Why the future of socializing lies in VR

And what avatar technologies are essential in the era of spatial computing.

At the dawn of the internet, people were first only able to communicate through text, with no font or design optionality. Eventual progressions in technology enabled more nuanced design and the onset of digital photo and video. As Mark Zuckerberg said at F8 2016 [1], “We’re always looking for better and richer ways to express ourselves and share with one another”.

Evolution of digital content

In recent years, we’ve seen a rapid rise in live video communication alongside a shift in interpersonal communication and media consumption from desktop to mobile. The next frontier is the eventuality of AR/VR, which allows a user to share content that is closer to the purest form of his or her experience and to consume arguably the most immersive and compelling form of media available to date.

In addition, the use of virtual spaces for traditionally physical experiences is becoming more and more commonplace. The onset of COVID-19 has only accelerated this phenomenon. Examples include people holding wedding ceremonies in Animal Crossing [2], students recreating high schools to study in Minecraft [3], and performers hosting concerts in Fortnite [4].

Communication through avatars

In these aforementioned digital environments, the individual is able to represent his or herself with avatars. The more society employs non storyline conforming uses to these digital worlds, the more important nuanced interpersonal communication becomes for the user base.

This phenomenon highlights the need for a new form of digital communication, particularly in light of the shift to AR/VR interfaces. Current text-based messaging solutions are extremely inconvenient on virtual interfaces, while audio streams don’t fully reflect the emotional dimensions of a conversation. Lastly, while video holograms benefit from their accurate depictions of user emotion, they unfortunately suffer from the uncanny valley effect rendering them less than ideal for encouraging user engagement.

Online multiplayer games like the World of Warcraft and Diablo series originally enabled people to communicate with their avatars through text. Since then, technological advances have enabled audio to stream alongside user characters.

The popular multiplayer game Fortnite has taken a step forward in online user interaction by adding user controlled emotive behavior to their avatars and audio communications. However, these popular Fortnite dances and celebrations are highly limited in their functionality from a communication perspective.

The New York City based venture studio developing Chudo, the next generation virtual social network, sees the next phase of avatar technologies employing real time emotion animations that reflect the speech of the user. Recognizing this progression, Chudo’s developers have reworked the entire way a user can create, as well as interact with avatars.

Chudo’s work on avatar technologies

Chudo developers have set out to create technologies that shift messaging from text-based to voice and animation-based content, using digital avatars to enable emotion rich interpersonal communication.

The first step for the team was developing an automatic tool for creating avatars, which is far more user friendly than the traditional “avatar maker” approach.

The traditional “avatar maker” approach

Incumbent technologies typically employ a manual avatar making toolkit, where the user is expected to build their own personalized digital twin by choosing from hundreds of different facial feature options: eyebrows, chins, noses, etc.

There are three main issues with this approach:

Specific artistic skills are required to design your look-alike avatar.
The user experience is daunting.
It’s workforce intensive for developers to draw many variations of different face parts to match everyone.

Chudo’s machine learning approach for avatar creation

While still providing the ability to customize looks, the team behind Chudo built a tool that allows users to create precise virtual copies of themselves, i.e. digital twins, with ease. To put it simply, Chudo’s team has built a machine learning technology that automatically generates avatars based on a user’s selfie photo.

Of note, the technology runs inference locally on mobile devices without cloud processing users’ sensitive data — face pictures.

Here is how it works:

The process of avatar generation starts by constructing a user’s 3D face model from a selfie shot. With this single photo, one neural network builds a 3D mesh of the user’s head.

Then the mesh goes through the automatic stylization process that makes it look more cartoonish. Afterwards, it is used as a basis for the avatar’s head.

Next, other neural networks swing into action. Face parts such as hair, eyebrows, facial hair and colors are classified and synthesized (generated) onto the mesh in the appropriate style based on the user’s selfie.

Moreover, if a user is wearing glasses another neural network will detect and synthesize a 3D version of them.

In comparison to platforms like Fortnite and Minecraft, Chudo’s product focus is on user interaction and communication, rather than gameplay. With that purpose, Chudo’s team has built a technology that uses machine learning to recognize speech and translate it into real-time emotional animations.

Chudo’s AI Speech-driven Facial Animation allows users to express themselves in a more nuanced way to foster closer emotional connections between users on the platform.

How the technology was built:

As the first step, Chudo pre-trained its neural network on more than 10,000 hours of speech recordings to transcribe audio into phone sequences. These recordings trained Chudo’s network for audio speech recognition.

In the process of training the animation network, Chudo’s team used its proprietary facial performance capture solution to extract animation from the large video dataset and obtain audio-to-animation training examples.

Taking into account that different actors had produced unique animations for the same utterance, Chudo trained its animation network to learn the distribution of possible animations in the generative adversarial framework GAN. This implementation allows Chudo software to choose an animation style based on speech in real time.

The larger speech driven animation network uses full audio recordings to obtain animation, which is fine for audio files. To adopt this solution to real-time streams, Chudo applied a technique called “Knowledge Distillation” for training real-time architecture by the larger and non-real time version. As a result, the real time version has a smaller memory footprint and requires much less computations.

Chudo. Live the future.

The avatar technologies described above are implemented in Chudo, which is now available on the App Store and Google Play. The mobile app is the first step in product development towards building the virtual social platform. For now, Chudo has the very basic functionalities that serve as the foundation of the larger digital world: real-time audio chat, personalized emotes and a multidimensional digital space that the user can explore.

This first version of Chudo allows the team to test out these proprietary avatar technologies in a 3D space and find more specific use cases for virtual gatherings before expanding to new platforms (Oculus, desktop). Chudo is hard at work to refine and expand its functionalities, with your help.

We would also appreciate The HackerNoon community to share thoughts and ideas on what use cases we should tackle (e.g. virtual concerts, classrooms). You are welcomed to leave your comments below.

References

[1] Facebook F8 Live — April 12, 2016

[2] The Washington Post The pandemic canceled their wedding. So they held it in Animal Crossing. — April 2, 2020

[3] Intelligencer, New York Magazine A Group of New York Students Re-created Their High School on Minecraft — March 30, 2020

[4] The Verge Fortnite hosted a Diplo concert in its new party mode — May 2, 2020