Synthetic Reality: AI and the Metaverse
AI will play the single most important role in building, operating and governing the Open Metaverse.
This article is adapted from my 2014 talk called “Synthetic Reality: How AI Will Shape the Metaverse” I’ve re-posted it from my personal blog at www.matt-white.com and updated with the more recent AI papers, products and services.
This is a long article, please use the table of contents to navigate to the sections that are most interesting to you. It starts out with some backstory on the Metaverse and generative AI from my experiences, with a definition of “synthetic reality” then jumps right into Metaverse use cases for AI with recent examples supported by papers and platforms.
Table of Contents
2. Synthetic Reality — A Definition
3. The Role of AI in the Open Metaverse
4. 3D Assets and World Building
↳ 4.1. 3D Avatar Generation
↳ 4.2. Rigging
↳ 4.3. 3D Asset Generation
↳ 4.4. 3D Scene Generation
↳ 4.5. 3D Fashion
5. Avatar Control & Animation
↳ 5.1. Character Control
↳ 5.2. Character Animation
↳ 5.3. Facial Animation
↳ 5.4. Autonomous Agents
6. Image Generation
↳ 6.1. Materials & Textures
↳ 6.2. AI Art
↳ 6.3. Advertising
↳ 7.1. Fluid Dynamics
↳ 7.2. Rigid and Soft-Body Dynamics
↳ 7.3. Cloth, Hair & Fur Simulations
↳ 7.4. Natural Phenomena
8. Audio and Communication
↳ 8.1. Music Generation
↳ 8.2. Speech Recognition
↳ 8.3. Speech Synthesis
↳ 8.4. Sound Effects
↳ 8.5. Language Translation
↳ 8.6. Audio Encoding
↳ 8.7. Spatial Audio
↳ 8.8. Conversational Dialog
9. Other Multimedia
↳ 9.1. Storyboarding
↳ 9.2. Scripted Dialogue Generation
↳ 9.3. Writing Tools
↳ 9.4. Video Generation
10. Marketing & Commerce
↳ 10.1. AI Smart Contracts
↳ 10.2. Transaction and Trade Settlements
↳ 10.3. Arbitration
↳ 10.4. Fraud Detection
11. Security, Privacy & Governance
↳ 11.1. Sentiment Analysis
↳ 11.2. Gatekeepers
↳ 11.3. Moderators
↳ 11.4. Threat Detection
↳ 11.5. Authentication
↳ 11.6. Accessibility
13. Additional Resources
↳ 13.1. The Ultimate Guide to 3D Model and Scene Generation Papers
↳ 13.2. arXiv
↳ 13.3. Papers with Code
↳ 13.4. Hugging Face
Up until Goodfellow et al. published their paper on GANs the most I had seen on generative deep learning was from a deep learning summer school program run by UCLA where notable AI leaders Geoffrey Hinton, Yoshua Bengio, Yann LeCun, Andrew Ng and many other brilliant minds in the field of artificial intelligence delivered talks on deep learning in the areas of their own research (the schedule and videos are still available here.)
During the summer school, Ruslan Salakhutdinov a professor from the University of Toronto at the time (now at Carnegie Mellon) gave a compelling lecture on generative deep learning and deep Boltzmann machines demonstrating deep learning models generating images of airplanes, querying a trained model on its belief of what Sanskrit should look like and predicting how the other half of an image should appear. Although these capabilities existed prior to the summer school it was the first time I had seen generative AI in action and it really got me thinking about the future potential of generative AI.
In 2014, innovative research was being conducted by Ian Goodfellow under the advisement of Yoshua Bengio at the Université de Montréal on a new modeling architecture called Generative Adversarial Networks (GANs). Although the first results produced by the adversarial model that pitted a generator against a discriminator in order to incrementally improve generated images were not at the level of today’s GAN architectures like StyleGAN and CycleGAN, the results were impressive nonetheless and they inspired me to consider the potential of generative AI and how far it could be taken and how it may be used in the Metaverse.
A decade earlier, the birth of the concept of an Open Metaverse began around 2005, during the peak of the Linden Lab’s Second Life craze. The SL community had reversed engineered the Linden Lab’s protocol and wanted to be able to run their own virtual worlds and use their avatars and digital assets as they moved freely between virtual worlds. This kicked off the first Open Metaverse movement and saw the creation of communities and projects like OpenSim, OpenGrid and the Open Metaverse Foundation, the latter a project we moved into the Linux Foundation in 2022.
The prospect of using generative deep learning models to produce highly plausible synthetic data such as images, audio, video and 3D assets and scenes was a ways away in 2014 and although we are closer to achieving viable results at the end of 2022, we still have some distance to go. However we will see major advancements over the course of 2023 and 2024 in generative AI research in producing results that are high fidelity and reduce some of the current pain points such as training times and computational complexity. New model architectures will make generative models much more viable and accessible as tools to enable the creation, design and manipulation of synthetic media (3D, images, audio, videos), text and code generation and manipulation, as well as enabling hyper-personalization. AI will also be heavily leveraged for non-generative applications like hyper-automation, fraud detection in decentralized environments, AI arbitration, safety enforcement and AI smart contracts.
The ability to generate and simulate real world environments and systems using artificial intelligence is what I refer to as synthetic reality. I firmly believe that building the Metaverse as a truly immersive physically accurate and photorealistic environment that is safe, ethical suitable for global commerce will be intractable without it. This is because there would be no way to build, operate and govern an environment of universal scale through human activities and conventional programming alone, and this is where AI outperforms in creation, automation and computational efficiency.
Synthetic Reality — A Definition
“The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble.” — Paul Adrien Maurice Dirac, Quantum Physicist (Quantum Mechanics of Many Electron System. 1929.)
The term “synthetic reality” has been referenced more in recent years but is far from being a well known term in daily lexicon since its viability has remained elusive over the course of the past decade. But synthetic reality, is for all intents and purposes, is a synthetic version of the real world and all its components and systems (objects, entities, atomic and quantum elements and behaviors, natural phenomena, …) as well as the composability of the aforementioned but here is the important part, it is all a product of artificial intelligence and has no objective except to simulate the elements of the real world.
Well Matt, isn’t that just what virtual reality is except you are saying it is generated by artificial intelligence? Not quite. Virtual reality is a top-down approach to computer generated environments that is designed to provide an immersive experience, it could be a game, a training environment, a simulation, it uses mathematical models to simulate real world physics and experiences but if only to convince the user with enough visual information to achieve reasonable effects and is done in a very programmatic and prescriptive way. The emphasis is on perception and not physical simulation at the quantum and atomic levels. Scientific simulations are sometimes, but not always, concerned with quantum and atomic behaviors and physical accuracy but often have little to no concern for photorealism. This is due to computational constraints which require tradeoffs depending on the application. Modelers shed unnecessary data and computations that don’t directly affect the results of the simulation.
In VR you are limited by the DoF (degrees of freedom) the programmers afford you. Often these experiences are narrative driven or purpose built, even sandbox games which have no game-like objective, still have purpose, to provide a social experience, sell you a plot of land or some other experience-based or value-based objective. Contrast this with synthetic reality, where artificial intelligence is used to generate and govern the simulation and all its elements in a non-prescriptive manner that obeys fundamental natural laws. You might draw a strong reference to the Matrix, a machine generated simulated environment designed to fool the consciousness of its inhabitants into believing they are in the real world. Perhaps the only difference is that synthetic reality is benign and not intended to trap its inhabitants but to allow them to have real world experiences in an immersive photorealistic and physically accurate environment and not drain them of their bio-electric energy to power robots… yet.
Aside from the simulation of natural elements, laws and systems, AI is capable of synthesizing human-devised systems and social constructs like culture, economics, self-organizing systems, political systems, governmental functions, systems that implement rules, controls, procedures, monitoring and validation which range from air-traffic control to police and the military. The value and risks of AI in being able to synthesize these systems in the Metaverse should not be understated as AI is extensible and its development will continue to yield even more efficient means to replicate and perfect existing systems (I recognize this is a controversial subject and it evokes possibly scary uncertain outcomes but through the principles of responsible AI we will have to ensure that high risk systems are developed in a responsible way and enforce controllability, explainability, governance and accountability with attention to human-centeredness, fairness and the reduction of human and algorithmic bias.)
Although no Metaverse yet exists and there is no consensus on what it will be, I do believe that artificial intelligence will play the most integral role in building, operating and governing the Open Metaverse and empowering and protecting its inhabitants.
The Role of AI in the Open Metaverse
3D Assets and World Building
In the Metaverse world building and 3D asset as well as scene creation will require the most heavy lifting using today’s manual and procedural methods. With these classical methods the realization of a Metaverse with global presence will be impossible. In my 2014 talk I spoke about both the potential and necessity for generative AI to produce the 3D content required to build virtual worlds. Up until 2020 it felt like a long way out, then UC Berkeley students Ben Mildenhall, Pratul Srinivasan Matthew Tancik et al. released their paper on NeRF titled “Representing Scenes as Neural Radiance Fields for View Synthesis”. This was a game changer for synthesizing 3D scenes from 2D images. NeRF has inspired additional research including NDDF and NELF which sought to improve on the long rendering times that plagued NeRF. Work continued on improving the performance of NeRFs and in 2022 Nvidia released Instant Neural Graphics Primitives (AKA Instant NeRF) which drastically reduced the time needed to train a NeRF. It was a big enough innovation that Time magazine named it one of the best inventions of 2022.
The transition from creation to generation is a necessary one to see the fulfillment of the dream of an Open Metaverse. Although it creates a strong technical disruption to the status quo, and will affect the way 3D modelers, 3D artists, animators and other 3D related artists work today it will have a democratizing effect on the generation of 3D assets, scenes and worlds. The reduced barrier to entry will allow for lower cost and more rapidly developed experiences and the commoditization of digital assets for all communities, especially economically challenged and marginalized communities who cannot afford high-powered computers, expensive software and extensive training.
The techniques I would see evolving would be audio-to-3D, audio-to-scene, and so forth. I don’t say text-to-something because voice will be the input tool of the Metaverse, except for long prescriptive text prompts that are intended to be loaded like code chunks. There have also been recent developments in in-painting, out-painting, neural style transfer, sketch-to-image which will evolve to sketch-to-3D as well as text-guided editing which could of course be voice-guided editing. These and novel methods yet to come will make the experience of 3D digital asset creation and world building a low friction experience for consumers.
3D Avatar Generation
There have been applications of discriminative AI (for classification tasks) used for 3D avatar generation which uses an image of a user’s face to try to best match facial features like shape and skin tone and match against a library of avatar assets. However generative AI will facilitate highly accurate avatar creation based on input images from the most photo-realistic representation all the way down to cartoonish appearances or any preferred style. The great thing about generative AI is that you can very easily perform a style transfer to convert your photo-realistic avatar into a style that matches the virtual world you enter, so as to remain style consistent and to abide by the aesthetic and safety rules of the virtual world. One could imagine themselves entering the Roblox virtual world in the Metaverse and their avatar instantly becoming a block-style avatar to fit the motif of the virtual world.
Avatar generation will certainly be a very popular feature of the Metaverse and a starting point for those entering it. Being able to modify one’s avatar quickly without having to select from a limited library or request that a 3D modeler custom build one will be a must to provide Metaverse users with higher degrees of autonomy and avatar uniqueness.
Avatars are not avatars without rigging. Neural methods have already proven themselves effective in predicting skeletal composition in order to rig 3D assets such as the work done by Xu et al. presented at SIGGRAPH 2020 entitled RigNet.
3D Asset Generation
Although everyone in the Metaverse will start out with generating their own avatar (or perhaps a multitude of them) most of the Metaverse will be constructed of scenes and 3D digital assets. Users will have a need to build their homes, to generate digital accessories like wearables and aesthetics as well as utility assets such as a sword or a car. Recent advances beginning with UC Berkeley Phd student Ajay Jain’s work in 2022 on Dream Fields and then later in the year on DreamFusion with Ben Poole, Ben Mildenhall and Jon Barron from Google demonstrated the 3D assets could be generated from a diffusion model trained on 2D images and a trained NeRF to produce impressive results. Nvidia shortly after released a paper on Get3D which instead of using a diffusion model trained on 2D images was trained on 3D models and produced high quality textured meshes.
However there is a lack of 3D data and much of it is covered by licensing agreements that do not allow for training models and redistributing the results, so most work has been focused on 2D image inputs. Magic3D was released in late 2022 by Nividia and improved upon the work that was done on DreamFusion, however to date no code has been released to validate the claims made in the paper, but the quality of the outputs are impressive and will certainly lead to further innovations that will see the eventuality of photo-realistic, high resolution, high poly count 3D digital assets being generated through prompts.
Magic3D: High-Resolution Text-to-3D Content Creation
Magic3D: High-Resolution Text-to-3D Content Creation Magic3D is a new text-to-3D content creation tool that creates 3D…
Although resolution continues to be a computational challenge for 3D asset generation, neural methods can be applied to do what is called super-resolution or upscaling (and conversely can be used for downscaling) to produce dynamic LODs (Level of Detail) for avatars, assets and scenes. This will reduce the need for exorbitant amounts of asset storage and instead real-time inference can be used to change the LOD for a particular asset based on its distance as perceived by a 3rd party. Although at the time of writing it is more costly to perform super-resolution than it is to just store multiple copies of a 3D asset in the network.
3D Scene Generation
Virtual worlds will consist of scenes and these scenes can be thought of as the web pages of the Metaverse. Just as the Web is mostly comprised of web pages, the Metaverse will mostly be comprised of 3D scenes. It follows that to make the Metaverse a reality we need the ability to create 3D scenes with as little friction as possible, so that means low cost, low time and low complexity scene creation. This is where generative AI comes in to save the day, although manual and procedurally generated content will still be produced for and in the Metaverse (and authentic human-generated content is likely to be highly sought after due to its scarcity) there will be a need to rapidly generate scenes that can range from simple low resolution scenes all the way up to highly complex highly detailed photo-realistic scenes. Recent research performed by Apple called GAUDI shows promising results in this space, with the ability to generate 3D scenes by disentangling radiance fields from camera poses on image-conditioned (image prompt) or text-conditioned neural scene generation as well as unconditioned, text-to-scene generation
The global fashion industry is one of the most environmentally impacting industries, it accounts for up to 10% of global greenhouse gasses, about 85% of all textiles end up in landfills or end up in the ocean and the industry has a terrible track record when it comes to human rights violations including child labor. Being able to move fashion into virtual environments could only have a net positive effect and the industry is ripe for innovation. There has been growing interest in fashion for use in virtual worlds however it has not seen wide adoption in online games, but I would expect that to change with the realization of the Metaverse.
Text-guided textures are already a reality, and recently research has demonstrated the ability to create a 3D reconstruction of clothing from only a single input image. Advances in this methodology will allow users to replicate their favorite real-world clothing items for use on their avatars.
Avatar Control & Animation
Moving avatars around virtual worlds and between virtual worlds is a fundamental primitive of the Metaverse experience. Neural methods have been successfully employed to animate characters without prescriptive movements defined and great strides have been made in the ability to instruct characters or avatars to move based on text instruction.
Nvidia has demonstrated how using deep reinforcement learning can be used to teach simulated characters how to perform tasks and how to respond to variables in their environment. Researchers have also shown that using neural methods they can instruct characters to perform particular actions like using their shield to protect themselves or swinging a sword. This research has very real implications in both instructing avatars on how to interact inside of virtual worlds, for instance providing the instruction to “walk to the store” (when semantic understanding advances of course) and for training NPCs (autonomous agents) and interacting with them.
OpenAI has demonstrated remarkable advancements in agent (character/avatar) training being able to perform complex tasks within Minecraft. Their VPT (Video PreTraining) method coupled with fine-tuning with reinforcement learning has demonstrated that agents can be trained to learn complex tasks in Minecraft by learning from the abundant amount of online videos, and then being “dropped” into the Minecraft environment. This learning approach could be used to teach agents how to perform highly complex tasks in the Metaverse based on observational learning.
Advances in neural character animation from joystick, mouse and keyboard inputs can generate very natural movements as seen with the work of Sebastian Starke’s PhD work with AI4animation. The network is trained to learn spatial-temporal elements of body movements through what he calls a periodic autoencoder. Learning natural movements and being able to easily replicate them will allow avatars and rigged entities to move around and interact with their environments in a very intuitive way that requires very little programming and will generalize well to interacting with unseen objects and environments.
Researchers from the Max Planck Institute and Nvidia have been able to create realistic animations of facial features through speech processing. This type of technology will be valuable as facial tracking through headsets is less than perfect and voice is a more consistent and accurate mechanism for generating plausible facial movements. Audio2Face from Nvidia has already been released and is available in open beta (https://developer.nvidia.com/blog/nvidia-omniverse-audio2face-app-now-available-in-open-beta/)
The Metaverse will be filled with both avatars (players) and agents (non-player characters) that may inhabit humanoid or non-humanoid entities. I believe that autonomous agents will vastly outnumber avatars in the Metaverse and avatars themselves could convert to autonomous agents when their user disconnects from the Metaverse.
During a 2018 SIGGRAPH presentation researchers from UC Berkeley presented a paper called DeepMimic that showed that autonomous agents could learn motion from video clips and re-target for use in physical simulations. This observational or reinforcement learning from videos has broad applications that will allow machines to learn from human experiences for most anything. This reduces the amount of trial-and-error training with high episodic counts needed to train agents in conventional reinforcement learning.
Image generation, especially using diffusion models like DALL-E 2, MidJourney and Stable Diffusion were for many their first experience with generative models. The rate of progress of image generation models has been exceptional in recent years in contrast to the early days of the first GANs, and developments continue to be made in improved architectures and solving model deficiencies (photorealism, handling text in images, handling multiple concepts.) Diffusion models are largely considered to be the SOTA (State-Of-The-Art) in image generation and are the first stage in model architectures used for generating visually unique 3D assets such as DreamFusion (and Stable DreamFusion) as well as Nvidia’s Magic3D. 2D diffusion is preferred due to the abundance of images available on the Internet.
However generated images on their own will certainly play a role in the Metaverse. Generated images will be applied as textures to any surface to change the appearance of buildings or avatars, and one could imagine hanging some piece of generated art inside their home in the Metaverse. Generated images could be bought and sold as a collectors item in the Metaverse especially when produced by fine-tuned personalized models that produce unique outputs that are not easily replicated. The issue of authenticity will be extended to cover whether or not an image is synthetically generated or is a human-produced work. Distributed ledger technologies along with trusted organizations that validate works before committing (minting) to the network will be a must.
Materials & Textures
For PBR materials, workflows can be enhanced with auto-generation of color, roughness and metallness maps for metallic (roughness) workflows and diffuse, glossiness and specular maps for specular (glossiness) workflows to generate PBR materials. Additionally surface normals, albedo, height, opacity, ambient occlusion, refraction and self-illumniation maps can be generated and applied to 3D models all through text or image-guided prompts. Textures can be subsequently manipulated through prompt engineering coupled with graphical or procedural manipulation. AI generated textures are already being used today and can be accessed on a myriad of sites including Polycam’s website.
Text-guided texture generation has matured to the point where a user can enter a prompt describing the aesthetic appearance of a texture and that texture can be applied to existing 3D geometry as illustrated by a recent paper called TEXTure.
AI art generators like MidJourney, NightCafe and Deep Dream Generator are already very pervasive and produce stunning results when they don’t start pushing towards photorealism or multiple concepts in a single image. This work will continue to improve in the coming years although some legal battles are playing out currently that may affect the ability for image models to train on copyrighted images like the case of Getty Images lawsuit against Stability AI.
AI Art will be traded in the Metaverse similar to how NFTs are traded today but ideally with stronger and more reliable frameworks than unencrypted IPFS-based solutions using ERC-1155 and ERC-721. When managed by distributed ledgers like blockchains and directed acyclic graphs (DAGs) we can ensure authenticity and provenance and maintain artificial scarcity of digital art items so that they preserve their value, even if they are generated by image models. In the future we should expect that all asset management will be decentralized including 3D assets (this is a core focus of the work we are doing in the Digital Asset Management working group at the Metaverse Standards Forum.)
The ability to quickly generate images coupled with hyper-personalization will allow marketers to generate highly personalized advertisements that appeal to individual consumers. Image adverts can be dynamically generated and as you move through the Metaverse you may see the exact same product advertised in different ways but these have been tailored for your eyeballs in order to realize higher conversion rates. Think the Minority Report.
The term simulation has several different applications, at a macro level it is used in the context of a synthetic environment such as those used in real-time games, CG video production environments, digital twins, industrial simulations and military simulations. The environments are comprised of lower level simulations, these include visual and physical simulations such as fluid (flow) dynamics, collisions, deformations, fractures and simulating natural phenomena are all within the domain of simulations. Will the Metaverse be a simulation? It will be in my measure, since the environment will be real-time and stateful and will rely on physical simulations to generate a simulated experience. AI has extensive applications in simulations and was relied upon extensively for producing the water simulations seen in James Cameron’s Avatar 2, The Way of Water.
In simulations we have both an aesthetic (visual) element and a physical element. Simulations don’t necessarily require high fidelity in both and due to resource limitations (processor, memory, latency and bandwidth) and the specific application may not even have both together. For instance in many games you will see water that seems to flow fairly naturally and refracting light decently but when interacting with it you don’t get the type of physical accuracy that you would see in say a scientific fluid simulation that employs molecular or continuum methods.
In fluid simulations both visual and physical there have been significant advances in the use of deep learning methods, for physical simulations researchers were able to optimize flow predictions using a combined method of flow fields and a conventional Adams–Bashforth method which substantially reduced the computational cost of running a fluid flow simulation.
Correspondingly in the domain of visual simulations a significant amount of work has taken place in recent years with good results especially around the interaction of fluids with objects of varying degrees of hydrophobia. The advantage here is being able to produce visually accurate results that require much less run-time computing resources.
Advances have also been made in smoke simulations with surprisingly accurate results as witnessed by the optimizations done by Hong et al. in their paper “Accelerated Smoke Simulation by Super-Resolution With Deep Learning on Downscaled and Binarized Space”
Rigid and Soft-Body Dynamics
Rigid-body physics is the simulation of when objects collide with one another and do not deform which is used primarily in games to simulate impacts. Objects are wrapped in a convex collision box and when two objects collide a collision is detected in real-time and the mathematical model calculates the appropriate amount of force to apply to each object. Rigid-body dynamics is not as computationally demanding as the methods required for soft-body dynamics which are used for simulating deformable objects which rely on mathematical models such as the spring-mass model, position-based dynamics (PBD), finite-element method (FEM) and material point method (MPM.)
Although current physics simulation methods are highly optimized, deep learning has proven to outperform current mathematically modeled methods. Deep learning models accomplish this by being trained on large datasets and emulating what a physics solver would do. Some of the the most impressive work is from researchers from Stanford and the University of Oxford who developed a technique called Deep Emulator Network Search (DENSE) that accelerated simulations by billions of times. A team at Ubisoft La Forge has demonstrated that feed-forward networks (a simple neural network architecture) trained on outputs from Maya’s nCloth physics solver could accurately predict soft-body behaviors which outperformed the original physics solver it trained on by 5000 times.
Graph networks have also proven to be effective at simulating physics as a team from DeepMind has shown with their paper “Learning to Simulate Complex Physics with Graph Networks.” Although the computational performance was not much better than conventional physics solvers, it none-the-less demonstrated that there are other methods for physics and particle simulations that can produce highly plausible results and generalize well to unseen data.
Cloth, Hair & Fur Simulations
During CVPR 2020 a talk was given by researchers from the Max Planck Institute for Informatics on cloth simulations, the work by Patel et al. entitled “TailorNet” produced highly plausible results in different poses and body shapes and garment styles in a supervised fashion. Work in this area was advanced in 2022 with the release of “Neural Cloth Simulation” which operates in an unsupervised fashion, learning how to simulate cloth movement without labeled data. The area requires further research especially around self-collision but this is still a very nascent area of deep learning research in simulations which is sure to improve.
Physics simulations have been used extensively to predict the effects of natural phenomena but less work has been done to perfectly simulate the physical and visual aspects of natural phenomena such as earthquakes, lightning, volcanoes, tectonic plate movement, clouds, tornadoes and hurricanes and so forth. However to achieve a highly realistic experience in the Metaverse we need to be able to accurately simulate natural phenomena.
There has been excellent work in light transport simulations, with notable neural rendering papers “Neural Radiosity” and “Gaussian Material Synthesis.”
Further research in simulating natural phenomena includes Moseley et al. “Fast Approximate Simulation of Seismic Waves with Deep Learning” for earthquake simulations along with a large amount of physically-based research that are focused on deep learning powered simulations for high-energy physics, climate simulations like turbulent air motion and cloud-related processes, seismic modelling of volcanoes and geodetic applications.
Audio and Communication
Audio in the Metaverse will be absolutely essential, as the Metaverse will be an experiential environment that has to appeal to all senses and will have highly social elements including interactive commerce and engagement with avatars as well as autonomous systems within the Metaverse. Although communication protocols will handle the transmission of audio, there are applications of AI in audio compression and synthesizing audio using natural language processing (NLP) including its sub-disciplines of natural language understanding (NLU) and natural language generation (NLG).
Since DeepMind’s paper “WaveNet: A Generative Model for Raw Audio” released in 2016 and subsequent research, it’s been demonstrated that generative AI can effectively be used in generating different forms of audio including speech, music and sounds from midi, waveforms and latent representations. In the Metaverse these methods will evolve and be applied to synthesize speech, generate music tracks, sound effects and background sounds.
OpenAI’s MuseNet is an important piece of research that came out in 2019 which is capable of producing 4 minute musical compositions using up to 10 different distinct musical instruments. MuseNet uses the MIDI-format training approach and is built on the transformer architecture (the same model architecture used in GPT-2 through GPT-4 including ChatGPT) and was inspired by prior work by Huang et al. on “Music Transformer: Generating Music with Long-Term Structure.”
MusicLM released by Google Research in January 2023 introduces the ability to generate several minutes of music at 24 kHz purely from text descriptions as well as conditioning on input audio outperforming existing models like Riffusion and Mubert. Although it cannot generate coherent lyrics (similar to the issue image models have with generating coherent text) it produces music that matches the 8 musical elements of dynamics, form, harmony, melody, rhythm, texture, timbre and tonality. Work in this area will greatly reduce the amount of time it takes to compose and create music with DAWs (digital audio workstations) and advances in this area will soon create viable lyrics and allow iterative text-based editing.
Speech recognition has been used effectively with digital personal assistants like Siri and Alexa, and besides just recognizing commands there are broad applications for speech recognition in sentiment analysis, crisis monitoring and voice fingerprinting for authentication and authorization. Speech recognition can be used as a front-end to text-to-everything generative models and with speech being the most likely method for interfacing with systems and others in the Metaverse, it goes without saying that speech recognition is a highly practical application of AI.
Although there are ample speech recognition papers, services and products in use today, speech recognition is hampered by the quality of the audio which includes the sample rate, the levels of background noise, pronunciation and accents, languages spoken and the articulation of the speaker. Impressive work has been done by OpenAI on Whisper which encodes audio samples into Log-mel spectrogram representation and was trained on 680,000 hours of multilingual and multitask supervised data. The addition of non-English audio actually enhances the models ability to predict next-word tokens on English language speech samples not previously seen by the model.
Speech synthesis can be applied to provide a voice to bots and agents, or to transfer the style of one voice to another. It has applications in storytelling, narrating instructional videos, audiobooks or reading out articles and written materials for those with vision impairment. Speech generation can be implemented with text-to-speech models like Deep Voice or services like Azure TTS. These models are trained on large datasets in order to produce audio outputs for very specific voice profiles. Other methods can be used for voice cloning, where a pre-trained model is conditioned on what is called “few-shot learning” where a small number of audio samples of the target voice are provided to the model and it is able to reproduce the voice. Some prior work in this area is Attentron and the meta-learning paper by Huang et al. titled “Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech”.
There are a lot of speech synthesis platforms available today for personal and commercial use including Eleven Labs, Speechify, Azure Text-to-Speech, Google Cloud Text-to-Speech and Amazon Polly.
Sound effects are necessary to provide a realistic experience in movie production, games and simulations and this carries forward to the Metaverse. There have been a number of papers on generating sound effects including work done by Sanchita Ghose and John Prevost called AutoFoley which analyzes the frames in a video and provides an audio soundtrack with very impressive results.
DeepAFx by Martinez Ramierz et al. demonstrates a deep learning model that can perform audio signal processing (audio effects) on existing audio inputs. The model can remove breaths and voice pops to clean up speech, emulate a tube amplifier for guitar effects, as well as additional audio mastering techniques. Deep learning models like these can be used for advanced audio editing with substantially less processing requirements and could even be voice prompted to apply effects.
Language translation in the Metaverse will break down language barriers which largely remain unaddressed with the web, social and mobile experiences today. Much of the work with language translation using machine learning involves converting other languages into English and then from English into another language. With this approach there can be issues with semantics and adds an unnecessary processing step in language translation, but researchers at Meta have developed a multilingual machine translation model called M2M-100 which can perform language translation directly between 2200 different languages.
Audio encoding is necessary for transmission of low-latency high-quality audio over networks with varying conditions. Conventional audio codecs like MP3, OGG, AAC and others function are lossy (meaning they lose some data in the compression process) or lossless like FLAC, ALAC and APE. There are also uncompressed audio like PCM which is used with WAV files.
Neural audio compression has been applied to different forms of audio from music, to speech and generalized audio. Google Brain published research in 2021 on a novel method of audio compression for speech at low bitrates called Lyra, they followed that work up quickly with a method for more generalized audio called SoundStream. Due to the way neural networks work SoundStream would be considered a lossy compression method as information is lost during the inference process. SoundStream’s results are better than EVS and Opus compression algorithms at low bitrates using 1/4th the amount of bits for encoding, which makes it a solid candidate for voice compression.
Researchers at UC San Diego and Adobe Research have been able to take mono audio (single channel) recorded along with 360 degree video and convert the audio into spatial audio using a deep neural network in a self-supervised manner. Training on 360 video and spatial audio datasets their model performs well but has issues with identifying the source of audio when confronted with multiple potential sources that don’t provide clear indicators (ie. movement.) However work in this area will improve and has applications in localizing audio in 3D virtual worlds to provide a truly immersive experience.
Although OpenAI’s ChatGPT is the innovation that excited most people to the capabilities and potential of generative AI, something which caught many in generative AI research off guard including the CEO of OpenAI, Sam Altman himself. The technology has been used in research labs with impressive results prior to the launch of ChatGPT, models like GLaM, LaMDA, Gopher, and MT-NLG. In fact InstructGPT-3 which is a precursor model to ChatGPT was released in February of 2022 to little fanfare but was capable of producing responses to questions with surprisingly accurate results. Without diving into the GPT model families and their releases and parameter counts (which you can find here) it is apparent that ChatGPT won’t be the only large language model (LLM) for conversational dialog and very likely won’t exist in its current monolithic architecture by the time the Metaverse is a reality (hopefully LLMs will become SLMs, I will link an upcoming post here on the major issues with LLMs.)
Chatbots have pretty extensive applications for customer service, as personal assistants, use in search and numerous other applications that are powered by a combination of large language models (LLMs), natural language processing (NLP) and understanding (NLU) and natural language generation (NLG.) The most likely outcome is that we will all have a personal assistant powered by these technologies accompanying us through the Metaverse.
There are quite a few commercial platforms including those in our home like Alexa and Siri. Some are paired with video deepfaking like Synthesia, Rephrase, Speechify and others.
All forms of multimedia will have an application in the Metaverse. Some will be in-world media, where content is created inside the Metaverse and others will be outer-world media which will be imported into virtual worlds. There are applications in game development, video production to story generation as well as writing.
Storyboarding and narrative generation can be accomplished with various machine learning model architectures. In 2019 Chen et al. presented a paper called “Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences” which took the text for each storyboard panel and retrieved and modified images that most strongly correlated to the text provided. Although I’m not aware of any such platforms at the moment it would be trivial to leverage this technique and use a diffusion model or GAN to generate the panel images that most strongly resemble the text descriptions as opposed to having to modify existing images.
Scripted Dialogue Generation
In developing films, games and other scripted experiences dialogue for characters can be developed using generative deep learning techniques including the state-of-the-art in large language models (LLMs) which can produce very naturally sounding conversational dialogue in much of the same way as chatbots provide responses to inputs but in this case dialogue is scripted and non-interactive. There are commercial services today like Character.ai which uses a LLM (GPT-J) that is fine tuned on character dialogues to bring personalities to chatbots and even ChatGPT can be used to transfer the style of dialog to another speaker. With readily available open source models, being able to reproduce these types of platforms is relatively low-effort and there will be a large selection to choose from in the Metaverse.
Prose, lyrics, blogs, articles and books can all be enhanced with generative AI for tasks like ideation, layout, text summarization, table generation and so forth. Although today’s LLMs like GPT-3, GPT-J, LaMDA and others are the state of the art, there will be new modeling techniques that aren’t monolithic models that condense a small-subset of human knowledge directly into the model. Promising work in retrieval models like REALM which relies on an external corpus to provide knowledge and not strictly auto-regressive language models that perform next token prediction alone. These improved models will reduce bias and hallucinations, increase safety and accuracy, and allow for real-time knowledge updates when paired with frozen language models (for instance GPT-3 has no data prior to 2021.)
Today’s LLMs have many other applications which will be enhanced like code development, presentation creation, search enhancement, cloud systems administration, report generation, sentiment analysis and so on.
ChatGPT: Optimizing Language Models for Dialogue
We've trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for…
Hollywood studios are closely watching these developments and have yet to discern whether generative AI will be a tool for productivity or will be a threat. Challenges with spatial temporal consistency have been an issue with current research but continued innovations will make this an issue of the past. It is highly probable that even in the 4D Internet (3D + the dimension of time) 2D media will still play an active role. Perhaps virtual viewing parties will enjoy a movie together or immersive learning will include educational videos. Being able to rapidly generate videos that are tailored to its audience (hyper-personalization) will also be important in catering to the preferences of individual viewers.
In 2022 video generation is rapidly developing including text-to-video, video-to-video, in and out-painting of video and text guided editing of videos. Although text-to-video outputs currently suffer from temporal and spatial consistency, it is assured that they will continually develop as more research focuses in this area.
Some of the most important papers to date are Google’s Imagen Video, Meta’s Make-a-Video and Google’s Phenaki.
There are already commercially viable products that use generative AI to perform video editing tasks, like RunwayML’s Gen-1. Platforms like Hour one and Synthesia.ai use predefined video avatars which use deep learning to match audio to facial movements and have been used extensively in building training videos as well as having been misappropriated as deepfakes to spread disinformation.
Marketing & Commerce
AI Smart Contracts
Smart contracts are a must in decentralized networks, and procedural event triggered smart contracts like those used in today’s distributed ledger networks will still have a place in the Metaverse. However AI smart contracts will also play a large role due to their ability to learn and generalize well to unseen data without explicit instruction. AI smart contracts could be used to release funds when someone completes some objective in the Metaverse like the sale of 1000 widgets legitimately to real buyers (where procedural code cannot validate the trustworthiness of buyers and AI can through probabilistic analysis of each buyer’s past behaviors) or for applications of safety that could identify questionable transactions to a minor that could reveal an incident of child grooming. AI smart contracts would be the mechanism of implementing any number of use cases including transactions, dispute resolution and fraud detection and others in a trust-less and decentralized environment.
Transaction and Trade Settlements
For conventional settlements you may have a clearing house or a trusted 3rd party that handles transactions such as ACH or for currency exchange a Forex trader. In decentralized networks today we use liquidity pools which are managed by smart contracts to handle settlements but are susceptible to attacks and rug pulls. AI smart contracts could be used to manage liquidity pools to facilitate trades. Models can identify the trading parties, analyze the nature of the trade to make determinations about whether they seem reasonable and our consistent with both parties past behaviors and could make recommendations to either party about the trade based on market conditions. The model can also use rules-based methods to ensure all conditions are met before completing the trade, and pull in external data to from oracles perform a more detailed analysis.
With transactions there will always be disputes, whether a transaction was fraudulent and unauthorized or there were issues with respect to the quality or completion of delivery of some good or service. Today this is accomplished with payment services like Visa, Mastercard, banks or services like PayPal who employ human dispute resolution agents to analyze the facts of the dispute whether they be photos of a damaged or incorrect product, tracking information that shows a package was delivered or any other empirical evidence that either supports or counters a claim. Deep learning methods can be used to substitute the same forms of analysis and produce much more consistent outcomes when facts are the same and can do so in a fraction of the time compared to a human assessor.
Fraud detection is a very straightforward use case which employs machine learning techniques in discriminative AI today. The same sorts of analysis can be used with distributed ledgers analyzing the nature of a transaction, the credibility score of both parties, transaction metadata like location and similarities to past transactions as well as metadata that can be sourced for oracles that would add further variables to consider during inference.
Security, Privacy & Governance
Sentiment analysis can be performed against on text, voice and video to determine the mood and state of an individual. Some examples are facial tracking cameras reflecting the mood of the user expressed through their avatar, usage in customer service to determine when a customer is at risk of terminating their service and in determining whether someone could be experiencing a mental health concern.
Virtual worlds will need to be protected from bad actors entering or prevent digital assets and avatars that are not appropriate to the virtual world from entering. Although simple rules-based assessment against metadata associated with the digital asset is the computationally quickest method to make an assessment, deep learning techniques in computer vision can analyze 3D and 2D digital assets to ensure they meet the virtual world’s entry criteria.
Moderators will be needed to ensure safety of the Metaverse’s inhabitants by enforcing good user behavior and identifying illegal and immoral content. Human moderation is expensive, prone to inconsistent assessments and outcomes, and in large-scale real-time environments is intractable. Machine learning can be used to moderate conversations and content in real-time within virtual worlds and trained against the rules of the virtual world.
Just like the Internet, the Metaverse will be prone to exploits, adversarial attacks by bad actors and nation states, subject to espionage and denial of service attacks. Where 0-day attacks went undetected until signatures were updated for intrusion detection systems and firewall software, deep learning models generalize well and can identify previously unseen attacks.
Extensive work has been done at the intersection of machine learning and cybersecurity including a study by Apruzzese et al on “The Role of Machine Learning in Cybersecurity”. Many of the same paradigms employed for the Internet and the web are applicable to the Metaverse and cybersecurity firms will surely develop solutions to address cybersecurity in the Metaverse.
Deep Reinforcement Learning for Cybersecurity Threat Detection and Protection: A Review
The cybersecurity threat landscape has lately become overly complex. Threat actors leverage weaknesses in the network…
Facial recognition will continue to be an important factor for authentication using deep learning. It is pretty straight forward now in terms of technological maturity, but it was not so easy prior to 2017 before Apple introduced the feature in its iPhone X. At the time others struggled to do facial recognition well. It’s become a less interesting space in research but it is still very important going forward as measures to defeat facial recognition will be developed.
Deep learning can be used to detect authentication patterns and provide an extra layer of security to prevent unauthorized access to virtual worlds. There will likely be mechanisms where past behaviors can be pattern assessed to determine if an avatar is a bot or a real person, and if they are determined to be a real person they would pass a preliminary check that would allow authentication to take place. This would throttle any brute force attacks to access virtual worlds or to trade in virtual digital assets.
The Metaverse will be an inclusive space which means that accessibility is the key to facilitating interactions and experiences with all who wish to participate. Economic accessibility is enabled by generative AI since it reduces barriers to content creation which include powerful laptops, expensive software and expensive training.
Machine learning has been successfully applied for automated real-time image and video captioning of speech to assist those with hearing difficulties and is available in many commercial systems and software today including Microsoft’s PowerPoint.
For blind or limited vision users be able to obtain audio descriptions of images and videos is a necessity. Research published by Wang et al. on “Toward Automatic Audio Description Generation for Accessible Videos” and work conducted by myself and my team at UC Berkeley on Descriptive World which employs a set of camera equipped glasses that performs real-time audio descriptions of the users’ surroundings have demonstrated that deep learning can be successfully used to solve this challenge.
Audio description techniques could be extended to address real-time 3D environments which will be a necessity for accessibility in the Metaverse. For color blind users re-colorization can be easily achieved using the same techniques used to colorize black and white images and videos.
We are in the early days of both the Metaverse and of AI, however substantial progress is being made on both fronts. The Metaverse Standards Forum has over 3500 members and standards development organizations like the Khronos Group, IEEE, IETF, W3C and OMA3 have active working groups focused on standards development and standards supporting activities. Organizations like the Open Metaverse Foundation and OMI Group are working on open source software projects to support the development of the Open Metaverse to implement those standards.
For AI and especially generative AI we are seeing exponential growth in the number of research papers released on arXiv at a rate of about 5000 per month, contrast that to 2010 when it was only a couple hundred. Innovations are being made at a feverish pace, and each new innovation rapidly obsoletes previous works. With the use cases we have covered it is not hard to see that the realization of synthetic reality and applications of AI in the Metaverse are not fantasy and in fact are absolutely necessary. Without sufficient advances in artificial intelligence I wholly believe that the Metaverse will not come to fruition; however at the current rate of progress in AI research I remain confident that the convergence of the work being done to build Metaverse infrastructure and the advances in AI research will converge to empower the creator economy to build a Metaverse on a foundation of interoperable standards, open source software and the state-of-the-art in AI innovations.
The Ultimate Guide to 3D Model and Scene Generation Papers
Pinar Seyhan Demirdag and I put together a comprehensive list of 3D-related papers which we will work to keep up-to-date; however, with so many papers being published so quickly I would also suggest subscribing to Berkeley Synthetic’s LinkedIn where we post details on new papers including our results from code implementations.
The Ultimate Guide to 3D Model and Scene Generation Papers (Feb 2023)
This article is written in collaboration with Matt White. The cherished generative A.I. expert, researcher, and…
ArXiv is the de-facto source for AI research papers. However I would caution readers that reading research papers for those who do not have a background in deep learning and its foundational mathematical disciplines like statistics, calculus and linear algebra along with Python programming skills and familiarity with deep learning frameworks like PyTorch and TensorFlow can be a challenge. (https://www.arxiv.org)
Papers with Code
Although a great deal of innovations in AI are not released with code. For those looking to implement the latest papers in AI I would suggest looking at Papers with Code. (https://paperswithcode.com/)
Hugging face has a transformers and diffusers library which make getting started with the state-of-the-art (SOTA) in generative AI modeling architectures very accessible. They also host projects which allow users to play with user creations including interfaces for Stable Diffusion (1.5, 2.0, 2.1) and other fun implementations. (https://huggingface.co/)