Meta’s Open-Source ImageBind Works Across Six Modalities

By Paula Parisi
May 15, 2023

Meta Platforms has built and is open-sourcing ImageBind, an artificial intelligence that combines six modalities: audio, visual, text, thermal, movement and depth data. Currently a research project, it suggests a future in which AI models generate multisensory content. “ImageBind equips machines with a holistic understanding that connects objects in a photo with how they will sound, their 3D shape, how warm or cold they are, and how they move,” Meta says. In other words, ImageBind’s approach more closely approximates human thinking by training on the relationship between things rather than ingesting massive datasets so as absorb every possibility.

With ImageBind, “we’re introducing an approach that brings machines one step closer to humans’ ability to learn simultaneously, holistically, and directly from many different forms of information — without the need for explicit supervision (the process of organizing and labeling raw data),” Meta says in a blog post.

The platform “can outperform prior specialist models trained for one particular modality,” as described in the scientific paper “ImageBind: One Embedding Space to Bind Them All.”

Meta in the past year has showcased its generative AI capabilities, going from text to images, videos, and audio with Make-A-Scene, Make-A-Video and AudioGen.

With ImageBind the company says developers can use the “modalities as input queries and retrieve outputs in other formats.” Meta calls ImageBind “an important step toward building machines that can analyze different kinds of data holistically, as humans do.”

By way of example, Engadget writes that “if you’re standing in a stimulating environment like a busy city street, your brain (largely unconsciously) absorbs the sights, sounds and other sensory experiences to infer information about passing cars and pedestrians, tall buildings, weather and much more.” The goal is “to cross-reference multimodal data in the way current AI systems do for text inputs.”

One obvious use case for such technology is “a futuristic virtual reality device that not only generates audio and visual input but also your environment and movement on a physical stage,” writes The Verge. “You might ask it to emulate a long sea voyage, and it would not only place you on a ship with the noise of the waves in the background but also the rocking of the deck under your feet and the cool breeze of the ocean air.”

Meta’s Open-Source ImageBind Works Across Six Modalities

No Comments Yet

Leave a comment