MAGE AI Unifies Generative and Recognition Image Training

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have introduced a computer vision system that combines image recognition and image generation technology into one training model instead of two. The result, MAGE (short for MAsked Generative Encoder) holds promise for a wide variety of use cases and is expected to reduce costs through unified training, according to the team. “To the best of our knowledge, this is the first model that achieves close to state-of-the-art results for both tasks using the same data and training paradigm,” the researchers said.

“Since generation and recognition tasks require both visual and semantic understanding of data, they should be complementary when combined in a single framework,” but at present these two types of models are typically trained independently, the researchers write in an academic paper on the new model.

“Generation benefits representation by ensuring that both high-level semantics and low level visual details are captured; conversely, representation benefits generation by providing rich semantic guidance,” explain the MAGE creators.

While developers working with natural language processing have leveraged this synergy — resulting in frameworks such as Google’s BERT and OpenAI’s DALL-E 2 — when examining a stricter computer vision subset “there are currently no widely adopted models that unify image generation and representation learning in the same framework” for results approaching state-of-the-art,” according to the researchers.

To accomplish the unification, MAGE uses masking (rather than DALL-E’s diffusion approach). To develop MAGE, the researchers “used a pre-training approach called masked token modeling,” converting sections of image data “into abstracted versions represented by semantic tokens,” reports VentureBeat.

With input image resolution at 256×256 (“to be consistent with previous generative models,” the paper explains) the token sequence length is 16×16 (256 tokens). VentureBeat likens the tokens to “mini jigsaw puzzle pieces,” each representing a patch of the original image.

“Once the tokens were ready, some of them were randomly masked and a neural network was trained to predict the hidden ones by gathering the context from the surrounding tokens. That way, the system learned to understand the patterns in an image (image recognition) as well as generate new ones (image generation),” VentureBeat explains.

“Our key insight in this work is that generation is viewed as ‘reconstructing’ images that are 100 percent masked, while representation learning is viewed as ‘encoding’ images that are zero percent masked,” the researchers write in their paper.

Computer vision is a field of artificial intelligence that gives computers the ability to extract information from digital images, videos, and what they “see” to understand and interpret the world around them. Its potential use-cases include navigation systems in self-driving cars, medical diagnostics, security and surveillance systems, inventory tracking at retail, and inspection tasks in the manufacturing supply chain.

MAGE “comes at a time when enterprises are going all-in on AI, particularly generative technologies, for improving workflows,” says VentureBeat, adding that “the MIT system still has some flaws and will need to be perfected in the coming months if it is to see adoption.”

Related:
Can We No Longer Believe Anything We See?, The New York Times, 4/8/23
The Best AI Image Generators in 2023, PetaPixel, 2/15/23

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.