Meta Advances Multimodal Model Architecture with Chameleon

By Paula Parisi
May 28, 2024

Meta Platforms has unveiled its first natively multimodal model, Chameleon, which observers say can make it competitive with frontier model firms. Although Chameleon is not yet released, Meta says internal research indicates it outperforms the company’s own Llama 2 in text-only tasks and “matches or exceeds the performance of much larger models” including Google’s Gemini Pro and OpenAI’s GPT-4V in a mixed-modal generation evaluation “where either the prompt or outputs contain mixed sequences of both images and text.” In addition, Meta calls Chameleon’s image generation “non-trivial,” noting that’s “all in a single model.”

“The architecture of Chameleon can unlock new AI applications that require a deep understanding of both visual and textual information,” reports VentureBeat, citing Meta experiments that show “Chameleon achieves state-of-the-art performance in various tasks, including image captioning and visual question answering (VQA), while remaining competitive in text-only tasks.”

In the introduction to its research paper, Meta writes “Chameleon marks a significant step forward in a unified modeling of full multimodal documents.” PetaPixel does a deep-dive on the meaning of multimodality.

What sets Chameleon apart is the fact that it is purpose-built to be multimodal. Unlike the usual approach to multimodal foundation models — stitching together models trained for different modalities, called “late fusion” — Chameleon uses an early-fusion token-based mixed-modal architecture.

That means “it has been designed from the ground up to learn from an interleaved mixture of images, text, code and other modalities,” VentureBeat reports, noting “Chameleon transforms images into discrete tokens, as language models do with words. It also uses a unified vocabulary that consists of text, code and image tokens.”

This is an improvement over late fusion, which thus far exhibits a limited ability “to integrate information across modalities and generate sequences of interleaved images and text,” VentureBeat explains.

Capped by Meta’s Chameleon research, “the arms-race between technology companies regarding AI models has heated up significantly in the last two weeks,” with OpenAI unveiling GPT-4o immediately prior to Google announcing its Gemini upgrade across Search, followed by Microsoft’s big reveal of Copilot+ PCs and other AI improvements, Tom’s Guide reports.

Meta’s Chameleon should not be confused with “the generative AI model CM3leon (pronounced Chameleon) that Meta AI revealed last summer,” though it suggests an evolution, notes Tom’s Guide.

The research paper’s authors say Chameleon is similar to Gemini, though unlike the Google model Chameleon is distinguished by being “an end-to-end model.” No word on when or if Chameleon will be released, or if it will be open source like Llama 2.

Topics: Artificial Intelligence, Chameleon, CM3leon, Code, Copilot Plus PC, Foundation Model, Gemini, Gemini Pro, Generative AI, Google, GPT-4o, GPT-4V, Image, Late Fusion, Llama 2, LLM, Meta Platforms, Microsoft, Multimodal AI, Open Source, OpenAI, Search, Text, VQA

Meta Advances Multimodal Model Architecture with Chameleon

No Comments Yet

Leave a comment