Apple Unveils Progress in Multimodal Large Language Models

By ETCentric Staff
March 19, 2024

Apple researchers have gone public with new multimodal methods for training large language models using both text and images. The results are said to enable AI systems that are more powerful and flexible, which could have significant ramifications for future Apple products. These new models, which Apple calls MM1, support up to 30 billion parameters. The researchers identify multimodal large language models (MLLMs) as “the next frontier in foundation models,” which exceed the performance of LLMs and “excel at tasks like image captioning, visual question answering and natural language inference.”

“We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art few-shot results across multiple benchmarks,” explains an Apple researcher paper.

VentureBeat reports the researchers discovered that both the resolution of input images and choice of image encoder had a substantial impact on model performance.

“We show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance,” the researchers write, leading to speculation that “continued scaling and refinement of the visual components of these multimodal models will be key to unlocking further gains,” according to VentureBeat.

Combining the computer vision and visual learning aspects with natural language output are similar challenges to those addressed by OpenAI with the robotics company Figure to train a robot called Figure 01 with strong visual comprehension, suggesting Apple interest in a higher-level AI assistant, or potentially robotics.

“The MM1 research comes as Apple has been ramping up its investments in artificial intelligence in an effort to catch up with rivals like Google, Microsoft, and Amazon who have raced ahead in integrating generative AI capabilities into their products,” writes VentureBeat.

Apple is expected to spend $1 billion per year on AI development with an ultimate goal of putting GenAI on all its devices, per a Bloomberg report. A secret AI framework called Ajax and a chatbot currently named Apple GPT are among the projects said to be in the works.

SiliconANGLE writes that “Apple’s previous work on multimodal LLMs includes Ferret, a model that was quietly open-sourced in October before being noticed in December.”

Apple Unveils Progress in Multimodal Large Language Models

No Comments Yet

Leave a comment