Allen Institute Announces Vision-Optimized Molmo AI Models

The Allen Institute for AI (also known as Ai2, founded by Paul Allen and led by Ali Farhadi) has launched Molmo, a family of four open-source multimodal models. While advanced models “can perceive the world and communicate with us, Molmo goes beyond that to enable one to act in their worlds, unlocking a whole new generation of capabilities, everything from sophisticated web agents to robotics,” according to Ai2. On some third-party benchmark tests, Molmo’s 72 billion parameter model outperforms other open AI offerings and “performs favorably” against proprietary rivals like OpenAI’s GPT-4o, Google’s Gemini 1.5 and Anthropic’s Claude 3.5 Sonnet, Ai2 says.

Molmo “weighs in at (according to best estimates) about a tenth” the size of those rivals, yet still “approaches their level of capability,” TechCrunch reports.

The smaller-is-better strategy “lowers barriers to development and provides a robust foundation for the AI community to build innovative applications around Molmo’s unique capabilities,” Ai2 explains in a news release. The most efficient Molmo model has only 1 billion active parameters, making it suitable for onboard use with mobile devices.

The four main Molmo models, as described by VentureBeat, are:

  • Molmo-72B (72 billion parameters, or settings — the flagship model, based on Alibaba Cloud’s Qwen2-72B open source model)
  • Molmo-7B-D (“demo model” based on Alibaba’s Qwen2-7B model)
  • Molmo-7B-O (based on Ai2’s OLMo-7B model)
  • MolmoE-1B (based on OLMoE-1B-7B mixture-of-experts LLM, and which Ai2 says “nearly matches the performance of GPT-4V on both academic benchmarks and user preference”)

Technical details have been released in a scientific paper, linked through the Ai2 blog, which includes benchmark highlights and a user interface. Select model weights, inference code, and demo materials are available now, and Ai2 says it will be releasing all weights, captioning and fine-tuning data and source code “in the near future.”

VentureBeat writes that machine learning developer Vaibhav Srivastav, an advocate engineer at Hugging Face, feels “Molmo offers a formidable alternative to closed systems, setting a new standard for open multimodal AI,” and quotes Google DeepMind robotics researcher Ted Xiao calling it “exciting” and the first open visual language model (VLM) “optimized for visual grounding.”

“This capability allows Molmo to provide visual explanations and interact more effectively with physical environments, a feature that is currently lacking in most other multimodal models,” VentureBeat explains.

The secret to training on less data is using “better quality, data,” TechCrunch writes. “Instead of training on a library of billions of images that can’t possibly all be quality controlled, described, or deduplicated, Ai2 curated and annotated a set of just 600,000.”

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.