Nvidia Releases Open-Source Frontier-Class Multimodal LLMs

By Paula Parisi
October 4, 2024

Nvidia has unveiled the NVLM 1.0 family of multimodal LLMs, a powerful open-source AI that the company says performs comparably to proprietary systems from OpenAI and Google. Led by NVLM-D-72B, with 72 billion parameters, Nvidia’s new entry in the AI race achieved what the company describes as “state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models.” Nvidia has made the model weights publicly available and says it will also be releasing the training code, a break from the closed approach of OpenAI, Anthropic and Google.

“Nvidia dropped a bombshell,” is how VentureBeat interprets the move, which it says “grants researchers and developers unprecedented access to cutting-edge technology.”

As the world’s leading supplier of AI chips, it’s a canny business move to lower the barrier of entry for software that will spur sales of its leading hardware product. The model files can be downloaded from Hugging Face.

“The NVLM-D-72B model shows impressive adaptability in processing complex visual and textual inputs,” writes VentureBeat, adding that in Nvidia’s white paper “researchers provided examples that highlight the model’s ability to interpret memes, analyze images, and solve mathematical problems step-by-step.”

According to Digital Trends, this “production-grade multimodality” is available out of the box, “with exceptional performance across a variety of vision and language tasks, in addition to improved text-based responses compared to the base LLM that the NVLM family is based on.”

To accomplish this, Nvidia integrated “a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities,” the researchers explain.

NVLM-D-72B is notable in that it “improves its performance on text-only tasks after multimodal training,” while some of the other leading models “see a decline in text performance” over time, writes VentureBeat, citing Nvidia’s claim that it “increased its accuracy by an average of 4.3 points across key text benchmarks.”

“Nvidia appears serious about ensuring that this model meets the Open Source Initiative’s newest definition of ‘open source’ by not only making its training weights available for public review, but also promising to release the model’s source code,” reports Digital Trends, calling this “a marked departure from the actions of rivals like OpenAI and Google.”

In doing so, Nvidia is positioning its NVLM family “as a foundation for third-party developers to build their own chatbots and AI applications,” not necessarily to “compete directly against ChatGPT-4o and Gemini 1.5 Pro,” although indirectly, it will.

Nvidia Releases Open-Source Frontier-Class Multimodal LLMs

No Comments Yet

Leave a comment