Alibaba’s Latest Vision Model Has Advanced Video Capability

By Paula Parisi
September 5, 2024

China’s largest cloud computing company, Alibaba Cloud, has released a new computer vision model, Qwen2-VL, which the company says improves on its predecessor in visual understanding, including video comprehension and text-to-image processing in languages including English, Japanese, French, Spanish, Chinese and others. The company says it can analyze videos of more than 20 minutes in length and is able to respond appropriately to questions about content. Third-party benchmark tests compare Qwen2-VL favorably to leading competitors and the company is releasing two open-source versions with a larger private model to come.

“With the new Qwen2-VL, Alibaba is seeking to set new standards for AI models’ interaction with visual data, including the capability to analyze and discern handwriting in multiple languages, identify, describe and distinguish between multiple objects in still images, and even analyze live video in near real-time, providing summaries or feedback that could open the door it to being used for tech support and other helpful live operations,” reports VentureBeat.

Alibaba says the model “can maintain a continuous flow of conversation in real time, offering live chat support,” according to VB, which notes that functionality “allows it to act as a personal assistant, helping users by providing insights and information drawn directly from video content.”

It can also handle function calls and tools on a visual basis, “enabling it to retrieve and access external data, such as flight statuses, weather forecasts and package tracking,” writes SiliconANGLE. “That would make it useful for interacting with customer service or workers in the field who could show it images of products, bar codes or other information.”

Qwen2-VL is available in three versions, though only two are open source: Qwen2-VL-72B, Qwen2-VL-7B and Qwen2-VL-2B. The two smaller models are available via an open-source Apache 2.0 license from platforms including Hugging Face.

SiliconANGLE points out limitations, including that “it is unable to extract audio from video files, given that it’s designed only for visual reasoning,” and on launch has a training cutoff date of June 2023.

“It cannot guarantee complete accuracy for complex instructions or scenarios,” SiliconANGLE explains, noting that Alibaba claims “the model’s performance and visual capabilities showcased top-tier benchmarks across most metrics, even surpassing closed-sourced models such as OpenAI’s flagship GPT-4o and Anthropic’s Claude 3.5-Sonnet.”

VentureBeat lists Meta’s Llama 3.1 and Google’s Gemini-1.5 Flash among the models referred to as open-source against which Qwen2-VL performed well.

Demo:
Qwen2-VL-72B: This WebUI Is Based on Qwen2-VL-72B, Developed by Alibaba Cloud

Alibaba’s Latest Vision Model Has Advanced Video Capability

No Comments Yet

Leave a comment