By
Paula ParisiApril 14, 2025
Among the many tech advancements unveiled at Google Cloud Next include a major generative media upgrade to Vertex AI, Google Cloud’s managed AI development platform. The new Vertex AI Media Studio lets enterprise users generate complete videos from scratch using text prompts. Lyria, Google’s text-to-music model is now available on Vertex in private preview. Both are subject to an “allowlist.” Chirp 3 now creates custom voices with just 10 seconds of audio input, while Imagen 3 has gained improved abilities for reconstructing missing or damaged portions of an image. Continue reading Vertex AI Movie Studio Can Create Videos from Start to Score
By
Paula ParisiApril 14, 2025
YouTube is rolling out a new tool called Music Assistant to its Creator Music marketplace. Music Assistant uses generative AI to automatically add royalty-free instrumental background music to long-form videos. Accessed via a dedicated tab in Creator Music, Music Assistant provides more control over things like the style, mood and instruments of the desired music, which is requested via text prompts. Creator Music is available in the U.S. to creators enrolled in the YouTube Partner Program. YouTube is also experimenting with a Shorts feature that automatically synchronizes audio for short-form video creators. Continue reading YouTube Brings Generative Music Assistant to Video Creation
By
Paula ParisiApril 10, 2025
Amazon has updated its Nova model series, with Nova Reel 1.1 now able to generate AI videos of up to two minutes as well as gaining a new ‘multi-shot’ feature. Announced in December, Nova Reel marked Amazon’s initial foray into generative video. AWS developer advocate Elizabeth Fuentes says that Nova Reel accommodates user prompts of up to 4,000 characters that can generate a series of six-second shots for a sequence totaling two minutes. The company also introduced the Nova Sonic real-time voice model that supports third-party enterprise development. Continue reading AWS Updates Nova Reels and Adds Nova Sonic Voice Model
By
Paula ParisiMarch 28, 2025
Alibaba Cloud has released Qwen2.5-Omni-7B, a new AI model the company claims is efficient enough to run on edge devices like mobile phones and laptops. Boasting a relatively light 7-billion parameter footprint, Qwen2.5-Omni-7B understands text, images, audio and video and generates real-time responses in text and natural speech. Alibaba says its combination of compact size and multimodal capabilities is “unique,” offering “the perfect foundation for developing agile, cost-effective AI agents that deliver tangible value, especially intelligent voice applications.” One example would be using a phone’s camera to help a vision impaired-person navigate their environment. Continue reading Alibaba’s Powerful Multimodal Qwen Model Is Built for Mobile
By
Paula ParisiMarch 27, 2025
OpenAI has activated the multimodal image generation capabilities of GPT-4o, making it available to ChatGPT users on the Plus, Pro, Team and Free tiers. It replaces DALL-E 3 as the default image generator for the popular chatbot. GPT-4o’s accuracy with text, understanding of symbols and precision with prompts combined with well multimodal capabilities that allow the model to take cues from visual material have transformed its image capabilities from largely unpredictable to “consistent and context-aware,” resulting in “a practical tool with precision and power,” claims OpenAI. Continue reading OpenAI Delivers Native GPT-4o Image Generator to ChatGPT
By
Paula ParisiMarch 27, 2025
Google has released what it calls its most intelligent AI model yet, Gemini 2.5. The first 2.5 model release, an experimental version of Gemini 2.5 Pro, is a next-gen reasoning model that Google says outperformed OpenAI o3-mini and Claude 3.7 Sonnet from Anthropic on common benchmarks “by meaningful margins.” Gemini 2.5 models “are thinking models, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy,” according to Google. The new model comes just three months after Google released Gemini 2.0 with reasoning and agentic capabilities. Continue reading Google Debuts Next-Gen Reasoning Models with Gemini 2.5
By
Paula ParisiMarch 19, 2025
San Mateo, California-based game developer Roblox has released a 3D object generator called Cube 3D, the first of several models the company plans to make available. Cube currently generates 3D models and environments from text, and in the future the company plans to add image inputs. Roblox says it is open-sourcing the tool, making it available to users on and off the platform. Cube will serve as the core generative AI system for Roblox’s 3D and 4D plans, the latter referring to interactive responsiveness. The launch coincides with the Game Developers Conference, running through Friday in San Francisco. Continue reading Roblox Reveals Its Generative AI System Cube for 3D and 4D
By
Paula ParisiMarch 18, 2025
Baidu has launched two new AI systems, the native multimodal foundation model Ernie 4.5 and deep-thinking reasoning model Ernie X1. The latter supports features like generative imaging, advanced search and webpage content comprehension. Baidu is touting Ernie X1 as of comparable performance to another Chinese model, DeepSeek-R1, but says it is half the price. Both Baidu models are available to the public, including individual users, through the Ernie website. Baidu, the dominant search engine in China, says its new models mark a milestone in both reasoning and multimodal AI, “offering advanced capabilities at a more accessible price point.” Continue reading Baidu Releases New LLMs that Undercut Competition’s Price
By
Paula ParisiMarch 10, 2025
Google has added Gemini Embedding to its Gemini developer API. This new experimental model for text translates words, phrases and other text inputs into numerical representations, otherwise known as embeddings, which capture their semantic meaning. Embeddings are used in a wide range of applications including document retrieval and classification, potentially reducing costs and improving latency. Google is also testing an expansion of its AI Overviews search feature as part of a Gemini 2.0 update. Called AI Mode, it helps explain complex topics by generating search results that use advanced reasoning and thinking capabilities. Continue reading Google Updates AI Search and Intros Gemini Text Embedding
By
Paula ParisiMarch 5, 2025
Flora is a new software interface built by AI creatives for creative AI applications. Much like Apple reinvented the personal computer UI to make it feel natural for people who were not IT specialists, Flora aims to reframe the way designers and artists interact with generative AI. “AI tools make it easy to create, but lack creative control,” the startup’s founder Weber Wong says, opining that such tools have proven “great for making AI slop, but not for doing great creative work.” Wong’s goal is to make an AI interface everyone will find comfortable and intuitive, simplifying use and curating “the best text, image, and video models.” Continue reading Flora Is a New AI Interface Geared Toward Helping Creatives
By
Paula ParisiFebruary 7, 2025
Snap has created a lightweight AI text-to-image model that will run on-device, expected to power some Snapchat mobile features in the months ahead. Using an iPhone 16 Pro Max, the model can produce high-resolution images in approximately 1.4 seconds, running on the phone, which reduces computational costs. Snap says the research model “is the continuation of our long-term investment in cutting edge AI and ML technologies that enable some of today’s most advanced interactive developer and consumer experiences.” Among the Snapchat AI features the new model will enhance are AI Snaps and AI Bitmoji Backgrounds. Continue reading Snap Develops a Lightweight Text-to-Video AI Model In-House
By
Paula ParisiFebruary 6, 2025
ByteDance has developed a generative model that can use a single photo to generate photorealistic video of humans in motion. Called OmniHuman-1, the multimodal system supports various visual and audio styles and can generate people doing things like singing, dancing, speaking and moving in a natural fashion. ByteDance says its new technology clears hurdles that hinder existing human-generators — obstacles like short play times and over-reliance on high-quality training data. The diffusion transformer-based OmniHuman addressed those challenges by mixing motion-related conditions into the training phase, a solution ByteDance researchers claim is new. Continue reading ByteDance’s AI Model Can Generate Video from Single Image
By
Paula ParisiDecember 18, 2024
Attempting to stay ahead of OpenAI in the generative video race, Google announced Veo 2, which it says can output 4K clips of two-minutes-plus at 4096 x 2160 pixels. Competitor Sora can generate video of up to 20 seconds at 1080p. However, TechCrunch says Veo 2’s supremacy is “theoretical” since it is currently available only through Google Labs’ experimental VideoFX platform, which is limited to videos of up to 8-seconds at 720p. VideoFX is also waitlisted, but Google says it will expand access this week (with no comment on expanding the cap). Continue reading Veo 2 Is Unveiled Weeks After Google Debuted Veo in Preview
By
Paula ParisiDecember 12, 2024
Ten months after its preview, OpenAI has officially released a Sora video model called Sora Turbo. Described as “hyperrealistic,” Sora Turbo generates clips of 10 to 20 seconds from text or image inputs. It outputs video in widescreen, vertical or square aspect ratios at resolutions from 480p to 1080p. The new product is being made available to ChatGPT Plus and Pro subscribers ($20 and $200 per month, respectively) but is not yet included with ChatGPT Team, Enterprise, or Edu plans, or available to minors. The company explains that Sora videos contain C2PA metadata indicating that they were generated by AI. Continue reading OpenAI Releases Sora, Adding It to ChatGPT Plus, Pro Plans
By
Paula ParisiNovember 27, 2024
Nvidia has unveiled an AI sound model research project called Fugatto that “can create any combination of music, voices and sounds” based on text and audio inputs. Described by Nvidia as “the world’s most flexible sound machine,” many appear to agree that the new model represents an audio breakthrough, with the potential to generate a wide array of sounds that have not previously existed. While popular sound models from companies including Suno and ElevenLabs “can compose a song or modify a voice, none have the dexterity of the new offering,” Nvidia claims. Continue reading Nvidia AI Model Fugatto a Breakthrough in Generative Sound