AWS Building Trainium-Powered Supercomputer with Anthropic

Amazon Web Services is building a supercomputer in collaboration with Anthropic, the AI startup in which the e-commerce giant has an $8 billion minority stake. Hundreds of thousands of AWS’s flagship Trainium chips will be amassed in an “Ultracluster” that when it is completed in 2025 will be one of the largest supercomputers in the world for model training, Amazon says. The company announced the general availability of AWS Trainium2-powered Amazon Elastic Compute Cloud (EC2) virtual servers as well as Trn2 UltraServers designed to train and deploy AI models and teased next-generation Trainium3 chips.

The announcements were made at AWS’s re:Invent conference in Las Vegas, where AWS said Apple and Databricks are new Trainium customers. The spotlight on AWS silicon suggests “the company is positioning [Trainium] as a viable alternative to the graphics processing units, or GPUs, sold by chip giant Nvidia,” writes The Wall Street Journal.

While Nvidia has become a GPU juggernaut, the trend of companies including Microsoft with Azure Maia, Google with its Tensor Processing Units (TPUs) and Cupertino’s own Apple Silicon is worth noting. While they all continue to use Nvidia GPUs too, they’re hedging bets.

“Today, there’s really only one choice on the GPU side, and it’s just Nvidia. We think that customers would appreciate having multiple choices,” AWS CEO Matt Garman said in WSJ.

Wired sees the AWS-Anthropic AI supercomputer initiative, dubbed Project Rainier, as yet another challenge to the status quo.

“Amazon is building one of the world’s most powerful artificial intelligence supercomputers in collaboration with Anthropic, an OpenAI rival that is working to push the frontier of what is possible with artificial intelligence,” Wired writes, adding that “when completed, it will be five times larger than the cluster used to build Anthropic’s current most powerful model.”

Wired reports that AWS is positioning Trainium clusters as “30 to 40 percent cheaper than those that feature Nvidia’s GPUs.”

Its new Trn2 UltraServers are upping the game, using “ultra-fast NeuronLink interconnect to connect four Trn2 servers together into one giant server, enabling the fastest training and inference on AWS for the world’s largest models,” Amazon explains in an announcement.

The Project Rainier supercomputer will have “more than 5x the number of exaflops” used to train the current generation of leading AI models, AWS says, adding that “Trainium3, its next generation AI chip, will allow customers to build bigger models faster and deliver superior real-time performance when deploying them.”

WSJ cites IDC figures pegging the market for AI semiconductors at “an estimated $117.5 billion in 2024,” with Nvidia corralling “about 95 percent of the market,” which IDC predicts will reach “an expected $193.3 billion by the end of 2027.”

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.