The demand for artificial intelligence by enterprise as well as consumers is putting tremendous pressure on cloud service providers to meet the vast data center resources required to train the models and deploy the resulting apps. Microsoft recently opened up about the pivotal role it played in getting OpenAI’s ChatGPT to the release phase via its Azure cloud computing platform, linking “tens of thousands” of Nvidia A100 GPUs to train the model. Microsoft is already upgrading Azure with Nvidia’s new H100 chips and latest InfiniBand networking to accommodate the next generation of AI supercomputers.
Azure helped make possible the viral GPT chatbot that accrued more than 1 million users since going public in November, making it the fastest consumer app launch in history. It is now being used by businesses run by billionaire hedge fund manager Ken Griffin and the Instacart food-delivery service. Meta Platforms and Shopify are reportedly using the underlying technology for their customer service apps.
“We built a system architecture that could operate and be reliable at a very large scale. That’s what resulted in ChatGPT being possible,” Microsoft general manager of Azure AI infrastructure Nidhi Chappell said in a Bloomberg article, adding, “that’s one model that came out of it. There’s going to be many, many others.”
In addition to keeping ChatGPT up and running, Microsoft is using that architecture to run its own Bing AI search bot, launched last month, as well as making the network available to other customers. Microsoft EVP Scott Guthrie, who oversees cloud and AI, told Bloomberg the infrastructure cost for the ChatGPT launch alone was “‘probably larger’ than several hundred million dollars.”
In 2019, Microsoft invested its first $1 billion in OpenAI, adding another $10 billion this year. The journey that began five years ago resulted in what is described as an Azure AI supercomputing partnership in a blog post that explains “how Microsoft’s bet on Azure unlocked an AI revolution.”
“There was definitely a strong push to get bigger models trained for a longer period of time, which means not only do you need to have the biggest infrastructure, you have to be able to run it reliably for a long period of time,” Chappell says in the post.
Microsoft’s Azure roadmap for the supercomputing infrastructure, services, and expertise that can scale up exponentially to the size and complexity of the latest AI models is outlined in a separate blog post that details Azure’s GPU-accelerated virtual machines.
“We didn’t build them a custom thing — it started off as a custom thing, but we always built it in a way to generalize it so that anyone that wants to train a large language model can leverage the same improvements,” Guthrie told Bloomberg. “That’s really helped us become a better cloud for AI broadly.”
No Comments Yet
You can be the first to comment!
Leave a comment
You must be logged in to post a comment.