Will AWS’ next-gen Trainium2 chips accelerate AI development and put Amazon ahead in the chips race?

AWS appears to have upped the stakes in the AI infrastructure market this week with the rollout of its next-gen Trainium2 chips. What’s more, the company is already teasing its Trainium3 chip, expected in late 2025.

Trainium2 powers AWS’ most powerful instances for generative AI, and the tech giant says it continues to see great gains generationally, chip to chip. This ongoing development signals AWS’ intent to both decrease its reliance on outside infrastructure and establish its dominance in the AI chip game alongside Google, Microsoft, Nvidia, and others.

The company has also enhanced its partnership with OpenAI rival Anthropic, who will use AWS’ platform to train and deploy its Claude models, demonstrating that its infrastructure can support intense workloads from one of the leading builders of AI today.

But analysts say it’s still too early to tell whether this will move AWS ahead of the pack, particularly as it is still relatively new in the AI chip game, and Nvidia continues to hold roughly 80% market share for AI chips.

“AI is different from the traditional industry, as verticalization is radically changing solutions and power dynamics,” Paul Baier, CEO and cofounder of GAI Insights, told Network World. “In the short term, AI developers should monitor and realize that in this AI hype moment, there is tremendous pressure on hyperscalers to pre-announce products.”

Performance enhanced vs current instances

Large language models (LLMs) offer great promise, but as they increase in size and the amount of data that they use, the needs for memory, compute, and bandwidth grow exponentially, as, concurrently, do costs.

AWS aims to meet these ever-intense demands with Trn2 instances, which use 16 connected Trainium2 chips to provide 20.8 peak petaflops of compute. According to AWS, this makes the platform ideal for training and deploying LLMs with 100 billion-plus parameters, and offers a 30% to 40% better price/performance than the current generation of GPU-based instances.

“That is performance that you cannot get anywhere else,” AWS CEO Matt Garman said onstage at this week’s AWS re:Invent conference.

In addition, Amazon’s Trn2 UltraServers are a new Amazon EC2 infrastructure that feature 64 interconnected chips using a NeuronLink interconnect. This single “ultranode” features 83.2 petaflops of compute, quadrupling the compute, memory, and networking of a single instance, Garman said. “This has a massive impact on latency,” he noted.

AWS aims to push these capabilities even further with Trainium3, which is expected later in  2025. This will provide 2X more compute and 40% more efficiency than Trainium2, the company said, and Trainium3-powered UltraServers are expected to be 4x more performant than Trn2 UltraServers.

Garman asserted: “It will have more instances, more capabilities, more compute than any other cloud.”

For developers, Trainium2 provides more capability with tighter integration of AI chips to software, Baier pointed out, but it also results in higher vendor lock-in, and thus higher longer-term prices. Also, truly architecting “switchability” for foundation models and AI chips is an important design consideration. “Switchability” is a chip’s ability to adjust processing configurations to support different types of AI workloads. Depending on need, it can switch between different tasks, ultimately helping with development and scaling, and cutting cost.

Further, chip and software verticalization have the potential to bring down compute cost for tokens, “but time will tell, as there are many factors influencing that overall strategy,” said Baier.

Strong use case with Anthropic

AWS continues to emphasize its expanded partnership with Anthropic. Notably, the companies are jointly developing Project Rainer, an EC2 UltraCluster of Trn2 UltraServers featuring hundreds of thousands of Trainium2 chips. Anthropic is also working with AWS’ Annapurna Labs to write low-level kernels that allow interaction with Trainium silicon, and is helping improve computational efficiency.

The tight partnership with the AI startup is a big win for AWS, illustrating that it can handle intense AI workloads to rival Nvidia, Google Cloud, and Microsoft Azure.

“It is showcasing that they have expertise, they have a service that can effectively serve a popular and very powerful model,” Alvin Nguyen, Forrester senior analyst, told Network World. From a developer perspective, “you should be able to port over your models and see the same benefits Anthropic is seeing.”

Further, AWS’ continued chip gains signal the company’s intent to remain self-reliant. “It shows that they’re committed to developing, and not being dependent on others,” said Nguyen.

But will all companies need that much power?

However, Nguyen pointed out that, while hyperscalers have the resources to build giant AI factories based on infrastructure like Trn2, many enterprises will be looking to build smaller, more specialized models that don’t need nearly that much compute. These models are also more specific to their organizations and help keep their proprietary data safer and more secure.

“They need it to be smaller, they need it to be scalable,” he said. Where Trainium2 will benefit smaller players is in how they apply general knowledge, such as initially tying their smaller models into larger ones to enhance their capabilities through functions such as retrieval-augmented generation (RAG).

“There are going to be smaller needs that Trainium will be excellent for as well,” said Nguyen.

While Trainium2, and eventually Trainium3, will be extremely useful for specialized startups such as Anthropic, there are “very few organizations that can afford that level of infrastructure and to deploy it widely,” he said.

Source:: Network World