Amazon SageMaker introduces a new generative AI inference optimization capability

Today, Amazon SageMaker announced general availability of a new inference capability that delivers up to ~2x higher throughput while reducing costs by up to ~50% for generative AI models such as Llama 3, Mistral, and Mixtral models. For example, with a Llama 3-70B model, you can achieve up to ~2400 tokens/sec on a ml.p5.48xlarge instance v/s ~1200 tokens/sec previously without any optimization.

With this new capability, customers can choose from a menu of the latest model optimization techniques, such as speculative decoding, quantization, and compilation, and apply them to their generative AI models. SageMaker will do the heavy lifting of provisioning required hardware to run the optimization recipe, along with deep learning frameworks and libraries. Customers get out-of-the-box support for a speculative decoding solution from SageMaker that has been tested for performance at scale for various popular open source models, or they can bring their own speculative decoding solution. For quantization, SageMaker ensures compatibility and support for precision types on different model architectures. For compilation, the runtime infrastructure of SageMaker ensures efficient loading and caching of optimized models to reduce auto-scaling time.

Customers can leverage this new capability from AWS SDK for Python (Boto3), SageMaker Python SDK, or the AWS Command Line Interface (AWS CLI). This capability is now generally available in the US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), Europe (Stockholm), and South America (Sao Paulo) Regions.

Source:: Amazon AWS