Google Cloud has updated its managed compute service Cloud Run with a new feature that will allow enterprises to run their real-time AI inferencing applications serving large language models (LLMs) on Nvidia L4 GPUs.
The new feature assumes significance for developers as the support of Nvidia GPUs will enhance the capabilities of Cloud Run by accelerating the compute time required for inferencing as well as helping save expenditure.
Cloud Run, which was first previewed in April 2019, allows enterprises to run stateless containers that are invocable via HTTP requests.
The managed or serverless compute service is also available on Google Kubernetes Engine (GKE), allowing developers to run containerized HTTP workloads on a managed Kubernetes cluster.
Arguably, the service has been popular among developers as it allowed them to run computations or workloads on-demand — in stark contrast to a typical cloud instance that runs for a specific time and is always available.
However, the growing demand for the ability to run AI-related workloads, that too via a serverless compute service, forced Google to add GPU support to Cloud Run.
The combination of GPU support and the serverless nature of the service, according to experts, should benefit enterprises trying to run AI workloads as with Cloud Run they don’t need to buy and station hardware compute resources on-premises and not spend relatively more by spinning up a typical cloud instance.
“When your app is not in use, the service automatically scales down to zero so that you are not charged for it,” Google wrote in a blog post.
The company claims that the new feature opens up new use cases for developers, including performing real-time inference with lightweight open models such as Google’s open Gemma (2B/7B) models or Meta’s Llama 3 (8B) to build custom chatbots or on-the-fly document summarization, while scaling to handle spiky user traffic.
Another use case is serving custom fine-tuned gen AI models, such as image generation tailored to your company’s brand, and scaling down to optimize costs when nobody’s using them.
Additionally, Google said that the service can be used to speed up compute-intensive Cloud Run services, such as on-demand image recognition, video transcoding and streaming, and 3D rendering.
But are there caveats?
To being with, enterprises may worry about cold start — a common phenomenon with serverless services. Cold start refers to the amount of time needed for the service to load before running actively.
This is significant for enterprises as it has a direct relation and effect with latency. For example, time required by the LLM to reply to a user query via an enterprise application.
However, Google seems to have it covered.
“Cloud Run instances with an attached L4 GPU with driver pre-installed starts in approximately 5 seconds, at which point the processes running in your container can start to use the GPU. Then, you’ll need another few seconds for the framework and model to load and initialize,” the company explained in the blog post.
Further, to boost the confidence of enterprises to try out Cloud Run’s new feature, the company has put out cold start times for several lightweight models.
Cold start times for Gemma 2b, Gemma2 9b, Llama2 7b/13b, and Llama3.1 8b models with the Ollama framework, range from 11 to 35 seconds, the company wrote, adding that the duration provided measures the time to start an instance from 0, load the model in the GPU, and for the LLM to return its first word. Other supported frameworks for the service include vLLM and PyTorch. Cloud Run can also be deployed via Nvidia NIM.
Source:: Network World