
Terrill Dicki
Jan 24, 2025 14:36
Discover NVIDIA’s strategy for horizontal autoscaling of NIM microservices on Kubernetes, employing custom metrics for effective resource administration.
NVIDIA has unveiled a thorough strategy for horizontally autoscaling its NIM microservices on Kubernetes, as described by Juana Nakfour on the NVIDIA Developer Blog. This technique utilizes Kubernetes Horizontal Pod Autoscaling (HPA) to adapt resources dynamically based on custom metrics, enhancing compute and memory efficiency.
Comprehending NVIDIA NIM Microservices
NVIDIA NIM microservices function as model inference containers that can be deployed on Kubernetes, pivotal for managing extensive machine learning models. These microservices require a precise understanding of their compute and memory characteristics in a production setting to guarantee effective autoscaling.
Configuring Autoscaling
The procedure initiates with establishing a Kubernetes cluster that includes vital components such as the Kubernetes Metrics Server, Prometheus, Prometheus Adapter, and Grafana. These instruments are crucial for gathering and visualizing the metrics necessary for the HPA service.
The Kubernetes Metrics Server amasses resource metrics from Kubelets and makes them available via the Kubernetes API Server. Prometheus and Grafana are utilized to collect metrics from pods and create dashboards, while the Prometheus Adapter enables HPA to apply custom metrics for scaling techniques.
Launching NIM Microservices
NVIDIA offers a comprehensive manual for launching NIM microservices, particularly utilizing the NIM for LLMs model. This process entails configuring the required infrastructure and ensuring that the NIM for LLMs microservice is prepared for scaling according to GPU cache usage metrics.
Grafana dashboards represent these custom metrics, aiding in the oversight and modification of resource distribution based on traffic and workload requirements. The deployment procedure encompasses generating traffic using tools like genai-perf, which assists in evaluating the effect of varying levels of concurrency on resource consumption.
Executing Horizontal Pod Autoscaling
For HPA implementation, NVIDIA illustrates the creation of an HPA resource centered on the gpu_cache_usage_perc metric. By conducting load tests at various concurrency levels, the HPA autonomously modifies the number of pods to sustain peak performance, showcasing its efficiency in managing varying workloads.
Prospective Developments
NVIDIA’s methodology creates opportunities for further inquiry, such as scaling based on multiple metrics like request latency or GPU compute usage. Furthermore, utilizing Prometheus Query Language (PromQL) to generate new metrics could amplify autoscaling functionalities.
For more in-depth knowledge, check out the NVIDIA Developer Blog.
Image source: Shutterstock
Be the first to comment