Demystifying ML Deployment with Sushant from RingCentral
This week on towards scaling inference, we chat with Sushant Hiray, Director of Machine Learning, RingCentral.
Sushant is an expert AI researcher and data science leader who is focused on building next-generation conversational AI platforms at RingCentral.
In this episode, Sushant discusses his experience in building ML infrastructure from the ground up, techniques for optimizing models, the role of networking and memory sharing for GPUs in Kubernetes, and best practices for fellow ML practitioners.
Prefer watching the interview?
For the summary, let’s dive right in:
Q: What business problems are you solving through AI?
Our machine learning team focuses on improving meeting experiences for our customers. We leverage AI to help our customers extract as much information as possible from meetings, without feeling fatigued. As a team, we work on an entire suite of products that incorporate AI into different areas, such as live transcription, translation, and meeting summarization, throughout the entire stack.
Q: How do you currently deploy your models in production? Do you work with managed services OR have you built your own ML infrastructure?
We have a complete in-house infrastructure, which includes separate infrastructure for training large-scale models. We also have a separate infrastructure for inference pipelines that is built on top of Kubernetes.
Q: How has your experience been deploying your production infrastructure in Kubernetes? What challenges have you faced regarding resource utilization?
Building and deploying the production infrastructure in Kubernetes from the ground up has been an interesting journey. Over time, we have made many optimizations to how we push things into production, including model versioning and effective model monitoring.
However, there are still some general challenges that we constantly deal with, such as scaling workloads based on traffic and effectively utilizing GPUs. Internally, we have conducted benchmarks to determine whether to leverage GPUs or CPUs when deploying models. Additionally, because we work with a wide variety of data, model monitoring presents an interesting challenge.
Q: How do you approach optimization before deploying a model? What are your suggestions?
Nowadays, many optimization techniques have become standard practice. For example, default quantization is one of the things we started off with very early. However, there are many trade-offs depending on whether you want to deploy on CPU or GPU. We have empowered our data scientists to understand these trade-offs and make decisions since they build the model and have intimate knowledge of how it works. Our platform team supports them with automated optimizations to enhance performance and reduce the size of the model. That's the first part.
💡 The second part is deploying the model. We have different levels of workloads, including streaming workloads like live transcription or live translation, and asynchronous processing. These workloads require entirely different optimization techniques. For offline workloads, longer latencies might be acceptable, but batch processing must be effective. For streaming workloads, latency must be as low as possible, but you still want some level of batching. You don't want to run one request on the entire GPU. Effective batching is crucial for live transcription.
We have been working on generative AI for a while now. For example, if we receive 1,000 simultaneous requests, our goal is to batch them effectively and create a summarization framework on top of it. For us, model deployment and optimization are constant aspects, and we continuously evolve and improve our metrics.
Need guidance? Have questions? 🤔
Q: Does networking play a critical part in managing Kubernetes? What have you learned? How does it affect latency?
In my experience, networking does play a crucial role in managing Kubernetes. However, with a great MLOps team that has experience in DevOps and Kubernetes, we have been able to optimize our system to enhance performance and reduce latencies at every step of the pipeline.
💡 For example, with async workloads, there is often a lot of queuing time as requests are waiting for pods to become available for processing. We have spent a lot of time optimizing auto-scaling to maintain SLAs, which vary depending on the workflow. By pushing model monitoring metrics to Grafana, we have been able to automate many of the processes that used to be done manually.
As a result, we have been able to reduce the time it takes to generate AI services and maintain an excellent product experience for our customers.
Q: What has been your experience of running multiple models on the same GPU on Kubernetes?
Running multiple models on the same GPU on Kubernetes has been a constant struggle for us. We have customers across the globe and we have to support different languages on our platform. However, the amount of traffic we get for English workloads is almost 10 times more than for other workloads. Traffic for other languages, such as Spanish, German, French, and Italian, is relatively small, and we want to avoid wasting GPU cycles.
💡 To address this issue, we split the workload into 100 parts and run them on multiple GPUs to effectively run multiple types of models. However, this presents another challenge with auto-scaling, which becomes super tricky. We don't want to waste infrastructure cycles because bringing up and down a GPU takes a lot of time.
To solve this, we warm up some of these GPUs before they come in, while ensuring that it's a fair trade-off between latency and cost. We work closely with Nvidia and are one of their biggest partners in this area. We've optimized a lot of this stack with them, but we still have a long way to go to reach our goals.
Q: What are the implications of infrastructure costs on your business outcomes? Do they limit your ability to experiment with more models and growth?
We have divided our infrastructure cost between training and inference. The inference pipeline is tied to the product, so as long as the unit economics make sense, we can scale up to a certain point. However, we still try to balance usage without hampering model training productivity. We use Nvidia DGX clusters and have a big setup for training these models.
As part of my role, I ensure that we have enough provisioned capacity for our needs.
Q: Can you discuss any best practices or tips for other ML engineer practitioners looking to deploy their model in production, based on your experience?
It depends on the stage of maturity of the AI company. If you are an ML engineer at an early-stage startup, your primary goal should be to serve traffic and get your model-serving pipeline up using a managed service instead of building in-house using Kubernetes. However, as your team matures, you should spend more time on monitoring instead of just deployment. GPUs can get stuck at times.
For ML monitoring, start with something rudimentary, like monitoring CPU memory, requests, and latency. As you progress, bring in more advanced techniques like data drives and model monitoring. But, understanding how to interpret each of these graphs is necessary for actionable insights. If you observe drift in your data or distribution, retrain your model, or make changes to the pipeline. Take gradual steps and start with the bare minimum. Over time, add additional stacks to create a more robust pipeline as your team grows.
Thanks for reading. 😄