Mastering ML Deployment with Parth from Primer.ai
This week on Towards Scaling Inference, we chat with Parth Shah, Director of Engineering at Primer.ai
Parth is a seasoned technology leader with a vast experience in MLOps, distributed systems, and cloud technologies. At Primer, he is leading a diverse team that delivers foundational capabilities for data orchestration and storage, ML workflows, etc. Previously, he has also worked with Bloomberg.
In this episode, he shares his experience about the techniques used for building their ML infrastructure optimized for stream processing, experimented with techniques like quantizing models, and using ONNX runtimes & much more.
Prefer watching the interview?
For the summary, let’s dive right in:
Q: What kind of business problems are you solving through AI?
At Primer, we operate within the NLP space and build ML models to help understand, analyze, and summarize information from unstructured documents. This involves various NLP problems, such as structured data extraction with entities and extracting structured data like numbers and dates, as well as topical and semantic analysis, such as sentiment analysis and document overviews.
Overall, our ML applications span the breadth of NLP problems. We use this technology to comprehend unstructured text information, analyze and structure it, and then compress and summarize the information for our users. This allows them to focus on decision-making rather than reading and understanding textual information.
Q: How do you currently deploy these models in production? Do you work with managed services like hugging face, AWS, etc OR have you built your own ML infrastructure stack over Kubernetes OR any other approach? And what are the pros and cons of the process?
Currently, we have our own solution for ML infrastructure. We started early in this space, around 2019, when it was still early and primitive.
Although there were solutions available for deploying models, both open-source and commercial, it was important for us to have the ability to deploy these solutions on our customer-hosted side. This was a significant factor when we were considering different managed services.
Q: When deploying models on these services, do you deploy them in the native format OR convert to ONXX/TensorRT? And, what has been your experience?
This involves simple techniques like quantizing models and using just-in-time compilation. Recently, we have been exploring standard formats like ONNX runtimes to extract better performance. By using an open format, we can run it on different runtimes.
💡 The latency of a model is a crucial factor to consider. We have experimented with techniques, such as case quantization, as the model size increases. When deploying these models on hardware that we may not have full control over, ONNX runtimes may come into play. In these cases, quantization can significantly affect latency. While ONNX runtime is known for low latency, we must also consider the hardware on which we will be running the models.
Our focus at Primer is stream processing. When building our ML infrastructure, we made sure to incorporate stream processing into our solution. One challenge we faced was figuring out how to handle the pre- and post-processing steps, especially in NLP. These steps can be quite involved, requiring the document to be broken down, segmented, and the results aggregated later on.
There is a lot of business logic involved, so we had to figure out how to manage that computation along with optimizing for the graph execution. These were some of the operational considerations for creating developer workflows for data scientists to leverage the optimization platform and handle pre- and post-processing. We analyzed using Triton, but since we already had a working solution, that posed a challenge. While latency is certainly a concern, we're also trying to balance operational costs from a developer's perspective and keep it simple.
Q: Could you share your experience with Quantisation techniques?
💡 Initially, when we started, the technology sets we wanted to use were not as mature in terms of feature capabilities. Thus, we started with something simple on our side. This approach has worked out well for us, particularly on the stream processing side, and our solution is now pretty stable and reliable. Additionally, it was important to be able to interoperate between customer hosted and our SaaS solution, which influenced some of the choices we made.
We have done a significant amount of work on the Infrastructure API. It is very simple and integrated into their GitHub workflow, enabling change management. Deployment is also easy and always on. Additionally, we have built-in monitoring tools that make it easy to monitor once things are deployed.
Q: What was the scenario where you needed streaming, and what were the challenges you faced when building the system for the first time?
For us, data processing is done in a streaming fashion. We work with open-source data, such as mainstream media content and social media content, which is inherently streaming. That's why we needed something that could provide a more streaming interface to running models instead of just a REST API. When a request comes in, it gets queued up, and then the inference runs on a GPU, with the results pushed to a queue. Our API works in different flavors, but stream processing was critical for us. We optimized the system around it and made sure there are abstractions from a user perspective, so they don't need to know how the model is running.
💡 There are still optimization opportunities, such as separating out the CPU and GPU processing to get more performance out of the GPUs. However, we need to balance the scale and latency characteristics we require with operational complexity from a development perspective.
Q: Have you considered serverless GPU deployment as a solution? If so, I would love to hear more about your experience.
We aimed to create a solution that would be easily deployable on the customer side, so we chose Kubernetes as our common abstraction layer. We heavily invested in usage-based resource allocation, auto-scaling, and setup, making it behave in a very serverless and functional way for our customers. We ended up building the solution ourselves, which was an organic evolution over time. Initially, we focused on building something simple, reliable, and integrated into customers' workflows. As for optimizing resource usage and scaling, it becomes easier when you have good adoption and a common API contract that customers are already using.
Need guidance? Have questions? 🤔
Q: What is your average GPU utilization across all clusters?
💡 It seems like the number of GPUs we use has been increasing over time. We're also trying to maximize their performance, depending on the models we're using. For example, models with hundreds of millions of parameters take longer to process on GPUs than smaller models. As a result, GPUs are naturally more highly utilized when processing larger models. Additionally, the time it takes for inference requests per model can also skew GPU utilization.
While I don't have exact numbers to share with you right now, one thing we do to maximize utilization is to bin-pack models on a GPU. This helps to maximize both the memory and the actual execution utilization of the GPU itself, which has been really effective for us.
Q: Have you created custom containers for tasks like loading models and managing GPU usage?
We have invested in developer tooling to improve in this area over time. We offer our own containers, and our internal teams can also use their own. We provide developer tooling with the necessary libraries to make the runtime work on our platform. In other words, you can build your own container image by embedding our runtime toolkit with your model itself, and everything will work seamlessly.
Q: Between latency & model accuracy what is more important for you? And how do you achieve the desired goal?
In my current company, we have been mainly focused on accuracy and model performance, even if it comes at the expense of latency. Our use cases and information processing have not required us to make many trade-offs, so accuracy remains our top priority. To improve our ability to make trade-offs, we use techniques such as inference triage, where we train a cheaper model to do the same task as a slower, more accurate model, but use the latter only when necessary. We invested in these techniques from a research perspective, aiming to get the best of both worlds by training a smaller model that can handle most cases, and using a larger model only for more complex problems. By building a system that can identify when to use each model, we can benefit from both accuracy and efficiency.
Q: Can you discuss any best practices or tips for other ML engineers looking to deploy models in production, based on your experience?
I believe someone was considering deploying a model to production. As a general rule, it's not always necessary to train a transformer. It's good practice to train a non-transformer model and benchmark the results. In machine learning, it's important to always have a test set to validate the performance of your model. Before starting a project, define the metrics for success and the dataset you'll use to test your model.
💡 When it comes to deployment, it largely depends on the company's constraints. However, I suggest incorporating model deployment and training workflow as part of a larger workflow. This will allow you to reproduce results and keep track of changes in the model development workflow. Good tools for experiment tracking, similar to Git workflows, are handy in these cases.
You don't always need to train the latest deep neural network model. Sometimes, a simpler model can meet the requirements. Make sure you understand the customer's needs and have a representative test dataset to validate the model's performance on the problem you're trying to solve.
When deploying the model, keep reproducibility in mind, both on the training side and the deployment side. If you're just starting out, using managed services is a great option since you don't need to spin up infrastructure. Make sure you have a way to keep track of deployed models and a good monitoring system. If you're scaling your model's deployment, it's important to have visibility into how the models are being called and how GPU utilization is being affected. You can think of model monitoring itself as well as software performance monitoring as important aspects.
Primer leverages ML models to analyze and summarize unstructured documents. Its applications span NLP problems, such as structured data extraction, sentiment analysis, and document overviews. It has developed its own solution for ML infrastructure optimized for stream processing. To improve performance, it has experimented with techniques like quantizing models and using ONNX runtimes. While prioritizing accuracy over latency, Primer uses inference triage to balance the two. Best practices for deploying models include benchmarking simpler models, incorporating deployment and training workflows, and using managed services for reproducibility and monitoring.
Thanks for reading. 😄