Demystifying ML Deployment with Saulius from Dun & Bradstreet
This week on towards scaling inference, we chat with Saulius Garalevicius, Principal Machine Learning Engineer, Dun & Bradstreet
Saulius is an accomplished machine learning engineer currently serving as principal engineer at Dun & Bradstreet. He is an enthusiastic self-starter interested in AI, machine learning, computer vision, NLP, and brain-inspired pattern recognition algorithms.
During this episode, Saulius discusses his experience with deploying models in managed infrastructure like AWS sagemaker, managing multiple inference workloads, techniques for optimizing models, challenges with the current processes, and best practices for fellow ML practitioners.
Prefer watching the interview?
For the summary, let’s dive right in:
Q: What business problems are you solving through AI?
Our customers have a large number of customers and leads. These are typically large companies considering how to market their own products or services to those customers or leads. Our machine learning models predict which leads or accounts should be focused on for further marketing and outreach efforts. We gather a wide range of attributes about each business entity or lead, and the model identifies those who are likely to become customers, repeat customers, or high spenders. This helps advertisers and marketers focus their efforts and time on the most promising leads and customers.
Q: How do you currently deploy these models in production? And what are the pros and cons of the approach taken?
I have experience deploying models in several different environments. I started with a home-based solution, which essentially deployed its own servers on a homebrewed architecture. I also worked with scikit models, which required their own machine or a cluster of machines to do inference. Then, I moved to SageMaker models in AWS, where we deployed the models to SageMaker's provided infrastructure and endpoints. AWS SageMaker then automatically managed the load and scaled the endpoint infrastructure behind the scenes. I prefer this more automated approach since we can delegate these tasks to AWS SageMaker and set the scaling parameters once, ensuring that SageMaker takes care of the load.
This approach is not limited to just AWS SageMaker. Similar functionality and approaches exist in other ML platforms like GCP. That's why I think this is a big pro of this approach. If configured properly, it can be set and forgotten, letting someone else take care of the scaling and inference.
Q3: How do you manage inference workload for different scenarios such as real-time, batch, and streaming? What are some of the challenges you face when you deploy them on AWS sagemaker?
Yes, I have experience with both real-time and batch processing, but not with streaming data. Let's use batch processing as an example. There are several ways to perform batch processing, such as using Sagemaker's batch processing or batch transform jobs, which spin up a separate AWS instance to handle the processing and then shut down once it's complete. Another approach for batch processing is to use the Sagemaker endpoint, which is always available to handle the load of batch processes.
💡 The main challenge is setting the parameters for scaling and testing that scaling to ensure it can support the expected workloads. I have faced several challenges in making sure we can handle large amounts of data that need to be run through models and scored. I have found that using endpoint-based inference is a promising approach because it can start the inference right away and rely on the scaling functionality within the endpoint itself. When deploying these approaches on AWS Sagemaker, it is important to carefully consider the potential challenges and ensure that the system is properly configured to handle the expected workload.
Q: What are the different model types you have?
We use supervised machine learning to predict whether a certain contact or customer will convert or purchase more products. This is a classification problem, so we use classification models to predict whether or not a lead will become a successful customer. Additionally, we use regression models to predict not only whether a lead will become a customer, but also how much money they are likely to spend on a certain product or service. Regression models are used when predicting across a range of values, rather than discrete values.
Need guidance? Have questions? 🤔
Q: When working with classification and regression-based models, do you run inference workloads on CPUs, or do you also use GPUs for the same?
We have considered using GPUs for our inference workloads. AWS SageMaker allows us to switch between CPU and GPU models easily. Although we currently use CPUs, we can enable the use of GPUs to see if it improves performance. This is also cost-effective.
Q: How do you optimize machine learning models for inference, taking into account factors such as latency and resource utilization?
There is no automated pipeline to evaluate this, but I personally evaluate the latency, resource utilization, and overall ability to handle loads before deploying any solution to production. Once the solution is deployed, it's up to our customers to train and deploy their own models. As our customers continue to train and deploy more models, we may have an increasing number of models running in production. However, before this happens, we test everything in-house to see how many models we can support and whether we have enough infrastructure to scale to a certain amount of models or data that need to be run through the models for prediction.
Q: How do you manage infrastructure costs while autoscaling & make sure to avoid degraded customer experience? Are you satisfied with the current process? OR do you think a serverless infrastructure can help?
Let's discuss two use cases. For real-time scoring, it's essential to provide an answer quickly and respond in a reasonable time. We might use a Sagemaker solution that provides enough hardware and enough endpoint capacity to handle the load. We can monitor the loads and determine whether the endpoint is fully loaded or half-loaded, or if it needs to be scaled.
💡 However, for batch processing, if a customer does batch processing with big data, they might have to wait for quite a long time for this to finish. This includes many steps in the pipeline, including model scoring. We still have to manage the time it takes for the model prediction parts to finish and make sure this is within reasonable limits. But we can afford to wait for the Sagemaker endpoints to scale up. It takes at least five minutes to detect the need to scale up and another five or seven minutes to add another instance to the Sagemaker endpoint. By the time it scales, 10 or 15 minutes may have passed, which might be seen as a problem. However, this can be done pretty seamlessly for customers who rely on batch processing. It would be great if it were still within a couple of seconds, but Sagemaker does not currently provide a solution like that.
On the other hand, for real-time inference, we can usually predict what the load is going to be. We can monitor the load and scale accordingly. This should be a rare event.
Q: Can you discuss any best practices or tips for other ML engineers who are looking to deploy their models in production based on your experience?
Deploying models in production is always a challenging task. Even if you see good model performance in your development environment, it may not be the case in production. Therefore, it is crucial to test and retest at every stage of development, starting from the prototype phase. During the prototype phase, you should estimate how much load the solution can handle. After the solution is implemented and ready, before deploying to production, you should thoroughly test how it handles the load.
💡 It would be ideal to have the same infrastructure in the QA or testing environment as in the production environment and test that infrastructure with large amounts of data and different scenarios. This way, you can avoid any surprises once the model is deployed to production. In summary, my main takeaway is to always test and retest at every stage of development to ensure that the model performs well in production.
Thanks for reading. 😄