Managing Inference Workloads, Conversion to ONXX, Deploying on AWS & more with Jonathan Tu Nguyen

This week on Towards Scaling Inference, we chat with Jonathan Tu Nguyen, Senior Machine Learning Engineer, Voodoo

Mar 28, 2023

Jonathan is currently working as a Senior Machine Learning Engineer at Voodoo. Previously, he also worked at BNP Paribas where he worked on bringing the latest machine-learning technologies into production for
use-cases like speech recognition, and OCR.

In this episode, he shares his experience of deploying their models in production using AWS and EKS services and ONNX format for efficiency. He also addresses techniques for optimizing infrastructure cost and solving latency challenges. Jonathan recommends preparing for growth, monitoring for anomalies, choosing reasonable models, and using automation tools when deploying models in production.

Prefer watching the interview?

For the Summary, let’s dive right in:

Q: What kind of business problems are you solving through AI?

In my current project, I am developing an ad network that aggregates ad inventories from our company's supply sources and merchants and matches them with demand sources. We are creating a real-time bidding system to compete with other companies such as Google, Facebook, Applovin, and other app competitors. This system enables app developers to monetize their apps quickly.

To predict various use cases like click-through rate, install rate, win rate, and win probability, we use a machine learning model and a predictive algorithm. Additionally, we utilize multi-embedding to select and display the best ads for our users, ensuring the highest install probability.

Q: How do you currently deploy these models in production? Do you work with managed services like hugging face, AWS, etc OR have you built your own ML infrastructure stack over Kubernetes OR any other approach? And what are the pros and cons of the process?

Currently, we are deploying a model in production into AWS with the EKS services. We previously tried multiple platforms, including AWS Sagemaker, but decided to stick with our AWS and EKS service to build custom infrastructure for improved search functionality. This allowed us to avoid over-dependence on other platforms and continue to improve our custom infrastructure.

📌 Prior to working at Voodoo, I worked on internal projects that utilized our own local cluster for training and deploying models. However, we encountered many problems with this setup. The most serious problem was maintenance cost, as we needed a team to maintain our server and infrastructure. When there were problems with our cluster, we had to deploy our team to fix them, resulting in time and maintenance costs. Another important consideration was the capability of skills, as we had a limited local cluster and infrastructure.
After encountering these issues, we decided to move to a cloud-based infrastructure using Domino Data Lab. However, this setup also had drawbacks, including dependence on the Domino Data Lab, which meant we had to wait for them to debug problems that we detected from outside. Additionally, we had to wait for a new release to have the newest features.

Currently, we are using our own infrastructure based on AWS and EKS service, which is more flexible and allows us to have ownership of everything, including the release and features. While it requires more resources at the beginning to build the base infrastructure, we can deploy our model in one or two weeks. This is in contrast to my work at BNP Paribas, where I had to wait for one year to deploy a model.

Q: When deploying models on these services, do you deploy them in the native format OR convert them to ONXX/TensorRT? And, what has been your experience?

When discussing OCR or speech recognition use cases, our model can be quite large. Therefore, we need to convert it to a more efficient format. Our preferred option is to convert it to the ONNX format, which improves inference time and reduces dependence on the hardware platform. Although we also tested deploying it onto TensorFlow, we find it more convenient to convert all our projects to PyTorch and then to ONNX.

The ONNX format is compatible with many types of models and performs well in terms of inference time. Currently, I am working on deploying a structured data model. This data comes from our users, inventory, and games, and we combine it into tabular data which is then deployed in its native format.

Q: How do you manage inference workload for different scenarios such as real-time, batch, and streaming? What are some of the challenges you face with each of these scenarios, and how do you address them?

I am already working on real-time prediction and streaming predictions in our current use case. I have also worked on batch prediction in my previous work experience. In real-time predictions, our solution is deployed in its native formats, such as Scikit-learn or PyTorch, without the need for direct conversion into ONXX. We load the model into the memory cache of our deployment and check every 10 minutes for any changes or modifications in our base model. Since we have many models and checkpoints, we compare versions to determine which version is the current model. If there is a change, we load the new model, and if there is none, we don't load anything, keeping our model inference in the memory cache for real-time predictions.

To process real-time data, we use tools such as DynamoDB and Kinesis in AWS, along with monitoring and logging tools like Grafana and Prometheus to track real-time predictions. For batch predictions, we have to process a large number of requests at once, such as 1000 documents. The challenge we face is optimizing batch processing time and fitting the total batch memory while improving the processing time.

💡 To achieve this, we use parallel processing techniques such as Kubernetes, which can deploy multiple processors and nodes of Kubernetes to reduce batch processing time. We also combine the model with batch prediction to increase its size. Balancing the memory of our GPU, CPU, and inference time is crucial when increasing the batch size of our predictions. So, we use many techniques, including parallel processing techniques, for batch prediction.

Q: Do you see that the infrastructure cost for real-time processing is high because you have to keep these models always on? Is this a big challenge, or are these models small enough that the cost of real-time inference is not a significant factor?

This is a problem we face frequently in our use case, where real-time prediction is critical for real-time bidding. We must output a prediction in less than 200 milliseconds or 300 milliseconds for the bidding process to receive a bid. If it takes longer than 300 milliseconds, the platform will not accept the bid, and we lose the opportunity to bid on that place. This is the most important factor in our systems, and it causes a lot of operational challenges.

The model size can increase and go deeper, and we need to use a compression model to compress it to the required scale. We also need to manage memory and machines, as we bid on around 2 million bids per day. Therefore, our cluster must run all the time, and the model must load continuously. We also need to deploy the model every day with training and running. Our system can detect the new model every day, and we can manage the peak within the time frame accepted by the platform.

Need guidance? Have questions? 🤔

Q: Let's say you have to maintain a crucial 300-millisecond latency. Do you usually over-provision hardware? In other words, if the hardware doesn't get fully used, is that acceptable? Or do you tend to over-provision instead of under-provisioning?

Sometimes, we need to increase the capacity of our worker machines to accommodate the larger memory requirements of our models. However, this presents a challenge because more powerful machines can be quite costly and require additional infrastructure. Therefore, we must strike a balance between the depth of our models and their accuracy. We need to choose the most reasonable model that can produce acceptable results within acceptable memory and inference time constraints. We consider this carefully and change machines as needed.

We must consider both latency and model accuracy, depending on the use case. For example, in real-time applications, latency is often the most important constraint. We may need to enforce a strict limit, such as 300 milliseconds, and cannot exceed 150 milliseconds. Although deeper models can provide higher accuracy, they may come at the cost of increased latency. In real-time prediction, we favor lower latency over higher accuracy. Therefore, we draw a curve between latency and accuracy and pick the models that strike the best balance between the two. These models may not be the best overall but are the most reasonable given the constraints. This is critical for real-time prediction.

On the other hand, batch prediction is less constrained by latency and therefore allows for more complex techniques. We can utilize more powerful machines to create deeper models and achieve higher accuracy. Batch prediction can be run overnight for business cases that require less urgency.

Q: When training these models using PyTorch, TensorFlow, or Scikit, have you built your own serving runtime, or are you using existing tools like TorchScript or TF Serving?

Before we started using TensorFlow serving, we relied heavily on TensorFlow for our models. Back in 2018, TensorFlow was the best option available and we found it very convenient to use TensorFlow serving, which could handle multiple models, multiple versions, and various deployment configurations. However, we later changed our strategy and infrastructure to PyTorch because many platforms and models only support PyTorch, which allows for flexible research and development. As a result, we completely converted our models to PyTorch.

💡 Although we could convert PyTorch models to TensorFlow serving format, it created many problems with accuracy and version compatibility due to the large size of the models. So, we changed our strategy again and switched to ONNX Runtime. First, we converted our models to ONNX format, which we had already discussed. Then, we started using ONNX Runtime because it has a high-performance inference engine and can be used with many different hardware configurations, as we had also discussed. We deployed ONNX Runtime on the Kubernetes platform and it has been running very smoothly.
Overall, switching to ONNX Runtime was one of the best decisions we've made so far.

Q: How do you manage and scale your infrastructure for inference as your usage grows?

At Voodoo, we face significant scaling challenges due to our 350+ million active users, and we expect this number to keep growing rapidly. Our use case grows exponentially each month. To handle this, we initially chose to implement auto-scaling in EKS AWS, which has worked well overall. While we encountered some problems, such as with serverless Redshift, we discussed these with Amazon and found them to be minor. Generally, auto-scaling in AWS has adapted well to our use case.

Monitoring tools are crucial for managing our growing user base, and we use Prometheus and Grafana for this purpose. These open-source tools allow us to track a variety of metrics, including user data and system performance.

At BNP, we also face scaling challenges due to our use case spanning multiple countries. Initially, we only used CPU for our hardware, but as we expanded, we incorporated multi-processing for GPU, which allowed us to scale our infrastructure for inference. This included the use of specialized hardware to scale up speech recognition and improve overall performance.

Q: Can you discuss any best practices for other ML Engineers who are looking to deploy models in production based on your experience?

I think there can be a lot of things we encounter in the industry when working on problems. But I believe our best knowledge comes from the business side. For example, for one use case, if we want to grow, say at the beginning of our project for our network, we only use a specific domain or country, such as the US or Canada. However, we need to think beyond that. We cannot just stick to one country. We have to prepare every option and scenario for that scale. Therefore, we need to first think about that at the beginning before anything else, before models, before infrastructure. We have to prepare every possible step for growth and scaling up.

💡 Secondly, I believe monitoring is also very important, but it's often ignored by engineers. It is one of the most important aspects because when we use monitoring at the beginning, we can detect many problems throughout our project and our skills. We need to stick with monitoring and detecting issues before they become too big to handle. As part of monitoring, we have to implement some kind of anomaly detection or automatic issue alerts. For example, we can set a threshold to detect if the system goes beyond a certain limit. This way, we can detect problems instantly because in production, many things can go wrong, and we cannot rely on our own eyes to catch them.

Thirdly, regarding machine learning models, I think in the industry, we see many good benchmarks, architectures, and complex models that outperform other benchmarks, but it's different in practice. We have to take into account everything from latency to memory cost, skills, and budget. We cannot just pick the best one in the industry. We have to improve it over time and choose a reasonable model for the industry. It's essential to think beyond just the best model and choose one that is reasonable for deployment.

📌 Lastly, I think having a Serverless GPU platform like what you are building at Inferless helps machine learning engineers focus more on the models, data, and monitoring. It takes care of the problems and complexity, and it helps us a lot because there are many things to consider, such as coding, optimization, business use cases, and KPIs. If we have tools to help us work automatically on some domains, it is very helpful. Nowadays, more and more industries are moving in this direction towards automation.

Thanks for reading. 😄