Building an ML ecosystem with AWS SageMaker, Kubeflow for Model Deployment with Reza from Dropbox
This week on Towards Scaling Inference, we chat with Reza Rahimi, Senior ML Engineering Manager at Dropbox.
Reza is currently working as a Senior Engineering Manager at Dropbox where he is leading and managing the strategy, roadmap, and execution of the cross-functional AI/ML products. He has an overall 17+ years of industrial work experience in machine learning, big data management, experiments, and analytics.
In this episode, he explains how they use AI and ML to recommend better products, optimize funnels, and experiment. They use a combination of open-source tools and cloud services, including AWS Sagemaker, Airflow, and Kubeflow. He also discusses their approach to deploying models in production, managing inference workloads, and scaling infrastructure. Finally, they provide tips for other ML engineers looking to deploy models in production, including the importance of focusing on the problem, using proven open-source solutions, containerization techniques, monitoring performance, and prioritizing privacy and security.
Prefer watching the interview?
For the summary, let’s dive right in:
Q: What kind of business problems are you solving through AI?
In my role at Dropbox, I oversee the strategy, roadmap, and execution of AI and ML products related to consumer targeting, intent understanding, product and feature recommendation, growth, funnel optimization, and experimentation. I am also responsible for understanding the needs of different stakeholders, identifying pain points, and leveraging machine learning to efficiently solve their problems.
As you know, developing ML products requires complex data analysis to understand pain points and real problems in the business. To scope the problem and identify similar and lookalike users for consumer targeting, we work closely with different stakeholders such as data scientists, PMs, and growth managers.
💡 Regarding consumer targeting, we aim to expand the consumer and user base of different features and products. By leveraging machine learning, we identify similar and lookalike users and target them to promote product adoption and monetization. For product recommendation, we bundle different products and services and aim to identify the best bundle or package that suits the customer's needs and actions. By understanding user intent and interaction with Dropbox, we can recommend the best bundle or package to ensure a good user experience. Lastly, we focus on gross funnel optimization. In this area, we try to understand user needs in different phases of the consumer journey and connect them to the correct workflow in Dropbox.
Machine learning is a powerful tool that we leverage to solve these problems and overcome the main challenges we face in the business.
Q: How do you currently deploy these models in production? Do you work with managed services like hugging face, AWS, etc OR have you built your own ML infrastructure stack over Kubernetes OR any other approach? And what are the pros and cons of the process?
Our main ecosystem is based on AWS and popular open-source tools. For orchestration and workflow, we mainly use Airflow. For many machine learning tasks, we rely on AWS services such as SageMaker. Spark is also a key tool in our ecosystem.
💡 Recently, we decided to switch from Airflow to Kubeflow for machine learning deployment due to benchmarking results that showed Kubeflow's superior scalability, flexibility, and performance. Kubeflow is specifically designed for machine learning, making it a better fit for our needs. Additionally, our company's strategy is to switch to Kubernetes for all deployments, as it provides a friendly ecosystem of compatible tools.
When selecting tools and platforms, we prioritize selecting ones that are friendly and work well together. These are the main tools and infrastructure we use for MLOps and other operations.
Q: When deploying models on these services, do you deploy them in the native format OR convert them to ONNX/TensorRT? And, what has been your experience?
💡 The choice of format depends on the use case. For instance, when the explainability of the model is important, we do not use the native format. This is because stakeholders, such as marketing teams, want to understand the model's drivers and check for biases. To ensure that the information is not lost, we use the native format, which supports the sharp values that provide explainability. However, for deep learning models related to text analysis, where portability is essential, we convert into the ONNX format.
We have both formats available, depending on the use case and stakeholder requirements. From a software engineering perspective, the use of different formats does not pose many challenges, as we have one standard that everyone follows during execution. ML operations run smoothly, and portability for execution is fast. The main pain point is when the explainability of the model is essential, and we need to use specific formats like XGboost. In such cases, we save the information in our format and convert it to JSON format to ensure explainability when stakeholders inquire about the model's drivers.
Q: How do you manage inference workload for different scenarios such as real-time, batch, and streaming? What are some of the challenges you face with each of these scenarios, and how do you address them?
Depending on the use case, we have three models from an engineering ML perspective: batch, real-time, and stream. We have use cases for all of them. For example, we use real-time models to recommend the best plan for users based on their short-term behavior. This model works in a totally working environment. We also have batch models that work based on the long-term interaction of users with the Dropbox tool, which can be over three months. We collect all interaction data and use it to recommend new features related to a person's interest, such as working with video files. Our product, Capture, recommends the best time to collect context that is productive. These models work in the batch operation and have been more accurate in general because they've collected more data about the user.
💡 We have two different pipelines for executing these models. The first one is for the batch models that use heavy-lifting ETL operations and ML inferencing. We leverage Spark and SageMaker in this domain to make inferences. For the real-time models, we use the same based on the input of some cached data and other information that's then run through a real-time API that interacts with the users. This is the current ecosystem of models and the types of models we deploy for our use cases. We mainly designed the ecosystem based on open sources such as SageMaker services in those domains.
Need guidance? Have questions? 🤔
Q: Which is more efficient - real-time or batch processing? Is computation inefficient in real-time processing? For tasks that require under 100 milliseconds, should accelerated computing like GPUs be used? Or should CPU-based machines still be used for real-time processing in most workloads?
There are a couple of techniques that you could follow based on the real-time models. One technique uses the funneling model, starting with higher recall models, and then moving to the high-precision model. Another technique is to use a machine learning perspective to create a simple and fast model. The first technique allows for higher recall of potential users, with a very granular approach. We are using these techniques to reduce the timeline.
💡 The second technique involves teaching the SLA needed by the product for the real-time model. For this, the API should be less than 100 milliseconds. Although we have mainly focused on CPU-based use cases, we could use a combination of both CPU and GPU models, which could be very helpful. Additionally, it would also save GPU costs, as using GPUs can be expensive.
Regarding deep learning models, we have not used any in my group. However, we have other ML groups that have worked on video texts and used their own GPU-based model. We have a couple of use cases in that domain where heavy, deep learning models with many layers are used, and GPUs are used to accelerate the inferencing. However, we need to do a trade-off between the cost and performance of the algorithm to work at the optimal point for execution.
Q: Which model serving runtime are you using? like torchscript, triton, tf serving OR have you written custom scripts? And, what has been your experience?
We primarily use TensorFlow as our main training framework. We follow the standard format for containerization, such as the one used in SageMaker for containers and then deploying the workload. The process is straightforward and based on templates. After containerizing the model, we deploy it in an end point. This work is systematic and done within our domain.
Q: Between latency & model accuracy what is more important for you? And how do you achieve the desired goal?
Usually, accuracy is a constant concern for our API and software engineers, as it directly impacts consumer experience. Our goal is to keep the API point to avoid latency to less than 201-50 milliseconds to ensure we are working at a good level of accuracy for those kinds of use cases. To achieve this, we use specific techniques.
💡 Firstly, we use a very fast model to ensure high recall and select the correct users or products for the best experience. Secondly, we use a high-precision model that works with fewer inputs to drive the final decision. These are the main techniques that we leverage. Additionally, we scale the inference endpoint by increasing the number of containers and nodes to achieve the desired latency. Finally, we also use cached data in certain use cases, as it is fresh, close, and has very low latency to recall as an inference input. This helps us satisfy the SLA that comes from the software engineer and API groups.
Q: How do you manage and scale your infrastructure for inference as your usage grows?
Once we use the usual software engineering and architectural techniques, we can provide inferencing and APIs. One technique is to provide auto-scaling for our inference point. If there are a lot of requests coming from the vendor, we scale up the nodes. Another approach is to determine if the batch model is reasonable for the use case. If it is, we do pre-computation. This allows us to have saved and pre-cached data in Cassandra or DynamoDB for fast access. By pre-computing the data, we can handle big loads automatically and scale up as needed, providing good SLAs for customers. This approach is serverless, so we don't need to worry about adding auto-scaling to the infrastructure.
Q: Can you discuss any best practices or tips for other ML engineers looking to deploy models in production, based on your experience?
Based on my experience, I want to emphasize that while ML is not very new, it is still a relatively new concept in the industry. It is important to focus on a great problem and get buy-in from all stakeholders to ensure that you are solving a real issue. This can be challenging since different stakeholders often have different perspectives, and not all of them have a background in ML. Therefore, understanding the correct problem to solve is crucial.
💡 My second point is that it is essential to use proven open-source solutions like Spark, TensorFlow, and other cloud services. This will help to ensure that you are correctly deploying the ML model and not wasting time writing everything from scratch. Sometimes, we forget the real problem and just focus on writing the initial solution, which is not ideal. Instead, we should solve the business problem first and then think about optimizing and using the cloud or other open-source tools to improve the solution.
Thirdly, when starting a project, it's best to start small and build incrementally toward a better model or solution. It's easy to get overwhelmed since there is often a lot of ambiguity in the problem scope. Therefore, it's important to move fast and try to get good results. Using containerization is also important since it helps to manage the complexity inside the model as abstract containers that can scale up or down.
It's essential to monitor the performance of the model in the cloud and have a good observability pipeline to ensure that the offline metrics are in the good range. Retraining the model is also a critical part that requires automation in the pipeline and MLOps operation.
Finally, privacy and security are critical concerns when dealing with AI. It's important to work closely with privacy and legal teams to ensure that you are using the correct data and respecting users' privacy. Updating the model is also essential since user behavior may change frequently in the market domain. Therefore, maintenance of the model is also an important part of the MLOps operation.
Thanks for reading. 😄