Simplifying ML Deployment with Harsha from PathAI
This week on Towards Scaling Inference, we chat with Harsha Vardhan Pokkalla, Director of Machine Learning at PathAI
Harsha is a seasoned technology leader and is currently serving as a Director of Machine Learning at PathAI. PathAI’s mission is to improve patient outcomes with AI-powered pathology. It's a Series-C funded company, backed by marquee investors like General Atlantic, General Catalyst Partners etc.
In this episode, he discusses, their focus on oncology and non-oncology areas. He shares insights about their in-house infrastructure for training and inference, with separate infrastructure optimized for each. He also talks about reasons behind prioritizing accuracy over latency and optimizing their product by understanding the business use case and user workflow.
Prefer watching the interview?
For the summary, let’s dive right in:
Q: What kind of business problems are you solving through AI?
At PathAI, we develop products for histopathology, primarily focusing on oncology and non-oncology areas. These products help pharma companies discover novel biomarkers, as well as assist pathologists in granularly diagnosing or scoring different types of diseases in pathology images. Our products are particularly useful for translational research, clinical trials, and diagnostic purposes.
One of the main challenges in this field is that histopathology images are pieces of tissue taken from patients and scanned by high-resolution scanners to create what's known as a whole slide image. These images can be gigapixel-sized, with dimensions up to 100,000 by 100,000 pixels per patient. Therefore, we must build algorithms capable of segmenting these images, identifying different types of cells, separating cancerous and normal tissues, and identifying new biomarkers within these images. Sharing this information with pathologists and pharma companies is the main challenge we face in developing machine learning algorithms.
Q: How do you currently deploy these models in production? Do you work with managed services like hugging face, AWS, etc OR have you built your own ML infrastructure stack over Kubernetes OR any other approach? And what are the pros and cons of the process?
💡 We began developing our infrastructure back in 2017, and our journey has been very interesting as we have evolved over time. When we started, we mostly had in-house infrastructure that we wanted to build. We have separate in-house infrastructure for training and inference. For training, we have our own data center as well as the competition server that we use to train our models. For inference, we use our in-house infrastructure for the majority of use cases, but for some specific use cases, we use AWS to deploy these models.
One of the main challenges we face is the massive size of the images we work with. To support and optimize how we deploy our products on these images, we needed to build a custom infrastructure. We optimized our infrastructure separately for training and inference as DCMs are massive. Our products are a composition of sub-models, and each model is again a deep learning model which is computationally heavy. We deploy each of these models in parallel sequences, segment the images into different regions, and then aggregate them back to generate some level of patient-level information. All of this requires significant workflow orchestration and optimization of inference, utilizing the GPU and CPU effectively to get those outputs at low cost for our internal batch processing or close-to-real-time processing.
💡 80% of our training, most of it takes place in our data center, so we do not use AWS storage. However, some storage is still in AWS and is used for inference. For 80% of our use cases, our own data center and in-house infrastructure are sufficient. However, for some use cases, we do use AWS.
Our work process is built on top of Argo Kubernetes and is designed to scale within our data center. We use a similar infrastructure that applies to AWS to support different types of nodes and the ability to run different types of models on GPUs, CPUs, and so on.
Q: When deploying models on these services, do you deploy them in the native format OR convert them to ONNX/TensorRT? And, what has been your experience?
We have experimented with various methods to optimize inference and achieve optimized inference and low cost. However, in our case, understanding our use case is crucial. We have worked with pharma companies where real-time processing of any image is not a must-have requirement. Deploying in a native format is good enough to meet our needs. As a result, most of our deployment is in native format, and all our models are in PyTorch.
Q: How do you manage inference workload for different scenarios such as real-time, batch, and streaming? What are some of the challenges you face with each of these scenarios, and how do you address them?
I will explain two use cases: batch processing and close-to-real-time processing. However, the latter is not completely real-time. We use in-house infrastructure to optimize how we run different models and GPUs in our data center. Then, we aggregate the results from these models, which can be CPU-based tasks, and run them on CPU environments. In our workflow orchestration, we separate the computation for different nodes and then aggregate the results. The aggregation can vary depending on whether we are doing batch processing or close-to-real-time processing, such as when we deploy our models for use in a lab setting or for clinical trials or diagnostics in research. In these cases, a technician scans a glass slide through a scanner to create a massive image file, which they upload to our server or storage. We then deploy the model and have the computed results for the entire slide. A pathologist then looks at the image and model results and interacts with the model. Since the technician uploads the image one day before, we have almost one day to run the model and get the results. Therefore, we try to reduce the dependency on real-time processing by optimizing the workflow. We create a lag between when the technician uploads the image and when the pathologist reviews it. This allows us to deploy models at a low cost with a significant time, which can be multiple hours and gives us the latest stage for pathologists to interact with the algorithm. They can exclude regions where the algorithm made mistakes during the computation of the slide level or patient real-time interaction, which happens on CPU nodes with minimal computation. This is an area where we optimize to get close-to-real-time results back to the pathologist when they're interacting with the algorithm.
Q: Do you use something similar to TorchScript for PyTorch models or your own containers for deploying models that interact with workflow tools? How do you launch these model containers?
Most of our code is a custom wrapper built on top of PyTorch. Although we use TorchScript for some use cases, the majority of our wrapper is designed to support our containers and enable deployment and inference.
Need guidance? Have questions? 🤔
Q: Is the workflow orchestration tool you use developed in-house, or do you use an external solution?
We use Argo, but we have also created a wrapper on top of it to enable our ML engineers to easily interact with it. This wrapper allows for custom requirements and tasks, such as defining which images and models to use and which GPU or CPU to utilize. It also works for both AWS and our data centers, making it very flexible.
💡 By abstracting away the complexities, our ML engineers can simply request the setup requirements and tasks, and we can execute them independently on either AWS or our data center.
Q: Between latency & model accuracy what is more important for you? And how do you achieve the desired goal?
As you know, we work in the healthcare space, specifically in cancer. Therefore, accuracy is the highest metric that we prioritize. To achieve this, we strive to understand our use case, user workflow, and business case so that we can reduce latency needs. As previously explained, setting up a pre-cache would enable us to deploy any blockage model to get results, even if it takes longer, and adjust the user workflow to obtain the most accurate models. While accuracy is the most important factor, there are research use cases where we try to balance it out based on our needs. However, we primarily focus on reducing latency in dependencies and sometimes make trade-offs. For instance, we may make trade-offs when doing some level of aggregation, such as when dealing with red petals or interacting with an algorithm. Additionally, we have found ways to minimize latency needs as much as possible.
Q: How do you manage and scale your infrastructure for inference as your usage grows?
So, within PathAI, we have our own ML Infra & MLOps team. They manage the auto-scaling aspect of it, and we use the same auto-scaling infrastructure for both AWS and our data center. Depending on the load we receive, we manage it accordingly. We have some challenges, especially with our data center, in managing the load. We create queues, prioritize which tasks to run, which to slow down, and so on. Our ML infrastructure team maintains the auto-scaling for both our data center and AWS.
Q: Can you discuss any best practices or tips for other ML engineers looking to deploy models in production, based on your experience?
Based on my experience, when I started as an engineer, I was mainly focused on achieving perfection in terms of latency and accuracy, which was a challenge. However, I learned that understanding the business case and use case, along with gaining experience and feedback, can provide more opportunities for optimizing not just the ML inference code but the entire product.
💡Therefore, my advice would be to deeply understand the requirements, identify alternate ideas, and prioritize solving the right problems for the business. This will enable engineers to have meaningful discussions with business leaders and optimize the product overall.
By understanding the product use case, engineers can identify alternative ideas and optimize the product to meet the needs of both the ML and customers. Understanding the business use case and user workflow can help engineers build different ways of deploying models, allowing them to target their work toward solving the right problems for the business. This will create a significant impact on the business.
Thanks for reading. 😄