Demystifying ML Deployment with Raj Katakam from Credit Karma
This week on towards scaling inference, we chat with Raj Katakam, Senior Machine Learning Engineer, Credit Karma
Raj is an experienced engineer with an MS in computer science. He currently serves as a senior machine learning engineer at Credit Karma, where he works on recommendation systems, risk modeling, and user engagement.
In this episode, Raj discusses how AI is used to solve business issues, such as fraud detection and recommendation systems. He explains how Credit Karma uses their own infrastructure on the Google Cloud Platform for deployment. Additionally, he also addresses monitoring and observability issues. He recommends refreshing models as soon as possible and utilizing tools that can automate model refreshes as a best practice for ML engineers.
Prefer watching the interview?
For the summary, let’s dive right in:
Q: What business problems are you solving through AI?
We face several business problems such as recommendation issues. We want to offer the best financial products to our Credit Karma members, which will help them progress to the next level of financial success. Additionally, we encounter problems with fraud detection. We aim to reduce fraud while providing a better user experience and minimizing user friction. These issues mainly involve risk modeling, recommendations, and fraud prevention.
Q: How do you currently deploy these models in production? Do you work with managed services like Hugging Face, AWS, etc, or have you built your own ML infrastructure from scratch over Kubernetes or any other approach? And what are the pros and cons of the approach taken?
For deployment and inference, we have built our own infrastructure layer on top of the Google Cloud Platform. While this creates less dependency on a single cloud provider, it also means that we maintain the entire inference engine. Our specific needs for inference have evolved over time, with changes in our model families and cross-platform interactions. We have moved from Spark to TensorFlow and now have a mix of PMML models.
💡 While this approach performs better for our use cases, it comes with the overhead of building our own systems for monitoring, testability, and end-to-end management. This requires investment and cannot be done overnight.
Q3: How do you manage inference workload for different scenarios such as real-time, batch, and streaming? What are some of the challenges you face when you deploy them on GCP?
Regarding the infrastructure, we have all three use cases covered. For real-time inference, we manage the infrastructure ourselves, but for batch inference, we use GCP batch. We primarily use Dataflow as a scoring engine for batch inference. Additionally, we have our own streaming pieces, and on the feature side, we have a hybrid ecosystem, where we use Google's Feature Store for some low loads and our internal features for the primary workload. This platform handles both batch and streaming features. We rely heavily on CPUs, which are more cost-effective than GPUs. However, we are evaluating the use of GPUs for some use cases to see if it can reduce the workload.
Q: Do you have any plans to optimize costs or how do you approach infrastructure costs?
Well, as I mentioned, there is an overhead of maintenance that comes with it. Teams need to be on the core and maintain the entire system to ensure that nothing goes down. This brings a lot of challenges in terms of testability and monitoring. For us, model monitoring and ad hoc debugging are important pieces that come with any inference provider, such as cloud providers. These providers offer an umbrella of tools to help you with debugging. We are looking at these options as well. It should still provide a lot of value in other areas irrespective of the cost.
Need guidance? Have questions? 🤔
Q: How do you approach customization for production workloads?
Customization is a significant aspect, especially when dealing with model families. We heavily rely on PMML, and currently, most cloud providers offer TensorFlow or Torch.
As a result, we need to create our own serving container, which can be time-consuming and involve some trial and error. However, this is necessary to ensure the best results for our specific needs.
Q: Do you do quantization or conversion to ONNX or other formats, or do you deploy your models in their native format wherever they are trained?
We use native TensorFlow deployment for TensorFlow models, including deep neural network models. We also have PMML, which is a similar technology. While ONNX is something we may consider in the future, we have not yet explored its potential benefits.
Q: Was it a matter of accuracy or a trade-off that you didn't go with Onnx, given that you were already familiar with it?
Our systems rely heavily on Scala, and our legacy was in the Java and Scala world. We inherited a number of components from that environment. Moving on to Onnx would require dealing with changes to the models and implementing all of the associated logic and customization alongside model evaluation. Because of this, trying out any other alternatives would entail significant overhead, requiring us to cater specifically to certain data or situations.
Q: Do you think the current setup of managed services like Google and your own solution is sufficient? Or do you believe that as it scales, you would seek additional solutions to help you achieve a more managed or serverless environment?
It depends on the solution. The solutions we have seen are similar to what cloud service providers offer. I haven't seen a cross-platform solution where you can put containers and content in one place and easily spin up TensorFlow, PMML, or other clusters, and have them interact with each other. This is important because not all model families can or should use the same modeling ecosystem.
💡 In this case, you need a way to scan one type of model in one place and add customization on top so that you can interact with all of these clusters together. I haven't seen a solution like this yet. Cloud service providers focus on one or a few types of models and host them in multiple places, but there is no single solution that can do it all at once.
Q: Could you discuss one challenge your team faced regarding monitoring and observability, and how did you address it?
When working with your own clusters and mechanisms, you need to implement your own logging to capture all inferences. Additionally, you need to monitor the logs for any issues, such as drift or incorrect predictions, and take appropriate action.
💡 If you're using cloud providers, these tools come pre-built with features like drift detection, prediction reports, and feature correlation analysis. You can also integrate your own on-call systems, such as PagerDuty, for end-to-end monitoring. By leveraging these tools, you can catch and address problems faster, avoiding potential losses.
Q: Can you discuss any best practices or tips for other ML engineers who are looking to deploy their models in production, based on your experience?
Based on my experience, I recommend refreshing models as soon as possible and utilizing tools that can automate model refreshes without human intervention. Seasonality is important, and stale models are not useful. Ideally, once a model is deployed, data scientists or anyone else should not need to see it again. This presents a challenge, but it should be the vision for any engineer to target.
Thanks for reading. 😄