Machine Learning Deployment Made Simple with Denys from Voiceflow
This week on Towards Scaling Inference, we chat with Denys Linkov, Machine Learning Lead at Voiceflow
Denys currently serves as the Machine Learning Team Lead at Voiceflow, a company that aims to make creating voice apps easy, intuitive, and collaborative. He leads a team at Voiceflow that focuses on natural language understanding (NLU), generative natural language processing (NLP), MLOps, and conversational data discovery and augmentation. In addition to his work at Voiceflow, Denys is also a learning instructor at LinkedIn, where he teaches courses on GPT and large language models.
In this episode, he discusses Voiceflow AI use cases and how they are using both in-house and third-party AI models. He also discusses training & deployment challenges. He also advises validating the business case before investing in AI.
Prefer watching the interview?
For the summary, let’s dive right in:
Q: Could you provide some specific use cases of how Voiceflow AI technology has helped your customers?
What's interesting about Voiceflow is that we have over 100,000 customers, ranging from hobbyist students to some of the biggest companies in the world. Currently, over 60 enterprises use our technology across a variety of sectors such as retail, banking, automotive, utilities, and government.
On our website, we showcase several interesting use cases, but I would like to highlight BMW's use of our Voiceflow to prototype their in-car assistance. BMW, one of the largest automotive companies in the world, used our voice technology to prototype their in-car experience.
Q: Is Voiceflow’s AI-powered software built on top of third-party APIs, or is it a custom solution developed in-house?
When we first started, most of our functionality relied on third-party tools. However, more recently, we have been developing more in-house tools. Currently, we have over four production models, depending on how you count them, and they range from data analysis to classification models. As we have brought more tools in-house, we have been able to create more business value. Interestingly, sometimes we have shifted away from in-house models to third-party models based on business decisions. For example, we integrated more GPT and GPT-4 features into our platform and had a generative model that we built ourselves. However, it made more sense to use a third-party model based on considerations such as performance cost and multilingual functionality. So, we have both built our own and used third-party models, or even built models from scratch based on unique business cases we have encountered.
Q: Could you walk us through the process of training an AI model for your use cases, from data collection to deployment?
I believe it all starts with the business case, right? Why are we building our own models? What are the benefits? Sometimes, if we're using a third-party model for our classification models, we may encounter challenges with operationalizing them. For example, because we're a platform with 100,000 customers, many third-party APIs fail at that point. Based on operational requirements and what we knew about the space, we decided to build our own model instead of using an exploratory one. We conducted an R&D sprint for a month and then spent time productionizing it for English, our main customer base, as well as other languages.
💡 This journey was fascinating because we started with an exploratory project, but then we started thinking about the business case. We wanted lower operational costs, faster model training time, and higher accuracy than third-party models. It all aligned. However, what was interesting was that when we were building our different models, we had different requirements for the platform, which I can discuss later. At certain points, we had to adjust the platform based on the model because we needed different latency requirements.
We found the business case, validated the technology, conducted a proof of concept, integrated it into the product in a beta, and now it's in production. We're monitoring its technical metrics and customer feedback metrics.
Q: During training, did you use a specific framework or did you start with a base model and then build upon it to create a custom model? In other words, what was your process for creating a custom model?
Some of our models are unsupervised, so we use embedding-based models for those. For our custom models, we use many transformer models from Hugging Face. We standardized on PyTorch, but we built our custom ML serving framework to be framework-agnostic. We created a common set of APIs, so you can use any framework.
💡 Our current process involves training with PyTorch and deploying the compiled model. However, we could switch to TensorFlow or another framework if desired. Initially, we were unsure of the limitations we would face with each framework, so we chose PyTorch as a starting point. We did not delve too deeply into the ecosystem as things change rapidly in this field.
Q: When deploying models on these services, do you deploy them in the native format OR convert them to ONNX/TensorRT?
We initially experimented with some optimizations for our embeddings but ultimately decided not to pursue them. Although the batch models performed well in terms of batching, they did not necessarily perform better in terms of latency. For example, one of our models had a 20-millisecond latency at P99, which dropped slightly to a couple of milliseconds when we converted it to ONNX. However, this did not significantly improve our business results, and it made maintenance more difficult. Given that network latencies are often much higher, we decided not to pursue this optimization further for operational reasons.
Q: What metrics do you use to evaluate the performance of your AI models, and do you use these metrics to optimize and refine the models?
During our initial tests, we realized that selecting the appropriate base model was crucial. We scoured Hugging Face's library of models, considering their different sizes, and ultimately decided on a model with roughly 100 million parameters. Although there were larger models that theoretically should have performed better, we found that not to be the case and are currently investigating why. We also found that fine-tuning the larger model didn't take full advantage of the scaling laws we had. Based on the dataset we had, the larger model didn't provide much assistance.
💡Later, we discovered that adding more data significantly improved our model's performance. Tripling the size of our training dataset resulted in a substantial improvement in performance, particularly in regards to the underrepresented classes. To measure this, we relied on an F1 score for our classification model, ensuring that the model wasn't just memorizing the majority classes and ignoring the underrepresented ones.
Q: What challenges did you face during the development and deployment process of the AI models, and how did you address them?
When I first joined, my main priority was understanding the business case. After that, I made sure the models I wanted to use were appropriate for the cases. About a month after joining, we started building the platform. Although it seems a bit strange, I knew from my previous role that deploying to production can be painful. Since there wasn't any business pressure to deploy models yet, I decided to set up the platform first.
💡The primary focus of the platform was to make it easy for any technical user to deploy the model. It assumed that you did your model training or experimentation somewhere else. We built the framework mainly using Python, Kubernetes, and TerraForm. With this framework, you can simply say "ml create model," give it a name and some parameters, and it creates everything for you. All you need to do is insert your code into the specially marked areas, and then hit "ml model deploy." The platform takes care of the rest.
However, about six to twelve months after building the platform, we realized that one of the new models we wanted to build was going to have some challenges. We didn't design the platform to serve this new model, which required latencies of less than 150 P99s, including network time and transport layer. While our framework did pretty well with P50s, it did not support P99s. So we had to rebuild the transport layer specifically for the new model, which took quite a bit of time. This was an interesting experience, but it taught me that you can't always predict what models you'll need to build in the future.
Need guidance? Have questions? 🤔
Q: What benefits have your customers seen since implementing the AI-powered solution, and how have they measured the ROI?
Our platform is designed to help users design, build, and deploy conversational assistants, and there are definitely ways to measure its effectiveness. For example, one of our customers, Home Depot, significantly improved their user testing by using our platform. As a result, they were able to test their products more quickly, which improved their UI/UX experience. Our platform also enabled Cosworth, an Australian retailer, to achieve significant performance gains and faster deployments.
Recently, we've seen more and more customers building generative AI assistants using our platform. To make it easier for our enterprise customers to experiment with these generative models, we launched a product called AI Playground. This product allows users without a background in development or data science to test out generative models and block chaining in a simplified UI.
There are many other areas to our product, but these are some of the highlights.
Q: What advice do you have for organizations that are considering incorporating AI into their processes, and what are some common pitfalls they should avoid?
I believe that the most important thing is to understand your use case and the business value it brings. Currently, there is a hype wave around large language models and generative AI. However, it is not always clear what the ROI is, despite the biggest players in the space constantly mentioning generative AI and its impact. Therefore, it is crucial to validate the business case, determine how important generative AI is to your product, and estimate its costs.
💡If you are going to spend millions of dollars training your own model, you need to ask yourself if it's worth it because that kind of funding may not always be available to your company. You might have a better chance of using third-party models or standard NLP models, which are much faster. Specialized models still perform better on most tasks. It's essential to validate the business case rather than being swept up in a technology wave.
Q: Finally, What are your top three Generative AI companies or companies who have successfully incorporated AI in their workflows and why do you like them?
I believe there is a lot of fascinating work being done, especially by the big tech companies. They have shown that they can deliver both platforms and end-user products effectively. For instance, Microsoft Office and the Google Office suite are incredibly powerful and widely used. Facebook has also made some impressive strides in research, such as their work with pharma and other models. There are so many interesting research projects and products being developed. Recently, one of the most exciting developments was the Mosaic framework for training large language models in a relatively hands-off manner. These companies are doing some amazing work!
From a personal perspective, I find transcription to be an incredibly useful tool. I use the Whisper model to transcribe my spoken words when I'm writing blog posts or documents. I prefer to talk through my ideas, record them, transcribe them, and then copy and paste the text into the document. This has been my most impactful personal use case.
Thanks for reading. 😄