Demystifying ML Deployment with Bradford from CloudSight

This week on Towards Scaling Inference, we chat with Bradford Folkens, CEO at Cloudsight

May 09, 2023

Brad is the co-founder & CEO of CloudSight, a seed-funded technology company that specializes in image captioning and understanding. It delivers the most superior and state-of-the-art solutions to people and companies around the world.

In this episode, he discusses how Cloudsight's AI technology is used in a variety of industries. He shares how they manage model training & inference & different techniques to prioritize model accuracy. He advises ML Engineers to encourage experimentation and agility in AI development and believes cloud-based infrastructure can help small teams move quickly and efficiently.

Prefer watching the interview?

For the summary, let’s dive right in:

Q: What kind of business problems are you solving through AI?

Over the years, our technology has been applied to a wide variety of industries. Our focus on visual cognition and understanding has made our technology applicable to many different use cases.

In the retail industry, we've seen success with visual search and one-click sell functionality for classified marketplaces. This means that you can take a picture of something you want to sell and automated processes will take care of the posting details for you, eliminating the need to manually enter information.

💡 Our technology has also made a big impact on accessibility. For those who are visually impaired or blind, our technology can be used to help them see. Recent developments in the IoT space have also shown promise for our technology, particularly in the areas of home and business security, commercial security, surveillance, and other IoT products that rely on visual data.

We've also seen our technology applied in ad tech, social media, semiconductor applications, robotics, moving and storage, and inventory management. With such a wide range of applications, the potential uses for our technology are vast.

Q: How do you currently deploy these models in production? Do you work with managed services like hugging face, AWS, etc OR have you built your own ML infrastructure stack over Kubernetes OR any other approach? And what are the pros and cons of the process?

Just to let you know, we've done a lot of things in-house. We've certainly tried a lot of third-party options for training, but we still prefer to do a lot of our training in-house because we have a lot of data to work with. One thing we found very useful, particularly in the earlier days of AI technology, was owning our own hardware for training. This allowed our data science team to run experiments without worrying about the cost of each one, which encouraged more experimentation and idea generation.

💡 However, keeping up with the rate of technological advancement was a challenge, and we eventually started moving things into the cloud. For training, we still use bare metal machines or VMs with GPUs and TPUs in the cloud. For inferencing, we've always used cloud-based services, although we do have our own Kubernetes cluster to manage machine scaling.

Q: When deploying models on these services, do you deploy them in the native format OR convert to ONNX/TensorRT? And, what has been your experience?

We optimize the models we deploy in the cloud and on mobile platforms. In some cases, such as with robotics, equipment needs to be disconnected from the internet and run on low-power computers like the Jetson TX2 and other chipsets. We optimize these models to run optimally with less memory and low-power deployments. We have had success deploying some of the neural networks we've made onto Apple's platform.

For cloud models, we use TensorFlow for training, but early versions caused challenges, particularly with the serialization layer between the core APIs and Python layer, leading to slow training. Once we switched to PyTorch, many of these issues went away, and we are now mostly PyTorch-based for training, although some production systems still use TensorFlow.

Q: How do you manage inference workload for different scenarios such as real-time, batch, and streaming? What are some of the challenges you face with each of these scenarios, and how do you address them?

During the inferencing phase, we optimize in batches as much as possible, filling up the GPU flash with as many images as we can. However, this is not always feasible, so we balance response time and the number of images coming in. Our sophisticated infrastructure helps us handle different use cases. We optimize production because we don't want to waste a whole batch on one image. If we have images in the queue, we'll run them through the GPU all at once. Our infrastructure has evolved over the past few years. Initially, there were no good industry solutions for turn-key cloud infrastructure, so we built most of it internally. Nowadays, more companies are looking into this, but at the time, we had to build it ourselves.

Need guidance? Have questions? 🤔

Q: Which model serving runtime are you using? like Torchscript, triton, tf serving OR have you written custom scripts? And, what has been your experience?

We have developed many custom containers to serve up inference. Each microservice runs in the background and pulls images off a queue to process them. Building all of this custom stuff over the years has resulted in a sophisticated system. Part of the reason for this sophistication is because, early on in the company's history, we were building up a really large training set through the idea of real-time hybrid intelligence. If the system received an image and the computer vision didn't provide a good enough response, we would send it off to a human to respond in real-time, correcting or providing an answer. We would store this in a training set and train the computer vision on a semi-real-time basis. This required a lot of internal plumbing. However, the containers themselves are part of a containerized setup on the infrastructure through Kubernetes. This makes it very easy to scale up and down. On the flip side, running everything through a queueing channel makes it easy for us to add new types of neural networks to the system without modifying very much the API side or the other types of software that are in the microservices framework.

Q: Between latency & model accuracy what is more important for you? And how do you achieve the desired goal?

That's a great question for the API. Our focus is on accuracy, and we have plenty of memory to use on our cloud-based infrastructure. However, for customers who want neural network models adapted for on-device deployments, we work with them to determine their priorities. We can adjust the vocabulary on the language model, prune the parameters, or make other modifications to fine-tune it for their platform and demands. If they prioritize speed over accuracy, we can make those adjustments.

Q: How do you manage and scale your infrastructure for inference as your usage grows?

For a long time now, we have prioritized instrumentation. We have instrumented many different metrics throughout our infrastructure. Our monitoring framework looks at various metrics across all different microservices frameworks. If we notice a particular increase in demand, we can scale up the number of instances running a particular model. If our data science team deploys a neural network model that experiences increased latency or different accuracy than what we're accustomed to, we'll flag it and either pull it or roll back the deployment.

We have invested heavily in this instrumentation and plugged it into our cloud providers' default handling for the types of issues that can arise in a broad deployment with many interdependencies and side effects.

Q: Can you discuss any best practices or tips for other ML engineers looking to deploy models in production, based on your experience?

💡At Cloudsight, we value experimentation. We've been excited about it since the beginning. This applies not only to AI companies developing different types of neural networks and technologies but also to companies on the consuming side of AI. We encourage our clients and other companies to experiment. We've seen many companies rule out certain types of neural networks or other AI-based solutions that would have been a great fit but didn't fit into the product in the way they anticipated.

We've talked to many of them over the years, and we've noticed that companies that are more agile and willing to experiment tend to innovate quickly and overtake those that are not as free to imagine. Companies that are stuck in their market tend to discount new ideas before even trying them out. By playing around with different types of technology and being open to new ideas, you may find a combination of UI/UX or a way of working with product-market fit that could lead to tremendous success. So, we encourage everyone to experiment and see what works.

Q: Looking back, if there had been a platform available back then, similar to AWS Lambda, would you still have deployed it yourself or would you have used a service like AWS Lambda to make your cloud inference more advanced and serverless?

I believe that as long as it's flexible enough, serverless infrastructure can be a valuable tool for engineers and developers. However, it's important to have "escape hatches" or ways of customizing the infrastructure when necessary. Without these, it can be frustrating to feel limited in what you can do which can lead to wasted time and effort. For example, my team often uses cloud-based infrastructure options like hosted MySQL or Postgres databases and object storage. We even use Kubernetes to avoid running things on bare metal VMs when possible. This has helped us achieve a lot with a small team and high capital efficiency for our investors. However, we're always mindful of the pricing model when it comes to metered usage, especially since we deal with a large amount of training data. Overall, cloud-based infrastructure can be a great tool for small teams to move quickly and efficiently. It's important to weigh the benefits against the potential costs and ensure that there are ways to customize the infrastructure when necessary.

Thanks for reading. 😄