AI for Identifying Malicious Traffic, Managing Hundreds of Models, Techniques to prioritise Model Accuracy & More
This week on Towards Scaling Inference, we chat with Dan Shiebler, Head of Machine Learning at Abnormal Security
Dan currently serves as the Head of Machine Learning at Abnormal Security, where he leads a team of 40+ detection and ML engineers. Together, they have developed data processing and machine learning layers that detect cyber attacks using a variety of techniques, including aggregation systems, traditional machine learning approaches, and LLMs. Prior to this role, he also worked at Twitter.
In this episode, he discusses how AI is used to identify malicious traffic and compromised accounts in order to prevent cyber attacks. He explains how models are deployed internally using a federated model architecture to ensure data security, and how Kubernetes is used to monitor utilization and Spot Instances are used for scaling. He also advises other ML engineers to deploy models often in order to better understand deployment challenges.
Prefer watching the interview?
For the summary, let’s dive right in:
Q: What kind of business problems are you solving through AI?
At Abnormal Security, our core goal is to stop cyber attacks by identifying the difference between regular traffic in a customer's environment and malicious traffic. We focus on identifying compromised and taken-over accounts, as well as the ingression point of incoming messages from email or other services.
The core AI problem we face is a classification problem. Given some events, such as a message or an action taken by a user, we must identify whether the sender of the message or the current operator of the account is an attacker attempting something malicious. This is a traditional problem that is well-suited for machine learning due to its classification nature and black-and-white outcome.
However, scaling the raw efficacy and efficiency of the product at solving this problem is a major challenge. Compared to something like recommender systems, where framing the problem can be equally difficult to its operation and acceleration, our main issue is with scaling the product to effectively and efficiently solve the problem at hand.
Q: How do you currently deploy these models in production? Do you work with managed services like hugging face, AWS, etc OR have you built your own ML infrastructure stack over Kubernetes OR any other approach? And what are the pros and cons of the process?
We use a federated model architecture to accommodate the variety of models with different inference and operation requirements. Instead of relying on a single model, we have hundreds of models that target different types of traffic or are highly specific. Some models are as simple as decision trees, while others are large models deployed with TensorFlow or PyTorch.
For larger, more traditional models that are similar to those in a centralized serving architecture, we deploy everything internally. We avoid using managed services due to the extreme sensitivity of the data we deal with, such as customer information and internal emails. Managed services require a high degree of security and compliance burden that's difficult to justify from our perspective.
Although managed services have benefits such as simplicity and taking advantage of the latest advances in inference efficiency and efficacy, there's a trade-off between adopting cutting-edge technology and ensuring customer data remains safe and secure. This challenge applies across all parts of our ML stack, including inference.
Q: When deploying models on these services, do you deploy them in the native format OR convert to ONNX/TensorRT? And, what has been your experience?
Yes, we use ONNX for deploying our models, but the choice of deployment format depends on the specific use case. We are migrating many of our TensorFlow models to ONNX, while for internal DSL-based models, we use native deployments. We also have some scikit-learn models deployed in data format.
Although it would be ideal to standardize on a single deployment format, our current deployment complexity is more of a tech debt than an explicit decision. We have various areas of consolidation between feature serving, model training, deployment, and inference. While we have been happy with ONNX and achieved good results, its consolidation has not been the top priority for our team's migration efforts.
The main reason why we haven't migrated all of our models to ONNX is due to the internal DSL that we use to run certain models. It isn't out-of-the-box compatible with ONNX. I haven't had the chance to explore all the different types of feature transformations that we have yet. Some of our models have complex interfaces between the inference and the pre-processing of data. The abstractions we've built for the different stages of pre-processing are a little different from what's standard in some out-of-the-box systems. I'm not sure how well the conversion to ONNX will work with that interface. This is one source of technical complexity that has stopped us from being confident about migrating all of our models to ONNX.
For models like Bert, which only run on email tokens with standard tokenizers like word peace tokenizer, migration is less complex. However, for models deployed in scikit-learn that have many internal tokenizers and references to internal computations of data represented in non-standard ways, there are more migration hurdles to executing the migration.
I believe that the equation here revolves around the time we have budgeted for these types of migrations and how they compete with other engineering considerations. Ideally, we would like to standardize, but I don't think the added complexities are worth the potential performance improvements. It would be too complex to maintain multiple serving formats. Ultimately, it comes down to the amount of engineering time required to execute the migration.
Q: How do you manage inference workload for different scenarios such as real-time, batch, and streaming? What are some of the challenges you face with each of these scenarios, and how do you address them?
We have three pipelines that stream in real-time, with largely the same structure but different sources for the calls, either through Kafka or directly through an API. However, the batches are quite different due to the deployment to Spark. The biggest challenge is not only achieving parity in the model's inference but also in the features that join. In general, feature parity tends to be a substantially larger consideration than model inference parity, due to the need to align what's happening in batch with what's happening in real-time.
We have many features that deal with aggregate features, which are core to our detection strategies to understand how unusual or particular messages are in the context of a particular organization's environment. This helps us detect whether certain actions are indicative of an account being taken over or a message being malicious, based on how different they are from normal business traffic. However, understanding normal business traffic requires building aggregation systems that consume lots of events to present a unified, aggregated picture of what are the current state of the types of things that normally happen and the things that are happening recently. This is compared to what happened at a particular point in time, using a reference of what was happening exactly at that point in time when the inference was made.
Joining these features in a batch setting, when trying to predict multiple things that happened at multiple different points in time, is a very complex operation. There's always some noise that requires some kind of bucketing. There's a fundamental difference between what's going on in batch versus a time travel join that has a quantization of time, which is not representative of what's actually happening in real-time. This is a fundamental difference between the two settings, and it tends to drown out other types of differences.
Q: Which model serving runtime are you using? like torchscript, triton, tf serving OR have you written custom scripts? And, what has been your experience?
Currently, we are building from scratch, and we have considered using Triton and TensorFlow Serving at different points in time. However, there are a lot of eccentricities in our environment. The biggest issue is that we don't have a clean separation between how we do feature processing and how we do model inference. For most models, we lack a standardized representation of the features attached to a particular event. We would like to connect the serving runtime to the different one that's doing the feature addition and extraction. Although the identification and extraction of such an API introduce complexity in adding new features and schema changes, we still believe it's the right tradeoff for flexibility.
This approach is different from Twitter, where we utilized TF Serving, and it was entirely separated. Adding new features was a long and arduous process. But the model inference was much cheaper. At Google, maintaining flexibility makes model inference more expensive, but we believe it's worth the tradeoff for the moment. However, we're not far from the state where we would switch to a more stable approach, which has its own set of costs on productivity and flexibility.
Need guidance? Have questions? 🤔
Q: Between latency & model accuracy what is more important for you? And how do you achieve the desired goal?
This is a relevant issue for us because it presents a unique challenge in terms of cyber attacks. Specifically, we need to make a quick decision about whether or not to deliver an email that contains a video. Waiting too long can either cause the recipient to wait for the email, or cause us to be overly aggressive in pulling it before running our models. This can lead to business disruption and a bad customer experience. On the other hand, waiting too long to remediate a bad email can leave the customer at risk if there's something malicious attached to it. Balancing these concerns is critical, as is managing costs.
Our multistage scoring system helps us balance these concerns by dropping certain events from our pipeline based on the predictions of lighter, faster models earlier on. In some cases, we can drop events earlier if later predictions have a high degree of confidence. This helps us manage latency and costs, which are both affected by changes to how models are deployed.
Latency is particularly important for the customer experience, but the impact on customers is not equal at all points. For example, the difference between 200 and 400 milliseconds of latency is not significant, but going from five seconds to ten seconds can have a big impact on whether or not a bad email is clicked on and read. The trade-offs we make depend on the specific type of attack we're dealing with, as different types of attacks have different accuracy, latency, and cost considerations.
Q: How do you manage and scale your infrastructure for inference as your usage grows?
We carefully monitor utilization across our different cloud environments using Kubernetes. Our customer base and business communication nature give us a predictable usage pattern throughout the day, with clear seasonal patterns. We can reasonably predict capacity and use Spot Instances and similar strategies for scaling when necessary. We use a semi-real-time architecture to avoid overloading and allow for auto-scaling to pick up any backlog. While we don't aggressively innovate in scaling, we focus on making use of best practices given the predictability of our traffic.
Q: Can you discuss any best practices or tips for other ML engineers looking to deploy models in production, based on your experience?
One critical thing in the ML engineering lifecycle is to deploy often. Throughout my career, I've been giving this advice, and over time, people have been receiving it better and better. In the early days of ML as a part of business processes, ML engineers tended to be more separate from software life cycles. But now, most organizations generally accept that it's better to have a closer interface between data scientists and ML engineers who are training models and deploying them into production. This varies in different industries and different types of companies.
The benefits of early and fast deployments are that you can better understand the actual deployment challenges you will face. It can be difficult to predict these challenges because sometimes there is an offline-online mismatch, where what you see offline is different from what you see online. This can be due to features being served differently or differences in online and offline environments. There are many differences to consider.
Sometimes, the actual deployment is harder than expected, and you may not be able to do an automated deployment, making it difficult to automatically retrain your model. This can be a major challenge to deal with and could result in serious software challenges on deployment environment alignments. Other times, you may run into challenges where model deployment is more expensive than expected and you need to optimize. The slew of different online challenges that you can face is massive.
Taking an experimental mindset is the best piece of advice I can give towards the product process.
Thanks for reading. 😄