Machine Learning Use-Cases in Cybersecurity, prioritising latency & more with Eric Voots
This week on Towards Scaling Inference, we chat with Eric Voots, Lead Software Engineer - Cybersecurity & Machine Learning at Wells Fargo
Eric is currently working as a Lead Software Engineer in Cybersecurity and Machine Learning at Wells Fargo. He has previously worked with US Bank and Bank of America, among others. As an experienced financial industry professional, he enjoys utilizing his expertise to analyze data and improve efficiencies and processes.
In this episode, he discusses how he & his team use unsupervised learning to find anomalies in large amounts of cybersecurity data, and the need to prioritize latency over accuracy. To successfully deploy ML models in production, he also advises effectively using software engineering teams, coordinating with other teams, and having an independent team for model deployment checks.
Prefer watching the interview?
For the summary, let’s dive right in:
Q: What kind of business problems are you solving through AI?
Our team focuses on cybersecurity, where vast amounts of data are generated by various vendor products such as firewalls, scanners, anti-virus software, and user behavior scanners. Billions of logs are generated each day, and data sizes reach terabytes, making it difficult to sift through. We tackle this problem by using unsupervised learning to identify anomalies in the data. We present these anomalies to cybersecurity experts, who can investigate further and identify potential threats. By doing this, we help them to focus their efforts on malicious cases, rather than spending their time sifting through millions of rows of data. In short, we help to solve the haystack problem in cybersecurity.
Q: How do you currently deploy these models in production? Do you work with managed services like hugging face, AWS, etc OR have you built your own ML infrastructure stack over Kubernetes OR any other approach? And what are the pros and cons of the process?
Yes, we have a custom approach that mainly uses Spark. Although other teams use different methods, we focus on working with large amounts of big data that can reach billions of rows and terabytes. It is essential for us to use Spark to analyze the data quickly, as the longer it takes, the longer attackers could potentially stay in the system.
💡 Our primary goal is to obtain real-time data quickly without any latency in the analysis. While Python is commonly used, it is single-threaded by default, which is not efficient for big data. In contrast, Spark is multi-threaded and object-oriented in the background with Java, although Java knowledge is not necessary for Spark.
While we have been successful with Spark so far, we may eventually reach limitations, particularly with complex NLP work. In such cases, we will likely need to move towards some type of cloud infrastructure. Doing custom NLP work is challenging and not recommended unless you are willing to build the necessary infrastructure. While it is possible to do it better than others such as Hugging Face and Transformers, it requires a lot of effort and expertise.
Q: If I understand correctly, you are currently using Spark jobs to manage some of your learning, and then using TensorFlow in Python behind the scenes to train these machine learning models?
We haven't done much deep learning yet, but it will likely come into play with some of our more complex projects, particularly those related to NLP. So far, we've been focusing on typical time-series anomaly problems. We experimented with TensorFlow and PyTorch but didn't see significant performance gains compared to the computational costs. Running these tools in production requires the right GPU setup, which we didn't have. Even though we have GPUs on-premises, the performance gains for our current projects weren't worth the investment. However, we will likely need to use deep learning for future NLP projects. To make that happen, we will first need to set up the infrastructure. We don't want to be running that locally.
Q: Regarding piloting with NLP models, you mentioned Hugging Face. Could you explain why you chose Hugging Face and whether you have experience using their models or are currently experimenting with them?
Not our team, in particular, has been in contact with Hugging Face, and we tend to receive a lot of project requests. However, I know that other internal teams have also looked into Hugging Face and incorporated it into their existing projects. This is a big bank, and there are many different data science teams working on various projects. Some teams have used Hugging Face for document analysis because banks deal with a lot of paperwork. Hugging Face is commonly used for this purpose.
💡 Moving forward, we are considering using Hugging Face to analyze documents for potential cybersecurity threats such as bad links or malicious files. It is not feasible for a human to manually review all of these documents, so we hope to use Hugging Face to automatically identify any suspicious activity. Once identified, the appropriate personnel can investigate further. Therefore, we anticipate using Hugging Face for document analysis. There are simply too many pages to manually review.
Need guidance? Have questions? 🤔
Q: Once the other teams have trained their models on Hugging Face, do they deploy them in their own cluster or elsewhere?
We may need to keep all the information within the organization. Whether or not we should send it depends on the type of data being transmitted. If it contains PII or confidential information, such as internal IP addresses in cybersecurity, we need to be cautious about where we send it. This is a question that I deferred to the lawyers, so I cannot answer it. I suspect other organizations have similar policies in place, as it depends on the nature of the dataset. For instance, if it is basic geographical information, it is probably fine. However, if it contains personal information like names, addresses, and phone numbers, I would not send it there.
Q: Between latency & model accuracy what is more important for you? And how do you achieve the desired goal?
Latency is the most important factor for our purposes because it reduces the time to detect an attack or initial malicious event. In contrast, accuracy is difficult to define since labeling something as malicious might require a week or two of investigation. Therefore, latency is more important than accuracy. If we can quickly identify anomalous events, that is better than having more accurate models that take significantly longer to compute. The average time to discover an advanced persistent threat in cybersecurity is 200 days. Thus, any additional reduction in latency is a significant improvement for the cybersecurity world.
Q: Do you run your inference workloads on CPU machines? OR GPUs?
Mostly CPU, and it's multi-threaded for Spark jobs. There are a few GPU workloads, but I'd estimate that 90-95% of them use CPU. This percentage may shift as we move towards more cloud infrastructure and undertake more NLP projects. If you ask me in a year or so, I'll likely have a greatly different percentage. So it's a work in progress.
Q: How do you manage and scale your infrastructure for inference as your usage grows?
The key to successful training is obtaining accurate estimates of resource requirements. For internal projects, we begin by monitoring a sample set of 5-10 servers to determine the relative volume, before deploying to additional machines. This allows us to properly scale resources and network infrastructure for later use.
We also work with network engineers, data engineers, and security teams to ensure proper communication and handle any potential issues. Failure to communicate can result in volume problems or other complications, as experienced in the past.
📌 While we currently rely on CPU load balancing, we plan to utilize containerization in the future for faster scaling. Proper communication with all partners is essential in a large organization to ensure successful implementation.
Q: Can you discuss any best practices or tips for other ML engineers who are looking to deploy their model in production based on your experience?
Yes, utilize your software engineering teams effectively, especially for testing. When I was doing a lot of programming early on, I wasn't the best software engineer, and I'm still not. I talked to them to determine what they don't understand about my ML model. Sometimes, they don't care, and you might think that they think it's great. However, they might point out that you have an if statement on line 115 when you don't need it or that you have a while loop without an exit condition. Hence, it's crucial to have proper checks with other teams and have proper unit testing for your code. Don't hesitate to ask for help. You're not an expert on every particular aspect of Python, and you're likely to miss something. Even after using it for ten years, I still miss stuff.
💡 Having good unit testing and sensitivity testing on your model will help you identify what's going to break and what's not. Don't rely solely on yourself and your own team for checking. Have an independent team. This is why some ML teams don't always have a QA team like software does. I recommend having an additional team that works independently, but it's not always possible in some organizations. Don't proofread your own college paper by yourself.
Thanks for reading. 😄