IT Threat Detection using Neural Search

In this tutorial, we will create a “deep-learning powered” cybersecurity dashboard that simulates network traffic monitoring for malicious events in real-time.

July 3, 2022

Network attacks are a broad category of cybersecurity threats in which a malicious actor attempts to disrupt, steal, or corrupt an organization’s data by gaining unauthorized access to its systems. The proverbial “needle in a haystack”, network attacks are an inherently difficult problem because they require finding rare events in extremely large datasets.

When a dataset contains 100s-1000s of dimensions, it can pose tricky challenges (e.g., curse of dimensionality). Similarity search is an approach to understanding high-dimensional data that works by finding objects in a collection that are similar based on some definition of sameness. You can think of it as a k-Nearest Neighbor (k-NN) problem where the similarity of objects is measured by distance (source).

Documents with smaller angles relative to the query are considered most relevant. From “Vector Space Search Engines Explained” Link.

In this series of blogs, we will build a Jina application that leverages similarity search to classify network traffic flow as either benign or malicious. Our goal will be to develop a reliable, scalable, and speedy intrusion detection system that predicts if an attack happens in real-time.

To pull this off, we will perform “network surgery” on a pre-trained neural network, removing the classification layer, and instead repurposing the network as a feature extractor. In other words, our network will output features, as opposed to labels.

Diagram of “network surgery”. Our model’s type and dimensions are different but use this concept. Source.

Then, we will take the 128-D embeddings generated by our feature extractor and make them searchable by indexing them using a Jina Flow. By indexing thousands of these 128-D vectors along with their labels (benign/malicious), we can capitalize on the powerful relationship between distance and similarity that vector space facilitates.

It will allow us to take unseen network traffic data from a different day, extract its features, and determine whether it is benign or malicious by finding its nearest neighbor and assigning it a class depending on the class of its nearest neighbor.

To recap, we are going to make a slight tweak to a pre-trained neural network and turn a classification problem into a similarity search problem so that we can simulate detecting malicious network traffic in real-time.

Similarity search as a basis for classification.

Here are the steps involved:

Generate vector representations of our network traffic by using our network as a feature extractor
Find a similarity measure that makes representations of similar things close together
Find the nearest neighbors of search queries and return the things that they represent (benign/malicious) to identify malicious traffic

This project won’t build itself! Let's get started already and check out our dataset.