For this particular project, I wanted to focus on anomaly detection in the domain of cyber security. I figured that analysis of web logs for anomalies would be a great start to this experiment. After doing some research, it seems that unsupervised deep learning would be a great way to implement this type of analysis. An autoencoder neural network is a very popular way to detect anomalies in data. The autoencoder tries to learn to approximate the identity function:
Here is what a typical autoencoder model might look like:
For detailed information on these models, there are plenty of blogs, research, etc. for the curious mind.
As I needed comprehensive data, I looked for a database of web logs that could be easily ran through my autoencoder model. I found a dataset at Kaggle: https://www.kaggle.com/shawon10/web-log-dataset#webLog.csv . This dataset is a 10787 X 4 vector/tensor. The 4 columns represent the IP address, the time, the directory requested, and the HTTP Response code. I removed the time column from my data because every one of these entries would be unique and might not help elicitate a pattern within the data that will help with anomaly detection. Here are some charts from the output of the model:
Statistics on the Reconstruction Errors:
Binning of the Reconstruction Errors:
Plotting of the Reconstruction Errors vs. the data:
The first bubble in the upper left part of the latest chart is a non-patterned data point that I purposely included to verify the model is working correctly. As you can see, it does indeed stand out. I created a pipeline to extract all original data entries that are above the 99th quartile of mean squared error (reconstruction error) from the data. This is the threshold that I used to automatically detect anomalies. Samples of the data above the threshold value can be seen below; all of the data points above the threshold are available on Github as a separate text file. You can verify yourself that these directories are unique in the original dataset. It is incredible that this AI was able to figure out what values are anomalies based on some hyperparameters and the training of the model with this data.
If there are issues with accessing my Gihub repo below, I have a zipped file with my code, model, and datasets here: Repo Copy
Please see my Github for code, model, and the dataset related to this project.