AI: Deep Learning for Phishing URL Detection

I wanted to continue building my A.I. / deep learning knowledge. A requirement for this project was that it had to be focused on cyber security. I know that email-based phishing is a big issue within our society and I wanted to focus my efforts in that particular direction. I have somewhat of a specialization in applying deep learning to NLP (natural language processing).This is simply an observation of my interests and resulting output.

I decided to use binary classification for this particular model; thus I had to find phishing URLs.

For the phishing URLs, I used Phishtank's verified URL database. I have coded logic that polls their API every 4 hours and continues to build a local database. For my non-phishing URLs, I have a crawler I found on Github and modified for my own purposes to update a local database.

I set about with a character-embedded Bidirectional LSTM for training. This seems to be a production worthy state-of-the art model that benefits from seeing past characters as well as characters later in the URL. This helps to identify features that can be used for detecting patterns for binary classification. At the end of this post I have the Keras training output.

Below are charts of the training/cross-validation loss and accuracy:

Training/val loss Training/val acc

The model achieved a 97.68% level of accuracy on the test set (representing 10% of the URLs i.e. 6799 URLs). I have also included evaluation metrics below for this model: ROC/AUC curve, confusion matrices, and the F1 score.

ROC/AUC Curve:

ROC/AUC Curve ROC/AUC Curve Zoomed

Confusion matrices:

Confusion Matrix non-normatlized Confusion Matrix normalized

F1 Score:

F1 Score

For various directories and files, I seem to get a respectable level of accuracy with unseen data. However, various tests seem to show unreliable predictions when it comes to base URLs. I have code that simply returns no prediction on base URLs e.g. https://www.zpettry.com

I have put together a Flask REST API that can be tested locally. I also have a "request.py" program available that will do the POST request for you. All you have to do is add the URL of your choice.

Future Plans:

I have coded logic that continuously acquires both phishing and regular URLs as I'm think about turning this model into more of an anomaly detection paradigm by using an Autoencoder. There are a plethora of regular URLs that could be trained on as the data is incredibly asymmetric. Furthermore, I might start looking into the body of emails and start training an anomaly detection model to detect if the message is classified as phishing. This way I can create an ensemble model. Based on my research, it seems that these models outperform non-ensemble methods.

If there are issues with accessing my Gihub repo below, I have a zipped file with my code, model, and datasets here: Repo Copy

Please see my Github for code and datasets related to this project.

Because of Github size limits, the model can be downloaded here: Model

This is the training output from Keras:

Using TensorFlow backend.
Found 69 unique tokens.
Shape of data tensor: (67997, 128)
Shape of label tensor: (67997,)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_4 (Embedding)      (None, 128, 128)          8960      
_________________________________________________________________
bidirectional_10 (Bidirectio (None, 128, 512)          788480    
_________________________________________________________________
bidirectional_11 (Bidirectio (None, 128, 512)          1574912   
_________________________________________________________________
bidirectional_12 (Bidirectio (None, 256)               656384    
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 257       
=================================================================
Total params: 3,028,993
Trainable params: 3,028,993
Non-trainable params: 0
_________________________________________________________________
Train on 48957 samples, validate on 12240 samples
Epoch 1/10
48957/48957 [==============================] - 1082s 22ms/step - loss: 0.4997 - acc: 0.7468 - val_loss: 0.3786 - val_acc: 0.8386
Epoch 2/10
48957/48957 [==============================] - 1078s 22ms/step - loss: 0.3326 - acc: 0.8631 - val_loss: 0.2266 - val_acc: 0.9182
Epoch 3/10
48957/48957 [==============================] - 1079s 22ms/step - loss: 0.2686 - acc: 0.8942 - val_loss: 0.1943 - val_acc: 0.9252
Epoch 4/10
48957/48957 [==============================] - 1081s 22ms/step - loss: 0.1852 - acc: 0.9326 - val_loss: 0.1308 - val_acc: 0.9551
Epoch 5/10
48957/48957 [==============================] - 1080s 22ms/step - loss: 0.1664 - acc: 0.9400 - val_loss: 0.1272 - val_acc: 0.9574
Epoch 6/10
48957/48957 [==============================] - 1081s 22ms/step - loss: 0.1274 - acc: 0.9561 - val_loss: 0.0995 - val_acc: 0.9683
Epoch 7/10
48957/48957 [==============================] - 1081s 22ms/step - loss: 0.1006 - acc: 0.9661 - val_loss: 0.0844 - val_acc: 0.9742
Epoch 8/10
48957/48957 [==============================] - 1079s 22ms/step - loss: 0.0894 - acc: 0.9702 - val_loss: 0.0674 - val_acc: 0.9772
Epoch 9/10
48957/48957 [==============================] - 1078s 22ms/step - loss: 0.0839 - acc: 0.9732 - val_loss: 0.0658 - val_acc: 0.9800
Epoch 10/10
48957/48957 [==============================] - 1079s 22ms/step - loss: 0.0717 - acc: 0.9769 - val_loss: 0.0582 - val_acc: 0.9825
6799/6799 [==============================] - 46s 7ms/step
Model Accuracy: 98.29%