Background
At Jericho Security, we strive to prepare people against the next generation of phishing attacks by providing generative phishing simulations similar to real-life phishing and spear phishing scenarios. To that end, we have always prioritized having a wide breadth of realistic phishing simulations - that mimic real-life situations to an extraordinary degree.
To track our performance, however, we need measurable metrics to determine both the breadth and realism of the generations.
This report is meant to shed light on our data science research efforts in conceptualizing, operationalizing, and measuring those metrics.
How did we decide on Semantic Similarity?
Diving deep into what it means for our simulations to have breadth and realism, we decided that our simulations would on average be different from each other whereas an individual simulation would be similar to a real phishing email. Again, this required a concept of how much a piece of text overlaps with another - fortunately, machine learning literature already has fairly grounded operationalizations for this concept.
After experimenting with various algorithms and scores, we landed on using semantic similarity. We chose this for a couple of reasons, namely because algorithms such as lexical matching would rely on explicit n-gram overlapping which is much more prone to shift over time, while the underlying semantics might not. There are also technical reasons for this choice, for example being able to easily parallelize the embedding computation, which is also reusable for other tasks.
Calculation of Semantic Similarity
To calculate the semantic similarity between two emails, we rely on SentenceBERT to first embed the whole text into an embedding space. From there, breadth is calculated as the average pairwise cosine similarity between a generated email with all other generated emails while realism is calculated as the max pairwise cosine similarity between a generated email with all real phishing emails.
Currently, we are using a seed dataset from various open-source phishing dumps (SpamAsssassin, the Enron dumps, etc.) as our representation of real phishing emails, and we are actively expanding this dataset to include more up-to-date phishing emails.
The above process creates a distribution of the diversity and realism scores for this particular sample of generated and real emails. Figure (1) shows how the distribution of the diversity scores of a random sample of 200 generated emails with varying inputs related to the target information, pretext, sender information, and attack type compares to a distribution of diversity scores from real phishing emails.
Ideally, the generated phishing emails should be trending toward the lower end of the average cosine similarity range, as lower cosine similarity values mean the pairwise comparison is more dissimilar on average.
For the realism score distribution, there could be a better point of comparison as we only have one set of emails to compare against our real phishing email dataset. That being said, Figure (2) highlights the distribution of these scores; over time we want these scores to trend up as any given generated phishing email has a corollary with a real phishing email.
Dimensionality Reduction Based Visualization
Another informative way of visualizing these embeddings is through dimensionality reduction-based visualizations. T-SNE and UMAP are common techniques of nonlinear dimensionality reduction that also uses clustering as an added benefit.
We elected to use UMAP in this case due to allowing more modifications of hyperparameters. Figure (3) shows the results of running both generated and real phishing emails through the same UMAP transform.
The results help explain the trend we saw above in the raw distributions of diversity scores. Specifically, the generated emails tend to be a subsetted cluster within the phishing emails. This can be due to a variety of reasons, one of them being that pretexts not overlapping as much as they should. For example, one of the main clusters in the UMAP visualization is the prototypical Nigerian-prince-type phish, which we haven’t included as a default pretext in our simulation.
Conclusion
Ultimately, we recognize that the generated emails are meant to be more akin to spear phishing, which is a niche that won’t be able to be captured by public phishing email dumps. As we grow this dataset and our functionalities related to generation variation, we will continue monitoring these metrics and visuals to make sure we are moving toward our end goal of making increasingly diverse and realistic phishing simulations.