Transforming Phishing Campaigns with Advanced Language Models - Topic Discovery

Written by

Jericho Security Contributor

Published on

May 31, 2024

We previously discussed semantic similarity and how Jericho tracks metrics such as diversity and realism of our generated emails using that concept.

Introduction

Semantic similarity is calculated from a vector representation of whole documents, which is instrumental for a variety of natural language processing tasks. Today, we will show that a natural extension of the semantic similarity workflow is doing Topic Discovery on a corpus of emails.

Topic discovery has traditionally been performed using a bag-of-words assumption. This assumption means that documents are treated as a collection of individual words, and contextual information is mostly assumed not to play a role. That is why common techniques such as latent semantic analysis (LSA) or latent Dirichlet allocation (LDA) require a transformation of the documents into what are functionally word co-occurrence matrices.

From there, LSA uses singular value decomposition and cosine similarities to determine document similarity, while LDA trains a Bayesian network with a Dirichlet distribution as its posterior to infer topics. While the assumption is relatively naive, the methodologies work reasonably well to generate topics, and many research papers have shown this to work for large text corpora. For example, in the applied stats/political science space, Martin and McCrain (2018) used LDA to parse and qualitatively pick out 15 topics that were characteristic of local and national news.

How Jericho improves pretext automation

As Jericho researched how to improve the experience of our users when crafting campaigns, we found an interesting and potentially innovative gap in the form of pretext recommendations via topic discovery.

In a traditional security awareness training program, security analysts write templates of phishing emails caught in the security gateway or found through trawling the internet. This process is time-intensive and costly to do in-house, and if given over to traditional third-party phishing simulation providers would be difficult to perform at scale.

Our thesis is that this process could be automated by using topic discovery to label clusters of emails in an unsupervised fashion. The topic labels can be recommended as commonly occurring pretexts at various levels such as company-specific, industry-specific, and even global.

From the semantic similarity analysis (shown in Figure 1), we showed that there was reasonable clustering that could be obtained based on semantic embeddings of each document.

The next step is finding representations of these clusters, ideally using the words present in the document. As explained above, this is usually done with a bag-of-words assumption and using word-count vectorizers to find similar documents.

However, since we already have clusters, we can instead use a simple metric of term-frequency inverse-document frequency (TF-IDF) at the cluster level to find significant keywords in the clusters. This method was proposed by Grootendorst (2022), and he also started an open-source library called BERTopic (See the algorithm Grootendorst proposed here) that abstracts much of this process, allowing quick and easy experimentation of the topic discovery pipeline.

Following this basic workflow of embedding the documents, projecting the embeddings to a lower dimensional space and clustering them, and finally using clustered-TF-IDF (c-TF-IDF) on the word-count vectorized documents, we show in Table 1 a simple run of the above using the seed dataset described in the previous post results in the topics.

Some topics seem reasonable, such as row 5 related to phishing emails about Citibank security issues. However, on the whole, they are not readable from a natural language perspective and require human parsing to make sense of the topic.

Integrating advanced language models to improve topic readability

With LLM technology, we may not require a human to do the parsing. If given sufficient information about the document cluster and keywords, an LLM may generate a good representation of the topic, making it human-readable and useful for future LLM downstream tasks.

BERTopic provides an easy-to-use hook at the end of the pipeline to integrate it with various LLMs, and we hooked our pipeline up to a Mistral model we have running. Using a custom prompt to inject the document and keyword information into Mistral, we were able to come up with better representations of the documents while keeping the process relatively fast.

These results can be seen in Table 2, where the names of the topics now are much more human-readable and could even be passed into our email phishing campaign generation process.

Conclusion

There is still room for exploration, for example, experimenting with larger LLMs or different prompt techniques and examining those effects on the final topic representation. Overall, this is a significant step in pretext automation/recommendation and improving user experience during the phishing campaign generation process.