The system autonomously hunts for insights and pinpoints critical themes within vast amounts of textual data, enabling cyber threat analysts to efficiently identify and focus on the most pertinent information and relationships hidden in threat intelligence collections. The system can score themes based on their importance to the overall corpus and with their alignment to a specific intelligence goal.
By combining TF-IDF, Louvain community detection, and PageRank, the system harnesses the strengths of each technique. TF-IDF identifies key terms, Louvain clusters documents based on these terms to discover thematic groups, and PageRank assesses the importance of terms within these clusters, addressing each method's limitations while capitalizing on their advantages.
Utilizing graph theory in NLP, specifically through the application of the Louvain method, provides a powerful way to visualize and analyze the structural relationships within text data. This approach not only groups related content but also reveals the network dynamics of terms and documents, uncovering deeper insights into textual relationships that traditional methods might overlook.
The system's incorporation of large language models (LLMs) enhances semantic analysis significantly, enabling deeper and more nuanced interpretations of text. It leverages these models to hypothesize about thematic concepts potentially present within the data. This capability facilitates targeted, in-depth analyses, guiding further exploration and understanding of underlying themes.
The system employs a sophisticated embedding strategy that leverages custom vector spaces for enhanced semantic analysis. This method involves calculating the centroid of embeddings from a manually curated list of conceptually significant words, thereby creating a specialized vector space that encapsulates the thematic core of these concepts. By measuring the semantic proximity of document contents to this custom centroid, the system quantitatively evaluates the alignment of texts with specific topics of interest to the user. Although the initial term selection for the list is manual (currently), the integration of large language models (LLMs) in the previous step significantly aids in identifying and locating key thematic elements within the corpus, making term list creation easy. This use of custom embedding spaces precisely pinpoints where themes of interest identified by the system, manifest across the document corpus, enhancing both the efficiency and depth of textual analysis.
{
"concept": "ai",
"sensitivity": 0.59,
"raw_data": "Data/ISOON",
"graph_name": "neo4j",
"loader_type": "text_file",
"system_action": "profile",
"delete_graph": "del",
"analysis_type": "llm"
}
The config.json
file allows you to tailor the TI Cyber Hunter system to your specific analytical needs. Below is a breakdown of each configuration setting:
"concept": "ai"
: Specifies the list of terms used for creating a custom embedding space, aligned with a particular collection goal."sensitivity": 0.59
: Sets the cosine similarity threshold for matching text in the corpus to the custom embedding space, determining how closely text must align with the specified concept."raw_data": "Data/ISOON"
: Defines the directory containing the input data for analysis."graph_name": "neo4j"
: Designates the Neo4j database instance to be used for storing and analyzing data (in this case database name neo4j)."loader_type": "text_file"
: Selects the type of data loader to use, which in this case is configured for text documents, allowing for alternative options such as PDFs."system_action": "profile"
: When set, the system searches for overarching themes within the corpus rather than focusing solely on text that aligns closely with the custom embedding space."delete_graph": "del"
: This option should be used cautiously, primarily when profiling, as it deletes all existing data within the specified graph to ensure a fresh analysis."analysis_type": "llm"
: Determines the depth of analysis. Setting this to "llm" engages the Large Language Model (Mixtral 8x7B-HF) to hypothesize about identified themes, which can be resource-intensive and impact processing time. Alternatively, users can opt to simply retrieve the raw words associated with each theme for manual interpretation.
These settings collectively enable precise control over how the system processes and analyzes cyber intelligence data, making it adaptable to various investigative needs.
This guide will walk you through the basic configuration and usage of TI Cyber Hunter. In the example below, we demonstrate how to use the system to analyze the translated ISOON leak. Our focus will be on identifying key topics and concepts within the corpus, followed by an example of how to drill-down into documents representing key areas of interest.
Here we will profile the data to determine if there are any underlying concepts which interest us and assess their overall importance to the corpus as a whole.
{
"concept": "",
"sensitivity": 0.0,
"raw_data": "Data/ISOON",
"graph_name": "neo4j",
"loader_type": "text_file",
"system_action": "profile",
"delete_graph": "del",
"analysis_type": "llm"
}
After selecting our profiling configurations, we execute main.py.
The output is a JSON file named possible_concepts.json, which contains 39 potentially interesting concepts identified within the corpus. Below is a sample of two particularly intriguing objects which discuss Chinas adoption of advanced data science methods for email and social media monitoring. Also notice the importance value denotes the importance of that topic in relation to the corpus as a whole.
{
"community": "2001",
"words": "analysis, email, data, intelligence, application, layer, decision, make, number, storage, source, phone, cloud, create, statistic",
"analysis": "Yes, the words listed above could represent the concept of \"data analysis and decision-making using technology and cloud-based services,\" including the processes of gathering data from various sources, storing it, analyzing it, and using it to make informed decisions, potentially with the aid of artificial intelligence and application software.",
"importance": "59.551346709134016"
},
Truncated...
{
"community": "986",
"words": "control, account, public, opinion, deployment, parameter, mode, guidance, group, domain, applicable, monitor, center, tweet, advantage",
"analysis": "Yes, the words in this list could represent the concept of \"the monitoring and control of public opinion or discourse, particularly in a digital context, with a focus on parameters, modes, and guidance, and the potential advantages and application of such a system.\" This concept could involve social media platforms like Twitter, centers for data analysis, and the use of parameters to control and monitor public opinion or discourse within a specific domain or group.",
"importance": "44.484950068328914"
},
Truncated...
Now that we've identified the themes available within the corpus, let's proceed to define a custom list of terms to create an embedding centroid from. This will enable us to precisely locate where these themes appear in the corpus, identifying the specific documents they inhabit. Below is an example of the list of terms we will use to target documents discussing China's adoption of advanced AI. The full list can be viwed in the project folder.
- Deep Learning
- Artificial Intelligence
- Intelligent
- Machine Learning
- Neural Networks
- Clustering
- Dimensionality Reduction
- Feature Engineering
- Model
- Data Preprocessing
- Predictive Modeling
- Classification
- Truncated...
We will define our concept list name and set the cosine similarity value to adjust the sensitivity of the matching mechanism. We will also remove the profile, del, and llm options as they were needed for the previous step. It's important to note that while this demo illustrates a logical progression, it is not necessary. Users might already have a library of predefined concept lists that they wish to apply directly to a corpus without initially profiling it for themes.
{
"concept": "ai",
"sensitivity": 0.59,
"raw_data": "Data/ISOON",
"graph_name": "neo4j",
"loader_type": "text_file",
"system_action": "",
"delete_graph": "",
"analysis_type": ""
}
After selecting our profiling configurations, we execute main.py again.
The output is a JSON file named ai_concept.json which holds information on which documents hold our data and how aligned these documents are to our collection goals (peakRatings). The system orders the results in descending order within the json, accordingly. Below, the first part of the resulst:
{
"w.value": "relationship",
"maxScore": 1.0,
"peakRatings": [
0.6972822085618371,
0.6708208166356886,
0.6283626351796582
],
"contents": [
"XXX...",
"XXX...",
"XXX..."
],
"doc_name": [
"fe245192-1f9c-4f28-9b32-046fb7ce7e1e_5.txt",
"fe245192-1f9c-4f28-9b32-046fb7ce7e1e_8.txt",
"fe245192-1f9c-4f28-9b32-046fb7ce7e1e_3.txt"
]
},
Truncated...
The documents most aligned with our collection goal (China’s adoption of AI/ML) appear at the top of the JSON file. Interestingly, we can evaluate the system’s effectiveness at surfacing content that aligns with human-assessed interest. As analyzed by HarFangLab, they identified the very same documents as among the most intriguing in the leak, matching the first two documents identified by our system.
For a foundational course on offensive security, including an introductory module to the cybersecurity data science concepts discussed here which I authored, please visit my colleague’s training offering: cyberdagger.com