In this mini-project, we aim to develop a system to analyze email interactions of users within organizations. This is to gain insights about usage patterns, discover social interactions within the institution, and understand the impacts of potential vulnerabilities.
Email interactions between 986 users at a research institution for over two years, including sender user ID, receiver user ID, and send time of 332,334 emails. The data is provided in a single file in “txt” format.
Link to the open-source data: https://snap.stanford.edu/data/email-Eu-core-temporal.html
Effectively, you are given the data for a directed temporal weighted graph in the resources
folder.
A graph is a suitable representation of the interaction between different users. However, depending on which question we want to answer, different graph representations might be more suitable.
A rich form of capturing email interactions is by using a directed graph. For this, you should implement a class called DWInteractionGraph
. It represents a directed weighted graph of user interactions within a specific window of time. Assuming two users A and B, directed means emails from A to B should be captured separately from B to A, and weighted means that the number of emails from A to B should be captured. This class should capture email times, as certain analyses would rely on it. An object of this class should be constructed in one of the following ways:
- Using an input file (similar in format to the one provided to you) and a time window (int array of length two:
[StartTimeInSeconds, EndTimeInSeconds]
). Only email interactions that occurred in this window should be stored in the object, where the time window is inclusive. If no time window is provided, all email interactions in the file should be included. - From another
InteractionGraph
object and a time range filter (int array of length two:[StartTimeInSeconds, EndTimeInSeconds]
) Note that the time window for the resulting object would be the intersection of input time range filter and the input object’s. - From another
InteractionGraph
object and an int array containing user IDs. The returningInteractionGraph
should only contain interactions of these users. Note that not all requested users are guaranteed to be in the inputInteractionGraph
object. Interactions should be added to the new graph as long as either the sender or the reciver is in the specified array, or both.
A more simplified format of capturing the interactions is via an undirected graph. The class to be implemented for this is called UDWInteractionGraph
, for an undirected weighted graph of user interactions within a specific time. Assuming two users A and B, undirected means emails from A to B are not differentiated from those sent from B to A, and weighted means that the total number of emails between A to B should be captured. This class should also capture email times, as certain analyses would rely on it. An object of this class should be constructed in one of the following ways:
- Using an input file (similar in format to the one provided to you) and a time window (int array of length two:
[StartTimeInSeconds, EndTimeInSeconds]
). - From another
UDWInteractionGraph
object and a time range filter (int array of length two:[StartTimeInSeconds, EndTimeInSeconds]
) Again, this time window should be checked to fit in the time window associated with the input object, where the time range is inclusive. - From another
UDWInteractionGraph
object and an int array containing user IDs. The returningUDWInteractionGraph
should only contain interactions of these users. Interactions should be added to the new graph as long as either the sender or the reciver is in the specified array, or both. - From an
InteractionGraph
. Here, theInteractionGraph
should be translated into aUDWInteractionGraph
.
Note 1: Part of solving this task is deciding a suitable data structure for holding data of a DWInteractionGraph
or a UDWInteractionGraph
. There are two common ways to represent a graph: using an adjacency matrix, or an adjacency list. Depending on the graph, one could be better than the other in terms of speed or memory usage. Your ability to finish Task 4 and pass tests in a reasonable time may depend on choosing the representation better suited to this problem.
The DWInteractionGraph
class should implement the following public methods:
Method Name | Parameters | Returns |
---|---|---|
ReportActivityInTimeWindow | Time window in seconds: <int, int> Time window is inclusive on both boundaries. |
[NumberOfSenders, NumberOfReceivers, NumberOfEmailTransactions]: <int, int, int> |
ReportOnUser | User ID: <int> |
[NumberOfEmailsSent, NumberOfEmailsReceived, UniqueUsersInteractedWith]: <int, int, int> If the User ID cannot be found in the graph, returns [0,0,0] |
NthMostActiveUser | N: <int> ,interactionType: <SendOrReceive> (‘SendOrReceive.SEND’ or ‘SendOrReceive.RECEIVE’) |
User ID: <int> If two or more User IDs send or recieve the same number of emails, returns the smallest User ID in the most active set. |
The UDWInteractionGraph
class should implement the following public methods:
Method Name | Parameters | Returns |
---|---|---|
ReportActivityInTimeWindow | Time window in seconds:<int, int> |
[NumberOfUsers, NumberOfEmailTransactions]: <int, int> |
ReportOnUser | User ID: <int> |
[NumberOfEmails, UniqueUsersInteractedWith]: <int, int> |
NthMostActiveUser | N: <int> |
User ID: <int> |
The DWInteractionGraph
class should implement the following public methods:
Method Name | Parameters | Returns |
---|---|---|
BFS | User 1 ID <int> ,User 2 ID <int> |
Returns: List <Integer> Using the breadth first search (BFS) algorithm, find whether there exists a path between User 1 and User 2. If such a path exists, this method should return the list of User IDs in the order visited. If no such path exists, the method should return null. When choosing the next adjacent node to vist, choose the node with the smallest User ID first that has not yet been visited. More on BFS: https://en.wikipedia.org/wiki/Breadth-first_search |
DFS | User 1 ID <int> ,User 2 ID <int> |
Returns: List <Integer> Using the depth first search (DFS) algorithm, find whether there exists a path between User 1 and User 2. If such a path exists, this method should return the list of User IDs in the order visited. If no such path exists, the method should return null. When choosing the next adjacent node to vist, choose the node with the smallest User ID first that has not yet been visited. More on BFS: https://en.wikipedia.org/wiki/Breadth-first_search |
The UDWInteractionGraph
should implement the following public methods:
Given that all users belong to the same institution and are being served by the same email server, they are exposed to a somewhat similar set of vulnerabilities. Thus, a hacker can exploit a potential vulnerability to quickly pollute a significant portion of the users.
Let us assume that the hacker can only send one email containing a malicious attachment. This is because sending more external emails might reveal a pattern making it easy for the firewall to detect it. Thus, the attacker is relying on the internal spread of the malware. This means when victim number 1 sends emails to their colleagues, they spread the malware, and so on. The firewall is triggered N hours after receipt of the first email to victim number 1.
Users may be infected and pollute downstream users at the exact same time that the hacker starts the attack. However, if the hacker attacks at time th
, and the firewall triggers n
hours later, users receiving emails at time t >= th + n
will not be polluted.
In hindsight and having access to an email interaction record, one can calculate the maximum number of users that can be polluted in N hours. This is exactly what you should do in this task. You should implement the MaxBreachedUserCount
method in the DWInteractionGraph
class, which takes as input parameter hours (<int>)
and returns the integer (int
) count of maximum polluted users. Note that:
- All interaction times in the input file are in seconds, but the input provided to this function is in hours.
- This problem is equivalent to finding the time for the first email and the right victim number 1, to maximize the number of users receiving malicious emails in N hours.