tudelft-cda-lab / sage Goto Github PK

View Code? Open in Web Editor NEW

28.0 6.0 14.0 93.43 MB

[TDSC 2021] IntruSion alert-driven Attack Graph Extractor. https://ieeexplore.ieee.org/document/9557854

License: MIT License

Python 81.71% Shell 18.29%

attack-graph alert-driven probabilistic-finite-state-automata graphviz-dot strategy-analysis

sage's People

Contributors

Stargazers

Watchers

Forkers

mveraart avenchen liujie40 ibabalau weixshen abdellinasredine kheoss smzvandenbroec opreacristian2002 raystyle hexing2333 bladetuab

sage's Issues

Refactor code for make_AG()

Make the code cleaner

Move alert signatures into separate files

Description

Currently, all alert signatures and mappings are hard-coded into sage.py. This makes the file unnecessarily large.

Proposed solution

Extract alert signatures and mappings into separate files and read/import them at the beginning of a program.

State identifiers for sink states: |Sink or state IDs?

Determine how to represent vertices in AGs that come from sink states. If we do assign |Sink, then all sinks of that type will be merged in one node (cool for simplicity, bad for readability). However, state IDs taken from sinks (salvaged) might be misleading for analysts (think different context). In either case, vertices related to sink states MUST ALWAYS have a dotted border.

A single action within a time window is not picked up as an episode

Add ArgumentParser to SAGE

Description

Currently, parsing of the input arguments is happening manually, which is a bit cumbersome.

Furthermore, another optional parameter has to be added for the dataset name. This is needed because some if-checks in the code are CPTC-specific (see issue #24).

Proposed solution

Replace the manual parsing with ArgumentParser (similar to SECLEDS)
Add one more option: dataset_name (options = {"cptc", "other"}, default value = "other")
Optionally, add another optional parameter to not remove the dot files
Update the docker branch accordingly (only the new options have to be added to the script.sh and input.ini files, no other changes are necessary, i.e. the input.ini file remains only on the docker branch, spdfa-config.ini remains only on the main branch, argument parsing remains in the sage.py, with the exception that default values of optional parameters will be added to the script.sh file).
Update the documentation for the main and docker branches

A better tie-breaker for the most targeted service

In the current implementation, when computing the most targeted service, the first most frequent service is taken, so that the result is deterministic (see PR #10).

On the other hand, there are "unknown" services, which are used when SAGE cannot infer the service based on IANA port-mapping.

A potential improvement to the tie-breaker might be to explicitly not choose "unknown" as the most targeted service in case of a tie, or to add a small margin (for example, if http has a count of 3 and unknown has a count of 4, then http can still be used). This way a security analyst might get better insights from the AGs since a specific service might reveal more information than an "unknown" service.

Support for incremental models

Add provision to limit the amount of alerts available for making AGs

Folder with files of different modes

What happens if the input folder contains files with alternating order of alerts? Currently the {mode} applies to all files in a folder. Make it more flexible.

Possible to put a sample file

Hi,

Thanks for the source code. Is it possible to put a sample input file I tried some .json files in http://cptc.rit.edu/2018/t2/events/ but I guess it does not work for all the .json files there.

Why is there an error in the input script according to the sample-input.json you gave.

Support for partial paths

Implement support for partial paths. For a victim X, all paths that have reached an intermediate vertex Y.

Split sage.py into separate files

Description

Currently, SAGE is one file sage.py, which is over 1k lines of code. The largest part of the file consists of the functions, and only at the very end there is the actual main part. A better approach might be to split sage.py into separate files, as it was also done in SECLEDS.

Proposed solution

The following files will be present in the repository:

sage.py with the main part, alert parsing and global parameters
plotting.py with the functions that are related to plotting (including make_state_groups)
episode_sequence_generation.py (from making hyperalert sequences to trace generation, i.e. from aggregate_into_episodes until generate_traces excluding)
model_learning.py (from generate_traces until make_state_sequences and group_episodes_per_av including; the code in group_episodes_per_av can go to make_state_sequences function, since it just makes the state sequences on an attacker or victim level)
ag_generation.py (converting state sequences into AGs, i.e. make_attack_graphs and the related functions)

Furthermore, the global parameters will become function parameters wherever applicable.
Finally, the docker branch will be updated accordingly to make sure that all the files are copied.

A problem about the dataset

Hello, is the cptc data set in this experiment under this link? I wonder which of these files should I download? Just like the sample-input.json you gave. Thanks very much!

http://mirror.rit.edu/cptc/2018/t5/

Structure the filtering part when parsing the alerts

Description

Sometimes, there are attackers which generate 99% of the alerts, which in the code are called bad_ip and are skipped. Furthermore, there are alerts that occur way too often and could be filtered, if necessary (see below).

The bad_ip might be dataset-specific, and the checks for "Attempted Information Leak" and "Non Suspicious Traffic" might be needed only in case of bad_ip.

Proposed Solution

Check bad IPs for CPTC/CCDC and decide on how to proceed
Check what happens if we remove the check for "Not Suspicious Traffic"? _remove_duplicate method checks for NON_MALICIOUS traffic, however the former is a SURICATA category, while the latter is part of the MicroAttackStage framework
Update the _parse function accordingly

Discarding IDs from low-severity sinks loses transitions

Description

This code snippet below (part of the traverse function) does not work in an intended way. As mentioned here, a defaultdict in Python creates a new element when it is not present in the dict and is accessed. When the ID is removed from a low-severity sink and there are more states in the trace, state_list[-1] will be -1, which will be queried in the sinks dictionary (which is a defaultdict). This will lead to -1 being added to the (global) sinks (sinks_model) variable, which should not be there (otherwise states with ID -1 will sort of be sinks).

This was the issue for CCDC dataset. In the image below, there is a (still reversed) sequence where netDOS follows vulnD. Because the ID is removed from vuldD (as it is a low-severity node), the next transition is from state -1 which cannot be correct as there are no nodes in S-PDFA with ID -1 (except for one dummy node, but that's not a problem here).

The states in the AGs are essentially the same, except for this -1 ID on some states.

Proposed solution

The IDs can be stored in a separate list here (for example, transitions_list), which is used only for transitions. The original list (state_list) will have -1s will be returned.

Episode subsequence starts with a high-severity episode as a result of cutting

Description

Some episode subsequences start with a high-severity episode.

The viable cuts for subsequences are: [med, low], [high, low], and [high, med]. It could happen that episode subsequences start with a high-severity episode if the attackers did actually start with a high-severity alert, and we even claim that the alert driven AGs have this as a special property. However, we need to make sure this artefact does not exist because of the sequence cutting.

The problem with the image above happens because of the line if pieces < 1: (in break_into_subbehaviours). The subsequence above had length 3. In break_into_subbehaviours function, cut_length = 4. As a result, the line pieces = math.floor(len(episodes) / cut_length) will evaluate to pieces = 0 and if pieces < 1 to True, because the corresponding episode sequence has length 3. Hence, the entire subsequence of length 3 will be added to the episode subsequences.

Proposed solution

Ideally we want the sequence to be cut entirely based on severity, not on the cut_length basis

Verify multiple paths by same attacker in one AG

Add a flag for CPTC dataset

Description

In CPTC-2017 and CPTC-2018 datasets, attacker IPs are known, however this might not be the case for other datasets (e.g. CCDC). Because of that, some parts of the code that are CPTC-specific have to be commented out when using, for example, CCDC dataset.

For example:

Furthermore, the code snippet above is executed after learning the S-PDFA, which is too late.

Proposed solution

Move this check from make_state_sequences into group_alerts_per_team (in sage.py):

Add the check for 10.0.254 in src_ip or in dst_ip - if not present, then discard
If present in src_ip, then add (src_ip, dst_ip). If in dst_ip, then add (dst_ip, src_ip)
Correspondingly update the part in make_state_sequences function

For the future, we might want to address internal paths (leave this as a TODO).

bad_ip can be renamed to cptc_bad_ip

Furthermore, add a specific flag for the dataset (enum or a string) and add this flag to the if-check, so that it is triggered only for the CPTC dataset. In PR #35, ArgumentParser will be used to parse this option or set the default one.

UPDATE: PR #35 has already added the --dataset option. In this PR, this option only has to be added to the correct places.

Support for alert prediction

How can alert-driven AGs be used to predict future attacks?

Add documentation for methods

Description

Currently, block comments are present in the code, helping the user understand the related parts. However, a method-level documentation in the form of Python Docstrings is missing.

Proposed solution

Document all the (non-helper) methods using Python Docstrings, in the following format:

Error in IANA mapping

Description

Currently, the method load_IANA_mapping throws an error when a time-out occurs.

Proposed solution

Send another request when a time-out happens. In either case, the user should not have to deal with this error, so it has to be hidden from the user.

Test cases for SAGE

Currently, SAGE does not have any tests. Below are some ideas for potential test cases.

Regression tests

Keep the "ground truth" version of the attack graphs (the dot files). Before every merge to main, run the new implementation and compare the AGs to the "ground truth". An example using the scripts from my Research Project repository:

In this test file, AGs are compared with the "ground truth" based on the nodes and edges present (diff-ags.sh) and node stats (stats-nodes-ags.sh), and the episode traces passed to FlexFringe are also compared with the "ground truth".

This would be very useful for PRs related to refactoring and changing minor parts of the code that do not affect the resulting AGs.

The disadvantage, however, is that the "ground truth" files will have to be updated if the AGs change. On the other hand, the graphs can be compared to the latest SAGE version on the main branch.

Sinks in FlexFringe vs sinks in AGs

As mentioned in #14, sinks in AGs can be compared with the sinks in the json files generated by FlexFringe. Here no "ground truth" is necessary as FlexFringe output (the S-PDFA) is taken as the "ground truth".

# All found sinks in 2017 are indeed sinks (after all fixes)
[jegor@arch SAGE-fork]$ sinks_after_2017=$(find after-2017AGs/ -type f -name '*.dot' | xargs grep -F -l "dotted" | xargs gvpr 'N [ $.style == "dotted" || $.style == "filled,dotted" || $.style == "dotted,filled" ] { print(gsub(gsub($.name, "\r"), "\n", " | ")); }' | sort -u | uniq -i)
[jegor@arch SAGE-fork]$ all_sinks_2017=$(jq '.nodes[] | select(.issink==1) | .id' before-2017.txt.ff.finalsinks.json | sort)
[jegor@arch SAGE-fork]$ echo -e "$sinks_after_2017" | wc -l
141
[jegor@arch SAGE-fork]$ comm -12 <(echo -e "$sinks_after_2017" | sed 's/^.*ID: \([0-9-]\+\)$/\1/' | sort) <(echo -e "$all_sinks_2017") | wc -l
141

# All found sinks in 2018 are indeed sinks (after all fixes)
[jegor@arch SAGE-fork]$ sinks_after_2018=$(find after-2018AGs/ -type f -name '*.dot' | xargs grep -F -l "dotted" | xargs gvpr 'N [ $.style == "dotted" || $.style == "filled,dotted" || $.style == "dotted,filled" ] { print(gsub(gsub($.name, "\r"), "\n", " | ")); }' | sort -u | uniq -i)
[jegor@arch SAGE-fork]$ all_sinks_2018=$(jq '.nodes[] | select(.issink==1) | .id' before-2018.txt.ff.finalsinks.json | sort)
[jegor@arch SAGE-fork]$ echo -e "$sinks_after_2018" | wc -l
104
[jegor@arch SAGE-fork]$ comm -12 <(echo -e "$sinks_after_2018" | sed 's/^.*ID: \([0-9-]\+\)$/\1/' | sort) <(echo -e "$all_sinks_2018") | wc -l
104

# All non-sinks with IDs in 2017 are indeed non-sinks (after all fixes)
[jegor@arch SAGE-fork]$ non_sinks_with_ids_after_2017=$(find after-2017AGs/ -type f -name '*.dot' | xargs gvpr 'N [ $.style != "dotted" && $.style != "filled,dotted" && $.style != "dotted,filled" ] { print(gsub(gsub($.name, "\r"), "\n", " | ")); }' | sort -u | uniq -i | grep 'ID: ')
[jegor@arch SAGE-fork]$ echo -e "$non_sinks_with_ids_after_2017" | wc -l
28
[jegor@arch SAGE-fork]$ all_non_sinks_after_2017=$(jq '.nodes[] | select(.issink==0) | .id' after-2017.txt.ff.final.json | sort -u)
[jegor@arch SAGE-fork]$ comm -12 <(echo -e "$non_sinks_with_ids_after_2017" | sed 's/^.*ID: \([0-9-]\+\)$/\1/' | sort -u) <(echo -e "$all_non_sinks_after_2017") | wc -l
28

# All non-sinks with IDs in 2018 are indeed non-sinks (after all fixes)
[jegor@arch SAGE-fork]$ non_sinks_with_ids_after_2018=$(find after-2018AGs/ -type f -name '*.dot' | xargs gvpr 'N [ $.style != "dotted" && $.style != "filled,dotted" && $.style != "dotted,filled" ] { print(gsub(gsub($.name, "\r"), "\n", " | ")); }' | sort -u | uniq -i | grep 'ID: ')
[jegor@arch SAGE-fork]$ echo -e "$non_sinks_with_ids_after_2018" | wc -l
16
[jegor@arch SAGE-fork]$ all_non_sinks_after_2018=$(jq '.nodes[] | select(.issink==0) | .id' after-2018.txt.ff.final.json | sort -u)
[jegor@arch SAGE-fork]$ comm -12 <(echo -e "$non_sinks_with_ids_after_2018" | sed 's/^.*ID: \([0-9-]\+\)$/\1/' | sort -u) <(echo -e "$all_non_sinks_after_2018") | wc -l
16

Episode generation

_get_episodes method has test cases for episode creation that are commented out. A better solution would be to move these commented tests into a separate test file

Cutting episode sequences

Cutting episode sequences into episode subsequences could also potentially be tested.

Episode sequences vs state sequences

Depending on whether implementation allows this or not, episode sequences could be compared with state sequences for consistency.

Structure discarding short episode subsequences

Description

The break_into_subbehaviors function, which is responsible for cutting episode sequences into episode subsequences, discards short subsequences, for example:

This check occurs multiple times within the function and it is also present in generate_traces function. Ideally, it should be only in one place (in generate_traces).

Furthermore, when splitting, if a sequence goes like [low, low, medium, high, low], then [low, low, medium, high] is saved but the last [low] is just discarded. We probably shouldn't lose alerts like this. On the other hand, it is not clear what to do with a single event either. Maybe we keep them regardless?

Proposed solution

Leave only the check in the generate_traces function and update the break_into_subbehaviors function accordingly. The resulting attack graphs should be the same as before.
For now, these short episodes can be discarded, as they used to be, and it will be left to the user whether they want to keep them or not.

tudelft-cda-lab / sage Goto Github PK

sage's People

Contributors

Stargazers

Watchers

Forkers

sage's Issues

Description

Proposed solution

Description

Proposed solution

Description

Proposed solution

Description

Proposed Solution

Description

Proposed solution

Description

Proposed solution

Description

Proposed solution

Description

Proposed solution

Description

Proposed solution

Regression tests

Sinks in FlexFringe vs sinks in AGs

Episode generation

Cutting episode sequences

Episode sequences vs state sequences

Description

Proposed solution

Recommend Projects

Recommend Topics

Recommend Org