Giter Club home page Giter Club logo

sage's People

Contributors

azqanadeem avatar jzelenjak avatar opreacristian2002 avatar smzvandenbroec avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

sage's Issues

Move alert signatures into separate files

Description

Currently, all alert signatures and mappings are hard-coded into sage.py. This makes the file unnecessarily large.

Proposed solution

Extract alert signatures and mappings into separate files and read/import them at the beginning of a program.

State identifiers for sink states: |Sink or state IDs?

Determine how to represent vertices in AGs that come from sink states. If we do assign |Sink, then all sinks of that type will be merged in one node (cool for simplicity, bad for readability). However, state IDs taken from sinks (salvaged) might be misleading for analysts (think different context). In either case, vertices related to sink states MUST ALWAYS have a dotted border.

Add ArgumentParser to SAGE

Description

Currently, parsing of the input arguments is happening manually, which is a bit cumbersome.

image

Furthermore, another optional parameter has to be added for the dataset name. This is needed because some if-checks in the code are CPTC-specific (see issue #24).

Proposed solution

  1. Replace the manual parsing with ArgumentParser (similar to SECLEDS)
  2. Add one more option: dataset_name (options = {"cptc", "other"}, default value = "other")
  3. Optionally, add another optional parameter to not remove the dot files
  4. Update the docker branch accordingly (only the new options have to be added to the script.sh and input.ini files, no other changes are necessary, i.e. the input.ini file remains only on the docker branch, spdfa-config.ini remains only on the main branch, argument parsing remains in the sage.py, with the exception that default values of optional parameters will be added to the script.sh file).
  5. Update the documentation for the main and docker branches

A better tie-breaker for the most targeted service

In the current implementation, when computing the most targeted service, the first most frequent service is taken, so that the result is deterministic (see PR #10).

image

On the other hand, there are "unknown" services, which are used when SAGE cannot infer the service based on IANA port-mapping.

A potential improvement to the tie-breaker might be to explicitly not choose "unknown" as the most targeted service in case of a tie, or to add a small margin (for example, if http has a count of 3 and unknown has a count of 4, then http can still be used). This way a security analyst might get better insights from the AGs since a specific service might reveal more information than an "unknown" service.

Folder with files of different modes

What happens if the input folder contains files with alternating order of alerts? Currently the {mode} applies to all files in a folder. Make it more flexible.

Support for partial paths

Implement support for partial paths. For a victim X, all paths that have reached an intermediate vertex Y.

Split sage.py into separate files

Description

Currently, SAGE is one file sage.py, which is over 1k lines of code. The largest part of the file consists of the functions, and only at the very end there is the actual main part. A better approach might be to split sage.py into separate files, as it was also done in SECLEDS.

Proposed solution

  1. The following files will be present in the repository:
  • sage.py with the main part, alert parsing and global parameters
  • plotting.py with the functions that are related to plotting (including make_state_groups)
  • episode_sequence_generation.py (from making hyperalert sequences to trace generation, i.e. from aggregate_into_episodes until generate_traces excluding)
  • model_learning.py (from generate_traces until make_state_sequences and group_episodes_per_av including; the code in group_episodes_per_av can go to make_state_sequences function, since it just makes the state sequences on an attacker or victim level)
  • ag_generation.py (converting state sequences into AGs, i.e. make_attack_graphs and the related functions)
  1. Furthermore, the global parameters will become function parameters wherever applicable.
  2. Finally, the docker branch will be updated accordingly to make sure that all the files are copied.

Structure the filtering part when parsing the alerts

Description

Sometimes, there are attackers which generate 99% of the alerts, which in the code are called bad_ip and are skipped. Furthermore, there are alerts that occur way too often and could be filtered, if necessary (see below).

image
image
image

The bad_ip might be dataset-specific, and the checks for "Attempted Information Leak" and "Non Suspicious Traffic" might be needed only in case of bad_ip.

Proposed Solution

  1. Check bad IPs for CPTC/CCDC and decide on how to proceed
  2. Check what happens if we remove the check for "Not Suspicious Traffic"? _remove_duplicate method checks for NON_MALICIOUS traffic, however the former is a SURICATA category, while the latter is part of the MicroAttackStage framework
  3. Update the _parse function accordingly

Discarding IDs from low-severity sinks loses transitions

Description

This code snippet below (part of the traverse function) does not work in an intended way. As mentioned here, a defaultdict in Python creates a new element when it is not present in the dict and is accessed. When the ID is removed from a low-severity sink and there are more states in the trace, state_list[-1] will be -1, which will be queried in the sinks dictionary (which is a defaultdict). This will lead to -1 being added to the (global) sinks (sinks_model) variable, which should not be there (otherwise states with ID -1 will sort of be sinks).

image

This was the issue for CCDC dataset. In the image below, there is a (still reversed) sequence where netDOS follows vulnD. Because the ID is removed from vuldD (as it is a low-severity node), the next transition is from state -1 which cannot be correct as there are no nodes in S-PDFA with ID -1 (except for one dummy node, but that's not a problem here).

image

The states in the AGs are essentially the same, except for this -1 ID on some states.

Proposed solution

The IDs can be stored in a separate list here (for example, transitions_list), which is used only for transitions. The original list (state_list) will have -1s will be returned.

Episode subsequence starts with a high-severity episode as a result of cutting

Description

Some episode subsequences start with a high-severity episode.

image

The viable cuts for subsequences are: [med, low], [high, low], and [high, med]. It could happen that episode subsequences start with a high-severity episode if the attackers did actually start with a high-severity alert, and we even claim that the alert driven AGs have this as a special property. However, we need to make sure this artefact does not exist because of the sequence cutting.

The problem with the image above happens because of the line if pieces < 1: (in break_into_subbehaviours). The subsequence above had length 3. In break_into_subbehaviours function, cut_length = 4. As a result, the line pieces = math.floor(len(episodes) / cut_length) will evaluate to pieces = 0 and if pieces < 1 to True, because the corresponding episode sequence has length 3. Hence, the entire subsequence of length 3 will be added to the episode subsequences.

Proposed solution

Ideally we want the sequence to be cut entirely based on severity, not on the cut_length basis

Add a flag for CPTC dataset

Description

In CPTC-2017 and CPTC-2018 datasets, attacker IPs are known, however this might not be the case for other datasets (e.g. CCDC). Because of that, some parts of the code that are CPTC-specific have to be commented out when using, for example, CCDC dataset.

For example:

image

Furthermore, the code snippet above is executed after learning the S-PDFA, which is too late.

Proposed solution

Move this check from make_state_sequences into group_alerts_per_team (in sage.py):

  1. Add the check for 10.0.254 in src_ip or in dst_ip - if not present, then discard
  2. If present in src_ip, then add (src_ip, dst_ip). If in dst_ip, then add (dst_ip, src_ip)
  3. Correspondingly update the part in make_state_sequences function

For the future, we might want to address internal paths (leave this as a TODO).

bad_ip can be renamed to cptc_bad_ip

Furthermore, add a specific flag for the dataset (enum or a string) and add this flag to the if-check, so that it is triggered only for the CPTC dataset. In PR #35, ArgumentParser will be used to parse this option or set the default one.

UPDATE: PR #35 has already added the --dataset option. In this PR, this option only has to be added to the correct places.

Add documentation for methods

Description

Currently, block comments are present in the code, helping the user understand the related parts. However, a method-level documentation in the form of Python Docstrings is missing.

Proposed solution

Document all the (non-helper) methods using Python Docstrings, in the following format:
image
image

Error in IANA mapping

Description

Currently, the method load_IANA_mapping throws an error when a time-out occurs.

Proposed solution

Send another request when a time-out happens. In either case, the user should not have to deal with this error, so it has to be hidden from the user.

Test cases for SAGE

Currently, SAGE does not have any tests. Below are some ideas for potential test cases.

Regression tests

Keep the "ground truth" version of the attack graphs (the dot files). Before every merge to main, run the new implementation and compare the AGs to the "ground truth". An example using the scripts from my Research Project repository:

image

In this test file, AGs are compared with the "ground truth" based on the nodes and edges present (diff-ags.sh) and node stats (stats-nodes-ags.sh), and the episode traces passed to FlexFringe are also compared with the "ground truth".

This would be very useful for PRs related to refactoring and changing minor parts of the code that do not affect the resulting AGs.

The disadvantage, however, is that the "ground truth" files will have to be updated if the AGs change. On the other hand, the graphs can be compared to the latest SAGE version on the main branch.

Sinks in FlexFringe vs sinks in AGs

As mentioned in #14, sinks in AGs can be compared with the sinks in the json files generated by FlexFringe. Here no "ground truth" is necessary as FlexFringe output (the S-PDFA) is taken as the "ground truth".

# All found sinks in 2017 are indeed sinks (after all fixes)
[jegor@arch SAGE-fork]$ sinks_after_2017=$(find after-2017AGs/ -type f -name '*.dot' | xargs grep -F -l "dotted" | xargs gvpr 'N [ $.style == "dotted" || $.style == "filled,dotted" || $.style == "dotted,filled" ] { print(gsub(gsub($.name, "\r"), "\n", " | ")); }' | sort -u | uniq -i)
[jegor@arch SAGE-fork]$ all_sinks_2017=$(jq '.nodes[] | select(.issink==1) | .id' before-2017.txt.ff.finalsinks.json | sort)
[jegor@arch SAGE-fork]$ echo -e "$sinks_after_2017" | wc -l
141
[jegor@arch SAGE-fork]$ comm -12 <(echo -e "$sinks_after_2017" | sed 's/^.*ID: \([0-9-]\+\)$/\1/' | sort) <(echo -e "$all_sinks_2017") | wc -l
141

# All found sinks in 2018 are indeed sinks (after all fixes)
[jegor@arch SAGE-fork]$ sinks_after_2018=$(find after-2018AGs/ -type f -name '*.dot' | xargs grep -F -l "dotted" | xargs gvpr 'N [ $.style == "dotted" || $.style == "filled,dotted" || $.style == "dotted,filled" ] { print(gsub(gsub($.name, "\r"), "\n", " | ")); }' | sort -u | uniq -i)
[jegor@arch SAGE-fork]$ all_sinks_2018=$(jq '.nodes[] | select(.issink==1) | .id' before-2018.txt.ff.finalsinks.json | sort)
[jegor@arch SAGE-fork]$ echo -e "$sinks_after_2018" | wc -l
104
[jegor@arch SAGE-fork]$ comm -12 <(echo -e "$sinks_after_2018" | sed 's/^.*ID: \([0-9-]\+\)$/\1/' | sort) <(echo -e "$all_sinks_2018") | wc -l
104
# All non-sinks with IDs in 2017 are indeed non-sinks (after all fixes)
[jegor@arch SAGE-fork]$ non_sinks_with_ids_after_2017=$(find after-2017AGs/ -type f -name '*.dot' | xargs gvpr 'N [ $.style != "dotted" && $.style != "filled,dotted" && $.style != "dotted,filled" ] { print(gsub(gsub($.name, "\r"), "\n", " | ")); }' | sort -u | uniq -i | grep 'ID: ')
[jegor@arch SAGE-fork]$ echo -e "$non_sinks_with_ids_after_2017" | wc -l
28
[jegor@arch SAGE-fork]$ all_non_sinks_after_2017=$(jq '.nodes[] | select(.issink==0) | .id' after-2017.txt.ff.final.json | sort -u)
[jegor@arch SAGE-fork]$ comm -12 <(echo -e "$non_sinks_with_ids_after_2017" | sed 's/^.*ID: \([0-9-]\+\)$/\1/' | sort -u) <(echo -e "$all_non_sinks_after_2017") | wc -l
28

# All non-sinks with IDs in 2018 are indeed non-sinks (after all fixes)
[jegor@arch SAGE-fork]$ non_sinks_with_ids_after_2018=$(find after-2018AGs/ -type f -name '*.dot' | xargs gvpr 'N [ $.style != "dotted" && $.style != "filled,dotted" && $.style != "dotted,filled" ] { print(gsub(gsub($.name, "\r"), "\n", " | ")); }' | sort -u | uniq -i | grep 'ID: ')
[jegor@arch SAGE-fork]$ echo -e "$non_sinks_with_ids_after_2018" | wc -l
16
[jegor@arch SAGE-fork]$ all_non_sinks_after_2018=$(jq '.nodes[] | select(.issink==0) | .id' after-2018.txt.ff.final.json | sort -u)
[jegor@arch SAGE-fork]$ comm -12 <(echo -e "$non_sinks_with_ids_after_2018" | sed 's/^.*ID: \([0-9-]\+\)$/\1/' | sort -u) <(echo -e "$all_non_sinks_after_2018") | wc -l
16

image
image
image

Episode generation

_get_episodes method has test cases for episode creation that are commented out. A better solution would be to move these commented tests into a separate test file
image

Cutting episode sequences

Cutting episode sequences into episode subsequences could also potentially be tested.

Episode sequences vs state sequences

Depending on whether implementation allows this or not, episode sequences could be compared with state sequences for consistency.

Structure discarding short episode subsequences

Description

The break_into_subbehaviors function, which is responsible for cutting episode sequences into episode subsequences, discards short subsequences, for example:
image

This check occurs multiple times within the function and it is also present in generate_traces function. Ideally, it should be only in one place (in generate_traces).

Furthermore, when splitting, if a sequence goes like [low, low, medium, high, low], then [low, low, medium, high] is saved but the last [low] is just discarded. We probably shouldn't lose alerts like this. On the other hand, it is not clear what to do with a single event either. Maybe we keep them regardless?

Proposed solution

  • Leave only the check in the generate_traces function and update the break_into_subbehaviors function accordingly. The resulting attack graphs should be the same as before.
  • For now, these short episodes can be discarded, as they used to be, and it will be left to the user whether they want to keep them or not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.