honeynet / cuckooml Goto Github PK

CuckooML: Machine Learning for Cuckoo Sandbox

Home Page: https://honeynet.github.io/cuckooml/

Python 77.46% Shell 3.85% DTrace 0.56% HTML 15.19% JavaScript 1.38% Makefile 0.01% C 0.62% Mako 0.07% CSS 0.45% VBScript 0.07% YARA 0.34%

cuckooml's People

Contributors

Stargazers

Watchers

cuckooml's Issues

`cuckooml.py` moved to `modules/processing/`

Since cuckooml.py moved to modules/processing/ most of provided code snippets (especially in the blog posts) and example CuckooML Jupyter Notebooks won't work anymore.

Import statement in virustotal.py

On line 21 in /lib/cuckoo/common/virustotal.py shoudn't it be

from modules.processing.cuckooml import Instance

instead of

from modules.cuckooml.cuckooml import Instance ?

Remove "unknown" OS label

"unknown" OS label needs to be removed in virustotal.py as it collides with "none" label in cuckooml.py.

Resolving abbreviated malware names

Right now the first mapping which is the longest string matched is used. To improve labelling all possible matches need to be considered and the most probable abbreviation combination i.e. the one that uses all of the sub-strings should be chosen.
For example "adload" right now will be split into "a" and "dload" with the latter mapped to downloader. A better split would be "ad" (adware) and "load" (downloader).

Store multiple clustering results in the malware analysis JSON

At the moment at most one parameter settings and clustering results can be stored per clustering algorithm. It should be extended to allow storing results of clustering for multiple parameter settings. See TODO tags in this commit 1727bb0.

Useful malware features

The base of ML features for binaries analysed by Cuckoo is going to be inspired by Reviewer Integration and Performance Measurement for Malware Detection by B Miller et al (available here).
They name all kind of binary features both static and dynamic which seems a good starting point for this project:

static attributes:
- binary metadata,
- digital signing,
- heuristic tools,
- packer detection,
- portable executable format,
- static imports;
dynamic attributes:
- dynamic imports, mutexes, processes,
- filesystem operations,
- network operations,
- registry operations,
- Windows API calls.

Once implemented they should be reviewed and revised with regard to usability for this project.

Naive filesystem path concatenation

Right now filesystem paths are created through string concatenation. This has to be changed to os.path.join() for robustness:

Global "normalized" field does not correspond to the same field per VT vendor

Global "normalized" field has to be updated with corresponding fields per VT vendor which has been updated to provide better labelling.

More useful normalised field in VirusTotal JSONs

VirusTotal supplies malware names which are simply not readable. Currently 'normalised' field generated by cuckoo and available in JSONs is not much of a use.
The goal is to create better normalised malware names which can then be used as labels for testing cuckooml clustering and classification.

Make CuckooML plotting dependant on library imports

In the try: import... create a global variable for all the libraries necessary for plotting and condition CuckooML plotting on that.
The result: no need to install plotting packages if you're only interested in malware analysis with textual output.

A web interface to display results of clustering

Currently the aim is to use matplotlib for all sorts of plots; an integrated web interface for this would be really useful.

Test AVClass for better labeling/evaluation

Try the tool AVClass to improve the labeling of samples and the evaluation of the different algorithms.

getting error running cuckoo.py --ml

I have tried to run cuckooml but always getting the following erorr! even tried to run your example in my IDE but still getting same error.

Traceback (most recent call last):
File "/home/ubuntu/Downloads/pycharm-community-2016.3.2/helpers/pydev/pydevd.py", line 1596, in
globals = debugger.run(setup['file'], None, None, is_module)
File "/home/ubuntu/Downloads/pycharm-community-2016.3.2/helpers/pydev/pydevd.py", line 974, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/ubuntu/PycharmProjects/cuckooml/mltestcase.py", line 10, in
loader.load_binaries("/home/ubuntu/Downloads/cuckooml/sample_data/dict")
File "/home/ubuntu/PycharmProjects/cuckooml/modules/processing/cuckooml.py", line 1189, in load_binaries
self.binaries[f].label_sample()
File "/home/ubuntu/PycharmProjects/cuckooml/modules/processing/cuckooml.py", line 1305, in label_sample
merged_labels += self.scans[vendor]["normalized"][label_type]
TypeError: list indices must be integers, not str

Sorting in clustering_results.csv

Hi @So-Cool

The sorting issue in clustering_results.csv is as follows:
1,10..19,2,20..[sample end 62], 7,8,9

I'm currently trying to create my own ground truth labels list, which means I will have to account for that sorting mistake when creating my own list. I'm wondering whether the ground truth labels generated by CuckooML are in sync with the clustering results, i.e. are they subject to the same bug or does it only affect the one list?

cuckooml showcase

It seems worthwhile to create some kind of a cuckooml showcase that performs clustering on some real data and gives some comments on interpretation of the results; possibly including cuckooml package usage guidelines; maybe in an iPython Notebook format.

Only one set of features is used for clustering

Hi,

thanks for sharing this project! I am in the process of adding features to the nominal feature set. In that process I noticed that my changes were not taken into account in the clustering results, even though I specified nominal in the configuration. I believe the reason is that the code that handles the configuration settings is using an if... elif construct, which will lead to only choosing one set of features. Relevant code snippet is:

    # Select features                               
    selected_features = []                          
    sf = [i.strip() for i in cfg.cuckooml.features.split(",")]
    if "simple" in sf:
        selected_features.append(simple_features)
    elif "nominal" in sf:
        selected_features.append(features_nominal)
    elif "numerical" in sf:
        selected_features.append(features_numerical)

Reading in the data for analysis

The simplest solution is reading in the JSONs placed in the /storage directory. At later stages it might be worth developing something more natural.

honeynet / cuckooml Goto Github PK

cuckooml's People

Contributors

Stargazers

Watchers

Forkers

cuckooml's Issues

Recommend Projects

Recommend Topics

Recommend Org