honeynet / cuckooml Goto Github PK
View Code? Open in Web Editor NEWCuckooML: Machine Learning for Cuckoo Sandbox
Home Page: https://honeynet.github.io/cuckooml/
CuckooML: Machine Learning for Cuckoo Sandbox
Home Page: https://honeynet.github.io/cuckooml/
Since cuckooml.py
moved to modules/processing/
most of provided code snippets (especially in the blog posts) and example CuckooML Jupyter Notebooks won't work anymore.
On line 21 in /lib/cuckoo/common/virustotal.py shoudn't it be
from modules.processing.cuckooml import Instance
instead of
from modules.cuckooml.cuckooml import Instance ?
"unknown" OS label needs to be removed in virustotal.py
as it collides with "none" label in cuckooml.py
.
Right now the first mapping which is the longest string matched is used. To improve labelling all possible matches need to be considered and the most probable abbreviation combination i.e. the one that uses all of the sub-strings should be chosen.
For example "adload" right now will be split into "a" and "dload" with the latter mapped to downloader. A better split would be "ad" (adware) and "load" (downloader).
At the moment at most one parameter settings and clustering results can be stored per clustering algorithm. It should be extended to allow storing results of clustering for multiple parameter settings. See TODO tags in this commit 1727bb0.
The base of ML features for binaries analysed by Cuckoo is going to be inspired by Reviewer Integration and Performance Measurement for Malware Detection by B Miller et al (available here).
They name all kind of binary features both static and dynamic which seems a good starting point for this project:
Once implemented they should be reviewed and revised with regard to usability for this project.
Global "normalized" field has to be updated with corresponding fields per VT vendor which has been updated to provide better labelling.
VirusTotal supplies malware names which are simply not readable. Currently 'normalised' field generated by cuckoo and available in JSONs is not much of a use.
The goal is to create better normalised malware names which can then be used as labels for testing cuckooml clustering and classification.
In the try: import...
create a global variable for all the libraries necessary for plotting and condition CuckooML plotting on that.
The result: no need to install plotting packages if you're only interested in malware analysis with textual output.
Currently the aim is to use matplotlib
for all sorts of plots; an integrated web interface for this would be really useful.
Try the tool AVClass to improve the labeling of samples and the evaluation of the different algorithms.
I have tried to run cuckooml but always getting the following erorr! even tried to run your example in my IDE but still getting same error.
Traceback (most recent call last):
File "/home/ubuntu/Downloads/pycharm-community-2016.3.2/helpers/pydev/pydevd.py", line 1596, in
globals = debugger.run(setup['file'], None, None, is_module)
File "/home/ubuntu/Downloads/pycharm-community-2016.3.2/helpers/pydev/pydevd.py", line 974, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/ubuntu/PycharmProjects/cuckooml/mltestcase.py", line 10, in
loader.load_binaries("/home/ubuntu/Downloads/cuckooml/sample_data/dict")
File "/home/ubuntu/PycharmProjects/cuckooml/modules/processing/cuckooml.py", line 1189, in load_binaries
self.binaries[f].label_sample()
File "/home/ubuntu/PycharmProjects/cuckooml/modules/processing/cuckooml.py", line 1305, in label_sample
merged_labels += self.scans[vendor]["normalized"][label_type]
TypeError: list indices must be integers, not str
Hi @So-Cool
The sorting issue in clustering_results.csv is as follows:
1,10..19,2,20..[sample end 62], 7,8,9
I'm currently trying to create my own ground truth labels list, which means I will have to account for that sorting mistake when creating my own list. I'm wondering whether the ground truth labels generated by CuckooML are in sync with the clustering results, i.e. are they subject to the same bug or does it only affect the one list?
It seems worthwhile to create some kind of a cuckooml showcase that performs clustering on some real data and gives some comments on interpretation of the results; possibly including cuckooml package usage guidelines; maybe in an iPython Notebook format.
Hi,
thanks for sharing this project! I am in the process of adding features to the nominal feature set. In that process I noticed that my changes were not taken into account in the clustering results, even though I specified nominal in the configuration. I believe the reason is that the code that handles the configuration settings is using an if... elif construct, which will lead to only choosing one set of features. Relevant code snippet is:
# Select features
selected_features = []
sf = [i.strip() for i in cfg.cuckooml.features.split(",")]
if "simple" in sf:
selected_features.append(simple_features)
elif "nominal" in sf:
selected_features.append(features_nominal)
elif "numerical" in sf:
selected_features.append(features_numerical)
The simplest solution is reading in the JSONs placed in the /storage
directory. At later stages it might be worth developing something more natural.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.