Giter Club home page Giter Club logo

cvefixes's Introduction

source under MIT licence data under CC BY 4.0 license

CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software

CVEfixes is a comprehensive vulnerability dataset that is automatically collected and curated from Common Vulnerabilities and Exposures (CVE) records in the public U.S. National Vulnerability Database (NVD). The goal is to support data-driven security research based on source code and source code metrics related to fixes for CVEs in the NVD by providing detailed information at different interlinked levels of abstraction, such as the commit-, file-, and method level, as well as the repository- and CVE level.

At the initial release, the dataset covers all published CVEs up to 9 June 2021. All open-source projects that were reported in CVE records in the NVD in this time frame and had publicly available git repositories were fetched and considered for the construction of this vulnerability dataset. The dataset is organized as a relational database and covers 5495 vulnerability fixing commits in 1754 open source projects for a total of 5365 CVEs in 180 different Common Weakness Enumeration (CWE) types. The dataset includes the source code before and after fixing of 18249 files, and 50322 functions. Because of limitations in GitHub storage, we provide a compressed SQL dump of the CVEfixes vulnerability dataset via Zenodo with DOI: 10.5281/zenodo.4476563.

This repository includes the code to replicate the data collection. The complete process has been documented in the paper "CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open- Source Software", a copy of which you will find in the Doc folder.

  • instructions for using CVEfixes are in the first section of INSTALL.md.
  • requirements for gathering CVEfixes from scratch are in REQUIREMENTS.md.
  • instructions for gathering CVEfixes from scratch are in the second section of INSTALL.md.

Citation and Zenodo links

Please site this work by referring to the paper:

Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21). ACM, 10 pages. https://doi.org/10.1145/3475960.3475985

@inproceedings{bhandari2021:cvefixes,
    title = {{CVEfixes: Automated Collection of Vulnerabilities  and Their Fixes from Open-Source Software}},
    booktitle = {{Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21)}},
    author = {Bhandari, Guru and Naseer, Amara and Moonen, Leon},
    year = {2021},
    pages = {10},
    publisher = {{ACM}},
    doi = {10.1145/3475960.3475985},
    copyright = {Open Access},
    isbn = {978-1-4503-8680-7},
    language = {en}
}

The GitHub repository containing the code to automatically collect the dataset can be found at https://github.com/secureIT-project/CVEfixes, released with DOI: 10.5281/zenodo.5111494. The dataset has been released on Zenodo with DOI: 10.5281/zenodo.4476563.

Acknowledgement

This work has been financially supported by the Research Council of Norway through the secureIT project (RCN contract #288787).

cvefixes's People

Contributors

leonmoonen avatar linasvidziunas avatar satbekmyrza avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cvefixes's Issues

Scrath process asks for github credentials.

During the scrath process from github sometimes there are repositories that ask for credentials and this stops the process. Can you help me please?

05/21/2023 23:23:10 CVEfixes INFO ----------------------------------------------------------------------
05/21/2023 23:23:10 CVEfixes INFO Retrieving fixes for repo 62 of 3406 - kiwi
05/21/2023 23:23:10 CVEfixes DEBUG Extracting commits for https://github.com/openSUSE/kiwi.git with 4 worker(s) looking for the following hashes:
05/21/2023 23:23:10 CVEfixes DEBUG https://github.com/openSUSE/kiwi.git/commit/f0f74b3f6ac6d47f7919aa9db380c0ad41ffe55f
05/21/2023 23:23:10 CVEfixes DEBUG https://github.com/openSUSE/kiwi.git/commit/88bf491d16942766016c606e4210b4e072c1019f
05/21/2023 23:23:10 git.cmd DEBUG Popen(['git', 'clone', '-v', '--', 'https://github.com/openSUSE/kiwi.git', '/tmp/tmpjk9o3du7/kiwi'], cwd=/media/cvefixes.edgewatch.net/CVEfixes, universal_newlines=True, shell=None, istream=None)
Username for 'https://github.com':

Getting error while trying to create dataset from scratch

Hello,
I was trying this tool to create the dataset from scratch. Followed the steps mentioned on Install.md. also all dependencies are satisfied. However, when i try to run the script create_CVEfixes_from_scratch.sh
I get below error
pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT * FROM cwe_classification': no such table: cwe_classification
Could you please help with pointers to resolve the issue.

Thanks in advance.

DatabaseError

The data version I used is Cvefixes_v1.0.7.sql, and I noticed that CVEfixed/CVEfixes_v1.0.7/Examples uses CVEfixes.db, but it gives an error when I change the path in CVEfixed/CVEfixes_v1.0.7/Examples. I wonder if there is anything changed in the latest version? Or is the database version wrong? Can you help me?
DatabaseError: Execution failed on sql 'SELECT m.name, m.signature, m.nloc, m.parameters, m.token_count, m.code, m.before_change, f.programming_language FROM method_change m, file_change f WHERE f.file_change_id=m.file_change_id AND f.programming_language='C'': no such table: method_change

Build CI&CD pipeline

Hi,
I'm a cybersecurity student from Vietnam. My teacher wants me to make use of your dataset in DevSecOps and build a CI/CD pipeline. Do you have any ideas how I can do this? In what stage of a CI/CD pipeline should I use this dataset? And what tools do I need to use in that case? Can you help me out?
Sorry for my bad English. Your work and your dataset is very helpful. Thank you!

Anaconda environment conflict

when I run conda env create -f environment.yml in my ubuntu 20.04, but conflict occurs.
how can I solve it?

Collecting guesslang~=2.0
  Downloading guesslang-2.2.0-py3-none-any.whl (2.5 MB)
  Downloading guesslang-2.0.3-py3-none-any.whl (2.1 MB)
  Downloading guesslang-2.0.1-py3-none-any.whl (2.1 MB)
  Downloading guesslang-2.0.0-py3-none-any.whl (13.0 MB)
INFO: pip is looking at multiple versions of <Python from Requires-Python> to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of pydriller to determine which version is compatible with other requirements. This could take a while.

The conflict is caused by:
    guesslang 2.2.1 depends on tensorflow==2.5.0
    guesslang 2.2.0 depends on tensorflow==2.5.0
    guesslang 2.0.3 depends on tensorflow==2.5.0
    guesslang 2.0.1 depends on tensorflow==2.2.0
    guesslang 2.0.0 depends on tensorflow==2.2.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict


Pip subprocess error:
ERROR: Cannot install -r /home/yang/Documents/CVE/CVEfixes/condaenv.ko814pto.requirements.txt (line 2) because these package versions have conflicting dependencies.
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

CondaEnvException: Pip failed

Credential Info being asked during the job running

The following credential was being asked during job running, and it interrupted the job which is supposed to run itself:

Username for 'https://bitbucket.org': Password for 'https://bitbucket.org':

I can tell this is not a standard Github link but not sure what's the best thing to bypass this automatically, so I am asking here first before digging too much into it.

Thanks much for such a great project!

Out of memory: Killed process python3

Hello. After 3 days collecting data, there was an error about available memory, even though I configured 160GB.

Has it happened to you?

Greetings

Parse Error When Trying to gzcat

Hi, I downloaded the CVEfixes_v1.0.7.zip file from https://zenodo.org/record/7029359#.ZD2w4i-970o and unzipped to get the CVEfixes_v1.0.7.sql.gz file. But when I try to create the CVEfixes.db file, using the command given in Install.md, which is gunzip -c Data/CVEfixes_v1.0.7.sql.gz | sqlite3 Data/CVEfixes.db I am getting the following errors:

Parse error near line 142728: no such table: method_change
Parse error near line 142729: no such table: method_change
Parse error near line 142730: no such table: method_change
Parse error near line 142731: no such table: method_change
Parse error near line 142732: no such table: method_change
..........

At the end it does create the CVEfixes.db file but without the tables method_change, file_change and commits databases. How do I fix this? I am using CentOS machine for this purpose.

Thank You.

Assertion error when attempting to re-collect data

I was trying to re-collect the CVE data locally with a sample limit of zero, such that 2022 and 2023 records were included in the resulting database.

However, I get the following assertion error:

03/21/2023 14:19:30 git.cmd DEBUG Popen(['git', 'version'], cwd=/Users/tomb/CVEfixes, universal_newlines=False, shell=None, istream=None)
03/21/2023 14:19:30 git.cmd DEBUG Popen(['git', 'version'], cwd=/Users/tomb/CVEfixes, universal_newlines=False, shell=None, istream=None)
[]
03/21/2023 14:19:30 CVEfixes INFO ----------------------------------------------------------------------
03/21/2023 14:19:31 CVEfixes INFO The CVE json for 2002 has been merged
03/21/2023 14:19:32 CVEfixes INFO The CVE json for 2003 has been merged
03/21/2023 14:19:34 CVEfixes INFO The CVE json for 2004 has been merged
03/21/2023 14:19:35 CVEfixes INFO The CVE json for 2005 has been merged
03/21/2023 14:19:37 CVEfixes INFO The CVE json for 2006 has been merged
03/21/2023 14:19:39 CVEfixes INFO The CVE json for 2007 has been merged
03/21/2023 14:19:41 CVEfixes INFO The CVE json for 2008 has been merged
03/21/2023 14:19:43 CVEfixes INFO The CVE json for 2009 has been merged
03/21/2023 14:19:45 CVEfixes INFO The CVE json for 2010 has been merged
03/21/2023 14:19:53 CVEfixes INFO The CVE json for 2011 has been merged
03/21/2023 14:19:54 CVEfixes INFO The CVE json for 2012 has been merged
03/21/2023 14:19:57 CVEfixes INFO The CVE json for 2013 has been merged
03/21/2023 14:20:00 CVEfixes INFO The CVE json for 2014 has been merged
03/21/2023 14:20:02 CVEfixes INFO The CVE json for 2015 has been merged
03/21/2023 14:20:04 CVEfixes INFO The CVE json for 2016 has been merged
03/21/2023 14:20:08 CVEfixes INFO The CVE json for 2017 has been merged
03/21/2023 14:20:13 CVEfixes INFO The CVE json for 2018 has been merged
03/21/2023 14:20:16 CVEfixes INFO The CVE json for 2019 has been merged
03/21/2023 14:20:23 CVEfixes INFO The CVE json for 2020 has been merged
03/21/2023 14:20:30 CVEfixes INFO The CVE json for 2021 has been merged
03/21/2023 14:20:34 CVEfixes INFO The CVE json for 2022 has been merged
03/21/2023 14:20:35 CVEfixes INFO The CVE json for 2023 has been merged
03/21/2023 14:20:35 CVEfixes INFO Flattening CVE items and removing the duplicates...
03/21/2023 14:22:35 CVEfixes INFO All CVEs have been merged into the cve table
03/21/2023 14:22:35 CVEfixes INFO ----------------------------------------------------------------------
03/21/2023 14:22:37 CVEfixes INFO Extracting CWE data from cwec_v4.10.xml
03/21/2023 14:22:39 CVEfixes INFO Adding CWE category to CVE records...
03/21/2023 14:25:12 CVEfixes DEBUG List of CWEs from CVEs that are not associated to cwe table are as follows:
03/21/2023 14:25:12 CVEfixes DEBUG {'CWE-1026'}
Traceback (most recent call last):
  File "Code/collect_projects.py", line 245, in <module>
    cve_importer.import_cves()
  File "/Users/tomb/CVEfixes/Code/cve_importer.py", line 175, in import_cves
    assign_cwes_to_cves(df_cve=df_cve)
  File "/Users/tomb/CVEfixes/Code/cve_importer.py", line 128, in assign_cwes_to_cves
    assert set(list(df_cwes_class.cwe_id)).issubset(set(list(df_cwes.cwe_id))), \
AssertionError: Not all foreign keys for the cwe_classification records are present in the cwe table!

Note that if I re-run the code with a non-zero sample limit of 100, it finishes fine. Any ideas what might be causing this? Is it an external issue related to the data of CWE-1026?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.