Giter Club home page Giter Club logo

software-mentions's Introduction

A large dataset of software mentions in the biomedical literature

About

We describe the CZ Software Mentions dataset, a new dataset of software mentions in biomedical papers. Plain-text software mentions are extracted with a trained SciBERT model from several sources: the NIH PubMed Central collection and from papers provided by various publishers to the Chan Zuckerberg Initiative. The dataset provides sources, context and metadata, and, for a number of mentions, the disambiguated software entities and links. We extract 1.12 million unique string software mentions from 2.4 million papers in the NIH PMC-OA Commercial subset, 481k unique mentions from the NIH PMC-OA Non-Commercial subset (both gathered in October 2021) and 934k unique mentions from 4 million papers in the Publishers’ collection. There is variation in how software is mentioned in papers and extracted by the NER algorithm. We propose a clustering-based disambiguation algorithm to map plain-text software mentions into distinct software entities and apply it on the NIH PubMed Central Commercial collection. Through this methodology, we disambiguate 1.12 million unique strings extracted by the NER model into ~97,000 unique software entities, covering 78% of all links. We link 185,000 of the mentions to a repository, covering about 55% of all software-paper links. We make all data and code publicly available as a new resource to help assess the impact of software (in particular scientific open source projects) on science.

This code repository accompanies our data release. It is released under the MIT License.

Authors

Ana-Maria Istrate, Donghui Li, Dario Taraborelli, Michaela Torkar, Boris Veytsman, Ivana Williams.

Chan Zuckerberg Initiative

Directories

Links

Code of Conduct

This project adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to [email protected].

Reporting Security Issues

If you believe you have found a security issue, please responsibly disclose by contacting us at [email protected].

software-mentions's People

Contributors

anaistrate avatar borisveytsman avatar dartar avatar ivana-jelic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

software-mentions's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.