Giter Club home page Giter Club logo

software-mention-extraction's Introduction

Software mention extraction and linking from scientific articles 💾

Most of the cutting-edge science is built on scientific software, which makes scientific software often as important as traditional scholarly literature. Despite that, the software is not always treated as such, especially when it comes to funding, credit, and citations. Moreover, with the ever-growing number of open-source software tools, it is impossible for many researchers to track tools, databases, and methods in a specific field.

In an effort to automate the process of crediting and identifying relevant and essential software in the biomedical domain, we've developed a machine learning model to extract mentions of software from scientific articles. The input to this model is a text from a scientific article and the output is a list of mentioned software within it. 

We applied this model to the CORD-19 full-text articles and stored the output inCORD-19 Software Mentions . Cite as: Wade, Alex D., & Williams, Ivana. (2021). CORD-19 Software Mentions [Data set]. https://doi.org/10.5061/dryad.vmcvdncs0

Getting started

Dependencies

Python 3.7+ (tested on 3.7.4)
Python packages: pandas, numpy, keras, torch, nltk, sklearn, transformers, os, seqeval, json, time, tqdm, argparse, blink, bs4, re, itertools

Training

Data

Softcite:

  • Softcite data repository

  • Full corpus (downloaded on February 8, 2021)

  • Download the XML file above and place it in the ./data folder.

  • XML file processing notebook: ./notebooks/Parse softcite data.ipynb

    • Input: ./data/softcite_corpus-full.tei.xml

    • Output: ./data/labeled_dfs_all.csv


Model

Performance:


Inference

Software Mentions

  • Download pretrained model 'scibert_software_sent' from: s3://meta-prod-ds-storage/software_mentions_extraction/models and place it in the ./models/ folder.

  • Example of how to run the model in inference mode: ./scripts/Software mentions inference mode.ipynb

  • Example:

Wikipedia Linking

  • This model is based on BLINK

  • Follow instructions on the github repo to download relevant models/install.

  • Example of how to run the model in inference mode: ./scripts/Link text to wikipedia.ipynb

  • Example:


Extract mentions of software from the CORD-19 dataset 🦠 📚

  • CORD-19 data: More information and download instructions: here
  • Save to the ./data/ folder
  • Run notebook: ./scripts/Software mentions CORD19.ipynb

Contributing

Contributions and ideas are welcome! Please see our contributing guide and don't hesitate to open an issue or send a pull request to improve the functionality of this gem.

Code of Conduct

This project adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to [email protected].

License

MIT

software-mention-extraction's People

Contributors

ivana-jelic avatar ryanking avatar

Stargazers

Michael Corrado avatar R avatar Asura Enkhbayar avatar Runze avatar  avatar Matt Shirley avatar Richie Siburian avatar Atul avatar

Watchers

Girish Patangay avatar Jessica Gadling avatar James Cloos avatar  avatar

software-mention-extraction's Issues

Incorrect IOB labels after BERT tokenization

Follow-up of #1 :)

In the notebook notebooks/Train software mentions model.ipynb, when expanding the labels for the extra tokens introduced by the BERT Tokenizer, these added labels have a copy of the leading label (which start with B-), while these added labels should be "inside" labels starting with I-, following the IOB scheme.

def tokenize_label_sentence(sentence, text_labels):
    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence, text_labels):
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)
        tokenized_sentence.extend(tokenized_word)
        labels.extend([label] * n_subwords)        # <--- problem !
    return tokenized_sentence, labels

For example:

['Statistical', 'analysis', 'was', 'conducted', 'using', 'SPSS', 'software', 'v', '13', '.', '0', '(', 
'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'IL', ',', 'USA', ')', '.']
['O', 'O', 'O', 'O', 'O', 'B-software', 'O', 'O', 'B-version', 'B-version', 'B-version', 'O', 'O', 
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

The problem is visible too on the notebook notebooks/Software mentions inference mode.ipynb, cell 6-7:

test_sentence = "I used Python package DBSCAN 1.234 for this analysis"
[('I', 'O'), ('used', 'O'), ('Python', 'B-software'), ('package', 'O'), ('DBSCAN', 'B-software'), 
('1', 'B-version'), ('.', 'B-version'), ('234', 'B-version'), ('for', 'O'), ('this', 'O'), ('analysis', 'O')]

We should not have 3 separate entities "version", but only one (1.234) with 3 tokens.

When seqeval then computes the evaluation scores, it will count every labels with B- as a different entity, so over-estimating the scores (it becomes more scores at token level than entity level). There are 4093 software names in the Softcite dataset, but the seqeval evaluation in the training/eval notebook, which is supposed to be on 10% of the corpus, is done on 959 software names, see support below:

f1 socre: 0.922123
Accuracy score: 0.995906
           precision    recall  f1-score   support

 software     0.9014    0.9343    0.9176       959
  version     0.9216    0.9515    0.9363       309

micro avg     0.9063    0.9385    0.9221      1268
macro avg     0.9063    0.9385    0.9221      1268

To fix the issue, the expanded labels should be prefixed with I-, for example:

def tokenize_label_sentence(sentence, text_labels):
    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence, text_labels):
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)
        tokenized_sentence.extend(tokenized_word)
        # extend tokens replacing B-label by I-label (otherwise we have B- everywhere at each token and wrong 
        # entity scores)
        labels.extend([label])
        if n_subwords>0:
            for i in range(0, n_subwords-1):
                labels.extend([label.replace("B-", "I-")])
    return tokenized_sentence, labels

We have then as expected:

['Statistical', 'analysis', 'was', 'conducted', 'using', 'SPSS', 'software', 'v', '13', '.', '0', '(', 
'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'IL', ',', 'USA', ')', '.']
['O', 'O', 'O', 'O', 'O', 'B-software', 'O', 'O', 'B-version', 'I-version', 'I-version', 'O', 'O', 'O', 
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

The evaluation scores are then correctly on 10%, 404 software names, and becomes (combined with the fix of #1):

F1 score: 0.855545
           precision    recall  f1-score   support

 software     0.8107    0.8589    0.8341       404
  version     0.9180    0.9412    0.9295       119

micro avg     0.8345    0.8776    0.8555       523
macro avg     0.8352    0.8776    0.8558       523

0.85 is more or less the same F1-score as I have with my independent implementation with the same model and Softcite dataset (I have ~ 0.84).

It's possible that this problem impact the inference and the ability to segment software names and versions with several tokens, because the model is trained with almost only B- tokens, so it's likely not just an issue for the evaluation.

[Bug] Can't Download Model from S3

Following the readme, it says:

Download pretrained model 'scibert_software_sent' from: s3://meta-prod-ds-storage/software_mentions_extraction/models and place it in the ./models/ folder.

I tried to do this, but I'm getting an "access denied" error:

$ aws s3 cp s3://meta-prod-ds-storage/software_mentions_extraction/models/scibert_software_sent ./
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

$ aws s3 ls  s3://meta-prod-ds-storage/software_mentions_extraction/models
An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

$ aws configure list
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile                <not set>             None    None
access_key     ****************GC2L         iam-role
secret_key     ****************8TvJ         iam-role
    region                us-east-1             imds

Is the bucket closed now? If so, feel free to close this Issue ticket. Otherwise, please let me know if there's something I'm missing to access the model.

Wrongly removed tokens

Hello !

Given that you are reporting 92% F1-score on the Softcite dataset, I was wondering why with the same SciBERT model and dataset I was having 8 points less in F1-score. Reproducing your training with the notebook, I found 2 problems in your data preparation which, I think, explains the difference.

The first one is related to a filtering that removes all the tokens and tags for labels not software and not version. In cell 3 of the noteboook notebooks/Train software mentions model.ipynb, we have:

data.columns = ['paper_id',"word", "tag", "sentence_id"]
data['word'] = data['word'].astype(str)
data = data[data['tag'].isin(['O', 'B-software', 'B-version', 'I-version','I-software'])]

This will remove in particular all the tokens for publishers and url. For example here "Microsoft" and "SPSS Inc" are removed:

Input:

Radiographic errors were recorded on individual tick sheets and the information was captured 
in an <rs cert="1.0" resp="#annotator0" type="software" xml:id="a7f72b2925-software-0">Excel</rs> 
spreadsheet (<rs corresp="#a7f72b2925-software-0" resp="#curator" type="publisher">Microsoft</rs>, 
Redmond, WA). The readers resolved any differences by consensus.

Tokens:

['Radio', '##graphic', 'errors', 'were', 'recorded', 'on', 'individual', 'tick', 'sheets', 'and', 'the', 
'information', 'was', 'captured', 'in', 'an', 'Excel', 'spreads', '##he', '##et', '(', ',', 'Red', 
'##mond', ',', 'WA', ')', '.']

Input:

The <rs cert="1.0" resp="#annotator0" type="software" xml:id="f204e3a468-software-0">SPSS</rs> 
software version <rs corresp="#f204e3a468-software-0" resp="#annotator0" type="version">11.0</rs> 
(<rs corresp="#f204e3a468-software-0" resp="#curator" type="publisher">SPSS Inc</rs>., Chicago, 
USA) was used for the statistical analysis.

Tokens:

['The', 'SPSS', 'software', 'version', '11', '.', '0', '(', '.', ',', 'Chicago', ',', 'USA', ')', 'was', 
'used', 'for', 'the', 'statistical', 'analysis', '.']

So the model is trained and evaluated without the text corresponding to publisher and url. This impact the evaluation because publisher and url are often ambiguous with software name.

This can be fixed by replacing labels to be excluded by O, so that the tokens are not removed:

data.columns = ['paper_id',"word", "tag", "sentence_id"]
data['word'] = data['word'].astype(str)
# replace 'B-publisher', 'B-url', 'I-publisher','I-url' and reference marker labels by 'O'
data['tag'] = data['tag'].replace(['B-publisher', 'B-url', 'B-bibr', 'B-table', 'B-figure', 'B-formula', 'I-publisher', 'I-url', 'I-bibr', 'I-table', 'I-figure', 'I-formula'], 'O')
data = data[data['tag'].isin(['O', 'B-software', 'B-version', 'I-version','I-software'])]

We have then as expected:

['Radio', '##graphic', 'errors', 'were', 'recorded', 'on', 'individual', 'tick', 'sheets', 'and', 'the', 
'information', 'was', 'captured', 'in', 'an', 'Excel', 'spreads', '##he', '##et', '(', 'Microsoft', ',', 
'Red', '##mond', ',', 'WA', ')', '.']
['The', 'SPSS', 'software', 'version', '11', '.', '0', '(', 'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'USA', ')', 
'was', 'used', 'for', 'the', 'statistical', 'analysis', '.']

The evaluation becomes then:

F1 score: 0.883440
           precision    recall  f1-score   support

 software     0.8468    0.8737    0.8600       974
  version     0.9440    0.9610    0.9524       333

micro avg     0.8713    0.8959    0.8834      1307
macro avg     0.8715    0.8959    0.8836      1307

However, it impacts probably not just the evaluation I think, it also means that your model does not know about publisher and url when applied to new article containing such tokens, so it might degrade the inference scenario too.

Note: I don't know how to PR when a notebook is used, but the code snippet above fixes the problem (in cell 3 of notebooks/Train software mentions model.ipynb).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.