chanzuckerberg / software-mention-extraction Goto Github PK

Software mention extraction and linking from scientific articles

License: MIT License

Jupyter Notebook 100.00%

software-mention-extraction's Introduction

Software mention extraction and linking from scientific articles 💾

Most of the cutting-edge science is built on scientific software, which makes scientific software often as important as traditional scholarly literature. Despite that, the software is not always treated as such, especially when it comes to funding, credit, and citations. Moreover, with the ever-growing number of open-source software tools, it is impossible for many researchers to track tools, databases, and methods in a specific field.

In an effort to automate the process of crediting and identifying relevant and essential software in the biomedical domain, we've developed a machine learning model to extract mentions of software from scientific articles. The input to this model is a text from a scientific article and the output is a list of mentioned software within it.

We applied this model to the CORD-19 full-text articles and stored the output inCORD-19 Software Mentions . Cite as: Wade, Alex D., & Williams, Ivana. (2021). CORD-19 Software Mentions [Data set]. https://doi.org/10.5061/dryad.vmcvdncs0

Getting started

Dependencies

Python 3.7+ (tested on 3.7.4)
Python packages: pandas, numpy, keras, torch, nltk, sklearn, transformers, os, seqeval, json, time, tqdm, argparse, blink, bs4, re, itertools

Training

Data

Softcite:

Softcite data repository
Full corpus (downloaded on February 8, 2021)
Download the XML file above and place it in the ./data folder.
XML file processing notebook: ./notebooks/Parse softcite data.ipynb
- Input: ./data/softcite_corpus-full.tei.xml
- Output: ./data/labeled_dfs_all.csv

Model

Model training notebook: ./notebooks/Train software mentions model.ipynb
- Input: ./data/labeled_dfs_all.csv, ‘allenai/scibert_scivocab_cased’ SciBERT: A Pretrained Language Model for Scientific Text (Beltagy et al., EMNLP 2019)
- Output: ./models/scibert_software_sent

Performance:

Inference

Software Mentions

Download pretrained model 'scibert_software_sent' from: s3://meta-prod-ds-storage/software_mentions_extraction/models and place it in the ./models/ folder.
Example of how to run the model in inference mode: ./scripts/Software mentions inference mode.ipynb
Example:

Wikipedia Linking

This model is based on BLINK
Follow instructions on the github repo to download relevant models/install.
Example of how to run the model in inference mode: ./scripts/Link text to wikipedia.ipynb
Example:

Extract mentions of software from the CORD-19 dataset 🦠 📚

CORD-19 data: More information and download instructions: here
Save to the ./data/ folder
Run notebook: ./scripts/Software mentions CORD19.ipynb

Contributing

Contributions and ideas are welcome! Please see our contributing guide and don't hesitate to open an issue or send a pull request to improve the functionality of this gem.

Code of Conduct

This project adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to [email protected].

License

MIT

software-mention-extraction's People

Contributors

Stargazers

Watchers

Forkers

kermitt2 vivlet gaybro8777 vtraag

software-mention-extraction's Issues

Incorrect IOB labels after BERT tokenization

Follow-up of #1 :)

In the notebook notebooks/Train software mentions model.ipynb, when expanding the labels for the extra tokens introduced by the BERT Tokenizer, these added labels have a copy of the leading label (which start with B-), while these added labels should be "inside" labels starting with I-, following the IOB scheme.

def tokenize_label_sentence(sentence, text_labels):
    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence, text_labels):
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)
        tokenized_sentence.extend(tokenized_word)
        labels.extend([label] * n_subwords)        # <--- problem !
    return tokenized_sentence, labels

For example:

['Statistical', 'analysis', 'was', 'conducted', 'using', 'SPSS', 'software', 'v', '13', '.', '0', '(', 
'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'IL', ',', 'USA', ')', '.']
['O', 'O', 'O', 'O', 'O', 'B-software', 'O', 'O', 'B-version', 'B-version', 'B-version', 'O', 'O', 
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

The problem is visible too on the notebook notebooks/Software mentions inference mode.ipynb, cell 6-7:

test_sentence = "I used Python package DBSCAN 1.234 for this analysis"
[('I', 'O'), ('used', 'O'), ('Python', 'B-software'), ('package', 'O'), ('DBSCAN', 'B-software'), 
('1', 'B-version'), ('.', 'B-version'), ('234', 'B-version'), ('for', 'O'), ('this', 'O'), ('analysis', 'O')]

We should not have 3 separate entities "version", but only one (1.234) with 3 tokens.

When seqeval then computes the evaluation scores, it will count every labels with B- as a different entity, so over-estimating the scores (it becomes more scores at token level than entity level). There are 4093 software names in the Softcite dataset, but the seqeval evaluation in the training/eval notebook, which is supposed to be on 10% of the corpus, is done on 959 software names, see support below:

f1 socre: 0.922123
Accuracy score: 0.995906
           precision    recall  f1-score   support

 software     0.9014    0.9343    0.9176       959
  version     0.9216    0.9515    0.9363       309

micro avg     0.9063    0.9385    0.9221      1268
macro avg     0.9063    0.9385    0.9221      1268

To fix the issue, the expanded labels should be prefixed with I-, for example:

def tokenize_label_sentence(sentence, text_labels):
    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence, text_labels):
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)
        tokenized_sentence.extend(tokenized_word)
        # extend tokens replacing B-label by I-label (otherwise we have B- everywhere at each token and wrong 
        # entity scores)
        labels.extend([label])
        if n_subwords>0:
            for i in range(0, n_subwords-1):
                labels.extend([label.replace("B-", "I-")])
    return tokenized_sentence, labels

We have then as expected:

['Statistical', 'analysis', 'was', 'conducted', 'using', 'SPSS', 'software', 'v', '13', '.', '0', '(', 
'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'IL', ',', 'USA', ')', '.']
['O', 'O', 'O', 'O', 'O', 'B-software', 'O', 'O', 'B-version', 'I-version', 'I-version', 'O', 'O', 'O', 
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

The evaluation scores are then correctly on 10%, 404 software names, and becomes (combined with the fix of #1):

F1 score: 0.855545
           precision    recall  f1-score   support

 software     0.8107    0.8589    0.8341       404
  version     0.9180    0.9412    0.9295       119

micro avg     0.8345    0.8776    0.8555       523
macro avg     0.8352    0.8776    0.8558       523

0.85 is more or less the same F1-score as I have with my independent implementation with the same model and Softcite dataset (I have ~ 0.84).

It's possible that this problem impact the inference and the ability to segment software names and versions with several tokens, because the model is trained with almost only B- tokens, so it's likely not just an issue for the evaluation.

[Bug] Can't Download Model from S3

Following the readme, it says:

Download pretrained model 'scibert_software_sent' from: s3://meta-prod-ds-storage/software_mentions_extraction/models and place it in the ./models/ folder.

I tried to do this, but I'm getting an "access denied" error:

$ aws s3 cp s3://meta-prod-ds-storage/software_mentions_extraction/models/scibert_software_sent ./
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

$ aws s3 ls  s3://meta-prod-ds-storage/software_mentions_extraction/models
An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

$ aws configure list
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile                <not set>             None    None
access_key     ****************GC2L         iam-role
secret_key     ****************8TvJ         iam-role
    region                us-east-1             imds

Is the bucket closed now? If so, feel free to close this Issue ticket. Otherwise, please let me know if there's something I'm missing to access the model.

Wrongly removed tokens

Hello !

Given that you are reporting 92% F1-score on the Softcite dataset, I was wondering why with the same SciBERT model and dataset I was having 8 points less in F1-score. Reproducing your training with the notebook, I found 2 problems in your data preparation which, I think, explains the difference.

The first one is related to a filtering that removes all the tokens and tags for labels not software and not version. In cell 3 of the noteboook notebooks/Train software mentions model.ipynb, we have:

data.columns = ['paper_id',"word", "tag", "sentence_id"]
data['word'] = data['word'].astype(str)
data = data[data['tag'].isin(['O', 'B-software', 'B-version', 'I-version','I-software'])]

This will remove in particular all the tokens for publishers and url. For example here "Microsoft" and "SPSS Inc" are removed:

Input:

Radiographic errors were recorded on individual tick sheets and the information was captured 
in an <rs cert="1.0" resp="#annotator0" type="software" xml:id="a7f72b2925-software-0">Excel</rs> 
spreadsheet (<rs corresp="#a7f72b2925-software-0" resp="#curator" type="publisher">Microsoft</rs>, 
Redmond, WA). The readers resolved any differences by consensus.

Tokens:

['Radio', '##graphic', 'errors', 'were', 'recorded', 'on', 'individual', 'tick', 'sheets', 'and', 'the', 
'information', 'was', 'captured', 'in', 'an', 'Excel', 'spreads', '##he', '##et', '(', ',', 'Red', 
'##mond', ',', 'WA', ')', '.']

Input:

The <rs cert="1.0" resp="#annotator0" type="software" xml:id="f204e3a468-software-0">SPSS</rs> 
software version <rs corresp="#f204e3a468-software-0" resp="#annotator0" type="version">11.0</rs> 
(<rs corresp="#f204e3a468-software-0" resp="#curator" type="publisher">SPSS Inc</rs>., Chicago, 
USA) was used for the statistical analysis.

Tokens:

['The', 'SPSS', 'software', 'version', '11', '.', '0', '(', '.', ',', 'Chicago', ',', 'USA', ')', 'was', 
'used', 'for', 'the', 'statistical', 'analysis', '.']

So the model is trained and evaluated without the text corresponding to publisher and url. This impact the evaluation because publisher and url are often ambiguous with software name.

This can be fixed by replacing labels to be excluded by O, so that the tokens are not removed:

data.columns = ['paper_id',"word", "tag", "sentence_id"]
data['word'] = data['word'].astype(str)
# replace 'B-publisher', 'B-url', 'I-publisher','I-url' and reference marker labels by 'O'
data['tag'] = data['tag'].replace(['B-publisher', 'B-url', 'B-bibr', 'B-table', 'B-figure', 'B-formula', 'I-publisher', 'I-url', 'I-bibr', 'I-table', 'I-figure', 'I-formula'], 'O')
data = data[data['tag'].isin(['O', 'B-software', 'B-version', 'I-version','I-software'])]

We have then as expected:

['Radio', '##graphic', 'errors', 'were', 'recorded', 'on', 'individual', 'tick', 'sheets', 'and', 'the', 
'information', 'was', 'captured', 'in', 'an', 'Excel', 'spreads', '##he', '##et', '(', 'Microsoft', ',', 
'Red', '##mond', ',', 'WA', ')', '.']

['The', 'SPSS', 'software', 'version', '11', '.', '0', '(', 'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'USA', ')', 
'was', 'used', 'for', 'the', 'statistical', 'analysis', '.']

The evaluation becomes then:

F1 score: 0.883440
           precision    recall  f1-score   support

 software     0.8468    0.8737    0.8600       974
  version     0.9440    0.9610    0.9524       333

micro avg     0.8713    0.8959    0.8834      1307
macro avg     0.8715    0.8959    0.8836      1307

However, it impacts probably not just the evaluation I think, it also means that your model does not know about publisher and url when applied to new article containing such tokens, so it might degrade the inference scenario too.

Note: I don't know how to PR when a notebook is used, but the code snippet above fixes the problem (in cell 3 of notebooks/Train software mentions model.ipynb).

chanzuckerberg / software-mention-extraction Goto Github PK

software-mention-extraction's Introduction

Software mention extraction and linking from scientific articles 💾

Getting started

Dependencies

Training

Data

Model

Inference

Software Mentions

Wikipedia Linking

Extract mentions of software from the CORD-19 dataset 🦠 📚

Contributing

Code of Conduct

License

software-mention-extraction's People

Contributors

Stargazers

Watchers

Forkers

software-mention-extraction's Issues

Recommend Projects

Recommend Topics

Recommend Org