pedromtq / unifunc Goto Github PK

Tool for similarity analysis of protein function annotations.

License: MIT License

Python 100.00%

bioinformatics bioinformatics-analysis text-mining protein-annotation nlp

unifunc's Introduction

UniFunc

UniFunc is a text mining tool that processes and analysis text similarity between a pair of protein function annotations. It is mainly used as a cross-linking mechanism or redundancy elimination tool when processing annotations without any sort of database identifiers.

Current release info

Name	Downloads	Version	Platforms	Latest release

Citing UniFunc

Please cite https://doi.org/10.1515/hsz-2021-0125 when using UniFunc.

Installing unifunc

Installing unifunc from the conda-forge channel can be achieved by adding conda-forge to your channels with:

conda config --add channels conda-forge
conda config --set channel_priority strict

Once the conda-forge channel has been enabled, unifunc can be installed with:

conda install unifunc

Using UniFunc

UniFunc can be run in two modes:

The default mode returns the similarity score (float) between the provided strings, to run it use:

unifunc "this is string1" "this is string2"

The secondary mode requires the user to set a threshold (e.g. 0.95) with the argument -t, and True will be returned if the string similarity is above the threshold, and False otherwise. To run it use:

unifunc string1 string2 -t 0.95

To use verbose mode add the argument -v, to redirect output to a file, add the argument -t file_path

To run a sample execution use: unifunc --example

Using workflows

At the moment, only one workflow is available cluster_function. To use it run unifunc cluster_function -h and you will get all the information regarding inputs.

Updating the corpus

Delete all files in UniFunc/Resources/
Go to https://www.uniprot.org/uniprot/?query=reviewed
Search for all protein entries
Choose the columns Entry,Protein names,and Function [CC]
Apply columns
Download results in tab separated format
Check if download file has these 3 headers: Entry Protein names Function [CC]
Rename the downloaded file to uniprot.tab and move it UniFunc/Resources/
Go to http://geneontology.org/docs/download-ontology/
Download go.obo
Move the filego.obo to UniFunc/Resources/

Here's an overview of the UniFunc workflow:

How does UniFunc work?

The natural language processing of functional descriptions entails several steps:

Text pre-processing:
- Split functional descriptions into documents
- Remove identifiers
- Standardize punctuation
- Remove digits that are not attached to a token
- Standardize ion patterns
- Replace Roman numerals with Arabic numerals
- Divide document into groups of tokens
- Unite certain tokens (for example: “3” should be merged with “polymerase 3”)
Part-of-speech tagging
- pos_tag with universal tagging (contextual)
- Wordnet tagging (independent)
- Choose best tag (Wordnet takes priority)
- Removal of unwanted tags (determiners, pronouns, particles, and conjunctions)
Token scoring
- Try to find synonyms (wordnet lexicon) shared between the 2 compared documents
- Build Term frequency- Inverse Document Frequency vectors (TF-IDF)
Similarity analysis
- Calculate cosine distance between the two scaled vectors
- Calculate Jaccard distance between the two sets of identifiers
- If similarity score is above the 0.8 consider, it a match

Part-of-speech tagging

Part-of-speech tagging (POST) is the method of lexically classifying tokens based on their definition and context. In the context of this application, the point is to eliminate tokens that are not relevant to the similarity analysis.
After pre-processing, tokens are tagged with a custom tagger SequentialBackOffTagger independent of context. This tagger uses Wordnet’s lexicon to identify the most common lexical category of any given token.
Should a token be present in Wordnet’s lexicon, a list of synonyms and their lexical category is generated, for example:

[(token,noun),(synonym1,noun) ,(synonym2,verb),(synonym3,adjective),(synonym4,noun)]

The token is then assigned the most common tag noun.

To adjust this lexicon to biological data, gene ontology tokens are also added.
Untagged tokens are then contextually classified with a Perceptron tagger. The classification obtained from this tagger is not optimal (as a pre-trained classifier is used), however, in the current context this is barely of consequence, as this tagger is merely used as a backup when no other tag is available. Optimally a new model would be trained, but unfortunately this would require heavy time-investment in building a training dataset.
The tokens tagged as being determiners, pronouns, particles, or conjunctions are removed.

Token scoring

In this step, tokens are scored based on the “Term frequency- Inverse Document Frequency” technique. This allows the analysis on which tokens are more relevant to a certain annotation, which in turn allows for the identification of other annotations with the same similarly important tokens.

TF-IDF measures the importance of a token to a document in a corpus. To summarize:

TF - Tokens that appear more often in a document should be more important. This is a local (document wide) metric.
IDF - tokens that appear in too many documents should be less important. This is a global (corpus wide) metric.

TF-IDF is calculated with the following equation:

NT, times token appears in document
TT, total amount of tokens in document
TD, total amount of documents
DT, total amount of times a certain token appears in a document – frequency table

The corpus used to build this metric were all the 561.911 reviewed proteins from Uniprot (as of 2020/04/14). After pre-processing, each protein annotation is split into tokens, and a token frequency table (DT) is calculated and saved into a file.

The TF-IDF score is then locally scaled (min_max scaling relative to the document) so that we can better understand which tokens are more relevant within the analysed document.

Similarity analysis

Finally, we can then compare annotations from different sources, by calculating the cosine distance between each pair of TF-IDF scaled vectors. Should the tokens they contain and their importance within the document be around the same, the annotations are classified as “identical”. Identifiers within the free-text description are also taken into account, via the Jaccard distance metric. A simple intersection is not used as more general identifiers might lead to too many false positives.

Consensus construction

In this manner we are able to construct groups of hits (from different sources) that match between either via identifiers or free-text descriptions. We then evaluate the quality of each group of consensuses and select the best one, taking into account:

Percentage of the sequence covered by the hits in the consensus
Significance of the hits (e-value) in the consensus
Significance of the reference datasets
Number of different reference datasets in the consensus

unifunc's People

Contributors

Stargazers

Watchers

unifunc's Issues

TypeError: _pos_tag() got an unexpected keyword argument 'lang'

Apologies for all of the issues as of late. I had trouble running the conda installation:

Here's my conda environment

(unifunc_env) -bash-4.2$ conda list
# packages in environment at /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/unifunc_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
ca-certificates           2021.10.8            ha878542_0    conda-forge
certifi                   2016.9.26                py36_0    conda-forge
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libblas                   3.9.0           12_linux64_openblas    conda-forge
libcblas                  3.9.0           12_linux64_openblas    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 11.2.0              h1d223b6_11    conda-forge
libgfortran-ng            11.2.0              h69a702a_11    conda-forge
libgfortran5              11.2.0              h5c6108e_11    conda-forge
libgomp                   11.2.0              h1d223b6_11    conda-forge
liblapack                 3.9.0           12_linux64_openblas    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.18          pthreads_h8fe5266_0    conda-forge
libstdcxx-ng              11.2.0              he4da1e4_11    conda-forge
libzlib                   1.2.11            h36c2ea0_1013    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
nltk                      3.2.4                    py36_0    conda-forge
numpy                     1.19.5           py36hfc0c790_2    conda-forge
openssl                   1.1.1l               h7f98852_0    conda-forge
pip                       21.3.1             pyhd8ed1ab_0    conda-forge
python                    3.6.15          hb7a2778_0_cpython    conda-forge
python_abi                3.6                     2_cp36m    conda-forge
readline                  8.1                  h46c0cb4_0    conda-forge
requests                  2.12.5                   py36_0    conda-forge
setuptools                49.6.0           py36h5fab9bb_3    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.37.0               h9cd32fc_0    conda-forge
tk                        8.6.11               h27826a3_1    conda-forge
unifunc                   1.3.4              pyhd8ed1ab_0    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h36c2ea0_1013    conda-forge

Here's my sample command:

(unifunc_env) -bash-4.2$ unifunc cluster_function -i sample.tsv -o unifunc_output/ -kh
Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/unifunc_env/bin/unifunc", line 10, in <module>
    sys.exit(main())
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/unifunc_env/lib/python3.6/site-packages/unifunc/__main__.py", line 84, in main
    argv_cluster_representative_function()
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/unifunc_env/lib/python3.6/site-packages/unifunc/__main__.py", line 73, in argv_cluster_representative_function
    output_without_representative=output_without_representative,
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/unifunc_env/lib/python3.6/site-packages/Workflows/Representative_function/Cluster_Representative_Function.py", line 27, in __init__
    self.unifunc = source.UniFunc()
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/unifunc_env/lib/python3.6/site-packages/unifunc/source.py", line 1062, in __init__
    self.wordnet_tagger = WordNetTagger(go_terms=self.go_terms, perceptron_tagger=self.tagger)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/unifunc_env/lib/python3.6/site-packages/unifunc/source.py", line 732, in __init__
    if self.tag_tokens_perceptron([g]) not in ['ADP','CONJ','DET','PRON','PRT']:
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/unifunc_env/lib/python3.6/site-packages/unifunc/source.py", line 740, in tag_tokens_perceptron
    return _pos_tag(tokens, tagset='universal', tagger=self.perceptron_tagger,lang='eng')
TypeError: _pos_tag() got an unexpected keyword argument 'lang'

[Feature Request] Have `unifunc cluster_function` default read from stdin if no input is provided

I'm working on implementing this into some pipelines and an option for reading from stdin would be really useful.

    def read_clustered_annotations(self):
        #this yields all clusters, one by one - less memory footprint
        #FILE NEEDS TO BE SORTED BY CLUSTER_ID
        if self.input_path is None:
                file = sys.stdin
        else:
                file = open(self.input_path)
        
            temp=[]
        
            for line in file.readlines():
                line=line.strip()
                line=line.split('\t')
                gene_id,cluster_id,annotation=line
                annotation=self.pre_process_annotations(annotation)
                if temp and cluster_id!=previous_cluster_id:
                    yield previous_cluster_id,temp
                    temp=[]
                temp.append([gene_id,annotation])
                previous_cluster_id=cluster_id
        if temp:
            yield previous_cluster_id,temp
        if file is not sys.stdin:
                file.close()

What are your thoughts?

How to run UniFunc?

I'm not sure how to run UniFunc. There is no executable called UniFunc. I fixed the syntax error in the setup.py but when I do that, it still doesn't put UniFunc in my path.

Can you create a new environment and make sure none of the steps are missing?

Downloading it from GitHub

(base) jespinoz@jespinozlt2-osx ~ % git clone https://github.com/PedroMTQ/UniFunc/
Cloning into 'UniFunc'...
remote: Enumerating objects: 145, done.
remote: Counting objects: 100% (145/145), done.
remote: Compressing objects: 100% (102/102), done.
remote: Total 145 (delta 64), reused 116 (delta 39), pack-reused 0
Receiving objects: 100% (145/145), 22.43 MiB | 5.16 MiB/s, done.
Resolving deltas: 100% (64/64), done.
(base) jespinoz@jespinozlt2-osx ~ % cd UniFunc
(base) jespinoz@jespinozlt2-osx UniFunc % ls
Images		LICENSE		README.md	Resources	Workflows	__main__.py	conda-recipe	pyproject.toml	setup.py	source.py	unifunc_env.yml

Creating the conda environment

(base) jespinoz@jespinozlt2-osx UniFunc % conda env create -f unifunc_env.yml -n unifunc_env
Collecting package metadata (repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.10.3
  latest version: 4.11.0

Please update conda by running

    $ conda update -n base conda



Downloading and Extracting Packages
requests-2.24.0      | 54 KB     | ############################################################################################################################################################################################################# | 100%
xz-5.2.5             | 282 KB    | ############################################################################################################################################################################################################# | 100%
openssl-1.1.1h       | 3.4 MB    | ############################################################################################################################################################################################################# | 100%
ca-certificates-2020 | 127 KB    | ############################################################################################################################################################################################################# | 100%
readline-8.0         | 397 KB    | ############################################################################################################################################################################################################# | 100%
setuptools-50.3.0    | 939 KB    | ############################################################################################################################################################################################################# | 100%
blas-1.0             | 5 KB      | ############################################################################################################################################################################################################# | 100%
pysocks-1.7.1        | 27 KB     | ############################################################################################################################################################################################################# | 100%
sqlite-3.33.0        | 2.5 MB    | ############################################################################################################################################################################################################# | 100%
urllib3-1.25.11      | 93 KB     | ############################################################################################################################################################################################################# | 100%
python-3.8.5         | 25.1 MB   | ############################################################################################################################################################################################################# | 100%
pyopenssl-19.1.0     | 47 KB     | ############################################################################################################################################################################################################# | 100%
six-1.15.0           | 13 KB     | ############################################################################################################################################################################################################# | 100%
ncurses-6.2          | 1016 KB   | ############################################################################################################################################################################################################# | 100%
libffi-3.3           | 48 KB     | ############################################################################################################################################################################################################# | 100%
wheel-0.35.1         | 36 KB     | ############################################################################################################################################################################################################# | 100%
pycparser-2.20       | 94 KB     | ############################################################################################################################################################################################################# | 100%
brotlipy-0.7.0       | 357 KB    | ############################################################################################################################################################################################################# | 100%
intel-openmp-2020.2  | 1.2 MB    | ############################################################################################################################################################################################################# | 100%
cffi-1.14.3          | 219 KB    | ############################################################################################################################################################################################################# | 100%
mkl_fft-1.2.0        | 162 KB    | ############################################################################################################################################################################################################# | 100%
chardet-3.0.4        | 170 KB    | ############################################################################################################################################################################################################# | 100%
certifi-2020.6.20    | 159 KB    | ############################################################################################################################################################################################################# | 100%
cryptography-3.1.1   | 604 KB    | ############################################################################################################################################################################################################# | 100%
tqdm-4.50.2          | 55 KB     | ############################################################################################################################################################################################################# | 100%
libedit-3.1.20191231 | 102 KB    | ############################################################################################################################################################################################################# | 100%
tk-8.6.10            | 3.3 MB    | ############################################################################################################################################################################################################# | 100%
idna-2.10            | 56 KB     | ############################################################################################################################################################################################################# | 100%
nltk-3.5             | 1.1 MB    | ############################################################################################################################################################################################################# | 100%
joblib-0.17.0        | 205 KB    | ############################################################################################################################################################################################################# | 100%
zlib-1.2.11          | 105 KB    | ############################################################################################################################################################################################################# | 100%
click-7.1.2          | 67 KB     | ############################################################################################################################################################################################################# | 100%
mkl-service-2.3.0    | 46 KB     | ############################################################################################################################################################################################################# | 100%
numpy-base-1.19.1    | 5.1 MB    | ############################################################################################################################################################################################################# | 100%
mkl_random-1.1.1     | 337 KB    | ############################################################################################################################################################################################################# | 100%
mkl-2019.4           | 155.2 MB  | ############################################################################################################################################################################################################# | 100%
libcxx-10.0.0        | 1.0 MB    | ############################################################################################################################################################################################################# | 100%
regex-2020.10.15     | 349 KB    | ############################################################################################################################################################################################################# | 100%
numpy-1.19.1         | 20 KB     | ############################################################################################################################################################################################################# | 100%
pip-20.2.4           | 2.0 MB    | ############################################################################################################################################################################################################# | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate unifunc_env
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base) jespinoz@jespinozlt2-osx UniFunc % conda activate unifunc_env

Python version

(unifunc_env) jespinoz@jespinozlt2-osx UniFunc % which python
/Users/jespinoz/anaconda3/envs/unifunc_env/bin/python
(unifunc_env) jespinoz@jespinozlt2-osx UniFunc % python --version
Python 3.8.5

Tried the setup script

(unifunc_env) jespinoz@jespinozlt2-osx UniFunc % python setup.py install
  File "setup.py", line 21
    install_requires=['nltk','numpy','python>=3.6','requests']
    ^
SyntaxError: invalid syntax

Tried the suggested command:

(unifunc_env) jespinoz@jespinozlt2-osx UniFunc % python UniFunc -h
python: can't open file 'UniFunc': [Errno 2] No such file or directory
(unifunc_env) jespinoz@jespinozlt2-osx UniFunc % python __main__.py -h
Traceback (most recent call last):
  File "__main__.py", line 6, in <module>
    from UniFunc import run_unifunc,run_example
ModuleNotFoundError: No module named 'UniFunc'

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.