iomega / spec2vec_gnps_data_analysis Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 11.0 21.26 MB

Analysis and benchmarking of mass spectra similarity measures using gnps data set.

License: Apache License 2.0

Jupyter Notebook 99.80% Python 0.20%

spec2vec_gnps_data_analysis's People

Contributors

Stargazers

Watchers

Forkers

justinjjvanderhooft stjordanis lozanocelia bachi55 axelrolov danielavic lukalmelias gojian anupriyatripathi rrcsw ovorzyu

spec2vec_gnps_data_analysis's Issues

add tests

Undefined MS_library.list_similars_ctr_idx, MS_library.list_similars_ctr and M_sim_mol in iomega-10-spectra-networking.ipynb

Hi,

I am trying to run iomega-10-spectra-networking.ipynb but I am getting error that these MS_library is not defined and also corresponding variables seems to be undefined (MS_library.list_similars_ctr_idx, MS_library.list_similars_ctr) at cell 14. Also M_sim_mol is not defined before use cell 16. Please see the attached image (cell number can be different in the image, they are referenced from the following link: https://github.com/iomega/spec2vec_gnps_data_analysis/blob/master/notebooks/iomega-10-spectra-networking.ipynb

Thanks.

add tests for custom functions

add liscence

Fix mod.cosine vs cosine swap in library matching function

Add CI

similarities_visualisation

Dear Florian,

When using a pre-trained model (spec2vec) and for the visualization part (figure 2. In depth comparison ...) I am basically following the steps below as agreed (I am using two mgf files (one from GNPS as a reference and one query doc from my data):

from matchms.Scores import Scores
import spec2vec
import os
import sys
import time
from matchms.filtering import add_losses
from matchms.filtering import add_parent_mass
from matchms.filtering import default_filters
from matchms.filtering import normalize_intensities
from matchms.filtering import reduce_to_number_of_peaks
from matchms.filtering import require_minimum_number_of_peaks
from matchms.filtering import select_by_mz
from matchms.importing import load_from_mgf #mgf to mzML
from matchms.importing import load_adducts
from spec2vec import SpectrumDocument
from spec2vec.model_building import train_new_word2vec_model

def apply_my_filters(s):
    s = normalize_intensities(s)
    s = default_filters(s)
    s = add_parent_mass(s)
    s = reduce_to_number_of_peaks(s, n_required=10, ratio_desired= None)
    s = select_by_mz(s, mz_from=0, mz_to=1000)
    s = add_losses(s, loss_mz_from=10.0, loss_mz_to=200.0)
    s = require_minimum_number_of_peaks(s, n_required=10)
    return s

spectrums = [apply_my_filters(s) for s in load_from_mgf("referenceXXX.mgf")] 
spectrums = [s for s in spectrums if s is not None]
reference_documents = [SpectrumDocument(s, n_decimals=2) for s in spectrums]
model_file = "spec2vec.model"

import gensim   
from matchms import calculate_scores
from spec2vec import Spec2Vec
import numpy as np  

query_spectrums = [apply_my_filters(s) for s in load_from_mgf("queryXXX.mgf")]
query_spectrums = [s for s in query_spectrums if s is not None]
query_documents = [SpectrumDocument(s, n_decimals=2) for s in query_spectrums]
model_file = "spec2vec.model"
model = gensim.models.Word2Vec.load(model_file)

spec2vec = Spec2Vec(model=model, intensity_weighting_power=0.5,
                    allowed_missing_percentage=5.0)
scores = list(calculate_scores(reference_documents, query_documents, spec2vec))
filtered = [(reference, query, score) for (reference, query, score) in scores if reference != query]

sorted_by_score = sorted(filtered, key=lambda elem: elem[2], reverse=True)
similarity_matrix= spec2vec.matrix(reference_documents, query_documents, is_symmetric=True)
filename = 'similarities_spec2vec_germicidins.npy'
np.save(filename, similarity_matrix)

# But I am getting a matrix dimension (1, 2995) which cannot work with the spectra comparison unfortunately (12787, 12797) (1, 2995) when using your directions from the iomega-in-depths-spectrum-comparions.ipynb:

from plotting_functions import plot_spectra_comparison

filename = 'similarities_daylight2048_jaccard.npy'
matrix_similarities_fingerprint_daylight = np.load(filename)
filename = 'similarities_cosine_tol0005_200708.npy'
matrix_similarities_cosine = np.load(filename)

filename = 'similarities_cosine_tol0005_200708_matches.npy'
matrix_matches_cosine = np.load(filename)

print("Matrix dimension", matrix_matches_cosine.shape)

matrix_similarities_cosine[matrix_matches_cosine < 6] = 0
filename = 'similarities_mod_cosine_tol0005_200727.npy'
matrix_similarities_mod_cosine = np.load(filename)

filename = 'similarities_mod_cosine_tol0005_200727_matches.npy'
matrix_matches_mod_cosine = np.load(filename)
matrix_similarities_mod_cosine[matrix_matches_mod_cosine < 10] = 0

print("Load spec2vec similarities")

filename = 'similarities_spec2vec_germicidins.npy'
matrix_similarities_spec2vec = np.load(filename)
print("Matrix dimension", matrix_similarities_spec2vec.shape)

pair_selection = np.where((matrix_similarities_cosine < 0.4)
                          & (matrix_similarities_mod_cosine < 0.4)
                          & (matrix_similarities_mod_cosine > 0)
                & (matrix_similarities_spec2vec > 0.8) 
                & (matrix_similarities_spec2vec < 0.98) 
                & (matrix_similarities_fingerprint_daylight > 0.8))

print("Found ", pair_selection[0].shape, " matching spectral pairs.")

possible_grid_points = np.arange(0, 2000, 50)
grid_points = possible_grid_points[(possible_grid_points > 370) & (possible_grid_points < 980)]
grid_points

ID1 = 1276 #pair_selection[0][pick]
ID2 = 1277 #pair_selection[1][pick]
print(ID1, ID2)
print(spectrums_postprocessed[ID1].get("spectrumid"), spectrums_postprocessed[ID2].get("spectrumid"))
print("Spec2Vec score: {:.4}".format(matrix_similarities_spec2vec[ID1, ID2]))
print("Cosine score: {:.4}".format(matrix_similarities_cosine[ID1, ID2]))
print("Modified cosine score: {:.4}".format(matrix_similarities_mod_cosine[ID1, ID2]))
print("Molecular similarity: {:.4}".format(matrix_similarities_fingerprint_daylight[ID1, ID2]))

csim = plot_spectra_comparison(spectrums_postprocessed[ID1], spectrums_postprocessed[ID2],
                                model,
                                intensity_weighting_power=0.5,
                                num_decimals=2,
                                min_mz=300,
                                max_mz=1000,
                                intensity_threshold=0.05,
                                method="cosine",#"modcos", #
                                tolerance=0.005,
                                wordsim_cutoff=0.05,
                                circle_size=5,
                                circle_scaling='wordsim',
                                padding=30,
                                display_molecules=True,
                                figsize=(12, 12),
                                filename="example_1276_1277_new.pdf")#None)#

The error turns out to be: pair_selection = np.where((matrix_similarities_cosine < 0.4)

ValueError: operands could not be broadcast together with shapes (12797,12797) (1,2995)

Any unnecessary steps that I might be doing here? Or any thoughts? I will try using different files, but should I try perhaps a different pretrained model?

AllPositive model cannot be loaded

Hei,

while trying to load the 'AllPositive' model using:

import gensim

model_fn = "data/spec2vec_models/spec2vec_AllPositive_ratio05_filtered_iter_15.model"  
model = gensim.models.Word2Vec.load(model_fn)

I get the following import error:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-20-8fd5ace11800> in <module>
      3 # model_fn = "data/spec2vec_models/spec2vec_UniqueInchikeys_ratio05_filtered_iter_50.model"
      4 model_fn = "data/spec2vec_models/spec2vec_AllPositive_ratio05_filtered_iter_15.model"  # Cannot be loaded
----> 5 model = gensim.models.Word2Vec.load(model_fn)

/path/to/venv/lib/python3.8/site-packages/gensim/models/word2vec.py in load(cls, *args, **kwargs)
   1139         """
   1140         try:
-> 1141             model = super(Word2Vec, cls).load(*args, **kwargs)
   1142 
   1143             # for backward compatibility for `max_final_vocab` feature

/path/to/venv/lib/python3.8/site-packages/gensim/models/base_any2vec.py in load(cls, *args, **kwargs)
   1228 
   1229         """
-> 1230         model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
   1231         if not hasattr(model, 'ns_exponent'):
   1232             model.ns_exponent = 0.75

/path/to/venv/lib/python3.8/site-packages/gensim/models/base_any2vec.py in load(cls, fname_or_handle, **kwargs)
    600 
    601         """
--> 602         return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
    603 
    604     def save(self, fname_or_handle, **kwargs):

/path/to/venv/lib/python3.8/site-packages/gensim/utils.py in load(cls, fname, mmap)
    433         compress, subname = SaveLoad._adapt_by_suffix(fname)
    434 
--> 435         obj = unpickle(fname)
    436         obj._load_specials(fname, mmap, compress, subname)
    437         logger.info("loaded %s", fname)

/path/to/venv/lib/python3.8/site-packages/gensim/utils.py in unpickle(fname)
   1396         # Because of loading from S3 load can't be used (missing readline in smart_open)
   1397         if sys.version_info > (3, 0):
-> 1398             return _pickle.load(f, encoding='latin1')
   1399         else:
   1400             return _pickle.loads(f.read())

ModuleNotFoundError: No module named 'custom_functions'

When I add the "custom_functions" directory to the Python path:

import gensim

import sys
sys.path.append("/path/to/spec2vec_gnps_data_analysis")

model_fn = "data/spec2vec_models/spec2vec_AllPositive_ratio05_filtered_iter_15.model" 
model = gensim.models.Word2Vec.load(model_fn)

The import error gets "more specific":

...
/path/to/venv/lib/python3.8/site-packages/gensim/utils.py in unpickle(fname)
   1396         # Because of loading from S3 load can't be used (missing readline in smart_open)
   1397         if sys.version_info > (3, 0):
-> 1398             return _pickle.load(f, encoding='latin1')
   1399         else:
   1400             return _pickle.loads(f.read())

ModuleNotFoundError: No module named 'custom_functions.utils_spec2vec'

I could not find anyware a file / module called "utils_spec2vec". Nevertheless, loading the "UniqueInchikey" model works just fine. Can it be, that there happend a mistake meanwhile pickling the larger model?

Best regards,

Eric

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.