iomega / spec2vec_gnps_data_analysis Goto Github PK
View Code? Open in Web Editor NEWAnalysis and benchmarking of mass spectra similarity measures using gnps data set.
License: Apache License 2.0
Analysis and benchmarking of mass spectra similarity measures using gnps data set.
License: Apache License 2.0
Hi,
I am trying to run iomega-10-spectra-networking.ipynb but I am getting error that these MS_library is not defined and also corresponding variables seems to be undefined (MS_library.list_similars_ctr_idx, MS_library.list_similars_ctr) at cell 14. Also M_sim_mol is not defined before use cell 16. Please see the attached image (cell number can be different in the image, they are referenced from the following link: https://github.com/iomega/spec2vec_gnps_data_analysis/blob/master/notebooks/iomega-10-spectra-networking.ipynb
Thanks.
Dear Florian,
When using a pre-trained model (spec2vec) and for the visualization part (figure 2. In depth comparison ...) I am basically following the steps below as agreed (I am using two mgf files (one from GNPS as a reference and one query doc from my data):
from matchms.Scores import Scores
import spec2vec
import os
import sys
import time
from matchms.filtering import add_losses
from matchms.filtering import add_parent_mass
from matchms.filtering import default_filters
from matchms.filtering import normalize_intensities
from matchms.filtering import reduce_to_number_of_peaks
from matchms.filtering import require_minimum_number_of_peaks
from matchms.filtering import select_by_mz
from matchms.importing import load_from_mgf #mgf to mzML
from matchms.importing import load_adducts
from spec2vec import SpectrumDocument
from spec2vec.model_building import train_new_word2vec_model
def apply_my_filters(s):
s = normalize_intensities(s)
s = default_filters(s)
s = add_parent_mass(s)
s = reduce_to_number_of_peaks(s, n_required=10, ratio_desired= None)
s = select_by_mz(s, mz_from=0, mz_to=1000)
s = add_losses(s, loss_mz_from=10.0, loss_mz_to=200.0)
s = require_minimum_number_of_peaks(s, n_required=10)
return s
spectrums = [apply_my_filters(s) for s in load_from_mgf("referenceXXX.mgf")]
spectrums = [s for s in spectrums if s is not None]
reference_documents = [SpectrumDocument(s, n_decimals=2) for s in spectrums]
model_file = "spec2vec.model"
import gensim
from matchms import calculate_scores
from spec2vec import Spec2Vec
import numpy as np
query_spectrums = [apply_my_filters(s) for s in load_from_mgf("queryXXX.mgf")]
query_spectrums = [s for s in query_spectrums if s is not None]
query_documents = [SpectrumDocument(s, n_decimals=2) for s in query_spectrums]
model_file = "spec2vec.model"
model = gensim.models.Word2Vec.load(model_file)
spec2vec = Spec2Vec(model=model, intensity_weighting_power=0.5,
allowed_missing_percentage=5.0)
scores = list(calculate_scores(reference_documents, query_documents, spec2vec))
filtered = [(reference, query, score) for (reference, query, score) in scores if reference != query]
sorted_by_score = sorted(filtered, key=lambda elem: elem[2], reverse=True)
similarity_matrix= spec2vec.matrix(reference_documents, query_documents, is_symmetric=True)
filename = 'similarities_spec2vec_germicidins.npy'
np.save(filename, similarity_matrix)
# But I am getting a matrix dimension (1, 2995) which cannot work with the spectra comparison unfortunately (12787, 12797) (1, 2995) when using your directions from the iomega-in-depths-spectrum-comparions.ipynb:
from plotting_functions import plot_spectra_comparison
filename = 'similarities_daylight2048_jaccard.npy'
matrix_similarities_fingerprint_daylight = np.load(filename)
filename = 'similarities_cosine_tol0005_200708.npy'
matrix_similarities_cosine = np.load(filename)
filename = 'similarities_cosine_tol0005_200708_matches.npy'
matrix_matches_cosine = np.load(filename)
print("Matrix dimension", matrix_matches_cosine.shape)
matrix_similarities_cosine[matrix_matches_cosine < 6] = 0
filename = 'similarities_mod_cosine_tol0005_200727.npy'
matrix_similarities_mod_cosine = np.load(filename)
filename = 'similarities_mod_cosine_tol0005_200727_matches.npy'
matrix_matches_mod_cosine = np.load(filename)
matrix_similarities_mod_cosine[matrix_matches_mod_cosine < 10] = 0
print("Load spec2vec similarities")
filename = 'similarities_spec2vec_germicidins.npy'
matrix_similarities_spec2vec = np.load(filename)
print("Matrix dimension", matrix_similarities_spec2vec.shape)
pair_selection = np.where((matrix_similarities_cosine < 0.4)
& (matrix_similarities_mod_cosine < 0.4)
& (matrix_similarities_mod_cosine > 0)
& (matrix_similarities_spec2vec > 0.8)
& (matrix_similarities_spec2vec < 0.98)
& (matrix_similarities_fingerprint_daylight > 0.8))
print("Found ", pair_selection[0].shape, " matching spectral pairs.")
possible_grid_points = np.arange(0, 2000, 50)
grid_points = possible_grid_points[(possible_grid_points > 370) & (possible_grid_points < 980)]
grid_points
ID1 = 1276 #pair_selection[0][pick]
ID2 = 1277 #pair_selection[1][pick]
print(ID1, ID2)
print(spectrums_postprocessed[ID1].get("spectrumid"), spectrums_postprocessed[ID2].get("spectrumid"))
print("Spec2Vec score: {:.4}".format(matrix_similarities_spec2vec[ID1, ID2]))
print("Cosine score: {:.4}".format(matrix_similarities_cosine[ID1, ID2]))
print("Modified cosine score: {:.4}".format(matrix_similarities_mod_cosine[ID1, ID2]))
print("Molecular similarity: {:.4}".format(matrix_similarities_fingerprint_daylight[ID1, ID2]))
csim = plot_spectra_comparison(spectrums_postprocessed[ID1], spectrums_postprocessed[ID2],
model,
intensity_weighting_power=0.5,
num_decimals=2,
min_mz=300,
max_mz=1000,
intensity_threshold=0.05,
method="cosine",#"modcos", #
tolerance=0.005,
wordsim_cutoff=0.05,
circle_size=5,
circle_scaling='wordsim',
padding=30,
display_molecules=True,
figsize=(12, 12),
filename="example_1276_1277_new.pdf")#None)#
ValueError: operands could not be broadcast together with shapes (12797,12797) (1,2995)
Any unnecessary steps that I might be doing here? Or any thoughts? I will try using different files, but should I try perhaps a different pretrained model?
Hei,
while trying to load the 'AllPositive' model using:
import gensim
model_fn = "data/spec2vec_models/spec2vec_AllPositive_ratio05_filtered_iter_15.model"
model = gensim.models.Word2Vec.load(model_fn)
I get the following import error:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-20-8fd5ace11800> in <module>
3 # model_fn = "data/spec2vec_models/spec2vec_UniqueInchikeys_ratio05_filtered_iter_50.model"
4 model_fn = "data/spec2vec_models/spec2vec_AllPositive_ratio05_filtered_iter_15.model" # Cannot be loaded
----> 5 model = gensim.models.Word2Vec.load(model_fn)
/path/to/venv/lib/python3.8/site-packages/gensim/models/word2vec.py in load(cls, *args, **kwargs)
1139 """
1140 try:
-> 1141 model = super(Word2Vec, cls).load(*args, **kwargs)
1142
1143 # for backward compatibility for `max_final_vocab` feature
/path/to/venv/lib/python3.8/site-packages/gensim/models/base_any2vec.py in load(cls, *args, **kwargs)
1228
1229 """
-> 1230 model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
1231 if not hasattr(model, 'ns_exponent'):
1232 model.ns_exponent = 0.75
/path/to/venv/lib/python3.8/site-packages/gensim/models/base_any2vec.py in load(cls, fname_or_handle, **kwargs)
600
601 """
--> 602 return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
603
604 def save(self, fname_or_handle, **kwargs):
/path/to/venv/lib/python3.8/site-packages/gensim/utils.py in load(cls, fname, mmap)
433 compress, subname = SaveLoad._adapt_by_suffix(fname)
434
--> 435 obj = unpickle(fname)
436 obj._load_specials(fname, mmap, compress, subname)
437 logger.info("loaded %s", fname)
/path/to/venv/lib/python3.8/site-packages/gensim/utils.py in unpickle(fname)
1396 # Because of loading from S3 load can't be used (missing readline in smart_open)
1397 if sys.version_info > (3, 0):
-> 1398 return _pickle.load(f, encoding='latin1')
1399 else:
1400 return _pickle.loads(f.read())
ModuleNotFoundError: No module named 'custom_functions'
When I add the "custom_functions" directory to the Python path:
import gensim
import sys
sys.path.append("/path/to/spec2vec_gnps_data_analysis")
model_fn = "data/spec2vec_models/spec2vec_AllPositive_ratio05_filtered_iter_15.model"
model = gensim.models.Word2Vec.load(model_fn)
The import error gets "more specific":
...
/path/to/venv/lib/python3.8/site-packages/gensim/utils.py in unpickle(fname)
1396 # Because of loading from S3 load can't be used (missing readline in smart_open)
1397 if sys.version_info > (3, 0):
-> 1398 return _pickle.load(f, encoding='latin1')
1399 else:
1400 return _pickle.loads(f.read())
ModuleNotFoundError: No module named 'custom_functions.utils_spec2vec'
I could not find anyware a file / module called "utils_spec2vec". Nevertheless, loading the "UniqueInchikey" model works just fine. Can it be, that there happend a mistake meanwhile pickling the larger model?
Best regards,
Eric
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.