bojarlab / glycowork Goto Github PK

Package for processing and analyzing glycans and their role in biology.

Home Page: https://Bojarlab.github.io/glycowork

License: MIT License

Python 6.38% Jupyter Notebook 75.04% HTML 18.14% CSS 0.01% JavaScript 0.43%

machine-learning glycobiology glycans bioinformatics python computational-biology data-science molecular-biology open-source glycomics

glycowork's Introduction

glycowork

Glycans are fundamental biological sequences that are as crucial as DNA, RNA, and proteins. As complex carbohydrates forming branched structures, glycans are ubiquitous yet often overlooked in biological research.

Why Glycans are Important

Ubiquitous in biology
Integral to protein and lipid function
Relevant to human diseases

Challenges in Glycan Analysis

Analyzing glycans is complicated due to their non-linear structures and enormous diversity. But that’s where glycowork comes in.

Introducing glycowork: Your Solution for Glycan-Focused Data Science

Glycowork is a Python package specifically designed to simplify glycan sequence processing and analysis. It offers:

Functions for glycan analysis
Datasets for model training
Full support for IUPAC-condensed string representation. Broad support for IUPAC-extended, LinearCode, Oxford, GlycoCT, and WURCS.
Powerful graph-based architecture for in-depth analysis

Documentation: https://bojarlab.github.io/glycowork/

Contribute: Interested in contributing? Read our Contribution Guidelines

Citation: If glycowork adds value to your project, please cite Thomes et al., 2021

Install

Not familiar with Python? Try our no-code, graphical user interface (glycoworkGUI.exe, can be downloaded at the bottom of the latest Release page) for accessing some of the most useful glycowork functions!

via pip:
pip install glycowork
import glycowork

alternative:
pip install git+https://github.com/BojarLab/glycowork.git
import glycowork

Note that we have optional extra installs for specialized use (even further instructions can be found in the Examples tab), such as:
deep learning
pip install glycowork[ml]
drawing glycan images with GlycoDraw (see install instructions in the Examples tab)
pip install glycowork[draw]
analyzing atomic/chemical properties of glycans
pip install glycowork[chem]
everything
pip install glycowork[all]

Data & Models

Glycowork currently contains the following main datasets that are freely available to everyone:

df_glycan
- contains ~50,000 unique glycan sequences, including labels such as ~37,500 species associations, ~16,500 tissue associations, and ~1,000 disease associations
glycan_binding
- contains >550,000 protein-glycan binding interactions, from 1,392 unique glycan-binding proteins

Additionally, we store these trained deep learning models for easy usage, which can be retrieved with the prep_model function:

LectinOracle
- can be used to predict glycan-binding specificity of a protein, given its ESM-1b representation; from Lundstrom et al., 2021
LectinOracle_flex
- operates the same as LectinOracle but can directly use the raw protein sequence as input (no ESM-1b representation required)
SweetNet
- a graph convolutional neural network trained to predict species from glycan, can be used to generate learned glycan representations; from Burkholz et al., 2021
NSequonPred
- given the ESM-1b representation of an N-sequon (+/- 20 AA), this model can predict whether the sequon will be glycosylated

How to use

Glycowork currently contains four main modules:

glycan_data
- stores several glycan datasets and contains helper functions
ml
- here are all the functions for training and using machine learning models, including train-test-split, getting glycan representations, etc.
motif
- contains functions for processing & drawing glycan sequences, identifying motifs and features, and analyzing them
network
- contains functions for constructing and analyzing glycan networks (e.g., biosynthetic networks)

Below are some examples of what you can do with glycowork; be sure to check out the other examples in the full documentation for everything that’s there. –> Learn more A non-exhaustive list includes:

using trained AI models for prediction –> Learn more
training your own AI models –> Learn more
motif enrichment analyses –> Learn more
differential glycomics expression analysis –> Learn more
annotating motifs in glycans –> Learn more
drawing publication-quality glycan figures –> Learn more
finding out whether & where glycans are describing the same sequence –> Learn more
m/z to composition to structure to motif mappings –> Learn more
mass calculation –> Learn more
visualizing motif distribution / glycan similarities / sequence properties –> Learn more
constructing and analyzing biosynthetic networks –> Learn more

#drawing publication-quality glycan figures
from glycowork.motif.draw import GlycoDraw
GlycoDraw("Neu5Ac(a2-3)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-2)Man(a1-3)[Neu5Gc(a2-6)Gal(b1-4)GlcNAc(b1-2)Man(a1-6)][GlcNAc(b1-4)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc", highlight_motif = "Neu5Ac(a2-3)Gal(b1-4)[Fuc(a1-3)]GlcNAc")

#get motifs, graph features, and sequence features of a set of glycan sequences to train models or analyze glycan properties
glycans = ["Neu5Ac(a2-3)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-2)Man(a1-3)[Gal(b1-3)[Fuc(a1-4)]GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc",
           "Ma3(Ma6)Mb4GNb4GN;N",
           "α-D-Manp-(1→3)[α-D-Manp-(1→6)]-β-D-Manp-(1→4)-β-D-GlcpNAc-(1→4)-β-D-GlcpNAc-(1→",
           "F(3)XA2",
           "WURCS=2.0/5,11,10/[a2122h-1b_1-5_2*NCC/3=O][a1122h-1b_1-5][a1122h-1a_1-5][a2112h-1b_1-5][a1221m-1a_1-5]/1-1-2-3-1-4-3-1-4-5-5/a4-b1_a6-k1_b4-c1_c3-d1_c6-g1_d2-e1_e4-f1_g2-h1_h4-i1_i2-j1",
           """RES
1b:b-dglc-HEX-1:5
2s:n-acetyl
3b:b-dglc-HEX-1:5
4s:n-acetyl
5b:b-dman-HEX-1:5
6b:a-dman-HEX-1:5
7b:b-dglc-HEX-1:5
8s:n-acetyl
9b:b-dgal-HEX-1:5
10s:sulfate
11s:n-acetyl
12b:a-dman-HEX-1:5
13b:b-dglc-HEX-1:5
14s:n-acetyl
15b:b-dgal-HEX-1:5
16s:n-acetyl
LIN
1:1d(2+1)2n
2:1o(4+1)3d
3:3d(2+1)4n
4:3o(4+1)5d
5:5o(3+1)6d
6:6o(2+1)7d
7:7d(2+1)8n
8:7o(4+1)9d
9:9o(-1+1)10n
10:9d(2+1)11n
11:5o(6+1)12d
12:12o(2+1)13d
13:13d(2+1)14n
14:13o(4+1)15d
15:15d(2+1)16n"""]
from glycowork.motif.annotate import annotate_dataset
out = annotate_dataset(glycans, feature_set = ['known', 'terminal', 'exhaustive'])

	Internal_LewisX	SialylLewisX	Terminal_LewisA	H_type2	Chitobiose	Trimannosylcore	Terminal_LacNAc_type1	Internal_LacNAc_type2	Terminal_LacNAc_type2	Terminal_LacdiNAc_type2	core_fucose	core_fucose(a1-3)	Nglycan_complex	Nglycan_complex2	M3FX	Fuc	Gal	GalNAc	GalNAcOS	GlcNAc	Man	Neu5Ac	Xyl	Fuc(a1-2)Gal	Fuc(a1-3)GlcNAc	Fuc(a1-4)GlcNAc	Fuc(a1-6)GlcNAc	Fuc(a1-?)GlcNAc	Gal(b1-3)GlcNAc	Gal(b1-4)GlcNAc	Gal(b1-?)GlcNAc	GalNAc(b1-4)GlcNAc	GalNAcOS(b1-4)GlcNAc	GlcNAc(b1-2)Man	GlcNAc(b1-4)GlcNAc	GlcNAc(b1-?)Man	Man(a1-3)Man	Man(a1-6)Man	Man(a1-?)Man	Man(b1-4)GlcNAc	Neu5Ac(a2-3)Gal	Xyl(b1-2)Man	GalNAc(b1-4)	Neu5Ac(a2-3)	Gal(b1-3)	GalNAcOS(b1-4)	Fuc(a1-2)	Man(a1-6)	Man(a1-3)	Fuc(a1-6)	Xyl(b1-2)	Gal(b1-4)	Fuc(a1-3)	Fuc(a1-4)	Man(a1-?)	Gal(b1-?)	Fuc(a1-?)	GlcNAc(b1-?)
Neu5Ac(a2-3)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-2)Man(a1-3)[Gal(b1-3)[Fuc(a1-4)]GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc	1	1	1	0	1	1	1	1	0	0	1	0	1	0	0	3	2	0	0	4	3	1	0	0	1	1	1	3	1	1	2	0	0	2	1	2	1	1	2	1	1	0	0	1	1	0	0	0	0	1	0	0	1	1	0	1	3	0
Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	2	3	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	1	2	1	0	0	0	0	0	0	0	1	1	0	0	0	0	0	2	0	0	0
Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	2	3	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	1	2	1	0	0	0	0	0	0	0	1	1	0	0	0	0	0	2	0	0	0
GlcNAc(b1-?)Man(a1-3)[GlcNAc(b1-?)Man(a1-6)][Xyl(b1-2)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-3)]GlcNAc	0	0	0	0	1	1	0	0	0	0	0	1	1	1	1	1	0	0	0	4	3	0	1	0	1	0	0	1	0	0	0	0	0	0	1	2	1	1	2	1	0	1	0	0	0	0	0	0	0	0	1	0	1	0	0	0	1	2
Fuc(a1-2)Gal(b1-4)GlcNAc(b1-2)Man(a1-6)[Gal(b1-4)GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc	0	0	0	1	1	1	0	1	1	0	1	0	1	0	0	2	2	0	0	4	3	0	0	1	0	0	1	1	0	2	2	0	0	2	1	2	1	1	2	1	0	0	0	0	0	0	1	0	0	1	0	1	0	0	0	1	2	0
GalNAcOS(b1-4)GlcNAc(b1-2)Man(a1-3)[GalNAc(b1-4)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc	0	0	0	0	1	1	0	0	0	1	0	0	1	0	0	0	0	1	1	4	3	0	0	0	0	0	0	0	0	0	0	1	1	2	1	2	1	1	2	1	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0

#using graphs, you can easily check whether a glycan contains a specific motif; how about internal Lewis A/X motifs?
from glycowork.motif.graph import subgraph_isomorphism
print(subgraph_isomorphism('Neu5Ac(a2-3)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-6)[Gal(b1-3)]GalNAc',
                     'Fuc(a1-?)[Gal(b1-?)]GlcNAc', termini_list = ['terminal', 'internal', 'flexible']))
print(subgraph_isomorphism('Neu5Ac(a2-3)Gal(b1-3)[Fuc(a1-4)]GlcNAc(b1-6)[Gal(b1-3)]GalNAc',
                     'Fuc(a1-?)[Gal(b1-?)]GlcNAc', termini_list = ['t', 'i', 'f']))
print(subgraph_isomorphism('Gal(b1-3)[Fuc(a1-4)]GlcNAc(b1-6)[Gal(b1-3)]GalNAc',
                     'Fuc(a1-?)[Gal(b1-?)]GlcNAc', termini_list = ['t', 'i', 'f']))

#or you could find the terminal epitopes of a glycan
from glycowork.motif.annotate import get_terminal_structures
print("\nTerminal structures:")
print(get_terminal_structures('Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc'))

True
True
False

Terminal structures:
['Man(a1-3)', 'Man(a1-6)', 'Fuc(a1-6)']

#given a composition, find matching glycan structures in SugarBase; specific for glycan classes and taxonomy
from glycowork.motif.tokenization import compositions_to_structures
print(compositions_to_structures([{'Hex':3, 'HexNAc':4}], glycan_class = 'N'))

#or we could calculate the mass of this composition
from glycowork.motif.tokenization import composition_to_mass
print("\nMass of the composition Hex3HexNAc4")
print(composition_to_mass({'Hex':3, 'HexNAc':4}))
print(composition_to_mass("H3N4"))
print(composition_to_mass("Hex3HexNAc4"))

0 compositions could not be matched. Run with verbose = True to see which compositions.
                                               glycan  abundance
0   GlcNAc(b1-2)Man(a1-3)[GlcNAc(b1-2)Man(a1-6)]Ma...          0
1   GalNAc(b1-4)GlcNAc(b1-2)Man(a1-3)[Man(a1-6)]Ma...          0
2   GlcNAc(b1-2)Man(a1-6)[Man(a1-3)][GlcNAc(b1-4)]...          0
3   Man(a1-3)[GlcNAc(b1-4)][GlcNAc(b1-2)Man(a1-6)]...          0
4   GlcNAc(b1-2)Man(a1-3)[GlcNAc(b1-4)][Man(a1-6)]...          0
5   GlcNAc(?1-?)Man(a1-3)[GlcNAc(b1-?)Man(a1-6)]Ma...          0
6   GlcNAc(b1-2)[GlcNAc(b1-4)]Man(a1-3)[Man(a1-6)]...          0
7   GlcNAc(b1-2)Man(a1-3)[GlcNAc(b1-6)Man(a1-6)]Ma...          0
8   GlcNAc(b1-4)Man(a1-3)[GlcNAc(b1-6)Man(a1-6)]Ma...          0
9   GlcNAc(b1-2)Man(a1-3)[GlcNAc(b1-2)Man(a1-6)][G...          0
10  GlcNAc(b1-2)Man(a1-3)[GlcNAc(b1-2)[GlcNAc(b1-4...          0
11  GlcNAc(b1-2)[GlcNAc(b1-4)]Man(a1-3)[GlcNAc(b1-...          0
12  GlcNAc(b1-4)Man(a1-3)[GlcNAc(b1-2)Man(a1-6)]Ma...          0
13  Man(a1-3)[GlcNAc(b1-2)[GlcNAc(b1-6)]Man(a1-6)]...          0
14  GalNAc(b1-4)GlcNAc(b1-2)Man(a1-6)[Man(a1-3)]Ma...          0

Mass of the composition Hex3HexNAc4
1316.4865545999999
1316.4865545999999
1316.4865545999999

glycowork's People

Contributors

Stargazers

Watchers

Forkers

thomas-wiese xinxinatg justcherie wwq1203 cthoyt klarich rudiyantogunawan glycodynamics mattias-erhardsson daichengxin glycocalex

glycowork's Issues

Special character '-' in index of df input to get_heatmap() throws unintuitive error

Description

When using get_heatmap(), if you input a df that has an index that contains '-' then you will get an error that is difficult to figure out what it means. Other special characters might mess it up too and create the same error, I haven't tested. This is tested on the main branch. A more descriptive error message could be beneficial for the users.

Code:

# Imports
from glycowork.motif.analysis import get_heatmap
import pandas as pd

# Example with dataframe with alphanumerical index without special character
df_no_special = pd.DataFrame({
    'Gal(b1-3)GalNAc': [0.242959, 0.208267, 0.223529, 0.245893, 0.297072],
    'GalOS(b1-3)GalNAc': [0.007172, 0.012267, 0.017346, 0.004030, 0.006302],
    'Gal(b1-3)[Fuc(a1-?)]GalNAc': [0.140820, 0.183320, 0.182716, 0.202931, 0.160700],
    'Fuc(a1-2)Gal(b1-3)GalNAc': [1.811925, 1.782675, 1.249882, 1.189128, 1.221432],
    'Fuc(a1-?)[HexNAc(?1-?)]GalNAc': [0.098392, 0.079560, 0.063288, 0.047380, 0.051726]
}, index=['A1', 'B1', 'C1', 'D1', 'E1'])

print("Heatmap with alphanumeric index:")
get_heatmap(df_no_special, motifs=True)

# Example with dataframe with index that contains '-'
df_with_special = pd.DataFrame({
    'Gal(b1-3)GalNAc': [0.242959, 0.208267, 0.223529, 0.245893, 0.297072],
    'GalOS(b1-3)GalNAc': [0.007172, 0.012267, 0.017346, 0.004030, 0.006302],
    'Gal(b1-3)[Fuc(a1-?)]GalNAc': [0.140820, 0.183320, 0.182716, 0.202931, 0.160700],
    'Fuc(a1-2)Gal(b1-3)GalNAc': [1.811925, 1.782675, 1.249882, 1.189128, 1.221432],
    'Fuc(a1-?)[HexNAc(?1-?)]GalNAc': [0.098392, 0.079560, 0.063288, 0.047380, 0.051726]
}, index=['A-1', 'B-1', 'C-1', 'D-1', 'E-1'])

print("Heatmap with special character index:")
get_heatmap(df_with_special, motifs=True)

Error:

ValueError Traceback (most recent call last)
Cell In[87], line 29
27 # Example with character index
28 print("Heatmap with Character Index:")
---> 29 get_heatmap(df_with_special, motifs=True)

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\motif\analysis.py:214, in get_heatmap(df, motifs, feature_set, transform, datatype, rarity_filter, filepath, index_col, custom_motifs, return_plot, **kwargs)
212 df = df.dropna(axis = 1)
213 if motifs:
--> 214 df = clean_up_heatmap(df.T)
215 if not (df < 0).any().any():
216 df /= df.sum()

Cell In[61], line 36, in debug_clean_up_heatmap(df)
34 max_idx_series = grouped.apply(lambda group: group.index.to_series().str.len().idxmax())
35 print("max_idx_series:", max_idx_series)
---> 36 result = df.loc[max_idx_series].drop_duplicates()
37 print("Result DataFrame after cleanup:")
38 print(result)

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexing.py:1191, in _LocationIndexer.getitem(self, key)
1189 maybe_callable = com.apply_if_callable(key, self.obj)
1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexing.py:1418, in _LocIndexer._getitem_axis(self, key, axis)
1416 if not (isinstance(key, tuple) and isinstance(labels, MultiIndex)):
1417 if hasattr(key, "ndim") and key.ndim > 1:
-> 1418 raise ValueError("Cannot index with multidimensional key")
1420 return self._getitem_iterable(key, axis=axis)
1422 # nested tuple slicing

ValueError: Cannot index with multidimensional key

IndexError in `get_differential_biosynthesis`.

Hi! I encountered an IndexError using get_differential_biosynthesis on my own dataset.

result = get_differential_biosynthesis(prepared_sub, group1_sub, group2_sub)

prepared_sub is a subset of my dataset containing only 6 samples for debugging.
I've formatted this DataFrame with IUPAC strings as the first "glycan" column, and the following columns samples.

Full traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/motif/processing.py:923, in rescue_glycans.<locals>.wrapper(*args, **kwargs)
    921 try:
    922   # Try running the original function
--> 923   return func(*args, **kwargs)
    924 except Exception as e:
    925   # If an error occurs, attempt to rescue the glycan sequences

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/network/biosynthesis.py:744, in construct_network(glycans, allowed_ptms, edge_type, permitted_roots, abundances)
    743 if 'GlcNAc' in ''.join(permitted_roots) and any(g.count('Man') >= 5 for g in network.nodes()):
--> 744   add_high_man_removal(network)
    745 if abundances:

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/network/biosynthesis.py:622, in add_high_man_removal(network)
    621 if target.count('Man') >= 5:
--> 622   diff_attr = network[source][target]['diff']
    623   edges_to_add.append((target, source, diff_attr))

KeyError: 'diff'

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
Cell In[28], line 1
----> 1 result = get_differential_biosynthesis(prepared_sub, group1_sub, group2_sub)

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/network/biosynthesis.py:1371, in get_differential_biosynthesis(df, group1, group2, analysis, paired)
   1369 root = list(infer_roots(df.index.tolist()))
   1370 root = max(root, key = len) if '-ol' not in root[0] else min(root, key = len)
-> 1371 nets = {col: estimate_weights(construct_network(df.index.tolist(), abundances = df[col].values.tolist()), root = root) for col in all_groups}
   1372 res = {col: get_maximum_flow(nets[col], source = root) for col in all_groups}
   1373 if analysis == "reaction":

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/network/biosynthesis.py:1371, in <dictcomp>(.0)
   1369 root = list(infer_roots(df.index.tolist()))
   1370 root = max(root, key = len) if '-ol' not in root[0] else min(root, key = len)
-> 1371 nets = {col: estimate_weights(construct_network(df.index.tolist(), abundances = df[col].values.tolist()), root = root) for col in all_groups}
   1372 res = {col: get_maximum_flow(nets[col], source = root) for col in all_groups}
   1373 if analysis == "reaction":

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/motif/processing.py:928, in rescue_glycans.<locals>.wrapper(*args, **kwargs)
    926 rescued_args = [canonicalize_iupac(arg) if isinstance(arg, str) else [canonicalize_iupac(a) for a in arg] if isinstance(arg, list) and isinstance(arg[0], str) else arg for arg in args]
    927 # After rescuing, attempt to run the function again
--> 928 return func(*rescued_args, **kwargs)

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/network/biosynthesis.py:646, in construct_network(glycans, allowed_ptms, edge_type, permitted_roots, abundances)
    644 if permitted_roots is None:
    645   permitted_roots = infer_roots(glycans)
--> 646 abundance_mapping = {glycans[k]: abundances[k] for k in range(len(glycans))} if abundances else {}
    647 # Generating graph from adjacency of observed glycans
    648 min_size = min([k.count('(') for k in permitted_roots]) + 1

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/network/biosynthesis.py:646, in <dictcomp>(.0)
    644 if permitted_roots is None:
    645   permitted_roots = infer_roots(glycans)
--> 646 abundance_mapping = {glycans[k]: abundances[k] for k in range(len(glycans))} if abundances else {}
    647 # Generating graph from adjacency of observed glycans
    648 min_size = min([k.count('(') for k in permitted_roots]) + 1

IndexError: list index out of range

More information for reference:

glycowork version: v1.2.0
Python version: 3.11
Device: MacBook Air M3
System: MacOS 14.5 (23F79)

I did a little debugging with PyCharm and found that abundances[k] with k being 62 (line 646 in biosynthesis.py) triggered the error. Note that prepared_sub had only 62 rows.

Some problems with the runtime of the code

Dear Glycowork

When I tried to run example of Constructing and exploring biosynthetic networks, plot_network gave the following error message：

Would you please help me solve this problem?

kind regards,

Deprecated pandas-related code in get_differential_expression() + unintuitive error

Running the latest version of pandas (2.2.2) and the Dev branch on glycowork (commit 6f81f00), there are two errors when running the following code:

import pandas as pd
from glycowork.motif.analysis import get_differential_expression

data = {
    'Glycan': ['Gal(b1-3)GalNAc', 'GalOS(b1-3)GalNAc', 'Gal(b1-3)[Fuc(a1-?)]GalNAc'],
    'Sample1': [1.1, 0.2, 0.3],
    'Sample2': [1.2, 0.1, 0.2],
    'Sample3': [0.1, 1.8, 1.9],
    'Sample4': [0.2, 1.1, 1.2]
}
differential_glycomics_df = pd.DataFrame(data)

group1 = ['Sample1', 'Sample2']
group2 = ['Sample3', 'Sample4']

get_differential_expression(df = differential_glycomics_df,
                            group1 = group1,
                            group2 = group2,
                            motifs = True,
                            feature_set = ['exhaustive'],
                            paired = False,
                            min_samples = 0.1)

The first error seems related to deprecated code related to pandas. The second error is some kind of divide by 0 error with challenging interpretability.

C:\Users\xerhma\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\glycan_data\stats.py:696: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
row = row.fillna(nan_placeholder)

C:\Users\xerhma\AppData\Local\Programs\Python\Python312\Lib\site-packages\scipy\stats_morestats.py:3345: RuntimeWarning: divide by zero encountered in scalar divide
W = numer / denom

Bad IUPAC sequence in glycowork/glycan_data/v9_df_species.csv

I found the following bad IUPAC sequence in glycowork/glycan_data/v9_df_species.csv:

{Gal(b1-?)GlcNAc(b1-?)}{Neu5Ac(a2-?)}Gal(b1-4)GlcNAc(b1-2)Man(a1-3)[Gal(b1-4)GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc

It has two Man(a1-3) linkages connected to the same Man(b1-4) residue.

annotate_figure() cannot draw terminal glycans

anootate_figure() does not draw terminal glycan structures generated by for example get_differential_expression(). Example code:

import pandas as pd
from glycowork.motif.analysis import get_differential_expression
from glycowork.motif.analysis import get_volcano
from glycowork.motif.draw import annotate_figure
from glycowork.motif.draw import GlycoDraw

data = {
    'Glycan': ['Gal(b1-3)GalNAc', 'GalOS(b1-3)GalNAc', 'Gal(b1-3)[Fuc(a1-?)]GalNAc', 'GlcNAc(b1-2)Man(a1-3)Man', 'Man(a1-6)[Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc', 'Neu5Ac(a2-6)Gal(b1-4)GlcNAc(b1-2)Man', 'Neu5Ac(a2-3)Gal(b1-3)GalNAc'],
    'Sample1': [1.1, 0.2, 0.3, 0.5, 0.7, 1.0, 0.6],
    'Sample2': [1.2, 0.1, 0.2, 0.4, 0.8, 0.9, 0.5],
    'Sample3': [0.1, 1.8, 1.9, 0.3, 0.6, 0.8, 1.2],
    'Sample4': [0.2, 1.1, 1.2, 0.2, 0.5, 0.7, 1.1],
    'Sample5': [1.3, 0.3, 0.4, 0.6, 0.9, 1.1, 0.7],
    'Sample6': [1.4, 0.4, 0.5, 0.7, 1.0, 1.2, 0.8],
    'Sample7': [0.3, 1.9, 2.0, 0.4, 0.7, 0.9, 1.3],
    'Sample8': [0.4, 1.2, 1.3, 0.3, 0.6, 0.8, 1.2]
}
differential_glycomics_df = pd.DataFrame(data)

# Define the groups
group1 = ['Sample1', 'Sample2', 'Sample5', 'Sample6']
group2 = ['Sample3', 'Sample4', 'Sample7', 'Sample8']

differential_expression = get_differential_expression(df = differential_glycomics_df,
                            group1 = group1,
                            group2 = group2,
                            motifs = True,
                            feature_set = ['terminal1', 'terminal2', 'terminal3'],
                            paired = False,
                            min_samples = 0.1)

print(differential_expression)

# Differential glycomics volcano plot
volcano = get_volcano(differential_expression,
                      y_thresh = 0.05,
                      annotate_volcano = True,
                      filepath = './volcano.svg')

For an example that works, just replace the feature_set with 'exhaustive' or 'known'.

GlycoDraw can draw terminal glycans, even though it looks a bit weird. It seems to be interpreted as a modification called "Terminal_" on the left-most glycan. Example:

GlycoDraw('Terminal_Gal(b1-?)[Fuc(a1-?)]GalNAc')

A more logical representation might perhaps be to draw a linkage at the reducing end? If so it seems to me that this would require changes in how terminal structures are determined.

About Model Analysis

Dear Glycowork
I saw the documentation stating that we can use analyze_ml_model to analyze machine learning models. Is there a similar method for analyzing deep learning.

terminal1 combined with terminal2 doesn't work as feature set

Similar to issue #53. When terminal 1 is combined with terminal2 a similar error happens. It seems to be specific to just this combination, I tried a few other combinations and they worked. I assume there's a similar problem with the logic, but I couldn't make sense of it when I looked at the code so i can't provide a code solution.

Example code below:

# Setup
from glycowork.motif.analysis import get_heatmap
import pandas as pd
data = {
    'Glycan': ['Gal(b1-3)GalNAc', 'GalOS(b1-3)GalNAc', 'Gal(b1-3)[Fuc(a1-?)]GalNAc', 'GlcNAc(b1-2)Man(a1-3)Man', 'Man(a1-6)[Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc', 'Neu5Ac(a2-6)Gal(b1-4)GlcNAc(b1-2)Man', 'Neu5Ac(a2-3)Gal(b1-3)GalNAc'],
    'Sample1': [1.1, 0.2, 0.3, 0.5, 0.7, 1.0, 0.6],
    'Sample2': [1.2, 0.1, 0.2, 0.4, 0.8, 0.9, 0.5],
    'Sample3': [0.1, 1.8, 1.9, 0.3, 0.6, 0.8, 1.2],
    'Sample4': [0.2, 1.1, 1.2, 0.2, 0.5, 0.7, 1.1],
    'Sample5': [1.3, 0.3, 0.4, 0.6, 0.9, 1.1, 0.7],
    'Sample6': [1.4, 0.4, 0.5, 0.7, 1.0, 1.2, 0.8],
    'Sample7': [0.3, 1.9, 2.0, 0.4, 0.7, 0.9, 1.3],
    'Sample8': [0.4, 1.2, 1.3, 0.3, 0.6, 0.8, 1.2]
}
data = pd.DataFrame(data)

# This works
get_heatmap(data,
           motifs = True,
           feature_set=[
               'terminal1',
               #'terminal2',
               #'terminal3'
                       ])

# This works
get_heatmap(data,
           motifs = True,
           feature_set=[
               #'terminal1',
               'terminal2',
               #'terminal3'
                       ])

# This fails
get_heatmap(data,
           motifs = True,
           feature_set=[
               'terminal1',
               'terminal2',
               #'terminal3'
                       ])

# This works
get_heatmap(data,
           motifs = True,
           feature_set=[
               'terminal1',
               #'terminal2',
               'terminal3'
                       ])

# This works
get_heatmap(data,
           motifs = True,
           feature_set=[
               #'terminal1',
               'terminal2',
               'terminal3'
                       ])

# This fails
get_heatmap(data,
           motifs = True,
           feature_set=[
               'terminal1',
               'terminal2',
               'terminal3'
                       ])

Wrong orientation of core-fucoses in `GlycoDraw`.

Hi! Thanks for developing this wonderful package. The GlycoDraw function produces really modern, publish-ready SNFG cartoons.

I've found some bugs though:

GlycoDraw may output empty SVG files when vertical setting to True.
The orientation of the core-fucose is unorthodox. For example, GlycoDraw draws a cartoon for "H4N3F1" as follows:

However, the core-fucose is more often drawn in another orientation by the N-glycomics community:

Bad IUPAC sequence in glycowork/glycan_data/v9_df_species.csv

I found the following bad IUPAC sequence in glycowork/glycan_data/v9_df_species.csv:

GlcNAc(b1-2)Man(a1-?)[Man(a1-?)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)GlcNAc

It is missing a ] just before the reducing end GlcNAc.

get_pca() errors and unable to input groups as df

Description
get_pca() throws errors that indicate parts of the code use deprecated code.

get_pca() do not seem to be able to use a groups df according to the documentation in order to draw colors and shapes. It seems like it is limited to only using a list to draw color.

Environment

Using a fresh install of the dev version with the following command on windows: pip install glycowork[draw]@git+https://github.com/BojarLab/glycowork.git@dev
Using python 3.12 in jupyter-lab 4.2.1 to run the code

Code for deprecated code

import pandas as pd
from glycowork.motif.analysis import get_pca

# Sample data
data = {
    'Glycan': ['Gal(b1-3)GalNAc', 'GalOS(b1-3)GalNAc', 'Gal(b1-3)[Fuc(a1-?)]GalNAc', 'Fuc(a1-2)Gal(b1-3)GalNAc', 'Fuc(a1-?)[HexNAc(?1-?)]GalNAc'],
    'Sample1': [0.5, 0.3, 0.2, 0.7, 0.9],
    'Sample2': [0.4, 0.2, 0.3, 0.6, 0.8],
    'Sample3': [0.3, 0.1, 0.4, 0.5, 0.7]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Perform PCA
get_pca(df, groups=None, motifs=True, feature_set=['known', 'exhaustive'], pc_x=1, pc_y=2, color=None, shape=None, filepath='', custom_motifs=[], transform=None, rarity_filter=0.05)

Error for deprecated code

C:\Users\xerhma\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\motif\annotate.py:406: FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do frame.T.groupby(...) without axis instead.
out_matrix = out_matrix.groupby(by = out_matrix.columns, axis = 1).sum()

Code for bugged groups input

# Sample data
data = {
    'Glycan': ['Gal(b1-3)GalNAc', 'GalOS(b1-3)GalNAc', 'Gal(b1-3)[Fuc(a1-?)]GalNAc', 'Fuc(a1-2)Gal(b1-3)GalNAc', 'Fuc(a1-?)[HexNAc(?1-?)]GalNAc'],
    'Sample1': [0.5, 0.3, 0.2, 0.7, 0.9],
    'Sample2': [0.4, 0.2, 0.3, 0.6, 0.8],
    'Sample3': [0.3, 0.1, 0.4, 0.5, 0.7]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Sample groups data
groups_data = {
    'id': ['Sample1', 'Sample2', 'Sample3'],
    'Treatment': ['Control', 'Treatment', 'Control']
}

# Create groups DataFrame
groups_df = pd.DataFrame(groups_data)

# Perform PCA
get_pca(df, groups=groups_df, motifs=True, feature_set=['known', 'exhaustive'], pc_x=1, pc_y=2, color='Treatment', shape=None, filepath='', custom_motifs=[], transform=None, rarity_filter=0.05)

Error for bugged groups input

C:\Users\xerhma\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\motif\annotate.py:406: FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do frame.T.groupby(...) without axis instead.
out_matrix = out_matrix.groupby(by = out_matrix.columns, axis = 1).sum()

ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_30572\2247117151.py in ?()
18 # Create groups DataFrame
19 groups_df = pd.DataFrame(groups_data)
20
21 # Perform PCA
---> 22 get_pca(df, groups=groups_df, motifs=True, feature_set=['known', 'exhaustive'], pc_x=1, pc_y=2, color='Treatment', shape=None, filepath='', custom_motifs=[], transform=None, rarity_filter=0.05)

~\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\motif\analysis.py in ?(df, groups, motifs, feature_set, pc_x, pc_y, color, shape, filepath, custom_motifs, transform, rarity_filter)
497 # get pca
498 if motifs:
499 # Motif extraction and quantification
500 df = quantify_motifs(df.iloc[:, 1:], df.iloc[:, 0].values.tolist(), feature_set, custom_motifs = custom_motifs, remove_redundant = False).T.reset_index()
--> 501 X = np.array(df.iloc[:, 1:len(groups)+1].T) if groups and isinstance(groups, list) else np.array(df.iloc[:, 1:].T)
502 scaler = StandardScaler()
503 X_std = scaler.fit_transform(X)
504 pca = PCA()

~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\generic.py in ?(self)
1575 @Final
1576 def nonzero(self) -> NoReturn:
-> 1577 raise ValueError(
1578 f"The truth value of a {type(self).name} is ambiguous. "
1579 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1580 )

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Code for successful drawing using list as groups input

# Sample data
data = {
    'Glycan': ['Gal(b1-3)GalNAc', 'GalOS(b1-3)GalNAc', 'Gal(b1-3)[Fuc(a1-?)]GalNAc', 'Fuc(a1-2)Gal(b1-3)GalNAc', 'Fuc(a1-?)[HexNAc(?1-?)]GalNAc'],
    'Sample1': [0.5, 0.3, 0.2, 0.7, 0.9],
    'Sample2': [0.4, 0.2, 0.3, 0.6, 0.8],
    'Sample3': [0.3, 0.1, 0.4, 0.5, 0.7]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Create groups DataFrame
groups_df = pd.DataFrame(groups_data)

# Perform PCA
get_pca(df, groups = ['Control', 'Treatment', 'Control'], motifs=True, feature_set=['known', 'exhaustive'], pc_x=1, pc_y=2, shape=None, filepath='', custom_motifs=[], transform=None, rarity_filter=0.05)

v3_sugarbase.csv WURCS and glytoucan_acc columns do not match the glycan column

Hello,

While looking at the v3_sugarbase.csv static file, I noticed that there's a mismatch between the glycan column's IUPAC notation and the WURCS and glytoucan_acc columns.

For example, the row with glycan_id = 2, the glycan is GlcNAc(b1-2)[Gal(b1-3)[Neu5Ac(a2-6)]GlcNAc(b1-4)]Man(a1-3)[GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc, which is a NeuAc-containing glycan:

The WURCS column contains a much shorter sequence, WURCS=2.0/2,5,4/[a2122h-1b_1-5][a2211m-1a_1-5]/1-2-2-2-2/a2-b1_b4-c1_c3-d1_c4-e1, which does not contain NeuAc. It parses to:

and the glytoucan_acc column references https://glytoucan.org/Structures/Glycans/G52117LP, which matches my parsing.

There are many more examples like this, but I wasn't able to successfully parse the whole table.

About Model Training

Dear Glycowork
When I use the SweetNet model for training, the training results of models with the same set of data are not fixed each time.Excuse me, what parameters do I need to fix to ensure that the results of the model are consistent each time.I trained the model with GPU.

terminal2 feature set broken

At least in get_heatmap() and get_differential_expression() and on the dev branch with latest commit, when using terminal2 as a feature set generates a lengthy error message which ends with a statement about how theres a length mismatch where the expected axis has twice as many elements as the new values. Interestingly, terminal1 and terminal3 works just fine.

Example code:

# Setup
from glycowork.motif.analysis import get_heatmap
import pandas as pd
data = {
    'Glycan': ['Gal(b1-3)GalNAc', 'GalOS(b1-3)GalNAc', 'Gal(b1-3)[Fuc(a1-?)]GalNAc', 'GlcNAc(b1-2)Man(a1-3)Man', 'Man(a1-6)[Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc', 'Neu5Ac(a2-6)Gal(b1-4)GlcNAc(b1-2)Man', 'Neu5Ac(a2-3)Gal(b1-3)GalNAc'],
    'Sample1': [1.1, 0.2, 0.3, 0.5, 0.7, 1.0, 0.6],
    'Sample2': [1.2, 0.1, 0.2, 0.4, 0.8, 0.9, 0.5],
    'Sample3': [0.1, 1.8, 1.9, 0.3, 0.6, 0.8, 1.2],
    'Sample4': [0.2, 1.1, 1.2, 0.2, 0.5, 0.7, 1.1],
    'Sample5': [1.3, 0.3, 0.4, 0.6, 0.9, 1.1, 0.7],
    'Sample6': [1.4, 0.4, 0.5, 0.7, 1.0, 1.2, 0.8],
    'Sample7': [0.3, 1.9, 2.0, 0.4, 0.7, 0.9, 1.3],
    'Sample8': [0.4, 1.2, 1.3, 0.3, 0.6, 0.8, 1.2]
}
data = pd.DataFrame(data)

# This works
get_heatmap(data,
           motifs = True,
           feature_set=['terminal1'])

# This fails
get_heatmap(data,
           motifs = True,
           feature_set=['terminal2'])

# This works
get_heatmap(data,
           motifs = True,
           feature_set=['terminal3'])

# This fails
get_heatmap(data,
           motifs = True,
           feature_set=['terminal1','terminal2','terminal3'])

Error:

ValueError Traceback (most recent call last)
File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\motif\processing.py:935, in rescue_glycans..wrapper(*args, **kwargs)
933 try:
934 # Try running the original function
--> 935 return func(*args, **kwargs)
936 except Exception:
937 # If an error occurs, attempt to rescue the glycan sequences

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\motif\annotate.py:252, in annotate_dataset(glycans, motifs, feature_set, termini_list, condense, custom_motifs)
251 bag_out = pd.concat([bag_out, shadow_bag], axis = 1).reset_index(drop = True)
--> 252 bag_out.index = glycans
253 bag_out.columns = ['Terminal_' + c for c in bag_out.columns]

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\generic.py:6313, in NDFrame.setattr(self, name, value)
6312 object.getattribute(self, name)
-> 6313 return object.setattr(self, name, value)
6314 except AttributeError:

File properties.pyx:69, in pandas._libs.properties.AxisProperty.set()

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\generic.py:814, in NDFrame._set_axis(self, axis, labels)
813 labels = ensure_index(labels)
--> 814 self._mgr.set_axis(axis, labels)
815 self._clear_item_cache()

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\internals\managers.py:238, in BaseBlockManager.set_axis(self, axis, new_labels)
236 def set_axis(self, axis: AxisInt, new_labels: Index) -> None:
237 # Caller is responsible for ensuring we have an Index object.
--> 238 self._validate_set_axis(axis, new_labels)
239 self.axes[axis] = new_labels

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\internals\base.py:98, in DataManager._validate_set_axis(self, axis, new_labels)
97 elif new_len != old_len:
---> 98 raise ValueError(
99 f"Length mismatch: Expected axis has {old_len} elements, new "
100 f"values have {new_len} elements"
101 )

ValueError: Length mismatch: Expected axis has 14 elements, new values have 7 elements

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
Cell In[39], line 2
1 # This fails
----> 2 get_heatmap(data,
3 motifs = True,
4 feature_set=['terminal1','terminal2','terminal3'])

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\motif\analysis.py:263, in get_heatmap(df, motifs, feature_set, transform, datatype, rarity_filter, filepath, index_col, custom_motifs, return_plot, **kwargs)
261 raise ValueError("A heatmap needs to have at least two motifs.")
262 if datatype == 'response':
--> 263 df = quantify_motifs(df, df.index.tolist(), feature_set, custom_motifs = custom_motifs)
264 elif datatype == 'presence':
265 # Count glycan motifs and remove rare motifs from the result
266 df_motif = annotate_dataset(df.index.tolist(), feature_set = feature_set, condense = True, custom_motifs = custom_motifs)

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\motif\annotate.py:308, in quantify_motifs(df, glycans, feature_set, custom_motifs, remove_redundant)
306 df = pd.read_csv(df) if df.endswith(".csv") else pd.read_excel(df)
307 # Motif extraction
--> 308 df_motif = annotate_dataset(glycans, feature_set = feature_set,
309 condense = True, custom_motifs = custom_motifs)
310 collect_dic = {}
311 df = df.T

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\motif\processing.py:940, in rescue_glycans..wrapper(*args, **kwargs)
938 rescued_args = [canonicalize_iupac(arg) if isinstance(arg, str) else [canonicalize_iupac(a) for a in arg] if isinstance(arg, list) and arg and isinstance(arg[0], str) else arg for arg in args]
939 # After rescuing, attempt to run the function again
--> 940 return func(*rescued_args, **kwargs)

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\motif\annotate.py:252, in annotate_dataset(glycans, motifs, feature_set, termini_list, condense, custom_motifs)
250 shadow_bag = pd.DataFrame([{i: j.count(i) for i in repertoire if '?' in i} for j in shadow_glycans])
251 bag_out = pd.concat([bag_out, shadow_bag], axis = 1).reset_index(drop = True)
--> 252 bag_out.index = glycans
253 bag_out.columns = ['Terminal_' + c for c in bag_out.columns]
254 shopping_cart.append(bag_out)

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\generic.py:6313, in NDFrame.setattr(self, name, value)
6311 try:
6312 object.getattribute(self, name)
-> 6313 return object.setattr(self, name, value)
6314 except AttributeError:
6315 pass

File properties.pyx:69, in pandas._libs.properties.AxisProperty.set()

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\generic.py:814, in NDFrame._set_axis(self, axis, labels)
809 """
810 This is called from the cython code when we set the index attribute
811 directly, e.g. series.index = [1, 2, 3].
812 """
813 labels = ensure_index(labels)
--> 814 self._mgr.set_axis(axis, labels)
815 self._clear_item_cache()

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\internals\managers.py:238, in BaseBlockManager.set_axis(self, axis, new_labels)
236 def set_axis(self, axis: AxisInt, new_labels: Index) -> None:
237 # Caller is responsible for ensuring we have an Index object.
--> 238 self._validate_set_axis(axis, new_labels)
239 self.axes[axis] = new_labels

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\internals\base.py:98, in DataManager._validate_set_axis(self, axis, new_labels)
95 pass
97 elif new_len != old_len:
---> 98 raise ValueError(
99 f"Length mismatch: Expected axis has {old_len} elements, new "
100 f"values have {new_len} elements"
101 )

ValueError: Length mismatch: Expected axis has 14 elements, new values have 7 elements

00_core.ipynb: ValueError

characterize_monosaccharide('Xyl', rank = 'Kingdom', focus = 'Plantae', modifications = True)
ValueError: cannot set using a list-like indexer with a different length than the value

Mysterious error in get_differential_expression() "Cannot index with multidimensional key"

When running a similar example as issue #51, there's a mysterious error "Cannot index with multidimensional key". Running the dev branch.

import pandas as pd
from glycowork.motif.analysis import get_differential_expression
from glycowork.motif.analysis import get_volcano
from glycowork.motif.draw import annotate_figure
from glycowork.motif.draw import GlycoDraw

data = {
    'Glycan': ['Gal(b1-3)GalNAc', 'GalOS(b1-3)GalNAc', 'Gal(b1-3)[Fuc(a1-?)]GalNAc', 'GlcNAc(b1-2)Man(a1-3)Man', 'Man(a1-6)[Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc', 'Neu5Ac(a2-6)Gal(b1-4)GlcNAc(b1-2)Man', 'Neu5Ac(a2-3)Gal(b1-3)GalNAc'],
    'Sample1': [1.1, 0.2, 0.3, 0.5, 0.7, 1.0, 0.6],
    'Sample2': [1.2, 0.1, 0.2, 0.4, 0.8, 0.9, 0.5],
    'Sample3': [0.1, 1.8, 1.9, 0.3, 0.6, 0.8, 1.2],
    'Sample4': [0.2, 1.1, 1.2, 0.2, 0.5, 0.7, 1.1],
    'Sample5': [1.3, 0.3, 0.4, 0.6, 0.9, 1.1, 0.7],
    'Sample6': [1.4, 0.4, 0.5, 0.7, 1.0, 1.2, 0.8],
    'Sample7': [0.3, 1.9, 2.0, 0.4, 0.7, 0.9, 1.3],
    'Sample8': [0.4, 1.2, 1.3, 0.3, 0.6, 0.8, 1.2]
}
differential_glycomics_df = pd.DataFrame(data)

# Define the groups
group1 = ['Sample1', 'Sample2', 'Sample5', 'Sample6']
group2 = ['Sample3', 'Sample4', 'Sample7', 'Sample8']

differential_expression = get_differential_expression(df = differential_glycomics_df,
                            group1 = group1,
                            group2 = group2,
                            motifs = True,
                            feature_set = ['terminal1'],
                            paired = False,
                            min_samples = 0.1)

print(differential_expression)

Getting an error when using lectinOracle in glycowork 0.8.0

Running the lectinOracle example from the glycowork documentation gives the following error:

02_glycan_data: ValueError

df_glycan.index = df_glycan.glycan.values.tolist()
AttributeError: 'DataFrame' object has no attribute 'glycan'

construct_network() can't handle (a1), (α1), (b1) or (β1) linkages and canonicalize_iupac() do not fix them

When I was trying to construct a network I kept getting an error that I found difficult to interpret when deploying a large a list of around 100 glycans. Eventually I found that some of the glycan structure linkages were in the form (b1) or (a1) after being processed with canonicalize_iupac() instead of the expected (b1-?) or (a1-?), and that it was these glycans that were behind the error.

Construct_network() downstream of the script seems to think glycans with linkages written like these are in the oxford format. When glycans with linkages written like that are part of the list I get the same kind of error, example:

construct_network(["Gal(b1)GalNAc"])

Gives the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[135], line 2
      1 # Top most expressed glycans
----> 2 construct_network(["Gal(b1)GalNAc"])

File ~\anaconda3\Lib\site-packages\glycowork\network\biosynthesis.py:663, in construct_network(glycans, allowed_ptms, edge_type, permitted_roots, abundances)
    661 network.remove_edges_from(to_remove)
    662 # Create edge and node labels
--> 663 nx.set_edge_attributes(network, {el: find_diff(el[0], el[1], graph_dic) for el in network.edges()}, 'diffs')
    664 virtual_labels = {k: (1 if k in virtual_nodes else 0) for k in network.nodes()}
    665 nx.set_node_attributes(network, virtual_labels, 'virtual')

File ~\anaconda3\Lib\site-packages\glycowork\network\biosynthesis.py:663, in <dictcomp>(.0)
    661 network.remove_edges_from(to_remove)
    662 # Create edge and node labels
--> 663 nx.set_edge_attributes(network, {el: find_diff(el[0], el[1], graph_dic) for el in network.edges()}, 'diffs')
    664 virtual_labels = {k: (1 if k in virtual_nodes else 0) for k in network.nodes()}
    665 nx.set_node_attributes(network, virtual_labels, 'virtual')

File ~\anaconda3\Lib\site-packages\glycowork\network\biosynthesis.py:180, in find_diff(glycan_a, glycan_b, graph_dic)
    178 matched = subgraph_isomorphism(graphs[0], graphs[1], return_matches = True)
    179 if not isinstance(matched, bool) and matched[0]:
--> 180   return graph_to_string(graphs[0].subgraph([k for k in graphs[0].nodes() if k not in matched[1][0]]))
    181 else:
    182   return 'disregard'

File ~\anaconda3\Lib\site-packages\glycowork\motif\graph.py:450, in graph_to_string(graph)
    448   return parts[:parts.rfind('{')] + parts[parts.rfind('{')+1:]
    449 else:
--> 450   return graph_to_string_int(graph)

File ~\anaconda3\Lib\site-packages\glycowork\motif\graph.py:426, in graph_to_string_int(graph)
    424 if ')(' in nodes and ((nodes.index(')(') < nodes.index('(')) or (nodes[:nodes.index(')(')].count(')') == nodes[:nodes.index(')(')].count('('))):
    425   nodes = nodes.replace(')(', '(', 1)
--> 426 return canonicalize_iupac(nodes.strip('()'))

File ~\anaconda3\Lib\site-packages\glycowork\motif\processing.py:824, in canonicalize_iupac(glycan)
    822   return
    823 elif ((glycan[-1].isdigit() and bool(re.search("[A-Z]", glycan))) or (glycan[-2].isdigit() and glycan[-1] == ']') or glycan.endswith('B') or glycan.endswith("LacDiNAc")) and 'e' not in glycan and '-' not in glycan:
--> 824   glycan = oxford_to_iupac(glycan)
    825 # Canonicalize usage of monosaccharides and linkages
    826 replace_dic = {'Nac': 'NAc', 'AC': 'Ac', 'Nc': 'NAc', 'NeuAc': 'Neu5Ac', 'NeuNAc': 'Neu5Ac', 'NeuGc': 'Neu5Gc',
    827                '\u03B1': 'a', '\u03B2': 'b', 'N(Gc)': 'NGc', 'GL': 'Gl', 'GaN': 'GalN', '(9Ac)': '9Ac',
    828                'KDN': 'Kdn', 'OSO3': 'S', '-O-Su-': 'S', '(S)': 'S', 'SO3-': 'S', 'SO3(-)': 'S', 'H2PO3': 'P', '(P)': 'P',
    829                '–': '-', ' ': '', ',': '-', 'α': 'a', 'β': 'b', 'ß': 'b', '.': '', '((': '(', '))': ')', '→': '-',
    830                'Glcp': 'Glc', 'Galp': 'Gal', 'Manp': 'Man', 'Fucp': 'Fuc', 'Neup': 'Neu', 'a?': 'a1',
    831                '5Ac4Ac': '4Ac5Ac', '(-)': '(?1-?)'}

File ~\anaconda3\Lib\site-packages\glycowork\motif\processing.py:715, in oxford_to_iupac(oxford)
    710 oxford_wo_branches = bracket_removal(oxford)
    711 branches = {"A": int(oxford_wo_branches[oxford_wo_branches.index("A")+1]) if "A" in oxford_wo_branches and oxford_wo_branches[oxford_wo_branches.index("A")+1] != "c" else 0,
    712             "G": int(oxford_wo_branches[oxford_wo_branches.index("G")+1]) if "G" in oxford_wo_branches and oxford_wo_branches[oxford_wo_branches.index("G")+1] != "a" else 0,
    713             "S": int(oxford_wo_branches[oxford_wo_branches.index("S")+1]) if "S" in oxford_wo_branches and oxford_wo_branches[oxford_wo_branches.index("S")+1] != "g" else 0}
    714 extras = {"Sg": int(oxford_wo_branches[oxford_wo_branches.index("Sg")+2]) if "Sg" in oxford_wo_branches else 0,
--> 715           "Ga": int(oxford_wo_branches[oxford_wo_branches.index("Ga")+2]) if "Ga" in oxford_wo_branches else 0,
    716           "Lac": int(oxford_wo_branches[oxford_wo_branches.index("Lac")+3]) if "Lac" in oxford_wo_branches and oxford_wo_branches[oxford_wo_branches.index("Lac")+3] != "D" else 0,
    717           "LacDiNAc": 1 if "LacDiN" in oxford_wo_branches else 0}
    718 specified_linkages = {'Neu5Ac(a2-?)': oxford[oxford.index("S")+2:] if branches['S'] else []}
    719 specified_linkages = {k: [int(n) for n in v[:v.index(']')].split(',')] for k, v in specified_linkages.items() if v}

ValueError: invalid literal for int() with base 10: 'l'

Whereas

construct_network(["Gal(b1-?)GalNAc"])

Is fine.

Furthermore, canonicalize_iupac() of glycans with these linkages yields what to me looks like incorrect linkages when there aren't any thoroughly written out linkages in the string when. For example:
canonicalize_iupac("Gal(b1)Gal(b1)Gal(b1)Gal(b1)Gal(b1)Gal(b1)GalNAc")
Yields:
'Gal(b1-1)Gal(b1-1)Gal(b1-1)Gal(b1-1)Gal(b1-1)Gal(b1-1)GalNAc'

Custom motifs do not work for several functions

I've been struggling to make use of custom motifs. Having looked at the code I suspect it might not work for other functions than annotate_dataset(). For example, get_heatmap() does not pass custom_motifs to annotate_dataset(), which maybe was intended to be done with **kwargs? I tested some naive solutions, for example by adding **kwargs to the get_heatmap() call to annotate_dataset(). As another example I tried adding custom_motifs = [] as a variable in get_heatmap() and pass it on to annotate_dataset() with custom_motifs = custom_motifs but this did not work. I believe more thorough changes might be needed to be able to use custom motifs for most functions.

Incorrect error when using exhaustive feature set

Running latest dev commit db6c19d

The error "Warning: ['exhaustive'] not recognized as features." comes up when using exhaustive as a feature set. I cannot grasp why by looking at the code and it seems to otherwise work as expected. Sample code below.

from glycowork.motif.annotate import annotate_dataset

annotate_dataset(glycans = ['Gal(b1-3)GalNAc', 
                            'GalOS(b1-3)GalNAc', 
                            'Gal(b1-3)[Fuc(a1-?)]GalNAc', 
                            'GlcNAc(b1-2)Man(a1-3)Man', 
                            'Man(a1-6)[Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc', 
                            'Neu5Ac(a2-6)Gal(b1-4)GlcNAc(b1-2)Man', 
                            'Neu5Ac(a2-3)Gal(b1-3)GalNAc'],
                feature_set = ['exhaustive'])

Scipy incompatibility error: which version should I use?

Hello,

Thank you for this package! I'm having some trouble loading it. TL;DR: using python 3.8 and scipy 1.8.0, I get a module load error for 'scipy.sparse.linalg.eigen.arpack'. This only happens when I run

from glycowork.motif import analysis

It would be helpful if you could suggest the version of scipy that will let me run the code, and/or document the version compatibility.

Thanks!

Here's my full error message:

ModuleNotFoundError                       Traceback (most recent call last)
Input In [15], in <cell line: 1>()
----> 1 from glycowork.motif import analysis

File /opt/homebrew/Caskroom/miniforge/base/envs/alive/lib/python3.8/site-packages/glycowork/motif/analysis.py:12, in <module>
      9 from sklearn.manifold import TSNE
     11 from glycowork.glycan_data.loader import lib, glycan_emb, df_species
---> 12 from glycowork.motif.annotate import annotate_dataset, link_find
     13 from glycowork.motif.graph import subgraph_isomorphism
     16 def get_pvals_motifs(df, glycan_col_name = 'glycan', label_col_name = 'target',
     17                      libr = None, thresh = 1.645, sorting = True,
     18                      feature_set = ['exhaustive'], extra = 'termini',
     19                      wildcard_list = [], multiple_samples = False,
     20                      motifs = None, estimate_speedup = False):

File /opt/homebrew/Caskroom/miniforge/base/envs/alive/lib/python3.8/site-packages/glycowork/motif/annotate.py:7, in <module>
      4 import re
      6 from glycowork.glycan_data.loader import lib, linkages, motif_list, find_nth, unwrap
----> 7 from glycowork.motif.graph import subgraph_isomorphism, generate_graph_features, glycan_to_nxGraph, try_string_conversion, compare_glycans
      8 from glycowork.motif.processing import small_motif_find
     11 def convert_to_counts_glycoletter(glycan, libr = None):

File /opt/homebrew/Caskroom/miniforge/base/envs/alive/lib/python3.8/site-packages/glycowork/motif/graph.py:7, in <module>
      5 import numpy as np
      6 import pandas as pd
----> 7 from scipy.sparse.linalg.eigen.arpack import eigsh
      9 def character_to_label(character, libr = None):
     10   """tokenizes character by indexing passed library\n
     11   | Arguments:
     12   | :-
   (...)
     17   | Returns index of character in library
     18   """

ModuleNotFoundError: No module named 'scipy.sparse.linalg.eigen.arpack'; 'scipy.sparse.linalg.eigen' is not a package

And my full enviroment

anyio=3.5.0=py38h10201cd_0
appnope=0.1.2=py38h10201cd_2
argon2-cffi=21.3.0=pyhd8ed1ab_0
argon2-cffi-bindings=21.2.0=py38hea4295b_1
asttokens=2.0.5=pyhd8ed1ab_0
attrs=21.4.0=pyhd8ed1ab_0
babel=2.9.1=pyh44b312d_0
backcall=0.2.0=pyh9f0ad1d_0
backports=1.0=py_2
backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
beautifulsoup4=4.10.0=pyha770c72_0
bleach=4.1.0=pyhd8ed1ab_0
brotli=1.0.9=h3422bc3_6
brotli-bin=1.0.9=h3422bc3_6
brotlipy=0.7.0=py38hea4295b_1003
c-ares=1.18.1=h3422bc3_0
ca-certificates=2021.10.8=h4653dfc_0
cachecontrol=0.12.11=pypi_0
cached-property=1.5.2=hd8ed1ab_1
cached_property=1.5.2=pyha770c72_1
certifi=2021.10.8=py38h10201cd_2
cffi=1.15.0=py38hc67bbb8_0
charset-normalizer=2.0.12=pyhd8ed1ab_0
click=8.1.2=pypi_0
cryptography=36.0.1=py38h10d4710_0
cycler=0.11.0=pyhd8ed1ab_0
cython=0.29.28=pypi_0
debugpy=1.5.1=py38h6f2b01f_0
decorator=5.1.1=pyhd8ed1ab_0
defusedxml=0.7.1=pyhd8ed1ab_0
emperor=1.0.3=pypi_0
entrypoints=0.4=pyhd8ed1ab_0
et-xmlfile=1.1.0=pypi_0
executing=0.8.3=pyhd8ed1ab_0
flit-core=3.7.1=pyhd8ed1ab_0
fonttools=4.30.0=py38h33210d7_0
freetype=2.10.4=h17b34a0_1
future=0.18.2=pypi_0
giflib=5.2.1=h27ca646_2
glycowork=0.4.0=pypi_0
h5py=3.6.0=nompi_py38hacf61ce_100
hdf5=1.12.1=nompi_hf9525e8_104
hdmedians=0.14.2=pypi_0
icu=69.1=hbdafb3b_0
idna=3.3=pyhd8ed1ab_0
importlib-metadata=4.11.3=py38h10201cd_0
importlib_resources=5.4.0=pyhd8ed1ab_0
ipykernel=6.9.2=py38h2cb4d76_0
ipython=8.1.1=py38h10201cd_0
ipython_genutils=0.2.0=py_1
ipywidgets=7.6.5=pyhd8ed1ab_0
jbig=2.1=h3422bc3_2003
jedi=0.18.1=py38h10201cd_0
jinja2=3.0.3=pyhd8ed1ab_0
joblib=1.1.0=pyhd8ed1ab_0
jpeg=9e=h3422bc3_0
json5=0.9.5=pyh9f0ad1d_0
jsonschema=4.4.0=pyhd8ed1ab_0
jupyter=1.0.0=py38h10201cd_7
jupyter_client=7.1.2=pyhd8ed1ab_0
jupyter_console=6.4.3=pyhd8ed1ab_0
jupyter_core=4.9.2=py38h10201cd_0
jupyter_server=1.15.3=pyhd8ed1ab_0
jupyterlab=3.3.2=pyhd8ed1ab_0
jupyterlab_pygments=0.1.2=pyh9f0ad1d_0
jupyterlab_server=2.10.3=pyhd8ed1ab_0
jupyterlab_widgets=1.0.2=pyhd8ed1ab_0
kiwisolver=1.3.2=py38h1670459_1
krb5=1.19.3=hf9b2bbe_0
lcms2=2.12=had6a04f_0
lerc=3.0=hbdafb3b_0
libblas=3.9.0=13_osxarm64_openblas
libbrotlicommon=1.0.9=h3422bc3_6
libbrotlidec=1.0.9=h3422bc3_6
libbrotlienc=1.0.9=h3422bc3_6
libcblas=3.9.0=13_osxarm64_openblas
libcurl=7.82.0=hb0e6552_0
libcxx=13.0.1=h6a5c8ee_0
libdeflate=1.10=h3422bc3_0
libedit=3.1.20191231=hc8eb9b7_2
libev=4.33=h642e427_1
libffi=3.4.2=h3422bc3_5
libgfortran=5.0.0.dev0=11_0_1_hf114ba7_23
libgfortran5=11.0.1.dev0=hf114ba7_23
liblapack=3.9.0=13_osxarm64_openblas
libnghttp2=1.47.0=he723fca_0
libopenblas=0.3.18=openmp_h5dd58f0_0
libpng=1.6.37=hf7e6567_2
libsodium=1.0.18=h27ca646_1
libssh2=1.10.0=hb80f160_2
libtiff=4.3.0=h77dc3b6_3
libuv=1.43.0=h3422bc3_0
libwebp=1.2.2=h0d20362_0
libwebp-base=1.2.2=h3422bc3_1
libxcb=1.13=h9b22ae9_1004
libzlib=1.2.11=hee7b306_1013
llvm-openmp=13.0.1=h455960f_1
lockfile=0.12.2=pypi_0
lz4-c=1.9.3=hbdafb3b_1
markupsafe=2.1.0=py38h33210d7_1
matplotlib-base=3.5.1=py38hb140015_0
matplotlib-inline=0.1.3=pyhd8ed1ab_0
mistune=0.8.4=py38hea4295b_1005
mpld3=0.5.7=pypi_0
msgpack=1.0.3=pypi_0
munkres=1.1.4=pyh9f0ad1d_0
natsort=8.1.0=pypi_0
nbclassic=0.3.6=pyhd8ed1ab_0
nbclient=0.5.13=pyhd8ed1ab_0
nbconvert=6.4.4=py38h10201cd_0
nbformat=5.2.0=pyhd8ed1ab_0
ncurses=6.3=hc470f4d_0
nest-asyncio=1.5.4=pyhd8ed1ab_0
networkx=2.8=pyhd8ed1ab_0
nodejs=17.4.0=habd0e26_0
notebook=6.4.9=pyha770c72_0
notebook-shim=0.1.0=pyhd8ed1ab_0
numpy=1.22.3=py38hf29d37f_0
openjpeg=2.4.0=h062765e_1
openpyxl=3.0.9=pypi_0
openssl=1.1.1n=h90dfc92_0
packaging=21.3=pyhd8ed1ab_0
pandas=1.4.1=py38h3777fb4_0
pandocfilters=1.5.0=pyhd8ed1ab_0
parso=0.8.3=pyhd8ed1ab_0
patsy=0.5.2=pyhd8ed1ab_0
pexpect=4.8.0=pyh9f0ad1d_2
pickleshare=0.7.5=py_1003
pillow=9.0.1=py38h7ff1586_2
pip=22.0.4=pyhd8ed1ab_0
prometheus_client=0.13.1=pyhd8ed1ab_0
prompt-toolkit=3.0.27=pyha770c72_0
prompt_toolkit=3.0.27=hd8ed1ab_0
psutil=5.9.0=py38hea4295b_0
pthread-stubs=0.4=h27ca646_1001
ptitprince=0.2.5=pypi_0
ptyprocess=0.7.0=pyhd3deb0d_0
pure_eval=0.2.2=pyhd8ed1ab_0
pycparser=2.21=pyhd8ed1ab_0
pygments=2.11.2=pyhd8ed1ab_0
pyhamcrest=2.0.3=pypi_0
pyopenssl=22.0.0=pyhd8ed1ab_0
pyparsing=3.0.7=pyhd8ed1ab_0
pyrsistent=0.18.1=py38hea4295b_0
pysocks=1.7.1=py38h10201cd_4
python=3.8.12=hab31e5c_3_cpython
python-dateutil=2.8.2=pyhd8ed1ab_0
python-graphviz=0.19.1=pypi_0
python_abi=3.8=2_cp38
pytz=2021.3=pyhd8ed1ab_0
pyzmq=22.3.0=py38h51b17a6_1
readline=8.1=hedafd6a_0
regex=2022.3.15=pypi_0
requests=2.27.1=pyhd8ed1ab_0
scikit-bio=0.5.7=pypi_0
scikit-learn=1.0.2=py38h2cd4032_0
scipy=1.8.0=py38hd0c9ec0_1
seaborn=0.11.2=hd8ed1ab_0
seaborn-base=0.11.2=pyhd8ed1ab_0
send2trash=1.8.0=pyhd8ed1ab_0
setuptools=60.9.3=py38h10201cd_0
six=1.16.0=pyh6c4a22f_0
sklearn=0.0=pypi_0
sniffio=1.2.0=py38h10201cd_2
soupsieve=2.3.1=pyhd8ed1ab_0
sqlite=3.37.0=h72a2b83_0
stack_data=0.2.0=pyhd8ed1ab_0
statsmodels=0.13.2=py38h691f20f_0
terminado=0.13.3=py38h10201cd_0
testpath=0.6.0=pyhd8ed1ab_0
threadpoolctl=3.1.0=pyh8a188c0_0
tk=8.6.12=he1e0b03_0
torch=1.11.0=pypi_0
tornado=6.1=py38hea4295b_2
traitlets=5.1.1=pyhd8ed1ab_0
typing-extensions=4.2.0=pypi_0
unicodedata2=14.0.0=py38hea4295b_0
urllib3=1.26.8=pyhd8ed1ab_1
wcwidth=0.2.5=pyh9f0ad1d_2
webencodings=0.5.1=py_1
websocket-client=1.3.1=pyhd8ed1ab_0
wheel=0.37.1=pyhd8ed1ab_0
widgetsnbextension=3.5.2=py38h10201cd_1
xgboost=1.6.0=pypi_0
xlrd=2.0.1=pypi_0
xorg-libxau=1.0.9=h27ca646_0
xorg-libxdmcp=1.1.3=h27ca646_0
xz=5.2.5=h642e427_1
zeromq=4.3.4=hbdafb3b_1
zipp=3.7.0=pyhd8ed1ab_1
zlib=1.2.11=hee7b306_1013
zstd=1.5.2=h861e0a7_0

more example code please...

Dear Glycowork,

Thanks you for your software. I have been trying to work through this so I can predict the binding capacity of some lectins from their sequence using lectin_oracle_flex, whoever the example code ins https://bojarlab.github.io/glycowork/examples.html#example2 is not enough for me to work with in order to do it myself.

Would you kindly please expand your example for the lecting section?

kind regards,

Peter Thorpe

Problems with LectinOracle regression training

Hi,

when retraining LectinOracle in a regression setting using the train_model(...) function I'm getting an error stating a problem with appending to a list in line 137 of glycowork/ml/model_training.py:

running_acc.append(y.cpu().detach().numpy(), pred.cpu().detach().numpy())

The problem is that append only takes one argument, not two as provided in that line.

A code example (with the train_loader and val_loader filled as in your LectinOracle Notebook):

model = LectinOracle(input_size_prot=1280, input_size_glyco=len(lib) + 1, 
        hidden_size=128, num_classes=1, data_min=data_min, data_max=data_max)
model.apply(init_weights)
model.cuda()

optimizer = optim.Adam(model.parameters(), lr=0.0005, weight_decay=0.0)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, 80)
criterion = nn.MSELoss().cuda()

model = train_model(model=model, dataloaders={"train": train_loader, "val": val_loader}, 
        criterion=criterion, optimizer=optimizer, scheduler=scheduler, num_epochs=100, 
        patience=20, mode="regression")

Compared to the metrics-handling for the classification setting there is missing the function call to compute_accuracy as in line 133:

running_acc.append(accuracy_score(y.cpu().detach().numpy().astype(int), pred2))

But as accuracy is not a suitable metric in a regression training, it might be better to rework the metrics for regression training in general (in the following, only accuracy and matthews coefficient are reported, both classification metrics).

Best, Roman

Loading pretrained models fails on cpu-only devices

Hi Daniel!

If I run the following code:

from glycowork.ml.models import SweetNet, init_weights, trained_SweetNet

On a machine without gpu (in this case github actions runner), I get the following error

from glycowork.ml.models import SweetNet, init_weights, trained_SweetNet
/usr/share/miniconda/lib/python3.8/site-packages/glycowork/ml/models.py:15: in <module>
    trained_SweetNet = torch.load(data_path)
/usr/share/miniconda/lib/python3.8/site-packages/torch/serialization.py:712: in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
/usr/share/miniconda/lib/python3.8/site-packages/torch/serialization.py:1046: in _load
    result = unpickler.load()
/usr/share/miniconda/lib/python3.8/site-packages/torch/serialization.py:1016: in persistent_load
    load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
/usr/share/miniconda/lib/python3.8/site-packages/torch/serialization.py:1001: in load_tensor
    wrap_storage=restore_location(storage, location),
/usr/share/miniconda/lib/python3.8/site-packages/torch/serialization.py:1[76](https://github.com/ilsenatorov/rindti/runs/6470382314?check_suite_focus=true#step:5:76): in default_restore_location
    result = fn(storage, location)
/usr/share/miniconda/lib/python3.8/site-packages/torch/serialization.py:152: in _cuda_deserialize
    device = validate_cuda_device(location)
/usr/share/miniconda/lib/python3.8/site-packages/torch/serialization.py:136: in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
E   RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Error: Process completed with exit code 4.

bojarlab / glycowork Goto Github PK

glycowork's Introduction

glycowork

Why Glycans are Important

Challenges in Glycan Analysis

Introducing glycowork: Your Solution for Glycan-Focused Data Science

Install

Data & Models

How to use

glycowork's People

Contributors

Stargazers

Watchers

Forkers

glycowork's Issues

Description

Code:

Error:

C:\Users\xerhma\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\motif\annotate.py:406: FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do frame.T.groupby(...) without axis instead. out_matrix = out_matrix.groupby(by = out_matrix.columns, axis = 1).sum()

Recommend Projects

Recommend Topics

Recommend Org

C:\Users\xerhma\AppData\Local\Programs\Python\Python312\Lib\site-packages\glycowork\motif\annotate.py:406: FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.
out_matrix = out_matrix.groupby(by = out_matrix.columns, axis = 1).sum()