svalkiers / clustcr Goto Github PK

CDR3 clustering module providing a new method for fast and accurate clustering of large data sets of CDR3 amino acid sequences, and offering functionalities for downstream analysis of clustering results.

License: Other

Python 99.94% R 0.05% Dockerfile 0.01%

markov-clustering-algorithm faiss immunoinformatics

clustcr's Introduction

ClusTCR: a Python interface for rapid clustering of large sets of CDR3 sequences with unknown antigen specificity

A two-step clustering approach that combines the speed of the Faiss Clustering Library with the accuracy of Markov Clustering Algorithm

On a standard machine*, clusTCR can cluster 1 million CDR3 sequences in under 5 minutes.
_{*Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz, using 8 CPUs}

Compared to other state-of-the-art clustering algorithms (GLIPH2, iSMART and tcrdist), clusTCR shows comparable clustering quality, but provides a steep increase in speed and scalability.

Documentation & Install

All of our documentation, installation info and examples can be found in the above link! To get you started, here's how to install clusTCR

$ conda install clustcr -c svalkiers -c bioconda -c pytorch -c conda-forge

There's also a GPU version available, with support for the use_gpu parameter in the Clustering interface.

$ conda install clustcr-gpu -c svalkiers -c bioconda -c pytorch -c conda-forge

Mind that this is for specific GPUs only, see our docs for more information.

To update use a similar command

$ conda update clustcr -c svalkiers -c bioconda -c pytorch -c conda-forge

Development Guide

Environment

To start developing, after cloning the repository, create the necessary environment

$ conda env create -f conda/env.yml

The requirements are slightly different for the GPU supported version

$ conda env create -f conda/env_gpu.yml

Building Packages

To build a new conda package, conda build is used.
Mind that the correct channels (pytorch, bioconda & conda-forge) should be added first or be incorporated in the commands as can be seen in the install commands above.

$ conda build conda/clustcr/

For the GPU package:

$ conda build conda/clustcr-gpu/

Cite

Please cite as:

Sebastiaan Valkiers, Max Van Houcke, Kris Laukens, Pieter Meysman, ClusTCR: a Python interface for rapid clustering of large sets of CDR3 sequences with unknown antigen specificity, Bioinformatics, 2021;, btab446, https://doi.org/10.1093/bioinformatics/btab446

Bibtex:

@article{valkiers2021clustcr,
    author = {Valkiers, Sebastiaan and Van Houcke, Max and Laukens, Kris and Meysman, Pieter},
    title = "{ClusTCR: a Python interface for rapid clustering of large sets of CDR3 sequences with unknown antigen specificity}",
    journal = {Bioinformatics},
    year = {2021},
    month = {06},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btab446},
    url = {https://doi.org/10.1093/bioinformatics/btab446},
    note = {btab446},
    eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btab446/38660282/btab446.pdf},
}

clustcr's People

Contributors

Stargazers

Watchers

Forkers

rosemary94 bragattemas danobohud vincentvandeuren ncrna filipejesus kamurani viktorhura

clustcr's Issues

Could you please create a release

To promote reproducible science, could you please use git tags. Creating a tag also creates a release for your project. We require tagged releases when building scientific software. Pulling from the master is not reproducible.

I would also recommend using the standard for semantic versioning. (Semver)[https://semver.org/]
Version number in the form: Magor.Minor.Patch. Please do not
follow the git examples by putting a "v" as the leading character. Github will create a "release" when the tag is pushed.

Thank you for making your software available

How to import file output from MiXCR?

It's said that the file format from MiXCR following the AIRR standard. But when I use the following code to import my dataset in mixcr format:

data = read_cdr3('mixcr_out/2018-R-KF-DTCR461.clonotypes.TRB.txt', data_format='airr')

a error was reported like this:

sys:1: DtypeWarning: Columns (8,12) have mixed types.Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
  File "/mnt/data/lianm/software/Miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'productive'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/data/lianm/software/clusTCR/clustcr/input/datasets.py", line 39, in read_cdr3
    return parse_airr(file)
  File "/mnt/data/lianm/software/clusTCR/clustcr/input/airr.py", line 5, in parse_airr
    data = data[data["productive"]==True]
  File "/mnt/data/lianm/software/Miniconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/mnt/data/lianm/software/Miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'productive'

Could you please give me some advise, thanks a lot!

Enabling Travis CI to automatically check if conda build is succesful

No objects to concatenate in clustering.fit

Hi!

After updating to the latest release of clusTCR, I am facing an issue while attempting to fit the clustering to data (please see the complete traceback below). The same functions worked perfectly with the previous version. I initialize the clustering object like clustering = Clustering(n_cpus=24, chain='A') (though the same error occurs if I don't specify the chain, both for TRA and TRB input data). I'd be grateful for your help with this issue.

ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 output = clustering.fit(tra_data, include_vgene = True, 
      2                         cdr3_col="aaSeqCDR3", 
      3                         v_gene_col="vGene")

File ~/anaconda3/envs/clustcr_103/lib/python3.10/site-packages/clustcr/clustering/tools.py:96, in timeit.<locals>.timed(*args, **kwargs)
     94 def timed(*args, **kwargs):
     95     start = time.time()
---> 96     result = myfunc(*args, **kwargs)
     97     end = time.time()
     98     print(f'Total time to run ClusTCR: {(end-start):.3f}s')

File ~/anaconda3/envs/clustcr_103/lib/python3.10/site-packages/clustcr/clustering/clustering.py:429, in Clustering.fit(self, data, include_vgene, cdr3_col, v_gene_col, alpha)
    425 """
    426 Function that calls the indicated clustering method and returns clusters in a ClusteringResult
    427 """
    428 if include_vgene:
--> 429     return self._vgene_clustering(data, cdr3_col, v_gene_col)
    430 else:
    431     try:

File ~/anaconda3/envs/clustcr_103/lib/python3.10/site-packages/clustcr/clustering/clustering.py:346, in Clustering._vgene_clustering(self, data, cdr3_col, v_gene_col)
    343 super_clusters = self._faiss(subset["junction_aa"])
    344 # Second clustering step
    345 clusters = ClusteringResult(
--> 346     MCL_multiprocessing_from_preclusters(
    347         super_clusters, self.mcl_params, self.n_cpus
    348         ), chain=self.chain
    349                             ).clusters_df
    350 clusters.cluster += c # adjust cluster identifiers to ensure they stay unique
    351 subset = subset.merge(clusters, left_on="junction_aa", right_on="junction_aa")

File ~/anaconda3/envs/clustcr_103/lib/python3.10/site-packages/clustcr/clustering/methods.py:139, in MCL_multiprocessing_from_preclusters(preclust, mcl_hyper, n_cpus)
    137     if c != 0:
    138         nodelist[c]['cluster'] += nodelist[c - 1]['cluster'].max() + 1
--> 139 return pd.concat(nodelist, ignore_index=True)

File ~/anaconda3/envs/clustcr_103/lib/python3.10/site-packages/pandas/core/reshape/concat.py:382, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    379 elif copy and using_copy_on_write():
    380     copy = False
--> 382 op = _Concatenator(
    383     objs,
    384     axis=axis,
    385     ignore_index=ignore_index,
    386     join=join,
    387     keys=keys,
    388     levels=levels,
    389     names=names,
    390     verify_integrity=verify_integrity,
    391     copy=copy,
    392     sort=sort,
    393 )
    395 return op.get_result()

File ~/anaconda3/envs/clustcr_103/lib/python3.10/site-packages/pandas/core/reshape/concat.py:445, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    442 self.verify_integrity = verify_integrity
    443 self.copy = copy
--> 445 objs, keys = self._clean_keys_and_objs(objs, keys)
    447 # figure out what our result ndim is going to be
    448 ndims = self._get_ndims(objs)

File ~/anaconda3/envs/clustcr_103/lib/python3.10/site-packages/pandas/core/reshape/concat.py:507, in _Concatenator._clean_keys_and_objs(self, objs, keys)
    504     objs_list = list(objs)
    506 if len(objs_list) == 0:
--> 507     raise ValueError("No objects to concatenate")
    509 if keys is None:
    510     objs_list = list(com.not_none(*objs_list))

ValueError: No objects to concatenate

Yury

Strange output

Dear author,

I have 10000 distinct CDR3 sequences with the same length 15. I just run the codes with them like below:

import pandas as pd

from clustcr import Clustering
from clustcr import datasets

cdr3 = pd.read_csv('15.txt').iloc[:,0]

clustering = Clustering(use_gpu=True)

output = clustering.fit(cdr3,)

edges = output.export_network(filename='15_edgelist.txt')

output.write_to_csv('15_nodelist.txt')

However, the output file shows there are only 5 clusters. In each cluster, the difference between sequences is only one amino acid.

issue with metaclustering and airr data

Hello Sebastian and co.

Thanks a lot for designing this nice package to understand the nature of TCR repertoire and potential expansions. My goal with the current airr data I have is to compare differences in clusters between different subjects and I think your batch approach can be really useful for this objective.

I have been following some of the sections in your docs document but unfortunately I am stuck with the demo for clustering a set of repertoires simultaneously. The main issue is with the metarepertoire function. Here is the error:

In [14]: training_sample_size = round(1000 * (total_cdr3s / 5000))
...: training_sample = metarepertoire(directory=datadir,
...: data_format='airr',
...: n_sequences=training_sample_size)
...:

TypeError Traceback (most recent call last)
Cell In[14], line 2
1 training_sample_size = round(1000 * (total_cdr3s / 5000))
----> 2 training_sample = metarepertoire(directory=datadir,
3 data_format='airr',
4 n_sequences=training_sample_size)

File ~/miniconda3/envs/clustcr/lib/python3.9/site-packages/clustcr/input/datasets.py:65, in metarepertoire(directory, data_format, out_format, n_sequences)
63 meta = pd.concat([meta, parse_immuneaccess(file, out_format=out_format)])
64 elif data_format.lower()=='airr':
---> 65 meta = pd.concat([meta, parse_airr(file)])
66 elif data_format.lower()=='tcrex':
67 meta = pd.concat([meta, parse_tcrex(file)])

File ~/miniconda3/envs/clustcr/lib/python3.9/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments..decorate..wrapper(*args, **kwargs)
325 if len(args) > num_allow_args:
326 warnings.warn(
327 msg.format(arguments=_format_argument_list(allow_args)),
328 FutureWarning,
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)

File ~/miniconda3/envs/clustcr/lib/python3.9/site-packages/pandas/core/reshape/concat.py:368, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
146 @deprecate_nonkeyword_arguments(version=None, allowed_args=["objs"])
147 def concat(
148 objs: Iterable[NDFrame] | Mapping[HashableT, NDFrame],
(...)
157 copy: bool = True,
158 ) -> DataFrame | Series:
159 """
160 Concatenate pandas objects along a particular axis.
161
(...)
366 1 3 4
367 """
--> 368 op = _Concatenator(
369 objs,
370 axis=axis,
371 ignore_index=ignore_index,
372 join=join,
373 keys=keys,
374 levels=levels,
375 names=names,
376 verify_integrity=verify_integrity,
377 copy=copy,
378 sort=sort,
379 )
381 return op.get_result()

File ~/miniconda3/envs/clustcr/lib/python3.9/site-packages/pandas/core/reshape/concat.py:458, in _Concatenator.init(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
453 if not isinstance(obj, (ABCSeries, ABCDataFrame)):
454 msg = (
455 f"cannot concatenate object of type '{type(obj)}'; "
456 "only Series and DataFrame objs are valid"
457 )
--> 458 raise TypeError(msg)
460 ndims.add(obj.ndim)
462 # get the sample
463 # want the highest ndim that we have, and must be non-empty
464 # unless all objs are empty

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid**

I think the main issue is that airr files are not loaded as pd dataframe. See this code as an example:

**data = read_cdr3('/mnt/c/Users/usuari/Desktop/mixcr-4.1.2/clustcr/output_TRB_SP_135.tsv', data_form
...: at='airr')

In [25]: data
Out[25]:
array(['CASSQGFGTQYF', 'CASSQSQYAEQFF', 'CASSRGAADTLYF', ...,
'SASSLGQNNSPLHF', 'SASSSYEQHF', 'RGHTGQLYF'], dtype=object)**

Do you have an idea about which can be the problem?

All my best,

Guillem Sanchez

Levenshtein distance

Hi
I want to use clustcr with distance metric of Levenshtein distance, according to the docs it should be like this

clustering = Clustering(distance_metric='levenshtein')

but when running and looking inside the Clustering constructor, it doesn't expect the argument "distance_metric".
Is there another way to still use Levenshtein distance?

Floats in slice indices during pgen computation

Hi!

I was testing the pgen calculation after the recent updates allowing the selection of chains, and bumped into a bug related to python2 integer division within OLGA. It seems that, while in some functions you have added div3int() to avoid the errors, some of the functions within OLGA (seemingly, the ones that are called for alpha-chain data) still contain "unprotected" python2 integer divisions, for example, in the compute_Pi_V_insVJ_given_J function, lines 1832, 1850, 1867 (I noticed these ones, but I'm not sure there aren't any more).
The issue leads to the following stack trace when attempting to compute features for the alpha chain (never mind occasional lines I inserted while debugging):

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 features = output.compute_features(compute_pgen=True)
      2 features.head()

File ~/anaconda3/envs/clustcr-new/lib/python3.10/site-packages/clustcr/clustering/clustering.py:43, in ClusteringResult.compute_features(self, compute_pgen)
     42 def compute_features(self, compute_pgen=True):
---> 43     return FeatureGenerator(self.clusters_df).get_features(
     44         chain=self.chain,
     45         compute_pgen=compute_pgen
     46         )

File ~/anaconda3/envs/clustcr-new/lib/python3.10/site-packages/clustcr/analysis/features.py:153, in FeatureGenerator.get_features(self, chain, compute_pgen)
    151 pchem = self._calc_physchem()
    152 if compute_pgen:
--> 153     pgen = self._calc_pgen(chain=chain)
    154     return self._combine(aavar, pchem, pgen)
    155 else:

File ~/anaconda3/envs/clustcr-new/lib/python3.10/site-packages/clustcr/analysis/features.py:125, in FeatureGenerator._calc_pgen(self, chain)
    123 pgen_model = get_olga_model(chain=chain)
    124 # Compute Pgen
--> 125 p = [pgen_model.compute_aa_CDR3_pgen(seq) for seq in self.nodes["junction_aa"]]
    126 # Format results
    127 self.nodes["pgen"] = p

File ~/anaconda3/envs/clustcr-new/lib/python3.10/site-packages/clustcr/analysis/features.py:125, in <listcomp>(.0)
    123 pgen_model = get_olga_model(chain=chain)
    124 # Compute Pgen
--> 125 p = [pgen_model.compute_aa_CDR3_pgen(seq) for seq in self.nodes["junction_aa"]]
    126 # Format results
    127 self.nodes["pgen"] = p

File ~/anaconda3/envs/clustcr-new/lib/python3.10/site-packages/clustcr/modules/olga/generation_probability.py:274, in GenerationProbability.compute_aa_CDR3_pgen(self, CDR3_seq, V_usage_mask_in, J_usage_mask_in, print_warnings)
    272 print(CDR3_seq, file=log_file)
    273 log_file.close()
--> 274 return self.compute_CDR3_pgen(CDR3_seq, V_usage_mask, J_usage_mask)

File ~/anaconda3/envs/clustcr-new/lib/python3.10/site-packages/clustcr/modules/olga/generation_probability.py:1679, in GenerationProbabilityVJ.compute_CDR3_pgen(self, CDR3_seq, V_usage_mask, J_usage_mask)
   1676 Pi_V_given_J, max_V_align = self.compute_Pi_V_given_J(CDR3_seq, V_usage_mask, r_J_usage_mask)
   1678 #Include insertions (R and PinsVJ) to get the total contribution from the left (3') side conditioned on J gene. Return Pi_V_insVJ_given_J
-> 1679 Pi_V_insVJ_given_J = self.compute_Pi_V_insVJ_given_J(CDR3_seq, Pi_V_given_J, max_V_align)
   1681 pgen = 0
   1682 #zip Pi_V_insVJ_given_J and Pi_J together for each J gene to get total pgen

File ~/anaconda3/envs/clustcr-new/lib/python3.10/site-packages/clustcr/modules/olga/generation_probability.py:1856, in GenerationProbabilityVJ.compute_Pi_V_insVJ_given_J(self, CDR3_seq, Pi_V_given_J, max_V_align)
   1853 base_ins = 1
   1855 #Loop over all other insertions using base_nt_vec
-> 1856 for aa in CDR3_seq[init_pos/3 + 1: init_pos/3 + max_insertions/3]:
   1857     Pi_V_insVJ_given_J[j][:, init_pos+base_ins+1] += self.PinsVJ[base_ins + 1]*np.dot(self.Svj[aa], current_base_nt_vec)
   1858     Pi_V_insVJ_given_J[j][:, init_pos+base_ins+2] += self.PinsVJ[base_ins + 2]*np.dot(self.Dvj[aa], current_base_nt_vec)

TypeError: slice indices must be integers or None or have an __index__ method

Could you please fix this problem?

Thanks in advance and sorry for bothering you again :)

Best,
Yury

Error installing using conda

Trying to install from svalikiers channel using conda but running into below error. Any ideas?

conda install -c svalkiers clustcr

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: \ 
/ 
- 
/ 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                          

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions`

Error in the clustcr.input.datasets

Runtime error

When running the code for the first time, using the example from the Evaluating part of the documantation I reached this error:

/home/dimchatz/anaconda3/envs/clustering/lib/python3.9/site-packages/clustcr/clustering/mcl.py:136: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. nodelist = nodelist.append(nodes)
Total time to run '_twostep': 0.713s
Traceback (most recent call last): File "/home/dimchatz/Desktop/clusTCR_trial.py", line 5, in <module> epitopes = datasets.test_epitope()
AttributeError: module 'clustcr.input.datasets' has no attribute 'test_epitope'

I have updated the package and looked through google but nothing came up

Error message using .summary()

only happens when I use my own data but works with the given dataset.
It produces the initial output csv but doesn't show summary or features

output.summary()
Traceback (most recent call last):
File "", line 1, in
File "/home/__/miniconda3/lib/python3.9/site-packages/clustcr/clustering/clustering.py", line 32, in summary
motifs = FeatureGenerator(self.clusters_df).clustermotif(cutoff=motif_cutoff)
File "/home//miniconda3/lib/python3.9/site-packages/clustcr/analysis/features.py", line 184, in clustermotif
profile = profile_matrix(sequences)
File "/home//miniconda3/lib/python3.9/site-packages/clustcr/analysis/tools.py", line 47, in profile_matrix
profile[i][pos] = np.round(psc.loc[i] / len(sequences),2)
KeyError: ''

features = output.compute_features(compute_pgen=True)
/home//miniconda3/lib/python3.9/site-packages/numpy/lib/function_base.py:380: RuntimeWarning: Mean of empty slice.
avg = a.mean(axis)
/home/_/miniconda3/lib/python3.9/site-packages/numpy/core/methods.py:188: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "", line 1, in
File "/home//miniconda3/lib/python3.9/site-packages/clustcr/clustering/clustering.py", line 49, in compute_features
return FeatureGenerator(self.clusters_df).get_features(compute_pgen=compute_pgen)
File "/home//miniconda3/lib/python3.9/site-packages/clustcr/analysis/features.py", line 167, in get_features
pchem = self.calc_physchem()
File "/home/___/miniconda3/lib/python3.9/site-packages/clustcr/analysis/features.py", line 106, in calc_physchem
properties[prop].append(np.average([physchem_properties[prop][aa] for aa in seq]))
File "/home/_____/miniconda3/lib/python3.9/site-packages/clustcr/analysis/features.py", line 106, in
properties[prop].append(np.average([physchem_properties[prop][aa] for aa in seq]))
KeyError: ''

Appreciate any help!

Examples in 'Clustering' raises different errors

It seems that clustering with any method other than faiss makes the clustering halt with the error 0-dimensional array given. Array must be at least two-dimensional when using Python 3.11.4 and the latest conda version of clustcr.

Also, the example in Clustering/Usage:

clustering = Clustering()
cdr3 = datasets.test_cdr3()
output = clustering.fit(cdr3)

fails with the error Wrong input. Please provide an iterable object containing CDR3 amino acid sequences.. This is irrespective of the python version.

It seems, that fit() ignores the cdr3_col argument if include_vgene=False, as this works:

clustering = Clustering()
cdr3 = datasets.test_cdr3()
output = clustering.fit(cdr3, include_vgene=True, cdr3_col="junction_aa", v_gene_col="v_call")

but this fails:

clustering = Clustering()
cdr3 = datasets.test_cdr3()
output = clustering.fit(cdr3, include_vgene=False, cdr3_col="junction_aa")

Here's a complete example

#!/usr/bin/env python3

from clustcr import Clustering, datasets

# This works
clustering = Clustering(method='faiss')
cdr3 = datasets.test_cdr3()
output = clustering.fit(cdr3['junction_aa'])
output = clustering.fit(cdr3, include_vgene=True, cdr3_col="junction_aa", v_gene_col="v_call")

data = datasets.vdjdb_paired()
cdr3, alpha = data['CDR3_beta'], data['CDR3_alpha']
output = clustering.fit(cdr3, alpha=alpha)

# This fails with 'Wrong input. Please provide an iterable object containing CDR3 amino acid sequences.'
clustering = Clustering()
cdr3 = datasets.test_cdr3()
output = clustering.fit(cdr3)

# MCL and two-step methods both fail with '0-dimensional array given. Array must be at least two-dimensional'
mcl_clustering = Clustering(method='mcl')
output = mcl_clustering.fit(cdr3)

ts_clustering = Clustering(method='two-step')
output = ts_clustering.fit(cdr3)

It works under Python 3.10.12 and the latest conda version of clustcr (the clustering completes with all methods), though SciPy complains: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.25.1.

I used the following commands to create the conda environments:

conda create -n clustcr python clustcr -c svalkiers -c bioconda -c pytorch -c conda-forge
conda create -n clustcr3-10 python=3.10 clustcr -c svalkiers -c bioconda -c pytorch -c conda-forge

predicting new TCR peptide to existing cluster

Hi, After I make the cluster by the Clustering/fit function, is there anyway or convenience way to predict a new incoming TCR peptide sequence and assign a (existing) cluster number to it. I know I can do it myself with some pattern search with the feature motif, but if I have thousands of pattern and it won't be efficient in doing that way. So just wondering if there is a better way of a existing function already.

Thanks!

Jason

Urgent! Error using clustering.fit

I got the following error when using the clustering function

I installed it using the following command and set up the environment:
#conda install clustcr -c svalkiers -c bioconda -c pytorch -c conda-forge
#conda env create -f env.yml

Could you please take a look at it?
Many thanks!

Retrieve the centroids

Hello ! Thanks for this amazing library. Is there a way to retrieve only the centroids for each cluster? Are they maybe the first sequence in each cluster (i.e. row 0 from cluster 0)? Thanks a lot

Clustering method for TCRdist

Hi, I'm wondering what's the clustering method for TCRdist since it only gives the pairwise similarity scores. I observed many options in the tcrdist module in this repository. Thus, which one did you use for evaluating TCRdist?

Best,
Yuepeng

Unequal-length sequences being clustered together

Dear ClusTCR developer,

I am using ClusTCR's MCL method to cluster about 3k CDRH3 sequences and find out that there are a few clusters containing sequences with different length. Based on my understanding, the similarity metric, hamming distance, is only valid on a group of sequences with same length thus prohibit putting seqences with unequal length into one cluster. Am I missing something here? Thanks

df_3k = pd.read_csv("seq_3k.txt")
clustering = Clustering(n_cpus=3,method="mcl")
output = clustering.fit(df_3k["CDRH3_sequence"])
df_seq = output.clusters_df
c_idx_set = set(df_seq.cluster)
abnormal = []
for c_idx in c_idx_set:
    if len(set(list(map(len,df_seq[df_seq.cluster == c_idx]["CDR3"])))) != 1:
        abnormal.append(c_idx)
print("abnormal cluster index:",abnormal)

Here is the sequence file:
seq_3k.txt

Sequences loss after clustering?

>>> import clustcr as ct
>>> cdr3 = ct.datasets.test_cdr3()
>>> cdr3.size
2851
>>> cdr3.unique().size
2851
>>> clustering = ct.Clustering()
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
                 CDR3  cluster
0     CASSLGQGHYNEQFF        0
1     CASSPGQGHYNEQFF        0
2     CASSSGTGPNEKLFF        1
3     CASTSGTGPNEKLFF        1
4     CASSPGTAPNEKLFF        1
..                ...      ...
637    CASSLQGSNQPQHF      199
638     CASSDSGTDTQYF      200
639     CASSLSGTDTQYF      200
640  CSARAGGGEAKNIQYF      201
641  CSARASGGEAKNIQYF      201

[642 rows x 2 columns]
>>> sum(len(seqs) for seqs in output.cluster_contents())
642

Where are the rest 2851 - 641 = 2210 sequences?

>>> clustering = ct.Clustering(method="mcl")
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
                 CDR3  cluster
0     CASSLGQGHYNEQFF        0
1     CASSPGQGHYNEQFF        0
2     CASSSGTGPNEKLFF        1
3     CASTSGTGPNEKLFF        1
4     CASSPGTAPNEKLFF        1
..                ...      ...
637    CASSLQGSNQPQHF      199
638     CASSDSGTDTQYF      200
639     CASSLSGTDTQYF      200
640  CSARAGGGEAKNIQYF      201
641  CSARASGGEAKNIQYF      201

[642 rows x 2 columns]

>>> clustering = ct.Clustering(method="two-step")
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
                 CDR3  cluster
0     CASSLGQGHYNEQFF        0
1     CASSPGQGHYNEQFF        0
2     CASSSGTGPNEKLFF        1
3     CASTSGTGPNEKLFF        1
4     CASSPGTAPNEKLFF        1
..                ...      ...
637    CASSLQGSNQPQHF      199
638     CASSDSGTDTQYF      200
639     CASSLSGTDTQYF      200
640  CSARAGGGEAKNIQYF      201
641  CSARASGGEAKNIQYF      201

[642 rows x 2 columns]

Also wired that different methods resulted in the same size of clusters_df.

>>> import importlib.metadata
>>> importlib.metadata.version("clustcr")
'0+untagged.115.gba1ad3c'

Documentation link in repo info (next to summary)

pgen calculation is hard-coded to use human TRB model

Hi!

We recently noticed a discrepancy between the pgen scores calculated by OLGA and those produced by clusTCR for the alpha-chain of human TCR. While investigating this issue, I noticed that the calculation of pgen in clusTCR is hard-coded to use human beta-chain model:

Method _calc_pgen() in features.py, lines 128-131:

params_file_name = path.join(DIR,'modules/olga/default_models/human_T_beta/model_params.txt')
marginals_file_name = path.join(DIR,'modules/olga/default_models/human_T_beta/model_marginals.txt')
V_anchor_pos_file = path.join(DIR,'modules/olga/default_models/human_T_beta/V_gene_CDR3_anchors.csv')
J_anchor_pos_file = path.join(DIR,'modules/olga/default_models/human_T_beta/J_gene_CDR3_anchors.csv')

Is there any motivation for this limitation? And, if not, could you please provide additional arguments for the compute_features method to allow the user to set model for calculation of pgen?