Giter Club home page Giter Club logo

Comments (3)

pwwang avatar pwwang commented on August 20, 2024

With the latest version by pip install git+https://github.com/svalkiers/clusTCR.git @ b618118 :

>>> import clustcr as ct
>>> cdr3 = ct.datasets.test_cdr3()
>>> clustering = ct.Clustering()
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
                 CDR3 cluster
0       CASTPQGAYEQYF       0
1       CASTPTGAYEQYF       0
2        CASSLGQIEQYF       1
3        CASSLGQKEQYF       1
4        CASSLGQGEQYF       1
..                ...     ...
789      CASSEGSQEVFF     237
790  CSARAGGGEAKNIQYF     238
791  CSARASGGEAKNIQYF     238
792     CASSDSGTDTQYF     239
793     CASSLSGTDTQYF     239

[794 rows x 2 columns]

>>> clustering = ct.Clustering(method="mcl")
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
                 CDR3  cluster
0       CASTPQGAYEQYF        0
1       CASTPTGAYEQYF        0
2        CASSLGQIEQYF        1
3        CASSLGQKEQYF        1
4        CASSLGQGEQYF        1
..                ...      ...
789      CASSEGSQEVFF      237
790  CSARAGGGEAKNIQYF      238
791  CSARASGGEAKNIQYF      238
792     CASSDSGTDTQYF      239
793     CASSLSGTDTQYF      239

[794 rows x 2 columns]

>>> clustering.method
'MCL'
>>> clustering = ct.Clustering(method="faiss")
>>> clustering.method
'FAISS'
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
                      CDR3  cluster
0     CASSYLPGQGDHYSNQPQHF        0
1      CASSFEAGQGFFSNQPQHF        0
2      CASSFEPGQGFYSNQPQHF        0
3     CASSYEPGQVSHYSNQPQHF        0
4            CASSFGVEDEQYF        0
...                    ...      ...
3387        CATSDVNGAYEQYF        0
3388        CSARGGSVFYEQYF        0
3389        CSARGGERFYEQYF        0
3390      CASSASTSDYSYEQYF        0
3391      CASSDLTGTAYNEQFF        0

[3392 rows x 2 columns]

faiss method even resulted in all seqs being clustered to 0.

>>> import importlib.metadata
>>> importlib.metadata.version("clustcr")
'0+untagged.267.gb618118'

from clustcr.

svalkiers avatar svalkiers commented on August 20, 2024

Hi, thanks for using ClusTCR. I'll try to provide a comprehensive answer to any of your questions:

Where are the rest 2851 - 641 = 2210 sequences?

To answer your first question, ClusTCR takes into account all sequences, but not every sequence does belong to a cluster. This is an inherent result of the clustering procedure. In its second pass, ClusTCR builds a network where edges are drawn between sequences only if they differ 1 hamming distance (amino acid edit distance) at most. Thus, sequences that have no such connection will not be part of the network and therefore considered outliers. As such, they are not reported in the clustering results.

Also wired that different methods resulted in the same size of clusters_df.

The reason you see this results is that, when using the default parameters of ClusTCR, the two-step approach and MCL method will have identical result for small data sets. That is because the first pass, i.e. the faiss-based clustering, will group the sequences into large 'superclusters'. You can define the size of the 'superclusters' by changing the faiss_cluster_size parameter in the Clustering() method. By default, this value is set to 5000. Since the number of sequences in the test data is < 5000, they will all be grouped into the same supercluster, on which the MCL approach will be applied. Consequently, the MCL approach will report identical results as the two-step method if the size of your data set is smaller than faiss_cluster_size.

faiss method even resulted in all seqs being clustered to 0.

See previous comment.

Hope this was helpful to you. If you have any questions, please don't hesitate to address them to me, I will gladly answer them.

All the best,
Sebastiaan

from clustcr.

pwwang avatar pwwang commented on August 20, 2024

WOW! Super clear explanation! Appreciate it! 👍

from clustcr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.