Giter Club home page Giter Club logo

compairr's People

Contributors

torognes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

compairr's Issues

Overlap metrics

Hi,

Does CompAIRR allow the calculation of overlap metrics such as Jaccard or Morisita-Horn index or should this be done manually (afterwards)?

Thank you and kind regards,
Sebastiaan Valkiers

Max 65535 repertoires

If more than 65535 repertoires are used, the counting of matches in each repertoire will be wrong. Probably due to a 16-bit integer used somewhere.

compairr MH index strange results

Hello! I have used compairr with MH index, but the results seem to be very strange. For example the HM index compairing the same sample does not equal to 1 and also varies between different samples, ex: sample 1 vs sample 1 = 138.04, sample 2 vs sample 2 =76.74. I went through the publication and github but did not manage to find info regarding MH index values. This is the command I use run compairr compairr -d 0 -l $f1$f2.log -m -o $f1$f2.out -s MH -t 7 -g -u $f1 $f2; . Is there a reason for MH index values to be higher than 1 and why do the values change so much between different samples? I Can provide a matrix with all the values if needed.

Thanks:)

Allow d>3

To make the tool more general, we could add the possibility of allowing any value for d. An alternative algorithm for identifying similar sequences should probably be used when d>2.

Distance 2 with indels

Due to the way sequence variants are generated, in some cases variants at distance 2 with indels may be generated multiple times and the resulting values may be inaccurate.

Detect duplicates in the input

If there are exact duplicates in the input (same repertoire id, same sequence, same V-gene, same J-gene) when d=0, the resulting MH-index or Jaccard index would be bogus. This should be detected and the program should terminate, telling the user to deduplicate / dereplicate their data first.

It should be possible to perform this check quickly by looking up the hashes.

Segmentation fault when clustering sequences

CompAIRR sometimes crashes with a segfault when clustering nucleotide sequences. A user have reported this issue.

The error has been reproduced and a potential reason has been identified.

Handle non-standard sequence symbols better

CompAIRR will currently abort if any character except for the 20 standard amino acid symbols (ACDEFGHIKLMNPQRSTVWY in upper or lower case) appear in the junction_aa column.

In some datasets, the symbols _ and * appear in the amino acid sequence to represent a frame shift or a translation stop.

We could improve CompAIRR by either treating these symbols (_*) as any other amino acid symbol, or we could ignore the sequences were these symbols appear. This kind of behaviour could be indicated with additional options, but I think the default behaviour should be to abort with an informative error message.

This could perhaps also be extended to other non-standard amino acid symbols, like BJOUXZ, and even to non-standard nucleotide symbols โ€“ only ACGTU are allowed now.

AIRR standards compliance

Make CompAIRR compliant with the AIRR standard for software tools:

Needed:

  • Open source code in public repository
  • AIRR standard file formats etc
  • Include example data and automated check
  • Provide information about run parameters as part of the output.
  • Provide a container build file
  • Provide user support, clearly stating which level of support users can expect, and how and from whom to obtain it.

Further steps:

  • Apply for ratification
  • Obtain certificate of compliance
  • Add badge

Is sequence matching exact?

Hi,

I've been using CompAIRR for calculating the overlap between different repertoires. I noticed that, when computing the overlap between one sample and itself, CompAIRR produces a different (greater) number than the number of sequences present in that repertoire. For example, if repertoire X would contain 200,000 clonotypes, the CompAIRR n_x_n overlap matrix would return a number >200,000 for the overlap between X and X. My question is therefore: is the CompAIRR result an approximation of the overlap between two repertoires or is it exact and am I misinterpretting the results?

For reference I used the following parameters:

compairr -m -f -o output.tsv input.tsv

Thank you.

Kind regards,
Sebastiaan Valkiers

Increase resolution of timing of operations

CompAIRR shows the timing of some of the operations that take a significant amount of time. However, it is just shown with a resolution of seconds. The resolution should be increased to enable more precise measurements.

New feature: Copy additional columns from input to output files

Make a CompAIRR parameter called something like columns-to-keep, where the names of additional columns of interest that are not transferred by default could be specified. So if the user specifies --columns-to-keep epitope, the pairs file has additional columns epitope_1 and epitope_2. And if the epitope column is only present in one of the input files, the fields in the pairs file could just be empty rather than throwing an error. This feature could be of general use for people who want to further analyse the sequences in the pairs file, it's essentially just transferring additional sequence metadata so the user does not have to map this data back to the input files.

Default repertoire ID if column missing

If no repertoire_id column is provided, assume all sequences in the file belong to the same repertoire (could default to IDs 1 and 2 for the first and second file respectively).

Self-comparison

Add possibility of comparing a repertoire set with itself, without having to read it twice.

Allow partial V or J gene match

Partial matches with V or J gene names could optionally be allowed, perhaps based on prefix matches in the gene name. Need investigate further.

Unable to upgrade version

Hi, I downloaded the latest version of compairr (1.7.0) and compiled it according to the instructions in the readme. However, when I check the version after installation (using compairr -v) it shows version 1.6.1.

Also, uninstalling via:

make uninstall

or

sudo make uninstall

doesn't work, and returns:

make: *** No rule to make target 'install'.  Stop.

Any idea how to solve this issue?

A flag to switch to cdr3 instead of junction

A flag to switch to cdr3 instead of junction. In that case, the sequence columns would be cdr3 (nt) and cdr3_aa, both for input and output. Nothing will change for how the sequences should be treated further, but AIRR files can contain either junction or cdr3, the difference is that junction has one extra leading/trailing amino acid.

Output results in alternative format

The output could optionally be presented in an alternative three column format with sample (repertoire) names in the two first columns and the overlap value in the third column.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.