Giter Club home page Giter Club logo

Comments (9)

marcelm avatar marcelm commented on August 27, 2024

From [email protected] on July 06, 2012 10:24:49

By the way, if you need my simulated dataset, drop me a note, and I will post it here.

from cutadapt.

marcelm avatar marcelm commented on August 27, 2024

From [email protected] on July 06, 2012 10:58:45

I also noticed that the histogram of adapter lengths is still not accurate.

In my simulation, for command: CutAdapt -b adapter_1 -b adapter_2 contamined.reads
===Adapter 1===
Histogram of adapter lengths (5')
length count
9 412
10 468
11 421
12 396
13 410
14 379

Histogram of adapter lengths (3' or within)
length count
9 409
10 472
11 416
12 392
13 412
14 403

=== Adapter 2 ===
Histogram of adapter lengths (5')
length count
9 295
10 345
11 288
12 323
13 276
14 304

Histogram of adapter lengths (3' or within)
length count
9 301
10 351
11 335
12 299
13 299
14 302

Thus, the sum of reads trimmed at 9-14 bp are:
length count
9 1417
10 1636
11 1460
12 1410
13 1397
14 1388

However, when I checked the trimmed reads, I found the number of reads trimmed at the specific length is not the same as reported by the histogram:
length count
9 1505
10 1554
11 1453
12 1412
13 1398
14 1386

Is there something I should pay attention to?

Best,
Ying

from cutadapt.

marcelm avatar marcelm commented on August 27, 2024

From [email protected] on July 09, 2012 06:36:29

Iā€™d like to reproduce your experiment. Could you send me the dataset via mail or attach it to this issue?

from cutadapt.

marcelm avatar marcelm commented on August 27, 2024

From [email protected] on July 09, 2012 11:55:13

Hi, Marcel,

Since my read file is larger than 10M, I emailed it to you. Please play around with it.

Thank you.

Best,
Ying

from cutadapt.

marcelm avatar marcelm commented on August 27, 2024

From [email protected] on July 09, 2012 15:18:41

Ok, thanks for the mail. I think I've figured it out. I've only looked at the comparison of -b vs -a, but the problem is probably the same for -b vs -g.

When you use the -a option to specify a 3' adapter, you're essentially saying that the adapter must occur after the sequence you are interested in. The adapter itself and everything following it is trimmed. This is already in the README and is probably clear. But, the problem is that in your data, you also have reads with simulated full-length 5' adapters, which are in the beginning of the read. If you use -a and such a read is found, the entire read is trimmed (leaving a read of length zero). In your example above, every time this happens, it will also increase the counter for reads that were found at full length (25 or 33 in your case).

I suggest that you update to cutadapt 1.1, in which the histogram output has changed slightly and should be a little less confusing (although it still is not perfect, I admit). For cutadapt 1.1, I get output like this when using your command line with the two '-a' adapters:

=== Adapter 1 ===

Adapter 'AATGATACGGCGACCACCGAGATCT', length 25, was trimmed 30712 times.

Lengths of removed sequences
length count expected
[...]
17 394 0.0
18 415 0.0
19 407 0.0
20 372 0.0
21 426 0.0
22 369 0.0
23 408 0.0
24 400 0.0
25 394 0.0

=50 1164 0.0

Note how it now says "lengths of removed sequence" and not "Histogram of adapter lengths", which is different from cutadapt 1.0 and more what one expects. You'll see immediately that the 394 adapters of length 25 are correctly found, but that also in 1164 cases, the adapter was found in the beginning of the read, trimming the entire read of 50 bp. The 1164 is a bit too high, but this is due to the error rate. If you reduce it to zero, the result should be better.

(Note that the histogram is cut off at a certain point and for all lengths after this point, only the cumulative counts are shown. While playing around with your data, I realized that this is not helpful and I have now changed that behavior in the Git version of cutadapt.)

I hope this fully explains the discrepancies you've been seing.

from cutadapt.

marcelm avatar marcelm commented on August 27, 2024

From [email protected] on July 10, 2012 07:39:41

Thank you, Marcel, for this clear explanation.

It also showed me why I saw the removal of an entire read in my result.

Just a side question, do you have any recommendation on using CutAdapt to remove adapter sequences?

Obviously, based on my experiment, I will stick to -b option. If so, what is the point to use -a/-g option?

In some sense, I think the -g/-a option is mis-leading, because it still searches for the presence of adapter sequences "anywhere" in the read. Previously, when I used -a option, I though it would search the adapter sequences from the 3' end and stop when reaching the full length of the adapter, which means, -a option would only match the expression pattern for the 25 or 33 bp at the 3' end. I don't expect it to extend the pattern match to 5' end.

On the other hand, why the -a/-g option will treat the full length adapter sequences differently than the part sequences? I mean, I don't think there is significant difference between the 23-, 24-, and 25-bp of contamination. However, only the 25-bp "contamination" results in over-trimming at 5' end, which means reads like this "AATGATACGGCGACCACCGAGATCTNNNNNNNNNNNNNNNNNNNNNN" got trimmed, but reads like this "ATGATACGGCGACCACCGAGATCTNNNNNNNNNNNNNNNNNNNNNNN" was not trimmed. So how do you define the pattern match for -a/-g option? Why searching for the 24-bp contamination will not extend to 5' end?

from cutadapt.

marcelm avatar marcelm commented on August 27, 2024

From [email protected] on July 10, 2012 08:39:54

do you have any recommendation on using CutAdapt to remove adapter sequences?

It depends on the real data that you have and what you expect could happen. If an adapter can occur only in the 5' end, then using -a makes sense: If it's in the beginning of the read, then trimming the read to length zero is the right thing to do. That could happen when the 5' and 3' adapters are attached to each other without anything between them. Using -b in that setting would be problematic as the sequence after the adapter could just be noise or another adapter.

In some sense, I think the -g/-a option is mis-leading, because it still searches for the presence of adapter sequences "anywhere" in the read.

I had to name the options somehow -- it's just a mnemonic. The authoritative description is in the README. For the -a adapter, it says explicitly that the read will be empty if the adapter is found in the beginning.

So how do you define the pattern match for -a/-g option? Why searching for the 24-bp contamination will not extend to 5' end?

For -a, the start of the adapter must be within the read. There are two reasons.

First, the model behind this is that the sequence is "5' adapter - insert - 3' adapter". The extreme case is that the insert is empty. Allowing to see a proper suffix of the 3' adapter would be something else entirely: We then also allow some kind of degradation of the 3' adapter. But then we'd also need to allow that we see a degraded adapter when the insert is nonempty. This kind of degradation may perhaps happen, I don't know. It could even be allowed in cutadapt, but since the model is different, it would have to be implemented as a different kind of adapter.

Second, there is a practical reason: If suffixes of a 3' adapter were allowed to match in the beginning of the read, then short, random matches, in which only the last few bases of the adapter match would lead to the read being trimmed to length zero.

If you encounter real data in which the assumptions that cutadapt makes don't fit the physical process, I'm eager to learn about that and change or extend cutadapt to model that process (and I have done so in the past).

from cutadapt.

marcelm avatar marcelm commented on August 27, 2024

From [email protected] on July 10, 2012 08:55:49

I see.

Thanks again.

from cutadapt.

marcelm avatar marcelm commented on August 27, 2024

From [email protected] on July 10, 2012 09:08:34

I hope it's ok to close the issue now. Please open a new one for other problems or e-mail me.

Status: Done

from cutadapt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.