bigdatagenomics / bdg-formats Goto Github PK

View Code? Open in Web Editor NEW

38.0 38.0 36.0 255 KB

Open source formats for scalable genomic processing systems using Avro. Apache 2 licensed.

License: Apache License 2.0

Shell 100.00%

bdg-formats's People

Contributors

Stargazers

Watchers

bdg-formats's Issues

Reference allele value for Number=R VCF INFO attributes

The documentation for VariantAttribute fields states that Number=R VCF INFO attribute values are split for multi-allelic sites

  /**
   Total read depth, VCF INFO reserved key AD, Number=R, split for multi-allelic
   sites.
   */
  union { null, int } readDepth = null;

  /**
   Forward strand read depth, VCF INFO reserved key ADF, Number=R, split for
   multi-allelic sites.
   */
  union { null, int } forwardReadDepth = null;

  /**
   Reverse strand read depth, VCF INFO reserved key ADR, Number=R, split for
   multi-allelic sites.
   */
  union { null, int } reverseReadDepth = null;

...

  /**
   Additional variant attributes that do not fit into the standard fields above.
   The values are stored as strings, even for flag, integer, and float types. VCF
   INFO key values with Number=., Number=0, Number=1, and Number=[n] are shared across
   all alternate alleles in the same VCF record. VCF INFO key values with Number=A and
   Number=R are split for multi-allelic sites.
   */
  map<string> attributes = {};

When converting VCF records to VariantAnnotation records, we assume that the first index of the array is the reference allele value, and use the alternate allele index to extract the value for the alternate allele.

##INFO=<ID=MY,Number=R,Type=Integer,Description="My Number=R attribute">
1  1024  .  G  A,T  .  PASS  MY=10,20,30

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "A", "annotation":{"attributes": "MY=20"}}

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "T", "annotation":{"attributes": "MY=30"}}

When converting from VariantAnnotation back to VCF records, we no longer have access to the reference allele value. I can think of at least four ways to handle this:

Add a default value based on the Type

1  1024  .  G  A  .  PASS  MY=-1,20
1  1024  .  G  T  .  PASS  MY=-1,30

Add the missing value

1  1024  .  G  A  .  PASS  MY=.,20
1  1024  .  G  T  .  PASS  MY=.,30

Write as Number=R with the wrong cardinality

1  1024  .  G  A  .  PASS  MY=20
1  1024  .  G  T  .  PASS  MY=30

Create Variant and VariantAnnotation records for the reference allele when splitting multi-allelic sites

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "<*>", "annotation":{"attributes": "MY=10"}}

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "A", "annotation":{"attributes": "MY=20"}}

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "T", "annotation":{"attributes": "MY=30"}}

1  1024  .  G  <*>  .  PASS  MY=10
1  1024  .  G  A  .  PASS  MY=20
1  1024  .  G  T  .  PASS  MY=30

For the Number=R VCF INFO attribute values that map to fields (currently AD, ADF, ADR) we have a couple more options:

1  1024  .  G  A,T  .  PASS  AD=10,20,30

Use array<int> for the field type with cardinality 2

  array<int> readDepth = [];
  array<int> forwardReadDepth = [];
  array<int> reverseReadDepth = [];

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "A", "annotation":{"readDepth": 10,20}}

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "T", "annotation":{"readDepth": 10,30}}

1  1024  .  G  A  .  PASS  AD=10,20
1  1024  .  G  T  .  PASS  AD=10,30

Add new reference value fields

  union { null, int } readDepth = null;
  union { null, int } forwardReadDepth = null;
  union { null, int } reverseReadDepth = null;

  union { null, int } referenceReadDepth = null;
  union { null, int } referenceForwardReadDepth = null;
  union { null, int } referenceReverseReadDepth = null;

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "A", "annotation":{"readDepth": 20, "referenceReadDepth": 10}}

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "A", "annotation":{"readDepth": 30, "referenceReadDepth": 10}}

1  1024  .  G  A  .  PASS  AD=10,20
1  1024  .  G  T  .  PASS  AD=10,30

New option, with 5 above, as proposed below

##INFO=<ID=MY,Number=R,Type=Integer,Description="My Number=R attribute">
1  1024  .  G  A,T  .  PASS  AD=5,15,25;MY=10,20,30

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "A", "annotation":{"readDepth": [5,15], "attributes": "MY=10,20"}}

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "T", "annotation":{"readDepth": [5,25], "attributes": "MY=10,30"}}

1  1024  .  G  A  .  PASS  AD=5,15;MY=10,20
1  1024  .  G  T  .  PASS  AD=5,25;MY=10,30

Clean up variant annotation record

Similar to the VariantCallingAnnotations record (see #51), I think the DatabaseVariantAnnotation record needs some TLC.

Sample name in AlignmentRecord

Am I missing this somewhere or is the sample name not stored in AlignmentRecord. I'm looking for something equivalent to the

SAMRecord.getReadGroup.getSample

from https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMReadGroupRecord.java#L70

Generate UML diagrams from source

Having a diagrams of all the data structures of BDG-Formats would help a newcomer get started in the project.

Many tools allow us to generate such a diagram automatically from Java sources, like UMLGraph which is open source.

I'll submit a pull request shortly that achieves this feature.

Clarify readNum field of AlignmentRecord

As raised on bigdatagenomics/adam#815, the readNum field of AlignmentRecord is a bit vaguely named. E.g., readNum could be read as a UUID, instead of the number of the read in the fragment.

AlignmentRecord not storing "TLEN" / "inferred insert size" from BAM

I just noticed this field getting lost when converting into and out of BAMs with ADAM. Should we/I add it? Should we just infer it in ADAM? Or continue not supporting it?

See TLEN in the SAM spec.

htsjdk puts it in SAMRecords as "inferred insert size"

Remove AlignmentRecord.mateAlignmentEnd

per discussion at bigdatagenomics/adam#1290

Add "fragmentEndPosition" field to NucleotideContigFragment

Currently we just have fragmentStartPosition and fragmentLength. To perform predicate filtering by a ReferenceRegion, we need a fragmentEndPosition field.

Publish C/C++ artifacts

Publish the C/C++ artifacts, generated from the latest JSON (via bigdatagenomics.github.io?).

Strip out sequences from Fragment record

In bigdatagenomics/adam#815 we decided that the normalization provided by the Sequences in the Fragment record wasn't that useful and was somewhat hard to reason about.

Rename 'Read' back to 'Record' in 0.1.2

I remember there was some discussion of converting the name of 'ADAMRecord' to 'Read' at the same time as we made the move to remove the 'ADAM' prefix from a lot of these schemas -- although I can't find the discussion of it, at the time.

However, can I vote that we change the name of 'Read' back to 'Record'? If 'Record' is too generic, maybe something more specific like 'AlignmentRecord'.

The point is, the 'Read' schema actually bears a many-to-one relationship with the reads themselves (which will become clear, if we start parsing the FASTQ files directly in any context where we're parsing the raw data), and using a '[something]Record' name will (continue to) evoke the association with SAMRecord which is so clearly implied by the actual presence of fields with similar names and semantics.

Thoughts?

Harmonize Variant/VariantCallingAnnotations filters

Related to bigdatagenomics/adam#194, and #108. Specifically, this is a subset of #108 that I'd like to get into 0.10.0.

Variant has:

  /**
   True if filters were applied for this variant. VCF column 7 "FILTER" any value other
   than the missing value.
   */
  union { null, boolean } filtersApplied = null;

  /**
   True if all filters for this variant passed. VCF column 7 "FILTER" value PASS.
   */
  union { null, boolean } filtersPassed = null;

  /**
   Zero or more filters that failed for this variant. VCF column 7 "FILTER" shared across
   all alleles in the same VCF record.
   */
  array<string> filtersFailed = [];

While VariantCallingAnnotations has:

  // FILTER: True or false implies that filters were applied and this variant PASSed or not.
  // While 'null' implies not filters were applied.
  union { null, boolean } variantIsPassing = null;
  array <string> variantFilters = [];

I'm going to make VariantCallingAnnotations match Variant.

Release fails under Java 8

Releasing bdg-formats fails under Java 8 because certain Javadoc warnings have changed to errors in Java 8. We should look closer to see if the issues are from the comments we've written inline in our avdl, or if it is caused by the Javadoc generated by Avro. To work around, we can just cut the releases using Java 7.

Add the 'sampleId' field back to the Pileup schema.

The last revision to the Pileup schema removed too many fields, we need the sampleId field (or its equivalent) added back in.

Re-organize the Feature schema

This is a ticket to capture our collective thinking around the re-organization of the Feature schema. The Feature schema needs to be edited to satisfy the following (additional) requirements:

it needs an explicit 'type' field
it should be less file-format specific (i.e. fields like 'qValue' and 'signalValue' could be moved to an attributes field)
we need to make sure it's as memory efficient as possible (and some benchmarking would be nice, too)

Release bdg-formats 1.0.0

Add back dbSnp ID field

See discussion on bigdatagenomics/adam#1103

I can't find Pileup class file in org.bdgenomics.formats.avro package.

I can't find Pileup class file in org.bdgenomics.formats.avro package. than there is error:

val pileups = reads.adamRecords2Pileup().cache()
:32: error: value adamRecords2Pileup is not a member of org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.AlignmentRecord]
val pileups = reads.adamRecords2Pileup().cache()

Please how i can git the file and solute the error?

Thanks!

jack xu

@davidonlaptop @hammer @jey @massie @heuermh

Rename "end" fields

I have trouble accessing those "end" fields (e.g. AlignmentRecord.end, variant.end) with sparkSql because end is a reserved keyword there and it conflicts with the field names.

I was wondering: Is it possible to assign different names to those fields?

Set up mvn distribution

https://issues.sonatype.org/browse/OSSRH-10444

Update Fragment type to support more general "read-bucketing"

To support things like random barcodes, drop-seq, etc.

Followup to #44.

Use separate filtersFailed and filtersPassed arrays for variant quality filters

See bigdatagenomics/adam#194

Improve gVCF support

Specifically, better schema level support for:

Symbolic alts
Quality score ranges
Info END field

Avro version is out of sync with ADAM

ADAM is on Avro 1.7.7, while bdg-formats is on 1.7.4. I think this is causing some weird behavior with the Spark 1.5 stream of releases which pull in Avro 1.7.7.

Reconsider StructuralVariant and StructuralVariantType

#103 removed StructuralVariant and StructuralVariantType records; this issue suggests we might want to reconsider that decision.

Remove Pileups

Forward reference between Variant and VariantAnnotation

The reference between Variant and VariantAnnotation should be forward so that it can be projected away.

Remove StructuralVariant and StructuralVariantType, add names field to Variant

We need a proper README for bdg-formats

We should leave some top-level documentation in bdg-formats, as part of the README.

This should explain, among other things, that

the .avdl file is the central set of data structures for ADAM and other downstream projects,
the actual classes are auto-generated by the Maven build,
the history of moving this out of adam-formats and into its own downstream formats, and
links to adam, avocado, and any other projects that might use the data structures defined here.

Revert back to 0.9.0

@heuermh will break out the variant/genotype changes into smaller chunks, so that we can roll them into ADAM downstream incrementally.

Remove ADAM prefix from avro model class names

Release bdg-formats 0.10.0

Revisit VariantCallingAnnotations

I wanted to revisit VariantCallingAnnotations. It seems a lot of these fields are GATK specific and it seems hard to extend to add new annotations or variant calling output.

Might it make more sense to move of these to attributes and improve VCF output things from attributes (which it doesn't seem like it does currently?)

Add more fields to variant calling annotations

We should add fields for:

An array of priors
An array of posteriors
Somatic status

We should also add a tag array.

Streamline genotype/variant annotations

Rename DatabaseVariantAnnotation to VariantAnnotation
Rename VariantCallingAnnotations to GenotypeAnnotation
Roll TranscriptEffect under VariantAnnotation.

Add processing description schema

See bigdatagenomics/adam#1257.

FlatGenotype vs Genotype

Seems like both of these are still hanging around, was there any resolution on this?

Why was alternateAllele in TranscriptEffect removed?

I can't remember the part of the discussion around #90 where alternateAllele in TranscriptEffect was dropped. As far as I can tell, this is still necessary when reading in variants to associate an effect with a specific alternate allele.

Code style and doc fixes

Minor code style and doc fixes:

Use UPPERCASE_WITH_UNDERSCORES for enum values
Remove @see tags that cause javadoc warnings
Remove is from boolean field names
Fix whitespace and doc formatting
Add doc where missing
Remove unused Base enum

Rename Contig objects

I was writing code that munged NucleotideContigFragments and realized that Contig should be changed to Reference and NucleotideContigFragment should be renamed to Contig or ContigFragment.

Rename Record to Read/AlignedRead

Along with #9

Rename to bigdatagenomics/formats

Small nit: I feel like bigdatagenomics/bdg-formats is redundant. Would anyone be opposed to renaming to bigdatagenomics/formats? I will leave this open for a week and make the changes if no one is opposed.

Add OP/OC flags

The original position and original cigar flags are useful for describing the alignment of a read prior to realignment.

Can't access Maps in Avro data (AVRO-803)

See http://stackoverflow.com/questions/19728853/apache-avro-map-uses-charsequence-as-key for more info.

Update Avro dependency version to 1.8.0

May contain incompatible API changes.

Release notes
https://issues.apache.org/jira/browse/AVRO-1722?jql=project%20%3D%20AVRO%20AND%20fixVersion%20%3D%201.8.0%20ORDER%20BY%20status%20DESC%2C%20priority%20DESC

Source diff
apache/avro@release-1.7.7...release-1.8.0

Proposal: VariantCallingAnnotations to be moved to Variant

This stems from the -onlyvariants flag we added to vcf2adam, which only writes out the variant information. IMO, annotations associated with a variant should be packaged with a variant. Not a genotype. If you're denormalizing the variant information into the Genotype, you shouldn't denormalize these two pieces separately. This is annoying from the -onlyvariants perspective, because at the moment, this ends up storing minimal info on the variants, when what I really want to do is analyze the metadata on the variants. Thoughts?

Migrate `Contig` out of `Variant`

Similar to 95f4b5b, we want to move the Contig record out of Variant and replace it with a string contigName field.

Should `Genotype` allow for multiple `Variant`s / alleles?

@arahuja implied there had been some discussion around this in the past.

AFAICT there is no good way right now to capture there being two non-reference alleles at one locus, having one Variant per Genotype.

In the immediate term I will work around this by emitting two lines / Variants in my VCFs, but I'm curious whether we should support the other way here. Thanks!

Document schema design philosophy

The ADAMRecord field combines the GFF fields of 'feature' and 'source' into a single field.

In the 0.1.1 implementation, the ADAMFeature contains a single field 'trackName', which the comments imply (for GFF/GTF files) contains both the 'feature' and 'source' values from the original record.

However, for parsing out GFF and GTF files into hierarchical or structured gene models, we're going to need to represent those two fields as separate values.

bigdatagenomics / bdg-formats Goto Github PK

bdg-formats's People

Contributors

Stargazers

Watchers

Forkers

bdg-formats's Issues

Recommend Projects

Recommend Topics

Recommend Org