Giter Club home page Giter Club logo

Comments (9)

ekg avatar ekg commented on August 14, 2024

I'm referring to: https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/references.avdl#L48

from ga4gh-schemas.

lh3 avatar lh3 commented on August 14, 2024

The use of this flag is common. The chrY sequence in GRCh37 contains the PARs, but the chrY used by the 1000g has PARs hard masked. It is not the official chrY and does not have an accession number. For this 1000g sequence, isDerived should be set. Also in GRCh37, there is a single 'R' on one chromosome. Some versions of GRCh37 have it converted to N. The md5 will be different. In the version of GRCh38 for the mapping purposes, multiple chromosomes have some centromeric regions hard masked. Some cancer groups also hard mask wrong regions in the reference genome. These are all derived sequences without official accession numbers.

from ga4gh-schemas.

calbach avatar calbach commented on August 14, 2024

I now think I understand the fields, but they are confusing as currently documented and its not clear how a new user would make use of them. I think they need improved documentation, or else need to be reworked. I'm not sure which because I don't exactly understand the role/semantics of derived References in a ga4gh repository.

The main thing I'm missing is the motivation. Is the primary purpose of these derived references to allow comparison between datasets which are aligned against an original reference and derived reference? Presumably the user may consider sourceDivergence when deciding to allow this. Or is the purpose to avoid fetching bases for a derived reference, if you had the bases for the original? Or is this just for provenance?

Depending on the answer above, would it provide any value to have derivedFromReferenceId in place of isDerived, which points to the parent reference if any?

from ga4gh-schemas.

ekg avatar ekg commented on August 14, 2024

Great, this explains things much more. I will file a pull request to
improve the documentation.
On Aug 21, 2014 6:50 PM, "Heng Li" [email protected] wrote:

The use of this flag is common. The chrY sequence in GRCh37 contains the
PARs, but the chrY used by the 1000g has PARs hard masked. It is not the
official chrY and does not have an accession number. For this 1000g
sequence, isDerived should be set. Also in GRCh37, there is a single 'R'
on one chromosome. Some versions of GRCh37 have it converted to N. The md5
will be different. In the version of GRCh38 for the mapping purposes,
multiple chromosomes have some centromeric regions hard masked. Some cancer
groups also hard mask wrong regions in the reference genome. These are all
derived sequences without official accession numbers.


Reply to this email directly or view it on GitHub
#130 (comment).

from ga4gh-schemas.

lh3 avatar lh3 commented on August 14, 2024

Purpose: data mapped to different derived versions of the same sourceAcession are allowed to be jointly retrieved.

from ga4gh-schemas.

delagoya avatar delagoya commented on August 14, 2024

@ekg any update on a PR?

from ga4gh-schemas.

cassiedoll avatar cassiedoll commented on August 14, 2024

did a PR happen for this one? Is this issue ready to close?
(trying to get us ready for our v0.5 cleanup release!)

from ga4gh-schemas.

delagoya avatar delagoya commented on August 14, 2024

No PR and no recent comments. Closing.

from ga4gh-schemas.

diekhans avatar diekhans commented on August 14, 2024

thank @lh3 for point at this. This description needs to be added to the ga4gh documentation, not left in a ticket.

from ga4gh-schemas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.