Giter Club home page Giter Club logo

smarter-database's Introduction

SMARTER Database

Pytest Workflow Lint Workflow Coverage Status Documentation Status

SMARTER-database aims to collect data produced by the WP4 group in the context of the SMARTER project and to merge them with already available data.

Project Organization

├── data
│   ├── external        <- Data from third party sources.
│   ├── interim         <- Intermediate data that has been transformed.
│   ├── processed       <- The final, canonical data sets for modeling.
│   └── raw             <- The original, immutable data dump.
|
├── database            <- MongoDB smarter database docker-composed image
│
├── docs                <- A default Sphinx project; see sphinx-doc.org for details
│
├── models              <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks           <- Jupyter notebooks. Naming convention is a number (for ordering),
│                          the creator's initials, and a short `-` delimited description, e.g.
│                          `1.0-jqp-initial-data-exploration`.
│
├── references          <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports             <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures         <- Generated graphics and figures to be used in reporting
│
├── src                 <- Source code for use in this project.
│   ├── __init__.py     <- Makes src a Python module
│   │
│   ├── data            <- Scripts to download or generate data
│   │
│   ├── features        <- Scripts to turn raw data into features for modeling
│   │
│   ├── models          <- Scripts to train models and then use trained models to make
│   │   │                  predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization   <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
|
├── tests               <- Folder to test python modules / scripts
│
├── HISTORY.rst         <- Project change log
├── LICENSE             <- Project LICENSE
├── Makefile            <- Makefile with commands like `make data` or `make train`
├── README.md           <- The top-level README for developers using this project.
├── TODO.md             <- Stuff that need to be done
├── environment.yml     <- Conda environment file
├── pytest.ini          <- Configuration file for `pytest` testing environment
├── requirements.txt    <- The requirements file for reproducing the analysis environment, e.g.
│                          generated with `pip freeze > requirements.txt`
├── setup.py            <- makes project pip installable (pip install -e .) so src can be imported
├── test_environment.py <- Helper script to test if environment is properly set
└── tox.ini             <- Tox file with settings for running tox; see tox.readthedocs.io

Project based on the cookiecutter data science project template. #cookiecutterdatascience

smarter-database's People

Contributors

bunop avatar dependabot[bot] avatar

Watchers

 avatar  avatar  avatar

smarter-database's Issues

:card_file_box: track breeds and countries in countries and breeds collection respectively

Is your feature request related to a problem? Please describe.
When filtering samples from SMARTER-frontend, suggested breed and country values take no considerations about which breeds are available in each country and viceversa, so it's possible to take a breed and country which returns 0 results. If breed and country collections track countries and breeds respectively, it will be possible to query a breed endpoint and filter by country

Describe the solution you'd like
Track breeds and countries int countries and breeds collection respectively. The script src/data/update_db_status.py could track that information in collections

Describe alternatives you've considered
An alternative could be executing an aggregation pipeline within sample collections using SMARTER-backend

Additional context

  • Model countries and breed as ListField
  • Update collections with update_db_status.py

:sparkles: import data from Hungary

Is your feature request related to a problem? Please describe.
Data from Hungary need to be imported

Describe the solution you'd like
Add Hungarian dataset to smarter database. Model breed names like White Dorper, Rusty Tsigai and Ile de france (check for breeds already in database).

Describe alternatives you've considered
There are no alternatives to Hungarian data load

Additional context

  • Model Hungarian breed names
  • Import from plink (seems in forward OAR3)
  • Add GPS coordinates (if any)

:sparkles: integrate background data with published datasets

Is your feature request related to a problem? Please describe.
Add additional published datasets to fill gaps and provide more data to SMARTER database

Describe the solution you'd like
Some published datasets could be added to SMARTER-database

Describe alternatives you've considered
Bakcground data could remain as its is, however adding new data could add some useful informations

Additional context

  • Front Genet 2021 Jun 18;12:612492. doi: 10.3389/fgene.2021.612492. (Welsh sheep breeds)
  • Barbato M, Hailer F, Orozco-Terwengel P, Kijas J, Mereu P, Cabras P, et al. Genomic signatures of adaptive introgression from European moufon into domestic sheep. Sci Rep. 2017;7:7623
  • Ciani et al. Genet Sel Evol (2020) 52:25 https://doi.org/10.1186/s12711-020-00545-7
  • https://www.nature.com/articles/s41598-019-44137-y/ (Northwest Africa)
  • Heredity (Edinb)2017 Mar;118(3):293-301. doi: 10.1038/hdy.2016.86. Epub 2016 Sep 14 (Algerian sheep)

:card_file_box: check variants data before update

Is your feature request related to a problem? Please describe.
There are affy probes like Affx-291664366 and Affx-281280004 which map to the same cust_id: "oar3_OAR8_62459373". however they are A/G and A/C respectively: I can't track both of them on the same SNP, so I need to check alleles (illumina_top coding) in order to understand if the SNP is the same, otherwise I could skip the import

Describe the solution you'd like
Track illumina_top attribute at variant level: update only if these records match. Need to check at base level since I can't trust to a wrong location during updates. This attribute should generated when creating a variant and should never changed

Describe alternatives you've considered
I could skip import or overwrite with new data, but I can't merge dataset with different alleles

Additional context

  • track illumina_top at variant level
  • skip variant update if illumina_top alleles between variant and new location don't match
  • use dates to understand if I need to update location or not

:sparkles: Import coordinates from Sheep and Goat projects

Is your feature request related to a problem? Please describe.
Sheep and Goat projects have SNPs coordinates for the four major assemblies I need (CHI1, ARS, OAR3, OAR4)

Describe the solution you'd like
Check genomic coordinates, make a report and then fill those data into database

Describe alternatives you've considered
Data could be fetched directly from EBI or EnsEMBL, however it's interesting to merge multiple data sources to understand where problems could be. Data from consortium will not be used to provide the final data files, evaluate if these information need to be provided with final version

Additional context
Those data could be useful to cover missing informations (like rs in goat). This data lacks of alignment information, so getting illumina_top from this type of data make no sense.

  • Manage the two different data files provided by consortium
  • Force data update when importing from consortium
  • Track date when importing from consortium
  • Determine illumina_top data directly from variant for Sheep
  • Import data for goat

:bug: imported illumina report multi breeds file have the same FID in output

Describe the bug
By importing illumina reportfile for multi breed dataset, we have the same FID in ped output: This seems related to the fact that the fid variable is readed once and then never re-assigned when sample changes:

# determine fid from sample, if not received as argument
if not fid:
    sample = self.SampleSpecies.objects.get(
        original_id=row.sample_id,
        dataset=dataset
    )

    fid = sample.breed_code
    logger.debug(f"Found breed {fid} from {row.sample_id}")

once fid is set, no more updates are possible

To Reproduce
Steps to reproduce the behavior:
Import an illumina report multibreeds dataset and see the same FID in the output file

Expected behavior
FID need to change accordingly sample breed code

Additional context

  • define a test with two samples with different FID
  • check that FID changes in output

:memo: update documentation

Is your feature request related to a problem? Please describe.
Describe how this project works

Describe the solution you'd like
Documentation should be updated: the importing process and data generation process need to be described

Describe alternatives you've considered
We can document in an another moment, however documentation will be useful to others to understand better the whole process

Additional context

  • Improve documentation
  • Set up documentation with readthedocs

:bug: indel SNPs should be discarded from final dataset

Describe the bug
Variant like oar3_OAR1_80093512_dup are indels, and can't be converted in illumina_top nor included in the final dataset file

To Reproduce
Steps to reproduce the behavior:

  1. Try to create samples from foreground french dataset (binary plink HD)

Expected behavior
Such snps should be tracked, for example with a is_indel flag, in order to be ignored while fetching coordinates. Or such SNPs could be ignored when created from manifest (like Affymetrix indel SNPs are)

Screenshots

{
  "version" : "Oar_v3.1",
  "chrom" : "X",
  "position" : 50971660,
  "illumina" : "I/D",
  "illumina_strand" : "MINUS",
  "strand" : "PLUS",
  "imported_from" : "manifest",
  "date" : ISODate("2013-04-23T00:00:00Z"),
  "consequences" : [ ]
}

Additional context
Choose one of this two distinct strategies:

  • Detect indels while importing dataset from manifest
  • Add is_indel attribute to such SNPs in VariantSpecies
  • Ignore such indels variant in plinkio.SmarterMixin.fetch_coordinates

Or

  • skip SNPs during manifest upload (like import_affymetrix.py does)
  • check no-errors while uploading SNPchimp coordinates

Search and add this flag manually with mongodb:

db.variantSheep.updateMany({$or: [{locations: {$elemMatch: {imported_from: "manifest", illumina: "D/I"}}}, {locations: {$elemMatch: {imported_from: "manifest", illumina: "I/D"}}}]}, {$set: {is_indel: true}})```

:boom: should smarter_id change its format?

Is your feature request related to a problem? Please describe.
The smarter_id attribute is not a good primary key since it incorporates additional information in its name like country and breed_code. This means that modifing a country or a breed_code for a sample causes inconsistencies between this primary id and fields, having a stable id which contains wrong informations. On the other way, fixing smarter_id relying on new data means that these ids will change with database version.

Describe the solution you'd like
smarter_ids shold be anonymizes, maybe tracking the project name with the number (like SMARTER-OA-000002310, species need to be kept to avoid name collisions between goat and sheep

Describe alternatives you've considered
An alternative could be fix stuff after data import, using a mongoengine query to correct wrong values keeping smarter_id stable. However this means not to trust information in ids anymore.

Additional context
The Galway sheeps breed was tracked as Germany instead of Ireland since country was derived from a wrong set of coordinates. Code and country need to be fixed

:sparkles: generate final genotype files for OAR4 and CHI1 assemblies

Is your feature request related to a problem? Please describe.
Genotypes files for the remaining coordinate systems need to be generate (OAR4 and CHI1)

Describe the solution you'd like
Genotype data need to be generated with other assemblies relying on configuration files or parameters: in this way there's the possibility to support a new assembly in the future with a few changes.

Describe alternatives you've considered
If we want to support other genome versions, we need to find a solution to deal with: there could be a version specific Makefile, a Makefile dependency or a workflow done with softwares like Nextflow

Additional context

  • Provide additional assembly version as parameters
  • Convert genotypes in CHI1 and OAR4 genomic coordinates
  • Ensure that operations on SampleSpecies collections occur exactly once
  • Publish data on FTP site
  • Update FTP README
  • Check all the documentations when assembly versions are mentioned

:sparkles: add new background datasets for goat

Is your feature request related to a problem? Please describe.
Add two more background datasets for goat

Describe the solution you'd like
Test and update Goat background samples with samples from Burren et al. 2016 and Cortellari et al. 2021

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context

  • Add Burren et al. 2016 samples
  • Add Cortellari et al. 2021 samples
  • Check no Bezoar duplicated (from Cortellari)

:sparkles: import latest Uruguayan data

Is your feature request related to a problem? Please describe.
A new batch of data has been uploaded from Uruguay

Describe the solution you'd like
Import latest Uruguay data as foreground data

Describe alternatives you've considered
There are no alternatives, data should be imported

Additional context

  • Check data type
  • Check user IDs
  • Check SNPs are in database
  • Import data into SMARTER-database

:recycle: model location with MultiPointField

Is your feature request related to a problem? Please describe.
Modelling locations like:

# GPS location
# NOTE: X, Y where X is longitude, Y latitude
locations = mongoengine.ListField(mongoengine.PointField(), default=None)

Seems to not work properly when querying.

Describe the solution you'd like
Try to model those attributes with MultiPointField

Describe alternatives you've considered
An alternative could be a custom method like is_top to evaluate locations. However, this could be not performant as the MultiPointField could be. An alternate solutions is to write locations in two different attributes (location1, location2, ...)

Additional context

:sparkles: import latest Uruguayan data

Is your feature request related to a problem? Please describe.
A new batch of data has been uploaded from Uruguay

Describe the solution you'd like
Import latest Uruguay data as foreground data

Describe alternatives you've considered
There are no alternatives, data should be imported

Additional context

  • Check data type
  • Check user IDs
  • Check SNPs are in database
  • Import data into SMARTER-database

:sparkles: import data from Spain datasets

Is your feature request related to a problem? Please describe.
Foreground data coming from Spain need to be imported in SMARTER database

Describe the solution you'd like
Data comes from an affymetrix chip which is not modeled by the database. This data need to be uploaded into database before merging data to SMARTER dataset

Describe alternatives you've considered
I could insert the data I can model, however I could miss about 20% of SNP which I can't find in database. Moreover I need to check genotypes and try to figure out what this should be in TOP coding. This will be difficult to handle for reference genomes which are different from ones used to generate the chip data

Additional context
Add any other context or screenshots about the feature request here.

  • solve issues about 10K chip: are this genotypes a subset of the data I have? need I to import them into database
  • import the new manifest data
  • move the src.features.plinkio.AffyPlinkIO class to a generic AffyCellIO class (to handle cell files).
  • support for comments or tabs in src.features.plinkio.TextPlinkIO class
  • import spain data (and metadata) into SMARTER database

:construction_worker: Set Up Continuous Integration

Is your feature request related to a problem? Please describe.
This repo lacks of CI

Describe the solution you'd like
Set up Travis CI and Coverage (maybe scrutinizer)

Describe alternatives you've considered
Test could run before a commit, however it's better to configure such features

Additional context
Mind to Partner Queue Solution

:sparkles: check that illumina A/B alleles are tracked properly

Is your feature request related to a problem? Please describe.
In the illumina technical notes is clearly states which alleles need to be A/B. Check variant import after illumina updates.

Describe the solution you'd like
Illumina import should take account of A/B when determining illumina top

Describe alternatives you've considered
This step seems to be not important: illumina TOP is currently determined and enforced during data import. Detect correctly the A/B allele should affect only a few SNPs, without consequences on TOP detection

Additional context

  • Update src.features.illumina module
  • Update tests accordingly
  • Test variants after variants import

:boom: model database migrations

Is your feature request related to a problem? Please describe.
Dropping and re-creating data every time a change is needed is not performant and prone to errors: if intermediate and processed data file are not removed and the collections not restored in the initial state we can have inconsistencies

Describe the solution you'd like
Database changes need to be modeled with migration when possible. A change could be done without running the entire import process. Test database migrations to understand if they could be applied in this context

Describe alternatives you've considered
src.features.smarterdb.get_or_create_sample could be modified to check if sample was changed. The same concept could be applied on dataset imports.

Additional context
See:

:sparkles: determine illumina_forward genotype from manifest data

Is your feature request related to a problem? Please describe.
It's not possible convert from forward to top using data imported from manifest since the Location.illumina_forward attribute cannot be set or derived from manifest data. Consider, for example, SNP 250506CS3900386000001_696.1:

{
        "version" : "Oar_v3.1",
        "chrom" : "16",
        "position" : 62646307,
        "illumina" : "A/G",
        "illumina_strand" : "TOP",
        "strand" : "TOP",
        "imported_from" : "manifest",
        "date" : ISODate("2009-01-07T00:00:00Z"),
        "consequences" : [ ]
}

Even if strand is TOP and illumina_strand is TOP, the wanted illumina_forward allele is T/C, since it depends on genome mapping. The illumina_strand tells how to cut the SNPs from the probe (and since is top, I can take the illumina record as it is to have an illumina top. The strand is the direction of the alignment, however I don't know if the probe was reversed or not in this aligment (probably yes, since there is an affymetrix probe in this record which seems to be reversed - has T/C snp in sequence)

Describe the solution you'd like
illumina_forward should be determined using genomic sequence and manifest informations. In this way, we will be able to convert from forward to top using manifest data (ie goat forward data input)

Describe alternatives you've considered
An alternative could be getting data from other sources, for example illumina row files, since that conversion doesn't rely on illumina_forward attribute, however the script import_from_illumina.py should change in order to determine fid from database and not from command line (which is necessary for multi breeds illumina row files.

Additional context

  • read the proper genome file while reading from manifest
  • determine the reference allele from sequence relying on SNP location
  • check if reference allele belongs to manifest illumina feature. If yes, illumina_forward could be equal to illumina feature, if not illumina should be reversed complemented
  • cross check illumina_forward feature with other source (ie. SNPchimp)

:rotating_light: fix issues with linter

Describe the bug
Linter tests should pass without errors

To Reproduce
Steps to reproduce the behavior:

flake8 src
# or 
tox
# or see linter github workflow

Expected behavior
Linter process should exit with success

Additional context

  • refacord dbsnp feature module

:recycle: refactor tests which require database objects

Is your feature request related to a problem? Please describe.
Tests should be refactored in order to initialize data from fixtures. This will make easier the testing processes, even in most difficult requirement setup

Describe the solution you'd like
Database relates tests should start by fixtures upload, and requirements must be managed by uploading json objects

Describe alternatives you've considered
Test could remain like this, however it seems difficult to proper setup certain tests

Additional context
See tests in SMARTER-backend and take inspirations by django tests

:bug: check country for french dataset with GPS coordinates

Describe the bug
There are some sheep animals assigned to a wrong country; this can be verified by inspecting the GPS coordinates and doing a reverse geocoding

To Reproduce
An example of this are animals coming from this dataset:

# collect geo data from Romanov sheeps
romanov_sheep <- get_smarter_geojson(
  "Sheep",
  query = list(breed = "Romanov")) %>%
    sf::st_cast("POINT", do_split = FALSE)

# display data with leaflet
leaflet::leaflet(
  data = romanov_sheep
  ) %>%
    leaflet::addTiles() %>%
    leaflet::addMarkers(
      clusterOptions = leaflet::markerClusterOptions(), label = ~smarter_id
    )

Expected behavior
Those animals need to be assigned to the proper country (ie RUSSIA* for Romanov sheeps)

Additional context

  • check samples with reverse geocoding
  • fix countries and their respective dataset
  • import and re-generate IDS (however, this raise again a problem spotted in #73)
  • new database release

:boom: define a SMARTER snp coordinate version

Is your feature request related to a problem? Please describe.
Variant information comes from different sources, each one describing only a subset of all smarter variants. The SMARTER coordinate version should merge the most updated and trusted information and provide a final set of coordinates used to generate the most updated genotype file

Describe the solution you'd like
A solution could be trying to align probes directly to the genome of interest. This could solve ambiguities, determine new relations between chips and get information on the latest assembly

Describe alternatives you've considered
All the evidences for a variation should be compared and order by updated times and by autority. The most updated and trusted coordinates should be used to define a smarter coordinate source which can be displayed in API and web interface.

Additional context

  • track dates when importing snps from snpchimp
  • illumina_top attribute should be checked during insertion and written in to database
  • merge and rank evidences
  • define a comprehensive coordinate version relying on different evidences

:sparkles: import data from affymetrix file

Is your feature request related to a problem? Please describe.
Data coming from affymetrix need to be imported into SMARTER database.

Describe the solution you'd like
Files are similar to plink format, however first plink colums are missing (only sample name and genotype are present). Sample name need to be converted from metadata files (this is different from the usual alias used for breed conversion). Since there are more data than the data which need to be shared within smarter, we don't want to create a record while loading data

Describe alternatives you've considered
Fix data before import may not work since format is slightly different from plink format. Subset samples before uploading could be useful with others datasets

Additional context

  • Add a function in src.feature.plinkio.SmarterMixin to ignore a ped line if not present in database
  • Import samples from file by providing country and breeds as parameters
  • Define alias while importing samples
  • Search in database relying on original_id or alias
  • Parse affy genotype data and discard INDELs
  • Rely on original coordinate system to determine illumina top alleles
  • Apply the desidered coordinate system
  • Add a custom script for affymetrix data import

:sparkles: add multiple phenotypes to samples

Is your feature request related to a problem? Please describe.
There are sample with multiple phenotypes (more rows to be added to the same samples)

Describe the solution you'd like
Try to upload all data in Phenotypes using lists

Describe alternatives you've considered
We can create another collection in which put every record as an object. However, this means deal with another endpoint in backend, r-smarter-api and web interface and require users to merge data themselves

Additional context

  • import multiple phenotypes for the same sample
  • test stuff within SMARTER-database project
  • try to load data through SMARTER-backend
  • try to collect data using r-smarter-api
  • try to visualize data wth SMARTER-frontend

:wrench: configure database connection

Is your feature request related to a problem? Please describe.
Connect to a database in a remote host, in order to execute the import process in a computing resource

Describe the solution you'd like
Track configuration parameters in env file and set default values for development

Describe alternatives you've considered
Data could be imported in the same host of the database (as currently is)

Additional context

  • Track configuration parameters in env file
  • Test for default values for local development
  • Test import process

:sparkles: import data from dbSNP

Is your feature request related to a problem? Please describe.
Import data from dbSNP

Describe the solution you'd like
Need to insert last dbSNP information in database. Should check if I could find info for the chips that are actually supported

Describe alternatives you've considered
Those data are not so recent, since dbSNP wasn't updated from last 2017. Reference positions need to be more recent

Additional context

  • Fix linter issues
  • Refactor src.features.dbsnp module
  • Load data from dbSNP152

:sparkles: track `ALT` and `REF` alleles in locations

Is your feature request related to a problem? Please describe.
Tracking ALT and REF alleles for each location is required to convert data into VCF format

Describe the solution you'd like
ALT and REF alleles should be determine from genome alignments and then tracked for each Location object

Describe alternatives you've considered
The sheep genome consortium tracks genotypes with REF/ALT convention, data can be imported from here. However, all the SNPs not managed by them (ie Affymetrix) cannot have these attributes

Additional context
Please see comments in #18 for import_consortium.py

  • Add ALT and REF class attributes to src.features.smarterdb.Location
  • Track latest data from Sheep genome consortium

:white_check_mark: check goat coordinates

Is your feature request related to a problem? Please describe.
Check goat coordinates as described by email

Describe the solution you'd like
Check and fix chromosome names accordingly http://www.goatgenome.org/data/capri4dbsnp-base-CHI-ARS-OAR-UMD.csv and http://www.goatgenome.org/data/Goat_IGGC_65K_v2_A2_chr_mapinfo_Aux_file.txt

Describe alternatives you've considered
The data which comes from manifest seems to be almost correct, however there are SNPs which are annotated on X chromosomes while they are in a scaffold

Additional context

The probes have been mapped on ARS1 assembly with the following naming of chromosomes/scaffolds:

    1 to 29
    MT
    X.1 X.2 for scaffold assigned to X in the auxiliary and csv files, X in the bpm file
    Y.1 to Y.11 for scaffolds assigned to Y in the auxiliary and csv files, Y in the bpm file
    0.scaffold_name…for unplaced scaffold (example: 0.NW_017191871.1) in the auxiliary and csv files, 0.NW_scaffold in the bpm file.

For probes either unlocalized or with multiple localizations or not confirmed as belonging to X or Y scaffolds (after studying the genotypes), localization was set to null.
  • check goat chromosomes with aux file
  • check chromosomes in Variants locations: mind to scaffold, null, and non-autosomal

:sparkles: add WGS background data for Goat

Is your feature request related to a problem? Please describe.
Add WGS background data for goat animals coming from Berihulay et al. 2019

Describe the solution you'd like
Import WGS background data like we did for iSheep data

Describe alternatives you've considered
Alternative could be different datasets or ignoring WGS data for Goats

Additional context

  • Check for ARS1 assembly (search SNPs relying on information in database)
  • Extract SNPs relying on probe position
  • Create metadata accordingly to publication and Supplementary Information
  • Convert into illumina top and import into database

:sparkles: track constants and status in database

Is your feature request related to a problem? Please describe.
Default attributes should be tracked in database, with information on releases and status

Describe the solution you'd like
Database should be coupled with library version. Tracking defaults in database could be useful for SMARTER-backend

Describe alternatives you've considered
Defaults and magic number could be tracked in scripts, however those information need to be duplicated in SMARTER-backend

Additional context

  • track database version into a collection
  • track defaults into a collections (ie assembly versions)
  • illumina_top attribute should be checked during insertion and written in to database (moved in #37)
  • track dates when importing snps from manifest
  • track dates when importing snps from snpchimp (moved in #37)
  • track type (foreground, background) within samples

:sparkles: import data from isheep database

Is your feature request related to a problem? Please describe.
SMARTER database can be integrated with the isheep database, which contains data from CAS partner

Describe the solution you'd like
All genotypes from isheep database can be downloaded in VCF format (used both for chips and WGS). Convert data into illumina top then add genotypes into database

Describe alternatives you've considered
There are no alternatives if we need to add CAS data into SMARTER dataset

Additional context
Add any other context or screenshots about the feature request here.

  • Track species in SampleSpecies species field
  • Support custom species during sample creation
  • Fix species for ADAPTMAP dataset
  • Fix metadata table
  • Create isheep samples
  • update genomic coordinates for OAR4 assembly
  • update rs_id if possible
  • convert VCF into plink format
  • import data from chips and WGS

:sparkles: convert genotypes from top to forward

Is your feature request related to a problem? Please describe.
Will be nice to convert genotypes in forward coding format. This will be useful to transform data in other format like VCF

Describe the solution you'd like
The ideal solution will be working with datasets separately, by converting genotypes in temporary files (and folders)

Describe alternatives you've considered
An alternative, will be working on the final dataset and doing the conversion using the SNPconvert.py tool. This will be easier since the data are in the same format and less changes are required to maintain this project

Additional context
Ideally, this task should do:

  • Convert from illumina to forward in src.features.smarterdb.Location
  • Add a _to_forward function in src.features.plinkio.SmarterMixin
  • Support src and dst coding in src.features.plinkio.SmarterMixin._process_genotypes , _process_pedline and update_pedfile
  • Convert into forward using SNPconvert.py
  • Add a script to transform PLINK into VCF and fix allele refereces using bcftools

:sparkles: track affymetrix data

Is your feature request related to a problem? Please describe.
Refactor variants collection to track informations from affymetrix

Describe the solution you'd like
The affymetrix ID should be a key in variant collections. Moreover we need to track dates into locations (when the chip was updated)

Describe alternatives you've considered
Affymetrix data could be stored in other collections, however would be difficult to translate or search for a SNPs between chips

Additional context

  • model affymetrix data in variant collections
  • load data from affymetrix manifest
  • calculate illumina_top from affymetrix sequence
  • check genotypes between illumina and affymetrix

:sparkles: add SNPconvert.py to main src/data scripts

Is your feature request related to a problem? Please describe.
SNPconvert.py is a script based on SMARTER-database libraries which can convert genomic coordinates of datasets without importing them into SMARTER-database. It should be managed and maintained with SMARTER-database main scripts

Describe the solution you'd like
Support and maintain SNPconvert.py in main branch

Describe alternatives you've considered
Even if this script can be used like this, will be easier to maintain all software within SMARTER-database development

Additional context

  • Add integration test for SNPconvert.py
  • Support SNPconvert.py in main documentation

:zap: enhance genotype conversion

Is your feature request related to a problem? Please describe.
SNP genotype conversion is very slow. For example, if the original file is already in plink binary format it requires too much time to convert it by creating temporary plink text files. Converting the whole dataset into FORWARD (see #111) generates huge temporary files. It requires time converting from binary to text since the information for the same sample is column based in binary and row based in text formats.

Describe the solution you'd like
Ideally data conversion need to be done with binary formats (not text). If it's possible to change only the alleles of the .bim files for plink binary files it needs to be done.

Describe alternatives you've considered
We working with text until now and it's works. However, is very inefficient.

Additional context

  • understand how plinkio deal with writing files
  • value plink transposed binary format
  • value updating only the allele in *.bim file for binary format

:sparkles: add Croatian sheep data

Is your feature request related to a problem? Please describe.
Add data coming from Drzaic et al 2022. Its related dataset can be found in dryad

Describe the solution you'd like
Check data and import HD SNPs. Supplementary materials have information about GPS coordinates

Describe alternatives you've considered
An alternative could be not importing dataset, freezing database content in this state

Additional context

  • check assembly
  • ensure this samples are not currently in database
  • fix and import GPS coordinates
  • load data into database

:sparkles: add a collection for country objects

Is your feature request related to a problem? Please describe.
Having a collection of supported countries could help in suggesting items through the portal interface

Describe the solution you'd like
Define a collection of countries and rely on that while managing samples

Describe alternatives you've considered
Countries could be derived with aggregations on Samples collections. However this could take more time than modelling distinct countries in a new collection

Additional context

  • check country collections before inserting samples or dataset
  • use pycountry module to collect info in countries

:bug: check that breed exists while importing phenotypes

Describe the bug
The Malagueña purpose metadata was not correctly inserted into database. Also landrace was recorded as landrance in breed table

To Reproduce
Search for Malagueña phenotype

Expected behavior
The Malagueña should have the purpose phenotype

Additional context

  • Check that breed exists while inserting phenotype data
  • Fix Malagueña and Landrace (goat) breeds
  • Check traditional Arran goat -> should be Aran

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.