Giter Club home page Giter Club logo

maestro's People

Contributors

andricdu avatar blabadi avatar buwujiu avatar dependabot[bot] avatar henro001 avatar justincorrigible avatar kevinfhartmann avatar leoraba avatar lepsalex avatar lrivera-oicr avatar rosibaj avatar rtisma avatar yalturmes avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

maestro's Issues

Implement Remove Analysis function for analysis_centric index

Right now Maestro only supports removing ES documents when an analysis is requested to be removed( either by using rest endpoint /index/repository/{repositoryCode}/study/{studyId}/analysis/{analysisId},
or by listening to kafka topics).

Requirement:

  1. Extend and refactor DefaultIndexer - removeAnalysis method to include removing either file documents or analysis documents upon removing analysis.

Record-Type Indexing Configuration

Users should have the ability to define indexing rules from SONG.
Rules should be based on any field or combination of fields found in SONG. Eg, this could be by:

Study
Analysis Type
Data
Rules for configuration should be based on an INCLUDE or EXCLUDE basis. Eg, this could be by:

INCLUDE in index only records with status = published
EXCLUDE records with analysis_type = vcf

Analysis of changes to Maestro for Dynamic Song

  • Currently maestro is configured to work with the relational song database
  • What needs to be done to enable it to work with dynamic song?
  • Develop a technical specification to update maestro

Conflict Resolution

What happens when 2 records do not match in 2 repositories?

Expected Behaviour

If the data of the same Analysis-ID does not match in two different song repositories, then:

  1. The analysis document should be deleted from the index.
  2. An alert for data-conflict-resolution should be sent to a message queue recording the repository and both analysis ids (is there a unique SONG record id for each of these as well?)

Critera to delete a record -
All fields must have the exact same value EXCEPT for date-time values like created_date or updated_date OR the info fields. If they have a single field that does not match, then delete the record.

Index repo result only showing one index name

Problem: when indexing multiple indices, endpoint index/repo return only one index result, even both indices have been successfully indexed:
image.png

Expected: should return index results for both indices.

Implement index repo to analysis_centric_index

  • Make the existing POST endpoint /index/repository/ to index an entire repository to both analysis_centric and file_centric.
  • Add a generic method in Indexer API to allow repository multi-index.

Feature Request: Update file_centric mappings and transformation changes with changes to SONG base schema

List of fields that need to be changed:

  • all submitter id fields need to be "submittter_id"
  • add gender
  • add new specimen fields
  • include analysis type
  • index repository base url

mapping updates in progress with @rosibaj and @blabadi are here:
icgc-argo/argo-metadata-schemas#18

  • only the maestro mapping is relevent
  • review the difference for missing fields

Expected Outcome

We can see in ES (whether thru Kibana or another method) Maestro indexing from a song a file_repository index with the new mapping.

Update the analysis_centric mapping + transformation logic

Update the analysis_centric mapping to correctly reflect the Song base schema. Some changes include:

  • all submitter id fields need to be "submittter_id"
  • add gender
  • add new specimen fields
  • include analysis type
  • index repository base url

https://github.com/icgc-argo/argo-metadata-schemas/blob/master/mappings/maestro_analysis_centric.json

Expected Outcome

We can see in ES (whether thru Kibana or another method) Maestro indexing from a song a analysis_repository index with the new mapping.

Refactor to support multiple Indices

Maestro right now is target toward indexing a single index, file_centric. This means that many of the functions and classed are not prepared to work within the scope of a second index type - they are only programmed to work with a file_centric view.

Thus, we need to adjust the code in maestro to be able to respond to a multi-index structure.

  • refactor DefaultIndexer to remove the duplication of code
  • refactor FileCentricElasticSearchAdapter + AnalysisCentricElasticSearchAdapter to be combined into a single adapter

CI/CD Setup

  • deploy Jenkins to collab from configurations
  • Write a Kubernetes deploy scripts

Maestro Dynamic Song Schema Update

Changes needed for Maestro:

Background:
The dynamic schema portion in SONG is called data, maestro needs to read thedata portion in order to index analyses, studies, and repositories.

experiment is a required field, other fields under data are optional depending on the schema.
Example, the data field could be:

"data" : {
  "experiment" : {  "myField1": "" },
  "myField2" : {.....},
  "myField3" {......},
  }
}

To do:

  • update FileCentricAnalysis model to use type : {name, version}
  • update FileCentricAnalysis model, by replacing experiment field with data which will include the dynamic data portion
  • update FileCentricDocumentConverter.java . getExperiement method, to be getData method

Conflict resolution - part 2

This is to expand upon the work that has been done for conflict resolution see : #25
the goal of this ticket is to remove the deletion step of the conflict resolution and refuse the new conflicting document but notify on failure.

that's based on the discussion that happened with @christinayung we want this as MVP to avoid unfolding the work needed in song to handle consistency (decide and check primary file owner, replication process consistency etc). however that to be handled and planned for later on since inconsistency is an issue

Refactor Maestro so that event processing can respond to multiple index types

We need to update these classes to update the logic to handle file_centric and analysis_centric. Currently there one topic for all. The event processing logic needs to updated to process both file_centric and analysis_centric outputs.

  • refactor SongAnalysisStreamListener to use indexAnalysis()
  • refactor IndexingMessagesStreamListener

Deploy ES cluster on ARGO QA env

  • Set up the QA cluster correctly with volumes
  • mimic the prod setup so that we will be ready when the time comes
  • not containerized
  • Dusan has an ansible playbook ready, 3 vms

BUG - Error getting study TEST-PR from song dev/qa

When maestro gets song study TEST-PR from song-qa and dev, following exception is throen: springframework.core.io.buffer.DataBufferLimitException: Exceeded limit on max bytes to buffer : 262144.

Cause: TEST-PR exceeds the default spring codec max in memory size which is 256KB.

Indexing exclusion configs

Users should have the ability to define indexing rules.

  • For MVP, go with a config file
  • Exclude basis for Study, Donor, File and Sample
  • For every Analysis, do a check against the rules
  • check to see if it matched any of the excluded ids

Event Based Indexing

Users should be able to index discreet units of data, not the whole database each time, configured by an event system

Indexer should be able to listen to multiple event streams

  • Decide between Kafka, RabbitMQ or possible another if there is something better
    -----> We decided on Kafka
  • Song is already sending message queues

Create Documentation

  • Set up Read the Docs
  • Installation
  • How to set the configs
  • How to hook up with Song & Arranger

Index result not showing index name when index fails

Problem: When indexing an analysis fails, the returned result is not showing which index has failed(should be either file_centric or analysis_centric or both). It's currently showing "indexName:": null.

Expected results: indexName should be either file_centric_1.0 or analysis_centric_1.0 depending on which index has failed.

image.png

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.