Comments (9)
from etl.
from etl.
I'm not quite sure what you mean by 'deprecate the content.' Do you mean to push a new commit to the repo with the old scripts deleted and the notebook added in place? That would make sense.
Do we have any protocols worked out for ETL processes in general? I haven't come across anything in our docs. I think ETL code is as critical as any other code if it generates data that may be used in production.
from etl.
Do you mean to push a new commit to the repo with the old scripts deleted and the notebook added in place? That would make sense.
Something like that. We should decide if we should use notebook as the primary, or have a standalone script.
Do we have any protocols worked out for ETL processes in general? I haven't come across anything in our docs. I think ETL code is as critical as any other code if it generates data that may be used in production.
No, I don't think we have any protocols. Do you have an example of such protocol? What would you want to be determined by that protocol?
Our ETL will partially (mostly?) rely on GHC ETL for extracting metadata into BQ tables. The way I see it, our ETL scripts will just shuffle GHC-generated content around.
We most definitely need to keep track of the code used to do that shuffling in a repository, and ideally we should have code reviews for such critical pieces of code. That's the basics of the protocol I would suggest.
We can/should discuss at the Friday meeting.
from etl.
ISB-CGC ETL is migrating to small task-oriented scripts to do each ETL operation. After trying to use Jupyter notebooks for ETL, we decided to go to straight scripts driven by a yaml configuration file that can be archived.
from etl.
I look forward to learn more details about this to understand what it means.
from etl.
Along with keeping scripts in a repo I would think we would also want to keeps logs of any ETL process, ie these scripts were applied to this table on this date etc. Unfortunately Bill will be away for today's meeting.
from etl.
On a related note, can we filter out and keep in a log forever all write queries for every table we maintain?
from etl.
this is not relevant anymore at this point
from etl.
Related Issues (20)
- [clinical] Add PatientID to the per-collection tables (whenever it is not already available) HOT 7
- Map the patient identifier column in the clinical collections to the DICOM patientID HOT 2
- [clinical] Support NLST clinical data HOT 9
- Clinical data per-table metadata tracking HOT 3
- Define BQ layout of clinical data tables HOT 1
- [clinical] collection_id should not be an array HOT 3
- variable_label should not be blank HOT 3
- Consider renaming "variable" to "column" in "column_metadata" HOT 2
- Inconsistencies identified for hnscc_3dct_rt_clinical table HOT 2
- Add regression testing to confirm consistency of clinical table schemas with column_metadata HOT 2
- Integrate 'Legacy' Clinical data into new clinical dataset.
- Inconsistencies identified for the ISPY1 clinical table HOT 4
- dicom_patient_id appears to be missing in several clinical tables HOT 2
- Use fully resolved versioned table names in all places HOT 1
- table_metadata should indicate whether table dictionary was parsed from sources or derived HOT 2
- acrin_6698 sbrgrade NAs are replaced with nulls HOT 2
- Values for `dicom_patient_id` are invalid for the `acrin_6698` collection HOT 2
- Duke-Breast-Cancer-MRI clinical data is missing HOT 1
- Investigate ingestion of HTAN related data HOT 2
- Ingest RMS clinical data HOT 18
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from etl.