thehyve / tmtk Goto Github PK

View Code? Open in Web Editor NEW

6.0 11.0 4.0 3.99 MB

tranSMART Arborist ETL toolkit

Home Page: https://pypi.org/project/tmtk/

License: GNU Lesser General Public License v3.0

Python 59.89% R 0.39% JavaScript 18.96% CSS 19.64% HTML 1.12%

data-modeling data-curation transmart jupyter-notebook

tmtk's Introduction

tmtk

Master:

Develop:

Anaconda Cloud latest package:

A toolkit for ETL curation for the tranSMART data warehouse. The TranSMART curation toolkit (tmtk) can be used to edit and validate studies prior to loading them with transmart-batch.

For general documentation visit readthedocs.

Installation

Installing via Anaconda Cloud or Pip package managers

Anaconda:

conda install -c conda-forge tmtk

Pip:

pip install tmtk

Installing manually

Initialize a virtualenv

python3 -m venv env
source env/bin/activate

Installation from source

To install tmtk and all dependencies into your Python environment, and enable the Arborist Jupyter notebook extension, run:

pip install -r requirements.txt
python setup.py install

or if you want to run the tool from code in development mode:

pip install -r requirements.txt
python setup.py develop
jupyter-nbextension install --py tmtk.arborist
jupyter-serverextension enable tmtk.arborist

Requirements

The dependencies are in requirements.txt, optional dependencies are in requirements-dev.txt.

Licence

LGPL-3.0

tmtk's People

Contributors

Stargazers

Watchers

Forkers

jochemb brendahijmans rubyaryat ahmad-abdellatif

tmtk's Issues

validate_all gives 'truth value of array' error when multiple rows per patient

When having multiple rows per patient per concept validation gives CRITICAL error, but text is obscured by Value Error (the truth value of an array is ambiguous).

Should be an option error, only when validating for pre-17.1.
Should give the real error text, not obscured by this message.

Saving study objects always includes modifiers, ontology_mapping and trial_visits

The current assumption introduces a breaking change when loading with transmart-batch. transmart-batch does not recognize these files and they are introduced in the clinical.params file.

Validate subject_sample_mapping study id

Change variable id from string to tuple

The function Clinical.get_variable() requires a string of format '__column_number'
Would suggest to change this into a tuple. Seems like a more natural way to make a key of two components instead of a new formatted ID-string

Validate on Category CD + Datalabel length

Check for the string length of Category CD + datalabel + topnode(studyname + private/public study) and ideally categorical value. This cannot be longer than 700 characters.

Attribute Error when trying to use SkinnyExport on its own

SkinnyExport_Jupyter.pdf

Need to transform already existing transmart-batch files into transmart-copy format, so I tried to use the SkinnyExport tool in Jupyter. But I can't get the sample study to work with the sample code from export_to_skinny.py.
All steps and error message in the attached file.

If wordmap file is empty call_boris() returns an error

StopIteration Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
1196 try:
-> 1197 data = self._reader.read(nrows)
1198 except StopIteration:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8629)()

StopIteration:

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last)
in ()
----> 1 zero_study.call_boris()

/Users/wibopipping/tools/tmtk/tmtk/study.py in call_boris(self)
137
138 def call_boris(self):
--> 139 arborist.call_boris(self)
140
141 @Property

/Users/wibopipping/tools/tmtk/tmtk/arborist/common.py in call_boris(to_be_shuffled)
51 raise utils.ClassError(type(to_be_shuffled, 'pd.DataFrame, tmtk.Clinical or tmtk.Study'))
52
---> 53 json_data = create_concept_tree(to_be_shuffled)
54
55 json_data = launch_arborist_gui(json_data) # Returns modified json_data

/Users/wibopipping/tools/tmtk/tmtk/arborist/jstreecontrol.py in create_concept_tree(column_object)
25 """
26 if isinstance(column_object, tmtk.Study):
---> 27 concept_tree = create_tree_from_study(column_object)
28
29 elif isinstance(column_object, pd.DataFrame):

/Users/wibopipping/tools/tmtk/tmtk/arborist/jstreecontrol.py in create_tree_from_study(study_object, concept_tree)
49 concept_tree = ConceptTree()
50
---> 51 concept_tree = create_tree_from_clinical(study_object.Clinical, concept_tree)
52
53 for map_file in study_object.subject_sample_mappings:

/Users/wibopipping/tools/tmtk/tmtk/arborist/jstreecontrol.py in create_tree_from_clinical(clinical_object, concept_tree)
80 data_args = variable.column_map_data
81 concept_path = variable.concept_path
---> 82 categories = variable.word_map_dict if not variable.is_numeric else {}
83
84 # Add filename to SUBJ_ID, this is a work around for unique path constraint.

/Users/wibopipping/tools/tmtk/tmtk/clinical/Variable.py in is_numeric(self)
58 @Property
59 def is_numeric(self):
---> 60 return utils.is_numeric(self.mapped_values)
61
62 @Property

/Users/wibopipping/tools/tmtk/tmtk/clinical/Variable.py in mapped_values(self)
93 @Property
94 def mapped_values(self):
---> 95 return [v for k, v in self.word_map_dict.items()]
96
97 def validate(self, verbosity=2):

/Users/wibopipping/tools/tmtk/tmtk/clinical/Variable.py in word_map_dict(self)
88 :return: dict
89 """
---> 90 word_map = self.parent.WordMapping.get_word_map(self.id_)
91 return {v: word_map.get(v, v) for v in self.unique_values}
92

/Users/wibopipping/tools/tmtk/tmtk/clinical/WordMapping.py in get_word_map(self, var_id)
34
35 filename, column = var_id.rsplit('__', 1)
---> 36 f = self.df.ix[:, 0].astype(str) == filename
37 c = self.df.ix[:, 1].astype(str) == column
38 if sum(f & c):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/Werkzeug-0.11.9-py3.5.egg/werkzeug/utils.py in get(self, obj, type)
71 value = obj.dict.get(self.name, _missing)
72 if value is _missing:
---> 73 value = self.func(obj)
74 obj.dict[self.name] = value
75 return value

/Users/wibopipping/tools/tmtk/tmtk/utils/filebase.py in df(self)
13 def df(self):
14 if self.path and os.path.exists(self.path):
---> 15 df, self._hash_init = file2df(self.path, hashed=True)
16 else:
17 CPrint.warn("No dataframe found on disk for {}, creating.".format(self))

/Users/wibopipping/tools/tmtk/tmtk/utils/Generic.py in file2df(path, hashed)
58 df = pd.read_table(path,
59 sep='\t',
---> 60 dtype=object)
61 if hashed:
62 hash_value = hash(df.bytes())

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
496 skip_blank_lines=skip_blank_lines)
497
--> 498 return _read(filepath_or_buffer, kwds)
499
500 parser_f.name = name

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
283 return parser
284
--> 285 return parser.read()
286
287 _parser_defaults = {

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
745 raise ValueError('skip_footer not supported for iteration')
746
--> 747 ret = self._engine.read(nrows)
748
749 if self.options.get('as_recarray'):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
1202 self.index_col,
1203 self.index_names,
-> 1204 dtype=self.kwds.get('dtype'))
1205 else:
1206 raise

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py in _get_empty_meta(columns, index_col, index_names, dtype)
2257 # Convert column indexes to column names.
2258 dtype = dict((columns[k] if com.is_integer(k) else k, v)
-> 2259 for k, v in compat.iteritems(dtype))
2260
2261 if index_col is None or index_col is False:

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/compat/init.py in iteritems(obj, **kwargs)
132 func = getattr(obj, "iteritems", None)
133 if not func:
--> 134 func = obj.items
135 return func(**kwargs)
136

AttributeError: type object 'object' has no attribute 'items'

disable editing filename and column number in gui

Best to limit flexibility here.

Tag support: Context menu to "Add Metadata" and flexible way to add more tags.

The tag nodes standard "data" array has a array called "tags" which has tag_titles as keys and [ tag_description, tag_weight ] as value.

Currently the arborist gui only shows these meta data tags. We need:

Create new meta data node in context menu
Button to add new tags to existing metadata node
Any changes made should be put back in the node data array

tmtk expects a wordmapping to load the study

tmtk fails to load the study if no wordmapping file is specified in the clinical.params file.

Make SUBJ_ID special case

allow duplicate so they do not disappear when multiple are put in the same folder.
also, remove children.

Parse date fields to be in correct format

START_DATE and END_DATE fields cause an error when loading with transmart-copy.

Insert into i2b2demodata.observation_fact   4% │███████▌                                                                                                                                                                                     │   50/1241 (0:00:01 / 0:00:25)
2018-03-15 11:16:00,278 [ERROR] Error processing row 52 of i2b2demodata/observation_fact.tsv: Text '2005-11-15' could not be parsed at index 10
2018-03-15 11:16:00,293 [ERROR] Text '2005-11-15' could not be parsed at index 10

Add validation to START_DATE and END_DATE fields

tmtk call_boris() gives error when save and return to jupyter if no changes

ValueError Traceback (most recent call last)
in ()
----> 1 zero_study.call_boris()

/Users/wibopipping/tools/tmtk/tmtk/study.py in call_boris(self)
137
138 def call_boris(self):
--> 139 arborist.call_boris(self)
140
141 @Property

/Users/wibopipping/tools/tmtk/tmtk/arborist/common.py in call_boris(to_be_shuffled)
56
57 if isinstance(to_be_shuffled, tmtk.Study):
---> 58 update_study_from_json(to_be_shuffled, json_data=json_data)
59 elif isinstance(to_be_shuffled, tmtk.Clinical):
60 update_clinical_from_json(to_be_shuffled, json_data=json_data)

/Users/wibopipping/tools/tmtk/tmtk/arborist/common.py in update_study_from_json(study, json_data)
132 study.Clinical.ColumnMapping.df = concept_tree.column_mapping_file
133 study.Clinical.WordMapping.df = concept_tree.word_mapping
--> 134 study.Tags.df = concept_tree.tags_file
135
136 high_dim_paths = concept_tree.high_dim_paths

/Users/wibopipping/tools/tmtk/tmtk/arborist/jstreecontrol.py in tags_file(self)
235 # This reduces the nested dictionary to a flat one.
236 flat_mapping = [row for nest_list in all_mappings for row in nest_list]
--> 237 df = pd.concat([pd.Series(row) for row in flat_mapping], axis=1).T
238
239 df.columns = ['Concept Path', 'Title', 'Description', 'Weight']

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/tools/merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
810 keys=keys, levels=levels, names=names,
811 verify_integrity=verify_integrity,
--> 812 copy=copy)
813 return op.get_result()
814

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/tools/merge.py in init(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
843
844 if len(objs) == 0:
--> 845 raise ValueError('No objects to concatenate')
846
847 if keys is None:

ValueError: No objects to concatenate

Smart reloading of study

Could be lazy loading (JIT) and/or using meta data (or md5hash) to decide what to reload into memory.

Clinical class definition out of init

init file should be reserved for package management (like all) and doesn't seem to be a natural place for a class definition

+ in data label value

When a + is introduced during the word mapping the + is substituted in the arborist to a subfolder.

Example: word mapping value: 3 --> L+R
Expected behaviour is that the arborist shows a data value of L+R

Implement save option for study

No need to read everything from disk when saving the study to the same location. Use the df_has_changed to determine if object should be written.

Use write_to as base with the changed assumption that it saves to the exact same location.

Updating security_required after top_node resets the top_node

First updating the top_node of a study object and then updating the security_required field resets the top_node field to '\Private studies'

VariableCollection could use query/filter function

Currently the class has a .get function, which requires a variable is string.
This could use a better search function. Possible functions are a attribute - value filter or a data label method.
For usability this could also return an error if nothing is found instead of a NoneType