Giter Club home page Giter Club logo

tmtk's Introduction

tmtk

Master:

https://travis-ci.org/thehyve/tmtk.svg?branch=master Documentation Status

Develop:

https://travis-ci.org/thehyve/tmtk.svg?branch=develop

Anaconda Cloud latest package:

A toolkit for ETL curation for the tranSMART data warehouse. The TranSMART curation toolkit (tmtk) can be used to edit and validate studies prior to loading them with transmart-batch.

For general documentation visit readthedocs.

Installation

Installing via Anaconda Cloud or Pip package managers

Anaconda:

conda install -c conda-forge tmtk

Pip:

pip install tmtk

Installing manually

Initialize a virtualenv

python3 -m venv env
source env/bin/activate

Installation from source

To install tmtk and all dependencies into your Python environment, and enable the Arborist Jupyter notebook extension, run:

pip install -r requirements.txt
python setup.py install

or if you want to run the tool from code in development mode:

pip install -r requirements.txt
python setup.py develop
jupyter-nbextension install --py tmtk.arborist
jupyter-serverextension enable tmtk.arborist

Requirements

The dependencies are in requirements.txt, optional dependencies are in requirements-dev.txt.

Licence

LGPL-3.0

tmtk's People

Contributors

alepev avatar brendahijmans avatar ewelinagr avatar forus avatar gijskant avatar jochemb avatar spayralbe avatar wardweistra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tmtk's Issues

Change variable id from string to tuple

The function Clinical.get_variable() requires a string of format '__column_number'
Would suggest to change this into a tuple. Seems like a more natural way to make a key of two components instead of a new formatted ID-string

Validate on Category CD + Datalabel length

Check for the string length of Category CD + datalabel + topnode(studyname + private/public study) and ideally categorical value. This cannot be longer than 700 characters.

If wordmap file is empty call_boris() returns an error


StopIteration Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
1196 try:
-> 1197 data = self._reader.read(nrows)
1198 except StopIteration:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8629)()

StopIteration:

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last)
in ()
----> 1 zero_study.call_boris()

/Users/wibopipping/tools/tmtk/tmtk/study.py in call_boris(self)
137
138 def call_boris(self):
--> 139 arborist.call_boris(self)
140
141 @Property

/Users/wibopipping/tools/tmtk/tmtk/arborist/common.py in call_boris(to_be_shuffled)
51 raise utils.ClassError(type(to_be_shuffled, 'pd.DataFrame, tmtk.Clinical or tmtk.Study'))
52
---> 53 json_data = create_concept_tree(to_be_shuffled)
54
55 json_data = launch_arborist_gui(json_data) # Returns modified json_data

/Users/wibopipping/tools/tmtk/tmtk/arborist/jstreecontrol.py in create_concept_tree(column_object)
25 """
26 if isinstance(column_object, tmtk.Study):
---> 27 concept_tree = create_tree_from_study(column_object)
28
29 elif isinstance(column_object, pd.DataFrame):

/Users/wibopipping/tools/tmtk/tmtk/arborist/jstreecontrol.py in create_tree_from_study(study_object, concept_tree)
49 concept_tree = ConceptTree()
50
---> 51 concept_tree = create_tree_from_clinical(study_object.Clinical, concept_tree)
52
53 for map_file in study_object.subject_sample_mappings:

/Users/wibopipping/tools/tmtk/tmtk/arborist/jstreecontrol.py in create_tree_from_clinical(clinical_object, concept_tree)
80 data_args = variable.column_map_data
81 concept_path = variable.concept_path
---> 82 categories = variable.word_map_dict if not variable.is_numeric else {}
83
84 # Add filename to SUBJ_ID, this is a work around for unique path constraint.

/Users/wibopipping/tools/tmtk/tmtk/clinical/Variable.py in is_numeric(self)
58 @Property
59 def is_numeric(self):
---> 60 return utils.is_numeric(self.mapped_values)
61
62 @Property

/Users/wibopipping/tools/tmtk/tmtk/clinical/Variable.py in mapped_values(self)
93 @Property
94 def mapped_values(self):
---> 95 return [v for k, v in self.word_map_dict.items()]
96
97 def validate(self, verbosity=2):

/Users/wibopipping/tools/tmtk/tmtk/clinical/Variable.py in word_map_dict(self)
88 :return: dict
89 """
---> 90 word_map = self.parent.WordMapping.get_word_map(self.id_)
91 return {v: word_map.get(v, v) for v in self.unique_values}
92

/Users/wibopipping/tools/tmtk/tmtk/clinical/WordMapping.py in get_word_map(self, var_id)
34
35 filename, column = var_id.rsplit('__', 1)
---> 36 f = self.df.ix[:, 0].astype(str) == filename
37 c = self.df.ix[:, 1].astype(str) == column
38 if sum(f & c):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/Werkzeug-0.11.9-py3.5.egg/werkzeug/utils.py in get(self, obj, type)
71 value = obj.dict.get(self.name, _missing)
72 if value is _missing:
---> 73 value = self.func(obj)
74 obj.dict[self.name] = value
75 return value

/Users/wibopipping/tools/tmtk/tmtk/utils/filebase.py in df(self)
13 def df(self):
14 if self.path and os.path.exists(self.path):
---> 15 df, self._hash_init = file2df(self.path, hashed=True)
16 else:
17 CPrint.warn("No dataframe found on disk for {}, creating.".format(self))

/Users/wibopipping/tools/tmtk/tmtk/utils/Generic.py in file2df(path, hashed)
58 df = pd.read_table(path,
59 sep='\t',
---> 60 dtype=object)
61 if hashed:
62 hash_value = hash(df.bytes())

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
496 skip_blank_lines=skip_blank_lines)
497
--> 498 return _read(filepath_or_buffer, kwds)
499
500 parser_f.name = name

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
283 return parser
284
--> 285 return parser.read()
286
287 _parser_defaults = {

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
745 raise ValueError('skip_footer not supported for iteration')
746
--> 747 ret = self._engine.read(nrows)
748
749 if self.options.get('as_recarray'):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
1202 self.index_col,
1203 self.index_names,
-> 1204 dtype=self.kwds.get('dtype'))
1205 else:
1206 raise

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py in _get_empty_meta(columns, index_col, index_names, dtype)
2257 # Convert column indexes to column names.
2258 dtype = dict((columns[k] if com.is_integer(k) else k, v)
-> 2259 for k, v in compat.iteritems(dtype))
2260
2261 if index_col is None or index_col is False:

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/compat/init.py in iteritems(obj, **kwargs)
132 func = getattr(obj, "iteritems", None)
133 if not func:
--> 134 func = obj.items
135 return func(**kwargs)
136

AttributeError: type object 'object' has no attribute 'items'

Tag support: Context menu to "Add Metadata" and flexible way to add more tags.

The tag nodes standard "data" array has a array called "tags" which has tag_titles as keys and [ tag_description, tag_weight ] as value.

Currently the arborist gui only shows these meta data tags. We need:

  • Create new meta data node in context menu
  • Button to add new tags to existing metadata node
  • Any changes made should be put back in the node data array

Make SUBJ_ID special case

allow duplicate so they do not disappear when multiple are put in the same folder.
also, remove children.

Parse date fields to be in correct format

START_DATE and END_DATE fields cause an error when loading with transmart-copy.

Insert into i2b2demodata.observation_fact   4% │███████▌                                                                                                                                                                                     │   50/1241 (0:00:01 / 0:00:25)
2018-03-15 11:16:00,278 [ERROR] Error processing row 52 of i2b2demodata/observation_fact.tsv: Text '2005-11-15' could not be parsed at index 10
2018-03-15 11:16:00,293 [ERROR] Text '2005-11-15' could not be parsed at index 10

Add validation to START_DATE and END_DATE fields

tmtk call_boris() gives error when save and return to jupyter if no changes


ValueError Traceback (most recent call last)
in ()
----> 1 zero_study.call_boris()

/Users/wibopipping/tools/tmtk/tmtk/study.py in call_boris(self)
137
138 def call_boris(self):
--> 139 arborist.call_boris(self)
140
141 @Property

/Users/wibopipping/tools/tmtk/tmtk/arborist/common.py in call_boris(to_be_shuffled)
56
57 if isinstance(to_be_shuffled, tmtk.Study):
---> 58 update_study_from_json(to_be_shuffled, json_data=json_data)
59 elif isinstance(to_be_shuffled, tmtk.Clinical):
60 update_clinical_from_json(to_be_shuffled, json_data=json_data)

/Users/wibopipping/tools/tmtk/tmtk/arborist/common.py in update_study_from_json(study, json_data)
132 study.Clinical.ColumnMapping.df = concept_tree.column_mapping_file
133 study.Clinical.WordMapping.df = concept_tree.word_mapping
--> 134 study.Tags.df = concept_tree.tags_file
135
136 high_dim_paths = concept_tree.high_dim_paths

/Users/wibopipping/tools/tmtk/tmtk/arborist/jstreecontrol.py in tags_file(self)
235 # This reduces the nested dictionary to a flat one.
236 flat_mapping = [row for nest_list in all_mappings for row in nest_list]
--> 237 df = pd.concat([pd.Series(row) for row in flat_mapping], axis=1).T
238
239 df.columns = ['Concept Path', 'Title', 'Description', 'Weight']

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/tools/merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
810 keys=keys, levels=levels, names=names,
811 verify_integrity=verify_integrity,
--> 812 copy=copy)
813 return op.get_result()
814

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/tools/merge.py in init(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
843
844 if len(objs) == 0:
--> 845 raise ValueError('No objects to concatenate')
846
847 if keys is None:

ValueError: No objects to concatenate

Smart reloading of study

Could be lazy loading (JIT) and/or using meta data (or md5hash) to decide what to reload into memory.

+ in data label value

When a + is introduced during the word mapping the + is substituted in the arborist to a subfolder.

Example: word mapping value: 3 --> L+R
Expected behaviour is that the arborist shows a data value of L+R

Implement save option for study

No need to read everything from disk when saving the study to the same location. Use the df_has_changed to determine if object should be written.

Use write_to as base with the changed assumption that it saves to the exact same location.

VariableCollection could use query/filter function

Currently the class has a .get function, which requires a variable is string.
This could use a better search function. Possible functions are a attribute - value filter or a data label method.
For usability this could also return an error if nothing is found instead of a NoneType

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.