Giter Club home page Giter Club logo

mld3 / fiddle Goto Github PK

View Code? Open in Web Editor NEW
83.0 3.0 18.0 6.55 MB

FlexIble Data-Driven pipeLinE – a preprocessing pipeline that transforms structured EHR data into feature vectors to be used with ML algorithms. https://doi.org/10.1093/jamia/ocaa139

Home Page: http://tiny.cc/get_FIDDLE

License: MIT License

Python 5.57% Jupyter Notebook 94.30% R 0.12% Dockerfile 0.01%
electronic-health-records preprocessing machine-learning jamia data-science

fiddle's People

Contributors

shengpu-tang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

fiddle's Issues

Pre-Filter ID Issue

Hello,

I am trying to replicate the results from the paper and I am stuck at getting the features for the time-variant variables. I run the following and I get a KeyError: ID in the Pre-Filter step. Could you please help in figuring out what I am doing wrong?
I have run all the steps from the mimic3_experiements to generate the required files.
Thank you!
image_2022-02-23_133529

Having trouble processing ICD codes

I am not using MIMIC-III or eicu data, and since this pipeline should e applicable to other EHR data sets, I am using it for in-house EHR data. No matter how I preprocess ICD codes e.g. ICD9:V50.2 vs V50.2 vs V502. I always encounter the error below:

--------------------------------------------------------------------------------
2-B) Transform time-dependent data
--------------------------------------------------------------------------------
Total variables    : 31734
Traceback (most recent call last):
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\indexes\base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'icd_code:0'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
    main()
  File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 138, in main
    X, X_feature_names, X_feature_aliases = FIDDLE_steps.process_time_dependent(df_time_series, args)
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 235, in process_time_dependent
    df_time_series, dtypes_time_series = transform_time_series_table(df_data_time_series, args)
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 430, in transform_time_series_table
    variables_num_freq = get_frequent_numeric_variables(df_in, variables, theta_freq, args)
  File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 93, in get_frequent_numeric_variables
    numeric_vars = [col for col in variables if df_types[col] == 'Numeric']
  File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 93, in <listcomp>
    numeric_vars = [col for col in variables if df_types[col] == 'Numeric']
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 942, in __getitem__
    return self._get_value(key)
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 1051, in _get_value
    loc = self.index.get_loc(label)
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\indexes\base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 'icd_code:0'

So my df_types only one icd related variable name icd_code which is correct. However the parse_variable_data_type process has made a whole new list of variable names with icd at the beginning. Thus why variables has a long list of "icd_code:*" elements. The whole process is very confusing and vague in details. Would you please enlighten me on the source of the error? Many thanks.

Error when not discretizing MIMIC-III time-series data - TypeError: bad operand type for unary ~: 'float'

I am running FIDDLE on data extracted from MIMIC-III using the pipeline outlined in FIDDLE-experiments. I have my population of ICU stays and am running FIDDLE with these parameters:

--T=240.0
--dt=1.0
--theta_1=0.003
--theta_2=0.003
--theta_freq=1
--stats_functions 'mean'

and other default ones found in run_make_all.sh.

I get the following error:

Traceback (most recent call last):  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/runpy.py", line 193, in _run_module_as_main  
    "__main__", mod_spec)  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/runpy.py", line 85, in _run_code  
    exec(code, run_globals)  
  File "/home/hodgman/FIDDLE-experiments/FIDDLE/FIDDLE/run.py", line 141, in <module>  
    main()  
  File "/home/hodgman/FIDDLE-experiments/FIDDLE/FIDDLE/run.py", line 138, in main  
    X, X_feature_names, X_feature_aliases = FIDDLE_steps.process_time_dependent(df_time_series, args)  
  File "/home/hodgman/FIDDLE-experiments/FIDDLE/FIDDLE/steps.py", line 244, in process_time_dependent  
    X_all, X_all_feature_names, X_discretization_bins = map_time_series_features(df_time_series, dtypes_time_series, args)  
  File "/home/hodgman/FIDDLE-experiments/FIDDLE/FIDDLE/steps.py", line 604, in map_time_series_features  
    df.loc[~numeric_mask, col] = np.nan  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/site-packages/pandas/core/generic.py", line 1532, in __invert__  
    new_data = self._mgr.apply(operator.invert)  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 325, in apply  
    applied = b.apply(f, **kwargs)  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 381, in apply  
    result = func(self.values, **kwargs)  
TypeError: bad operand type for unary ~: 'float'

Do you know what could be causing this error? I was able to determine that it first occurs in the column 225958 and numeric_mask contains at least one NaN value which must mean column 225958 contains None values however in in my input_data.p file there are no None or NaN variable_values for variable_name == '225958'.

Mapping data between datasets

I'm having an issue mapping data between institutions when one dataset has a subset of variables in the second dataset. I run the first dataset through FIDDLE I get a set of discretization bins that I want to apply tot the second dataset. However, since the second dataset has variables that are not available in the first dataset, FIDDLE doesn't have any discretization bins to apply to those variables. Is it possible to have FIDDLE drop any data that it doesn't have any discretization bins for?

Generating features for train and test data

To have a reasonable experimental setting, I need to generate features for the training data and keep the feature names. Then the features for the testing data should be generated using these feature names from the training step. Is there any way to do this with FIDDLE? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.