ohdsi / whiterabbit Goto Github PK

WhiteRabbit is a small application that can be used to analyse the structure and contents of a database as preparation for designing an ETL. It comes with RabbitInAHat, an application for interactive design of an ETL to the OMOP Common Data Model with the help of the the scan report generated by White Rabbit.

Home Page: http://ohdsi.github.io/WhiteRabbit

License: Apache License 2.0

Java 100.00%

whiterabbit's People

Contributors

Stargazers

Watchers

Forkers

ericavoss rwpark99 mark-velez aperotte claire-oi deconnors taylordelehanty leeevans digideskio malcolmm83 tsuyoikaze luizagabriel adolfoeliazat alpivonka qkerby cephalization11 arestenko thehyve 18818569575 cpcgoogle digi2002 paulorades yovoss joshransom cgtap xpf100 orenius ehden hoang-jaist camcaan dbgwen gazzola alondhe yaroslavtir johnmatovu utopiazh minseosim rpatil524 ha5h1sbr0wn fabbondanza softwarecountry sofialis edencehealth hgcummings lenamax2355 usmanafzal89 wtroddy kttrld rohankhera intersystems-ib carlmellner chuasweechin kesadae11 thisismexp zsstrike92 entityrisk lonniesorrells prunk1al tfeusels-edence icentury watilde mgabetta natb1 patrickthoral vvcb datanadi thonitub ulc0 ksaseendran msaqr14 ingef jpkau kzollove metadataworks pariyat dongpojushi pxp9 minthantsin prabhatverma47 baitphish tsiupa-coder charliesks

whiterabbit's Issues

Running on RedShift, Column Headers may be incorrect.

There is a problem with WhiteRabbit running on RedShift.

SELECT COLUMN_NAME,DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS

Does not necessarily return rows in the order they exist in the table, which is required for the scan report to be correct. I added:

order by ordinal_position

to the query in SourceDataScan.java and it fixed the problem.

Create a workflow for designing an ETL from an I2B2 data warehouse to the OMOP CDM

Currently, running White Rabbit and Rabbit In A Hat on a database in I2B2's dataformat is not very helpful because of I2B2's Entity-Attribute-Value architecture. A specific version of the tools could take the specific structure into account.

[Suggested Feature] Ability to undo text edits

This one might be challenging, but any good GUI system has a complete system of undo/redo for all important actions. If I accidentally delete all the text in a comment or logic box, I have no way of undoing that action.

Hopefully Java has some easy way of adding undo to text areas.

Add ability to update scan report and/or CDM specification

Currently there is no way to update the underlying models once a specification has been created

Unable to load on Mac OS X 10.9

java -version
java version "1.6.0_65"
Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609)
Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode)

java -jar ~/Downloads/WhiteRabbit_v0/WhiteRabbit.jar
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/ohdsi/whiteRabbit/WhiteRabbitMain : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

[Suggested Feature] Bring highlighted arrows to top

As of #26, if a user has clicked on a target or source box, the arrows leading into or out of that box are highlighted. I assume that those highlighted arrows represent arrows that a user is most interested in. So along with highlighting the arrows should be placed on top of all other arrows, making them easier to select.

Option to exclude value counts from RabbitInAHat output

Frequency distributions generated by the WhiteRabbit scan report are incredibly valuable, but since they can be polluted with PHI, an option to avoid duplicating them in the RabbitInAHat output (i.e. the appendix and spec file) would be incredibly useful and help prevent proliferation of these data. At Columbia we typically delete the appendix manually after generating the ETL doc. This of course is error-prone.

Furthermore, continuing to duplicate these data in the spec file will impede its very use once serialization to a human-readable format (#10) is implemented. We found that data from org.ohdsi.rabbitInAHat.dataModel.valueCounts contributes to an overwhelming majority of our spec file size. Since we deal with large datasets and adoption of a human-readable format likely precludes compression, this means we'd be expected to read and edit a file exceeding 34MB whereas without the counts it would be ~500KB.

[Suggested Feature] Overall Notes Section

It would be good if there was a place to put a "header note" or just general notes. For example, if you need to do pre-processing before you start the ETL (e.g. pivot SEER data so you can actually deal with it) or you just have notes you need to stick SOMEWHERE (e.g. notes on something you need to do)

The results of this text box could just spit out to the top of the document.

Rabbit-In-A-Hat: Too Many Tables, Export Images Off

(From Charlie Baily/Evanette Burrows)

For Rabbit-in-a-Hat, we found that when there are too many tables involved the high level source to OMOP tables document did not export completely to the word document. If there was a way to scale the image so it would fit in the final document that would be a great addition to the tool. Also, any documentation on how to navigate to the higher level table comments verses specific field comments would be helpful, we misplaced comments because of where we clicked.

[Suggested Feature] Summarization of Column Mapped

The Medicare ETL group was chatting (Mark Danese, Jennifer Duryea, Michelle Gleeson, @aguynamedryan) about Rabbit-In-A-Hat and we were thinking if possible having some metric on how much is mapped would be interesting. XX% of tables are mapped and XX% of columns are mapped. We recognize this isn't the best metric, but it gives you a sense of how much has been reviewed.

[Suggested Feature] Indicate possible concepts that can be assigned to a column

When filling out columns like, say, the observation.observation_type_concept_id or the visit_occurrence.visit_type columns it would be nice to see the list of possible concepts (and their concept_ids) associated with those columns. The list of possibilities is small and the query to find the associated concepts is straightforward. Many times while we were working on the SynPUF ETL, @ericaVoss had to hop out of Rabbit in a Hat to look up these possible values and it seems like we could avoid this.

[Suggested Feature] Load multiple Source Tables to map fields to one CDM Field

RIH only allows one Source table to load, and when the rules of some CDM fields require multiple source fields, it has a broken and noncontinuous logic when creating the mappings of these values.
Would there be a way to update RIH to load multiple Source tables so that source values from different source fields from different tables can be mapped to one CDM field and show a more comprehensive mapping when it generates the graphic map?

[Suggested Feature] Create percentages for each frequency count from White Rabbit

During the SynPUF ETL, we occasionally referred to the White Rabbit scans of values for a given source column. The frequency counts are helpful, but as @mav7014 pointed out in #29 they leak PHI. In my mind, they also don’t easily let me know how big of a slice of my data contains a particular value. I’d like to see what percentage a given value represents out of the overall frequency. Perhaps if reported small values as “less than 0.1%” or something, we’d also not leak PHI and @mav7014 could reincorporate frequencies in his ETL spec decision making process.

Please note that I also don’t intend for reporting of percentages to replace counts entirely. In some cases, it would be nice to have both raw counts and percents.

Consider sorting columns alphabetically in RabbitInAHat

When attempting to map input columns to CDM columns, neither attribute list is sorted. This leaves the user to scan the lists to find the fields of interest. With small column lists this is not a big issue, but when attempting to process > ~ 15 columns it requires scrolling and dragging columns.

Consider sorting column names alphabetically to make searching easier.

[Suggested Feature] Load Sandbox CDMs as target tables/columns

During the March F2F, we decided it would be nice to try out potential CDM changes in a “sandbox implementation”. In order for us to build an ETL spec document against a “sandbox” CDM, we need the ability to load the schemas for those sandbox CDMs into RiaH.

At least in the short term I don’t envision a user being able to add/remove schema information directly in RiaH. Instead a user would need to prepare a “schema file” or something that can then be loaded into RiaH.

[Suggested Feature] Format for Storing CDM/Source Schema Information

It would be nice to find a nice JSON format to store schema and column-related information in order to satisfy issues:

There is JSON Schema, but that looks like an overwrought format and I’m not sure it would satisfy our actual needs. Plus, there don’t seem to be any Java libraries that would help us read in that format.

Instead, I propose a simple syntax like in this example: https://gist.github.com/aguynamedryan/4ecbc821ace160feed59

Add MySQL port to config or documentation

For MySQL, must be on port 3306 to work. Please add this to the documentation (or give a port option for the config for MySQL).

WhiteRabbit: 1M scan fail

(From Charlie Baily/Evanette Burrows)

100K worked in all cases. Interestingly to me, 1M failed on several tables, with what seemed a reasonably generous heap allowance for the JRE. But I don’t use Java a lot for big datasets, so I don’t have a feel for internal overhead. Maybe some -Xmx hints would be helpful, but it’s a trivium.

[Suggested Feature] Analyze White Rabbit scan document to detect which vocabulary(ies) are present in a source table

CDMv5 introduced the ideas of “domains”. Domains dictate which table(s) a concept should live in and these domain assignments can be surprising. For instance, HCPCS code G0248 has been assigned the domain of “Observation” meaning that this HCPCS code does not generate a procedure_occurrence, but instead is stored in the observation table. As another example, the ICD-9 code V53.2 (fitting of hearing aid) maps to the domain of “Procedure”, meaning the ICD-9 code generates a procedure_occurrence, not a condition_occurrence.

It would be nice to lessen the cognitive burden on our ETLers and provide them some automated guidance on how their source data might best be mapped to the CDM. For example, when working on the SynPUF ETL, we completely forgot that some ICD-9 codes generate procedure_occurrences. We had to go back and draw a bunch of arrows and come up with a lot of new logic long after we thought we had finished the spec and our minds had moved on to other matters.

If we matched the values for each column in White Rabbit’s scan report against the values in concept.concept_code, we could see if the column consistently matches codes for a particular vocabulary. This gives us very helpful information that we can use to automate some of the work done in Rabbit in a Hat.

For instance, in the SynPUF data, we could take White Rabbit’s values from the dx1 column in the inpatient file and try to match them against concept.concept_code. We’d find that those values all match ICD-9 concepts. From that we reasonably infer that inpatient.dx1 is an ICD-9 column. And since the CDMv5 vocabulary tells us that ICD-9 codes are not only associated with the Condition domain, but Observation, Measurement, and Procedure as well, we could then have RiaH automatically draw arrows between the inpatient table and condition_occurrence, but inpatient and measurement, observation, and procedure_occurrence as well.

We can do this for every vocabulary we find in each source table, drawing arrows between source tables and their potential target tables. Not only can we make the table-level mappings, but we can link the *_concept_id field and the *_source_value fields to the source column.

Combine this feature with #35 and RaiH begins to be a tool that guides an ETLer through the mapping process, rather than requiring an ETLer to intimately know the source data, CDM, and the domains assigned to each source vocabulary. Instead RiaH informs an ETLer about the relationships the ETLer must consider. Plus it will help avoid situations like the one we encountered where we completely ignored a set of relationships because we didn’t realize what domains ICD-9 mapped into.

Includes proportions with Frequency Counts

I like how Rabbit in a Hat provides frequency information about the contents of columns. It would be nice if, along with the frequency count for each value, we also provided the proportion of rows with that value. 21,123 rows sounds like a lot, but when the sample is a 1M rows, that really only represents 2.1% of the total rows. It helps put things in perspective.

WhiteRabbit: scan should indicate int values with leading zeros

Hi,

Class SourceDataScan currently infers that a field has int type if the trimmed value is accepted by Java's long parser without throwing a number format exception in StringUtilities#isLong.

Unfortunately, multiple-character codes which contain only digits, but start with one or more leading zeros can be parsed correctly to a long. If these leading digits are significant, the user has to look out for such values on frequency tabs of the scan result to avoid data truncation (and if the frequency list for the column is itself truncated, they may not find the "evidence" that they would need to store this value in a character column in a database, for example).

Please consider adding a separate field type which can be differentiated from regular ints (eg. int_lz for "int with leading zero"), or classify input fields with such values as varchar.

Thanks,
András

Rabbit-In-A-Hat files should be XML, not binary

Currently, ETL specifications are saved as Java binary files. XML would be better because:

More robust to Java code changes
Ability to edit the file without Rabbit-In-A-Hat

Remember last directory

When I open a file (either Scan Report or ETL Specs), it'd be nice if the Rabbit in a Hat remembered that path so the next time I go to save/open a file, I'm back at that last path.

Split off Rabbit in a Hat into a new repo

Though they are closely related, WhiteRabbit and Rabbit in a Hat are two separate programs and it makes better sense in my head to have them in separate repositories.

[Suggested Feature] Indicate column’s data type

This would be a nice bit of information to know about a target column. For instance, specimen.quantity is a float while procedure_occurrence.quantity is an integer. Nice things to know as you’re spec'ing an ETL.

request for microsoft excel/visual basic version to promote adoption

I'm simply having trouble getting my collaborators to download and install software. Is it conceivable that we could accomplish the same functionality with excel and VB?

[Suggested Feature] Indicate which mappings (represented by arrows) have no information

It would be nice to be able to tell, at a glance, which column mappings have no comments or logic filled out. At the table-level view, a table would be considered “incomplete” if it contains any “incomplete” column mappings. That way, as I’m looking over the entire ETL, I can see what work is still left to be done.

Tentatively we could render just the outline of the arrows in a thick border, leaving the center of the arrow “hollow”. Or we can color the arrows yellow?

WhiteRabbit: enable sorting by table name

When attempting to select a subset of tables to scan, it would be nice to be able to have a list sorted by table name.

Screenshot of fuzzy sort:

Files with matching first 31 characters causes error

Excel Sheet names can be 31 characters at most. When processing files with long matching prefixes, as happens with TCGA data, I receive the following error:

"Error: The workbook already contains a sheet of this name"

Example files that cause failures:
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/brca/bcr/biotab/clin/

[Suggested Feature] Drop the launch scripts in favor of double-clicking JAR

Rabbit in a Hat and WhiteRabbit both use a batch script to launch because we're changing the memory allocation using a command-line argument.

Unfortunately, memory allocation must be specified at the command line and can't be changed from inside a Java program itself or by adding some arguments to the manifest of a JAR. Dumb.

But, I poked around a bit and found an alternative approach which seems to allow us to make a jar file that fires up and launches a second jar in another process, and that second process can contain the appropriate command line arguments to adjust memory allocation.

Might be nice to drop the scripts in favor of just having a JAR file we can double click.

Add ant build file

An ant build.xml file to construct release jars will automate the build process.

Problem Connecting to SQL SERVER

I had a problem connecting WhiteRabbit running on Windows 8 to SQLServer. First off, the error message titled 'Error connecting to server' with the message 'Could not connect:null'. Not very helpful.

Anyway, the problem is explained here
http://stackoverflow.com/questions/11497530/jdbc-jtds-sql-server-connection-closed-after-ssl-authentication

I tested setting the property 'ssl=required' in the connection string and that allowed the connection.

Searching for DB Tables

(From Charlie Baily/Evanette Burrows)

When selecting tables, it would be good to have a search box. At some locations, their DB can have many tables and looking through the long list can be burdensome.

Can't Build

Prior to 1a7aa93, I wasn't able to build this project by running "ant" from the command line. After 1a7aa93, I'm still unable to build the project, but I'm getting a couple of new errors I'd like to throw your way.

First, the <fileset dir="bin"/> lines are causing the build to fail with the message WhiteRabbit/bin does not exist.

If I manually make the bin dir and rerun the ant command, the build seems to complete, but when I cd in to the dist directory and run

sh RabbitInAHat1.5GB.sh 
Error: Could not find or load main class org.ohdsi.rabbitInAHat.RabbitInAHatMain

and with WhiteRabbit, the same

sh WhiteRabbit1.5GB.sh
Error: Could not find or load main class org.ohdsi.whiteRabbit.WhiteRabbitMain

As soon as I'm able to build this project, I'll be able to test a few changes I've made to RabbitInAHat in relation to #21

Rabbit in a Hat: Label CDMv4 or CDMv5

When displaying the CDM tables in Rabbit in a Hat, the header currently reads "CDM". It would be nice if it displayed "CDMv4" or "CDMv5" depending on which CDM version was being used at the time.

[Suggested Feature] Indicate which CDM columns are required

It would be nice to have something simple, like an asterisk, next to any column that is required by the CDM.

Ability to switch between CDMv4 and v5 (at start of application)

[Suggested Feature] Select multiple boxes and draw arrows between them

Some source formats, e.g. CMS’s LDS, use a denormalized format to store sets of information, e.g. the SynPUF outpatient facility claims file contains up to 45 HCPCS fields. @ericaVoss had to draw 45 separate arrows between the HCPCS fields and the procedure_occurrence.procedure_concept_id. Then she had to draw 45 more arrows between the HCPCS fields and the procedure_occurrence.procedure_source_value.

It would be great to be able to select multiple source columns and multiple target columns and then select something from a drop down menu like “draw arrows” to end up generating arrows between each source column and target column.

This raises some questions about if we should be able to select multiple arrows and apply identical comments/logic text to them, but that’s for another ticket.

Using WhiteRabbit to generate fake data

Hi,

I used White Rabbit to scan OMOP V4 tables and then used the data profiles to generate fake data. However, only several tables (care_site, location, organization, oscar_result, provider) were included in the output. Was the limitation the result of this being an experimental function or me using OMOP V4 tables?

Thank you in advance.

[Suggested Feature] Show documentation about a column

During the SynPUF ETL, we frequently flipped back and forth between a RiaH and the CDM documentation from the wiki. Why not just include this documentation in RiaH so I can click on a target column and read the wiki documentation from within RiaH.

E.g. When clicking on a table, we’d show the documentation for the entire table (e.g. description, column-by-column information, conventions). When clicking on a column, we’d show the documentation specific to that column.

[Suggested Feature] Scroll column panes independent of each other

Often times, the source or target tables contain more columns than fit on a single screen. Dragging arrows between columns that aren’t present on the screen together requires a rather tedious, and time consuming click-and-drag action. It would be nice if the source column view and target column view could be scrolled independently of each other. Then the source column and target column could be scrolled next to each other and drawing an arrow becomes much faster.

[Suggested Feature] Load spec file from file path provided at command line

When I go to test out code I've changed in Rabbit in a Hat, I end up having to use the file dialog box to navigate to and then open an ETL spec document. This makes viewing changes a bit more time consuming than I'd like.

It'd be nice if I could specify a file path to an ETL spec file on the command line when launching Rabbit in a Hat and have it load in that ETL spec file on startup. Since it's a command line argument, I can just set my IDE to pass in a file path to a test ETL spec document every time I launch Rabbit in a Hat from within my IDE.

[Suggested Feature] Ability to undo adding/removing arrows

This one might be challenging, but any good GUI system has a complete system of undo/redo for all important actions.

It appears that if I remove an arrow between two tables, all the column-level mappings are preserved so that if I reinstate the table-level arrow, the column-level mappings are reinstated, but it would still be nice to have an "undo" feature.

[Rabbit In a Hat] Warning Needed when Opening Files

If you pick the wrong file (e.g. you try to open the ETL file as a scan report) Rabbit in a Hat just hangs. It might be nice for it to give a user warning that the file type choosen is inappropriate.

Also, I would switch the order of the menu items, you need to open the SCAN REPORT before you have an ETL. There is a logical order there.

These are both cosmetic items.

[Suggested Feature] Make backspace delete arrows

Currently only the delete key deletes arrows. Allow backspace to delete arrows.

Remove Ability to Save ETL Spec Using Java-based Serialization

We just ran into some issues refactoring the code and had to fix them in #72, we're wondering if using Java's serialization to save ETL spec files is the best approach for saving ETL specs.

At this time, we can't rename the fields in any classes without breaking the ability to read in an ETL spec. That's a bit of a bummer.

@mav7014 has already implemented an alternative means of serialization using JSON via the json-io library which seems a bit more flexible and maleable than Java's native serialization format. Naively, if we renamed a field in a class, we could, if nothing else, do a find/replace on the serialized JSON and change all the keys in the JSON file to the new field name. Maybe there are even better ways of allowing us to deserialize an object even if it has been refactored. But as best as we can tell, native Java serialization makes this hard, if not impossible.

Assuming that we all agree JSON is the better format for serialization going forward we propose the following for milestone 0.5.0:

Read in ETL specs using either native Java serialization or JSON serialization
Write out ETL specs only in JSON format

In future releases, we'll remove the ability to read in an ETL spec that was saved in native Java format.

This way, users can always use release 0.5.0 to read in their old ETL specs, and then save them back in JSON format in order for those files to be read into any release after 0.5.0. Essentially, 0.5.0 will serve as a way to convert Java-based files to JSON-based files.

After 0.5.0 we'll need to come up with a system for reading in older versions of JSON-based ETL specs and migrate them to the latest version of RiaH.

Can't delete arrows in Rabbit-In-A-Hat

For some reason the keyboard event doesn't make it down to the component. Suggest adding an action with a keyboard shortcut instead of trying to capture event.

[Suggested Feature] Change background of mapping pane to white

The background is currently gray and we have black, translucent arrows drawn on top. Those arrows might stand out a bit better if they were rendered over a white background. We’ll probably need to change the opacity of the arrows to make them more opaque to compensate for the lighter background.

whiterabbit scan report does not include empty tables in the source schema

We just ran through an etl design session for the openmrs community. One issue uncovered was that their source schema included valid tables that we needed to etl, but since that specific instance didn't have any data in some of the tables, they didnt appear in the scanreport. Can we make it an option to include or exclude empty tables in the report?

Rabbit in a Hat: Suggestion for viewing/manipulating arrows

I'm finding it a bit hard to see what arrows are going to which tables when there are many arrows are drawn on the screen. Also, when I'm trying to manipulate an arrow (e.g. remove an arrow pointing to a table) I find it hard to ensure I've grabbed the correct arrow.

Perhaps the UI could be updated with the following behavior:

To better view sets of arrows:

If no box (table) is selected, show all arrows with equal emphasis (essentially, the view that RiaH provides now)
If I click on a source table, highlight* all the arrows originating from this table
If I click on a CDM table, highlight all the arrows pointing to this table

To help with manipulating arrows:

Limit my ability to remove an arrow only to those arrows that are highlighted, e.g. to remove an arrow between the source table "beneficiary" and the CDM table "location", I must first click on the "beneficiary" or "location" table so the appropriate set of arrows are highlighted.
- This might be a bit clunky, but it would at least remove uncertainty when trying to manipulate arrows when many arrows occupy the same space.

I see highlighting achieved by some combination of the following:

Decrease the opacity of "highlighted" arrows so they are darker
Increase the opacity of the non-"highlighted" arrows so they are fainter
Change the color of the highlighted arrows to a bright, bold color (magenta or lime green or something)

ohdsi / whiterabbit Goto Github PK

whiterabbit's People

Contributors

Stargazers

Watchers

Forkers

whiterabbit's Issues

Recommend Projects

Recommend Topics

Recommend Org