bcgov / diputils Goto Github PK

Tabular data utilities to support the Data Innovation Program (DIP) which brings together data from different ministries in a secure environment -- and other initiatives which use tabular data, such as the BCWS Fuel Type Layer (FTL) project

License: Apache License 2.0

Python 57.09% C++ 37.93% R 0.77% Shell 2.45% HTML 0.12% C 1.41% Batchfile 0.22%

citz data-science flnro csv education fixed-width health mcfd msp pharmanet

diputils's Introduction

diputils

Utilities to support the Data Innovation Program (DIP). Generally Windows, Mac and Linux environments supported, some aspects may be specific to Windows, Cygwin, or require POPDATA SRE to run

Features

This package presently supports several big-data friendly operations for tabular data, due to properties of data in DIP environment

Fixed width format w data dictionary or header file specifying field names and widths
Using DIP metadata, some of which are avail. from BC Data Catalogue
Quite large potentially up to tens of GB per file or more
De-duplicate records that need to be de-duplicated
Zipped

For example, de-duplicating records is easy to neglect if there is not an efficient implementation available for doing that. Note: de-duplicating records currently does require RAM large enough to hold the data so it may be necessary to split datasets (e.g. by year) in order to do this

Guiding principles:

Making results obtainable by increasing the data volume that can be processed, e.g. by incrementally reading files so the main storage of the system becomes the limiting factor (i.e., moving from 10's of GB to 100's-1000's of GB range)
Self-contained / not using dependencies beyond base language features included in python, C/C++ and R. When working in a secure environment, software requests can take weeks to get approved. Therefore, to some degree, having a reference set of functions that are simple and transparent enough to recreate portions of manually (if need be) supports researchers flexibility
Language agnostic using a quick and dirty approach: borrowing from the Unix tradition by making procedures (written in any langauge) available (from any language) using the system interface

In R:

system("command_to_run")

In C/C++:

system("command to run")

In python:

import os; os.system("command_to_run")

Project Status

This project is developed and supported by DPD Partnerships and Capacity (PAC) branch and partners incl. BC Wildfire Service (BCWS) Predictive Services Unit

Projects Supported incl.

Race based data initiative (RBD)
Dip projects:
- CYMH
- DIP Development
- Education Special Needs
- Children in Care (In-Care Network)
BC Wildfire Service (BCWS) Fuel Type Layer Project

Example contents by directory:

cpp: c++ based scripts

unique.cpp: data de-duplication. This script makes is possible to de-duplicate extremely large tables. When it concludes, it indicates how many unique records were retained in the output
dd_slice_apply_cohort.cpp: using a data dictionary, convert a flat-file to CSV format. This version of the script takes a cohort file as input so that only records pertaining to a cohort of interest are retained (can be very helpful to reduce data volume)
unique_msp.cpp: MSP specific: filters MSP (medical services plan) data for unique records, based on several records that define a unique MSP transaction
unzp.cpp: unzip zip files in the present directory in parallel, to get this done faster
csv_slice.cpp: slice certain columns out of a (potentially arbitrarily large) CSV file
csv_sort_date.cpp: sort a CSV record in order by date
csv_split_year.cpp: split a large CSV file into per-year portions
csv_split.cpp: convert a CSV file into a columnar format (multi single-col "CSV" files)
count_col.cpp: fast counting of outcomes within columnar dataset (single-col "CSV")
csv_select.cpp: quick and dirty version of SELECT command in SQL
pqt.cpp: prototype data compression by: dictionary encoding and bit packing
upqt.cpp: undo the above compression
csv_cat.cpp: concatenate CSV files to create a larger (arbitrarily large) one!
pnet_check.cpp: filter pharmanet (PNET) data according to an industry-standard method

py: python based scripts

get_metadata.py: fetch DIP metadata from BC Data Catalogue (BCDC)
make_fakedata.py: synthesize DIP data from the above metadata
dd_list.py: find, convert "all" (all at the time of development) data dictionaries available in the environment, to a common cleaned format
dd_match_data_file.py: for a data file, find an acceptable data dictionary to use to open it
df_list.py: list available data files (flat files) in the POPDATA environment
dd_get.py: pull the "most current" copy of a given table (named by it's file name)
csv_to_fixedwidth.py: convert CSV data to "fixed width" format
parquet_to_csv.py: convert an Apache-parquet format file to CSV format
csv_grep.py: grep operation for csv files that produces output that is also valid CSV with header
indent.py: indent a code file
multicore.py: run jobs listed in a shell script, in parallel
dd_sliceapply_all.py: convert all flat files in a directory, to CSV (finding the data dictionaries automagically etc.)
forever.py: loop a command repeatedly for example when monitoring jobs that run for hours or days

R: R based scripts

dd_unpack.R: convert an Excel file (e.g. data dictionary) to CSV format
fw2parquet.R: convert a fixed-width file (with data dictionary in expected format) to Apache Parquet format
csv_plot.R: create a simple scatter plot (with trendline) from a CSV (specific 2-col format)

c: C based scripts

Sample use (outside of DIP):

get_metadata.py (fetch DIP metadata from DataBC)

Terminal:

python3 get_metadata.py

(Alternately, can open in Python IDLE and select RUN from menu, or press F5 key)

make_fakedata.py (generate unrealistic fake data to some extent based on DIP metadata from DataBC)

Terminal:

python3 make_fakedata.py

(Alternately, can open python scripts in Python IDLE and select RUN from menu, or press F5 key)

Optional arguments:

python3 make_fakedata.py [minimum number of rows, per file] [maximum number of rows, per file]

minimum number of rows, per file: lower bound for random number of rows of synthetic to generate
maximum number of rows, per file: upper bound for random number of rows of synthetic to generate

Notes: CSV format synthetic data (.csv) are supplied alongside zipped fixed-width format synthetic data (.zip) for comparison. In the actual DIP environment

data are provided in fixed-width format ONLY (no CSV files)
the data dictionaries don't typically appear in the same folder as the fixed-width files they're associated with. Software for matching fixed-width files with data dictionaries, is provided in this repository

Using inside of DIP environment

Building "diputils" command-line utilities in SRE:

copy the contents of this folder into your private folder (R:/$USER/bin/)

copy the tar.gz file into your home folder, and extract it there...
this file, bash.bat, etc. should reside in R:/$USER/bin/.. for example, if they were in R:/$USER/bin/diputils-master/ the files should be moved up a level!

navigate there (R:/$USER/bin) in Windows Explorer (the file manager)
- for example, if my user name is "bob", the place to go is: R:/bob/bin
double-click on bash.bat (bash) to enter the cygwin bash prompt (the programs should be built automatically)

To check if the utilities are working: can type (followed by pressing return):

csv_slice

If programs built correctly you would seem something like:

Error: usage: slice.exe [infile] [Field name] .. [Field name n]

How to find out where a particular script is: (at the terminal in the bin/ folder) to find csv_slice:

find ./ -name "csv_slice*"

in this instance the output was:

./cpp/csv_slice.cpp

./csv_slice.exe

The .cpp file is c++ code in the appropriate folder; the .exe file is the verb deployed at the terminal

Notes:

How to find out your user name in linux / Cygwin prompt:

type (followed by return):

whoami

For example, if my user name was bob, the terminal should come back and say:

bob

Example uses (not all DIP specific):

Opening and unpacking data for a cohort

This operation may be quite slow and require some manual intervention. Also the process here is likely to only cover some fraction of available data sets as new ones have been added since, and there may be issues due to formatting updates

Copy a cohort file (csv with studyid col) to (present) tmp folder:

cp /cygdrive/r/.../cohort.csv .

To confirm the file is there

Can type:

ls

And press return.

Slice out the studyid field

A terminal command so have to press return after:

csv_slice studyid cohort.csv

Examine first 10 lines of the result file

head -10 cohort.csv_slice.csv

Move the studyid-only file to a simpler filename:

For convenience:

mv cohort.csv_slice.csv studyid.csv

Then press return.

Fetch and extract all data for a cohort:

sup_cohort studyid.csv

Downloading, fetching and unpacking the latest version of a specific data file

Find pharmanet data

find /cygdrive/r/DATA/ -name "*pharmanet*"

/cygdrive/r/DATA/2019-04-24/docs/data_dictionary_pharmanet-january-1-1996-onwards.xlsx /cygdrive/r/DATA/2019-04-24/pharmanet

Make a local copy of pharmanet files (subset for your study population):

pnet_get studyid.csv

Note: this script needs to be updated to reflect currently available data here

Converting a "flat file" to csv:

First get a copy of the file:

df_get hlth_prod_final.dat

And convert it to CSV:

dd_sliceapply_all hlth_prod_final.dat

In-place removal of whitespace characters from end of a file

This program is sanity promoting as some programs could interpret terminating newline character as a record, leading to inaccuracies or errors:

snip studyid.csv

Concatenating pharmanet files

the script pnet_get above handles this.

Checking pharmanet files for bad data (according to filtering algorithm provided by MoH subject matter expert)

pnet_check dsp_rpt.dat_dd_sliceapply.csv

Bad data, if detected, should appear in a separate file.

Example of analyzing mh drug usage from pnet:

Although this script depends on one proprietary table (.xls) that is not provided here, otherwise the script should: download, fetch, unpack, clean, concatenate, and analyze pharmanet data for a cohort, without intervention.

pnet_druglist studyid.csv

Getting Help

Please contact [email protected] for assistance or to provide feedback

Contributors

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

diputils's People

Contributors

Stargazers

Watchers

Forkers

ashlinrichardson

diputils's Issues

Add project lifecycle badge

No Project Lifecycle Badge found in your readme!

Hello! I scanned your readme and could not find a project lifecycle badge. A project lifecycle badge will provide contributors to your project as well as other stakeholders (platform services, executive) insight into the lifecycle of your repository.

What is a Project Lifecycle Badge?

It is a simple image that neatly describes your project's stage in its lifecycle. More information can be found in the project lifecycle badges documentation.

What do I need to do?

I suggest you make a PR into your README.md and add a project lifecycle badge near the top where it is easy for your users to pick it up :). Once it is merged feel free to close this issue. I will not open up a new one :)

NEED TO MODULATE BETWEEN EXPLICIT INTERPRETER LOCATIONS (INSIDE DIP) VS NOT (OUTSIDE DIP)

Add missing topics

TL;DR

Topics greatly improve the discoverability of repos; please add the short code from the table below to the topics of your repo so that ministries can use GitHub's search to find out what repos belong to them and other visitors can find useful content (and reuse it!).

Why Topic

In short order we'll add our 800th repo. This large number clearly demonstrates the success of using GitHub and our Open Source initiative. This huge success means its critical that we work to make our content as discoverable as possible; Through discoverability, we promote code reuse across a large decentralized organization like the Government of British Columbia as well as allow ministries to find the repos they own.

What to do

Below is a table of abbreviation a.k.a short codes for each ministry; they're the ones used in all @gov.bc.ca email addresses. Please add the short codes of the ministry or organization that "owns" this repo as a topic.

That's in, you're done!!!

How to use

Once topics are added, you can use them in GitHub's search. For example, enter something like org:bcgov topic:citz to find all the repos that belong to Citizens' Services. You can refine this search by adding key words specific to a subject you're interested in. To learn more about searching through repos check out GitHub's doc on searching.

Pro Tip 🤓

If your org is not in the list below, or the table contains errors, please create an issue here.
While you're doing this, add additional topics that would help someone searching for "something". These can be the language used javascript or R; something like opendata or data for data only repos; or any other key words that are useful.
Add a meaningful description to your repo. This is hugely valuable to people looking through our repositories.
If your application is live, add the production URL.

Ministry Short Codes

Short Code	Organization Name
AEST	Advanced Education, Skills & Training
AGRI	Agriculture
ALC	Agriculture Land Commission
AG	Attorney General
MCF	Children & Family Development
CITZ	Citizens' Services
DBC	Destination BC
EMBC	Emergency Management BC
EAO	Environmental Assessment Office
EDUC	Education
EMPR	Energy, Mines & Petroleum Resources
ENV	Environment & Climate Change Strategy
FIN	Finance
FLNR	Forests, Lands, Natural Resource Operations & Rural Development
HLTH	Health
FLNR	Indigenous Relations & Reconciliation
JEDC	Jobs, Economic Development & Competitiveness
LBR	Labour Policy & Legislation
LDB	BC Liquor Distribution Branch
MMHA	Mental Health & Addictions
MAH	Municipal Affairs & Housing
BCPC	Pension Corporation
PSA	Public Safety & Solicitor General & Emergency B.C.
SDPR	Social Development & Poverty Reduction
TCA	Tourism, Arts & Culture
TRAN	Transportation & Infrastructure

NOTE See an error or omission? Please create an issue here to get it remedied.

auto dot map!!!!! graphviz

0. full linkage assessment

With respect to any two given data sets, quantify size of population mutually covered by them

compute linkage matrix by key count and fraction of total
multitemporal: compute linkage matrix, on data partitioned by year and plot changes

wrap C/C++ functions in a way that plays well with Rcpp (and not Rcpp) or just go through system()?

It's Been a While Since This Repository has Been Updated

This issue is a kind reminder that your repository has been inactive for 181 days. Some repositories are maintained in accordance with business requirements that infrequently change thus appearing inactive, and some repositories are inactive because they are unmaintained.

To help differentiate products that are unmaintained from products that do not require frequent maintenance, repomountie will open an issue whenever a repository has not been updated in 180 days.

If this product is being actively maintained, please close this issue.
If this repository isn't being actively maintained anymore, please archive this repository. Also, for bonus points, please add a dormant or retired life cycle badge.

Thank you for your help ensuring effective governance of our open-source ecosystem!

1. In-memory Agnostic, Language Agnostic, Concurrent Pattern

in-memory agnostic, distributed-agnostic language-agnostic pattern for reproducible methods and interoperability (concurrent and separate reads, processing and writes) plus lightweight, scalable graphical user interface (GUI)-agnostic "ETL-like" representation

auto point and click GUI generation (auto drop down menus creation)
auto discover of fxns within repos, plus autogen of hooks/ interfaces to them, from other langs
this environment not localized to a particular file-system director or memory location, or specific to a given language

"command line agnostic": all functions can be called from OS, or inside program (string args). Function/module parameter metadata are listable / queryable

It's Been a While Since This Repository has Been Updated

This issue is a kind reminder that your repository has been inactive for 518 days. Some repositories are maintained in accordance with business requirements that infrequently change thus appearing inactive, and some repositories are inactive because they are unmaintained.

To help differentiate products that are unmaintained from products that do not require frequent maintenance, repomountie will open an issue whenever a repository has not been updated in 180 days.

If this product is being actively maintained, please close this issue.
If this repository isn't being actively maintained anymore, please archive this repository. Also, for bonus points, please add a dormant or retired life cycle badge.

Thank you for your help ensuring effective governance of our open-source ecosystem!

utilities and methods with a consistent interface across all languages of interest

utilities and methods with a consistent interface across all languages of interest, of particular value: R, python and C++ for the heaviest lifting

code documentation autogeneration
workflow autodocumentation
learn and deploy reticulate
learn and deploy Rcpp
apply reticulate and Rcpp (or equivalent) to language-agnostic pattern

look at ls.str and compare with parent.frame(2)$ofile

https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/ls.str

3. advanced methods

Implement convenient interfaces to, or occasionally directly implement, contemporary methods for general purpose use within R or python. Caveat: need to be robust to moderately large data. Must be accessible from R and python

low-d embedding for visualization: isomap, tsne
combinatorical pattern explo: correl and co-occur analysis
unsuperv. classif: kmeans, dbscan, hdbscan, mean-shift clust, HAC
superv. classif: c45 decision tree classifier, random forest classiifer
encodings for categorial data: one hot, etc...

2. basic ops

Basic operations to support minimal data quality assessment, make life more live-able, and increase the ease and effectiveness for data-science swat-team deployments, all in the large-tabular-data context

path normalization for interop between environments (classify path format by OS and translate to native format)
data type detect: nominal, numeric, date, geo
date detect and format validation
data dictionary vs file matching
data dict normalization plus recovery from multiline cells
metadata: fields search, description search, w support for fuzzy matching
semantic matching
autodetect and application of human-readable lookups present in other tables
flatfile parsing -- all sets
dataset identification and integration
redundant records detection -- large data
lossless data compression
windowing for multitemporal analysis
low memory (large data) sorting, incl. but not limited to: by date!
not require specific install location
allow people to select versions for data
parse and filter largest files bypassing RAM memory limitation restrictions

follow peter kim's advice to perform spectral decom of distance matrix

for SUP metric suitability evaluation
@kmoselle

Add project lifecycle badge

No Project Lifecycle Badge found in your readme!

What is a Project Lifecycle Badge?

It is a simple image that neatly describes your project's stage in its lifecycle. More information can be found in the project lifecycle badges documentation.

What do I need to do?

Add project lifecycle badge

No Project Lifecycle Badge found in your readme!

What is a Project Lifecycle Badge?

It is a simple image that neatly describes your project's stage in its lifecycle. More information can be found in the project lifecycle badges documentation.

bcgov / diputils Goto Github PK

diputils's Introduction

diputils

Features

Guiding principles:

Project Status

Projects Supported incl.

Example contents by directory:

cpp: c++ based scripts

py: python based scripts

R: R based scripts

c: C based scripts

Sample use (outside of DIP):

get_metadata.py (fetch DIP metadata from DataBC)

make_fakedata.py (generate unrealistic fake data to some extent based on DIP metadata from DataBC)

Using inside of DIP environment

Building "diputils" command-line utilities in SRE:

Notes:

Example uses (not all DIP specific):

Opening and unpacking data for a cohort

Copy a cohort file (csv with studyid col) to (present) tmp folder:

To confirm the file is there

Slice out the studyid field

Examine first 10 lines of the result file

Move the studyid-only file to a simpler filename:

Fetch and extract all data for a cohort:

Downloading, fetching and unpacking the latest version of a specific data file

Find pharmanet data

Make a local copy of pharmanet files (subset for your study population):

Converting a "flat file" to csv:

In-place removal of whitespace characters from end of a file

Concatenating pharmanet files

Checking pharmanet files for bad data (according to filtering algorithm provided by MoH subject matter expert)

Example of analyzing mh drug usage from pnet:

Getting Help

Contributors

License

diputils's People

Contributors

Stargazers

Watchers

Forkers

diputils's Issues

No Project Lifecycle Badge found in your readme!

What is a Project Lifecycle Badge?

What do I need to do?

TL;DR

Why Topic

What to do

How to use

Pro Tip 🤓

Ministry Short Codes

No Project Lifecycle Badge found in your readme!

What is a Project Lifecycle Badge?

What do I need to do?

No Project Lifecycle Badge found in your readme!

What is a Project Lifecycle Badge?

What do I need to do?

Recommend Projects

Recommend Topics

Recommend Org