Giter Club home page Giter Club logo

bcgov / diputils Goto Github PK

View Code? Open in Web Editor NEW
4.0 5.0 1.0 12.99 MB

Tabular data utilities to support the Data Innovation Program (DIP) which brings together data from different ministries in a secure environment -- and other initiatives which use tabular data, such as the BCWS Fuel Type Layer (FTL) project

License: Apache License 2.0

Python 57.09% C++ 37.93% R 0.77% Shell 2.45% HTML 0.12% C 1.41% Batchfile 0.22%
citz data-science flnro csv education fixed-width health mcfd msp pharmanet

diputils's Introduction

diputils

Utilities to support the Data Innovation Program (DIP). Generally Windows, Mac and Linux environments supported, some aspects may be specific to Windows, Cygwin, or require POPDATA SRE to run

Features

This package presently supports several big-data friendly operations for tabular data, due to properties of data in DIP environment

  • Fixed width format w data dictionary or header file specifying field names and widths
  • Using DIP metadata, some of which are avail. from BC Data Catalogue
  • Quite large potentially up to tens of GB per file or more
  • De-duplicate records that need to be de-duplicated
  • Zipped

For example, de-duplicating records is easy to neglect if there is not an efficient implementation available for doing that. Note: de-duplicating records currently does require RAM large enough to hold the data so it may be necessary to split datasets (e.g. by year) in order to do this

Guiding principles:

  • Making results obtainable by increasing the data volume that can be processed, e.g. by incrementally reading files so the main storage of the system becomes the limiting factor (i.e., moving from 10's of GB to 100's-1000's of GB range)
  • Self-contained / not using dependencies beyond base language features included in python, C/C++ and R. When working in a secure environment, software requests can take weeks to get approved. Therefore, to some degree, having a reference set of functions that are simple and transparent enough to recreate portions of manually (if need be) supports researchers flexibility
  • Language agnostic using a quick and dirty approach: borrowing from the Unix tradition by making procedures (written in any langauge) available (from any language) using the system interface

In R:

system("command_to_run")

In C/C++:

system("command to run")

In python:

import os; os.system("command_to_run")

Project Status

This project is developed and supported by DPD Partnerships and Capacity (PAC) branch and partners incl. BC Wildfire Service (BCWS) Predictive Services Unit

Projects Supported incl.

  • Race based data initiative (RBD)
  • Dip projects:
    • CYMH
    • DIP Development
    • Education Special Needs
    • Children in Care (In-Care Network)
  • BC Wildfire Service (BCWS) Fuel Type Layer Project

Example contents by directory:

cpp: c++ based scripts

  • unique.cpp: data de-duplication. This script makes is possible to de-duplicate extremely large tables. When it concludes, it indicates how many unique records were retained in the output
  • dd_slice_apply_cohort.cpp: using a data dictionary, convert a flat-file to CSV format. This version of the script takes a cohort file as input so that only records pertaining to a cohort of interest are retained (can be very helpful to reduce data volume)
  • unique_msp.cpp: MSP specific: filters MSP (medical services plan) data for unique records, based on several records that define a unique MSP transaction
  • unzp.cpp: unzip zip files in the present directory in parallel, to get this done faster
  • csv_slice.cpp: slice certain columns out of a (potentially arbitrarily large) CSV file
  • csv_sort_date.cpp: sort a CSV record in order by date
  • csv_split_year.cpp: split a large CSV file into per-year portions
  • csv_split.cpp: convert a CSV file into a columnar format (multi single-col "CSV" files)
  • count_col.cpp: fast counting of outcomes within columnar dataset (single-col "CSV")
  • csv_select.cpp: quick and dirty version of SELECT command in SQL
  • pqt.cpp: prototype data compression by: dictionary encoding and bit packing
  • upqt.cpp: undo the above compression
  • csv_cat.cpp: concatenate CSV files to create a larger (arbitrarily large) one!
  • pnet_check.cpp: filter pharmanet (PNET) data according to an industry-standard method

py: python based scripts

  • get_metadata.py: fetch DIP metadata from BC Data Catalogue (BCDC)
  • make_fakedata.py: synthesize DIP data from the above metadata
  • dd_list.py: find, convert "all" (all at the time of development) data dictionaries available in the environment, to a common cleaned format
  • dd_match_data_file.py: for a data file, find an acceptable data dictionary to use to open it
  • df_list.py: list available data files (flat files) in the POPDATA environment
  • dd_get.py: pull the "most current" copy of a given table (named by it's file name)
  • csv_to_fixedwidth.py: convert CSV data to "fixed width" format
  • parquet_to_csv.py: convert an Apache-parquet format file to CSV format
  • csv_grep.py: grep operation for csv files that produces output that is also valid CSV with header
  • indent.py: indent a code file
  • multicore.py: run jobs listed in a shell script, in parallel
  • dd_sliceapply_all.py: convert all flat files in a directory, to CSV (finding the data dictionaries automagically etc.)
  • forever.py: loop a command repeatedly for example when monitoring jobs that run for hours or days

R: R based scripts

  • dd_unpack.R: convert an Excel file (e.g. data dictionary) to CSV format
  • fw2parquet.R: convert a fixed-width file (with data dictionary in expected format) to Apache Parquet format
  • csv_plot.R: create a simple scatter plot (with trendline) from a CSV (specific 2-col format)

c: C based scripts

Sample use (outside of DIP):

get_metadata.py (fetch DIP metadata from DataBC)

Terminal:

python3 get_metadata.py

(Alternately, can open in Python IDLE and select RUN from menu, or press F5 key)

make_fakedata.py (generate unrealistic fake data to some extent based on DIP metadata from DataBC)

Terminal:

python3 make_fakedata.py

(Alternately, can open python scripts in Python IDLE and select RUN from menu, or press F5 key)

Optional arguments:

python3 make_fakedata.py [minimum number of rows, per file] [maximum number of rows, per file]

  • minimum number of rows, per file: lower bound for random number of rows of synthetic to generate
  • maximum number of rows, per file: upper bound for random number of rows of synthetic to generate

Notes: CSV format synthetic data (.csv) are supplied alongside zipped fixed-width format synthetic data (.zip) for comparison. In the actual DIP environment

  • data are provided in fixed-width format ONLY (no CSV files)
  • the data dictionaries don't typically appear in the same folder as the fixed-width files they're associated with. Software for matching fixed-width files with data dictionaries, is provided in this repository

Using inside of DIP environment

Building "diputils" command-line utilities in SRE:

  1. copy the contents of this folder into your private folder (R:/$USER/bin/)
  • copy the tar.gz file into your home folder, and extract it there...
  • this file, bash.bat, etc. should reside in R:/$USER/bin/.. for example, if they were in R:/$USER/bin/diputils-master/ the files should be moved up a level!
  1. navigate there (R:/$USER/bin) in Windows Explorer (the file manager)

    • for example, if my user name is "bob", the place to go is: R:/bob/bin
  2. double-click on bash.bat (bash) to enter the cygwin bash prompt (the programs should be built automatically)

To check if the utilities are working: can type (followed by pressing return):

csv_slice

If programs built correctly you would seem something like:

Error: usage: slice.exe [infile] [Field name] .. [Field name n]

  1. How to find out where a particular script is: (at the terminal in the bin/ folder) to find csv_slice:

find ./ -name "csv_slice*"

in this instance the output was:

./cpp/csv_slice.cpp

./csv_slice.exe

The .cpp file is c++ code in the appropriate folder; the .exe file is the verb deployed at the terminal

Notes:

How to find out your user name in linux / Cygwin prompt:

type (followed by return):

whoami

For example, if my user name was bob, the terminal should come back and say:

bob

Example uses (not all DIP specific):

Opening and unpacking data for a cohort

This operation may be quite slow and require some manual intervention. Also the process here is likely to only cover some fraction of available data sets as new ones have been added since, and there may be issues due to formatting updates

Copy a cohort file (csv with studyid col) to (present) tmp folder:

cp /cygdrive/r/.../cohort.csv .

To confirm the file is there

Can type:

ls

And press return.

Slice out the studyid field

A terminal command so have to press return after:

csv_slice studyid cohort.csv

Examine first 10 lines of the result file

head -10 cohort.csv_slice.csv

Move the studyid-only file to a simpler filename:

For convenience:

mv cohort.csv_slice.csv studyid.csv

Then press return.

Fetch and extract all data for a cohort:

sup_cohort studyid.csv

Downloading, fetching and unpacking the latest version of a specific data file

Find pharmanet data

find /cygdrive/r/DATA/ -name "*pharmanet*"

/cygdrive/r/DATA/2019-04-24/docs/data_dictionary_pharmanet-january-1-1996-onwards.xlsx /cygdrive/r/DATA/2019-04-24/pharmanet

Make a local copy of pharmanet files (subset for your study population):

pnet_get studyid.csv

Note: this script needs to be updated to reflect currently available data here

Converting a "flat file" to csv:

First get a copy of the file:

df_get hlth_prod_final.dat

And convert it to CSV:

dd_sliceapply_all hlth_prod_final.dat

In-place removal of whitespace characters from end of a file

This program is sanity promoting as some programs could interpret terminating newline character as a record, leading to inaccuracies or errors:

snip studyid.csv

Concatenating pharmanet files

the script pnet_get above handles this.

Checking pharmanet files for bad data (according to filtering algorithm provided by MoH subject matter expert)

pnet_check dsp_rpt.dat_dd_sliceapply.csv

Bad data, if detected, should appear in a separate file.

Example of analyzing mh drug usage from pnet:

Although this script depends on one proprietary table (.xls) that is not provided here, otherwise the script should: download, fetch, unpack, clean, concatenate, and analyze pharmanet data for a cohort, without intervention.

pnet_druglist studyid.csv

Getting Help

Please contact [email protected] for assistance or to provide feedback

Contributors

License

Copyright 2020 Province of British Columbia

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

diputils's People

Contributors

ashlinrichardson avatar lindsay-fredrick avatar repo-mountie[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

diputils's Issues

Add project lifecycle badge

No Project Lifecycle Badge found in your readme!

Hello! I scanned your readme and could not find a project lifecycle badge. A project lifecycle badge will provide contributors to your project as well as other stakeholders (platform services, executive) insight into the lifecycle of your repository.

What is a Project Lifecycle Badge?

It is a simple image that neatly describes your project's stage in its lifecycle. More information can be found in the project lifecycle badges documentation.

What do I need to do?

I suggest you make a PR into your README.md and add a project lifecycle badge near the top where it is easy for your users to pick it up :). Once it is merged feel free to close this issue. I will not open up a new one :)

Add missing topics

TL;DR

Topics greatly improve the discoverability of repos; please add the short code from the table below to the topics of your repo so that ministries can use GitHub's search to find out what repos belong to them and other visitors can find useful content (and reuse it!).

Why Topic

In short order we'll add our 800th repo. This large number clearly demonstrates the success of using GitHub and our Open Source initiative. This huge success means its critical that we work to make our content as discoverable as possible; Through discoverability, we promote code reuse across a large decentralized organization like the Government of British Columbia as well as allow ministries to find the repos they own.

What to do

Below is a table of abbreviation a.k.a short codes for each ministry; they're the ones used in all @gov.bc.ca email addresses. Please add the short codes of the ministry or organization that "owns" this repo as a topic.

add a topic

That's in, you're done!!!

How to use

Once topics are added, you can use them in GitHub's search. For example, enter something like org:bcgov topic:citz to find all the repos that belong to Citizens' Services. You can refine this search by adding key words specific to a subject you're interested in. To learn more about searching through repos check out GitHub's doc on searching.

Pro Tip ๐Ÿค“

  • If your org is not in the list below, or the table contains errors, please create an issue here.

  • While you're doing this, add additional topics that would help someone searching for "something". These can be the language used javascript or R; something like opendata or data for data only repos; or any other key words that are useful.

  • Add a meaningful description to your repo. This is hugely valuable to people looking through our repositories.

  • If your application is live, add the production URL.

Ministry Short Codes

Short Code Organization Name
AEST Advanced Education, Skills & Training
AGRI Agriculture
ALC Agriculture Land Commission
AG Attorney General
MCF Children & Family Development
CITZ Citizens' Services
DBC Destination BC
EMBC Emergency Management BC
EAO Environmental Assessment Office
EDUC Education
EMPR Energy, Mines & Petroleum Resources
ENV Environment & Climate Change Strategy
FIN Finance
FLNR Forests, Lands, Natural Resource Operations & Rural Development
HLTH Health
FLNR Indigenous Relations & Reconciliation
JEDC Jobs, Economic Development & Competitiveness
LBR Labour Policy & Legislation
LDB BC Liquor Distribution Branch
MMHA Mental Health & Addictions
MAH Municipal Affairs & Housing
BCPC Pension Corporation
PSA Public Safety & Solicitor General & Emergency B.C.
SDPR Social Development & Poverty Reduction
TCA Tourism, Arts & Culture
TRAN Transportation & Infrastructure

NOTE See an error or omission? Please create an issue here to get it remedied.

0. full linkage assessment

With respect to any two given data sets, quantify size of population mutually covered by them

  • compute linkage matrix by key count and fraction of total
  • multitemporal: compute linkage matrix, on data partitioned by year and plot changes

It's Been a While Since This Repository has Been Updated

This issue is a kind reminder that your repository has been inactive for 181 days. Some repositories are maintained in accordance with business requirements that infrequently change thus appearing inactive, and some repositories are inactive because they are unmaintained.

To help differentiate products that are unmaintained from products that do not require frequent maintenance, repomountie will open an issue whenever a repository has not been updated in 180 days.

  • If this product is being actively maintained, please close this issue.
  • If this repository isn't being actively maintained anymore, please archive this repository. Also, for bonus points, please add a dormant or retired life cycle badge.

Thank you for your help ensuring effective governance of our open-source ecosystem!

1. In-memory Agnostic, Language Agnostic, Concurrent Pattern

in-memory agnostic, distributed-agnostic language-agnostic pattern for reproducible methods and interoperability (concurrent and separate reads, processing and writes) plus lightweight, scalable graphical user interface (GUI)-agnostic "ETL-like" representation

auto point and click GUI generation (auto drop down menus creation)
auto discover of fxns within repos, plus autogen of hooks/ interfaces to them, from other langs
this environment not localized to a particular file-system director or memory location, or specific to a given language

It's Been a While Since This Repository has Been Updated

This issue is a kind reminder that your repository has been inactive for 518 days. Some repositories are maintained in accordance with business requirements that infrequently change thus appearing inactive, and some repositories are inactive because they are unmaintained.

To help differentiate products that are unmaintained from products that do not require frequent maintenance, repomountie will open an issue whenever a repository has not been updated in 180 days.

  • If this product is being actively maintained, please close this issue.
  • If this repository isn't being actively maintained anymore, please archive this repository. Also, for bonus points, please add a dormant or retired life cycle badge.

Thank you for your help ensuring effective governance of our open-source ecosystem!

utilities and methods with a consistent interface across all languages of interest

utilities and methods with a consistent interface across all languages of interest, of particular value: R, python and C++ for the heaviest lifting

  • code documentation autogeneration
  • workflow autodocumentation
  • learn and deploy reticulate
  • learn and deploy Rcpp
  • apply reticulate and Rcpp (or equivalent) to language-agnostic pattern

3. advanced methods

Implement convenient interfaces to, or occasionally directly implement, contemporary methods for general purpose use within R or python. Caveat: need to be robust to moderately large data. Must be accessible from R and python

  • low-d embedding for visualization: isomap, tsne
  • combinatorical pattern explo: correl and co-occur analysis
  • unsuperv. classif: kmeans, dbscan, hdbscan, mean-shift clust, HAC
  • superv. classif: c45 decision tree classifier, random forest classiifer
  • encodings for categorial data: one hot, etc...

2. basic ops

Basic operations to support minimal data quality assessment, make life more live-able, and increase the ease and effectiveness for data-science swat-team deployments, all in the large-tabular-data context

  • path normalization for interop between environments (classify path format by OS and translate to native format)
  • data type detect: nominal, numeric, date, geo
  • date detect and format validation
  • data dictionary vs file matching
  • data dict normalization plus recovery from multiline cells
  • metadata: fields search, description search, w support for fuzzy matching
  • semantic matching
  • autodetect and application of human-readable lookups present in other tables
  • flatfile parsing -- all sets
  • dataset identification and integration
  • redundant records detection -- large data
  • lossless data compression
  • windowing for multitemporal analysis
  • low memory (large data) sorting, incl. but not limited to: by date!
  • not require specific install location
  • allow people to select versions for data
  • parse and filter largest files bypassing RAM memory limitation restrictions

Add project lifecycle badge

No Project Lifecycle Badge found in your readme!

Hello! I scanned your readme and could not find a project lifecycle badge. A project lifecycle badge will provide contributors to your project as well as other stakeholders (platform services, executive) insight into the lifecycle of your repository.

What is a Project Lifecycle Badge?

It is a simple image that neatly describes your project's stage in its lifecycle. More information can be found in the project lifecycle badges documentation.

What do I need to do?

I suggest you make a PR into your README.md and add a project lifecycle badge near the top where it is easy for your users to pick it up :). Once it is merged feel free to close this issue. I will not open up a new one :)

Add project lifecycle badge

No Project Lifecycle Badge found in your readme!

Hello! I scanned your readme and could not find a project lifecycle badge. A project lifecycle badge will provide contributors to your project as well as other stakeholders (platform services, executive) insight into the lifecycle of your repository.

What is a Project Lifecycle Badge?

It is a simple image that neatly describes your project's stage in its lifecycle. More information can be found in the project lifecycle badges documentation.

What do I need to do?

I suggest you make a PR into your README.md and add a project lifecycle badge near the top where it is easy for your users to pick it up :). Once it is merged feel free to close this issue. I will not open up a new one :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.