Giter Club home page Giter Club logo

ins-etl's Introduction

INS-ETL

Introduction

If you want to mine data for INS project, please stay at the root folder.

Mining

Simply run the following command to initiate the mining process:

node index.js

Data Loading

Move to data_loading folder for the data loading process. The following steps will show you how to load the data into database.

Pre-requisites

Python 3.6 or newer

An initialized and running Neo4j database

Add Dependencies

Run the following command to install dependencies.

pip3 install -r requirements.txt

Or run pip install -r requirements.txt if you are using virtualenv. The dependencies included in requirements.txt are listed below:

pyyaml

neo4j - version 1.7.6

boto3

requests

Configuration File

All the inputs of Data Loader can be set in a YAML format configuration file by using the fields defined below. Using a configuration file can make your Data Loader command significantly shorter.

An example configuration file can be found in config/config.yml

neo4j:uri: Address of the target Neo4j endpoint

neo4j:user: Username to be used for the Neo4j database

neo4j:password: Password to be used for the Neo4j database

schema: The file path(s) of the YAML formatted schema file(s)

prop_file: The file containing the properties for the specified schema

dataset: The directory containing the data to be loaded, a temporary directory if loading from an S3 bucket

Run the Pre-processing Pipeline

The INS project has a data pre-processing pipeline which consists of several Python scripts. These scripts format and in some cases generate a report about the data.

It is essential to run these scripts in order for the raw gathered data to work in the INS web application.

These scripts should be run from the root directory for the INS-ETL project in the following order and they will act upon data in the '/data' directory:

  1. python date_restriction_for_outputs.py This filters data based upon dates

  2. python project_abstract_formatter.py This removes '\n' characters from project abstracts

  3. python extra_whitespace_formatter.py This makes sure any whitespace is just a single space

  4. python calculate_award_amount_ranges.py This formats a column for the project data to be used on the UI

  5. python tag_representative_project.py This formats a column for the project data that is internal to the application but is required for the application to work properly

  6. python output_count_report.py This generates a report for the data, intended for data validation purposes

There are assumptions:

  1. All files have the file extensions either '.txt' or '.tsv'. Our convention is that manually curated data ends in '.txt' while automatically gathered data ends in '.tsv'.
  2. All files start with the type of data in the file, case sensitive. For example: type 'patent' has files 'patent_application.tsv' and 'patent_grant.tsv', not 'granted_patent.tsv' or 'Patent_application.tsv'. Any filename-level annotation is to be done after the beginning of the filename is the type of data in the file.
  3. These are tab delimited files.

NOTE0: For manually curated data, if copy/pasting was involved, there may be some characters that don't display properly, they may look like this '�'. These need to be addressed by hand. Take care when preparing manually curated data, in general.

NOTE1: Sometimes automatically gathered data isn't perfect, sometimes there are sparse or ill-formatted rows that can usually be safely removed. From experience, these are very rare.

Load Data into Neo4j

Run following command to load data into neo4j database (under data_loading folder):

python loader.py config/config.yml -p <neo4j password> -s model-desc/ins_model_file.yaml -s model-desc/ins_model_properties.yaml --prop-file model-desc/props-ins.yml --no-backup --dataset data

ins-etl's People

Contributors

alexd-conf avatar david-yuwei avatar n2iw avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.