Giter Club home page Giter Club logo

pg_ceds's Introduction

PostgreSQL flavored Common Education Data Standards Database

Initial pass at porting the T-SQL provided at https://ceds.ed.gov/data/ceds-nds-v5-sqlserver2008.zip. The script suggests using CREATE DATABASE CEDS; prior to running the script and connecting to that database, but it is not necessary. The script first creates the ceds schema ````CREATE SCHEMA ceds;``` and then begins building out tables, indices, and defining the primary/foreign key constraints.

Updates

CEDS is going through the next round of updates. It will take a while to incorporate those updates into the repository, but once available I should start working on it and make it available.

General info

On a MacBook Pro (i7 2.8Ghz quad core w/16GB RAM and an SSD) running PostgreSQL 9.4, it took < 5 seconds to build the data base and populate the lookup tables with the values provided by the CEDS initiative.

On an HP (i5 2.5GHz dual core w/4GB RAM and a spinning HD) running Windows 7 it took ~ 25 seconds to build the ceds Schema and tables (creating the data base was not included as part of the timing).

Files

ceds-nds-v5.sql

This file builds out the CEDS schema, all of the tables/objects, and all of the primary/foreign key constraints.

refTableData.sql

This file populates look up tables used for foreign key constraints across the system.

TODO

  1. Optimization. The current data table structure for the system has lots of inefficiencies that could be addressed quite easily by recasting the data. For example, the sortorder columns in lookup tables are cast as double precision floating point numbers, when an integer would completely suffice and other methods (e.g., during insertion of values) are available to control the order in which specific records appear in the table. Other times, varchar columns are used as ID codes when the table already includes a serial type ID column that could serve the same purpose. In essence, the myriad storage format issues in the CEDS schema as released by the US Department of Education is likely to create performance issues when the database grows due to the increased amount of disk over which queries will need to search for data elements.

  2. Currently the tables are set up with uniform parameters, but it is clear that some tables can be further optimized based on the likelihood of additional data being written to them. There are ~700 tables, so this will be a longer term project.

  3. Additionally, some of the field constraints (e.g., allowing null application dates) do not match the logical/scientific assumptions of the datum. For example, if an application is submitted it is a necessity that it be submitted on some date, at some time, to some location. Allowing these fields to remain null will allow bad data to continuously enter the system.

  4. Integrating PostGIS extensions. For address fields, creating normaddy and/or pagc_normaddy objects on entry will optimize geocoding operations on the back end and provide additional ways to interrogate the quality of the data (e.g., presenting the end user with each of the distinct fields defined in the two previously mentioned objects).

  5. Lastly, for some fields it makes sense to include additional exclusion constraints based on passing functional tests, but these are not currently implemented. For example, the characters ", &, !, @, #, $, %, ^, *, (, ), [, ], {, }, `, ;, :, <, >, /, ?, , |, =, +, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 0 are illegal characters in name fields (e.g., they are not allowed when naming a person). In some cases, the " character can be used to delimit an alias (e.g., William "Billy" Buchanan), which requires a separate business rule to enforce matching quotation marks in the field as the only possible way to allow the character. The other characters could easily be removed, or an error could be thrown to prevent the data from being committed in the data system. Building out these types of logical business processes shifts data quality burdens to the end users who are best positioned to fully interrogate the datum with a high degree of fidelity and expertise.

pg_ceds's People

Contributors

wbuchanan avatar

Stargazers

 avatar Jared Knowles avatar

Watchers

James Cloos avatar Jared Knowles avatar  avatar

Forkers

daresthedevil

pg_ceds's Issues

Sourcing script?

Hi @wbuchanan

What is the code to read this into PostGres? I'm not familiar with Postgres.

I am trying in psql to run

\i ceds-nds-v5.sql

But this keeps returning one of two errors for me -- either:

psql:ceds-nds-v5.sql:2: ERROR:  invalid byte sequence for encoding "UTF8": 0xff

OR

psql:ceds-nds-v5.sql:3: ERROR:  syntax error at or near "iot"
LINE 1: iot.i  ftEdry
        ^

Any idea what I might be doing wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.