Giter Club home page Giter Club logo

location-data-importer's Introduction

Location Data Importer

Location Data Importer is a scala application that imports Ordinance Survey data into a mongo database. Primarily it is used to back the locate-api which provides address and local authority lookups from postcode / GSS Code.

Issued under an MIT license, see LICENSE file.

Ordnance Survey

This is backed by two Ordnance Survey products:

- Address Base Premium: This provides all the details of actual addresses. [https://www.ordnancesurvey.co.uk/psma/index.html](https://www.ordnancesurvey.co.uk/psma/index.html)
- Code Point: Provides a mapping between postcode and a number of ONS statistical codes, these codes are used to identify Local Authorities, countries and electoral wards.

Additionally we use the ONS Local Authority dataset, manually linked to the OS local custodian dataset to provide extra validation of the Address to Local Authority resolution.

Download from:

https://www.ordnancesurvey.co.uk/sso/login.shtml

Updates timetable can be found:

http://www.ordnancesurvey.co.uk/business-and-government/help-and-support/products/addressbase-epoch-dates.html

Release notes:

http://www.ordnancesurvey.co.uk/business-and-government/help-and-support/products/addressbase-release-notes.html

NLPG Data entry conventions:

http://www.iahub.net/docs/1398672866952.pdf

Basic usage

The importer compiles to a jar:

(1) Use SBT to build project (from project root)

$ ./sbt
> assembly

(2) Run jar (from project root)

$ java -jar target/scala-2.10/location-data-importer.jar

The jar runs with command line options. Help is available:

 $ java -jar target/scala-2.10/location-data-importer.jar --help

Example run:

$ java -jar target/scala-2.10/location-data-importer.jar -a docs/examples/addressbase/ -c docs/examples/codepoint/

A script in the root of the project runs the above:

$ ./run-examples.sh

Mongo:

- uses a mongo database: installed via ./scripts/mongo/mongo-replica-set.sh

Command Line Options

Usage: Location Data Importer [options]

   -a <value> | --addresses <value>
         Location of address base files files
   -c <value> | --codepoint <value>
         Location of code point files)
   -o | --addressOnly
         Only do address import stage. (Default process all stages)
   -p <value> | --password <value>
         Password for the mongo (default none)
   -u <value> | --username <value>
         Username for the mongo (default none)
  --version
        0.1
  • addresses: Directory containing address base premium files. (provide path)

  • codepoint: Directory containing code point files. (provide path)

  • addressOnly: If the codepoint and street phases are complete, just run the address phase

  • password / --username: Mongo credentials if required.

Directory Structure

  • docs. Contains various useful documents regarding Ordinance Survey Products

    • user guide for code point.
    • user gurdes for boundary line.
    • number of address base premimum documentation, incuding introductory powerpoint.
    • examples CSV files for an address base file and a code point file.
  • scripts: mongo set up scripts

  • test-cases: OS supplied test cases for checking quality of generated addresses

  • testdata: Test files for application test suite

Process:

(1) Code point files parsed and inserted into the Mongo DB. This is done upfront as we make a reference table to query address postcodes against to discover GSS codes.

Fields:

  • country: Derived from ONS codes: England Scotland Wales N Ireland E92000001 S92000003 W92000004 N92000002

  • gssCode: Unitary Authority, Metropolitan and Non- Metropolitan District, London Borough or Scottish Council Area in which postcode falls.

  • easting/northing/lat/long: location of CPLC. CPLC is the location indicator for this code point. This is a point within the postcode area that is nearest the mean position of postal addresses. Not geographical central point.

  • nhsRegionalHealthAuthority: English Pan Strategic Health Authority in which CPLC falls. [optional]

  • nhsHealthAuthority: English Strategic Health Authority or Scottish Health Board in which CPLC falls. [optional]

  • county: County in which CPLC falls. [optional]

  • ward: Electoral Ward or Division in which CPLC falls. [optional]

    - Example document:
    
    {
        "_id" : ObjectId("52b1e7c38de22313daeedc38"),
        "postcode" : "ab101aa",
        "country" : "Scotland",
        "gssCode" : "S12000033",
        "name" : "Grampian",
        "easting" : 394251,
        "northing" : 806376,
        "lat" : 57.14823168960546,
        "long" : -2.0966478399737416,
        "nhsRegionalHealthAuthority" : "S08000005",
        "nhsHealthAuthority" : "S08000006",
        "county" : "S08000004"
        "ward" : "S13002483"
    }
    

(2) Address base files are parsed for Street data and inserted into the Mongo DB.

- The persisted street objects are a join between the Address Base Premium notions of
    - Street
    - Street Description

- This is done as streets exist in a single Address Base file, however may be referenced by address objects across several. Consequently we process these first and use the collection as a lookup table as we construct the more complex address objects.

- Example document:

    {
        "_id" : ObjectId("52b1e85b8de22313db0943d2"),
        "usrn" : "21870044",
        "streetDescription" : "NIPPER ALLEY",
        "townName" : "KINGSTON UPON THAMES",
        "administrativeArea" : "KINGSTON UPON THAMES",
        "recordType" : "officiallyDesignated",
        "state" : "open",
        "surface" : "metalled",
        "classification" : "allVehicles",
        "file" : "TQ1565.csv"
    }

(3) Address base files are then processed a second time to build up addresses.

- Address base files contain several types of row, we care about:
    - BLPU 
        Required – a BLPU is defined as a real-world object that is an ‘area of land, property or structure of fixed location having uniform occupation, ownership or function’. The BLPU is the core element of AddressBase Premium. In essence, a BLPU associates a real-world object on the ground to a UPRN.
    - LPI 
        Required – an LPI is a structured text entry that identifies a BLPU.
    - Organisation 
        (Optional. Company at an address)
    - Classification 
        (Required. Current use of the property, i.e. residential or commercial, to some level of details)
    - DeliveryPoint
        (Optional) Used to verify postcode on the BLPU. If there is a discrepency the DeliveryPoint postcode is used.

Each CSV file denotes an area. Each file will contain many rows (can be 10,000s), and will contain rows of each of the types defined above. The uploader will load the entire file into memory and try and associate the BLPU with the appropriate LPI, Organisation and Classification. This is done by means of a UPRN which acts a primary key across these data types.

- Filtering BLPUs
    - We don't treat all BLPUs as valid, and some will be skipped according to the following rules:
    (1) BLPU has an end date. We only want current BLPUs
    (2) No associated LPI available.
    (3) No active LPI available.
    (4) No classification available.
    (5) No matching CodePoint entry
    (6) No matching active street

(4) Persisting Addresses

- Once a list of AddressWrappers (an object containing the BLPU, LPI, Classification and optional Organisation) are created we iterate through this list constructing the Address format we want to persist. 

- We query the code points and street collections in mongo by postcode and USRN (unique street reference number, present on an LPI) respectively to aquire the details of the Country and Local Authority (codepoint) and street (street) 

- Example JSON:

    {
        "_id" : ObjectId("533d9b8d8de2c5f30a429e3c"),
        "postcode" : "kt13aw",
        "gssCode" : "E09000021",
        "country" : "England",
        "createdAt" : ISODate("2014-04-03T17:34:03.990Z"),
        "presentation" : {
            "property" : "Po Box Number 1487",
            "street" : "Hogsmill Lane",
            "town" : "Kingston Upon Thames",
            "postcode" : "KT1 3AW",
            "uprn" : "10015033889"
        },
        "location" : {
            "lat" : 51.404780948504616,
            "long" : -0.2928459753727967
        },
        "details" : {
            "blpuCreatedAt" : ISODate("2010-01-16T00:00:00Z"),
            "blpuUpdatedAt" : ISODate("2010-01-16T00:00:00Z"),
            "classification" : "OR03",
            "state" : "approved",
            "isPostalAddress" : false,
            "isCommercial" : false,
            "isResidential" : false,
            "usrn" : "21880163",
            "file" : "TQ1565.csv",
            "organisation" : "Medical Associates Ltd",
            "primaryClassification" : "Other",
            "secondaryClassification" : "Royal Mail Infrastructure"
        },
        "ordering" : {
            "paoText" : "poboxnumber1487"
        }

} (5) Processing files

- Directories supplied by the user on the command line are checked for having only CSV files. An error is anything else is present.

- Files are checked for validity on a per row basis - for correct number of columns per type, and for having the mandatory fields required to create the above data structures.

(6) Logging

Uses Java logging - logback.xml to configure

(7) Credentials

Credentials for APIs are not created  by the application. In order to create tokens for the API clients:

    db.authorizationToken.insert({token:"Me6ZGsSKqkVLNLS9fzYBvrGCQF4"})

    db.authorizationToken.ensureIndex({token:1})

Locate API has endpoints for making new tokens, but needs a first token to get going.

Boundary line

Boundary line is the polygon data set outlining the various administrative boundaries in the UK. This is required to give the correct granular mapping between an address and the area in which it sits.

Currently this is not used as part of the import - custodian codes in AddressBase and code point files are used only.

Boundary line comes as shape files. Processing shape files into mongo can be done with GDAL.

    brew install gdal
    ogr2ogr -f "GeoJSON" ~/unit.json unitary_electoral_division_region.shp unitary_electoral_division_region
    ogr2ogr -f "GeoJSON" ~/district.json district_borough_unitary_region.shp district_borough_unitary_region

Data sets currently used are: district_borough_unitary_region

The docs directory contains a geojson file containing the data which can be imported into mongo with the following command:

mongoimport --db locate --collection authorityBoundaries district-quarter.geojson

Validation

Pre/Post import it's worth checking the state of the nation:

(1) Get full count:

db.addresses.count()

(2) Count per Local Authority, stores results in the collection local_authority_test:

db.addresses.mapReduce(
	function() {emit(this.gssCode, 1);},
	function(key, values) {return Array.sum(values);},
	{
		out: "local_authority_test"
	}
)

(2) Count per postcode, stores results in the collection postcode_test:

db.addresses.mapReduce(
	function() {emit(this.postcode, 1);},
	function(key, values) {return Array.sum(values);},
	{
		out: "postcode_test"
	}
)

(3) Extract the data into CSV files.

mongoexport -d locate -c local_authority_test --csv  --fields _id,value -o local_authority_test.csv

mongoexport -d locate -c postcode_test --csv  --fields _id,value -o postcode_test.csv

(4) Errors - Instances where address or file is rejected

* Invalid file - File has invalid rows - whole file skipped

* Invalid row - Row invalid in a file for the given type, provokes Invalid file error

* BLPU is inactive - This BLPU is inactive, so skipped

    * BLPU has no matching LPI - No LPI available for this UPRN

* BLPU has no matching active LPI for this UPRN

* BLPU has no classification - BLPU has no classification entry

* No street found - No street description for street

* No active street - Street not active

* No local authority found for address - No LA found for this address either in Locate mapping or in Codepoint

* No street found for address - this Address cannot find a street

* Audit - address failed to pass audit

(5) Updates - Instances where Address Base address is modified as per Locate requirements

* Using DeliveryPoint postcode - Delivery Point and BLPU differ on postcodes - using Delivery Point

* GSSCode and Custodian code mismatch - check code point related GSSCODE against one derived from Custodian code

* Using delivery point for street - street type for this address is of type "street description" use delivery point street instead.

location-data-importer's People

Contributors

rjw1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

location-data-importer's Issues

Hangs at end of importing CodePoints

I'm attempting an import and get to CodePoints: 100.00% of files done being printed, then it hangs. Is it possible its running a very long operation here that shows no logging til complete?

@minglis would appreciate your feedback, thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.