Giter Club home page Giter Club logo

address-parser's Introduction

address-parser

This is a minimal implementation of a US address parser built using spaCy NLP library. This blog post covers the implementation and execution details at length.

Prerequisites

Folders

Training corpus

A sample corpus of US addresses to train/test the parser is present under corpus/dataset folder. JSON based rules required by Entity ruler are present under corpus/rules

Config

config contains files for initializing training parameters:

base_config.cfg: Initializes pipeline and training batch size. base_config_er.cfg: Similar as base_config but with additional entity ruler settings. config.cfg: Pre filled config file obtained after executing inti fill-config. config_er.cfg: Pre filled config file with additional entity ruler settings.

Output

output contains final trained models (with and without entity rules)



Training prerequisites

Before starting the training process, we need to:

i) Obtain a pre filled training config which has the required training parameters.

ii) Build spacy-docbin (binary serialized representation) files for training and test dataset.

Pre filled training config: Below command can be executed from command-line to get a pre filled config file. This would take as input the base_config.cfg file and churn out the pre filled training config file: config.cfg.

python -m spacy init fill-config config\base_config.cfg config\config.cfg

Similarly, to get entity-ruler based config, pointing this command to the base_config_er.cfg would churn out the pre filled config : config_er.cfg

Prepare spacy-docbins: Finally, a spacy-docbin file can be obtained by executing training_data_prep.py.

python training_data_prep.py

This would take raw csv training/test datasets as inputs and churn out docbin files under corpus/spacy-docbins folder.


Training loop execution

To start the training process, below train command can be executed:

python -m spacy train config\config.cfg --paths.train corpus\spacy-docbins\train.spacy --paths.dev corpus\spacy-docbins\test.spacy --output output\models --training.eval_frequency 10 --training.max_steps 300

This saves the output NER models under output folder.



Predictions

Predictions for a few sample US addresses can be checked by executing predict.py

python predict.py

Output:

Address string -> 130 W BOSE ST STE 100, PARK RIDGE, IL, 60068, USA
Parsed address -> [('130', 'BUILDING_NO'), ('W BOSE ST', 'STREET_NAME'), ('PARK RIDGE', 'CITY'), ('IL', 'STATE'), ('60068', 'ZIP_CODE'), ('USA', 'COUNTRY')]

address-parser's People

Contributors

swapnil-saxena avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.