woodbri / address-standardizer Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 1.0 811 KB

An addree parser and standardizer in C++

License: Other

Makefile 0.57% C++ 31.32% Shell 0.56% Perl 3.61% C 3.87% Lex 60.07%

address-standardizer's People

Contributors

Stargazers

Watchers

Forkers

tmepple

address-standardizer's Issues

Tokenizer: Improving word splitting process

This is primarily an issue for German language addresses. In German, "CHAUSSEE" is a street type and it might be joined to the street name. It is common to abbreviate it to "CH". So we desire to split words like: "nameCHAUSSEE" or "nameCH", into two tokens, one for the name and for the type. The problem comes from the short abbreviation. For example "BACH" gets splits "BA", "CH". The current logic pre-splits all the attached words before it gets passed to the parser to create tokens. We could change the logic to first check if a word is in the lexicon and if it is then don't split it.

Also while this in mostly needed for separating the TYPE from the street name in German, at the time of parsing the address string into tokens we are not able to differentiate what component of the address any given token is until much later in the process. Another downside of this is that some city names have prefixes and/or suffixes that also match some of these splitting rules which makes it harder to identify them appropriately.

Assuming we wanted to take this approach, the logic would be something like this:

while (the string has chars)
   if (check if the head of the string matches a lexicon entry) {
     make it a token and push it on the stream
   }
   else {
      if (we need to split it) {
         split it
         push part1, emdash, part2 on token stream
      }
      else {
         push word on token stream
      }
   }
   push punctuation on token stream
}

But here it the problem, every word that might get split and should not be split needs to be added to the lexicon. For example every German word ending in "ch" and every word ending in any of the suffix attachments would need to be added if they are also common words in addresses. This could make the lexicon HUGE and I'm not sure if this worth while. For example, if BACH always gets split into BA - CH this should not impact the ability to geocode accurately.

The current work around for this is that the code generates a set of token that are split and a second set of tokens that are not split and standardized both of these, picking the best scoring match. Also by making sure that words classified as both TYPE and WORD in the lexicon can solve some of the CITY word splitting issues.

Add Query level caching of standardizer in Postgresql

Currently the standardizer needs to create and load the grammar and lexicon for every address. We need to initialize and cache the standardizer and then reuse it for each additional address, for example when standardizing a whole table.

Look into breaking the lexicon regex in multiple smaller regex

Currently, Lexicon::regex() returns one HUGE regex string. This might be too large for a large lexicon. There are two potential ways that this might be improved:

implement something equivalent to perl's Regex::Optimizer
change the call to Lexicon::regex(char) and create multiple regex patterns based on the leading

Tokenizer would need changes for item 2 but my thought is that a smaller regex will be more memory efficient and will get evaluated faster.

Infinite recursion in grammar check and search

It is possible to construct grammars that cause infinite recursion. For example:

[a]
@a
@c

You can also do it via indirect recursion using multiple rules like:

[a]
@b @c

[b]
@a @c

both Grammar::check and Search::search family of methods need to be modified to allow and/or detect this and maybe limit the level of recursion to some reasonable number or throw an error that can be caught. currently they consume all of the stack and cause a SEGV.

The reason we might want to have such rules is to allo for the automatic consumption of multiplw tokens for a given section like CITY.

Rename all lexicon and grammar files based on the iso country code

Probably make names like: cc.lex and cc.gra

Redesign Grammar and Search to improve preformance

The current design is clean and simple and relies on storing the data in a map. The down side of this is that we are constantly running find(rule) on the map. Since the grammar is really a tree of nodes, it might be faster to create nodes with pointers to children and just follow the pointers.

TODO:

verify that the bulk of the time is spent doing find()
redesign the grammar to use pointers rather them the current map

Collected some stats on how many calls to find():

$ time ../src/tester/t2 lex-greatbritain.txt grammar/greatbritain.grammar '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
Address: '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
Normalized: '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
UpperCase: '1A SEASTONE COTTAGES WEYBOURNE HOLT NR25 7HG UK'
Timer: read lexicon: 2 ms
Timer: load lexicon: 149 ms
Timer: init tokenizer: 0 ms
Timer: tokenizer::getTokens: 164 ms
-------------------------------
TOKEN:  1               NUMBER  BADTOKEN                1
TOKEN:  A               WORD,SINGLE     BADTOKEN                0
TOKEN:  SEASTONE                WORD    BADTOKEN                0
TOKEN:  COTTAGES                WORD,TYPE       BADTOKEN                1
TOKEN:  WEYBOURNE               WORD    BADTOKEN                0
TOKEN:  HOLT            WORD    BADTOKEN                0
TOKEN:  NR25            MIXED,PCH       BADTOKEN                0
TOKEN:  7HG             MIXED,PCT       BADTOKEN                0
TOKEN:  UK              NATION  BADTOKEN                1
-------------------------------
Timer: read grammar file: 0 ms
Timer: init Grammar: 16 ms
Timer: Search::searchAndReclassBest: 6816 ms
Stats: findMetas: 198583
Stats: findRules: 149479
bestCost: 0.566667
TOKEN:  1       1       NUMBER  HOUSE           1
TOKEN:  A       A       SINGLE  HOUSE           0
TOKEN:  SEASTONE        SEASTONE        WORD    STREET          0
TOKEN:  COTTAGES        COTTAGES        TYPE    SUFTYP          1
TOKEN:  WEYBOURNE       WEYBOURNE       WORD    CITY            0
TOKEN:  HOLT    HOLT    WORD    CITY            0
TOKEN:  NR25    NR25    MIXED   POSTAL          0
TOKEN:  7HG     7HG     MIXED   POSTAL          0
TOKEN:  UK      UNITED KINGDOM OF GREAT BRITAIN NATION  NATION          1

real    0m7.161s
user    0m7.138s
sys     0m0.020s

inclass.*

If you use overloading:
because the input signatures are different, you can name the "converitng" fucntions the same name.
That way you dont need to rmember so many names of members
https://github.com/woodbri/address-standardizer/blob/develop/src/inclass.h#L123

The grammar needs to support optional rules

Currently we do not have the ability to support an optional token and it would be desirable to do this to simplify the grammars. The current work around for this is to construct multiple rules. For example, if we wanted an optional COMMA token, then we could construct something like in #19

@street @comma @city
@street @city

but if you have multiple optional tokens you have to generate multiple combinations of the rules for all possible permutations. We might indicate this with a syntax like @comma? or ?comma, then the search class would need to be updated to handle it during the search.

Make Alternate Token search optional via postgres wrappers and API

Code was added to handle extraction of alternate token streams that then get passed to the search and the best is selected. While this generates better results in some cases it can be expensive. The code was added here:

git diff 6d8dd71 284b251

Because this impacts both queries and the standardization of the reference data set, it is likely that the option can only be set at the database level and not per query, because if the reference set was standardize without this option and query was standardize with the option you would get different results from the standardizer and therefore would not match.

The other option might be as a build parameter.

Notes for Documentation

need to review country data and how house numbers are represented relative to how they are recognized in the grammar
- NUMBER DASH NUMBER -> HOUSE HOUSE HOUSE
- NUMBER DASH NUMBER -> HOUSE IGNORE IGNORE
creating unicode lexicon entries for ° there are at least two similar symbols \xc2\xb0 and \xc2\xba and there may be other similar sets of characters
standardize to full spelling or abbreviations like in state names
how to do repeating tokens
how to do optional tokens
when and why you need to restandardize reference set
sample lexicons and grammars in as_config
- loading the sample data
- queries using the sample data
- making changes to the sample data
performance considerations
- keep lexicons as small as possible
- use compiled lexicons
- keep grammars simple as possible
- avoid extra phrases like building and place name and unit identifiers if you don't need them.

Redesign LexEntry/Token classes to allow them to support a list of classifications

Currently a LexEntry/Token can only have a single classification so multiple object must be created for multiple classifications. We could change the classification to support a container or set of classifications.

Parsing 'KM 15,500' needs to be figured out

The first two commands disable some token filtering so you can see how it is getting tokenized.
This might be another case for supporting regexes in the lexicon.

../src/tester/t2 lex-italy.txt grammar/italy.grammar 'VIA ARDEATINA KM 15,500 00134 ROMA RM ' ''
../src/tester/t2 lex-italy.txt grammar/italy.grammar 'VIA ARDEATINA KM 15,500 00134 ROMA RM ' 'SPACE'

../src/tester/t2 lex-italy.txt grammar/italy.grammar 'VIA ARDEATINA KM 15,500 00134 ROMA RM '
Address: 'VIA ARDEATINA KM 15,500 00134 ROMA RM '
Normalized: 'VIA ARDEATINA KM 15,500 00134 ROMA RM '
UpperCase: 'VIA ARDEATINA KM 15,500 00134 ROMA RM '
-------------------------------
TOKEN:  VIA             TYPE    BADTOKEN                1
TOKEN:  ARDEATINA               WORD    BADTOKEN                0
TOKEN:  KM              WORD,DOUBLE     BADTOKEN                0
TOKEN:  15              NUMBER  BADTOKEN                0
TOKEN:  500             NUMBER  BADTOKEN                0
TOKEN:  00134           NUMBER,QUINT    BADTOKEN                0
TOKEN:  ROMA            WORD,PROV       BADTOKEN                1
TOKEN:  RM              PROV    BADTOKEN                1
-------------------------------
#### Failed to find a match (searchAndReclassBest)(-1)

Review Tokenizer::splitToken

Currently, Tokenizer::splitToken will split words like 500W -> 500 W, 2A -> 2 A, N123 -> N 123, I80 -> I 80, but not 11YY. The idea being that it would separate directionals from numbers or Interate from a number. The problem is that this make sense in USA, but maybe not internationally and maybe we should leave this tokens as MIXED.

Another option, might be to add support for regular expression matching and/or split that can be defined in the lexicon, although this will complicate things. Maybe we can add new LexEntry types like:

REGMATCH:\t<pattern>\t<InClass::Type>
REGREPLACE:\t<pattern>\t<replacement>\t<tokens_to_check>

Review how PUNCT chars are classified

A couple of issues:

L' - is in the lexicon as a STOPWORD but it is not getting recognized this might be related to the 'º' issue #32
'-' is not getting classified as a DASH token

time ../src/tester/t2 lex-usa.txt grammar/usa.grammar "475 L'Enfant Plaza SW Rm 370 IBU WASHINGTON DC 20260–6500  USA" 'SPACE'
Address: '475 L'Enfant Plaza SW Rm 370 IBU WASHINGTON DC 20260–6500  USA'
Normalized: '475 L'Enfant Plaza SW Rm 370 IBU WASHINGTON DC 20260–6500  USA'
UpperCase: '475 L'ENFANT PLAZA SW RM 370 IBU WASHINGTON DC 20260–6500  USA'
Timer: read lexicon: 9 ms
Timer: load lexicon: 381 ms
Timer: init tokenizer: 0 ms
Timer: tokenizer::getTokens: 174 ms
-------------------------------
TOKEN:  475             NUMBER  BADTOKEN                0
TOKEN:  L               WORD,SINGLE     BADTOKEN                0
TOKEN:  '               PUNCT   BADTOKEN                0
TOKEN:  ENFANT          WORD    BADTOKEN                0
TOKEN:  PLAZA           TYPE,BUILDT     BADTOKEN                1
TOKEN:  SW              DIRECT  BADTOKEN                1
TOKEN:  RM              TYPE,ROAD,UNITH,BADTOKEN        BADTOKEN               1
TOKEN:  370             NUMBER  BADTOKEN                0
TOKEN:  IBU             WORD    BADTOKEN                0
TOKEN:  WASHINGTON              WORD,CITY,PROV  BADTOKEN                1
TOKEN:  DC              WORD,ROAD,PROV  BADTOKEN                1
TOKEN:  20260           NUMBER,QUINT    BADTOKEN                0
TOKEN:  –               PUNCT   BADTOKEN                0
TOKEN:  6500            NUMBER,QUAD     BADTOKEN                0
TOKEN:  USA             NATION  BADTOKEN                1
-------------------------------
Timer: read grammar file: 0 ms
Timer: init Grammar: 17 ms
Timer: Search::searchAndReclassBest: 1139 ms
#### Failed to find a match (searchAndReclassBest)(-1)

real    0m1.738s
user    0m1.732s
sys     0m0.004s

Performance Evaluation and Improvements

I've added timing stats to src/tester/t2.cpp and we need to evaluate the performance bottlenecks and see what can be done to speed things up.

the example below takes almost 2 secs to search the grammar. Some other grammer take longer.
data/lex-usa.txt has about 3000 entries and it takes about 700ms to load it and 700ms tokenize an address.

$ time ../src/tester/t2 lex-usa.txt grammar/usa.grammar "475 L'Enfant Plaza SW Rm 370 IBU WASHINGTON DC 20260–6500  USA"
Address: '475 L'Enfant Plaza SW Rm 370 IBU WASHINGTON DC 20260–6500  USA'
Normalized: '475 L'Enfant Plaza SW Rm 370 IBU WASHINGTON DC 20260–6500  USA'
UpperCase: '475 L'ENFANT PLAZA SW RM 370 IBU WASHINGTON DC 20260–6500  USA'
Timer: read lexicon: 9 ms
Timer: load lexicon: 704 ms
Timer: init tokenizer: 0 ms
Timer: tokenizer::getTokens: 702 ms
-------------------------------
TOKEN:  475             NUMBER  BADTOKEN                0
TOKEN:  L               WORD,SINGLE     BADTOKEN                0
TOKEN:  ENFANT          WORD    BADTOKEN                0
TOKEN:  PLAZA           TYPE,BUILDT     BADTOKEN                1
TOKEN:  SW              DIRECT  BADTOKEN                1
TOKEN:  RM              TYPE,ROAD,UNITH,BADTOKEN        BADTOKEN               1
TOKEN:  370             NUMBER  BADTOKEN                0
TOKEN:  IBU             WORD    BADTOKEN                0
TOKEN:  WASHINGTON              WORD,CITY,PROV  BADTOKEN                1
TOKEN:  DC              WORD,ROAD,PROV  BADTOKEN                1
TOKEN:  20260           NUMBER,QUINT    BADTOKEN                0
TOKEN:  6500            NUMBER,QUAD     BADTOKEN                0
TOKEN:  USA             NATION  BADTOKEN                1
-------------------------------
Timer: read grammar file: 0 ms
Timer: init Grammar: 22 ms
Timer: Search::searchAndReclassBest: 1967 ms
bestCost: 0.622222
TOKEN:  475     475     NUMBER  HOUSE           0
TOKEN:  L       L       SINGLE  HOUSE           0
TOKEN:  ENFANT  ENFANT  WORD    STREET          0
TOKEN:  PLAZA   PLAZA   TYPE    SUFTYP          1
TOKEN:  SW      SOUTHWEST       DIRECT  SUFDIR          1
TOKEN:  RM      ROOM    UNITH   UNITH           1
TOKEN:  370     370     NUMBER  UNITT           0
TOKEN:  IBU     IBU     WORD    UNITT           0
TOKEN:  WASHINGTON      WASHINGTON      CITY    CITY            1
TOKEN:  DC      DISTRICT OF COLUMBIA    PROV    PROV            1
TOKEN:  20260   20260   QUINT   POSTAL          0
TOKEN:  6500    6500    QUAD    POSTAL          0
TOKEN:  USA     USA     NATION  NATION          1

real    0m3.420s
user    0m3.401s
sys     0m0.016s

Create script to load all lexicon and grammar files into DB table

Do #37 first.

Tokenizer treats space as punct token

Currently the Tokenizer classifies a space as punctuation. We might want to separate space from punctuation so they can be filtered separately and/or handled separately in the grammar.

Allow regex in Lexicon Entries

It would be handy to allow regular expression in lexicon entries to identify thing like number formats, postal code formats, etc.

Create Unit tests for all classes

Search class needs to be extended

We need an interface like: Search::search(const std::vector<Token> &tokens)
where the behavior is like:

enumerate patterns from tokens
search the patterns for the best
standardize the tokens based on the best matched
return the standardized tokens

Optimize Regex patterns from Lexicon

Currently, we take a list of all the lexicon entries and concatenate them together separated with a'|' char to build a huge regex that is used in the Tokenizer. This works but it could be much faster. Perl has a package Regex::List and Regex::Optimize that takes a list of words and build an optimized regex that matches the list of words. This could be coded into C++ and used.

When looking at callgrind results we spend a significant amount of the processing time in the Boost::Regex code, so my thought is that this could potential speed things up, especially as the optimized regex's can be cached at the query level.

Add the ability to cache the standardizer in the database

Currently the grammar and lexicon have to be loaded for every address. We need to be able to load it once and cache for the lifetime of the query so it is faster to standardize a table of addresses.

operator <<

Instead of this:

void Lexicon::dump() const {
    std::stringstream ss;
    ss << name_ << "\t"
       << langAsString() << "\t"
       << locale_;

    // later we might want to output to a logger instead of cout
    std::cout << "Lexicon: " << ss.str() << "\n";

    // TODO dump the lexicon entries
    std::cout << "Lexicon Entries (TODO)" << "\n";
}

overload like this:

friend ostream &operator<<(ostream  &ss, const Lexicon &lex ) { 
     ss << "Lexicon: " <<
       << name_ << "\t"
       << langAsString() << "\t"
       << locale_
       << "\n";
       << "Lexicon Entries (TODO)" << "\n";
    // TODO dump the lexicon entries
}

usage:

   Lexicon mylex1;
   Lexicon mylex2;
   // old usage:
   std::err << "The two lexicons:\n";
   // old usage dumps into std::cout !!!!!
   mylex1.dump();
   mylex2.dump();

   //new usage
   It can "dump" wherever I need
   std::err << "The two lexicons:\n" << mylex1 << mylex2 << "\n";
   std::cout << "The two lexicons:\n" << mylex1 << mylex2 << "\n";

Convert data from lex, gaz and rules to new formats

The new code handles things differently so it might not make sense to convert the old files. This need to be evaluated. lex and gaz files should not be a problem, the rules should be able to be greatly simplified because the Grammar files support hierarchical grammar definitions which would remove a lot of the redundancy in the rules.txt file.

Portugal Grammar might need some work

Here are some sample addresses from http://www.upu.int/fileadmin/documentsFiles/activities/addressingUnit/prtEn.pdf
The issue is how to handle sub-locality level 2 and sub-locality and locality.

Portugal has 18 districts which might be considered PROV but they do not explicitly come into play in the addresses. For Navteq data we can extract all the localities which would get tagged as CITY, and then potentially IGNORE the sub-localities.

The issue is not about recognizing them but how to categorize them when we pass them to the geocoder. We currently only CITY, PROV, NATION. We could stuff the CITY (aka: locality) into PROV and then put the sub-locality words into CITY.

Getting more addresses for Portugal and comparing them with real street data will probably make more sense of this.

## Home delivery (large towns):
MANUEL GASPAR                                    addressee
LG DR ANTÓNIO VIANA 1 2 DTO             street + premises, floor, side
1250–096 LISBOA                                    postcode + locality
PORTUGAL                                              country

## with sub-locality:
MARIA SILVA ANDRADE                           addressee
R PRINCIPAL VV ANDRADE                      street + premises
QUINTA DA PROVENÇA                           sub-locality level 2
CASAIS NOVOS                                        sub-locality
2580-347 ALENQUER                               postcode + locality
PORTUGAL                                               country

## Home delivery (rural region):
DR. NUNO FIGUEIREDO                           addressee
R. LEAL DA CAMARA 31 RL ESQ             street + premises, floor, side
ALGUEIRÃO                                              sub-locality
2725–079 MEM MARTINS                         postcode + locality
PORTUGAL                                               country

## PO Box delivery:
PATRICIA MARTINS                                   addressee
APARTADO 42024                                    PO Box
EC –D. LUÍS                                              post office
1201–950 LISBOA                                    postcode + locality
PORTUGAL                                              country

## Delivery to private letter boxes:
ENG. MANUEL SOUSA                              addressee
RUA DAS DESCOBERTAS                        street
CCI 8318                                                   PO Box
PENTEADO                                               sub-locality
2860–571 MOITA                                      postcode + locality
PORTUGAL                                               country

The current grammar handles these by putting them into the extra field:

[macro]
@locality @postal @citywords @region @country
@locality @postal @citywords @country
@locality @postal @citywords @region
@locality @postal @citywords
@postal @citywords @region @country
@postal @citywords @country
@postal @citywords @region
@postal @citywords

[locality]
@word @locality
@word

[word]
WORD -> EXTRA -> 0.3

[citywords]
@city @citywords
@city

These take too long to standardize

$ time ../src/tester/t2 lex-usa.txt grammar/usa.grammar "475 L'Enfant Plaza SW Rm 370 IBU WASHINGTON DC 20260–6500  USA"
Address: '475 L'Enfant Plaza SW Rm 370 IBU WASHINGTON DC 20260–6500  USA'
Normalized: '475 L'Enfant Plaza SW Rm 370 IBU WASHINGTON DC 20260–6500  USA'
UpperCase: '475 L'ENFANT PLAZA SW RM 370 IBU WASHINGTON DC 20260–6500  USA'
Timer: read lexicon: 9 ms
Timer: load lexicon: 725 ms
Timer: init tokenizer: 0 ms
Timer: tokenizer::getTokens: 694 ms
-------------------------------
TOKEN:  475             NUMBER  BADTOKEN                0
TOKEN:  L               WORD,SINGLE     BADTOKEN                0
TOKEN:  ENFANT          WORD    BADTOKEN                0
TOKEN:  PLAZA           TYPE,BUILDT     BADTOKEN                1
TOKEN:  SW              DIRECT  BADTOKEN                1
TOKEN:  RM              TYPE,ROAD,UNITH,BADTOKEN        BADTOKEN               1
TOKEN:  370             NUMBER  BADTOKEN                0
TOKEN:  IBU             WORD    BADTOKEN                0
TOKEN:  WASHINGTON              WORD,CITY,PROV  BADTOKEN                1
TOKEN:  DC              WORD,ROAD,PROV  BADTOKEN                1
TOKEN:  20260           NUMBER,QUINT    BADTOKEN                0
TOKEN:  6500            NUMBER,QUAD     BADTOKEN                0
TOKEN:  USA             NATION  BADTOKEN                1
-------------------------------
Timer: read grammar file: 0 ms
Timer: init Grammar: 23 ms
OK loading grammar:
Timer: Search::searchAndReclassBest: 2362 ms
Stats: findMetas: 19008
Stats: findRules: 82272
bestCost: 0.622222
TOKEN:  475     475     NUMBER  HOUSE           0
TOKEN:  L       L       SINGLE  HOUSE           0
TOKEN:  ENFANT  ENFANT  WORD    STREET          0
TOKEN:  PLAZA   PLAZA   TYPE    SUFTYP          1
TOKEN:  SW      SOUTHWEST       DIRECT  SUFDIR          1
TOKEN:  RM      ROOM    UNITH   UNITH           1
TOKEN:  370     370     NUMBER  UNITT           0
TOKEN:  IBU     IBU     WORD    UNITT           0
TOKEN:  WASHINGTON      WASHINGTON      CITY    CITY            1
TOKEN:  DC      DISTRICT OF COLUMBIA    PROV    PROV            1
TOKEN:  20260   20260   QUINT   POSTAL          0
TOKEN:  6500    6500    QUAD    POSTAL          0
TOKEN:  USA     USA     NATION  NATION          1

real    0m3.828s
user    0m3.805s
sys     0m0.020s

Remove some clauses speeds this up. Need to look at what in the grammar is can be simplified and how we might speed up the Lexicon and Tokenizer.

$ time ../src/tester/t2 lex-usa.txt grammar/usa.grammar "475 L'Enfant Plaza SW WASHINGTON DC 20260"
Address: '475 L'Enfant Plaza SW WASHINGTON DC 20260'
Normalized: '475 L'Enfant Plaza SW WASHINGTON DC 20260'
UpperCase: '475 L'ENFANT PLAZA SW WASHINGTON DC 20260'
Timer: read lexicon: 9 ms
Timer: load lexicon: 722 ms
Timer: init tokenizer: 0 ms
Timer: tokenizer::getTokens: 694 ms
-------------------------------
TOKEN:  475             NUMBER  BADTOKEN                0
TOKEN:  L               WORD,SINGLE     BADTOKEN                0
TOKEN:  ENFANT          WORD    BADTOKEN                0
TOKEN:  PLAZA           TYPE,BUILDT     BADTOKEN                1
TOKEN:  SW              DIRECT  BADTOKEN                1
TOKEN:  WASHINGTON              WORD,CITY,PROV  BADTOKEN                1
TOKEN:  DC              WORD,ROAD,PROV  BADTOKEN                1
TOKEN:  20260           NUMBER,QUINT    BADTOKEN                0
-------------------------------
Timer: read grammar file: 0 ms
Timer: init Grammar: 22 ms
OK loading grammar:
Timer: Search::searchAndReclassBest: 284 ms
Stats: findMetas: 2304
Stats: findRules: 10064
bestCost: 0.571429
TOKEN:  475     475     NUMBER  HOUSE           0
TOKEN:  L       L       SINGLE  HOUSE           0
TOKEN:  ENFANT  ENFANT  WORD    STREET          0
TOKEN:  PLAZA   PLAZA   TYPE    SUFTYP          1
TOKEN:  SW      SOUTHWEST       DIRECT  SUFDIR          1
TOKEN:  WASHINGTON      WASHINGTON      CITY    CITY            1
TOKEN:  DC      DISTRICT OF COLUMBIA    PROV    PROV            1
TOKEN:  20260   20260   QUINT   POSTAL          0

real    0m1.746s
user    0m1.736s
sys     0m0.008s

The grammar needs to support rules with no output tokens

In conjunction with #19 and #20, we need an OutClass::Type the basically throws away the token. In the case of STOPWORD in the InClass we can filter them out so the grammar never sees them, but in the case of COMMA that we want to pass forward so we can assign the tokens before and after it correctly, we also need a way to discard the token from the final output.

Change search from a single match to look for all possible matches

This needs to be evaluated. It may not be worth the effort or provide better results. The idea is based on the fact that a grammar might have multiple matches if you consider the how you devide up multiple WORD tokens to different rules.

this would be slower than finding the first match
do we need end-of-pattern markers to assist with backtracking to look for the next rule
how do we handle back tracking

Postgresql 11.6: Compiling Error

I am trying to compile the address standadizer for Postgresql 11.6. I have strictly followed the instructions: dependencies installed, patch implemented for Postgresql 11 (Makefile.global) ...
With "sudo make PGVER=11 -f Makefile.pg install" I get the following error message:

error: can't create module summary index for buffer: Expected a single module
LLVM ERROR: ThinLink didn't create an index
/usr/lib/postgresql/11/lib/pgxs/src/makefiles/pgxs.mk:229: recipe for target 'install' failed
make: *** [install] Error 1

If I compile the Address Standadizer on the same system under Postgresql 10, the compilation runs without errors.

My system: Ubuntu 18.04, Postgresql 11.6, PostGIS 2.5.

Got any advice for me?

Handle joining words split by emdash token

When words are split because of attached types (in German), abbreviations and split words that should not be split, like "BACH" getting split into "BA" <emdash> "CH" because "CH" is an abbreviation for "CHAUSSEE" that might be suffix attached.

So we need to generate two sequence of tokens, one split and one not. So we can match both and select which ever is best.

Notes on branch grammartrie

The grammartrie branch is an attempt to speed up searching patterns in a grammar. The idea is to create trie index of the grammar so that searching it is in O(1) time. The problem with this in its current implementation is that it needs LOTS of memory to build the trie on a complex grammar and it is expense to build the trie because you have to generate a list of all possible patterns represented in the grammar and load them into the trie. I wrote serialize and deSerialize methods with the idea that we could compile the grammar into a trie and then serialize it, then deSerialize it when we want to use it.

Here are some stats from my experiments. the number for each country is the number of unique patterns generated for that country's grammar. Also the size of the grammar and the serialized trie file for each country.

[grammartrie] ~/work/address-standardizer/src$ for x in ../data/sample/*.tri ; do echo `basename $x` "  " `grep '\[' $x | wc -l`; done
andorra.tri    5058
australia.tri    280020
austria.tri    15462
belgium.tri    6102
canada.tri    17717076
denmark.tri    5070
finland.tri    10460
france.tri    52200
germany.tri    68184
greatbritain.tri    4143920
ireland.tri    13561920
italy.tri    30800
liechtenstein.tri    5160
luxembourg.tri    38940
monaco.tri    52200
netherlands.tri    393480
newzealand.tri    4170684
norway.tri    46440
portugal.tri    334880
san_marino.tri    1940
spain.tri    996760
sweden.tri    2940
switzerland.tri    52080
usa.tri    56679552
vatican.tri    1224

[grammartrie] ~/work/address-standardizer/data$ ls -lh sample/*.{gmr,tri}
-rw-r--r-- 1 woodbri woodbri 1.7K Apr 10 16:26 sample/andorra.gmr
-rw-r--r-- 1 woodbri woodbri 356K Apr 10 16:38 sample/andorra.tri
-rw-r--r-- 1 woodbri woodbri 2.2K Apr 10 16:27 sample/australia.gmr
-rw-r--r-- 1 woodbri woodbri  26M Apr 10 16:38 sample/australia.tri
-rw-r--r-- 1 woodbri woodbri 1.9K Apr 10 16:27 sample/austria.gmr
-rw-r--r-- 1 woodbri woodbri 1.2M Apr 10 16:39 sample/austria.tri
-rw-r--r-- 1 woodbri woodbri 1.7K Apr 10 16:28 sample/belgium.gmr
-rw-r--r-- 1 woodbri woodbri 465K Apr 10 16:39 sample/belgium.tri
-rw-r--r-- 1 woodbri woodbri 3.5K Apr 10 16:24 sample/canada.gmr
-rw-r--r-- 1 woodbri woodbri 1.9G Apr 10 16:57 sample/canada.tri
-rw-r--r-- 1 woodbri woodbri 1.4K Apr  4 09:13 sample/denmark.gmr
-rw-r--r-- 1 woodbri woodbri 380K Apr 10 16:57 sample/denmark.tri
-rw-r--r-- 1 woodbri woodbri 2.1K Apr 10 16:28 sample/finland.gmr
-rw-r--r-- 1 woodbri woodbri 743K Apr 10 16:57 sample/finland.tri
-rw-r--r-- 1 woodbri woodbri 1.8K Apr 10 16:29 sample/france.gmr
-rw-r--r-- 1 woodbri woodbri 4.5M Apr 10 16:57 sample/france.tri
-rw-r--r-- 1 woodbri woodbri 2.2K Apr 10 16:29 sample/germany.gmr
-rw-r--r-- 1 woodbri woodbri 5.3M Apr 10 16:57 sample/germany.tri
-rw-r--r-- 1 woodbri woodbri 2.8K Apr  9 11:54 sample/greatbritain.gmr
-rw-r--r-- 1 woodbri woodbri 379M Apr 10 17:02 sample/greatbritain.tri
-rw-r--r-- 1 woodbri woodbri 3.1K Apr 10 16:30 sample/ireland.gmr
-rw-r--r-- 1 woodbri woodbri 1.3G Apr 10 17:17 sample/ireland.tri
-rw-r--r-- 1 woodbri woodbri 2.1K Apr 10 16:30 sample/italy.gmr
-rw-r--r-- 1 woodbri woodbri 2.5M Apr 10 17:17 sample/italy.tri
-rw-r--r-- 1 woodbri woodbri 1.7K Apr 10 16:30 sample/liechtenstein.gmr
-rw-r--r-- 1 woodbri woodbri 377K Apr 10 17:17 sample/liechtenstein.tri
-rw-r--r-- 1 woodbri woodbri 2.2K Apr 10 16:30 sample/luxembourg.gmr
-rw-r--r-- 1 woodbri woodbri 2.9M Apr 10 17:17 sample/luxembourg.tri
-rw-r--r-- 1 woodbri woodbri 1.8K Apr 10 16:31 sample/monaco.gmr
-rw-r--r-- 1 woodbri woodbri 4.5M Apr 10 17:17 sample/monaco.tri
-rw-r--r-- 1 woodbri woodbri 2.7K Apr 10 16:31 sample/netherlands.gmr
-rw-r--r-- 1 woodbri woodbri  35M Apr 10 17:18 sample/netherlands.tri
-rw-r--r-- 1 woodbri woodbri 2.4K Apr 10 16:31 sample/newzealand.gmr
-rw-r--r-- 1 woodbri woodbri 415M Apr 10 17:22 sample/newzealand.tri
-rw-r--r-- 1 woodbri woodbri 2.3K Apr 10 16:31 sample/norway.gmr
-rw-r--r-- 1 woodbri woodbri 3.7M Apr 10 17:22 sample/norway.tri
-rw-r--r-- 1 woodbri woodbri 2.5K Apr 10 16:32 sample/portugal.gmr
-rw-r--r-- 1 woodbri woodbri  32M Apr 10 17:22 sample/portugal.tri
-rw-r--r-- 1 woodbri woodbri 1.8K Apr 10 16:32 sample/san_marino.gmr
-rw-r--r-- 1 woodbri woodbri 120K Apr 10 17:22 sample/san_marino.tri
-rw-r--r-- 1 woodbri woodbri 2.3K Apr 10 16:32 sample/spain.gmr
-rw-r--r-- 1 woodbri woodbri 114M Apr 10 17:24 sample/spain.tri
-rw-r--r-- 1 woodbri woodbri 1.7K Apr 10 16:33 sample/sweden.gmr
-rw-r--r-- 1 woodbri woodbri 211K Apr 10 17:24 sample/sweden.tri
-rw-r--r-- 1 woodbri woodbri 2.0K Apr 10 16:33 sample/switzerland.gmr
-rw-r--r-- 1 woodbri woodbri 4.2M Apr 10 17:24 sample/switzerland.tri
-rw-r--r-- 1 woodbri woodbri 4.3K Apr 10 16:36 sample/usa.gmr
-rw-r--r-- 1 woodbri woodbri 5.5G Apr 10 18:24 sample/usa.tri
-rw-r--r-- 1 woodbri woodbri 1.2K Apr 10 16:34 sample/vatican.gmr
-rw-r--r-- 1 woodbri woodbri  77K Apr 10 18:25 sample/vatican.tri

Also to test performance of the largest grammar usa.gmr and usa.tri, I wrote src/tester/read-grammar-trie.cpp to just deSerialize() the usa.tri file and it required 23.8 G of memory and 28 minutes to load load it and another 1-2 min to deallocate it.

There are few things to still look at:

need to move reclass code from search.cpp to grammar.cpp
run callgrind on the current code and see if it can be optimized.
look into block memory management to speed up memory allocation and clean up
all InClass and OutClass types can be represented in a single byte instead of an int which would cut the memory consumption by 4
we allocate a Rule and save its pointer in the trie, but 50% of the Rule (ie: the InClass vector, is the pattern that gets us to the Rule so we could create simplified version of the Rule and cut the size of the Rule in half. For the usa.tri there are 56,679,552 Rules. If on average there are 12 ints + 12 ints + 1 float we need 5.3 GB but storing 12 ints + 1 float = 2.7 GB and if we store that in bytes we only need 0.7 GB this is only for the Rules, we still need to create all the nodes for the trie in the first place.
we could also change serialize/deSerialize to write binary files/blobs so we could eliminate reading the *.tri files as text and parsing and converting them.
I estimate that applying all these changes would potentially reduce the memory consumption of usa.tri by about 4.8 G, and have a significant performance improvement, but we would need to look at improving the memory efficiency of the trie nodes.
analyze how many trie nodes in usa.tri and look for ways to make this more efficient. At a minimum, there are 56,679,552 nodes plus.

Generate Country Specific Grammar Files

Add InClass::COMMA so grammars can key on this

Currently commas are classified as PUNCT which usually gets filtered out. It might be helpful to support a COMMA token to help disambiguate things like main st north city where north might belone to the street name or it might belong to the city. If we havemain st, north city`` we currently filter it out the generic PUNCT tokens to simplify the grammars, but by keeping the COMMA tokens then we can write rules like:

@housenumber @street @comma @city @comma @prov ...
@housenumber @street @city @prov ...

where the first matches if comma's exist ant the 2nd matches if they are not there.

Add Logging Facility

Tokenizer does not split off ° symbol

In lex-spain.txt we have:

LEXENTRY:   °   °   UNITH   ATT_SUF,DET_SUF

this should split the token '4°' into two tokens '4' and '°'

After some debugging, it looks like a problem with the regex. We generate "\B°\b" which is correct of all other splitting, but the '\B' seems to be invalid between '\d' and '°'

I tried to work around this by changing Tokenizer.cpp:50 to:

boost::u32regex re = boost::make_u32regex( std::string( "\\<(\\d+)([[:alpha:]\\p{L}°])\\>" ) );
boost::u32regex re = boost::make_u32regex( std::string( "^(\\d+)([[:alpha:]\\p{L}°])$" ) );

but this did not work either. More investigation is needed.

Add support for intersections

Currently the standard address structure only have fields for one street name. To be able to recognize and standardize intersections we need to be able to capture and standardize two street names and return them. For example:

Broadway & 81st Street
Hwy 51 & 22nd Ave

Redesign Tokenizer to work with Lexicon

Issue: the Lexicon entries can be multiple words but the current Tokenizer does not take that into account.

Potential solutions:

extract all Lexicon keys and build a regex to compare to the head of the string being Tokenized
cycle thru all keys matching against head of string

Item 2. can be optimized by grouping Lexicon phrases based on 1-2 initial chars of the phrase and only cycling thru those and each token pass.

Lexicon and Grammar files with UTF-8 BOM issues

Need a test file with BOM so I can test this.

Create a template trie class and use it to replace std::map usage

It turns out that when we need a std::map<std::string, something*> that we could alternatively create a trie with a pointer to something stored in the terminal node.

Why and where would this make sense? In the lexicon we have to build a trie to create the regex. So rather than store the lexicon as a map AND create it only for this purpose of the regex, we could create a trie instead of the map when loading the lexicon and also use it for creating the regex. The trie can also be used for checking if a word is in the lexicon using a variant of isWord() that returns the pointer or NULL if the word exists or not.

Another potential use could be in storing the grammar and assisting in the grammar search. This needs to be thought out in more detail, because we have Rules and MetaRules mixed in a grammar, but there is a lot of common prefix sequences that currently get searched multiple times so constructing a tree similar to a trie could eliminate this and speed things up significantly.

Add method to analyze grammars

check for ambiguities
check for unlinked meta rules
check that all rules are linked into the tree
other stuff??

Handle multiple identical adjacent tokens

Need to handle multiple adjacent WORD tokens

collapse them within a rule
keep them across rules?
handle them in the tokenizer?
handle them in the Grammar search algorithm
consider extending the grammar to support
- <token>+ one or more token
- <token>* zero or more token
- <token>{n,m} for n to m token

Problems with German and splitting tokens

../src/tester/t2 lex-german-merge.txt grammar/german.grammar 'waldweg 33 54321  konstanz baden-württemberg de'
Address: 'waldweg 33 54321  konstanz baden-württemberg de'
Normalized: 'waldweg 33 54321  konstanz baden-württemberg de'
UpperCase: 'WALDWEG 33 54321  KONSTANZ BADEN-WÜRTTEMBERG DE'
-------------------------------
TOKEN:  WALD            WORD    BADTOKEN
TOKEN:  WE              WORD,DOUBLE     BADTOKEN
TOKEN:  G               WORD,TYPE       BADTOKEN
TOKEN:  33              NUMBER  BADTOKEN
TOKEN:  54321           NUMBER,QUINT    BADTOKEN
TOKEN:  KONSTANZ                WORD    BADTOKEN
TOKEN:  BADEN           WORD    BADTOKEN
TOKEN:  W               WORD,TYPE       BADTOKEN
TOKEN:  ÜRTTEM          WORD    BADTOKEN
TOKEN:  BERG            TYPE    BADTOKEN
TOKEN:  DE              NATION  BADTOKEN
-------------------------------
TOKEN:  WALD            WORD    BADTOKEN
TOKEN:  WEG             WORD,TYPE,DOUBLE        BADTOKEN
TOKEN:  33              NUMBER  BADTOKEN
TOKEN:  54321           NUMBER,QUINT    BADTOKEN
TOKEN:  KONSTANZ                WORD    BADTOKEN
TOKEN:  BADEN           WORD    BADTOKEN
TOKEN:  W               WORD,TYPE       BADTOKEN
TOKEN:  ÜRTTEM          WORD    BADTOKEN
TOKEN:  BERG            TYPE    BADTOKEN
TOKEN:  DE              NATION  BADTOKEN
-------------------------------
#### Failed to find a match (searchAndReclassBest)(-1)

Some additional test cases:

 'waldweg 33 54321  konstanz mecklenburg-vorpommern de' 
'waldweg 33 54321  konstanz nordrhein-westphalen de'
'diemSTRASSE 33 54321  berlin test test berlin de'