Giter Club home page Giter Club logo

Comments (3)

cmhodgson avatar cmhodgson commented on June 19, 2024

We will not be able to identify street names or localities that are not in the province, so we won't be able to identify that the province name is the province name. "blah blah Ontario" is just as likely an occupant or site name on ontario street as it a street address in ontario.

from ols-geocoder.

mraross avatar mraross commented on June 19, 2024

As a first step, we should focus on reliably identifying alien addresses and assigning an address.isAlien fault to them instead of matching to a false positive address somewhere in BC. Here's a current example of such a false positive using Geocoder 4.1:

 122 Albert St, Port Melbourne, Victoria australia

matches to this:

 122 Lambert St, Quesnel, BC.

At least with accurate alien detection, a script can filter aliens out of the batch geocoder results file and apply a global geocoder to them.

@bstratto Feel free to add a comment describing potential alien detectors you've discovered in your rejected address analysis.

from ols-geocoder.

bstratto avatar bstratto commented on June 19, 2024

Below are a few patterns for identifying addresses in other countries. This is based on analysis of the 13 million HealthIdeas addresses and reflects the examples available in that dataset.

These patterns provide only a subset of the HealthIdeas addresses in these countries. There are many more addresses for which there is no “safe” pattern (i.e. a pattern would have the potential of also eliminating addresses where a BC location is included).

Pattern for addresses in Germany:
The HealthIdeas addresses show that people use the general formats:
• German zip code + “, GERMANY, BC”
• German zip code + German locality + “, GERMANY, BC”

To make this pattern safe, we would have to check that the text in German locality is not in fact the name of a BC locality. For example, there could be an address “10319 HOPE, GERMANY, BC”

The below pattern was tested with HealthIdeas and returns only German alien addresses:
• The first 5 characters are numbers
• The length of the address is <= 30 (longer addressStrings tend to include BC address text)
• The address ends with “, GERMANY, BC”
• The text preceding “, GERMANY, BC” is not the abbreviation for a street type
• The text preceding “, GERMANY, BC” is not a known BC locality

Below are some examples. This pattern identifies 65 addresses in the HealthIdeas dataset:

addressString Standardized address Score
55131 MAINZ, GERMANY, BC German Rd, Flatrock, BC 55
60486 FRANKFURT, GERMANY, BC BC 1
61350 BAD HOMBERG, GERMANY, BC BC 1
27612 LOXO ZECHT, GERMANY, BC Zacht 5 near Kanaka Bar, BC 52
28211 BREMEN, GERMANY, BC 28211 Herman S. Braich Blvd, Mission, BC 76

Pattern for addresses in England:
The HealthIdeas addresses show that people use the general formats:
• “, ENGLAND, BC”
• England locality + “ ENGLAND, BC” or England locality + “, ENGLAND, BC”

To make this pattern safe, we would have to check that the text in is not in fact the name of a BC locality. For example, there could be an address “HOPE, ENGLAND, BC”. This text, however, may also include localities that exist in both England and BC, such as “SURREY, ENGLAND, BC”. Geocoder would have to “make a call” regarding these.

The below pattern was tested with HealthIdeas and returns only England alien addresses:
• The first character is not a number
• The length of the address is <= 25 (longer addressStrings tend to include BC address text)
• The address ends with “ENGLAND, BC”
• The text preceding “ENGLAND, BC” does not include a known BC locality

Below are some examples. This pattern identifies 173 addresses in the HealthIdeas dataset:

addressString Standardized address Score
VISITOR FROM, ENGLAND, BC Vision Way, Langford, BC 24
WELSHPOOL, ENGLAND, BC BC 1
VISITING, ENGLAND, BC BC 1
WEST SUSSEX, ENGLAND, BC West Boulevard, Vancouver, BC 69
, KENT ENGLAND, BC England Rd, Courtenay, BC 64
, ENGLAND, BC England Ave, Courtenay, BC 62

Pattern for addresses in the United States:
The HealthIdeas addresses show that people use the general formats:
• text + “, USA, BC”
• text + “, US, BC”
• 6 to 10 numeric digits + text + “ , USA, BC”

To make these patterns safe, we would have to check that the text is not in fact the name of a BC locality. For example, there could be an address “HOPE, USA, BC”. This text, however, may also include localities that have similar names or exist in both US and BC, such as “MT VERNON, USA, BC”. Geocoder would have to “make a call” regarding these.

The below patterns were tested with HealthIdeas and return only United States alien addresses:

Pattern 1: Non-numeric (106 addresses found in HealthIdeas)
o The first 2 characters are not a number
o The length of the address is <= 19 (longer addressStrings tend to include BC address text)
o The address ends with “ US, BC” or “ USA, BC”
o The text preceding “ US, BC” (or “ USA, BC”) is not a known BC locality or “UVIC” or “UBC”

Pattern 2: Numeric (52 addresses found in HealthIdeas)
o The first 6 characters are numbers
o The length of the address is <= 19 (longer addressStrings tend to include BC address text)
o The address ends with “, USA, BC”
o The characters in position 7-10 are one of these: space, comma, “U”, “S”

Below are some examples. Numeric addresses were redacted.

addressString Standardized address Score
MT VERNON, USA, BC Mt Atkinson Pl, Vernon, BC 69
ALTONA PA, USA, BC Pa-aat 6 near Pitt Island, BC 54
, US, BC BC 1
, USA, BC BC 1
ARIZONA, USA, BC BC 1
BBBBBB, USA, BC BC 1
9999999, USA, BC BC 1
9999999, USA, BC BC 1
9999999 01, USA, BC BC 1

from ols-geocoder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.