Giter Club home page Giter Club logo

Comments (16)

mraross avatar mraross commented on June 19, 2024

One possible heuristic is subtractive geocoding. The idea comes from the observation that, in autocompletion mode, a civic number and just a few characters are enough to find five potential matches of which one is almost always correct. Here's the basic algorithm:

Starting from the beginning of the bad address string, find the first number in the string and call it a civic number.

Join the civic number plus a space character to the next five or six characters and call this addressString.

Perform a geocode using &addressString, &autocomplete-true, and &maxResults=5. If one or more matches has a good score, the bad address is salvagable. Otherwise, give up on bad address.

for each well-matched address
loop
Match each address element value set by the autocomplete request to a substring in the
bad address and subtract (e.g., remove) all remaining unmatched words in the bad address.

Geocode what remains of the bad address.
if the match score is perfect then the correct address has been salvaged from the bad
end loop

from ols-geocoder.

mraross avatar mraross commented on June 19, 2024

V 650 PENTICTON ST N COTTAGE HOSPICE, VANCOUVER, BC
geocodes to the following:

Score: 31
Match Precision: CIVIC_NUMBER
Precision Points: 100
Faults: UNRECOGNIZED.notAllowed33UNIT_DESIGNATOR.missing0UNIT_NUMBER.notMatched1LOCALITY.notMatched35STREET_DIRECTION.notSuffix0
UNRECOGNIZED.notAllowed 33
UNIT_DESIGNATOR.missing 0
UNIT_NUMBER.notMatched 1
LOCALITY.notMatched 35
STREET_DIRECTION.notSuffix 0

650 PENTIC autocompletes to the following:

  1. 650 Pentiction St, Vancouver, BC
  2. 650 Penticton Ave, Vancouver, BC
  3. 650 N Pentiction St, Vancouver, BC

None of the matches contain v, cottage, or hospice.
Match 1 doesn't match N; match 2 doesn't match St
Match number 3 is best since it matches all words except cottage and hospice as follows:
650 PENTICTON ST N VANCOUVER, BC
V 650 PENTICTON ST N COTTAGE HOSPICE, VANCOUVER, BC

So match no 3 is the salvaged address.

from ols-geocoder.

mraross avatar mraross commented on June 19, 2024

2187 SOUTH ALDER ST S, CAMPBELL RIVER, BC
geocodes to the following:

Score: 67
Match Precision: CIVIC_NUMBER
Precision Points: 100
Faults: UNRECOGNIZED.notAllowed33
UNRECOGNIZED.notAllowed 33

2187 SOUTH autocompletes to:

  1. 2187 South Rd, Gabriola Island, BC
  2. 2187 South Rd, Gabriola Island, BC
  3. 2187 South Wellington Rd, Nanaimo, BC
  4. 2187 South Nechako Pl, Miworth, BC
  5. 2187 South Lake Lane, Whistler, BC

None of these matches are going to allow 2187 SOUTH ALDER ST S, CAMPBELL RIVER, BC to be salvaged so substractive geocoding fails in this case.

from ols-geocoder.

mraross avatar mraross commented on June 19, 2024

Another heuristic is "treat address prefix as an occupant". Given the following example:

MAYFAIR MALL 3147 DOUGLAS ST, VICTORIA, BC

Working left to right one word at a time, insert occupant delimiter after each word and geocode; if score above minScore, return geocoded address.

from ols-geocoder.

mraross avatar mraross commented on June 19, 2024

Another heuristic is remove care of (e.g., c/o) name. Here is a c/o example:

c/o Moe Stooge, 1- 1175 Douglas St, Victoria bc

so the parser could treat c/o as an indicator that an occupant name name follows.

Needed by all clients with existing lists of addresses that need cleaning, standardizing, and geocoding

from ols-geocoder.

cmhodgson avatar cmhodgson commented on June 19, 2024

Note that the c/o case is also handled by #11.

from ols-geocoder.

cmhodgson avatar cmhodgson commented on June 19, 2024

A useful option might be to automatically apply the more aggressive and time consuming salvage heuristics on any input address that doesn't have a match within the tolerance. This would allow for a single-pass batch geocoding that takes more time as needed to try to find an acceptable match, rather than requiring users to re-submit failures to a second, more aggressive salvage service. This would be enabled with an additional parameter eg. &tryHard=true

from ols-geocoder.

mraross avatar mraross commented on June 19, 2024

Agreed as per my comment on Feb 8 above:

"We could support a salvage flag in the addresses resource or create a separate resource called addresses/salvage"

from ols-geocoder.

cmhodgson avatar cmhodgson commented on June 19, 2024

The estimate for this is dependent on how complicated the heuristics we decide to implement are. Some relatively simple examples have been given which may be easily implemented, however the goal will likely be to ensure handling of all "common" failures, as identified by #27, the result of which is not known at this time. 1 week is a reasonable starting point, but we could potentially spend double that to try to catch more cases.

from ols-geocoder.

mraross avatar mraross commented on June 19, 2024

I would like to put more time into this and have the work determined by the outcome of the data pattern analysis task so maybe two weeks work?

from ols-geocoder.

cmhodgson avatar cmhodgson commented on June 19, 2024

Right, but I am suggesting the salvage operation is conditional on a normal geocode failing; rather than always looking harder for the best result. Depending on the heuristics employed this might be implied (why keep looking if you've already found something "good enough") however the expectation of a "pure" salvage operation is that the input has already been deemed to be hard to match, whereas the 1-pass approach doesn't assume anything about the input.

from ols-geocoder.

mraross avatar mraross commented on June 19, 2024

Sorry, you're right. I was assuming salvage=true meant all input addresses were bad. As a parameter name, neither salvage or tryHard captures the conditionality. Maybe the name doesn't have to as long as we define what it means as in:

salvage - if True, geocoder will attempt a speedy match; if that fails, it will try progressively longer matching heuristics to come up with a plausible match.

from ols-geocoder.

mraross avatar mraross commented on June 19, 2024

Here are some unexpected/noise words and phrases found in rejected addresses:

Phone
No Phone
AT - bad abbreviation for apartment
unknown
canada - following st, not province
rtd mail
return mail
mailing
dnc - followed by one or more words; usually follows street

from ols-geocoder.

mraross avatar mraross commented on June 19, 2024

Here are some unrecognized building names. They may appear at the beginning or following street. Counts are shown in round brackets:

amica (64)
aqua (5)
espana + {1,2} (13)
oscar (9)

name + "apts" (205) - following street or at beginning
apartment building (687) - at beginning or following street
Hotel (9) - as last word in a building name
House (297) - as last word in a building name
Lodge (89) - as last word in a building name
Manor (150) - as last word in a building name
Care Centre or Care Cen (74) - as last words in a building name
sunrise or sunrise of vancouver (62)

from ols-geocoder.

mraross avatar mraross commented on June 19, 2024

Noise immunity to unknown site names or other garbage at the front of an address seems to be the most promising thing to implement. As discussed, you could try the following frontGate heuristic:

  1. In a left-to-right word scan, look for first number token, slap a frontGate just before it, and try to geocode the result; if you get a score of 95 or higher, return the result with an added a SITE_NAME.unknown fault with a five-point penalty. One additional refinement is to check that the word immediately left of the first number token is a unitDescriptor and act appropriately.

  2. If step 1 failed, try a sliding frontGate where place a frontGate after the first word, and try to geocode that; if it returns a perfect match at any given matchPrecision, return the result with an additional five-point penalty. If no perfect match, slide the frontGate right one word and repeat. For example:

    Harry's Harpoonery Atlin, BC

should return the following:

Harry's Harpoonery -- Atlin, BC

with a score of 63, matchPrecision of locality, and penalty of 5 points for missing frontGate(?).

In addition to better noise immunity, the above approach allows the frontGate to be optional on input and takes care of c/o elements.

from ols-geocoder.

mraross avatar mraross commented on June 19, 2024

Another possible heuristic to try after the frontGate heuristic is a strict pattern match with two noise word areas:

civicNumber sw-streetName streetType * sw-locality 'BC'
civicNumber sw-streetName streetType unitDesignator unitNumber * sw-locality 'BC'
unitNumber civicNumber sw-streetName streetType * sw-locality 'BC'

where sw means single-word (e.g. single-word streetName)
* means one or more noise words

These three patterns should match thousands of addresses with noise.

from ols-geocoder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.