Giter Club home page Giter Club logo

posuto's Introduction

posuto

Current PyPI packages

Posuto is a wrapper for the postal code data distributed by Japan Post. It makes mapping Japanese postal codes to addresses easier than working with the raw CSV.

You can read more about the motivations for posuto in Parsing the Infamous Japanese Postal CSV.

issueを英語で書く必要はありません。

Postbox character by Irasutoya

Features:

  • multi-line neighborhoods are joined
  • parenthetical notes are put in a separate field
  • change reasons are converted from flags to labels
  • kana records are unified for easy access
  • codes with multiple areas provide a list of alternates

Romaji provided by JP Post were previously included in this library, but they are extremely low quality and hard to sync, due to being updated separately. If you need romaji it is recommended you use cutlet instead.

To install:

pip install posuto

Example usage:

import posuto as 〒

🗼 = 〒.get('〒105-0011')

print(🗼)
# "東京都港区芝公園"
print(🗼.prefecture)
# "東京都"
print(🗼.kana)
# "トウキョウトミナトクシバコウエン"
print(🗼.note)
# None

Note: Unfortunately 〒 and 🗼 are not valid identifiers in Python, so the above is pseudocode. See examples/sample.py for an executable version.

You can provide a postal code with basic formatting, and postal data will be returned as a named tuple with a few convenience functions. Read on for details of how quirks in the original data are handled.

Details

The original CSV files are managed in source control here but are not distributed as part of the pip package. Instead, the CSV is converted to JSON, which is then put into an sqlite db and included in the package distribution. That means most of the complexity in code in this package is actually in the build and not at runtime.

The postal code data has many irregularities and strange parts. This explains how they're dealt with.

As another note, in normal usage posuto doesn't require any dependencies. When actually building the postal data from the raw CSVs mojimoji is used for character conversion and iconv for encoding conversion.

Field names

The primary fields of an address and the translations preferred here for each are:

  • 都道府県: prefecture
  • 市区町村: city
  • 町域名: neighborhood
    # 🗼
    tt = posuto.get('〒105-0011')
    print(tt.prefecture, tt.city, tt.neighborhood)
    # "東京都 港区 芝公園"

Notes

The postal data often includes notes in the neighborhood field. These are always in parenthesis with one exception, "以下に掲載がない場合". All notes are put in the notes field, and no attempt is made to extract their yomigana or romaji (which are often not available anyway).

minatoku = posuto.get('1050000')
print(minatoku.note)
# "以下に掲載がない場合"

Yomigana

Yomigana are converted to full-width kana.

Long Neighborhood Names

The postal data README explains that when the neighborhood field is over 38 characters it will be continued onto multiple lines. This is not explicitly marked in the data, and where line breaks are inserted in long neighborhoods appears to be random (it's often neither after the 38th character nor at a reasonable word boundary). The only indicator of long lines is an unclosed parenthesis on the first line. Such long lines are always in order in the original file.

In posuto, the parenthetical information is considered a note and put in the note field.

omiya = posuto.get('6020847')
print(omiya)
# "京都府京都市上京区大宮町"
print(omiya.note)
# "今出川通河原町西入、今出川通寺町東入、今出川通寺町東入下る、河原町通今出川下る、河原町通今出川下る西入、寺町通今出川下る東入、中筋通石薬師上る"

Multiple Regions in One Code

Sometimes a postal code covers multiple regions. Often the city is the same and just the neighborhood varies, but sometimes part of the city field varies, or even the whole city field. Codes like this are indicated by the "一つの郵便番号で二以上の町域を表す場合の表示" field in the original CSV data, which is called multi here.

For now, if more than one region uses multiple codes, the main entry is for the first region listed in the main CSV, and other regions are stored as a list in the alternates property. There may be a better way to do this.

Programming Notes

This section is for notes on the use of the library itself as opposed to notes about the data structure.

Multi-threaded Environments

By default, posuto creates a DB connection and cursor on startup and reuses it for all requests. In the typical single-threaded, read-only scenario this is not a problem, but it causes warnings (and may cause problems) in a multi-threaded scenario. In that case you can manage db connections manually using a context manager object.

from posuto import Posuto

with Posuto() as pp:
    tower = pp.get('〒105-0011')

Using the object this way the connection will be automatically closed when the with block is exited.

License

The original postal data is provided by JP Post with an indication they will not assert copyright. The code in this repository is released under the MIT or WTFPL license.

posuto's People

Contributors

digitalaun avatar polm avatar yukota avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

posuto's Issues

動作の高速化

類似ライブラリの jusho と動作速度がかなり違うようです。

import timeit
import posuto

timeit.timeit(lambda: posuto.get('160-0021'), number=100)

→7.848345300000801

import timeit
from jusho import Jusho

postman = Jusho()

timeit.timeit(lambda: postman.by_zip_code('160-0021'), number=100)

→0.018534900002123322

どちらも同じような実装に見えたので不思議に思いソースコードを眺めたのですが、jusho では以下のように SQLite のテーブルにインデックスが作成されている一方で、posuto のほうでは該当する処理が見当たらなかったため、これが原因かもしれません。(勘違いでしたらすみません。)
https://github.com/nagataaaas/Jusho/blob/5fa5037d42c16e1dceca693afd9a04f269815f90/jusho/gateway/insert.py#L37

jusho は大口事業所個別番号などには対応していないようで、posuto を使いたい場面も出てくるかもしれませんので、もし何かしら高速化が可能でしたらご検討いただけると助かります。

Special postal codes not handled

Hi,

Saw that the version 0.2.0 was out and that it had migrated from JSON to Sqlite. I've never used a Python library that does that (not that I am aware) nor packaged one. So decided to try and see if that worked, if that'd be slow, etc.

Installation was super smooth 👍 no issues found.

Then decided to test with a random address. Picked Tamana/Kumamoto (my distant family hometown), then googled a random address, and found this website: https://www.town.nagasu.lg.jp/default.html

The footer of the page contains: " 〒869-0198 熊本県玉名郡長洲町大字長洲2766番地 Tel:0968-78-3111 Fax:0968-78-1092"

I got the postal code, and tried the following code:

>>> import posuto
>>> posuto.get('〒869-0198')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/venv/lib/python3.8/site-packages/posuto/posuto.py", line 52, in get
    base = dict(_fetch_code(code))
  File "/tmp/venv/lib/python3.8/site-packages/posuto/posuto.py", line 21, in _fetch_code
    raise KeyError("No such postal code: " + code)
KeyError: 'No such postal code: 8690198'
>>> 

Searching the same postal code on Google.co.jp returns the right location on the map.

Untitled

Not sure how to provide a pull request, but thought it could be useful to report this missing postal code?

Anyway, great library, and nice trick of including an sqlite DB, might come in handy some day.

Thanks!
Bruno

Suggestion: build and distribute as a SQLite DB file

This library currently distributes data as a built JSON file, which has to be read into memory in full.

If the library used a binary SQLite DB file it could run lookups against an index and avoid needing to load the entire file into memory.

Since Python includes sqlite3 in the Python standard library this could be done without needing any extra library dependencies.

SyntaxError: invalid character in identifier

How do we input ?

Python 3.6.8 (default, Jul 12 2019, 20:53:32)
[GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import posuto asFile "<stdin>", line 1
    import posuto as^
SyntaxError: invalid character in identifier

Multi-threaded Environments向けの記法をしても、sqlite3.ProgrammingErrorが発生する

現象

以下の記述を行い、マルチスレッド環境から呼び出しを行うと
qlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread.
というExceptionが発生する。
https://github.com/polm/posuto#multi-threaded-environments

トレース

ファイル名一部省略しています

  File ".venv/lib/python3.8/site-packages/posuto/posuto.py", line 38, in get
    return get(code, self._db)
  File ".venv/lib/python3.8/site-packages/posuto/posuto.py", line 75, in get
    base = dict(_fetch_code(code))
  File ".venv/lib/python3.8/site-packages/posuto/posuto.py", line 41, in _fetch_code
    db.execute("select data from postal_data where code = ?", (code,))
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140153315018560 and this is thread id 140153219274496.

Library isn't thread safe

From me trying to use this inside a Django application:

  [...snip...]
  File "/home/rtpg/.virtualenvs/hats-env/lib/python3.8/site-packages/posuto/posuto.py", line 53, in get
    base = dict(_fetch_code(code))
  File "/home/rtpg/.virtualenvs/hats-env/lib/python3.8/site-packages/posuto/posuto.py", line 17, in _fetch_code
    DB.execute("select data from postal_data where code = ?", (code,))
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140451714606912 and this is thread id 140450079946496.

My understanding is that web servers like gunicorn run in a pre-fork model, where they import all your application, then fork the process into worker threads. So, in particular, stuff like import posuto will run on a different thread

Since we're using a global cursor here, instead of a thread-local, we run into this conflict

DB = CONN.cursor()

My guess is that just having some thread-local (you can have multiple read connections to sqlite without issue, to my knowledge) would let this code transparently work. My current workaround is to set up the import until after the fork.

Generate a nice diff

Each month I review updates and manually create a summary, but it should be possible to do that automatically. Example information:

  • added/deleted/changed place names
  • rank of prefectures by count of changes
  • something for the office data?

TypeError from missign romaji on a certain postcode

>>> posuto.get("1057529")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../lib/python3.8/site-packages/posuto/posuto.py", line 57, in get
    out = PostalCode(**base)
TypeError: __new__() missing 3 required positional arguments: 'prefecture_romaji', 'city_romaji', and 'neighborhood_romaji'

I'm not sure clear on how this database gets built, is this perhaps the effect of some data getting carried over from an older implementation of this system?

郵便番号"921-8046"を指定した場合にエラーになる

日本語で失礼いたします。郵便番号921-8046を指定した場合に、以下のようなエラーになります。

>>> import posuto
>>> posuto.get('921-8046')
(略)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/posuto/posuto.py", line 56, in get
    base['alternates'] = [PostalCode(**aa) for aa in base['alternates']]
  File "/usr/local/lib/python3.6/site-packages/posuto/posuto.py", line 56, in <listcomp>
    base['alternates'] = [PostalCode(**aa) for aa in base['alternates']]
TypeError: __new__() missing 3 required positional arguments: 'prefecture_romaji', 'city_romaji', and 'neighborhood_romaji'

確認環境:

  • posuto-0.2.1
  • Python 3.6.10

いわゆる「KEN_ALL.CSVが複数行の場合」なのかなと思ったのですが、特にそういうわけでもなさそうでした。
どういう場合に該当するのかちょっと追いきれなかったのですが(postaldata.jsonにnamedtupleの項目がないから?)、ご確認いただければ幸いです。

参考までに…

$ curl -sS https://raw.githubusercontent.com/polm/posuto/master/posuto/postaldata.json | jq '.["9218046"]'{
  "jisx0402": "17201",
  "old_code": "92181",
  "postal_code": "9218046",
  "prefecture_kana": "イシカワケン",
  "city_kana": "カナザワシ",
  "neighborhood_kana": "オオクワマチ",
  "prefecture": "石川県",
  "city": "金沢市",
  "neighborhood": "大桑町",
  "partial": true,
  "koazabanchi": false,
  "chome": false,
  "multi": true,
  "update_status": "変更なし",
  "update_reason": "変更なし",
  "multiline": true,
  "alternates": [
    {
      "jisx0402": "17201",
      "old_code": "92181",
      "postal_code": "9218046",
      "prefecture_kana": "イシカワケン",
      "city_kana": "カナザワシ",
      "neighborhood_kana": "ミツコウジマチ",
      "prefecture": "石川県",
      "city": "金沢市",
      "neighborhood": "三小牛町",
      "partial": true,
      "koazabanchi": false,
      "chome": false,
      "multi": true,
      "update_status": "変更なし",
      "update_reason": "変更なし",
      "multiline": false,
      "alternates": [],
      "note": "ヘ"
    }
  ],
  "note": "ア、イ、ヰ、ウ、上野、ヲ、オ乙、鐘搗山、上川原、上猫下、ク、ケ、御所谷、小寺山、シ、下上野、下西欠、平、チ、ツ乙、ツ丙、テ、ト、中上野、中尾山、中平、中ノ大平、西ノ山、猫シタイ、ノ、ハ、開、法師山、坊山、マ、鱒川淵、ム、元末、元涌波庚、ヤ、リ、ル、レ乙、レ甲、ロ乙、ロ甲、和",
  "prefecture_romaji": "Ishikawa Ken",
  "city_romaji": "Kanazawa Shi",
  "neighborhood_romaji": "Okuwamachi"
}

$ curl -sS -L "https://github.com/polm/posuto/blob/master/raw/ken_all.utf8.csv?raw=true" | grep "9218046"
17201,"92181","9218046","イシカワケン","カナザワシ","オオクワマチ(ア、イ、イ、ウ、ウエノ、オ、オオツ、カネツキヤマ、カミカワラ、カミネコシタ、","石川県","金沢市","大桑町(ア、イ、ヰ、ウ、上野、ヲ、オ乙、鐘搗山、上川原、上猫下、",1,0,0,1,0,0
17201,"92181","9218046","イシカワケン","カナザワシ","ク、ケ、ゴショガダニ、コデラヤマ、シ、シモウエノ、シモニシガケ、ダイラ、チ、ツオツ、ツヘイ、テ、ト、","石川県","金沢市","ク、ケ、御所谷、小寺山、シ、下上野、下西欠、平、チ、ツ乙、ツ丙、テ、ト、",1,0,0,1,0,0
17201,"92181","9218046","イシカワケン","カナザワシ","ナカウエノ、ナカオヤマ、ナカダイラ、ナカノオオヒラ、ニシノヤマ、ネコノシタイ、ノ、ハ、ヒラキ、","石川県","金沢市","中上野、中尾山、中平、中ノ大平、西ノ山、猫シタイ、ノ、ハ、開、",1,0,0,1,0,0
17201,"92181","9218046","イシカワケン","カナザワシ","ホウシヤマ、ボウヤマ、マ、マスカワブチ、ム、モトスエ、モトワクナミコウ、ヤ、リ、ル、レオツ、","石川県","金沢市","法師山、坊山、マ、鱒川淵、ム、元末、元涌波庚、ヤ、リ、ル、レ乙、",1,0,0,1,0,0
17201,"92181","9218046","イシカワケン","カナザワシ","レコウ、ロオツ、ロコウ、ワ)","石川県","金沢市","レ甲、ロ乙、ロ甲、和)",1,0,0,1,0,0
17201,"92181","9218046","イシカワケン","カナザワシ","ミツコウジマチ(ヘ)","石川県","金沢市","三小牛町(ヘ)",1,0,0,1,0,0

$ docker run -it --rm python:3.8.6-slim /bin/bash -c 'pip3 install posuto; python -c "import posuto; posuto.get("921-8046")"'
Collecting posuto
Downloading posuto-0.2.1.tar.gz (5.4 MB)
|████████████████████████████████| 5.4 MB 3.6 MB/s
Building wheels for collected packages: posuto
Building wheel for posuto (setup.py) ... done
Created wheel for posuto: filename=posuto-0.2.1-py3-none-any.whl size=5324729 sha256=5f3fcfe986c5072b00316b391f0b1d4faa7b576bbd9852c5036a266e31dfbbe9
Stored in directory: /root/.cache/pip/wheels/6b/a0/fb/8b586611424f543d9622afcbe2e916efee1e98e6293305bf5b
Successfully built posuto
Installing collected packages: posuto
Successfully installed posuto-0.2.1
WARNING: You are using pip version 20.3.1; however, version 20.3.3 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.8/site-packages/posuto/posuto.py", line 56, in get
base['alternates'] = [PostalCode(**aa) for aa in base['alternates']]
File "/usr/local/lib/python3.8/site-packages/posuto/posuto.py", line 56, in
base['alternates'] = [PostalCode(**aa) for aa in base['alternates']]
TypeError: new() missing 3 required positional arguments: 'prefecture_romaji', 'city_romaji', and 'neighborhood_romaji'

Use Calendar Versioning

From the next version, posuto will use calendar versioning. The version will be of the form YYYY.MM.release.

While ideally posuto would have data and code well-separated, the truth is that they're entangled, and it doesn't really make sense to try to read future data with an old version. For old data, the compiled data cannot be re-used but the preprocessing script can be re-run to generate old data.

Because updates are closely linked to the calendar this seems like a more reasonable versioning scheme than the current one.

Automate updates with Github Actions

Since the location of the data is fixed and predictable, updates should be automatic. Releases can stil be manual until the process is confirmed to work.

Use the UTF8 CSV file

JP Post started distributing a UTF8 version of the main CSV file in June 2023.

https://www.post.japanpost.jp/zipcode/dl/utf-zip.html

Besides using UTF8, this replaces the weird line continuations with long lines, which is a welcome update. Unfortunately, it doesn't resolve other issues like comments in place names, and columns don't have headings. Also, no equivalent file is provided for the office postal codes.

It would be good to update posuto to use this new file after verifying that there's no difference from our conversion and the official one.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.