iwpnd / flashgeotext Goto Github PK

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

License: MIT License

Python 79.27% Jupyter Notebook 20.53% Batchfile 0.14% Shell 0.07%

geotext named-entity-extraction flashtext python search-in-text search

flashgeotext's Introduction

Hi 👋, I'm Ben

I'm a backend software engineer with a background in geographic sciences, a strong focus on geospatial solutions and a heart for open-source.

🗺 I'm open for new challenges.
🛴 I'm a Backend Software Developer at TIER with the geo team.
🚙 I was a Data Engineer and Software Developer at HERE with Local Data Intelligence.

flashgeotext's People

Contributors

Stargazers

Watchers

Forkers

francbartoli yqchen123 trivedisorabh chazgorman abeusher mausamadh jerrychong25

flashgeotext's Issues

Optionally list found synonyms in extract output

geotext.extract(article.text, span_info=True)

>> {
    'city_names': {
        'Москва': {
            'count': 3, 
            'span_info': [(2, 8), (204, 210), (826, 832)]
            }
        }
    }

It would be cool to list the found toponyms optionally also:

geotext.extract(article.text, span_info=True, list_synonyms=True)

>> {
    'city_names': {
        'Москва': {
            'count': 3, 
            'span_info': [(2, 8), (204, 210), (826, 832)],
            'found_as': ['Москве', 'Москвы'] 
            }
        }
    }

So basically just parsing text[span_info].

.validate() returns LookupValidation object without repr

LookupData.validate() should return some kind of repr to point the user to errors in the LookupData

Initial impressions and questions of flashgeotext for extracting countries from affiliations

Thanks for posting at elyase/geotext#23 (comment) letting me know about this package. I'm interested in it as a way to extract countries referred to in author affiliations.

For example, here is an affiliation:

'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.

For my project, I'd like to know what countries are mentioned (either directly or inferred from a place mention inside that country).

If I run the following (with v0.2.0):

import flashgeotext.geotext
geotexter = flashgeotext.geotext.GeoText(use_demo_data=True)
affil = """\
'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, \
Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, \
Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.
"""
geo_text = geotexter.extract(affil, span_info=False)
geo_text

I get the following output:

2020-03-02 18:50:46.475 | DEBUG    | flashgeotext.lookup:add:194 - cities added to pool
2020-03-02 18:50:46.479 | DEBUG    | flashgeotext.lookup:add:194 - countries added to pool
2020-03-02 18:50:46.480 | DEBUG    | flashgeotext.lookup:_add_demo_data:225 - demo data loaded for: ['cities', 'countries']
{'cities': {'University': {'count': 2},
  'Saarbrücken': {'count': 1},
  'Carnegie': {'count': 1},
  'Pittsburgh': {'count': 1},
  'Berlin': {'count': 2},
  'Parys': {'count': 2}},
 'countries': {'Germany': {'count': 2},
  'United States': {'count': 1},
  'France': {'count': 2}}}

Some impressions / questions?

Are the city mentions counting towards country mentions? If yes, why does "United States" not have a count of 2 for "Pittsburgh" and "USA".
Is "Parys" for "Paris"... not sure why this conversion is made.
Counting "University" as a city will almost always be a false positive for us, although I'm guessing this is a source data issue.

Thanks for considering this feedback / helping answer any of these questions.

[ENHANCEMENT] Add a method to list supported scripts for LookupData

There is currently now way to list the supported scripts that reside in /resources/script.json other than

from flashgeotext import settings

print(settings.SCRIPTS.keys())

>> dict_keys(['cyrillic', 'ascii', 'latin', 'syriac', 'ethiopic', 'arabic', 'bengali', 'balinese', 'bamum', 'devanagari', 'tai_viet', 'tibetan', 'bassah_vah', 'buginese', 'chakma', 'cherokee', 'canadian_aboriginal', 'thaana', 'greek', 'adlam', 'gujarati', 'hebrew', 'armenian', 'yi', 'javanese', 'georgian', 'new_tai_lue', 'thai_tham', 'khmer', 'kannada', 'lisu', 'lao', 'mandaic', 'malayalam', 'myanmar', 'n_ko', 'oriya', 'gurmukhi', 'tifinagh', 'sinhala', 'sundanese', 'tamil', 'tai_le', 'telugu', 'thai', 'vai'])

so add a method to show the user the supported scripts in a better way.

seem debugging will cause some fatal errors

Hello, your method helps me a lot, and I just wanna debug at some point, but much to my suprise, it will show some fatal errors once debug, how can we solve this probelm

[Feature] handle different non_word_boundaries

Currently non_word_boundaries in the Trie are constructed as:

import string

non_word_boundaries = set(string.digits + string.ascii_letters + '_')
print(non_word_boundaries)

>> {'k', '6', 's', 'M', 'i', 'S', 'm', 'E', 'r', 'W', 'v', 'l', 
'R', 'f', 'e', 'X', '7', '3', 'q', 'w', '0', 'x', 'V', 'C', 'n', 
'I', '4', 'D', 'z', 'G', 'L', '2', 'T', 'U', '_', 'B', 't', 'Q', 
'd', '9', 'h', 'o', 'c', 'u', 'P', 'K', 'Y', 'p', 'A', 'J', 'O', 
'N', 'H', 'j', 'a', 'Z', '5', '1', 'b', 'y', 'F', '8', 'g'}

The problem arises when one decides to lookup cyrillic character keywords in a cyrillic text. Due to the limitation flashgeotext does not reliably extract the longest match, as every character not present in non_word_boundaries will stop the traversing thru the trie early.

say keyword is:

{"Нижневартовск": ["Нижневартовск"]

and text is:

.. Нижневартовском ..

Flashgeotext will extract Нижневартовск because о is not part of non_word_boundaries.

Publishing package on conda-forge

Hi @iwpnd, this is a great package! Unfortunately, it is not available on conda-forge while flashtext can be found.
I'm building a library around it and I can only fetch packages from conda-forge so I'm wondering if you might accept PR to publish on conda-forge eventually.
For now the workaround is to embed all flashgeotext as a module statically but I would love to declare it in the dependencies.

Countries / cities with only (or none) capitals are not recognized

As you can see on both pictures, the correct country / city is only extracted and recognized if it is correctly capitalized. It would be really helpful if the package can also extract and normalize countries / cities that are formatted differently.

Hide trace log

Hi!

i'd like to hide log trace, how can i do it?

2021-04-07 19:27:23.225 | DEBUG | flashgeotext.lookup:add:194 - cities added to pool
2021-04-07 19:27:23.231 | DEBUG | flashgeotext.lookup:add:194 - countries added to pool
2021-04-07 19:27:23.235 | DEBUG | flashgeotext.lookup:_add_demo_data:225 - demo data loaded for: ['cities', 'countries']
2021-04-07 19:27:24.777 | DEBUG | flashgeotext.lookup:add:194 - cities added to pool
2021-04-07 19:27:24.784 | DEBUG | flashgeotext.lookup:add:194 - countries added to pool
2021-04-07 19:27:24.786 | DEBUG | flashgeotext.lookup:_add_demo_data:225 - demo data loaded for: ['cities', 'countries']

Cannot install 0.3.0 or 0.3.1

Hi:

I try to install flashgeotext 0.3.1 via pip or pipenv but i get the following error:

COMAND:
pipenv iinstall flashgeotext~=0.3.0

ERROR:
ERROR: Could not find a version that satisfies the requirement flashgeotext~=0.3.0
ERROR: No matching distribution found for flashgeotext~=0.3.0

How can i solve it?

Thanks in advance.
Best.

Missing cities and countries

Hello,

I have started trying out this library, but it seems to be missing cities and countries mostly from South America. What's the best way to update the cities.json and countries.json files? Is it ok just to add the data in there manually?

Also, how can this library map Shanghai as China, where is that relation mapped? why does it not behave the same for Caracas?

>>> geotext.extract(input_text="Living in Caracas", span_info=True)
{'cities': {'Caracas': {'count': 1, 'span_info': [(10, 17)]}}, 'countries': {}}

Thanks in advance!

TypeError: slice indices must be integers or None or have an index method when span_info is False

As per the doc https://flashgeotext.iwpnd.pw/geotext/, the GeoText Class python function extract has an optional argument span_info : bool - return span_info. Defaults to True. However, on passing the span_info argument as false, GeoText fails to parse the text and throws an index slice error.

Below are the steps to reproduce the error:

from flashgeotext.geotext import GeoText
geotext2 = GeoText()

input_text = '''Shanghai. The Chinese Ministry of Finance in Shanghai said that China plans
                    to cut tariffs on $75 billion worth of goods that the country
                    imports from the US. Washington welcomes the decision.'''

geotext2.extract(input_text=input_text, span_info=False)

Output: > "found_as": [input_text[span_start:span_end]], TypeError: slice indices must be integers or None or have an index method

[docs] fix linebreaks in pydoc-markdown docs for geotext.md and lookup.md

pydoc-md doesn't seem to use linebreak properly in docs. fix it

About Taiwan

Hi, Thanks for this excellent repo which helps me a lot in processing the news data. I really appreciate it. However, I have some suggestions and hope I'm not being particularly offensive.

According to countries.json, Taiwan is listed as an independent country, I go through the codes in this repo and find this may be due to an error in the original web page List_of_alternative_country_names. Nevertheless, as stated by the United Nations, Taiwan belongs to China.

I sincerely expect you can update this great repo and fix this issue.
Thanks!