Giter Club home page Giter Club logo

flashgeotext's Introduction

Hi 👋, I'm Ben

I'm a backend software engineer with a background in geographic sciences, a strong focus on geospatial solutions and a heart for open-source.

  • 🗺 I'm open for new challenges.
  • 🛴 I'm a Backend Software Developer at TIER with the geo team.
  • 🚙 I was a Data Engineer and Software Developer at HERE with Local Data Intelligence.

flashgeotext's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar francbartoli avatar iwpnd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

flashgeotext's Issues

Optionally list found synonyms in extract output

geotext.extract(article.text, span_info=True)

>> {
    'city_names': {
        'Москва': {
            'count': 3, 
            'span_info': [(2, 8), (204, 210), (826, 832)]
            }
        }
    }

It would be cool to list the found toponyms optionally also:

geotext.extract(article.text, span_info=True, list_synonyms=True)

>> {
    'city_names': {
        'Москва': {
            'count': 3, 
            'span_info': [(2, 8), (204, 210), (826, 832)],
            'found_as': ['Москве', 'Москвы'] 
            }
        }
    }

So basically just parsing text[span_info].

Initial impressions and questions of flashgeotext for extracting countries from affiliations

Thanks for posting at elyase/geotext#23 (comment) letting me know about this package. I'm interested in it as a way to extract countries referred to in author affiliations.

For example, here is an affiliation:

'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.

For my project, I'd like to know what countries are mentioned (either directly or inferred from a place mention inside that country).

If I run the following (with v0.2.0):

import flashgeotext.geotext
geotexter = flashgeotext.geotext.GeoText(use_demo_data=True)
affil = """\
'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, \
Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, \
Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.
"""
geo_text = geotexter.extract(affil, span_info=False)
geo_text

I get the following output:

2020-03-02 18:50:46.475 | DEBUG    | flashgeotext.lookup:add:194 - cities added to pool
2020-03-02 18:50:46.479 | DEBUG    | flashgeotext.lookup:add:194 - countries added to pool
2020-03-02 18:50:46.480 | DEBUG    | flashgeotext.lookup:_add_demo_data:225 - demo data loaded for: ['cities', 'countries']
{'cities': {'University': {'count': 2},
  'Saarbrücken': {'count': 1},
  'Carnegie': {'count': 1},
  'Pittsburgh': {'count': 1},
  'Berlin': {'count': 2},
  'Parys': {'count': 2}},
 'countries': {'Germany': {'count': 2},
  'United States': {'count': 1},
  'France': {'count': 2}}}

Some impressions / questions?

  1. Are the city mentions counting towards country mentions? If yes, why does "United States" not have a count of 2 for "Pittsburgh" and "USA".

  2. Is "Parys" for "Paris"... not sure why this conversion is made.

  3. Counting "University" as a city will almost always be a false positive for us, although I'm guessing this is a source data issue.

Thanks for considering this feedback / helping answer any of these questions.

[ENHANCEMENT] Add a method to list supported scripts for LookupData

There is currently now way to list the supported scripts that reside in /resources/script.json other than

from flashgeotext import settings

print(settings.SCRIPTS.keys())

>> dict_keys(['cyrillic', 'ascii', 'latin', 'syriac', 'ethiopic', 'arabic', 'bengali', 'balinese', 'bamum', 'devanagari', 'tai_viet', 'tibetan', 'bassah_vah', 'buginese', 'chakma', 'cherokee', 'canadian_aboriginal', 'thaana', 'greek', 'adlam', 'gujarati', 'hebrew', 'armenian', 'yi', 'javanese', 'georgian', 'new_tai_lue', 'thai_tham', 'khmer', 'kannada', 'lisu', 'lao', 'mandaic', 'malayalam', 'myanmar', 'n_ko', 'oriya', 'gurmukhi', 'tifinagh', 'sinhala', 'sundanese', 'tamil', 'tai_le', 'telugu', 'thai', 'vai'])

so add a method to show the user the supported scripts in a better way.

seem debugging will cause some fatal errors

image
Hello, your method helps me a lot, and I just wanna debug at some point, but much to my suprise, it will show some fatal errors once debug, how can we solve this probelm
image
image

[Feature] handle different non_word_boundaries

Currently non_word_boundaries in the Trie are constructed as:

import string

non_word_boundaries = set(string.digits + string.ascii_letters + '_')
print(non_word_boundaries)

>> {'k', '6', 's', 'M', 'i', 'S', 'm', 'E', 'r', 'W', 'v', 'l', 
'R', 'f', 'e', 'X', '7', '3', 'q', 'w', '0', 'x', 'V', 'C', 'n', 
'I', '4', 'D', 'z', 'G', 'L', '2', 'T', 'U', '_', 'B', 't', 'Q', 
'd', '9', 'h', 'o', 'c', 'u', 'P', 'K', 'Y', 'p', 'A', 'J', 'O', 
'N', 'H', 'j', 'a', 'Z', '5', '1', 'b', 'y', 'F', '8', 'g'}

The problem arises when one decides to lookup cyrillic character keywords in a cyrillic text. Due to the limitation flashgeotext does not reliably extract the longest match, as every character not present in non_word_boundaries will stop the traversing thru the trie early.

say keyword is:

{"Нижневартовск": ["Нижневартовск"]

and text is:

.. Нижневартовском ..

Flashgeotext will extract Нижневартовск because о is not part of non_word_boundaries.

Publishing package on conda-forge

Hi @iwpnd, this is a great package! Unfortunately, it is not available on conda-forge while flashtext can be found.
I'm building a library around it and I can only fetch packages from conda-forge so I'm wondering if you might accept PR to publish on conda-forge eventually.
For now the workaround is to embed all flashgeotext as a module statically but I would love to declare it in the dependencies.

Hide trace log

Hi!

i'd like to hide log trace, how can i do it?

2021-04-07 19:27:23.225 | DEBUG | flashgeotext.lookup:add:194 - cities added to pool
2021-04-07 19:27:23.231 | DEBUG | flashgeotext.lookup:add:194 - countries added to pool
2021-04-07 19:27:23.235 | DEBUG | flashgeotext.lookup:_add_demo_data:225 - demo data loaded for: ['cities', 'countries']
2021-04-07 19:27:24.777 | DEBUG | flashgeotext.lookup:add:194 - cities added to pool
2021-04-07 19:27:24.784 | DEBUG | flashgeotext.lookup:add:194 - countries added to pool
2021-04-07 19:27:24.786 | DEBUG | flashgeotext.lookup:_add_demo_data:225 - demo data loaded for: ['cities', 'countries']

Cannot install 0.3.0 or 0.3.1

Hi:

I try to install flashgeotext 0.3.1 via pip or pipenv but i get the following error:

COMAND:
pipenv iinstall flashgeotext~=0.3.0

ERROR:
ERROR: Could not find a version that satisfies the requirement flashgeotext~=0.3.0
ERROR: No matching distribution found for flashgeotext~=0.3.0

How can i solve it?

Thanks in advance.
Best.

Missing cities and countries

Hello,

I have started trying out this library, but it seems to be missing cities and countries mostly from South America. What's the best way to update the cities.json and countries.json files? Is it ok just to add the data in there manually?

Also, how can this library map Shanghai as China, where is that relation mapped? why does it not behave the same for Caracas?

>>> geotext.extract(input_text="Living in Caracas", span_info=True)
{'cities': {'Caracas': {'count': 1, 'span_info': [(10, 17)]}}, 'countries': {}}

Thanks in advance!

TypeError: slice indices must be integers or None or have an __index__ method when span_info is False

As per the doc https://flashgeotext.iwpnd.pw/geotext/, the GeoText Class python function extract has an optional argument span_info : bool - return span_info. Defaults to True. However, on passing the span_info argument as false, GeoText fails to parse the text and throws an index slice error.

Below are the steps to reproduce the error:

from flashgeotext.geotext import GeoText
geotext2 = GeoText()

input_text = '''Shanghai. The Chinese Ministry of Finance in Shanghai said that China plans
                    to cut tariffs on $75 billion worth of goods that the country
                    imports from the US. Washington welcomes the decision.'''

geotext2.extract(input_text=input_text, span_info=False)

Output: > "found_as": [input_text[span_start:span_end]], TypeError: slice indices must be integers or None or have an index method

About Taiwan

Hi, Thanks for this excellent repo which helps me a lot in processing the news data. I really appreciate it. However, I have some suggestions and hope I'm not being particularly offensive.

According to countries.json, Taiwan is listed as an independent country, I go through the codes in this repo and find this may be due to an error in the original web page List_of_alternative_country_names. Nevertheless, as stated by the United Nations, Taiwan belongs to China.

I sincerely expect you can update this great repo and fix this issue.
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.