Giter Club home page Giter Club logo

wiktionaryparser's Introduction

Wiktionary Parser

A python project which downloads words from English Wiktionary (en.wiktionary.org) and parses articles' content in an easy to use JSON format. Right now, it parses etymologies, definitions, pronunciations, examples, audio links and related words.

Downloads

JSON structure

[{
    "pronunciations": {
        "text": ["pronunciation text"],
        "audio": ["pronunciation audio"]
    },
    "definitions": [{
        "relatedWords": [{
            "relationshipType": "word relationship type",
            "words": ["list of related words"]
        }],
        "text": ["list of definitions"],
        "partOfSpeech": "part of speech",
        "examples": ["list of examples"]
    }],
    "etymology": "etymology text",
}]

Installation

Using pip
  • run pip install wiktionaryparser
From Source
  • Clone the repo or download the zip
  • cd to the folder
  • run pip install -r "requirements.txt"

Usage

  • Import the WiktionaryParser class.
  • Initialize an object and use the fetch("word", "language") method.
  • The default language is English, it can be changed using the set_default_language method.
  • Include/exclude parts of speech to be parsed using include_part_of_speech(part_of_speech) and exclude_part_of_speech(part_of_speech)
  • Include/exclude relations to be parsed using include_relation(relation) and exclude_relation(relation)

Examples

>>> from wiktionaryparser import WiktionaryParser
>>> parser = WiktionaryParser()
>>> word = parser.fetch('test')
>>> another_word = parser.fetch('test', 'french')
>>> parser.set_default_language('french')
>>> parser.exclude_part_of_speech('noun')
>>> parser.include_relation('alternative forms')

Requirements

  • requests==2.20.0
  • beautifulsoup4==4.4.0

Contributions

If you want to add features/improvement or report issues, feel free to send a pull request!

License

Wiktionary Parser is licensed under MIT.

wiktionaryparser's People

Contributors

duffrecords avatar flowgunso avatar imlutr avatar jsch8q avatar jsibbiso avatar nikita-moor avatar ragunyrasta avatar suyashb95 avatar the-zebulan avatar wannaphong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wiktionaryparser's Issues

problem with utilis

ive got problem like this
File "C:\Program Files\Python36\lib\site-packages\wiktionaryparser_init_.py", line 2, in
from .WikiParse import WiktionaryParser
File "C:\Program Files\Python36\lib\site-packages\wiktionaryparser\WikiParse.py", line 7, in
from utils import WordData, Definition, RelatedWord
ModuleNotFoundError: No module named 'utils'

Create a test utility script that downloads test HTML and markdown and saves it as a part of the repo

Self assigned.

  • Setup mocks of the fetch() method to use offline files. Leave a few copies of the test as integration tests calling actual Wiktionary, preferably shorter word pages that won't change any time soon.
  • Create scripts/download_wiktionary_test_files.py that does the following:
    • Download HTML files of all words under test.
    • Download the Markdown files from the API endpoint below
https://en.wiktionary.org/w/api.php?action=parse&oldid={{OLDID}}&prop=wikitext&formatversion=2&format=json

AttributeError on a German Adverb

Hi,
I know it's supposed to have been fixed, but still exists, for example for translating the german adverb überallhin, no matter whether you set the default language or part_of_speech or not.

Any solution or workaround?
Thanks

Install issues on Linux

Doing
pip install wiktionaryparser
and running the test example yields:
Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.

from wiktionaryparser import WiktionaryParser
Traceback (most recent call last):
File "", line 1, in
File "/home/ragu/anaconda3/lib/python3.6/site-packages/wiktionaryparser/init.py", line 2, in
from .WikiParse import WiktionaryParser
File "/home/ragu/anaconda3/lib/python3.6/site-packages/wiktionaryparser/WikiParse.py", line 7, in
from utils import WordData, Definition, RelatedWord
ImportError: cannot import name 'WordData'

utils is included in the base distribution but is likely not being imported correctly in WiktionaryParser. Also I noticed that setup.py is missing in the main package.

However both of the above can be overcome if I run the code from the source tree i.e. without installing it.

Love the package though. Great package satisfying a great need

Pronunciations are not parsed

I am trying to fetch the english definition for "to grapple".
In the botto of this message there is the result of the fect. As you can see, the pronunciation section is empty, but the original page contain the pronunciation. It may be related to the fact that WiktionaryParser puts the pronunciation inside the etimology, while in the original page the pronunciation is outside.

open source license

Hi Suyash, What license is this code released in? I've plans to add to this library but would like to know before. I would vote for GPL, LGPL or MIT.

As of 0.0.6, WikiParse.py is no longer included in the package.

With 0.0.5:

% ls /home/alexei/.virtualenvs/4ivn7uXV/lib/python3.5/site-packages/wiktionaryparser WikiParse.py __init__.py __pycache__ tests.py utils.py

With 0.0.6:

% ls /home/alexei/.virtualenvs/4ivn7uXV/lib/python3.5/site-packages/wiktionaryparser __pycache__ setup.py

This, of course, breaks the package so it cannot be used, since the package code is not included.

Map_to_object() issue for words with more than 8 etymologies

The map_to_object() function inside core.py doesn't work properly when a word has more than 8 etymologies, such as cat. Here are the definitions included for all etymologies of this word, as well as the comparisons that are made in the function's code:

Etymology 1
'1.2' <= '1.2.1' < '1.3'
'1.2' <= '1.2.2' < '1.3'

Etymology 2
'1.3' <= '1.3.1' < '1.4'

Etymology 3
'1.4' <= '1.4.1' < '1.5'
'1.4' <= '1.4.2' < '1.5'

Etymology 4
'1.5' <= '1.5.1' < '1.6'

Etymology 5
'1.6' <= '1.6.1' < '1.7'

Etymology 6
'1.7' <= '1.7.1' < '1.8'

Etymology 7
'1.8' <= '1.8.1' < '1.9'

Etymology 8

Etymology 9
'1.10' <= '1.2.1' < '999'
'1.10' <= '1.2.2' < '999'
'1.10' <= '1.3.1' < '999'
'1.10' <= '1.4.1' < '999'
'1.10' <= '1.4.2' < '999'
'1.10' <= '1.5.1' < '999'
'1.10' <= '1.6.1' < '999'
'1.10' <= '1.7.1' < '999'
'1.10' <= '1.8.1' < '999'
'1.10' <= '1.9.1' < '999'
'1.10' <= '1.10.1' < '999'

whereas the last two etymologies should have been:

Etymology 8
'1.10' <= '1.9.1' < '999'

Etymology 9
'1.10' <= '1.10.1' < '999'

Since I couldn't quickly find a fix that would pass all the tests and the bug isn't a critical one, I chose to submit an issue here.

Support parsing translations

It is missing translation information :(

>>> word = parser.fetch('smartphone')
>>> print(word)
[{'etymology': 'smart +\u200e phone', 'definitions': [{'partOfSpeech': 'noun', 'text': 'smartphone \u200e(plural smartphones)\nA mobile phone with more advanced features and greater computing capacity than a featurephone.\n', 'relatedWords': [], 'exampleUses': []}], 'audioLinks': [], 'pronunciations': ['(UK) IPA: /ˈsmɑːtfəʊn/', '(US) IPA: /ˈsmɑɹtfoʊn/']}]
>>> 

But thank you for this interesting and nice project!

Quick question :)

Hey, first great job on this project !
I was trying to do the same thing before finding your work. However I was trying to use the provided API. I successfully managed to get the pages I wanted but so far I failed to parse the resulting HTML with BS 😭

Did you try to use the API or not ? And if yes what made you change to the current method you're using ?
Really good job tho, I try a bit more but I definitely cloned that and will fork it if I add some usefull things in it!

Misparsing of Ancient Greek pronunciations?

Given a page like this:

https://en.wiktionary.org/wiki/%E1%BC%80%CE%B3%CE%B3%CE%B5%CE%BB%CE%AF%CE%B1#Ancient_Greek

the parser returns:

[
{
"etymology": "From \u1f04\u03b3\u03b3\u03b5\u03bb\u03bf\u03c2 (\u00e1ngelos, \u201cmessenger\u201d) +\u200e -\u1fd0\u0301\u1fb1 (-\u00ed\u0101, abstract noun suffix).\n",
"definitions": [
{
"partOfSpeech": "noun",
"text": "\u1f00\u03b3\u03b3\u03b5\u03bb\u1fd0\u0301\u1fb1 \u2022 (angel\u00ed\u0101)\u00a0f (genitive \u1f00\u03b3\u03b3\u03b5\u03bb\u1fd0\u0301\u1fb1\u03c2); first declension\nmessage, news, report\nAlso, the substance or means of such communication\ninstruction, command\n",
"relatedWords": [],
"examples": []
}
],
"pronunciations": {
"text": [
"\u1f00\u03b3\u03b3\u03b5\u03bb\u1fd0\u0301\u1fb1 \u2022 (angel\u00ed\u0101)\u00a0f (genitive \u1f00\u03b3\u03b3\u03b5\u03bb\u1fd0\u0301\u1fb1\u03c2); first declension"
],
"audio": []
}
}
]

where the 'pronunciations'.'text' entry is (basically) the same as the 'definitions'.'text' entry.

Does not list sub-definitions.

For example, the page for the word "cat" reads as the following:

wiktionary entries for 'cat'

but WiktionaryParser gives this:

"text": [
  "cat (plural cats)",
  "An animal of the family Felidae:",
  "A person:",
  "(nautical) A strong tackle used to hoist an anchor to the cathead of a ship.",
  "..."
]

There's also the line "cat (plural cats)" at the beginning that shouldn't be there, but that's a separate issue and can be easily removed in code.

Uncaught error when selected language for the word does not exist

The word 'behagelig' for example has only a definition in Norwegian Bokmål.

https://en.wiktionary.org/wiki/behagelig#Norwegian_Bokm%C3%A5l

If I try to query for the language 'Norwegian Nynorsk' for example:

parser.fetch("behagelig", "norwegian nynorsk")

I get a following error:

Traceback (most recent call last):
  File "./scrap.py", line 24, in <module>
    another_word = parser.fetch(wordToScrap, NN)
  File "/home/c0rn3j/wikivirtpython/lib/python3.6/site-packages/wiktionaryparser/WikiParse.py", line 233, in fetch
    return self.get_word_data(language.lower())
  File "/home/c0rn3j/wikivirtpython/lib/python3.6/site-packages/wiktionaryparser/WikiParse.py", line 84, in get_word_data
    if index.startswith(start_index):
TypeError: startswith first arg must be str or a tuple of str, not NoneType

Supporting German as base language

I want to use this project, but I would like to use German wiktionary. I intend to fork off this project and make the required adaptions. Is there any interest in merging the result back via a PR? It would require some structural changes, but adding more languages later might be easier.

Some words get partially scrapped from other language than the defined one

Point in case, trying to scrap Admiral in Norwegian Bokmål yields this:

[
  {
    "etymology": "From Arabic أَمِير الْبَحْر‎ (ʾamīr al-baḥr, “commander of the fleet”), via French amiral",
    "definitions": [
      {
        "partOfSpeech": "noun",
        "text": "admìrāl m (Cyrillic spelling адмѝра̄л)\n\nadmiral\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

But that Definition is from Serbo-Croatian instead.

https://en.wiktionary.org/wiki/admiral#Norwegian_Bokm%C3%A5l

If a word has an article on wikipedia parser is failing to remove those texts

Hello,

If a word has an article on wikipedia then parser will include that title in text field of definition.
Ex: try parsing alexin

As of now i am solving this like this
definition['text']=[s for s in definition['text'].splitlines() if s and not ((s.lower() == 'wikipedia') or s.lower().startswith('wikipedia has an'))]

import error in python 3.5

I using

>>> from wiktionaryparser import WiktionaryParser
Traceback (most recent call last):
  File "<pyshell#10>", line 1, in <module>
    from wiktionaryparser import WiktionaryParser
  File "C:\WinPython-64bit-35\python-3.5.1.amd64\lib\site-packages\wiktionaryparser\__init__.py", line 1, in <module>
    from WikiParse import WiktionaryParser
ImportError: No module named 'WikiParse'

WiktionaryParser not support python 3.5 ?

AttributeError on some words

Trying to obtain a parser object for the word "patronise" (entry exists and looks like the others):

Traceback (most recent call last):
File "/path/path/path/noise.py", line 562, in wikidata
return WiktionaryParser().fetch(word)
File "/home/user/py36/lib/python3.6/site-packages/wiktionaryparser/WikiParse.py", line 308, in fetch
return self.get_word_data(language.lower())
File "/home/user/py36/lib/python3.6/site-packages/wiktionaryparser/WikiParse.py", line 117, in get_word_data
'definitions': self.parse_definitions(word_contents),
File "/home/user/py36/lib/python3.6/site-packages/wiktionaryparser/WikiParse.py", line 168, in parse_definitions
while table.name != 'ol':
AttributeError: 'NoneType' object has no attribute 'name'

Is there a way to use a dictionary other than any_language to English? (e.g.: Portuguese to Portuguese)

Is there a way to use this parser for a given language to given language definition?

from wiktionaryparser import WiktionaryParser
parser = WiktionaryParser()
print(parser.fetch('seda', 'portuguese'))  # silk in English

Gives me:

[{'pronunciations': {'audio': ['//upload.wikimedia.org/wikipedia/commons/c/cb/Pt-br-seda.ogg', '//upload.wikimedia.org/wikipedia/commons/c/cb/Pt-br-seda.ogg'], 'text': ['(Portugal) IPA: /ˈse.dɐ/', 'Hyphenation: se‧da']}, 'etymology': 'From Old Portuguese seda, from Latin saeta (“animal hair”).\n', 'definitions': [{'relatedWords': [{'relationshipType': 'derived terms', 'words': ['bicho-da-seda', 'sedoso']}], 'text': 'seda f (plural sedas)\n\n(uncountable) silk (a type of fiber)\na piece of silken cloth or silken clothes\n', 'partOfSpeech': 'noun', 'examples': []}]}]

That is: silk (a type of fiber)\na piece of silken cloth or silken clothes. Which is the same from the website: https://en.wiktionary.org/wiki/seda#Portuguese.

I would like to give me the Portuguese version:

https://pt.wiktionary.org/wiki/seda: substância filamentosa segregada pelo bicho-da-seda ....

I would say that that is not possible because of the hardcoding in the following line:

self.url = "https://en.wiktionary.org/wiki/{}?printable=yes"

But I'm waiting for your opinion on that.

I'm using this project on my own project: https://github.com/fmv1992/vim_dictionary. Thanks for the good work!

Import Error: No module named 'utils'

I'm using Python 3.5.3 on macOS in a conda environment.

As below, I cannot import this package in Terminal.

Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 12:15:08) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from wiktionaryparser import WiktionaryParser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/username/miniconda3/envs/proj_env/lib/python3.5/site-packages/wiktionaryparser/__init__.py", line 2, in <module>
    from .WikiParse import WiktionaryParser
  File "/Users/username/miniconda3/envs/proj_env/lib/python3.5/site-packages/wiktionaryparser/WikiParse.py", line 7, in <module>
    from utils import WordData, Definition, RelatedWord
ImportError: No module named 'utils'

ImportError: cannot import name zip_longest

I'm getting an error when running the sample:

Traceback (most recent call last):
  File "/Users/steven/Desktop/d.py", line 1, in <module>
    from wiktionaryparser import WiktionaryParser
  File "/Library/Python/2.7/site-packages/wiktionaryparser/__init__.py", line 2, in <module>
    from .WikiParse import WiktionaryParser
  File "/Library/Python/2.7/site-packages/wiktionaryparser/WikiParse.py", line 6, in <module>
    from itertools import zip_longest
ImportError: cannot import name zip_longest

Python version 2.7

Examples are not parsed correctly

I am trying to fetch the english definition for "to grapple".
When doing so, wiktionary parses the examples incorrectly. In particular, in the first row there are all the examples separated by a '\n', and then from the second row on the same examples are repeated:

            "examples": [
                "to grapple with one's conscience",
                "Hakluyt\nThe gallies were grappled to the Centurion.Shakespeare\nGrapple them to thy soul with hoops of steel.",
                "The gallies were grappled to the Centurion.",
                "Grapple them to thy soul with hoops of steel."
            ],

Tests fail

`python -m WiktionaryParser.tests.test
Testing "patronise" in English
Testing "test" in English
F

FAIL: test_multiple_languages (main.TestParser)

Traceback (most recent call last):
File "C:\Users\RMANCUSO00\Documents\progetti_miei\WiktionaryParser\tests\test.py", line 24, in test_multiple_languages
self.assertEqual(DeepDiff(parsed_word, sample_output[lang][word], ignore_order=True), {})
AssertionError: {'values_changed': {"root['etymology']": {[1187 chars].'}}} != {}


Ran 1 test in 2.157s

FAILED (failures=1)`

I also don't understand what the test is supposed to test?

word with urlencoded umlaut not working

The german word Güter is not returned by this libary, while another word with the same umlaut Sünde works.

from wiktionaryparser import WiktionaryParser
parser = WiktionaryParser()
parser.set_default_language('german')
parser.fetch('G%C3%BCter') # returns empty array
parser.fetch('S%C3%BCnde') # returns regular JSON

Supporting french as base language

Hi everyone,

As part of a school project where I needed a bunch of words in french and their definition (also in french), I forked the project and modified the code to my needs. The code is here : https://github.com/cedric-audy/WiktionaryParser .

Solved I was unable to pull from the repo and use wiktionary as a 'package'. For now, when I need it, I include the whole thing in my project and import WiktionaryParser, which is impractical. However, I had no problem installing the main version using pip. Help would be appreciated on that.

After a bit of tinkering I can now retrieve a definition and etymology (see image), with the help of this code : https://github.com/cedric-audy/french_wiktionary_scraper .

image

I am fairly new to all this (git, forking, python, etc etc), so help would be appreciated in making a version of WiktionaryParser that works with french as base language.

[Norwegian] Some pages are not being scrapped properly

Out of all the issues I opened here this one is the most important to me as I've used this project for creation of a Kindle-compatible dictionary, and incomplete/missing entries are the bane of every dictionary project ^^


https://en.wiktionary.org/wiki/seg#Norwegian_Bokm%C3%A5l
Missing completely

[{"etymology": "", "definitions": [], "pronunciations": {"text": [], "audio": []}}]

https://en.wiktionary.org/wiki/ham#Norwegian_Bokm%C3%A5l
Missing completely

[{"etymology": "", "definitions": [], "pronunciations": {"text": ["IPA: /h\u0251m/"], "audio": []}}]

https://en.wiktionary.org/wiki/by#Norwegian_Bokm%C3%A5l
Missing the verb definition

[
	{
		"etymology": "From Old Norse býr (“place (to camp or settle), land, property, lot; and later settlement”).\n",
		"definitions": [
			{
				"partOfSpeech": "noun",
				"text": "by m (definite singular byen, indefinite plural byer, definite plural byene)\n\ntown, city (regardless of population size or land area)\n",
				"relatedWords": [
					{
						"relationshipType": "derived terms",
						"words": [
							"bydel",
							"byfornyelse, byfornying",
							"bygdeby",
							"bymessig",
							"bystat",
							"bystatus",
							"drabantby",
							"ferieby",
							"gamleby",
							"havneby",
							"hjemby",
							"landsby",
							"Mexico by",
							"naboby",
							"spøkelsesby",
							"storby"
						]
					}
				],
				"examples": []
			}
		],
		"pronunciations": {
			"text": [],
			"audio": []
		}
	},
	{
		"etymology": "From byde, from Old Norse bjóða, from Proto-Germanic *beudaną (“to offer”), from Proto-Indo-European *bʰewdʰ- (“to wake, rise up”).\n",
		"definitions": [],
		"pronunciations": {
			"text": [],
			"audio": []
		}
	}
]

Here's a list of errors from my project for words in Norwegian Bokmål. It is totally possible that some errors are due to a mistake in my own scripts, but all I checked were thrown due to WiktionaryParser not parsing them properly or at all.

https://haste.rys.pw/raw/vevafamiwo

Another half-broken entry -

https://en.wiktionary.org/wiki/for#Norwegian_Bokm%C3%A5l

Request Add Contraction

Request that contraction be added to JSON structure.

For example, he's.

[{'etymology': '',
'definitions': [],
'pronunciations': {'text': ['IPA: /ˈhiːz/'],
'audio': ['//upload.wikimedia.org/wikipedia/commons/4/43/En-us-he%27s.ogg']}}]

Can't get the project to run

Trying to get the project to run using Python 3 in a virtual env.

c0rn3j@Luxuria : ~
[0] % pip list                         
Package          Version  
---------------- ---------
beautifulsoup4   4.4.0    
certifi          2018.4.16
chardet          3.0.4    
idna             2.7      
pip              10.0.1   
requests         2.7.0    
setuptools       39.2.0   
urllib3          1.23     
wheel            0.31.1   
wiktionaryparser 0.0.6    
(wikivirtpython) 

c0rn3j@Luxuria : ~
[0] % python  
Python 3.6.5 (default, May 11 2018, 04:00:52) 
[GCC 8.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from wiktionaryparser import WiktionaryParser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'WiktionaryParser'

But it won't even import, any idea what I'm doing wrong?

EDIT: Got it to run now in the venv.
I need to be CD'd in the git cloned directory for it to work though, which is suboptimal.

Package          Version  
---------------- ---------
beautifulsoup4   4.6.0    
certifi          2018.4.16
chardet          3.0.4    
html5lib         1.0.1    
idna             2.7      
lxml             4.2.3    
pip              10.0.1   
requests         2.19.1   
setuptools       39.2.0   
six              1.11.0   
urllib3          1.23     
webencodings     0.5.1    
wheel            0.31.1   
wiktionaryparser 0.0.6   

Win10 64-bit new install

Did pip install ... which completed successfully.

>>> from wiktionaryparser import WiktionaryParser Traceback (most recent call last): File "<interactive input>", line 1, in <module> File "C:\Python34\lib\site-packages\wiktionaryparser\__init__.py", line 2, in <module> from .WikiParse import WiktionaryParser File "C:\Python34\lib\site-packages\wiktionaryparser\WikiParse.py", line 7, in <module> from utils import WordData, Definition, RelatedWord ImportError: cannot import name 'WordData'

Multi-line 'etymology' entries resolve to the last line only

If the etymology section of the page for a given word (e.g. see Latin 'video' : https://en.wiktionary.org/wiki/video#Latin) the parser returns only the last line of this section

Thus:
Latin
Etymology
From Proto-Italic *widēō, from Proto-Indo-European *weyd- (“to know; see”).

Cognates include Ancient Greek εἴδω (eídō), Mycenaean Greek 𐀹𐀆 (wi-de), Sanskrit वेत्ति (vétti), Russian ви́деть (vídetʹ), Old English witan (English wit), German wissen, Macedonian види (vidi), Swedish veta.

yields only the line beginning 'Cognates.....' in the returned JSON element, whereas the first line is (also) required.

still get some empty elements

Even after commit af23b50 I get some empty fields.

Example:

[{'definitions': [{'examples': [],
                   'partOfSpeech': 'noun',
                   'relatedWords': [],
                   'text': ['Krismasi\xa0(n class, no plural)', 'Christmas']}],
  'etymology': 'From English Christmas.\n',
  'pronunciations': {'audio': [], 'text': []}}]

Test case:

#!/usr/bin/env python3

from pprint import pprint

from wiktionaryparser import WiktionaryParser

parser = WiktionaryParser()
parser.set_default_language('swahili')
word = parser.fetch("Krismasi")

pprint(word)

Exception when parsing related words for "nutritiously" in English

The parser encounters an AttributeError when fetching the entry for 'nutritiously'.

>>> from wiktionaryparser import WiktionaryParser
>>> p = WiktionaryParser()
>>> p.fetch('nutritiously')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/wiktionaryparser.py", line 260, in fetch
    return self.get_word_data(language.lower())
  File "/usr/local/lib/python3.6/dist-packages/wiktionaryparser.py", line 117, in get_word_data
    'related': self.parse_related_words(word_contents),
  File "/usr/local/lib/python3.6/dist-packages/wiktionaryparser.py", line 221, in parse_related_words
    while not parent_tag.find_all('li'):
AttributeError: 'NoneType' object has no attribute 'find_all'

I tried again with nothing in p.RELATIONS, and it worked.

>>> for i in list(p.RELATIONS):
...     p.exclude_relation(i)
... 
>>> p.RELATIONS
[]
>>>
>>> p.fetch('nutritiously')
[{'etymology': 'nutritious +\u200e -ly', 'definitions': [{'partOfSpeech': 'adverb', 'text': ['nutritiously (comparative more nutritiously, superlative most nutritiously)', 'In a way that provides nutrition.'], 'relatedWords': [], 'examples': []}], 'pronunciations': {'text': [], 'audio': []}}]

Optimalization when scrapping the same page for multiple languages

At the moment I have this simple scrapper - https://haste.c0rn3j.com/ahiyofahuf.py

It takes a word and scraps it in two languages. This however seems to send two requests to Wiktionary instead of just one (it is after all requesting the same page).

Is there a way I can scrap both languages in one request as to make the process faster and load on Wiktionary smaller?

EDIT: Assuming this is not currently implemented.

The parser could save the whole pages to /tmp/WiktionaryParser/. /tmp/ on every decent distro gets cleaned after reboot, and it should be a tmpfs on most distros (RAM storage).

So the parser just goes to check /tmp if the file is already there and not older than let's say 24 hours(user configurable?), and acts accordingly.

I think this should be user configurable behavior in case scrapping XXk pages can take a lot of memory.

If implemented, it should be mentioned on the README.

Support for parsing inflections

Definitions on Wiktionary can come with inflections, as in this screenshot:

image

Raw entry from WitkionaryParser:

fly (imperative fly, present tense flyr, simple past fløy, past participle flydd or fløyet)

to fly

What I'd like to see parsed:

Inflection: [imperative] fly
Inflection: [present tense] flyr
Inflection: [simple past] fløy
Inflection: [past participle] flydd
Inflection: [past participle] fløyet

Word definition:
to fly

My use case is creating a dictionary with inflection support (for e-readers like Kindle).

I've worked around this limitation in my scripts but it'd be nice if I didn't have to and it was supported.

EDIT: This needs so many workarounds and has so many edge cases I wouldn't be surprised this never gets implemented as out of scope >.>

not all existing relations are returned

Example query: for the word "white", querying English Wikipedia does not return derived words for the noun part-of-speech. In the entry for "boy", this relation is parsed correctly.

chore: Mild reorganization of project files for better dev experience, tidiness

Self assigned, if the repo owner gives permission.

I'd like to move wiktionaryparser.py and the utils.py into a new wiktionaryparser folder, making the project structure conform to the structures of other Python repositories.

proj-structure

proj-structure-2

I'll make sure to add a CONTRIBUTING.md with the shell script contributors can copy paste to get their dev env setup (venv, pip install -e .).

I'll also make sure to add the classes in utils.py to __all__ so the API doesn't change.

The aim's just to:

  • Make the repo tidier.
  • Make Intellisense work again since, for some reason, the current project structure is preventing it from doing autocompletion.
  • Pave the way for absolute imports.

Lookup gives definition from wrong language; possibly related to "see also"

With current master, I still get definitions from a wrong language in some cases. It happens when I look up "nao" and "nami" for Swahili.

Both of these have a "See also" link at the top. Maybe that confuses the parser?

https://en.wiktionary.org/wiki/nami#Swahili
https://en.wiktionary.org/wiki/nao#Swahili

Script:

#!/usr/bin/env python3

from pprint import pprint
import sys

from wiktionaryparser import WiktionaryParser

parser = WiktionaryParser()
parser.set_default_language('swahili')

word = parser.fetch(sys.argv[1])

pprint(word)
./lookup-word nami
[{'definitions': [{'examples': [],
                   'partOfSpeech': 'noun',
                   'relatedWords': [],
                   'text': ['nami', 'younger sister']},

"younger sister" is Comanche, not Swahili.

Fix and improve tests

  • Fix broken tests
  • Split single test that tests all languages into parameterized tests that test 1 word only

Support for Cantonese

Hi,

Thank you for this excellent parser.

I am trying to run this parser on Cantonese entries in Wiktionary, but they are not always found, and when they are found, the data returned is... weird.

I attached an example, where some words that exist in Wiktionary are not returned, and other words actually get different pronunciation (!) than the one written in Wiktionary.

Your help is much appreciated!

Thanks.

This is the code I run:

from wiktionaryparser import WiktionaryParser
import codecs
import sys

parser = WiktionaryParser()

with open('wiktionary_data.txt', "w") as outFile:
    with codecs.open("wordcount.log", "r", encoding="UTF-8") as inFile:
        for line in inFile:
            word = line.split()[0]
            wiktionaryWord = parser.fetch(word, "Chinese")

            if len(wiktionaryWord) > 0:
                outFile.write(word.encode('UTF-8') + "," + str(wiktionaryWord[0]) + "\n")

This is the file wordcount.log

喺 133175
我 84912
個 81040
你 75672
咁 66798
唔 60689
嘅 56333
啊 54957
係 48753
誒 46097

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.