dpriskorn / itemsubjector Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 1.0 1.03 MB

CLI-tool to easily add "main subject" aka topics in bulk to groups of items on Wikidata

License: GNU General Public License v3.0

Python 99.59% Shell 0.41%

batch-script heuristics python relation-extraction wikidata

itemsubjector's People

Contributors

Stargazers

Watchers

Forkers

futur3r

itemsubjector's Issues

Increase sample size with batch size

A rule of thumb could be for every 20 items - show a sample, but always a minimum of 50.
That means:
500 -> 50
1000 -> 50
2000 -> 100
3000 -> 150
4000 -> 200
5000 -> 250
6000 -> 300

for batches larger than 4000 throw a warning that the batch size is so big that you should split it up by first running --no-aliases if possible

handle double quotes

New feature: find subjects 2-3 steps up in the graph and remove

e.g. remove carcinoma from this https://www.wikidata.org/wiki/Q64228075

Avoid duplicate search expressions

http://www.wikidata.org/entity/Q413690

Enable searching for alias also

As a user, I want to choose whether to search for items matching the label and or one of the aliases so I get as many hits as possible.

Pseudo code:
also fetch the aliases from WDQS
ask user for which ones to include (or all)
https://console-menu.readthedocs.io/en/latest/consolemenu/MultiSelectMenu.html
add them to a new attribute in class Labels: search_strings
fetch based on that (with one query if possible)
use https://pmitzias.com/SPARQLBurger/docs.html to generate the SPARQL query using UNION

Implement --exclude List[QID] for excluding matches

When working on HCV https://www.wikidata.org/wiki/Q154869 I see that there is another item with "HCV 229E" and I want to make sure that they don't get included in my batch.

Support dissertations also

see https://query.wikidata.org/#%23Katter%0ASELECT%20%28count%28%3Fitem%29%20as%20%3Fc%29%20%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20wdt%3AP31%2Fwdt%3AP279%2a%20wd%3AQ1266946.%20%23%20M%C3%A5ste%20vara%20av%20en%20katt%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%20%23%20%3Cspan%20lang%3D%22en%22%20dir%3D%22ltr%22%20class%3D%22mw-content-ltr%22%3EHelps%20get%20the%20label%20in%20your%20language%2C%20if%20not%2C%20then%20en%20language%3C%2Fspan%3E%0A%7D where 52000 is found

Replace one QID with a more specific one

https://www.wikidata.org/wiki/Q65510721 has main subject muon but "muon g-2" is a better fit. As it is not undoing the whole batch of 800 items and redoing it partly is the only way to recategorize these items.
A command line flag with --replace qid-to-be-replaced --replacement new-qid would be nice

Unable to remove articles belonging to specific subjects from the list of articles related to generic subjects

I want to add the following main subjects to the articles

scoping review protocol (Q108684373): very specific
scoping review (Q101116078): generic

and I run the following command (from specific topics to generic topics)

$ python itemsubjector.py -na -l Q108684373 Q101116078

Even though the addition of 'Q108684373' is complete, I see articles with the text 'scoping review protocol' in the list for 'scoping review'.

This issue may be related to Issue 14

Add more tests

Remove hardcoded disabling of alias

From the riksdagen task, see https://github.com/dpriskorn/ItemSubjector/blob/prepare-batch/models/suggestion.py#L72

Use questionary instead of console-menu

Issue with git+git

Running command git clone -q git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-e8o5ih20/wikibaseintegrator_728b3c0d1e3b474b9f15e676bf978aca
fatal: remote error:
The unauthenticated git protocol on port 9418 is no longer supported.
Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.
WARNING: Discarding git+git://github.com/LeMyst/[email protected]#egg=wikibaseintegrator. Command errored out with exit status 128: git clone -q git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-e8o5ih20/wikibaseintegrator_728b3c0d1e3b474b9f15e676bf978aca Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement wikibaseintegrator (unavailable)
ERROR: No matching distribution found for wikibaseintegrator (unavailable)
Running command git clone -q git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-dctoaxdd/wikibaseintegrator_82363d8264d0473388254ca8bf6399e6
fatal: remote error:
The unauthenticated git protocol on port 9418 is no longer supported.
Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.
WARNING: Discarding git+git://github.com/LeMyst/[email protected]#egg=wikibaseintegrator. Command errored out with exit status 128: git clone -q git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-dctoaxdd/wikibaseintegrator_82363d8264d0473388254ca8bf6399e6 Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement wikibaseintegrator (unavailable)
ERROR: No matching distribution found for wikibaseintegrator (unavailable)

New mode for removing one subject where another is present

e.g. on https://www.wikidata.org/wiki/Q99467166 we only want the most specific one left

v0.3.3: error related to limit_to_items_without_p921

Table of hits has misleading text

It is a list of random items in all released version

Add poetry command to readme

$ poetry install

Enable the user to select which aliases to include in the matching

this can be implemented using multiselectmenu

Filter out proteins and genes from shipped list of main subjects

they appear in both rats and humans and are very hard to validate reliably currently

Error in PAWS on v0.3.2 when doing single subject

After installing in PAWS I ran this command:

poetry run python itemsubjector.py -a Q40858

I then selected 2 to work on Riksdagen documents. This caused this screen:

Working on naturgas, see http://www.wikidata.org/entity/Q40858
Got a total of 78 items
Please keep an eye on the lag of the WDQS cluster here and avoid working if it is over a few minutes.
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&viewPanel=8&from=now-30m&to=now&refresh=1d You can see if any lagging servers are pooled here
https://config-master.wikimedia.org/pybal/eqiad/wdqs
If any enabled servers are lagging more than 5-10 minutes you can search phabricator for open tickets to see if the team is on it.
If you don't find any feel free to create a new ticket like this:
https://phabricator.wikimedia.org/T291621
Running 1 job(s) with a total of 1 items non-interactively now. You can take a coffee break and lean back :)
Traceback (most recent call last):
  File "/home/paws/.itemsubjector/itemsubjector.py", line 8, in <module>
    itemsubjector.run()
  File "/home/paws/.itemsubjector/src/__init__.py", line 164, in run
    handle_job_preparation_or_run_directly_if_any_jobs(
  File "/home/paws/.itemsubjector/src/helpers/jobs.py", line 154, in handle_job_preparation_or_run_directly_if_any_jobs
    batchjobs.run_jobs()
  File "/home/paws/.itemsubjector/src/models/batch_jobs.py", line 45, in run_jobs
    job.suggestion.add_to_items(
  File "/home/paws/.itemsubjector/src/models/suggestion.py", line 111, in add_to_items
    f"to {clean_rich_formatting(target_item.label)}"
  File "/home/paws/.itemsubjector/src/helpers/cleaning.py", line 24, in clean_rich_formatting
    return label.replace("[/", "['/")
AttributeError: 'NoneType' object has no attribute 'replace'

Enable enWP category item input

Enable recursing through the items of a whole category, e.g. https://en.wikipedia.org/wiki/Category:Tyrosine_kinase_receptors

PAWS v0.3.3: no_alias_for_scholarly_items error

I get this error in PAWS on v0.3.3 when doing single subject:

Picking a random main subject
Working on naturgas
Do you want to continue? [Y/Enter/n]: 
Traceback (most recent call last):
  File "/home/paws/.itemsubjector/itemsubjector.py", line 8, in <module>
    itemsubjector.run()
  File "/home/paws/.itemsubjector/src/__init__.py", line 79, in run
    main_subjects.get_validated_main_subjects_as_jobs()
  File "/home/paws/.itemsubjector/src/models/main_subjects.py", line 108, in get_validated_main_subjects_as_jobs
    job = main_subject_item.fetch_items_and_get_job_if_confirmed()
  File "/home/paws/.itemsubjector/src/models/wikimedia/wikidata/item/main_subject.py", line 240, in fetch_items_and_get_job_if_confirmed
    return self.__fetch_and_parse__()
  File "/home/paws/.itemsubjector/src/models/wikimedia/wikidata/item/main_subject.py", line 250, in __fetch_and_parse__
    self.__prepare_before_fetching_items__()
  File "/home/paws/.itemsubjector/src/models/wikimedia/wikidata/item/main_subject.py", line 188, in __prepare_before_fetching_items__
    self.__extract_search_strings__()
  File "/home/paws/.itemsubjector/src/models/wikimedia/wikidata/item/main_subject.py", line 141, in __extract_search_strings__
    elif self.id in config.no_alias_for_scholarly_items:
AttributeError: module 'config' has no attribute 'no_alias_for_scholarly_items'

My command was poetry run python itemsubjector.py -a Q40858

Support inferring from abstracts?

Where to find them? Bulk download from fatcat! ?
https://archive.org/download/fatcat_bulk_exports_2021-10-07/abstracts.json.gz 10G abstracts

Add a Web UI

Similar to QuickStatements batches, ItemsSubjector could have a flask frontend that runs in Toolforge and execute the users batches.

This requires oauth and flask.
Lucas made a good toolforge flask template to get started.

Issue with past edits

At least 700 hundred items were edited before the replace-bug was fixed and the tool erroneously removed a lot of main subjects.

Cleanup is ongoing manually 😅
see https://www.wikidata.org/w/index.php?title=Special:Contributions/So9q&offset=20210913090748&limit=100&target=So9q

Test using black for code style

ModuleNotFoundError: No module named 'config.items'

On checking the latest version and 0.3-alpha2, I am getting the following error:

Traceback (most recent call last):
  File "itemsubjector.py", line 3, in <module>
    import src
  File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/__init__.py", line 11, in <module>
    from src.helpers.console import (
  File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/helpers/console.py", line 11, in <module>
    from src.models.batch_job import BatchJob
  File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/models/batch_job.py", line 3, in <module>
    from src.models.items import Items
  File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/models/items/__init__.py", line 10, in <module>
    from src.models.wikimedia.wikidata.sparql_item import SparqlItem
  File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/models/wikimedia/wikidata/sparql_item.py", line 4, in <module>
    import config.items
ModuleNotFoundError: No module named 'config.items'

New feature: overwrite search string

itemsubjector.py -l Q1234 --match-string "search for this"

Detect missing P1889 and exit with a query improvement suggestion

https://www.wikidata.org/w/index.php?title=Topic:Wgsavrqdzim7p12l&topic_showPostId=wis9a5nvequk6mxh#flow-post-wis9a5nvequk6mxh

Possibility to disable alias search

Is it possible to add an option (advanced option) to disable search for aliases while adding main subject to Wikidata items?

Add option to automatically approve jobs with less than 50 matches

This has the advantage that the human can better focus on jobs with a big impact instead of drowning in small jobs with a few matches where 99% contain no false positives and they can easily be reverted by anyone even if they did.

with --limit first gather results and then ask

as a user I want to approve all batches without having to sit and wait in between because my time is valuable.

Support academic journals also

this would enable us to exclude groups of articles based on the main subjects of the journals :)
see https://www.wikidata.org/w/index.php?title=Topic:Wivsgl0y23flu93q&topic_showPostId=wj0bq23pbit5scil#flow-post-wj0bq23pbit5scil

Support approving first and running later aka batch mode

Jean-Fred:
Run the interactive part on toolforge on the shell, and from there kick off a grid engine job ?

Dennis Priskorn:
I have not learned how the grid engine works yet.
Maybe a new flag --grid-engine can be added and then it saves the to be processed QIDs in a pickle.
Then a new script can read that and run a non-interactive batch for each one?
The latter can be executed in the engine as a job

--approve-only might be a better name

Add blocklist support

See https://www.wikidata.org/wiki/Topic:Wo0i7k236itj94st

create a sequence model using PlanUML

Upgrade to newest WBI version

Match against a list of existing main subjects on scholarly articles

This is useful, because there are already many thousand different main subjects and many of them are not matched properly with all relevant articles yet.

OAuth authentification error

Thanks for correcting the previous errors in 0.3-alpha3.

I checked out the latest commit in the main branch and 0.3-alpha4. And now, I now face the OAuth error.

I checked with username/password
I checked with botname/password

  File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/my_venv/lib/python3.7/site-packages/oauthlib/oauth2/rfc6749/parameters.py", line 432, in validate_token_parameters
    raise_from_error(params.get('error'), params)
  File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/my_venv/lib/python3.7/site-packages/oauthlib/oauth2/rfc6749/errors.py", line 402, in raise_from_error
    raise cls(**kwargs)
oauthlib.oauth2.rfc6749.errors.InvalidClientIdError: (invalid_request) The request is missing a required parameter, includes an invalid parameter value, includes a parameter more than once, or is otherwise malformed.

Any idea on this error.

I checked with other scripts of mine. There are no issues.

ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'

Labels with apostrophe(') do not work

Labels with apostrophe(') currently do not work. I think that an escape character needs to be added, before sending the query string to WDQS.

Take for example:

Alzheimer's disease (Q11081)

returns the following error

Fetching items with labels that have one of the search strings by running a total of 11 queries on WDQS...INFO:backoff:Backing off execute_sparql_query(...) for 1.0s (requests.exceptions.HTTPError: 400 Client Error: Bad Request for url

Strip [] and () and {} from labels before matching

Error on pip install

In v0.2 I am trying pip install -r requirements.txt in PAWS and get this error message:

Collecting wikibaseintegrator
  Cloning git://github.com/LeMyst/WikibaseIntegrator (to revision v0.12.0.dev5) to /tmp/pip-install-h0jhod33/wikibaseintegrator_2f94ad8cb5b244b3816e997a960745eb
  Running command git clone --filter=blob:none --quiet git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-h0jhod33/wikibaseintegrator_2f94ad8cb5b244b3816e997a960745eb
  fatal: unable to connect to github.com:
  github.com[0: 140.82.113.4]: errno=Connection timed out

  error: subprocess-exited-with-error
  
  × git clone --filter=blob:none --quiet git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-h0jhod33/wikibaseintegrator_2f94ad8cb5b244b3816e997a960745eb did not run successfully.
  │ exit code: 128
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-h0jhod33/wikibaseintegrator_2f94ad8cb5b244b3816e997a960745eb did not run successfully.
│ exit code: 128
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

What should I do?