dpriskorn / itemsubjector Goto Github PK
View Code? Open in Web Editor NEWCLI-tool to easily add "main subject" aka topics in bulk to groups of items on Wikidata
License: GNU General Public License v3.0
CLI-tool to easily add "main subject" aka topics in bulk to groups of items on Wikidata
License: GNU General Public License v3.0
A rule of thumb could be for every 20 items - show a sample, but always a minimum of 50.
That means:
500 -> 50
1000 -> 50
2000 -> 100
3000 -> 150
4000 -> 200
5000 -> 250
6000 -> 300
for batches larger than 4000 throw a warning that the batch size is so big that you should split it up by first running --no-aliases if possible
e.g. remove carcinoma from this https://www.wikidata.org/wiki/Q64228075
As a user, I want to choose whether to search for items matching the label and or one of the aliases so I get as many hits as possible.
Pseudo code:
also fetch the aliases from WDQS
ask user for which ones to include (or all)
https://console-menu.readthedocs.io/en/latest/consolemenu/MultiSelectMenu.html
add them to a new attribute in class Labels: search_strings
fetch based on that (with one query if possible)
use https://pmitzias.com/SPARQLBurger/docs.html to generate the SPARQL query using UNION
When working on HCV https://www.wikidata.org/wiki/Q154869 I see that there is another item with "HCV 229E" and I want to make sure that they don't get included in my batch.
https://www.wikidata.org/wiki/Q65510721 has main subject muon but "muon g-2" is a better fit. As it is not undoing the whole batch of 800 items and redoing it partly is the only way to recategorize these items.
A command line flag with --replace qid-to-be-replaced --replacement new-qid would be nice
I want to add the following main subjects to the articles
and I run the following command (from specific topics to generic topics)
$ python itemsubjector.py -na -l Q108684373 Q101116078
Even though the addition of 'Q108684373' is complete, I see articles with the text 'scoping review protocol' in the list for 'scoping review'.
This issue may be related to Issue 14
From the riksdagen task, see https://github.com/dpriskorn/ItemSubjector/blob/prepare-batch/models/suggestion.py#L72
Running command git clone -q git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-e8o5ih20/wikibaseintegrator_728b3c0d1e3b474b9f15e676bf978aca
fatal: remote error:
The unauthenticated git protocol on port 9418 is no longer supported.
Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.
WARNING: Discarding git+git://github.com/LeMyst/[email protected]#egg=wikibaseintegrator. Command errored out with exit status 128: git clone -q git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-e8o5ih20/wikibaseintegrator_728b3c0d1e3b474b9f15e676bf978aca Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement wikibaseintegrator (unavailable)
ERROR: No matching distribution found for wikibaseintegrator (unavailable)
Running command git clone -q git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-dctoaxdd/wikibaseintegrator_82363d8264d0473388254ca8bf6399e6
fatal: remote error:
The unauthenticated git protocol on port 9418 is no longer supported.
Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.
WARNING: Discarding git+git://github.com/LeMyst/[email protected]#egg=wikibaseintegrator. Command errored out with exit status 128: git clone -q git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-dctoaxdd/wikibaseintegrator_82363d8264d0473388254ca8bf6399e6 Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement wikibaseintegrator (unavailable)
ERROR: No matching distribution found for wikibaseintegrator (unavailable)
e.g. on https://www.wikidata.org/wiki/Q99467166 we only want the most specific one left
It is a list of random items in all released version
$ poetry install
this can be implemented using multiselectmenu
they appear in both rats and humans and are very hard to validate reliably currently
After installing in PAWS I ran this command:
poetry run python itemsubjector.py -a Q40858
I then selected 2 to work on Riksdagen documents. This caused this screen:
Working on naturgas, see http://www.wikidata.org/entity/Q40858
Got a total of 78 items
Please keep an eye on the lag of the WDQS cluster here and avoid working if it is over a few minutes.
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&viewPanel=8&from=now-30m&to=now&refresh=1d You can see if any lagging servers are pooled here
https://config-master.wikimedia.org/pybal/eqiad/wdqs
If any enabled servers are lagging more than 5-10 minutes you can search phabricator for open tickets to see if the team is on it.
If you don't find any feel free to create a new ticket like this:
https://phabricator.wikimedia.org/T291621
Running 1 job(s) with a total of 1 items non-interactively now. You can take a coffee break and lean back :)
Traceback (most recent call last):
File "/home/paws/.itemsubjector/itemsubjector.py", line 8, in <module>
itemsubjector.run()
File "/home/paws/.itemsubjector/src/__init__.py", line 164, in run
handle_job_preparation_or_run_directly_if_any_jobs(
File "/home/paws/.itemsubjector/src/helpers/jobs.py", line 154, in handle_job_preparation_or_run_directly_if_any_jobs
batchjobs.run_jobs()
File "/home/paws/.itemsubjector/src/models/batch_jobs.py", line 45, in run_jobs
job.suggestion.add_to_items(
File "/home/paws/.itemsubjector/src/models/suggestion.py", line 111, in add_to_items
f"to {clean_rich_formatting(target_item.label)}"
File "/home/paws/.itemsubjector/src/helpers/cleaning.py", line 24, in clean_rich_formatting
return label.replace("[/", "['/")
AttributeError: 'NoneType' object has no attribute 'replace'
Enable recursing through the items of a whole category, e.g. https://en.wikipedia.org/wiki/Category:Tyrosine_kinase_receptors
I get this error in PAWS on v0.3.3 when doing single subject:
Picking a random main subject
Working on naturgas
Do you want to continue? [Y/Enter/n]:
Traceback (most recent call last):
File "/home/paws/.itemsubjector/itemsubjector.py", line 8, in <module>
itemsubjector.run()
File "/home/paws/.itemsubjector/src/__init__.py", line 79, in run
main_subjects.get_validated_main_subjects_as_jobs()
File "/home/paws/.itemsubjector/src/models/main_subjects.py", line 108, in get_validated_main_subjects_as_jobs
job = main_subject_item.fetch_items_and_get_job_if_confirmed()
File "/home/paws/.itemsubjector/src/models/wikimedia/wikidata/item/main_subject.py", line 240, in fetch_items_and_get_job_if_confirmed
return self.__fetch_and_parse__()
File "/home/paws/.itemsubjector/src/models/wikimedia/wikidata/item/main_subject.py", line 250, in __fetch_and_parse__
self.__prepare_before_fetching_items__()
File "/home/paws/.itemsubjector/src/models/wikimedia/wikidata/item/main_subject.py", line 188, in __prepare_before_fetching_items__
self.__extract_search_strings__()
File "/home/paws/.itemsubjector/src/models/wikimedia/wikidata/item/main_subject.py", line 141, in __extract_search_strings__
elif self.id in config.no_alias_for_scholarly_items:
AttributeError: module 'config' has no attribute 'no_alias_for_scholarly_items'
My command was poetry run python itemsubjector.py -a Q40858
Where to find them? Bulk download from fatcat! ?
https://archive.org/download/fatcat_bulk_exports_2021-10-07/abstracts.json.gz 10G abstracts
Similar to QuickStatements batches, ItemsSubjector could have a flask frontend that runs in Toolforge and execute the users batches.
This requires oauth and flask.
Lucas made a good toolforge flask template to get started.
At least 700 hundred items were edited before the replace-bug was fixed and the tool erroneously removed a lot of main subjects.
Cleanup is ongoing manually ๐
see https://www.wikidata.org/w/index.php?title=Special:Contributions/So9q&offset=20210913090748&limit=100&target=So9q
On checking the latest version and 0.3-alpha2, I am getting the following error:
Traceback (most recent call last):
File "itemsubjector.py", line 3, in <module>
import src
File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/__init__.py", line 11, in <module>
from src.helpers.console import (
File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/helpers/console.py", line 11, in <module>
from src.models.batch_job import BatchJob
File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/models/batch_job.py", line 3, in <module>
from src.models.items import Items
File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/models/items/__init__.py", line 10, in <module>
from src.models.wikimedia.wikidata.sparql_item import SparqlItem
File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/models/wikimedia/wikidata/sparql_item.py", line 4, in <module>
import config.items
ModuleNotFoundError: No module named 'config.items'
itemsubjector.py -l Q1234 --match-string "search for this"
Is it possible to add an option (advanced option) to disable search for aliases while adding main subject to Wikidata items?
This has the advantage that the human can better focus on jobs with a big impact instead of drowning in small jobs with a few matches where 99% contain no false positives and they can easily be reverted by anyone even if they did.
as a user I want to approve all batches without having to sit and wait in between because my time is valuable.
this would enable us to exclude groups of articles based on the main subjects of the journals :)
see https://www.wikidata.org/w/index.php?title=Topic:Wivsgl0y23flu93q&topic_showPostId=wj0bq23pbit5scil#flow-post-wj0bq23pbit5scil
Jean-Fred:
Run the interactive part on toolforge on the shell, and from there kick off a grid engine job ?
Dennis Priskorn:
I have not learned how the grid engine works yet.
Maybe a new flag --grid-engine can be added and then it saves the to be processed QIDs in a pickle.
Then a new script can read that and run a non-interactive batch for each one?
The latter can be executed in the engine as a job
--approve-only might be a better name
This is useful, because there are already many thousand different main subjects and many of them are not matched properly with all relevant articles yet.
Thanks for correcting the previous errors in 0.3-alpha3.
I checked out the latest commit in the main branch and 0.3-alpha4. And now, I now face the OAuth error.
File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/my_venv/lib/python3.7/site-packages/oauthlib/oauth2/rfc6749/parameters.py", line 432, in validate_token_parameters
raise_from_error(params.get('error'), params)
File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/my_venv/lib/python3.7/site-packages/oauthlib/oauth2/rfc6749/errors.py", line 402, in raise_from_error
raise cls(**kwargs)
oauthlib.oauth2.rfc6749.errors.InvalidClientIdError: (invalid_request) The request is missing a required parameter, includes an invalid parameter value, includes a parameter more than once, or is otherwise malformed.
Any idea on this error.
I checked with other scripts of mine. There are no issues.
e.g. case control study in https://www.wikidata.org/wiki/Q44681960 where we have https://www.wikidata.org/wiki/Q961652 (with a dash)
e.g. needed for https://www.wikidata.org/wiki/Q108570971
reported by jsamwrites
I have checked out the latest version v0.3.1 in PAWS and try to run pip install -r requirements.txt
as the README says.
I then get the error message:
ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'
Labels with apostrophe(') currently do not work. I think that an escape character needs to be added, before sending the query string to WDQS.
Take for example:
Alzheimer's disease (Q11081)
returns the following error
Fetching items with labels that have one of the search strings by running a total of 11 queries on WDQS...INFO:backoff:Backing off execute_sparql_query(...) for 1.0s (requests.exceptions.HTTPError: 400 Client Error: Bad Request for url
In v0.2 I am trying pip install -r requirements.txt
in PAWS and get this error message:
Collecting wikibaseintegrator
Cloning git://github.com/LeMyst/WikibaseIntegrator (to revision v0.12.0.dev5) to /tmp/pip-install-h0jhod33/wikibaseintegrator_2f94ad8cb5b244b3816e997a960745eb
Running command git clone --filter=blob:none --quiet git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-h0jhod33/wikibaseintegrator_2f94ad8cb5b244b3816e997a960745eb
fatal: unable to connect to github.com:
github.com[0: 140.82.113.4]: errno=Connection timed out
error: subprocess-exited-with-error
ร git clone --filter=blob:none --quiet git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-h0jhod33/wikibaseintegrator_2f94ad8cb5b244b3816e997a960745eb did not run successfully.
โ exit code: 128
โฐโ> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
ร git clone --filter=blob:none --quiet git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-h0jhod33/wikibaseintegrator_2f94ad8cb5b244b3816e997a960745eb did not run successfully.
โ exit code: 128
โฐโ> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
What should I do?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.