lubianat / pyorcidator Goto Github PK

View Code? Open in Web Editor NEW

7.0 3.0 5.0 646 KB

License: MIT License

Python 100.00%

hacktoberfest orcid wikidata

pyorcidator's Introduction

PyORCIDator

PyORCIDator is a wrapper for ORCID data for integration to Wikidata.

It currently only tries and import for each ORCID:

Employment data (without titles and start and end date)
Educational data (with titles and start and end date)

It generates a quickstatement with a standard English description of "researcher" and a occupation --> researcher statement.

Next steps

The current features are on the development list:

Adding authorship (P50) statements for all listed articles.
Extract Google Scholar and Twitter IDs

Installation

To install PyORCIDator, run the easiest way is to clone the repository with:

git clone https://github.com/lubianat/pyorcidator.git

Then, install it from the project's root directory with:

pip install -e .

Usage

To run PyORCIDator, interactively, run:

pyorcidator import

To run a simple query, just run:

pyorcidator import --orcid 0000-0003-2473-2313

To run a query with a list of ORCIDs, run:

# here orcids.txt is a file containing one ORCID per line
pyorcidator import_list --orcid-list orcids.txt

Related Work

pyorcidator's People

Contributors

Stargazers

Watchers

Forkers

cthoyt jvfe egonw gabriellovate

pyorcidator's Issues

Think about implementing a test suite

I believe testing is a good thing to work on going forward, since the base functionality of the code is - more or less - achieved. As we go through implementing new features and what not, it'd be a nice safety net to have an adjacent testing suite, so we don't break the basics. I'd recommend going with pytest, but Python's own testing module, unittest is pretty good too.

BUG Add start and end time as qualifiers (and not references)

@cthoyt @jvfe bug on the approved code and tests. Start and end times need to be qualifiers, not refs.

Avoid code repetition between import_info and import_info_from_list

Since the end result between the modules is essentially the same, the functionality desired - to read from a list of ORCIDs - could be achieved by having a type guard on import_info.py, to see if the argument provided is a Path() that exists - thereby reading it and looping through them - or a string - in such case, only sending it to render_orcid_qs. I believe it would make the code simpler and cleaner.

Look for ORCID on Wikidata instead of plainly creating a new one

ORCIDs catalogued on Wikidata should be updated, not newly created

Add dictionaries for first and last names (and infer gender)

e.g. Rodrigo Dalmolin

Rodrigo - https://www.wikidata.org/wiki/Q4927979 (male given name)
Dalmolin - https://www.wikidata.org/wiki/Q37464573 (family name)

From male given name, infer sex or gender (P21) --> male

Watchout for:

names used for multiple genders (and non-binary names): "Alex", "Andrea" etc

Add keywords as "field of work" statements

e.g https://orcid.org/0000-0002-6049-9865

Describe in the README what it actually does

For example: does it add missing papers? is that optional? does it try to figure out links to website provided on the profile and extract, for example, Twitter account, GScholar account? does it link the (new) ORCID profile to existing articles?

ModuleNotFoundError: No module named 'pyorcidator'

@cthoyt I tried to install pyorcidator as a package by running pip3 install . in a virtual env. When I try to call the function, it gives the following error:

Traceback (most recent call last):
  File "/home/lubianat/Documents/main_venv/bin/pyorcidator", line 5, in <module>
    from pyorcidator.cli import cli
ModuleNotFoundError: No module named 'pyorcidator'

it is installed, though, as it is in /home/lubianat/Documents/main_venv/bin/pyorcidator, do you know what might be happening?

403 error when using import or import_list

Sometimes when we run pyorcidator import or import_list, it raises an HTTP 403 Forbidden error:

File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

The stacktrace appears to lead to SPARQLWrapper in the helper.lookup_id function.

Enable creating Wikidata record for a given ORCID non-interactively

While the goal of this repo is to have a human-in-the-loop to help disambiguate affiliations and other aspects of ORCID pages, it would be nice to have a non-interactive mode that allows for a given ORCID to just do a simple set of things (instance of, occupation, ORCID annotation) and some other high-confidence resolution of affiliations.

Related to cthoyt/orcidio#7

Automate pyORCIDator to run for a list for ORCIDS

Parse education into quickstatements

[] Create dict for role titles, ie

"role-title": "PhD"

-[] check how it is modelled on Wikidata

Outsource code for basic Wikidata curation/parsing

I've create a package for basic Wikidata curation tools:

https://github.com/lubianat/wdcuration

I use them across various projects. I'll adapt pyorcidator to use functions from that package, but I am a bit unsure on how to best do that.

Perhaps I'll push a version of wdcuration to PyPi and add it as a requirement. Anyone has thoughts on that?

Connect authors to the articles they authored

Connect to existing articles
Create items for the ones that do not exist yet

BUG: Fix date precision for start and end times

Currently precision for months is not set correctly.
See https://www.wikidata.org/w/index.php?title=Q90094381&oldid=1750716604 and https://orcid.org/0000-0002-6500-856X .

It is being added as the first day of the month, a false precision.

Parse websites and social links

e.g. see https://orcid.org/0000-0002-5292-4083

Split QuickStatements datamodel into own package?

Hi @lubianat @jvfe, would it be alright if I split the code I wrote for the QuickStatements data model into its own stand-alone package? I also want to write a full-fledged client for interacting with the QuickStatements API, but this isn't the core goal of this package.

I'll keep compatibility with the current interface, so later, it would be possible to replace code in this repository with that code

Pull aliases from ORCID

get_organization_list not returning full list of organizations

In the branch tests/get_org_list_bug I built a simple test to check if helper.get_organization_list could return a list of all organizations in the sample data. However, it currently only returns 2 of the 4 organizations (Output of pytest -v):

E       AssertionError: assert ['Harvard Med...l', 'Q152171'] == ['Harvard Med...unhofer SCAI']
E         At index 1 diff: 'Q152171' != 'Enveda Biosciences'
E         Right contains 2 more items, first extra item: 'Q152171'
E         Full diff:
E         - ['Harvard Medical School', 'Enveda Biosciences', 'Q152171', 'Fraunhofer SCAI']
E         + ['Harvard Medical School', 'Q152171']

As you can see, the function only returns the first and the third employment entries.
I believe this piece of code is the reason:

if a["disambiguated-organization"] is None:
    continue

If it doesn't find a key for disambiguated-organization - which is the case for the second and fourth entries I showed above -, it jumps to the next entry in the list, instead of returning it at the end. So, what's the reason behind this line? Is there something I'm missing here? Thanks

change the project settings to enforce squash/merge and also auto-delete branches after merge

@cthoyt

Improve handling and skipping when term not present in Wikidata

Sometimes a term is not on Wikidata and we'd rather just exclude it instead of going through the effort of creating a new entry.

Not sure about how to implement:

special dict value ("SKIP") that is skipped?
flag that skips any missing when running import?

Improve handling of external links when 'https' not present

Currently, if an ORCID user doesn't have 'https://' in their social links tab the link isn't recognized by pyorcidator. For example, the github link for https://orcid.org/0000-0001-7542-0286 (github.com/egonw).

This could be probably solved by using a regex instead of the current rv[key] = url[len(url_prefix) :]. I might send a possible solution soon.

Pull DOIs from ORCID

add a CITATION.cff plz

Adapt code to accept multiple degrees of the same institution

Develop a workflow for curation and update of the dictionaries for roles and institutions

Currently I'm commiting this updates directly on master. Probably not a good solution.

Maybe each curator using pyorcidator can have their own branch just for curation and they are merged e.g. once a month?

Implement data model for quickstatements

Right now, the construction of the quickstatements text is really hard to understand, and therefore to debug or extend. I would suggest creating a data model (i.e., a set of interconnected classes) that can better assist with the construction of quickstatements in a programmatic way, then can also implement serialization to text

Include black in the CI

Add CI check to see if the code is properly formatted with black

@jvfe about time we blackened the whole project ;)

For sure, I'll probably include black in the CI too, just to be safe. But that's for another PR.

Originally posted by @jvfe in #41 (comment)

SELECT ?itemLabel ?item
WHERE {
  ?item wdt:P279* wd:Q3529618 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Caveats:

This should be extended to multiple languages
Some labels are empty, those should be filtered out either in SPARQL or in post-processing (I realize this was likely due to there not being english labels)
There might be other terms besides academic title that are relevant, but this seems like a pretty good start

Alternate Multi-lingual SPARQL

SELECT DISTINCT ?label ?item
WHERE {
  ?item wdt:P279* wd:Q3529618 .
  ?item rdfs:label ?label .
}

Note that DISTINCT doesn't collapse entries tagged with multiple languages, but still have the same text.

Add logic to look for disambiguations on Wikidata

e.g. :

"organization": {
                        "name": "University of Regensburg",
                        "address": {
                            "city": "Regensburg",
                            "region": "Bayern",
                            "country": "DE"
                        },
                        "disambiguated-organization": {
                            "disambiguated-organization-identifier": "grid.7727.5",
                            "disambiguation-source": "GRID"
                        }
                    }

code should look up id on Wikidata before asking user for key