Giter Club home page Giter Club logo

pyorcidator's Introduction

PyORCIDator

PyORCIDator is a wrapper for ORCID data for integration to Wikidata.

It currently only tries and import for each ORCID:

  • Employment data (without titles and start and end date)
  • Educational data (with titles and start and end date)

It generates a quickstatement with a standard English description of "researcher" and a occupation --> researcher statement.

Next steps

The current features are on the development list:

  • Adding authorship (P50) statements for all listed articles.
  • Extract Google Scholar and Twitter IDs

Installation

To install PyORCIDator, run the easiest way is to clone the repository with:

git clone https://github.com/lubianat/pyorcidator.git

Then, install it from the project's root directory with:

pip install -e .

Usage

To run PyORCIDator, interactively, run:

pyorcidator import

To run a simple query, just run:

pyorcidator import --orcid 0000-0003-2473-2313

To run a query with a list of ORCIDs, run:

# here orcids.txt is a file containing one ORCID per line
pyorcidator import_list --orcid-list orcids.txt

Related Work

pyorcidator's People

Contributors

cthoyt avatar gabriellovate avatar jvfe avatar lubianat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pyorcidator's Issues

Think about implementing a test suite

I believe testing is a good thing to work on going forward, since the base functionality of the code is - more or less - achieved. As we go through implementing new features and what not, it'd be a nice safety net to have an adjacent testing suite, so we don't break the basics. I'd recommend going with pytest, but Python's own testing module, unittest is pretty good too.

Avoid code repetition between import_info and import_info_from_list

Since the end result between the modules is essentially the same, the functionality desired - to read from a list of ORCIDs - could be achieved by having a type guard on import_info.py, to see if the argument provided is a Path() that exists - thereby reading it and looping through them - or a string - in such case, only sending it to render_orcid_qs. I believe it would make the code simpler and cleaner.

Describe in the README what it actually does

For example: does it add missing papers? is that optional? does it try to figure out links to website provided on the profile and extract, for example, Twitter account, GScholar account? does it link the (new) ORCID profile to existing articles?

ModuleNotFoundError: No module named 'pyorcidator'

@cthoyt I tried to install pyorcidator as a package by running pip3 install . in a virtual env. When I try to call the function, it gives the following error:

Traceback (most recent call last):
  File "/home/lubianat/Documents/main_venv/bin/pyorcidator", line 5, in <module>
    from pyorcidator.cli import cli
ModuleNotFoundError: No module named 'pyorcidator'

it is installed, though, as it is in /home/lubianat/Documents/main_venv/bin/pyorcidator, do you know what might be happening?

403 error when using import or import_list

Sometimes when we run pyorcidator import or import_list, it raises an HTTP 403 Forbidden error:

File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

The stacktrace appears to lead to SPARQLWrapper in the helper.lookup_id function.

Enable creating Wikidata record for a given ORCID non-interactively

While the goal of this repo is to have a human-in-the-loop to help disambiguate affiliations and other aspects of ORCID pages, it would be nice to have a non-interactive mode that allows for a given ORCID to just do a simple set of things (instance of, occupation, ORCID annotation) and some other high-confidence resolution of affiliations.

Related to cthoyt/orcidio#7

Split QuickStatements datamodel into own package?

Hi @lubianat @jvfe, would it be alright if I split the code I wrote for the QuickStatements data model into its own stand-alone package? I also want to write a full-fledged client for interacting with the QuickStatements API, but this isn't the core goal of this package.

I'll keep compatibility with the current interface, so later, it would be possible to replace code in this repository with that code

get_organization_list not returning full list of organizations

In the branch tests/get_org_list_bug I built a simple test to check if helper.get_organization_list could return a list of all organizations in the sample data. However, it currently only returns 2 of the 4 organizations (Output of pytest -v):

E       AssertionError: assert ['Harvard Med...l', 'Q152171'] == ['Harvard Med...unhofer SCAI']
E         At index 1 diff: 'Q152171' != 'Enveda Biosciences'
E         Right contains 2 more items, first extra item: 'Q152171'
E         Full diff:
E         - ['Harvard Medical School', 'Enveda Biosciences', 'Q152171', 'Fraunhofer SCAI']
E         + ['Harvard Medical School', 'Q152171']

As you can see, the function only returns the first and the third employment entries.
I believe this piece of code is the reason:

if a["disambiguated-organization"] is None:
    continue

If it doesn't find a key for disambiguated-organization - which is the case for the second and fourth entries I showed above -, it jumps to the next entry in the list, instead of returning it at the end. So, what's the reason behind this line? Is there something I'm missing here? Thanks

Improve handling and skipping when term not present in Wikidata

Sometimes a term is not on Wikidata and we'd rather just exclude it instead of going through the effort of creating a new entry.

Not sure about how to implement:

  • special dict value ("SKIP") that is skipped?
  • flag that skips any missing when running import?

Implement data model for quickstatements

Right now, the construction of the quickstatements text is really hard to understand, and therefore to debug or extend. I would suggest creating a data model (i.e., a set of interconnected classes) that can better assist with the construction of quickstatements in a programmatic way, then can also implement serialization to text

Include black in the CI

Add CI check to see if the code is properly formatted with black

@jvfe about time we blackened the whole project ;)

For sure, I'll probably include black in the CI too, just to be safe. But that's for another PR.

Originally posted by @jvfe in #41 (comment)

Pre-populate degree dictionary

Using a SPARQL query to get all subclasses of academic title (Q3529618) would be a nice way to pre-populate degrees.json. The following SPARQL query (run at https://w.wiki/5o9H) gets the job done:

SELECT ?itemLabel ?item
WHERE {
  ?item wdt:P279* wd:Q3529618 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Caveats:

  • This should be extended to multiple languages
  • Some labels are empty, those should be filtered out either in SPARQL or in post-processing (I realize this was likely due to there not being english labels)
  • There might be other terms besides academic title that are relevant, but this seems like a pretty good start

Alternate Multi-lingual SPARQL

SELECT DISTINCT ?label ?item
WHERE {
  ?item wdt:P279* wd:Q3529618 .
  ?item rdfs:label ?label .
}

Note that DISTINCT doesn't collapse entries tagged with multiple languages, but still have the same text.

Add logic to look for disambiguations on Wikidata

e.g. :

"organization": {
                        "name": "University of Regensburg",
                        "address": {
                            "city": "Regensburg",
                            "region": "Bayern",
                            "country": "DE"
                        },
                        "disambiguated-organization": {
                            "disambiguated-organization-identifier": "grid.7727.5",
                            "disambiguation-source": "GRID"
                        }
                    }

code should look up id on Wikidata before asking user for key

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.