subugoe / metacheck Goto Github PK

View Code? Open in Web Editor NEW

7.0 5.0 1.0 1.5 MB

Automatically check metadata compliance for hybrid open access (OA).

Home Page: http://subugoe.github.io/metacheck/

License: GNU Affero General Public License v3.0

R 99.74% Dockerfile 0.26%

rstats open-access open-access-monitoring bibliometrics

metacheck's Introduction

Open Access Metadata Compliance Checker

Automatically check metadata compliance for funded open access articles.

Check your own DOIs Look at an example

Why Metadata Compliance Matters

Open access grants and transformation contracts with publishers increasingly require licensing metadata.

General metadata recommendations:

Supporting Libraries and OA Funders

The compliance checker helps libraries and funders of open access publications streamline their metadata monitoring.

Become Metadata Compliant
Independently verify that your publications are compliant with the metadata requirements of library consortia and funding agencies.
Identify Areas of Improvement
Identify nonconforming metadata and target the publishers and publications to best improve your compliance.

Receive Emailed Reports
Get high-quality reports parametrised for your institution and funded publications by email.
Dig Down into your Data
Use generated spreadsheets with your own tools and analyses.

Get Answers
Rely on the community of users and experts to interpret results and troubleshoot concompliant or missing metadata.
FAQs are coming soon.
Use Open Source for Open Access
Co-create value for the community by using a compliance tool powered by open source software and based on open data.
Contributions are welcome.

Technical Implementation

The Open Access Metadata Compliance Checker is powered by metacheck, an R package.

The package includes:

compliance checks
a parametrised rmarkdown compliance report
a webapp to send e-mail reports

metacheck's People

Contributors

Stargazers

Watchers

Forkers

maxheld83

metacheck's Issues

narrow down snapshot testing

snapshots currently get invalidated when some data on crossref changes; that's not really meaningful change.

expose app and preview on landing page

set up special purpose email (funktionsadresse) for reply to

change send from to metacheck-support

retrieve email send status from smtp relay

currently, we don't really hear back from what the smtp relay (mailjet currently) does with the email, i.e. when/if it is actually sent/bounced or whatever.

AFAIK, what we currently hear from mailjet is merely that mailjet has received our request.

It would be nice to have access to that and use that status:

in testing
in the UI (i.e. tell users when their email has actually been sent by the smtp)

This kind of bit me just now when mailjet seemed to be working through a queue for about 20 mins, sending no email (couldn't find them in gmail or gwdg) and then a bunch came all at once.
This kind of thing could/should be tested if we get a response.

add readme

... and other material from https://docs.google.com/presentation/d/1CfZTcAydMB8Heabx5jKWctUFFSfJ6iCUVIe9qyZYhyM/edit#slide=id.p

reference dfg document

https://www.dfg.de/formulare/12_21/12_21_de.pdf

find better way to integration test emails

actually sending out an email is a bit weird and creates a lot of noise.
Maybe at least send out to the commiter email?

use separate API tokens

Aside from security best practices, there may be another reason to use separate API tokens for separate people and, especially, services: rate limits.

Crossref states for Metadata Plus that:

Rate limiting of the API is primarily on a per access token basis. If a method allows, for example, for 75 requests per rate limit window, then it allows 75 requests per window per access token. This number can depend on the system state and may need to change. If it does, Crossref will publish it in the response headers.

The problem with us (people) and several services (Azure, gha) sharing the same token is that one of these users might (accidentally) exhaust the rate limit at the expense of another user or machine.
This can easily happen, because it is generally ok to go up to the rate limit on any individual machine or service.
As a result, seemingly unrelated services or other users queries may break in an intermittent fashion, which could be quite surprising and hard to debug.

This is unlikely to be an issue initially, but may well become one eventually and should be addressed head on with at least a token per user and service.

Depending on our scaling Azure may even need several tokens, or the shiny app must ask Azure how many instances there currently are and then share/divide the rate limit accordingly.

As a (slow) workaround, falling back to the open api might help #36

migrate to crossref metadata plus with auth

currently used open API can cause timeouts etc.

query OpenAPC

if this is retrieved from github, we might have to auth azure against github -- otherwise a rate limit of 60/h might bite us

use new member crossref badge

https://www.crossref.org/brand/member-badges/

enforce unique dois

fix warnings, errors and notes

report software versions in email

helpful for debugging and testing

fall back on public metadata api

as a safeguard / for other users who want to run this themselves without metadata plus

add test where runMetacheck is run inside of runtime container

haha, this was a fun one just now solved in https://github.com/subugoe/metacheck/actions/runs/413556533.

There was apparently (?) a missing dependency, which never came up before because:

it was only needed when there were missing DOIs, which did not come up in much of the testing
the dependency (gluedown) is probably part of muggle-buildtime anyway (?), so it never came up inside ghactions testing
inside azure, muggle-runtime probably did not have this, and that's why it failed.

This is a pretty bad failure mode.
Fixes:

also run (smoke?) test on muggle-runtime in github actions (though even the entrypoint wouldn't have captured that?!)
don't use any dependencies in the rmarkdown template, because forgetting those doesn't get caught in R CMD check (as per #52).

send email from azure

limit DOI intake

deploy to google cloud run (or similar batch service)

since the creation and sending of the email is actually stateless (i.e. we do not need a continuous connection to the client), we can actually do 95% of this on a (much cheaper/easier) service such as google cloud run.
This also works with the muggle containers, so there should be very little extra DevOps work.

internationalise report

autocomplete placeholder with metadata

add faq site

internationalise landing page

the whole multilingual thing is a bit of a mess currently.
There is, for example, currently no easy way to have a landing page in german without english artefacts.
The sites are also not properly reported as "translations" of other sites.

It might actually just be easiest to do this all in english.

add pkgdown

choose among RORs from email domain

data from https://figshare.com/collections/ROR_Data/4596503

lift quote from DEAL for metadata requirement

would be nice to have some quote from DEAL for the metadata requirement, or how that is related.
Similar to what we have for the DFG #17.

DEAL is also listed in the PPT.

fix alignment of bs list with fa icons

the landing page bootstrap with the fa icons isn't correctly aligned

allow intendation in readme html

the old problem with html pass-through is back, because it's only fixed via _site.yml, but should probably be fixed upstream

migrate to crossref plus api

fix codecov

add form for ROR and institutional email

render asynchronously

rendering the email and other expensive operations should work synchronously so as not to block the session

make email consistenly german

there's still some english in the email

send attachment

expose example email as vignette

refactor ELT + analysis codebase

this is a bit of a big issue, but just to keep track of what should be factored out:

the rmarkdown template (for the mail) strictly should not do any work, but should merely define the presentation of results (i.e. the order of tables and perhaps some commentary).
- Among other things, the rmarkdown template currently creates the excel output, which makes this quite brittle (and required the hack-fix for #44
- the substantive documentation in the template (i.e. what this or that means, not just "hello"-boilerplate) should be in the roxgen2 documentation for the respective functions.
  We can then figure out a way to dynamically pull this into the rmarkdown template.
  (I know that it's easily possible to pull in random rmarkdown chunks into the roxygen docs, but it would be even more expressive to do it the other way around -- will have to investigate).
  The logic here is that this kind of info really belongs to the functions, where it can be maintained together, tested, etc.
- no dependencies in rmarkdown templates, in particular, or at least not without programmatically rendereing the template as part of the check.
  If there is a missing dependency as in https://github.com/subugoe/metacheck/actions/runs/413556533 this creates a pretty thorny failure mode, because the missing dep isn't captured by R CMD check.
simply no r code that isn't a function and isn't covered by at least R CMD check and friends, ideally test() as well.
tbc.

save requests to db

add academic icons

would be nice to use doi, crossref and ror icons somewhere.

https://jpswalsh.github.io/academicons/

fix azure deployment

currently seems broken, unclear why

add legal disclaimer

@njahn82 some (IANAL) thoughts on the legal issues just raised:

I'd hope that we can somehow rely on the MIT license (which the package is also under) in using the web UI.
That expressly excludes any kind of warranty or guarantees, and has been thoroughly tested to that effect.
If we can retrieve and store the RORs for all/most users, we might drop the email as soon as we've send out the email.
I'd imagine that in that situation, we'd only store ROR and DOIs, neither of which are personal information, and thus perhaps, the GDPR regime wouldn't even apply.

tu_dois appears to include duplicates

add landing page

rendering of GT percentage sign

already reported here:rstudio/gt#657

change name

Can we change the name @njahn82? (Sorry, bit of a pet peeve).

This covers non-hybrid publications as well, correct?

Would just metacheck work? It's short, available::available() everywhere, and covers the core of what it's doing.

I know this isn't super important, but it's easiest to do this at the beginning, before we have to change it in a bunch of places.

subugoe / metacheck Goto Github PK

metacheck's Introduction

Open Access Metadata Compliance Checker

Why Metadata Compliance Matters

Supporting Libraries and OA Funders

Become Metadata Compliant

Identify Areas of Improvement

Receive Emailed Reports

Dig Down into your Data

Get Answers

Use Open Source for Open Access