research-software-ecosystem / content Goto Github PK

A metadata commons to store research software metadata

License: Creative Commons Attribution 4.0 International

content's Introduction

Research Software Ecosystem (RSEc) contents repository

This repository contains the metadata aggregated for the Research Software Ecosystem (RSEc). The purpose of this repository is to act as a central place for the exchange of these metadata for multiple projects, including bio.tools, Biocontainers, Bioconda, OpenEBench, Debian Med, BIII.eu, etc.

Contents outline

All software metadata are in the data folder of this repository, each software package/tool being in a distinct folder, which contains multiple files. Each of these files contains the metadata regarding the software coming from a specific resource, or reformatted in a specific format.

# Example for the contents of the 'fastqc' folder:
fastqc.biotools.json # metadata for the fastqc bio.tools entry (pulled by RSEc bot)
bioconda_fastqc.yaml # metadata for the bioconda fastqc package (pulled by RSEc bot)
biocontainers.yaml # metadata for the fastqc biocontainers image (pushed by biocontainers bot)
fastqc.oeb.metrics.json  # metadata for the OpenEBench fastqc package metrics (pulled by RSEc bot)
fastqc.debian.yaml  # metadata for the Debian Med fastqc package (pulled by RSEc bot)
fastqc.bioschemas.jsonld  # metadata for the fastqc package, converted from bio.tools metadata (pulled by RSEc bot)

content's People

Contributors

Stargazers

Watchers

content's Issues

CI error during biocontainers pull request

after automatic PR submission from biocontainers to add tool/version, bio-tools CI failed, cf https://github.com/bio-tools/content/runs/500476138?check_suite_focus=true

on comment-pr action:

github.GithubException.GithubException: 403 {"message": "Resource not accessible by integration", "documentation_url": "https://developer.github.com/v3/issues/comments/#create-a-comment"}

Furthermore, many workflows are executed, such as "validate debian yml files", "check biotools", ... but they check the whole system, not just the files/tools modified by merge/pull request.
The result is a PR CI may fail due to an issue in an other tool not related to PR.

Consider shim for Debian Common Tool Descriptors

Post from Michael Crusoe on Debian list is below. Would a biotoolsSchema<>CTD shim be useful?

Package: wnpp
Severity: wishlist
Owner: Debian Med team [email protected]

Package name : ctdconverter
Version : 2.0
Upstream Author : WorkflowConversion
URL : https://github.com/WorkflowConversion/CTDConverter/
License : GPL-3 / Apache-2.0
Programming Lang: Python
Description : Convert CTD files into Galaxy tool and CWL CommandLineTool files

Common Tool Descriptors (CTDs) are XML documents that represent the inputs, outputs, parameters of command line tools in a platform-independent way.

CTDConverter, given one or more Common Tool Descriptors (CTD) XML files, generates Galaxy tool wrappers and Common Workflow Language (CWL) Command Line Tool v1.0 standard descriptions from CTD files.

Will assist in including CWL descriptions in the seqan-apps, openms/topp, flexbar, and lambda-align packages

Will be team-maintained by Debian Med

UI for entry creation, editing and registration with seamless GitHub integration

One of many issues around GitHub-based content management for bio.tools.

Information mapping, gap analysis & schema revision

Mapping of tool information from different projects (to be integrated) with gap analysis for anticipated information requirements, followed by revision of biotoolsSchema to support integration of new data where possible and desirable.

Bioschemas dump uses undefined `schema` prefix

The data dump includes multiple instances where the prefix schema is used instead of sc. These result in the corresponding data being generated with the incorrect IRI for two types.

schema:Organization
schema:Person

Issue caused by https://github.com/bio-tools/content/blob/master/scripts/bioschemas/biotools_to_bioschemas.py

Data dump to GitHub

From @joncison on September 4, 2018 11:27

Nightly dump of all content (in XML and JSON formats?) to GitHub, as a convenience (or least to begin, just a one-off dump)

Copied from original issue: bio-tools/biotoolsRegistry#355

Please change 'clustalw' ID

https://bio.tools/clustalw is a super-old deprecated version of Clustal W and Clustal X, without documentation, license, etc., just a link to a software archive.

The "current" (well still quite "mature") version of Clustal W and Clustal X is https://bio.tools/clustal2.

I improved the annotations of both of these records, but it would be very nice to fix the following, in order to avoid confusion:

Change the 'clustalw' ID to 'clustal1' (can be done immediately)
Let alternative IDs 'clustalw' and 'clustalx' redirect to https://bio.tools/clustal2 (a nice-to-have feature for users)

Note: I also added the alternative spellings of the names (e.g. ClustalW) to the descriptions of Clustal 2 and Clustal Omega, so that the correct records can be found by the full-text search. I also added the corresponding Debian and Conda packages.

rename repo

To automate some PR, we need to fork the repo to push our updates and create some PR.
content is quite generic, would be better to get something like bio-tools-data to avoid conflicts with other repo in our organization and better understanding of repo usage

where should container/package be defined?

should we add entries to download section of tools with specified type and version ?

Action on tool delete

@bgruening @hmenager @joncison @matuskalas @piotrgithub1

Hi all,

This issue is regarding tool deletion from the ecosystem, especially pull requests coming from outside (e.g. from bio.tools or from forks).

It could be the case that in our content file structure we have something like:
data/tool_id/tool_id.biotools.json where this is the only file. If we delete this file then the whole folder goes as well, if this is intended behaviour it's fine but we just have to agree on what happens on delete.

I was thinking of, istead of deleting the file immediately, especially when changes are coming from bio.tools perhaps we can mark tools as deleted by renaming the bio.tools files to something like .deleted.tool_id.biotools.json. In this way the folder doesn't go away.

I think we can code and handle both cases, it's just that we have to agree on what we want the functionality to be. What are your opinions on this? Please tag other people.

biotoolsID handling in GitHub ecosystem and bio.tools

@bgruening @joncison @hmenager @matuskalas @piotrgithub1

I am creating this issue to discuss / consult on how we manage bio.tools IDs in both bio.tools and GitHub.

This topic is important because bio.tools is a tool ID provider which means the bio.tools IDs need to be persistent and not change without a serious reason.

Currently how I see bio.tools IDs working:
At tool creation:

In bio.tools registration interface: suggest the biotoolsID as an URL-safe version of the tool name, but also allow the user to edit the biotoolsID field (with validations on top of course), but once the tool has been registered and approved (see below), the ID cannot change anymore
In GitHub: When a pull request for a new tool comes in it has to follow the basic rules for a new tool which involve having a new folder with the name:
{{biotoolsID of the new tool}} and in the new folder to have a JSON file with the name
{{biotoolsID of the new tool}}.biotools.json and that JSON file with the new tool annotations needs to contain a biotoolsID field with the same value for the biotoolsID; it also needs to contain a biotoolsCURIE field with the value:
biotools:{{biotoolsID of the new tool}} and of course a name, a homepage , description (and other things we require)

We need to decide on how we handle the pull request merging, if one of the core team users needs to approve/review the new entry, or perhaps the user who created the entry needs to review the PR or if the tool creation PR gets automatically merged after all validations (I would not go for that).

My opinion is that since we already allow the tool into the bio.tools database immediately then we can reserve the right to approve new tools before they get added to the GiHub side.

The initial addition to the bio.tools database should be done with a "pending approval" flag which should be resolved on the GiHub side into either in approval of the tool or a rejection.

I think the approval or rejection should mainly focus on the fact that the tool is indeed an actual tool and if the id of the tool is acceptable (i.e. not completely different from the name or having some other weird value; I think we will rarely encounter a situation where a tool has an unacceptable id). Everything else about the tool annotation can be fixed later (e.g. wrong toolType, missing license etc)

At tool update

bio.tools will ignore any changes to the biotoolsID and biotoolsCURIE and will not pass these changes to GitHub. Perhaps bio.tools should even report a validation error if an ID change is attempted.
Similar to bio.tools GitHub should not allow changes to biotoolsID-related values (either in a file, or the foldername or anywhere), at least not in the files concerning bio.tools. If for some reason an ID change is needed then we create a whole new entry (and the corresponding file structure) and then delete the old one. Given that there might be other files in the parent folder from other provider (e.g. from Bioconda, OpenEBench etc.) we need to keep them and allow the other providers to update their entries. We can also discuss about keeping the old entry/id and in bio.tools redirect to the new id, and also flag this situation in GitHub somehow.

Please give your opinions on the above and also tag others.

Weird folder with Biocontainers JSON file

@hmenager @osallou
Hi guys,

There is a folder:
https://github.com/bio-tools/content/tree/master/data/CCMetagen

with a file called
https://github.com/bio-tools/content/blob/master/data/CCMetagen/biocontainers.biotools.json

which is not a bio.tools file, I assume this CCMetagen folder, which is different from
https://github.com/bio-tools/content/tree/master/data/ccmetagen

is just some folder that got left behind. Do we need this? I'm also asking because it contains the file with the name biocontainers.biotools.json which is the naming convention for bio.tools files.

Thanks,
Hans

Public Galaxy Server and Tool Metadata

Had a conversation with @matuskalas at 2 consecutive Galaxy CoFests about improving the search functionality of the Galaxy Platform Directory. In between those two conversations I had a conversation with my bosses about increasing the amount of information related to Galaxy in Bio.Tools and GalaxyCat.

It became obvious while talking with Matúš at this year’s CoFest that these goals complement each other nicely.

This issue could be created in many places:

Bio.Tools, any number or repos (Picked Content
GalaxyCat
GalaxyProject, any number of repos (including galaxy, galaxy-hub,...)

Eventually there may be pull requests in many of those places (including ToolDog).

Goals

Increase presence of public Galaxy servers and their tools in Bio.Tools and GalaxyCat.
Increase Awareness of Bio.Tools and GalaxyCat in the Galaxy Community.
Simultaneously, make the Galaxy Platform Directory contain more useful and searchable information about those platforms.

How?

That’s what this issue is here to discuss. One item seems uncontroversial to me:

We should use ontology terms to do this. EDAM and a Taxa ontology seem most useful, but others also have obvious applications. For example, RepeatExplorer is all about repetitive elements, and that suggests the sequence ontology

And a starting smattering of open questions:

Which ontologies? Any ontology that is available in a lookup service, or only a core set of ontologies?
How do we add new ontologies, either to our limited list, or that aren’t in our selected lookup service? For example the Climate Workbench server may use ontologies that aren’t in biology-centric lookup services.
Where and how do we store and access server ontology information? On the server itself, seems like a good idea, but adding this to metadata in the hub might be a good fallback.
A fair amount of work (I think) has gone into supporting EDAM annotation of individual tools. How can we encourage tool wrappers to actually use this functionality?
How do we make the Galaxy, and larger bioinformatics communities aware of these resources, once they are updated?

Preserving full-strength validation & error reporting

One of many issues around GitHub-based content management for bio.tools.

Could we add a link to Dmitry's dashboard?

It would be nice to have it linked from somewhere near the top of the README.

I couldn't find it either via OpenEBench or via web search engine 🙁

@redmitry

Rename main branch to 'main'

Also in https://github.com/bio-tools/ repos

Technical issues around GitHub-based content management (placeholder)

From @joncison on December 18, 2018 13:42

Use this thread to list (refer to) specific actions, and for general technical discussion, around the mooted GitHub-based content management architecture (see bio-tools/biotoolsRegistry#355 and bio-tools/biotoolsRegistry#242)

Copied from original issue: bio-tools/biotoolsRegistry#399

bio.tools IDs in Bioconda not in bio.tools

Below is a list of bio.tools IDs from Bioconda recipes that do not point to any bio.tools entries. Some are fixable, case in which I added the new id in parenthesis (needs checking) and some would require adding tools to bio.tools.

bam2fasta (None)
bold-identification (None)
chorus2 (None)
extract_fasta_seq (None)
faqcs (None)
gem_mapper (gemmapper,gem3)
hla-la (None)
ivar (None)
krakenhll (None)
ksw (None)
magicblast (None)
malt (None)
malva (None)
manta9235 (manta_sv)
mgkit
miniax
mutmap
ngmerge
pgcgap
phame
piret
popera
qtlseq
roast
sepp
sepp-refgg138
sepp-refsilva128
slimm
super_distance
taxonomy_ranks
tba
wham6216 (wham-variants)
womtool
xhmm

CI should skip non "biotools" entries

CI in travis is triggered for some checks, but they do not apply on biocontainers etc... entries (at least for now).
So if commits apply only to biocontainers.json etc.... CI could skip the tests, or should bots make commits with a [skip ci] message?

Prefix directories for scalability

It may well be too late, but looking at this for the first time after a discussion with @bgruening, if other communities are likely to join and the list grows further (which would be great! 🎉), I wonder if adding a top-level prefix directory (/1/1000genomes) to reduce the total number of listings in a single directory by a factor of ~36 has been discussed. (Currently there are 38543 entries in the data directory.)

See for example the package links in Debian:

http://ftp.de.debian.org/debian/pool/main/a/acct/acct_6.6.4-5+b1_amd64.deb

Data serialisation and transformation

One of many issues around GitHub-based content management for bio.tools.

Remove EDAM term labels before upload in runbiotools

Should remove them because outdated term labels create errors, e.g.:

    "topic": [
        {
            "term": "Immunoproteins, genes and antigens",
            "uri": "http://edamontology.org/topic_2830"
        },

For this term the label in the latest version of EDAM has changed, creating an error when we upload the tool:

ERROR:root:error while uploading ../content/data/swehla/swehla.biotools.json (status 400): {"topic":[{"general_errors":["The term does not match the URI: Immunoproteins, genes and antigens, http://edamontology.org/topic_2830."]},{},{},{}]}

data integration: augmenting bio.tools

One of many issues around GitHub-based content management for bio.tools.

Entry ownership and editing rights (preservation of)

One of many issues around GitHub-based content management for bio.tools.

Committing bioschemas right after creating

We might need to put more work in this github action. It creates bioschemas on every pull-request and tries to push it to origin Head. Then it's getting 403 from github, because actions bot from PR owner doesn't have enough rights to push. I guess, it would be better to create new pull requests with bioschemas on every push. With auto generated PRs we would have more control of situation and we can clearly see if something goes wrong.
Ping @bgruening @matuskalas @hmenager

What does spread DOIs do?

@bgruening @OlegZharkov

Excuse my ignorance, but I get emails from github-actions bot about spread dois. I know @OlegZharkov worked on this at the Freiburg hackathon, but I see it looks at bioconda yaml files, debian files and also bio.tools json files. What does it do with the bio.tools json files? Does it add DOIs where they are not available or what exactly?

Thanks

bioschemas generation folder

There is a problem with some tools, where the bioschemas file is not generated in the same folder as the bio.tools file. e.g., SPROUTS is generated in the data/content/SPROUTS (https://github.com/bio-tools/content/tree/master/data/SPROUTS) instead of data/content/sprouts (https://github.com/bio-tools/content/tree/master/data/sprouts).

I would advocate the following:

agree that all tool folders should be all-lowercase,
change the bioschemas generation script (see line below) so that the folder is all-lowercase.

https://github.com/bio-tools/content/blob/efb0e789a6610f0cafff44551fa36111eb840de9/scripts/bioschemas/biotools_to_bioschemas.py#L318

Tools ecosystem processes

This table is meant to document the various processes (workflows) taking place in this repository:

Source name	Import	Validation	Cross-link	Report	Publish
bio.tools	biotools-import.yaml	biotools-testrunregistry.yml	doi-load.yaml	biotools-pullrequest-analyzer.yml	bioschemas-build.yml bioschemas-build-dump.yml
bioconda	bioconda-import.yaml
biocontainers	push from biocontainers
OpenEBench metrics	openebench-git-populous
Debian Med	debianmed-import.yml	debian-yaml-validator.yaml	doi-load.yaml
BIII	biii-import.yaml				biii-import.yaml
Global				report.yaml

Exclude some biotoolsIDs from id checking

@matuskalas @bgruening @hmenager @joncison
Please provide some links to the resources that have biotoolsIDs which cannot be changed, so when I go through and verify the existing biotoolsIDs I can exclude these from the check

Thanks

Deal with deleted content in a provider registry

Related to PR #594

Validation of IDs as part of CI in TE
Bio.tools, OEB, BIII, etc. PRs deleting old records! Work needed in those resources to keep track of deletions!
TE CI should purge empty directories
I would also love if TE CI maintains a log / TSV / dashboard that would show deleted records (maybe also added and updated)

Issues with 3 Debian links to bio.tools

In https://blends.debian.org/med/tasks/bio
for the tools:

Belvu
Dotter
Blixem
each link to bio.tools points to:
https://bio.tools/{belvu,blixem,dotter}

the html code is:

<span class="registry_biotools"><a href="https://bio.tools/{belvu,blixem,dotter}">Bio.tools</a><span class="registry">

The same is for the other registries for the above 3 entries

Managing the curation maintenance burden (pull requests)

One of many issues around GitHub-based content management for bio.tools.

Create test issue for ecosystem

Autogenerated issue, ignore

bio.tools database update mechanism

One of many issues around GitHub-based content management for bio.tools.

Checking and forbidding "citation explosion"

(This issue comes from a consensus discussion incl. @bgruening @OlegZharkov )

By "citation explosion", we mean that an article (or another DO) is used as a (primary) citation in way too many tools.

This occured e.g. in some Galaxy tools annotated with a generic Galaxy article DOI, or some Bioconductor packages annotated with a generic Bioconductor article DOI. The records for the overall workbenches | suites | collections should be annotated with these article DOIs, but then not their member tools.

Note: In bio.tools, we can annotate a publication/DOI at the level of a workbench or toolkit, but we should consider if we want to allow it also at the level of (arbitrary) collections. Applicable when the representation of collections is strengthened. It would surely make many maintainers of collections very happy (think also about ELIXIR Service Bundles, or community efforts like Debian Med etc.). @hansioan @joncison @jvanheld

Another kind of occurence, applying to bio.tools, is in multiple deployments (services) of popular tools (e.g. again Galaxy servers, or sequence alignment & search tools). Then it would be ok to allow citation of the generic tool article DOI, but it shouldn't be a 'Primary' publication, and there definitely should be a relation to the actual tool (which then has the citation stated).

There are many reasons why this should not be allowed: noise (incl. worsened search & info integration), multiplication of information, not fair, getting high metrics from different-level publications (altmetrics, cit counts, ...) that isn't about this particular tool or service, etc.

It's not clear yet what a good cut-off of "too many" should be, but very most likely something greater than 4 and MUCH smaller than 10, when we speak in terms of bio.tools records. In case of Bioconda or Debian packages, the number could be a bit bigger, max up to double (more granularity of src pkgs, but no deployed services).

It will be the task of the CI in bio-tools/content to uncover such inappropriate "citation explosion", and warn everyone that curation is needed. The citations over the yet-to-be-found cut-off could also be automatically removed, or in any case at least ignored when integrating. (If auto-removed, then curation will have to add them back to those single or couple of records where they legitimately belong. If not auto-removed, curation will have to go through the reported lists and remove them everywhere except where actually beloging).

The cut-off needs to be explicitly documented for the users, and part of the checks of each record ("Error: This citation has already been used X times as 'Primary'.")

huge pending PRs

Hi,
we have many pending PRs injected by github-actions , will soon add confusion to handle PRs....

💌 Files from Debian 🐞

In the awesome YAMLs from Debian - despite of the highly appreciated perfectionism - we found a couple of 🐞🐛. @smoe if/when you'll have a bit of time for hacking the "yamlDump" again, it'd be super lovely if you could test the following records/phenomena with the edam.sh and edamJson2biotools.py? e^6 thanks!! 🚀🙏🏽

beast looks like a cluttered record from beast-mcmc and beast-mcmc2 deb src packages, resulting in invalid YAML

description: >
  BEAST is a cross-platform program for Bayesian MCMC analysis of molecular
  sequences. It is entirely orientated towards rooted, time-measured
  phylogenies inferred using strict or relaxed molecular clock models. It
  can be used as a method of reconstructing phylogenies but is also a
  framework for testing evolutionary hypotheses without conditioning on a
  single tree topology. BEAST uses MCMC to average over tree space, so that
  each tree is weighted proportional to its posterior probability. Included
  is a simple to use user-interface program for setting up standard
  analyses and a suit of programs for analysing the results.
version: 1.10.4
no new upstream version of beast-mcmc (1.x) but rather a rewritten
  version.
version: 2.6.0

The same problem for soapdenovo, deb src pkgs soapdenovo and soapdenovo2

Note 1: In these 2 cases, there is probably a reason to maintain both major versions in Debian (or isn't it?), and therefore we should consider that in bio.tools too: consider whether they should have different descriptions, and maybe also EDAM annotation (if not, keep just 1 record), plus credits, pubs, ...

Are there any more pairs of src pkgs that point to the same bio.tools record? What should be the general solution, or options, here? Any additional ideas on this issue @hmenager @bgruening ? (e.g. having 2 debian.yaml files in 1 bio-tools/content directory for the start? )
dnacopy: some YAML validators are happy, but some dislike the colon+space in R package: DNA copy number data analysis
bowtie: funny "punctuation" of function Genome indexing (Burrow-Wheeler). Ok in bowtie2. It looks like the only occurence of this phenomenon.

Note 2: Btw., @hmenager @bgruening @OlegZharkov have the best experiences with using the ruamel.yaml python lib for creating pretty YAML files.

Curation work to ensure all entries give "canonical" tool descriptions

One of many issues around GitHub-based content management for bio.tools.

missing tools

The following is a list of tools that are missing or are not easy to map. A few of them are high-profile tools.

xref: https://github.com/galaxyproject/tools-iuc/pull/3411/files

annotatemyids
anndata
barcode_splitter
basil
bax2bam
bctools
berokka
python-bioext
biom_format
bwameth
cat
chromeister
codeml
colibread
crispr_studio
crossmap
datamash
ena-upload-cli
enasearch
export2graphlan
fargene
fastani
fastp
fastqe
fermi2
fitlong
flair
mutect2
gecko
genehunter
genrich
goslimmer
gprofiler2
graphlan
gubbins
hansel
hicexplorer
hivclustering
humann2
idr
iqtree
ivar
jq
kcalign
kma
lofreq
lorikeet
lumpy_sv
maxbin2
medaka
dreme -- meme_meme and meme_fimo exist
meme_chip
miniasm
mitobim
moabs
mothur
msaboot
multigps
nanoplot
nanopolishcomp
nonpareil
novoplasty
nudup
ococo
odgi
picrust
pipelign
pizzly
porechop
progressivemauve
proteinortho
pureclip
pyGenomeTracks
qfilt
raven-assembler
sarscov2formatter
sarscov2summary
seacr
seqwish
shasta
slamdunk
alleyoop
snp-dists
socru
spaln
spyboat
stacks2 --> is that stacks1
star fusion
structureharvester
swiftlink
tb_variant_filter
tbl2gff3
tetoolkit_tetranscripts
tetyper
tn93
transtermhp
valet
vardict_java
variant_analyzer
vegan
vg
volcanoplot
vsnp
simple_weather
xpath
zerone

https://github.com/wtsi-hpag/Scaff10X

json files should be "pretty" printed

the json compact form will not ease the pull requests reviews.

If a pull request is done to add, for example, a download element, the compact form will only show, as diff, the full line. It makes it difficult for manual review to see the changes.

With a pretty print, we could see the impacted lines.

Order uploaded tools in runbiotools to handle tool "dependencies"

This to avoid that kind of message:

error while uploading ../content/data/protk/protk.biotools.json (status 400): {"relation":[{"biotoolsID":["There is no resource with biotoolsID:XTandem;The resource might have been deleted."]}],"editPermission":{"authors":[{"general_errors":["Specified user does not exsist: proteomics.bio.tools."]}]}}

Data integration: ensuring coverage in bio.tools

One of many issues around GitHub-based content management for bio.tools.

research-software-ecosystem / content Goto Github PK

content's Introduction

Research Software Ecosystem (RSEc) contents repository

Contents outline

content's People

Contributors

Stargazers

Watchers

Forkers

content's Issues

Recommend Projects

Recommend Topics

Recommend Org