clearlydefined / service Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 38.0 4.96 MB

The service side of clearlydefined.io

License: MIT License

JavaScript 99.88% Shell 0.01% Dockerfile 0.12%

service's Introduction

ClearlyDefined, defined.

This repo holds the docs, artwork, and other organizational content in support of ClearlyDefined.

Contributing

This project welcomes contributions and suggestions, and we've documented the details in how to get involved.

The Code of Conduct for this project details how the community interacts in an inclusive and respectful manner. Please keep it in mind as you engage here.

Website

This website is built using Docusaurus, a modern static website generator.

Installation

$ yarn

Local Development

$ yarn start

This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.

Build

$ yarn build

This command generates static content into the build directory and can be served using any static contents hosting service.

Deployment

Using SSH:

$ USE_SSH=true yarn deploy

Not using SSH:

$ GIT_USER=<Your GitHub username> yarn deploy

If you are using GitHub pages for hosting, this command is a convenient way to build the website and push to the gh-pages branch.

service's People

Contributors

Stargazers

Watchers

service's Issues

Separate license info in curation and summarization

Create a distinct place in the curation/summary for "licensed" info as separate from "described" data. The described data is info about the project itself, cla?, code of conduct? file groupings (e.g., tests, biuld tools, ...) that can be used for filtering

Add initial PR job that will validate incoming curation patches

When a PR is submitted with a curation patch to https://github.com/clearlydefined/curated-data:

Runs on Travis or CircleCI or equivalent
Job will validate curation patch yaml file
Correctly sets the status to success or error (Status ✅ )
Dump out the yaml in the log

Add schema validate to curations

Validate schema

of proposed curation when a PR is opened/modified (e.g,. the status check)
of the resultant summary after applying the proposal (also in the status check)
in the web ui (separate issue for that)

Data structure: remove copyright info

We're now getting authors, holders, and statements in the copyright info. For the data structure we just want to surface holders.

write aggregator

The aggregator takes summarized information from multiple sources and aggregates it according to a precedence model. For now the precedence can be really simple, this is more about the plumbing.

Build updated architectural diagrams

The architectural design of ClearlyDefined has progressed significantly from when we were first thinking about this project. We want to make sure we have updated architectural diagrams to present to potential adopters, code contributors, and for including in presentations.

@jeffmcaffer @iamwillbar

dummy 2

Rename everything

We need to rename everything in the code and get the code hygiene up now that we've moved along conceptually and in development.

Production deployment: service, web, crawler, scan builds

We need a production deployment of the service, web, crawler, and scan builds for the OSLS.

Add links to definition that point to source, tools, ...

Based on the _metadata links coming out of the crawler, add some links to component definition as it is being summarized and curated. Note, links likely should not be curated. Examples are:

source
tool results
curation

The "links" here are not necessarily clickable (though they might be rendered that way). Rather they are a way to convey relationships. So, for example, someone looking to see what tools have been run for a component (or indeed what tools' output went into a definition) should be able to inspect the links and see what's gone on.

Need to think a little about the format of the data. Likely need

name
type
data

as a minimum.

dummy1

Data structure: release date

We want the release date in the data structure.

implement "no auth" path for offline

the new GitHub auth story is slick but requires you to be online for much of the system to work. For example, you can't call the api or use the website effectively without being online. Even if online, if your net connection is slow (e.g, on an airplane over the atlantic) it can be unpredictable.

Propose that for localhost setups we allow "no auth" (perhaps the default?). That would trickle down into the team etc that we'll use for permissions. Those operations that truly need auth will necessarily have to be online so we should be good.

Initial website mockup

initial cut will use React, Redux and Boostrap to do basic browsing and curating

Add origins/maven endpoint

To support the user queuing Maven artifacts in the UI, the service needs to expose two origins/maven endpoints, one that aids in discovering components (group/artifact) and another that serves up the versions of a given group/artifact.

Depending on the Maven API, this may be a lot like the GitHub one where the org (i.e., group) search is served up by one GitHub API and the repo (i.e., artifact) search is served up by another.

setup harvesting execution

Setup OSS review toolkit to

run on some repo(s)
take the output and call the service API to store as harvested data
Kick off a PR?

Draft content for website overview page

We want to draft and review the main page content for the website to make sure it's clear and concise.

Add webhook in service to track crawler changes

As the crawler produces output it will trigger webhooks (see clearlydefined/crawler#44). On the service we need a webhook implementation that takes the POST'd event and "does the right thing".

In the first instance the action will be to recompute the cached result for the newly scanned component. So, the webhook will say "wrote /npm/npmjs/foo/1.0.0/scancode" or some such. The webhook will call ComponentService.computeAndStore for the triggering NPM (in this case). That will recompute the fundamental result and store it for future retrieval.

Note: When this is done, we need to remove the call to computeAndStore in the get method of the ComponentService

sort out source `path` story

some source locations need a path within a repo. In theory we talked about how this would be described but that has fallen by the wayside in the implementation and pathing/specing floating around.

Need to figure out the right data model but this should be driven from real scenarios.

story for curating arrays

Need a way to express changes to arrays of values like copyright holders. In particular,

addition
removal
update (if not remove and update)

Tests for the service

define the testing approach for different elements of the service

REST apis
plain function code
configuration
providers

Define schemas for all the forms of data

curations (single revision browser form, full revision list PR form)
summary documents
REST end points

Write status check to replace build

We need a richer experience for curators than we can get with a CI build log etc. Write a simple status check in the service that validates the PR and leaves a link that takes the user to a place on clearlydefined.io where they can see content.

We can incrementally make that content better.

Add a logging story in the service

Add Winston logging and configure AppInsights with the rest of the system parts.

Store the package identity in the curation patch file

To allow the curation patch to standalone it should contain information about the package.

Specifically:

type
repository
name

Surface "tools" in definition

We want to surface our "tools" in the definition, likely tool and version.

Consider getting "source" data when getting definition data

Right now summarization/curation happens for the precise entity that has been requested. E.g. just the npm or just the github repo. In the case of a package where we know (and have processed) the source, we should allow for the source's data to also be included in the package summary.

This is one of the value points of ClearlyDefined. We build up a network of information about entities and can aggregate that info to be more informative.

In implementation terms this means altering the definitions route execution to

consider a param that indicates the mode (or some such) as to whether or not to include the source data
optionally overlay the package and source data in some order TBD

This may also benefit from allowing the definition to qualify its source location info with a set of filters(dimensions) that would then be applied to the summarization of the source. This way, the package can express what parts of the source it includes. This is akin to the path element of the sourceLocation information.

Summarize API should be able to produce valid SPDX files

Input:

Optional param in the summarize API request to request for SPDX format.

Output:

Valid SPDX file with minimal datapoint (Source URL, License, Copyright plus minimal set of required fields to be a valid SPDX file.

Side-notes:

Thomas is working on minimizing the set of required fields

Add API for getting information about a component

write curation merging over harvested data

Allow for curations to overwrite harvested and have that served up by the API

Write a ClearlyDefined cache for review toolkit

When the review toolkit downloader runs it consults a cache to see if the desired component has already been scanned. Right now that works against Artifactory. Need one that looks at the ClearlyDefined GitHub harvested data repo

Surface "links" in definition

We want to surface the "links" to repos in the definition.

Permissions for who can queue a harvest request

Like the curator model, there should be protection on who can queue a harvest request. This can be managed by membership in a GitHub team or some such and require GH auth when the request is made

Swagger specification for API

To make the API easier to consume we should publish a Swagger specification for it.

Create Press Release mockup

as a means of scoping the work for the MVP, write a fake Press release that announces the project, extols its virtues and has a call to action to both consume and contribute

implement harvest PUT and GET

Need to move away from GitHub as the harvest store. There are too many limitations and it is not really needed as long as we have a way of browsing and exposing the harvested data to the community

API key and rate limiting design

Need something simple.

Consider CloudFlare as we are already using their infrastructure to front for *clearlydefined.io. @jpeddicord suggests using one of the express rate limiter middlewares. That might be easier to manage at least in the first instance.

Implement a GitHub App and use for GH operations

Create a "clearlydefined" GitHub user

This user would be used for authentication to GitHub to create PRs.

Create API for querying the set of harvested, curated and proposed data.

Need to be able to switch between the store (harvested, curated, proposed) and then filter with strings for the different segments of the identifier (e.g., type, provider, namespace, ...). This is in support of proposed website filtering affordances

ClearlyDefined badges for projects

Similar to Shields.io - have an end point that the URL contains your project details and serves up a project image/svg you can put in your markdown.
e.g. clearlydefined.io/badge/npm…
Badge works out if you’ve met the criteria and serve up appropriate color badge (need to sort out what is the criteria for the 3 levels)

API for github/npm search

To support the UI we need API roughly equivalent to (though not as fully featured as) the search apis present on github.com and npms.io. We can turn around and call those if needed.

Here the user is looking for a package to queue. We want them to input good data so need auto-completion etc. This is true for GitHub orgs, repos and commits, and NPM scopes, names, and versions. Of course, there will be more package managers in the future.

We should also consider (though not necessarily implement right now) that the response will include ClearlyDefined info such as "yup, we've already run these tools on that version" that would further help guide the user.

Summarized data structure

Define (and implement) the Summarized data format. Consider building on the structure that Philippe talked about yesterday in MineCode.

Add batching GET operations

For various user and automation scenarios we need endpoints that can take a list of components and give back the related data (e.g., harvest results, component definitions, ...)

Add Swagger doc for the REST API

Finalize storage structure

Need to finalize the storage structure path before our OSLS demo.

Clean up copyrights in summarized data

The data output from Scancode, particularly around copyrights is pretty noisy. Random characters, duplicates that differ only by case, puncutation, Inc. vs Inc vs Incorporated, ...

This degrades the apparent quality of the data for the end user so is key to showing value and building trust.

We can implement this in several different places. Ideally the upstream tools would just be less noisy but ... Current thinking is to put this in the summarizers. There is a summarizer for each tool so each would do the work as it goes from the raw data form to the summarized/canonical form. This might be a shared bit of code that does all the simplifications but rather than introduce a new step in the flow, consider adding it to summarization.

Add the ability to filter files from summarization and have that list curatable

On looking at some initial data for jquery it became apparent that it was very noisy due to things like d.ts files being included and not very well understood by the scanner.

We can still have the harvesters harvest data about these files but when we summarize, we need to be able to filter these out.

More over, this needs to be curatable so that people can define the filter and then see the outcome, then save that and have it apply to the final output.

Make merged metadata available to curators

Curators looking at a status check should be able to easily see the effect of the changes (e.g., merged data) as well as have handy pointers/links to available scans and the original source repo.

Authentication for write operations

API operations that make or propose modifications to data should require authentication. We need to decide what authentication makes sense.