Giter Club home page Giter Club logo

registry-dev's Introduction

PureScript Registry Development

deploy tests

This repository hosts the code for the PureScript Registry and its affiliated projects. If you are new to the registry and are interested in its development, then the following should be helpful:

  • The specification describes what the registry is, how it works, and its fundamental types and operations.
  • The contributor guide describes how to get started working on the registry repository.

If you are interested in the products of the registry, then you should see:

Finally, as always, package documentation is hosted on Pursuit.

Below, you can see the original RFC for a PureScript registry as it was written once the Bower registry shut down.


Problem at hand

PureScript needs a way to distribute packages. We used to rely on the Bower registry for that, but this is not possible anymore.

Goals for a PureScript package registry

Here's a non-comprehensive list of desiderable properties/goals that we'd generally like to achieve when talking of a PureScript registry, and that we think this design covers:

  • independent: we're coming from a situation when relying on a third party registry has faulted us, so we'd need something that we can control and possibly host ourselves.
  • immutable: published packages are immutable - once a version has been published then its source code is forever packaged in the tarball uploaded to our storage(s). The only exception is that unpublishing will be possible for some time after publishing. This goal also directly stems from our experience with Bower, where we were not able to prevent packages from disappearing and/or being altered, since we were not storing the source anywhere, but just pointing to the original location of the source, that we had no control over. This means that with this Registry if your package is building today then you're guaranteed that the packages you're depending on will not disappear.
  • with content hashes, so that we're able to support independent storage backends and be sure that the content they serve is the same, and - perhaps most importantly - that its integrity is preserved over time.
  • with version bounds for packages, so that package authors can control the amount of support that they want to provide for their packages - i.e. allowing a wider range of versions for upstream dependencies means more support.
  • with a way for trusted editors to publish new versions, so that under specified conditions faulty bounds/dependencies can be timely corrected if package authors are not around.
  • ease of publishing a release is optimized for and entirely automated.
  • package manifests are declarative: authors need not to concern themselves with the "how", but just declare properties about their packages. E.g. there's no such thing as NPM's postinstall and other similar hooks.
  • no webserver: all the software running the Registry is designed in such a way that we don't need to authenticate people ourselves, nor need them to upload anything, nor need to expose any webserver in general. This greatly reduces the amount of attack surface from the security standpoint.
  • with first class support for Package Sets: see the relevant section for more info.

This section has been informed by happenings, discussions, and various other readings on the internet, such as this one and this one.

Non-goals of this design

Things we do not aim to achieve with this design:

  • all-purpose use: we are going to care about hosting PureScript packages only.
  • user-facing frontend: this design aims to provide procedures for publishing, collecting, hosting and distributing PureScript packages. The metadata about packages will be exposed publicly, but we do not concern ourselves with presenting it in a navigable/queriable way - other tools/services should be built on top of this data to achieve that.

Proposed design: "Just a GitHub Repo"

Two main ideas here:

  • the Registry is nothing more than some data - in our case tarball of package sources stored somewhere public - and some metadata linked to the data so that we can make sense of it - in our case this GitHub repo is our "metadata storage"
  • to minimize the attack surface we will use a pull model where all sources are fetched by our CI rather than having authors upload them. In practice this means that all the registry operations will run on the CI infrastructure of this repo.

This repo will contain:

  • CI workflows that run the Registry operations
  • the source for all this CI
  • all the issues/pull-requests that trigger CI operations that affect the Registry
  • metadata about packages, such as the list of maintainers, published versions, hashes of the archives, etc.
  • the package-sets

All of the above is about metadata, while the real data (i.e. the package tarballs) will live on various storage backends.

The Package Manifest

A Manifest stores all the metadata (e.g. package name, version, dependencies, etc) for a specific version of a specific package.

Packages are expected to version in their sources a purs.json file, that conforms to the Manifest schema to ensure forwards-compatibility with future schemas. This means that new clients will be able to read old schemas, but not vice-versa. And the reason why forward (rather than backwards) compatibility is needed is because package manifests are baked in the (immutable) package tarballs forever, which means that any client (especially old ones) should always be able to read that manifest.

This means that the only changes allowed to the schema are:

  • adding new fields
  • removing optional fields
  • relaxing constraints not covered by the type system

For more info about the different kinds of schema compatibility, see here

All the pre-registry packages will be grandfathered in, see here for details. You can find some examples of the Manifest that has been generated for them in the examples folder.

Note: the Location schema includes support for packages that are not published from the root of the repository, by supplying the (optional) subdir field. This means that a repository could potentially host several packages (commonly called a "monorepo").

Registry Versions & Version Ranges

The PureScript registry allows packages to specify a version for themselves and version ranges for their dependencies.

We use a restricted version of the SemVer spec which only allows versions with major, minor, and patch places (no build metadata or prerelease identifiers) and version ranges with the >= and < operators.

This decision keeps versions and version ranges easy to read, understand, and maintain over time.

Package Versions

Package versions always take the form X.Y.Z, representing major, minor, and patch places. All three places must be natural numbers. For example, in a manifest file:

{
  "name": "my-package",
  "version": "1.0.1"
}

If a package uses all three places (ie. it begins with a non-zero number, such as 1.0.0), then:

  • MAJOR means values have been changed or removed, and represents a breaking change to the package.
  • MINOR means values have been added, but existing values are unchanged.
  • PATCH means the API is unchanged and there is no risk of breaking code.

If a package only uses two places (ie. it begins with a zero, such as 0.1.0), then:

  • MAJOR is unused because it is zero
  • MINOR means values have been changed or removed and represents a breaking change to the package
  • PATCH means values may have been added, but existing values are unchanged

If a package uses only one place (ie. it begins with two zeros, such as 0.0.1), then all changes are potentially breaking changes.

Version Ranges

Version ranges are always of the form >=X.Y.Z <X.Y.Z, where both versions must be valid and the first version must be less than the second version.

When comparing versions, the major place takes precedence, then the minor place, and then the patch place. For example:

  • 1.0.0 is greater than 0.12.0
  • 0.1.0 is greater than 0.0.12
  • 0.0.1 is greater than 0.0.0

All dependencies must take this form. For example, in a manifest file:

{
  "name": "my-package",
  "license": "MIT",
  "version": "1.0.1",
  "dependencies": {
    "aff": ">=1.0.0 <2.0.0",
    "prelude": ">=2.1.5 <2.1.6",
    "zmq": ">=0.1.0 <12.19.124"
  }
}

The Registry API

The Registry should support various automated (i.e. no/little human intervention required) operations:

  • adding new packages
  • adding new versions of a package
  • unpublishing a package

Adding a new package

As package authors the only thing that we need to do in order to have the Registry upload our package is to tell it where to get it.

We can do that by opening an issue containing JSON that conforms to the schema of an Addition.

Note: this operation should be entirely automated by the package manager, and transparent to the user. I.e. package authors shouldn't need to be aware of the inner workings of the Registry in order to publish a package, and they should be able to tell to the package manager "publish this" and be given back either a confirmation of success or failure, or a place to follow updates about the fate of the publishing process.

Implementation detail: how do we "automatically open a GitHub issue" while at the same time not requiring a GitHub authentication token from the users? The idea is that if a package manager wants to avoid doing that then it's possible to generate a URL that the user can navigate to, so that they can preview the issue content before opening it. This is an example of such link.

Once the issue is open, the CI in this repo will:

  • detect if this is an Addition, and continue running if so
  • fetch the git repo the Repo refers to, checking out the ref specified in the Addition, and considering the package directory to be subdir if specified, or the root of the repo if not
  • run the checks for package admission on the package source we just checked out Note: package managers are generally expected to run the same checks locally as well, to tighten the feedback time for authors.
  • if all is well, upload the tarball to the storages. Note: if any of the Storage Backends is down we fail here, so that the problem can be addressed.
  • generate the package's Metadata file:
    • add the SHA256 of the tarball
    • add the author of the release as a maintainer of the package. If that is unavailable (e.g. if a release is published by a bot), then it's acceptable to skip this and proceed anyways, as the list of maintainers of a package should be curated by Trustees in any case, as it's going to be useful only for actions that require manual intervention.
  • optionally add the package to the next Package Set
  • upload the package documentation to Pursuit

The CI will post updates in the issue as it runs the various operations, and close the issue once all the above operations have completed correctly.

Once the issue has been closed the package can be considered published.

Publishing a new version of an existing package

It is largely the same process as above, with the main difference being that the body of the created issue will conform to the schema of an Update.

Unpublishing a package/release

Unpublishing a version for a package can be done by creating an issue containing JSON conforming to the schema of an Unpublish.

CI will verify that all the following conditions hold:

  • the author of the issue is either one of the maintainers or one of the Registry Trustees
  • the version is less than 1 week old

If these conditions hold, then CI will:

  • move that package version from published to unpublished in the package Metadata
  • delete that package version from the storages

Unpublishing is allowed for security reasons (e.g. if some package was taken over maliciously), but it's allowed only for a set period of time because of the leftpad problem (i.e. breaking everyone's builds).

Exceptions to this rule are legal concerns (e.g. DMCA takedown requests) for which Trustees might have to remove packages at any time.

Package metadata

Every package will have its own file in the packages folder of this repo.

You can see the schema of this file here, and the main reasons for this file to exist are to track:

  • the upstream location for the sources of the package
  • published versions and the SHA256 for their tarball as computed by our CI. Note: these are going to be sorted in ascending order according to SemVer
  • unpublished versions together with the reason for unpublishing
  • GitHub usernames of package maintainers, so that we'll be able to contact them if any action is needed for any of their packages

Package Sets

As noted in the beginning, Package Sets are a first class citizen of this design.

This repo will be the single source of truth for the package-sets - you can find an example here - from which we'll generate various metadata files to be used by the package manager. Further details are yet to be defined.

Making your own Package Set

While the upstream package sets will only contain packages from the Registry, it is common to have the need to create a custom package set that might contain with packages that are not present in the Registry.

In this case the format in which the extra-Registry packages will depend on what the client accepts.

One of such clients will be Spago, where we'll define an extra-Registry package as:

let Registry = https://raw.githubusercontent.com/purescript/registry/master/v1/Registry.dhall

let SpagoPkg =
      < Repo : { repo : Registry.Location, ref : Text }
      | Local : Registry.Prelude.Location.Type
      >

..that is, an extra-Registry package in Spago could either point to a local path, or a remote repository.

Here's an example of a package set that is exactly like the upstream, except for the effect package, that instead points to some repo from GitHub:

-- We parametrize the upstream package set and the Address type by the package type that our client accepts:
let upstream = https://raw.githubusercontent.com/purescript/registry/master/v1/sets/20200418.dhall SpagoPkg
let Address = Registry.Address SpagoPkg

let overrides =
    { effect = Address.External (SpagoPkg.Repo
        { ref = "v0.0.1"
        , repo = Registry.Repo.GitHub
            { subdir = None Text
            , githubOwner = "someauthor"
            , githubRepo = "somerepo"
            }
        })
    }

in { compiler = upstream.compiler, packages = upstream.packages // overrides }

Registry Trustees

The "Registry Trustees" mentioned all across this document are a group of trusted janitors that have write access to this repo.

Their main task will be that of eventually publish - under very specific conditions - new versions/revisions of packages that will need adjustments.

The reason why this is necessary (vs. only letting the authors publish new versions) is that for version-solving to work in package managers the Registry will need maintenance. This maintenance will ideally be done by package authors, but for a set of reasons authors sometimes become unresponsive.

And the reason why such maintenance needs to happen is because otherwise older versions of packages with bad bounds will still break things, even if newer versions have good bounds. Registries which don’t support revisions will instead support another kind of "mutation" called "yanking", which allows a maintainer to tell a solver not to consider a particular version any more when constructing build plans. You can find a great comparison between the two here illustrating the reason why we support revisions here.

Trustees will have to work under a set of constraints so that their activity will not cause disruption, unreproducible builds, etc. They will also strive to involve maintainers in this process as much as possible while being friendly, helpful, and respectful. In general, trustees aim to empower and educate maintainers about the tools at their disposal to better manage their own packages. Being a part of this curation process is entirely optional and can be opted-out from.

Trustees will try to contact the maintainer of a package for 4 weeks before publishing a new revision, except if the author has opted out from this process, in which case they won't do anything.

Trustees will not change the source of a package, but only its metadata in the Manifest file.

Trustees are allowed to publish new revisions (i.e. versions that bump the pre-release segment from SemVer), to:

  • relax version bounds
  • tighten version bounds
  • add/remove dependencies to make the package build correctly

Note: there is no API defined yet for this operation.

Name squatting and reassigning names

If you'd like to reuse a package name that has already been taken, you can open an issue in this repo, tagging the current owner (whose username you can find in the package's metadata file).

If no agreement with the current owner has not been found after 4 weeks, then Registry Trustees will address it.

For more details see the policy that NPM has for this, that we will follow when not otherwise specified.

The Package Index

I.e. the answer to the question:

How do I know which dependencies package X at version Y has?

Without an index of all the package manifests you'd have to fetch the right tarball and look at its purs.json.

That might be a lot of work to do at scale, and there are usecases - e.g. for package-sets - where we need to lookup lots of manifests to build the dependency graph. So we'll store all the package manifests in a separate location yet to be defined (it's really an implementation detail and will most likely be just another repository, inspired by the same infrastructure for Rust).

Storage Backends

As noted above, this repository will hold all the metadata for packages, but the actual data - i.e. package tarballs - will be stored somewhere else, and we call each of these locations a "storage backend".

Clients will need to be pointed at place they can store package tarballs from, so here we'll store a mapping between "name of the storage backend" to a function that given (1) a package name and (2) a package version then returns the URL where the tarball for that package version can be fetched.

We maintain the list of all the Storage Backends and the aforementioned mappings here.

We also provide a small utility to demonstrate how to use the mappings.

There can be more than one storage backend at any given time, and it's always possible to add more - in fact this can easily be done by:

  • looking at all the package metadata file for every package, to get all the published versions
  • then downloading the tarballs from an existing backend, and uploading them to the new location
  • update the mappings file with the new Backend.

Downloading a package

A package manager should download a specific version of a package in the following way:

  1. given "package name" and "version", the URL to fetch the tarball can be computed as described above
  2. fetch the tarball from one of the backends
  3. lookup the SHA256 for that tarball in the package metadata file
  4. verify that the SHA256 of the tarball downloaded in (2) matches the one from (3)
  5. unpack the tarball and use the sources as desired

Note: we are ensuring that the package we download is the same file for all backends because we are storing the SHA256 for every tarball in a separate location from the storage backends (this repo).

Implementation plan

It is paramount that we provide the smoothest migration path that we can achieve with the resources we have. This is because we feel the ecosystem is already close to maturity (at this point breaking changes happen very rarely in practice), and we don't want to unnecessarily mess up with everyone's workflow, especially if it's possible to avoid that with some planning.

So a big chunk of our work is going towards ensuring that Bower packages are gracefully grandfathered into the new system. This basically means that for each of them we will:

  • generate a package manifest
  • upload them to the first storage backend
  • keep doing that for a while so that package authors have some time to adjust to the new publishing flow

What has happened already:

  • we're not relying on the Bower registry anymore for guaranteeing package uniqueness in the ecosystem. New packages are referenced in this file, while all the packages from the Bower registry are referenced here
  • we have drafted how the registry should behave, what's the API, how things will look like, etc (this document)
  • we set up the first storage backend, maintained by the Packaging Team

What is happening right now:

  • we're figuring out the last details of the package Manifest, which is the big blocker for proceeding further, since it will be baked into all the tarballs uploaded to the storage.
  • writing up the CI code to import the Bower packages as described above

What will happen after this:

  • we'll start using this repo as the source of truth for publishing new package sets
  • we'll write the CI code to implement the Registry API, so that authors will be able to publish new packages (albeit manually at first)
  • then implement automation to interact with the API in one package manager (most likely Spago)
  • then only after that we'll adjust package managers to use the tarballs from the Registry in a way that is compliant with this spec.

The Registry CI

All the Registry CI is implemented in PureScript and runs on GitHub Actions. Source can be found in the ci folder, while the workflows folder contains the various CI flows.

Checks on new packages

Yet to be defined: see this issue

Mirroring the Registry

As noted above, "The Registry" is really just:

  • this git repo containing metadata
  • plus various places that store the package tarballs

Mirroring all of this to an alternative location would consist of:

  • mirroring the git repo - it's just another git remote and there are plenty of providers
  • copying all the release artifacts to another hosting location. This can be done by looking at the package metadata and literally downloading all the packages listed there, then reuploading them to the new location
  • add another "tarball upload destination" to the registry CI, to keep all the backends in sync
  • add another Storage Backend in this repo

Additionally we could keep some kind of "RSS feed" in this repo with all the notifications from package uploads, so other tools will be able to listen to these events and act on that information.

FAQ

Why not use X instead?

We have of course investigated other registries before rolling one.

Our main requirement is to have "dependency flattening": there should be only one version of every package installed for every build.

All the general-purpose registries (i.e. not very tied to a specific language) that we looked at do not seem to support this.

E.g. it would be possible for us to upload packages to NPM, but installing the packages from there would not work, because NPM might pull multiple versions of every package from there according to the needs of every package.

Why not a webserver like everyone else?

These are the main reasons why we prefer to handle this with git+CI, rather than deploying a separate service:

  • visibility: webserver logs are hidden, while CI happens in the open and everyone can audit what happens
  • maintenance: a webserver needs to be deployed and kept up, CI is always there

How do I conform JSON to a Dhall type?

Install dhall, then:

$ cat "your-file.json" | json-to-dhall --records-loose --unions-strict "./YourDhallType.dhall"

Authors of this proposal

This design is authored by @f-f, with suggestions and ideas from:

Development

Setup

Create a .env file based off .env.example and fill in the values for the environment variables:

cp .env.example .env

If you are running scripts in the repository, such as the legacy registry import script, then you may wish to use the provided Nix shell to make sure you have all necessary dependencies available.

$ nix develop

registry-dev's People

Contributors

artemissystem avatar colinwahl avatar connordillon avatar dependabot[bot] avatar dirkz avatar f-f avatar gbagan avatar i-am-the-slime avatar jamesdbrock avatar jordanmartinez avatar jvliwanag avatar markfarrell avatar maxdeviant avatar monoidmusician avatar ntwilson avatar ongyiren1994 avatar oreshinya avatar pacchettibotti avatar paluh avatar pete-murphy avatar purefunctor avatar robertdp avatar roryc89 avatar sigma-andex avatar srghma avatar the-dr-lazy avatar thomashoneyman avatar toastal avatar ursi avatar yukikurage avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

registry-dev's Issues

Consider updating check that dependencies are self-contained

Currently, we check that all package dependencies are part of the registry right after fetching the bowerfile:

https://github.com/purescript/registry/blob/daf1399c458b8159deb473358a707c660af2fc27/ci/src/Registry/Scripts/BowerImport.purs#L133

However, this means that package versions could be dropped from the registry later (for example, when being converted to Manifests), and if all package versions for a package are dropped then the package isn't in the registry at all.

For this reason, we could end up considering a package valid because all of its dependencies are in the registry, only to later on drop one of its dependencies from the registry.

We could avoid this problem by moving the check for self-contained dependencies for packages to be the final check.

Reassign uint, float32, arraybuffer

I would like to reassign soon (not quite yet)

  • zaquest/purescript-uint → purescript-contrib/purescript-uint
  • athanclark/purescript-float32 → purescript-contrib/purescript-float32
  • jacereda/purescript-arraybuffer → purescript-contrib/purescript-arraybuffer

Discussion in purescript-contrib/governance#40

Implement SPDX license expressions

From #4:

For me, a big part of what makes SPDX attractive is that it supports license expressions, like MIT OR Apache-2.0, so I think it would be nice to support those here if we can. Implementing the expression language in Dhall might not be the best idea, but would using Text and then validating the license expressions in the curator work?

For context, here's the full documentation on license expressions.

While having a Text there and validating it in CI seems like a good option, it doesn't really provide feedback to authors on the correctness of their license expression until they push here.

It looks like the full grammar of the expressions is pretty small, so I think implementing it in pure Dhall (with a similar approach as the JSON type) would be quite approachable and possibly easy to use for authors, while at the same time guaranteeing correctness

Upload docs to Pursuit

Users can manually upload documentation to Pursuit today. But it's easy to forget to upload docs after
publishing a library and the existing process is error-prone. For these reasons, Pursuit documentation
is frequently out of date or missing versions altogether -- even for core libraries!

As part of the new registry pipeline, we should support a step to automatically upload package
documentation to Pursuit. This will help ensure that Pursuit has up-to-date documentation for all versions
of packages in the registry.

However, in order to do that we have to make sure that both:

  1. We compile with a compatible compiler version
  2. We compile with compatible dependency versions (in other words "we have a working build plan")

We are not there yet because:

  1. users cannot specify which compiler versions their package is compatible with
  2. we don't have a working solver on the Registry yet.

So our idea for how to approach this is:

  • we can start including "publish to Pursuit on the behalf of the user" as an optional step, warning the user if we didn't succeed (while still not failing the pipeline)
  • resolve the blockers so that docs publishing always succeeds, by
    • accepting compiler bounds in the Manifest
    • implementing a solver that works on the Registry

Add packages to a package set on upload

Similar to #154; packages uploaded to the registry should automatically be added to the latest package set, if possible. If the package can't be added to the package set then the registry should issue a warning (but not fail the package upload). Package sets will still need manual curation (ie. a breaking change at the root of a package set will eventually need manual management to update the root package and all its dependents), but this will help keep package sets up to date over time.

To accomplish this, we'll need to:

  1. Add or update the registry package to the package set that's currently in the default branch of the package sets repository
  2. Compile the entire package set with its compatible compiler version(s) to verify that the package works with the set
  3. If successful, commit the updated package set back to the default branch of the package sets repository

Fall back to spago.dhall files for packages without a bower.json

Some packages are in the PureScript registry and do not have bower.json files, but do have spago.dhall files sufficient for producing a valid Manifest. One example is halogen-hooks:

https://github.com/thomashoneyman/purescript-halogen-hooks/tree/v0.5.0

This repository doesn't have a bower.json file, but it does have a spago.dhall file that is sufficient for producing a manifest:

https://github.com/thomashoneyman/purescript-halogen-hooks/blob/v0.5.0/spago.dhall

I'm proposing that, when we find a package that has a Spago file but not a Bower file, we attempt to either:

  • parse the spago.dhall file to get the package license and dependencies
  • generate a bower.json file from the spago.dhall file, and then import the Bower file as usual

Suggestion: ban fancy semver ranges

I have felt for quite a long time that semver ranges like ^x.y.z or ~x.y.z are an antipattern, because they obscure what the actual bounds are, and they encourage you to tighten lower bounds prematurely. For instance, lots of JS tools push you towards using a caret range with whatever version you happened to be using when you were writing the code. Of course, it's a reasonable default at first if you're adding a dependency at version x.y.z to declare the bounds as >=x.y.z <(x+1).0.0 (note: this is what ^x.y.z means as long as x >= 1), but if you later make a change to support newer versions of that dependency, you often won't need to tighten the lower bound at the same time. That is, if I've written ^x.y.z and I want to allow my package to be used with a new version (x+1).0.0 of the same dependency, caret ranges encourage me to just write ^(x+1).0.0, which means that my package can no longer be used with any x.* version. Instead, if I was forced to write >=x.y.z <(x+1).0.0 to begin with, it's easy to see how to relax the upper bound without tightening the lower bound.

To be explicit, my proposal is that the only acceptable version range operators are >, >=, <, <=, and ==.

Note that the caret or tilde ranges aren't part of the semver spec; as far as I'm aware, they were first introduced by the node semver package. Elm does this already (and in fact goes a little further), by requiring that all version ranges are of the form a.b.c <= v < d.e.f.

Decode Bowerfiles as lenient JSON and only parse necessary fields

In #210 we saw a number of packages fail with a MalformedBowerJson error, namely because of issues like this:

{
  "name": "purescript-facebook",
  "description": "Idiomatic Purescript bindings for the Facebook SDK",
  "license": "Apache-2.0",
  "keywords": "facebook purescipt",
  ...
}

This fails to decode as a Bower.PackageMeta because the keywords field has to be an array of keywords, not a plain string. We don't even use this field in our manifests, though, so we shouldn't care if it isn't Bower-compliant!

The errors aren't always just that the file can't be decoded as a Bower.PackageMeta. Sometimes they aren't valid JSON at all:

{
  "name": "purescript-endpoints-express",
  "version": "0.0.1",
  "authors": {
    "Simon Van Casteren <[email protected]>"
  },
  ...
}

This file isn't valid JSON because of the authors key, which has an object with no key/value pairs as its value. We can't parse this as Json at all.

And yet we can pull off the keys that we know need to be there -- the name, the version, the dependencies, and the devDependencies -- and attempt to decode their contents. To avoid needlessly removing packages from the registry, I suggest that we change how we parse Bowerfiles once #210 merges so that packages like these can be converted into manifests despite their JSON issues.

There are some we maybe still can't parse as Json, because even the keys we're looking for have little issues. For example, the next one is even more insidious:

{
    "name": "purescript-uint",
    ...
    "dependencies": {
        "purescript-enums": "^v5.0.0",
        "purescript-gen": "^v3.0.0",
        "purescript-math": "^v3.0.0",
        "purescript-maybe": "^v5.0.0",
        "purescript-prelude": "^v5.0.0",
        "purescript-psci-support": "^v5.0.0",
    }
    "devDependencies": {
        "purescript-effect": "^v3.0.0",
        "purescript-quickcheck": "^v7.1.0",
        "purescript-quickcheck-laws": "^v6.0.1"
    }
}

There's a trailing comma after the "purescript-psci-support" dependency! It's not valid JSON! It would be nice if we could handle this case too, but perhaps we just pull off the keys we need and if they're not valid JSON then we just drop the package.

Fixed record of `targets` vs a `Map` of them

Right now, a Package can have a Map of Targets:

https://github.com/purescript/registry/blob/e90a200ddcd2216258ec78145014f0fad5c5a6eb/v1/Package.dhall#L24-L25

This means that in practice users will have a record with arbitrary keys, and call toMap on it.

An alternative design could be to fix the target names and just have the record there instead, so we'd have targets : Targets, where Targets is defined as follows:

{ lib : Target
, app : Optional Target
, test : Optional Target
, bench : Optional Target
-- ..and so on, for all the targets we want to include
}

The current design is nice as the users are free to come up with all the targets they need. However, some things would get much easier with the alternative design mentioned above:

  • enforcing the existence of a lib target - with the current design we have to run a check in CI, with the alternative the typechecker enforces it
  • overriding something inside a target (useful e.g. for Trustees, see #14) - right now it would require copying the whole targets section (because there's no way to override values in a Map, so one has to copy all of it and change the things they don't like.
    Example of changing a dependency for the latest version of aff (see here).
    Current approach:
    https://raw.githubusercontent.com/purescript/registry/master/packages/aff/v5.1.2.dhall
    with targets =
          toMap
            { lib = Registry.Target::{
              , sources = [ "src/**/*.purs" ]
              , dependencies =
                  toMap
                    { exceptions = "^4.0.0"
                    , effect = "^3.0.0"
                    , unsafe-coerce = "^4.0.0"
                    , transformers = "^4.0.0"
                    , parallel = "^4.0.0"
                    , datetime = "^4.0.0"
                    , functions = "^4.0.0"
                    }
              }
            , test = Registry.Target::{
              , sources = [ "src/**/*.purs", "test/**/*.purs" ]
              , dependencies =
                  toMap
                    { free = "^5.0.0"
                    , console = "^4.1.0"
                    , minibench = "^2.0.0"
                    , assert = "^4.0.0"
                    , partial = "^2.0.0"
                    }
              }
            }
    Makes it kind of hard to spot the updated dependency..
    With the alternative instead:
    https://raw.githubusercontent.com/purescript/registry/master/packages/aff/v5.1.2.dhall
    with targets.lib.dependencies =
                 toMap
                    { exceptions = "^4.0.0"
                    , effect = "^3.0.0"
                    , unsafe-coerce = "^4.0.0"
                    , transformers = "^4.0.0"
                    , parallel = "^4.0.0"
                    , datetime = "^4.0.0"
                    , functions = "^4.0.0"
                    }
    It's still hard - effect was updated to 3.0.0 - so I guess the right solution would be to have a Dhall operator to update things in a Map, but now it's clear that we're only updating the dependencies of the lib target.

How is unpublishing possible?

  • immutability: packages are largely immutable - once a version has been published then its source code is forever packaged in the tarball uploaded to our storage
    • ..but unpublishing will be possible for some time after publishing

How is unpublishing going to be implemented? "Make a new commit removing the file" isn't a great option because the package is still available in the history. Or do you plan to rewrite the history as well?

Produce a graph of package versions and their dependencies from the registry index

The registry contains a registry index, which is a list of all packages in the registry along with their manifests (which themselves include information about dependencies, licenses, and so on):

https://github.com/purescript/registry/blob/3f5c48e518f657ab03c4178df976648c9fb1437f/ci/src/Registry/Index.purs#L24

In order to upload a new registry index as part of our pipeline, we need to be able to order packages at specific versions by their dependencies, so that we only upload packages that already have all their dependencies present (at correct versions) in the registry. That way the registry index is always correct.

I'm proposing we implement a function:

toVersionsGraph :: RegistryIndex -> PackageVersionsGraph

where a PackageVersionsGraph contains every package version in the registry, and we can order it topologically:

toOrderedArray :: PackageVersionsGraph -> Array Manifest

...so as to walk through each package version in order and upload it to the registry index.

Root JSON diffs always contain trailing comma issues

All of the diffs in pull requests for additions contain extra lines which just makes the history uglier. Some commits, like this, exist just to deal with the commas. JSON5 supports trailing commas (and comments) or the JSON could be comma-first style which would eliminate this issue.

Clarify the mirroring story

During the meetup the other day, the topic of "aren't we over-relying on GitHub?" was brought up.

This makes sense as GitHub seems to go down regularly, so while we don't have to take any specific measures about that right now, we should at least take this into account in the design, so we can easily put in place any measures at a later time.

The main concerns we discussed are:

  • how do we mirror the registry in practice? Right now there are instructions on how it's possible in principle, but maybe we should think of a more concrete plan - possible options for storage could be DigitalOcean or S3 or B2, or something else.
    Moreover, it's not clear if it would be possible to get "hooks" for new package uploads: so that if someone wants to keep a mirrored version of the storage they'd get notifications from the Curator "hey I'm uploading this package, you should mirror it".
    As a final note, this should be taken into account when integrating with package managers: how do we keep track of "alternative storage locations"? Should this be a setting of package managers or should this be saved in the registry? It's definitely a design question that is unexplored so far and should be addressed.
  • how can people "mirror" package sets? I.e. one desire that was expressed was "can I just package the whole package set in a tarball and use that if Github is down?" - This is already possible today if you "install all packages in the set" and then package up your .spago folder and upload that somewhere.
    However, maybe we could automate this downstream - e.g. we could have Spago commands that take care of this (something like spago set-backup and spago set-restore), and/or this could be taken into account while addressing the concerns in the previous point

Discussion: Use `user/package` as package identifier

Change the type of the package identifier from Text to {user : Text, name: Text}, where user is the github username and the name is the package name chosen be that user.
For example affjax would be named slamdata/affjax

Advantages

  1. Easier forking of packages, as the new contributor does not need to make up an artificial new name
    • I think this is especially useful when forking packages for users, who don't merge PRs
  2. Less room for arguments regarding who should get what package name
  3. No implicit "official" version in the registry.

Disadvantages

  1. Installation is a little more complicated
  2. Possible confusion over whose package is meant

IDK is package-sets should also carry the prefix. IMO package-sets are opinionated and should probably take a stand in what they consider "official"

Automated upgrade support for breaking changes

Could we setup a framework to enable automated (or mostly automated) upgrades for dependents of packages with breaking changes? I'm envisioning the ability to run spago --upgrade-set and have code changes automatically applied when possible (with user's permission of course).

We could start by just listing which packages were detected as having breaking changes, and then link to their release pages. For example:

spago --upgrade-set
...
react-basic-hooks upgraded from v4.2.1 to v5.0.0
  Migration guide: https://github.com/spicydonuts/purescript-react-basic-hooks/releases/tag/v5.0.0

We could take this a step further by allowing upgrade scripts to automate some of these operations. For example, with the above upgrade, automatically rename:
React.Basic.Hooks.component
to:
React.Basic.Hooks.reactComponent

I suspect there's a nice way to define these transformations and then let the parser and/or compiler help out with applying them. The registry seems like a good place to store these automated upgrade instructions. For example: packages/foo/v2.0.0.upgrade to handle updating dependents of Foo from the latest v1 to v2. In cases where no manual upgrade steps are required, we could automate PRs to all dependent packages in a package set, so that trivial (yet breaking) API changes are painless. We could also investigate automated upgrades for breaking compiler changes. There's an effort to do this with Elm, but I think we could do even better. Perhaps even less manual coordination would be required for 0.15.0+ releases https://discourse.purescript.org/t/updating-the-ecosystem-for-upcoming-0-14-0-release/1144

Attempt to parse licenses and dependencies even without a bower.json

This issue is related to #220, #218, and #211, all of which discuss how to import packages that have a malformed Bowerfile, or a Bowerfile missing a license key, or have no Bowerfile at all (but have other files, such as a spago.dhall file).

I propose that we view this problem from a new angle: how do we get the information necessary to produce a Manifest for a package, regardless of the presence of a Bower or other files?

https://github.com/purescript/registry/blob/c28f98584e9858d3185a3842dd2fca31514574f4/v1/Manifest.dhall#L13-L24

To construct this, we need:

  1. The package name (we get this from the new-packages.json and bower-packages.json files)
  2. The package repository (we get this from the same two files)
  3. The package version (we get this from GitHub tags)
  4. The package license (we get this from the Bowerfile)
  5. The package targets, ie. dependencies and dev dependencies (we get this from the Bowerfile)

We already handle the first three items about as well as we possibly can, but the work in #220, #218, and #211 all have to do with getting the package license and targets. We've started with the idea that we'll just pull everything out of the Bowerfile, but a lot of packages have missing Bowerfiles or Bowerfiles with no license key.

My proposal is that we no longer consider ourselves to be parsing a BowerFile, but rather a license and targets, and it's not necessary that we get that information from a Bowerfile. For example:

  1. We can get the license and/or targets directly from a Bowerfile.
  2. We can get the license and/or targets directly from a Spago file (either by generating a bower.json file or using say spago ls deps).
  3. We can possibly get targets from psc-package files, though I've never used it myself and I'm not sure if it's worth bothering with this.
  4. We can get the license from a package.json file (but not dependencies).
  5. We can get the license from the repository root in the LICENSE file.

So I'm proposing that we attempt to get the license and targets by moving through these steps: read as much information as we can from the Bowerfile, then the Spago file, then the package.json, and finally the LICENSE in the repo root. Only if we can't get the license and targets from any of those sources, or a combination of them, do we fail the package.

Package maintainers

We are planning some API operation that would perform sensitive actions directly on this repo, such as:

  • unpublishing a package
  • change the repo address

In order to err on the safe side, we could only authorize the @purescript/packaging team to perform such operations.
However, it would be nice if we could allow packages' maintainers to perform such operations too.

The main issue to solve if we are to allow this is: how do we authenticate them?
I.e. how do we know that "Person X" is really a maintainer of package Y?

In order to tie together the entities of "package" and "maintainer" we could store a list of maintainers in the Manifest: since they have to be committed directly to the repo then "having write access" means "I can add myself as a maintainer", and this proves the relation.
However, how do we link "I'm a maintainer" to "I can perform an operation on the registry"?

An option for this would be to store GitHub usernames in the maintainers list: in this way if user X opens an issue to perform some operation then we can just check if they are in the maintainers' list, and safely authenticate them.
Does this tie use too much to GitHub?

A more agnostic option would be to use email addresses instead, but then authenticating them becomes more cumbersome?

Renaming a package from the `bower-packages.json`

Thanks a lot for your work on the purescript/registry!

I have this quite unusual scenario - I want to rename / deprecate a package which is listed in bower-packages.json (specifically polyform-validators into polyform-batteries-core - it was renamed long time ago but not in the bower registry itself).
I can also add this as a new package and deprecate the other one. The problem is that I need to change the existing (validators) lib url in such a case so the deprecated package links to a stub repo with deprecation warning. Could you please suggest me how to approach?

Allowed versions and Trustees publishing

In purescript/registry#76 (comment) we figured that using SemVer's "prerelease segment" for Trustees to publish versions, as the way the spec orders them is not the one we'd like.

I'll report here the discussion from the thread, as a starting point for further discussion:

@hdgarrood: I'm not sure using the semver prerelease segment will work so well for this. Firstly, prerelease versions are considered less than non-prerelease versions according to semver, so v1.0.0 compares greater than v1.0.0-r1. Secondly, doesn't this prevent package authors from using the prerelease segment for their own purposes? I wouldn't really mind if we reserved the prerelease segment for our own use, but in that case we would be diverging from semver (if we don't allow authors to use all of the components of semver, we aren't really using semver) so I think we ought to be more upfront about that if that's what we go with.

@f-f: Great points. I wouldn't like us to diverge from SemVer, but other than that I have no strong opinion on how to do this.
Do you have a concrete idea that we could write down here?

@hdgarrood: I just looked over the semver spec again and it does say that prerelease identifiers and build metadata are optional so actually I take that back, we are well within our right to only accept what the spec describes as "normal" versions into the registry (i.e. those which don't have prerelease identifiers or build metadata), and since it's quite rare to use prerelease identifiers and build metadata in practice (at least for libraries), I think it may even be a good idea to reject versions which use them. I also think it would be nice to set the registry architecture up so that it's impossible for revisions to affect anything other than the package metadata. What Hackage do is handle revisions of metadata for a version separately from the version tarball itself, so that there's only ever one tarball for a version, and revisions of the metadata are stored separately. It then becomes the job of the registry client to fetch the most recent version of the metadata alongside the package tarball. That approach sounds sensible to me, and I think it suggests that metadata revisions are separate from package versions. So maybe this is just a question for the package index?

@f-f: I am not quite comfortable with Hackage's approach of storing package sources and metadata revisions separately, as I think it has a couple of problems:

  1. there is only one tarball for a version, which means that "hashing the package sources" doesn't guarantee integrity and security by itself, as you'd need to fetch a metadata file as well, and you'd need to guarantee integrity (i.e. hash it and distribute the hashes) either of this file or of the whole bundle. In the latter case, one then also needs to define how to compose the files to compute the hash. Preserving integrity of the metadata as well is necessary because things like "changing version bounds" could introduce malicious code if not supervised, and not guaranteeing integrity for them would mean that people would get them in their build believing that the version hasn't changed (because the hash did not), rendering builds nondeterministic and insecure.
  2. doing this would mean that the registry-index repo would become a source of truth, which means that we'd either have to maintain these two repos in sync (e.g. the moment we add metadata in there we also need to sync the list of versions here, etc) or unify them
  3. ...but would storing metadata in here mean that packages wouldn't version that? If metadata is not stored alongside the package sources, then there would be no way to use a package without going through the Registry (and I can see some people not wanting to do that because of security or other company concerns), as they'd have to figure out how the package is defining dependencies, bounds, etc. If we store the metadata alongside the sources instead, then we have the question of "which metadata is correct, the one here or the one in the registry?". This last reason alone is why the current design ditches this aspect (i.e. storing manifests in here and optionally in packages) from the previous draft.

About prerelease identifiers: some packages have been using the pre-release segment (e.g. Halogen or aff) and I think it's good to allow them to, as it has an important role in package versioning.

I also read again through the SemVer spec, and found an issue which might actually help us here: it looks like it considers two version with everything equal but the build metadata as equals in the sorting. I.e. it ignores build metadata when sorting. This is a problem for us because we need get a stable sorting of releases (to figure out the last version of a package), so we would either have to:

  • disallow releases with build metadata altogether
  • or reserve the build metadata segment only for Trustees to cut new revisions (so going back to that idea from the previous version of the draft), and expand the sorting so that it would consider version with build metadata as "later" than without. The ordering of different build metadata segments should be no problem, as the spec already defines how to sort prerelease segments, and their grammar is the same, so we could just reuse that.
    This behaviour seems to be what apt does as well - distros patch upstream packages and add build metadata so that the package manager picks the patched versions as newer - and I think it works very well there

Note that both of the above options means that we'll slightly diverge from SemVer, but I'd consider this quite fine, since it's basically just getting rid of undefined behaviour.

@hdgarrood:

  1. Aren't we intending to provide access to individual package manifest files without downloading a whole package tarball anyway, via the package index? Ensuring the integrity of those package manifest files is already a problem we'll need to deal with, surely?
  2. In that case, could we store package manifest files in the storage backend alongside the package tarballs and call that the source of truth for them? Then, the package index would always be derived from the storage backend, and would not be a source of truth?
  3. I think disallowing build metadata makes sense. We could also require that each package may only upload one version with the same major, minor, patch, and prerelease components. For example, if you've already uploaded 1.0.0+abc, then I think the registry should reject a subsequent upload of 1.0.0+def. I think defining an ordering for the build metadata would be a more serious violation of the semver spec, because the spec specifically says that you mustn't do that. If we go against this, I think it is likely to cause funny behaviour in clients which implement the semver spec accurately: for example, you couldn't have versions 1.0.0+abc and 1.0.0+def in a Set together. The only way I can make sense of the build metadata ordering requirement from the semver spec is if package registries should be refusing package uploads which differ from an existing version only in the build metadata component, i.e. uploading two different versions with the same major, minor, patch, and prerelease components should be disallowed.

@f-f: How would Trustees publish revisions if we disallow build metadata? And why would sorting by that a "serious violation"? SemVer doesn't say "you shouldn't do that because it's bad", it just says "we don't do that in SemVer".
Registry clients are supposed to implement a spec that we define here. If we say "it's SemVer plus ordering by build metadata", then that is the spec.

@hdgarrood: By treating revisions as a separate thing from package versions? It’s a more serious violation in my mind because it’s not just filling a gap in the spec, it’s going against something the spec explicitly says. The versioning libraries that exist aren’t “semver plus build metadata,” they’re just semver.

Implement Bower mass-import

We already have some code that can get all the Bower package manifests and generate our manifests, so in order to be able to do the mass-import of Bower packages to the Registry we need to implement some code that:

  • downloads the tarball for a commit hash from a GitHub repo
  • sticks the generated manifest inside
  • repackages the whole thing
  • computes the sha256 of the new tarball

Note that we are planning to have CI checks for new packages, and would be good to run them for imported packages too:

  • all the checks listed in #23
  • the name check in #3

Plus we should convers the semver ranges in the dependencies to simpler ones when generating the manifest, see #43

Write tests for the Registry.Index module

The Registry.Index module supports reading and writing to the registry index. (For context, the registry index is a listing of all packages in the registry along with their manifest files.) This code has not yet been tested, and we need to make sure that once run it does produce a valid registry index which can be read.

Implement package upload to storage backend

We should have some code that given a tarball and package info (name, version), then uploads it to our S3-compatible storage backend (which is going to be on PureScript's org Digital Ocean account)

Boundaries and rules for "package editing by Trustees"

The possibility for Trustees to edit package manifests and publish patched releases is a feature of the current design.

However it's a very sensitive topic, as we're messing with things like build reproducibility, API surface and versioning guarantees.

So if we want to keep this possibility built in, it's quite necessary to clearly specify under which conditions and procedures the Registry maintainers would be allowed to do that.

There's already a section in the README about this, but I'm also opening this issue following the concerns in #4:

I don't think think "without having to ask authors" is the perspective we should take here. As a Haskell package author, it can be very annoying to learn that a Hackage trustee has tweaked version bounds without my knowledge or approval; at the very least I think we should require the trustee to inform the package author that this has happened.

Is the idea that for each package version, there should be an identifiable latest revision of it? It's not entirely clear to me how this will work with the current design.

Generally I don't trust Hackage trustees to understand my code well enough to know when it is safe to weaken a constraint, as breaking changes can be introduced if the APIs of any of the libraries you're using will "bleed through" into your code's public API. For instance, the version of language-javascript we are using in the PureScript compiler is effectively part of the compiler's API, because that determines what we accept in FFI files (aside: this is part of why I don't like it when people package the compiler in places like Nix or Homebrew with different sets of dependencies than those specified by our stack.yaml).

Parse LICENSE file if no license is listed in the Bowerfile

Some packages do not declare a license field in their Bowerfiles, despite having a valid SPDX license in the repository root. One example is aff-coroutines at its initial released version:

https://github.com/purescript-contrib/purescript-aff-coroutines/tree/v0.1.0

If we don't find a license in the imported Bowerfile (or in the imported spago.dhall file if / when #218 is implemented), but there is a LICENSE file in the repository root, then we should try to parse that and use it as the package's license in the produced Manifest.

Suggestion: use `dhall format`

Dhall is requirement to the registry and it ships with a first-party formatter. It would seem better if everyone were encouraged to use it for style consistency—this project should be using dhall format as well.

Only accept packages from GitHub?

The Repo type currently in master can support GitHub or a generic git link:

https://github.com/purescript/registry/blob/e90a200ddcd2216258ec78145014f0fad5c5a6eb/v1/Repo.dhall#L1-L11

..but there was a concern expressed in #4 about the fact that maybe we should accept only packages from GitHub for now:

Right now Pursuit only supports packages hosted on GitHub, and nobody has complained (not on the issue tracker or in any other way I might have seen, at least). The reasons for this are that allows us to use the GitHub API to get a rendered HTML readme (although I now consider that a misfeature and I'd like to remove it), and also it allows us to construct source links, since the source links take you to GitHub. If we allow non-GH packages in the registry, this means we won't be able to upload those packages to Pursuit right now (which is not a dealbreaker but probably worth considering). I'd quite like to have package sources hosted in the same place as the HTML API docs for source links though, which will probably be easier to do now with our shiny new registry; if we did that we'd be able to allow non-GH packages without losing source links.

Trim JS dependencies out of imported Bowerfiles

Some packages cannot be imported to the new registry from the old registry because they contain dependencies that are not in the registry. However, sometimes these dependencies are not in our registry list because they are JavaScript dependencies. Ordinarily we don't consider JS dependencies; for example, the purescript-react package doesn't depend on react even though it does require it to work.

There are packages that would be valid if they did not declare any JS dependencies in their Bowerfiles. Since we don't care about their JS dependencies anyways, I'm proposing that we strip out non-PureScript dependencies from packages being added to the registry.

One example is purescript-ace, which fails because it depends on ace-builds (a JS package):

https://github.com/purescript-contrib/purescript-ace/blob/38cd31854d779ce565dd43f138aeda588ec71d4c/bower.json#L30-L31

It's a shame to remove ace from the registry altogether (it's used by various PureScript applications) since it declares this dependency, given that it would work if we just dropped the JS dependency from the package.

To fix this issue, we could:

  • Process Bowerfiles after they have been imported to strip out packages that do not have a purescript- prefix.
  • Or, attempt to keep the Bowerfile as-is, and only strip out packages without purescript- prefixes if the Bowerfile is found to contain non-registry dependencies.

I believe it's safe to use a heuristic of the purescript- prefix for PureScript packages in the Bower registry, but I may be mistaken. In that case, then we can close this issue.

Verify that we can `git push` from the publishing pipeline

Right now we run the API CI pipeline on issues and issue comments - we should verify that from there we are able to authenticate and git push to:

  • this registry repo itself
  • and the registry-index repo

The best way to test this would be from some dummy repos, rather than doing it here (i.e. everyone can test this on their own repos)

Package name constraints

I would like to be quite strict about what is allowed as a package name. In particular, there are some packages with uppercase characters in this repo right now and I'd like to enforce that package names are lowercase. Partially because it's very annoying to have to remember the casing of package names, and partially because it's potentially a vector for attacks with similar-looking characters, like uppercase I vs lowercase l.

Here's what bower has to say about package names (source):

  • Must be unique.
  • Should be slug style for simplicity, consistency and compatibility. Example: unicorn-cake
  • Lowercase, a-z, can contain digits, 0-9, can contain dash or dot but not start/end with them.
  • Consecutive dashes or dots not allowed.
  • 50 characters or less.

I'd like to keep the restriction that the only allowed characters are a-z, 0-9, dot, and dash, and ban consecutive dots and dashes. There's an argument to be made that 50 characters is not enough (especially as the purescript- prefix alone takes 10 characters) but I think we should have an upper limit, so 50 is probably sensible as a starting point for now, and we can always relax it later.

Thoughts on implementing the above as part of CI?

META: Address fixable problems found with the existing Bower import

The current Bower import code, runnable with:

GITHUB_TOKEN=<token> spago run -m Registry.Scripts.BowerImport

...produces a bower-exclusions.json file with all packages that don't produce valid manifests. Many of these packages could produce valid manifests with a little tweaking. For example, some packages have a trivial misspelling of their LICENSE type, which we can rewrite for them:

https://github.com/purescript/registry/blob/3f5c48e518f657ab03c4178df976648c9fb1437f/ci/src/Registry/Scripts/BowerImport.purs#L226-L232

This issue tracks classifying errors that are currently excluding packages, but which we would like to fix so that more packages are included in the registry.

Consider removing targets from the manifest

The current manifest specification includes a targets key:
https://github.com/purescript/registry/blob/55bce52392cab4b595ac1f542954cfceeef2d431/v1/Manifest.dhall#L22-L23

where a Target is defined as:

https://github.com/purescript/registry/blob/55bce52392cab4b595ac1f542954cfceeef2d431/v1/Target.dhall#L14-L20

...but it's not entirely clear to me why we need to include targets at all in a package manifest, as the registry (implicitly, at least) relies on a single target called lib:

https://github.com/purescript/registry/blob/3d091b10d63350e8d374612e0eabbf825394cb40/README.md#the-package-manifest

I understand the value of targets when developing a package -- it's important to be able to specify things to compile (and their extra dependencies) besides src, like tests, bench, examples, and so on. For that reason it's vital that a build tool (like Spago) can support additional targets.

But I don't see why the extra targets information needs to be included in the manifest file unless the registry or package managers depending on the registry can do something with that information. If the lib target is all the registry cares about, then why not flatten that target into the manifest instead of checking for a lib target?

Add EditorConfig

I'm sure we already know EditorConfig. Halogen has one as do many other projects and it would help with consistency. My editor was giving me 8-spaced tabs (default in Vims) without having this file.

Record total packages and versions in stats

Currently, the Stats module outputs statistics with this format:

Number of successful packages: 1369
Number of failed packages: 474
Number of successful versions: 9565
Number of failed versions: 1694
Failures by error:
  manifestError: 1378 occurrences (371 packages / 1378 versions)
    missingLicense: 928 occurrences (241 packages / 928 versions)
    badDependencyVersions: 283 occurrences (104 packages / 283 versions)
    badLicense: 179 occurrences (45 packages / 179 versions)
    badVersion: 48 occurrences (32 packages / 48 versions)
  nonRegistryDependencies: 246 occurrences (47 packages / 246 versions)
  missingBowerfile: 64 occurrences (34 packages / 64 versions)
  noReleases: 32 occurrences (32 packages / 0 versions)
  malformedBowerJson: 5 occurrences (4 packages / 5 versions)
  malformedPackageName: 1 occurrences (1 packages / 1 versions)

However, it would also be useful to know the total number of packages and versions we attempted to work with. For example, these statistics list 1,369 "successful" packages and 474 "failed" packages, but we only have something like 1500-1600 packages in total; some packages are present in both the success and failure maps because some, but not all, of their versions succeeded.

I think the output of the first few lines ought to be more like:

Packages: 1500 total (1369 with successes, 474 with failures)
Versions: 11000 total (9565 successful, 1694 failed)

where the total packages won't be a sum of its constituents because some packages will be in both maps, but where the total versions should be the sum of its constituents because a version can't be failed and successful.

We could also in the future be more granular and segment packages with all successes, all failures, and a mixture of both, but that's a larger task.

Remove default license from the `Package` type

Right now we default to the MIT license for packages that don't specify it:
https://github.com/purescript/registry/blob/e90a200ddcd2216258ec78145014f0fad5c5a6eb/v1/Registry.dhall#L32-L36

However, as noted in #4:

I don't think we should have a default for the license field; picking a license should be something we require the maintainer to actively do, I think.

..so we should remove it from the defaults and see if there's any issues with packages imported from Bower (this will require tweaking the curator here)

Consider lower-casing package names imported from Bower

The Bower import script currently excludes only one package for having a malformed package name:

  "malformedPackageName": {
    "TypeAhead": {
      "v0.1.0": "Package name should start with a lower case char or a digit; pos = 0",
    }
  }

However, this would be a valid package name if it were all lower-cased. This issue tracks discussion on whether we should fix this package name or exclude it from the registry -- and, in general, if we should process package names to lower-case them prior to parsing them as PackageNames.

Backend specific dependencies

Some packages have implicit dependencies on JavaScript packages that aren’t enforced anywhere and must be propagated manually. For instance react-basic unsurprisingly depends on react and react-dom, uuid depends on uuid and uuid-validate, …

At work we circumvent this by installing our packages with both psc-package (for PureScript sources and dependencies) and npm (an empty "files" array in package.json ensuring this only installs JavaScript dependencies). As you can imagine, versions installed by npm frequently differ from those in our internal package set so we hope for the best and things mostly work because we don’t update or JavaScript dependencies often. At least we only have one npm dependency to worry about.

I’d like to propose tracking backend specific dependencies and installers in packages targets to solve this, by giving the following type to backends instead of Optional Text:

{ compile : Optional Text
, install : Optional Text
, dependencies : Map Text Text
}

Spago could then automatically install the backend dependencies with its specified install command (if any).

An alternative would be to define backends as an union so unambiguous fields can be ommited and types can be made as precise as needed per backend:

< JavaScript : { installer < Npm | Yarn >, dependencies : Map Text Text } >

There’s probably better ways to do this and the registry perhaps doesn’t need to take this into account from the start, in any case I’m curious to hear other people thoughts and solutions about that.

Document why `Package` manifests won't have a uniform schema in the Registry

In #4 I expressed this concern:

I was thinking about moving the packages folder under v1 too, but decided otherwise.
The reason is that when we change the Package type (or in general when the hash of Registry.dhall will change) we'll make a v2 folder - we wouldn't have to migrate all the files right away, so I think keeping some of them on old versions of the schema it's probably fine.
I don't have a strong opinion on this and I'm fine with either - there's value in having a packages folder for every version: it's more files, but consumers then have the assurance that all packages down that folder match the corresponding type.

As a note, migrating between versions of different schemas is usually possible in pure Dhall.
Fictional example: let's say in v2 we want to go from a name : Text to name : List Text.

Then we could write a migration function in Dhall:

let v1tov2 = \(pkg : ./v1/Package.dhall) ->  pkg // { name = [ pkg.name ] }

in v1tov2

..and if you'd like to migrate some old definition of a package, then it would just be a matter of applying the function to it:

./v1tov2.dhall ./some_v1_package_definition.dhall

This is nice because consumers can choose which version of the schema they want to work with, and migrate the data according to their needs, while at the same time we don't need to duplicate data here at all.

A recap of the problem first: we'll want to change the Package schema over time. How do we handle migrations, old versions, clients coding against one interface, etc?

In the quote above you can find a solution for how to handle data migration, but what's still not clear is how to ensure that clients can pull the manifest files in the schema they expect.

At first I thought we had a choice between these two options:

  1. we keep manifests in the version they've been originally published in
  2. we keep multiple copies of the same manifest, one for each version of the Package schema

Option 2 would be really nice, because e.g. a client that needs to query the manifest for prelude/v5.1.2 could choose to do it for v1 or v2.

..however, we cannot do this, because of the constraint of keeping in the registry repo the exact manifest that was contained in the published tarball.
If we had multiple instances of the same manifest (but in different schemas), then we wouldn't know which one got into the tarball.

This means that clients will have to negotiate the version of the Package schema version they're trying to use.
This means:

  • downloading the manifest
  • try to typecheck with v1
  • try to typecheck with v2
  • ..etc.

This is of course totally fine, as long as we are aware of it.

So all of this needs to be documented.

Support packages with sources not in `src`

For any package published on Bower right now it is possible to assume that one would get their source files with the following glob: src/**/*.purs

In purescript/spago#288 we explored what it would take to lift that limitation (it's a good read to get some context on what follows here)

One of the main blockers for that was the assumption coming from Bower registry that "one package = one repo". Since here we're moving away from the git-based model and towards packaging sources in immutable archives, I thought we could take this occasion to allow publishing packages with sources in arbitrary paths.
This would allow things like publishing several package from the same monorepo.

There has been some discussion in #4 about this, and we're spinning off this issue to keep track of a concern expressed there:

The issue I'm worried about with sources is basically for setups which don't use Spago; I want to avoid making life difficult for people who are still using other things like psc-package or pulp/bower. Yes, it's probably not too much effort to switch a build script of just a few lines in most cases, especially with spago init, but it's harder if the assumption is built into more complex tooling, such as, say, editor plugins. This assumption is built into purs publish right now too. It becomes significantly more complicated if these build scripts or tooling can't/don't want to assume the presence of Spago on the machine in use, or don't want to add Dhall as a dependency.

I would be happy if we could have the curator require that the sources for the lib target for any published package are just ["src/**/*.purs"], at least to start with; I could see us relaxing that later once most of the ecosystem is using this registry.

Print out statistics on package failures in Bower import

The Bower import tool converts Bowerfiles for packages in the Bower registry into valid Manifests. However, many packages fail along the way and are collected into a PackageFailures map:

https://github.com/purescript/registry/blob/3f5c48e518f657ab03c4178df976648c9fb1437f/ci/src/Registry/Scripts/BowerImport/Error.purs#L15

Later, these failures are written out to a file called bower-exclusions.json:
https://github.com/purescript/registry/blob/3f5c48e518f657ab03c4178df976648c9fb1437f/ci/src/Registry/Scripts/BowerImport.purs#L167-L169

However, we also would benefit from knowing some statistics about the package failures, which could be written out to a file or just printed to the console. As a first cut, we'd specifically like to know:

  • How many packages succeeded in total? How many packages failed in total?
  • How many package versions succeeded in total? How many package versions failed in total?
  • How many incidences are there of each error (MissingLicense, BadDependencyVersions, etc.), in sorted order (ie. most failures are at the top of the list, least failures are at the bottom of the list).

These statistics can help us prioritize what issues to fix and report on how many packages are being omitted from the PureScript registry and for what reasons.

Implement CI checks pipeline

There are a bunch of things that we need to enforce in CI, so that we don't add packages that don't follow the criterias we defined.

Checks we want to run for every package:

  • verify that the package name satisfies the constraints (as defined in #3)
  • verify that the repo value is the same for all the versions of a package (to prevent packages "stealing" package names)
  • verify that the license is a valid SPDX expression (see #13)
  • verify that the repo is on GitHub (see #15)
  • reject packages with ES6 syntax in their FFI (see #24)
  • ensure that a package has a lib Target
  • ensure that a package being added has sources: ["src/**/*.purs"] in its lib Target (see here)
  • ensure that no subdir is present to new packages until we are confident that downstream tooling can deal with it (see here)

This issue is about:

  • tracking that all these checks are implemented
  • writing them down in the spec, so that clients can implement them too

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.