Giter Club home page Giter Club logo

Comments (14)

dmikusa avatar dmikusa commented on September 3, 2024 1

The dependency-building infrastructure described above does not encompass any of the dependencies that contribute to the Java buildpacks. The Java buildpacks have their own system for managing dependencies. It is worth considering what a consolidation of these system might look like.

+1 - I was just talking to @ForestEckhardt about how we can do some consolidations in pipelines. I'd definitely be interested in arriving on a singular way to handle dependencies across all the Paketo buildpacks.

I'd also be happy to share our experiences using a more federated approach and pulling dependencies directly from upstream locations. There are some good parts and some challenges as well.

from explorations.

robdimsdale avatar robdimsdale commented on September 3, 2024 1

One possible additional goal that we could consider is making the dep server itself a system that others can re-use - either in part or as a whole. Although this would potentially increase the support burden on the dependencies team, I think it could also result in a significant value add for other buildpack authors outside of the paketo core team.

I have three cases in mind where buildpack authors outside of the paketo team could benefit from a reusable (and federated) dep-server:

  1. Third-party OSS teams maintaining their own buildpacks with github infrastructure - they could standup their own dep server (e.g. api.deps.some-oss-team.com and use github actions to build their dependencies.
  2. Third-party commercial teams maintaining their own buildpacks with private github infrastructure. I don't know for sure how github actions works in a private model, but assuming it works similarly to the OSS actions, this would be identical to the above
  3. Third-part commercial teams who cannot use github actions. They would potentially be able to stand-up a dep server to gain its API but would have to use an alternative system (e.g. concourse) to run the github scripts.

from explorations.

dmikusa avatar dmikusa commented on September 3, 2024 1

I wanted to also add this proposal as well: https://docs.google.com/document/d/1g5rRW-oE_v8Gdvz-CiCOK9z2rxg6L5XniKI25Zq2j6M/edit

I believe it to be complementary to what's been discussed already. You can take a look at the google doc for now, but I hope to get this into an RFC format in the near future.

from explorations.

fg-j avatar fg-j commented on September 3, 2024

@dmikusa-pivotal I'd like to hear about your experience with the federated approach, for sure.

from explorations.

ForestEckhardt avatar ForestEckhardt commented on September 3, 2024

I have a couple of questions/comments:

  1. Would moving to the work flow using already hosted dependencies mean that we would stop trying to gather other information about these dependencies such as the purl, cpe, and license or would we still want to generate that information for our the buildpack.toml?
  2. There are certain dependencies that we repackage for size reasons. I am think mostly of the .Net dependency which we do some minor pruning on and also repackage using xz compression. By switching to the Microsoft hosted dependencies we would no longer be getting that compression and our dependencies would grow in size.
  3. There are some dependencies that we currently restructure to ensure that they can be decompressed directly into a layer without any further manipulation. This often is accomplished by stripping a top level directory off of the packaged artifact. This is something that packit is capable just wanted to note it as a thing.

Overall I am excited by the prospect of being able to eliminate some of the magic and some of the back breaking work and I think that it would make it easier for us to get new languages from community members. I think that it will also ultimately make our buildpacks make more sense because we are downloading and installing the same dependency that your average developer is using. I think that this will make make it easier for users that need to replace the dependency with one hosted on a mirror as well because the artifact is question will not be any different from the publicly available one .

from explorations.

dmikusa avatar dmikusa commented on September 3, 2024

@dmikusa-pivotal I'd like to hear about your experience with the federated approach, for sure.

Some notes off the top of my head:

  1. You are dependent on the 3rd party's CDN for performance. Java pulls a number of dependencies from Github. The Github CDN is great in the US. It's not great in other parts of the world. I've gotten reports that it takes users upwards of five minutes to download something that takes me 20s to download. As the buildpack team, we have little recourse for issues like this. (As a side note, this is why caching and alternative download repo support are high on the priority list for the Java team)

  2. Detecting dependencies updates is a pain. We're managing like 20 different actions to check and fetch dependencies. In good cases, there's an API we can use to check for new versions & fetch downloads. In not good cases, we're basically screen scraping data and links, which is fragile and breaks occasionally. Even downloading Github resources can be a pain because projects structure their Github repos/releases/tags/assets in slightly different ways.

  3. You have to be kind to the 3rd party CDN you're targeting. Many of them are OSS and we don't want to generate a lot of traffic, so we have to keep polling for updates to less frequent intervals (like daily). We also have to think about what pointing lots of buildpack users to someone's CDN might do. Buildpacks may need to download resources multiple times, that can consume more bandwidth and skew download metrics.

  4. It doesn't really work if you need to get resources that are behind a login. Even if we're allowed to redistribute them, if the vendor requires you to login first, that doesn't work because essentially the end-user will need to login to download the resource.

  5. Most 3rd parties are not publishing sha256 hashes for downloads. Many don't publish any hash, but some will use just a sha1 hash. This means that we have to download some unknown resource, calculate the sha256 hash and use that. It's not technically hard, but it voids some of the usefulness of the sha256 hash.

  6. Speaking of hashes, some 3rd parties will change their downloads for a published release and not bump the version number so all of a sudden the hash will just stop matching and the buildpack will break. It then requires us to investigate and see what happened, which isn't always easy to determine. Then if we trust the change, it requires us to update buildpack.toml and publish a new buildpack version.

  7. Like @ForestEckhardt mentioned, you get whatever the 3rd party publishes. If they have a weird folder structure or include a bunch of stuff you don't need, that all get's installed. This hasn't been a huge issue for us, but we do occasionally have to strip a top level directory off an archive or move some binaries into a bin/ directory as part of the install process in the buildpack. Probably what's more challenging here is the upcoming ARM64 work. It's getting more common, but not everyone is publishing ARM64 binaries at the moment. If the project doesn't have them, then you're kind of stuck.

  8. I think 7 raises up a question though as to should we be stripping things out? or should we be giving users stock downloads? We've largely taken the approach of providing stock downloads (what you get if you as a user go and download the resource), but there have been cases where users have asked us to prune things. I can understand both sides of the argument.

Would moving to the work flow using already hosted dependencies mean that we would stop trying to gather other information about these dependencies such as the purl, cpe, and license or would we still want to generate that information for our the buildpack.toml?

On the Java buildpacks, we have this information in buildpack.toml and update it when we update releases. The base information tends not to change, but we have to keep the versions in these fields all in sync. How is this being sourced with deps-server?

from explorations.

ForestEckhardt avatar ForestEckhardt commented on September 3, 2024

On the Java buildpacks, we have this information in buildpack.toml and update it when we update releases. The base information tends not to change, but we have to keep the versions in these fields all in sync. How is this being sourced with deps-server?

It is being generated when it is input into the system so it is possible I was just curious if this also meant we would be stripping down the data we are providing for each dependency. For the most part many of our buildpacks use this to construct the old SBOM format and it may not make sense to have it there in the long run if we are using Syft.

from explorations.

ryanmoran avatar ryanmoran commented on September 3, 2024

Would moving to the work flow using already hosted dependencies mean that we would stop trying to gather other information about these dependencies such as the purl, cpe, and license or would we still want to generate that information for our the buildpack.toml?

@ForestEckhardt, yes. We would still want this information. So, a solution would need to take this into account.

from explorations.

ryanmoran avatar ryanmoran commented on September 3, 2024

For the most part many of our buildpacks use this to construct the old SBOM format and it may not make sense to have it there in the long run if we are using Syft.

It is also used to generate the new SBOM format: https://github.com/paketo-buildpacks/packit/blob/2247967a3f873b178f6fb16c5e6411646ca0882a/sbom/sbom.go#L72-L97

from explorations.

ForestEckhardt avatar ForestEckhardt commented on September 3, 2024

I stand corrected

from explorations.

sophiewigmore avatar sophiewigmore commented on September 3, 2024

@dmikusa-pivotal thanks for outlining those cases it's super helpful. I'd say that item 2 Detecting dependencies updates is a pain is true regardless of what dependency management approach you take. The dep-server has similar issues anyways so I'm not too worried about that.

Out of the items you mentioned, number 6 around hashes changing is the most concerning to me. Whatever process we implement, I think it'll be really important to have a way to reconcile mismatched SHAs or detect changes.

Number 7 and 8 around modifications to the dependency are also pretty complicated, but I think moving to the federated approach will be a big help with this, since we can delegate out those types of decisions to language family maintainers potentially.

from explorations.

dmikusa avatar dmikusa commented on September 3, 2024

👍

One other thought that's been a hindrance for us. Github Actions are not well suited to checking for dependency updates. There is no trigger or event, even if the 3rd party is using Github to release code, so you end up having to poll for updates.

Presently, we're polling daily, because if we do it more often we'll blow past the limits Github Actions puts on the execution of our jobs. In some cases, this means we have to manually trigger the job like if we need to get an urgent update released. It's not a big deal and it's easy to do, but it's manual work.

Also, if you have a buildpack that has many dependencies then you run into an issue with how to organize them. The Liberty buildpack, for example, has quite a few dependencies that we monitor. We presently have them set up such that each dependency has its own workflow. The workflows are largely the same but just check for different resources. This has some advantages in that it's easy to have them all run in parallel, if one fails it doesn't impact others, and it's easy to trigger just a single resource if you need to force an update or re-run a failed update. It's not nice in that the parallelization makes us hit Github Action limits faster, there's lots of duplication across workflows, and it's extremely inefficient (Github Actions spins up a new VM for each workflow & job).

Personally, I'd like them to be more efficient. I've thought about how we could merge them all into a single workflow and job with multiple steps, but then you don't get the same parallelization and it's not easy to run/re-run a specific update. It's also not clear if that would help reduce duplication in the workflow, possibly, I haven't looked from that angle.

I've also thought about moving this type of fetch outside of Github Actions, somewhere it can be done more efficiently and then using hooks/API to trigger Github Actions or submitting a PR directly. That's a big step though and we haven't had time to investigate it further.

from explorations.

garethjevans avatar garethjevans commented on September 3, 2024

We don't have the luxury of using GitHub Actions internally so have built out a dependency update system in Concourse with a few custom Concourse Resources

Re Point 5: There are a lot of dependencies that don't provide sha256sums but do provide others, if the sha256 exists we download and use that, if it doesn't, we calculate a new one against the downloaded binary, but we also verify the downloaded binary against some of the other shas that are available. This gives us an extra bit of confidence that the binary is what we were expecting it to be.

Re Semver: It would be great if everything supported semver compatible version numbers - but they don't. We've added in quite a bit of logic around handling non-semver compatible version numbers.

from explorations.

ryanmoran avatar ryanmoran commented on September 3, 2024

As the RFCs for dependency management have been approved and merged, and work is already underway to implement them, I will close this issue.

from explorations.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.