Giter Club home page Giter Club logo

repology-updater's Introduction

Repology

CI codecov

Repology is a service which monitors a lot of package repositories and other sources and aggregates data on software package versions, reporting new releases and packaging problems.

This repository contains Repology updater code, a backend service which updates the repository information. See also the web application code.

Dependencies

Needed for fetching/parsing repository data:

Development dependencies

Optional, for doing HTML validation when running tests:

Optional, for checking schemas of configuration files:

Optional, for python code linting:

Running

Preparing

Since repology rules live in separate repository you'll need to clone it first. The location may be arbitrary, but rules.d subdirectory is what default configuration file points to, so using it is the most simple way.

git clone https://github.com/repology/repology-rules.git rules.d

Configuration

First, you may need to tune settings which are shared by all repology utilities, such as directory for storing downloaded repository state or DSN (string which specifies how to connect to PostgreSQL database). See repology.conf.default for default values, create repology.conf in the same directory to override them (don't edit repology.conf.default!) or specify path to alternative config via REPOLOGY_SETTINGS environment variable, or override settings via command line.

By default, repology uses ./_state directory for storing raw and parsed repository data and repology/repology/repology database/user/password on localhost.

Creating the database

For the following steps you'll need to set up the database. Ensure PostgreSQL server is up and running, and execute the following commands to create the database for repology:

psql --username postgres -c "CREATE DATABASE repology"
psql --username postgres -c "CREATE USER repology WITH PASSWORD 'repology'"
psql --username postgres -c "GRANT ALL ON DATABASE repology TO repology"
psql --username postgres --dbname repology -c "GRANT CREATE ON SCHEMA public TO PUBLIC"
psql --username postgres --dbname repology -c "CREATE EXTENSION pg_trgm"
psql --username postgres --dbname repology -c "CREATE EXTENSION libversion"

in the case you want to change the credentials, don't forget to add the actual ones to repology.conf.

Populating the database

Note that you need more than 11GiB of disk space for Repology PostgreSQL database and additionally more than 11GiB space for raw and parsed repository data if you decide to run a complete update process.

Option 1: use dump

The fastest and most simple way to fill the database would be to use a database dump of main Repology instance:

curl -s https://dumps.repology.org/repology-database-dump-latest.sql.zst | unzstd | psql -U repology

Option 2: complete update

Another option would be to go through complete update process which includes fetching and parsing all repository data from scratch and pushing it to the database.

First, init the database schema:

./repology-update.py --initdb

Note that this command drops all existing data in Repology database, if any. You only need to run this command once.

Next, run the update process:

./repology-update.py --fetch --fetch --parse --database --postupdate

Expect it to take several hours the first time, subsequent updates will be faster. You can use the same command to updated. Brief explanation of options used here:

  • --fetch tells the utility to fetch raw repository data (download files, scrape websites, clone git repos) into state directory. Note that it needs to be specified twice to allow updating.
  • --parse enables parsing downloaded data into internal format which is also saved into state directory.
  • --database pushes processed package data into the database.
  • --postupdate runs optional cleanup tasks.

Documentation

Author

License

GPLv3 or later, see COPYING.

repology-updater's People

Contributors

amdmi3 avatar c72578 avatar daissi avatar dshein-alt avatar dylanaraps avatar fd00 avatar fthaltun avatar jmahler avatar jon-turney avatar jriddell avatar jrmarino avatar jtojnar avatar l2dy avatar lieryan avatar lleyton avatar lots0logs avatar mhoush avatar mikhailnov avatar mmatongo avatar nemani avatar pabs3 avatar pink-mist avatar robert-scheck avatar ryantm avatar ryzhovau avatar staudey avatar timb87 avatar tombriden avatar vsoch avatar whissi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

repology-updater's Issues

Add chocolatey support

Fetcher is rather trivial: get https://chocolatey.org/api/v2/Packages()?$filter=IsLatestVersion, parse XML, get next page from <link rel="next" href="">. All package info is available in XML, including name, version, tags, comment, author, www.

Split repo fetching and parsing

These are not tied together. Fetching is more generic and may be common to multiple parsing techniques. Currently, there are just 2 types of fetchers:

  • plain file [with options to gunzip or bunzip it after downloading]
  • git repository

Detect dates in versions

E.g. 20[01][0-9]\.?[01][0-9]\.?[0123][0-9]

What to do with this:

  • Convert with single format (see abcmidi package problem) to fix comparison
  • If all items contain date, compare these instead (other parts are likely non-unformative, e.g. 0.0.20160916 vs. git20160916 vs. 2016.09.16

Website functionality ideas

  • Browse pagefied package lists A...Z
  • Browse by category
  • Browse by maintainer
  • Browse outdated packages per repository
  • Browse by packages (packages with best support along distros)
  • Compare repositories
  • Compare maintainers
  • Suggest maintainer with same interests

Don't leave partial state after failed fetch

Failing to fetch single repository shouldn't interrupt the whole fetch, it should only stop single repository from updating. Also, failure to fetch most repositories should remove their state, so incomplete state is not parsed.

Gentoo needs more package splits

Need to be split: btf (dev-java and sci-libs)

Actually, a lot more:

find gentoo.git -type d -maxdepth 2 -mindepth 2 |
   egrep -v 'dev-(perl|python|haskell)' |
   awk -F/ '{print $NF}' |
   sort | uniq -d

ace acl ada amap analog apel atlas attica auctex baloo balsa barcode bbdb bfm binclock bluedevil bson btf build c-support calc calendar cdcover cdrtools charm checkpassword coffee-script color crystal csv daemontools dash dictionary dirdiff docker dolphin ebuild-mode ecb eject elib emacs ess exo fam fcgi ffmpeg fuse gambit gdl git glade glu gnupg gnuplot gom gpgme grip gsasl haskell-mode highline icecream igrep info jack jal jama jde jpeg json kactivities kde-gtk-config kdeplasma-addons kdesu kfilemetadata kglobalaccel khotkeys kinfocenter kmenuedit knewstuff krunner kscreen kstart ksysguard kwin kwrited languagetool launchy lemon libelf libffi libgudev libiconv libintl libkscreen libnet libusb locale lookup lzma magic mailcrypt mailx man mars mash mavros mc mediawiki mew milou mime-types mldonkey mmix mmm-mode modutils mongo mpack mpc msgpack muse mysql nagios nemesis ninja nitrogen notification-daemon nut ocaml openmsx otter pam par pcl pdv picard pkgconfig planner plasma-mediacenter plasma-nm plasma-workspace pmake pms polkit-kde-agent polyglot powerdevil psgml psi python-mode rails re2 redis reduce riece ruby rubygems screen session shadow signify silo ski skkserv slim slurm smack sml-mode snappy spice spin splat splice sqlite3 ssh surf systemsettings szip teco texinfo tf time tokyocabinet tornado tree uclibc udev vc vm w3m xclip xslide xsp yacc zenburn zenirc

Add sisyphus support

Code is there, need an easy way to download all .spec files. Todo: concact sisyphus guys

Extend version ignores

  • Ignore only newer (single version bad)
  • Ignore always (versioning schema totally broken)

Improve pkgsrc support

Currently we use list of packages instead of parsing pkgsrc, because the latter process (make index) is too slow. Not sure what we can do here now though.

Track package state changes

Track package state changes (such as version updates, new packages). Provide RSS streams of such events, display icons for recently updated/added packages.

Parallel downloading/parsing

Since repositories are independent, this should be easy to fetch and parse them in parallel. And it will really be useful for slow repositories such as pkgsrc (index generation) and Fedora (slow fetching)

No-lonely mode for repository

Need a way to make specific repositories not produce lonely packages. This will allow incompatible repositories which have too many packages which do not match with other repos to still take part in comparison. Useful for OpenSUSE as long as it's based on binary package lists and non-unix repositories like F-Droid and Chocolatey which have too many non-portable projects.

Add unit tests

At least for rules processing and version comparison done, now complete unit test is needed, which includes parsing and processing fake repository data.

Badges support

For named package, generate a badge with repositories it's present in with version info

Integrate with vulnerability databases

Mark vulnerable package versions

The plan:

  • Implement harvesting CPE data from upstream repositories
    • ❌ GUIX contains cpe_name (is useless without vendor)
    • ❌ FreeBSD ports define CPE_VENDOR and CPE_PRODUCT, but these are not exposed in INDEX
    • Gentoo contains usable CPE metadata
  • Implement database storage for project → CPE relations
  • Implement fetching and parsing CPE data (https://nvd.nist.gov/vuln/data-feeds#JSON_FEED)
  • Implement setting vulnerable flag for affected packages
    • Match incoming packages against vulnerable version ranges in the database
    • Force project update on new CPE for it (by resetting its hash)
    • It turned out to be viable to bulk update vulnerable status on all packages no it didn't, as we can't update binding tables properly this way to be able to do filtering based on vulnerable property
  • Implement stub for handling patched vulnerabilty information from repositories (discussed in #1045
  • Add vulnerable flag to binding tables to allow using it in project filtering
  • Integrate vulnerability updates into delta update process properly
    • During update, only run on incoming packages
    • When vulnerabilities are updated, queue affected projects in dedicated table, reset their hashes before pushing new packages
  • Add per-maintainer and per-repository vulnerable packages/projects counters
  • When all code is in place, force update for all projects with defined cpe_vendor/cpe_product
  • Add per-project vulnerability flag in order to be able to show "Vulns" project page conditionally
  • Add/ensure that vulnerability counts are saved in repository history, so we can plot graphs
  • Implement vulnerability based history events

Add support for AUR

How do I download a single-file AUR index (the one pacman uses?) or whole AUR as a single repository? No idea for now. As last resort, AUR website may be scrapped.

Add OpenSUSE support

Available for testing in newrepos branch. Uses binary package lists, so partially unsuitable for comparison, will be implemented as a shadow repo. Also contains too little info. Investigate a possibility of fetching complete data.

Improve pagination

To make pagination more usable, it needs to work with package names (e.g. aa..ak, ak..bc instead of 1, 2)

Support for removed packages

If package was removed, it's probably problematic and there's little reason bringing it back. For FreeBSD, may use MOVED file

Package merging rules

Packages are named differently across repos. Need rules to merge differently named packages into single entity. Need single package rules (extreme-tuxracer + extremetuxracer) as well as generic rules (FreeBSD: p5-Foo-Bar, Debian libfoo-bar-perl).

Rules TODO

  • apmod:perl
  • apmod:python
  • apmod:wsgi
  • apr (ignore on freebsd)
  • argus, argus-clients (-sasl on freebsd)
  • asterisk (merge 11, 13 on freebsd)
  • autodia (gentoo wtf)
  • bonnie (pkgsrc wtf)

Additional repository support ideas

Please share any ideas on what additional repositories we can support. A description on how to fetch all package data from specific repository is preferred. Approved repositories with determined fetching algorithm are split to separate bugs and eventually implemented.

Classic *nix package repositories

  • Any RPM repos
    • Fedora (see #36)
    • OpenSUSE (see #44)
    • AltLinux Sisyphus (see #24)
    • Fedora EPEL
  • slackware (#331)
  • homebrew (see #198)
  • DragonFlyBSD's dports (pretty much the same as FreeBSD ports minus some packages which don't build on DragonFly)
  • 💡 VectorLinux

From Fedora release-monitoring

Unsorted

Other platforms

  • NixOS
  • YACP (Yet another cygwin ports) (though project is somewhat inactive)
  • Rosa
  • GuixSD
  • 💡 OpenPandora
  • buckaroo 👍 json recipes, 👎 custom naming (fixed by some rules). Also looks dead already (true, 75% outdatedness).
  • MSYS2 (see #262 )

Since these will contain too many unrelevant unique packages, doable as shadow repos:

  • Chocolatey (see #43)
  • F-Droid (only a handful of packages manually whitelisted, too much android garbage)

Upstream repos

Doable as shadow repos as well

  • CPAN (perl packages)
  • PyPi (python packages)
  • RubyGems
  • 💡 Others? (node, etc).
  • 💡 GitHub, for projects which fetch from there. Need feedback loop here (parse normal repos -> get github urls -> parse github urls for latest tags)

New version detectors

  • 💡 FreeBSD's portscout
  • 💡 pkgsrc's thingy
  • 💡 AllMyChanges.com just for completeness, usefullness is questionable: it's focused on iOS apps and stuff which intersects with *nix software is fairly outdated

...more ideas?

Allow multiple packages per repo

Currenly, each metapackage only allow single package per repo (e.g. only one package for FreeBSD). This leads to shadowing and information loss (when e.g. php55, php56, php70 are merged into a single metapackage). Allow multiple packages per repo, always take highest version, but leave all information there.

Script for prototype repology.org generation

Before proper dynamic backend is developed, let's just make a static site generator. Required features:

  • Pagefied package lists
    • Plain (just packages)
    • Outdated for each repository
    • Absent for each repostitory
    • Per maintainer
    • Per category
  • Summary page for each package

Add Fedora packages

Fetchable via web: https://admin.fedoraproject.org/pkgdb/packages/

Available for testing in newrepos branch. Still TODO:

  • Improve fetching. Sequential fetching takes ~6 hours. Maybe do a parallel fetch.
  • Improve parsing. Need more clever .spec parser which will be useful for other RPM repos.
    • Add substitution support (Package: %foovar%)
    • Fix parsing of multiline fields (%description)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.