repology / repology-updater Goto Github PK

View Code? Open in Web Editor NEW

477.0 14.0 165.0 6.94 MB

Repology backend service to update repository and package data

Home Page: https://repology.org

License: GNU General Public License v3.0

Python 98.82% Makefile 0.19% PLpgSQL 0.99%

packages repository version package

repology-updater's Introduction

Repology

Repology is a service which monitors a lot of package repositories and other sources and aggregates data on software package versions, reporting new releases and packaging problems.

This repository contains Repology updater code, a backend service which updates the repository information. See also the web application code.

Dependencies

Python 3.11+
Python module Jinja2
Python module libversion (also requires libversion C library)
Python module psycopg2
Python module pyyaml
Python module xxhash
Python module pydantic
PostgreSQL 15.0+
PostgreSQL extension libversion

Needed for fetching/parsing repository data:

Python module jsonslicer
Python module lxml
Python module protobuf
Python module pyparsing
Python module requests
Python module rpm (comes with RPM package manager)
Python module rubymarshal
Python module sqlite3 (part of Python, sometimes packaged separately)
Python module sqlite3 (part of Python, sometimes packaged separately)
Python module tomli
Python module yarl
Python module zstandard
git
rsync
subversion

Development dependencies

Optional, for doing HTML validation when running tests:

Python module pytidylib and tidy-html5 library

Optional, for checking schemas of configuration files:

Python module voluptuous

Optional, for python code linting:

Python module flake8
Python module flake8-builtins
Python module flake8-import-order
Python module flake8-quotes
Python module mypy

Running

Preparing

Since repology rules live in separate repository you'll need to clone it first. The location may be arbitrary, but rules.d subdirectory is what default configuration file points to, so using it is the most simple way.

git clone https://github.com/repology/repology-rules.git rules.d

Configuration

First, you may need to tune settings which are shared by all repology utilities, such as directory for storing downloaded repository state or DSN (string which specifies how to connect to PostgreSQL database). See repology.conf.default for default values, create repology.conf in the same directory to override them (don't edit repology.conf.default!) or specify path to alternative config via REPOLOGY_SETTINGS environment variable, or override settings via command line.

By default, repology uses ./_state directory for storing raw and parsed repository data and repology/repology/repology database/user/password on localhost.

Creating the database

For the following steps you'll need to set up the database. Ensure PostgreSQL server is up and running, and execute the following commands to create the database for repology:

psql --username postgres -c "CREATE DATABASE repology"
psql --username postgres -c "CREATE USER repology WITH PASSWORD 'repology'"
psql --username postgres -c "GRANT ALL ON DATABASE repology TO repology"
psql --username postgres --dbname repology -c "GRANT CREATE ON SCHEMA public TO PUBLIC"
psql --username postgres --dbname repology -c "CREATE EXTENSION pg_trgm"
psql --username postgres --dbname repology -c "CREATE EXTENSION libversion"

in the case you want to change the credentials, don't forget to add the actual ones to repology.conf.

Populating the database

Note that you need more than 11GiB of disk space for Repology PostgreSQL database and additionally more than 11GiB space for raw and parsed repository data if you decide to run a complete update process.

Option 1: use dump

The fastest and most simple way to fill the database would be to use a database dump of main Repology instance:

curl -s https://dumps.repology.org/repology-database-dump-latest.sql.zst | unzstd | psql -U repology

Option 2: complete update

Another option would be to go through complete update process which includes fetching and parsing all repository data from scratch and pushing it to the database.

First, init the database schema:

./repology-update.py --initdb

Note that this command drops all existing data in Repology database, if any. You only need to run this command once.

Next, run the update process:

./repology-update.py --fetch --fetch --parse --database --postupdate

Expect it to take several hours the first time, subsequent updates will be faster. You can use the same command to updated. Brief explanation of options used here:

--fetch tells the utility to fetch raw repository data (download files, scrape websites, clone git repos) into state directory. Note that it needs to be specified twice to allow updating.
--parse enables parsing downloaded data into internal format which is also saved into state directory.
--database pushes processed package data into the database.
--postupdate runs optional cleanup tasks.

Documentation

How to extend or fix rules for package matching
How repology compares versions

Author

Dmitry Marakasov [email protected]

License

GPLv3 or later, see COPYING.

repology-updater's People

Contributors

Stargazers

Watchers

Forkers

mizhka olevole lugnsk whissi pink-mist hsitter pombredanne antlarr lots0logs andreymz mojca idlemoor kensington conan-kudo acidburn0zzz roscopecoltran berolinux banuchka vivo75 kuboosoft aosc-archive alexmekkering alexmyczko ninewise stephengroat daniel-mietchen ryantm jtojnar c72578 sodre matthewbauer nemani yzgyyang davidak alerque ikke thepro-dot-xyz bunsenlabs palica chubbymaggie jriddell luzpaz almack jon-turney alarixnia sgrif abitrolly jeremiah mayflower jmahler orbital-transfer-survey mikhailnov danyspin97 garbas pabs3 kktt007 administrator-power dreua jedahan timb87 eudaldgr hixio-mh dankamongmen y0rune arseniysky mayli aprilsherellquarles jjardon ad-m fthaltun volkc stevenans985900 e7dal xirdigh cemkeylan adigherman sbz bluebottle-new susnux ismaell jonringer mackiecarr89 researchapps heroku-miraheze wamserma stoffepro keszybz prince-chrismc harens damphu1986 tomberek manny27nyc dshein-alt kwizart milliams musicinmybrain ekilmer daissi philclifford grimler91

repology-updater's Issues

Add chocolatey support

Fetcher is rather trivial: get https://chocolatey.org/api/v2/Packages()?$filter=IsLatestVersion, parse XML, get next page from <link rel="next" href="">. All package info is available in XML, including name, version, tags, comment, author, www.

Split repo fetching and parsing

These are not tied together. Fetching is more generic and may be common to multiple parsing techniques. Currently, there are just 2 types of fetchers:

plain file [with options to gunzip or bunzip it after downloading]
git repository

Incomplete support flag for repositories

E.g. will draw repository names in table header with gray and display a tooltip, to show users package data for specific repository is incomplete.

Detect dates in versions

E.g. 20[01][0-9]\.?[01][0-9]\.?[0123][0-9]

What to do with this:

Convert with single format (see abcmidi package problem) to fix comparison
If all items contain date, compare these instead (other parts are likely non-unformative, e.g. 0.0.20160916 vs. git20160916 vs. 2016.09.16

Add flag to disable shadows

Useful for testing

Ability to ignore version in comparison

E.g. debian uses fake versions for perl modules: these should be ignored not to clobber other distros.

Website functionality ideas

Browse pagefied package lists A...Z
Browse by category
Browse by maintainer
Browse outdated packages per repository
Browse by packages (packages with best support along distros)
Compare repositories
Compare maintainers
Suggest maintainer with same interests

Don't leave partial state after failed fetch

Failing to fetch single repository shouldn't interrupt the whole fetch, it should only stop single repository from updating. Also, failure to fetch most repositories should remove their state, so incomplete state is not parsed.

Gentoo needs more package splits

Need to be split: btf (dev-java and sci-libs)

Actually, a lot more:

find gentoo.git -type d -maxdepth 2 -mindepth 2 |
   egrep -v 'dev-(perl|python|haskell)' |
   awk -F/ '{print $NF}' |
   sort | uniq -d

ace acl ada amap analog apel atlas attica auctex baloo balsa barcode bbdb bfm binclock bluedevil bson btf build c-support calc calendar cdcover cdrtools charm checkpassword coffee-script color crystal csv daemontools dash dictionary dirdiff docker dolphin ebuild-mode ecb eject elib emacs ess exo fam fcgi ffmpeg fuse gambit gdl git glade glu gnupg gnuplot gom gpgme grip gsasl haskell-mode highline icecream igrep info jack jal jama jde jpeg json kactivities kde-gtk-config kdeplasma-addons kdesu kfilemetadata kglobalaccel khotkeys kinfocenter kmenuedit knewstuff krunner kscreen kstart ksysguard kwin kwrited languagetool launchy lemon libelf libffi libgudev libiconv libintl libkscreen libnet libusb locale lookup lzma magic mailcrypt mailx man mars mash mavros mc mediawiki mew milou mime-types mldonkey mmix mmm-mode modutils mongo mpack mpc msgpack muse mysql nagios nemesis ninja nitrogen notification-daemon nut ocaml openmsx otter pam par pcl pdv picard pkgconfig planner plasma-mediacenter plasma-nm plasma-workspace pmake pms polkit-kde-agent polyglot powerdevil psgml psi python-mode rails re2 redis reduce riece ruby rubygems screen session shadow signify silo ski skkserv slim slurm smack sml-mode snappy spice spin splat splice sqlite3 ssh surf systemsettings szip teco texinfo tf time tokyocabinet tornado tree uclibc udev vc vm w3m xclip xslide xsp yacc zenburn zenirc

Add sisyphus support

Code is there, need an easy way to download all .spec files. Todo: concact sisyphus guys

Extend version ignores

Ignore only newer (single version bad)
Ignore always (versioning schema totally broken)

Improve pkgsrc support

Currently we use list of packages instead of parsing pkgsrc, because the latter process (make index) is too slow. Not sure what we can do here now though.

Statistics page

Split package mixer into separate class

Add pagination to maintainers page

The list is too long otherwise

Track package state changes

Track package state changes (such as version updates, new packages). Provide RSS streams of such events, display icons for recently updated/added packages.

Parallel downloading/parsing

Since repositories are independent, this should be easy to fetch and parse them in parallel. And it will really be useful for slow repositories such as pkgsrc (index generation) and Fedora (slow fetching)

No-lonely mode for repository

Need a way to make specific repositories not produce lonely packages. This will allow incompatible repositories which have too many packages which do not match with other repos to still take part in comparison. Useful for OpenSUSE as long as it's based on binary package lists and non-unix repositories like F-Droid and Chocolatey which have too many non-portable projects.

Make fetchers/parsers output logs

Instead of polluting stderr

Split repo fetching/analysis and report generation into separate utils

Multiple levels of outdatedness

E.g. darker red color

Add unit tests

~~At least for rules processing and version comparison~~ done, now complete unit test is needed, which includes parsing and processing fake repository data.

Add info on patches

Some repos include patches, it would be nice to report them upstream.

Badges support

For named package, generate a badge with repositories it's present in with version info

Integrate with vulnerability databases

Mark vulnerable package versions

The plan:

Add logo/favicon

Package search

This is of course planned with proper backend.

Add ability to ignore whole repository

Add support for AUR

How do I download a single-file AUR index (the one pacman uses?) or whole AUR as a single repository? No idea for now. As last resort, AUR website may be scrapped.

For Debian, count uploaders as maintainers too

Add OpenSUSE support

Available for testing in newrepos branch. Uses binary package lists, so partially unsuitable for comparison, will be implemented as a shadow repo. Also contains too little info. Investigate a possibility of fetching complete data.

Improve pagination

To make pagination more usable, it needs to work with package names (e.g. aa..ak, ak..bc instead of 1, 2)

Support for removed packages

If package was removed, it's probably problematic and there's little reason bringing it back. For FreeBSD, may use MOVED file

Allow version matching and transformations

ver, verpat and setver

Develop proper dynamic web backend

Add links to packages

Make versions link to packages in specific repos (or VCS pages)

Allow multiple rules to be applied

Keep transformed name in the process, pass it from rule to rule; pass terminates (rename to stop)

Add set of pages with outdated packages per repo

This depends on some refactoring though

Package merging rules

Packages are named differently across repos. Need rules to merge differently named packages into single entity. Need single package rules (extreme-tuxracer + extremetuxracer) as well as generic rules (FreeBSD: p5-Foo-Bar, Debian libfoo-bar-perl).

Rules TODO

apmod:perl
apmod:python
apmod:wsgi
apr (ignore on freebsd)
argus, argus-clients (-sasl on freebsd)
asterisk (merge 11, 13 on freebsd)
autodia (gentoo wtf)
bonnie (pkgsrc wtf)

Additional repository support ideas

Please share any ideas on what additional repositories we can support. A description on how to fetch all package data from specific repository is preferred. Approved repositories with determined fetching algorithm are split to separate bugs and eventually implemented.

Classic *nix package repositories

Any RPM repos
- ✅ ~~Fedora~~ (see #36)
- ✅ ~~OpenSUSE~~ (see #44)
- ✅ ~~AltLinux Sisyphus~~ (see #24)
- ✅ ~~Fedora EPEL~~
✅ ~~slackware~~ (#331)
✅ ~~homebrew~~ (see #198)
✅ ~~DragonFlyBSD's dports (pretty much the same as FreeBSD ports minus some packages which don't build on DragonFly)~~
💡 VectorLinux

From Fedora release-monitoring

✅ ~~Alpine linux~~
✅ ~~CRUX whatever that is~~

Unsorted

✅ ~~SliTaz~~ (#664)
💡 NuTyX (only shell based pkgbuilds are available)
❌ ~~Sabotage~~ (do not use versions)
✅ ~~salix~~
✅ ~~solus (https://git.solus-project.com/, https://packages.solus-project.com/unstable/)~~ (#341)

Other platforms

✅ ~~NixOS~~
✅ ~~YACP (Yet another cygwin ports)~~ (though project is somewhat inactive)
✅ ~~Rosa~~
✅ ~~GuixSD~~
💡 OpenPandora
✅ ~~buckaroo~~ 👍 json recipes, 👎 custom naming (fixed by some rules). Also looks dead already (true, 75% outdatedness).
✅ ~~MSYS2~~ (see #262 )

Since these will contain too many unrelevant unique packages, doable as shadow repos:

✅ ~~Chocolatey~~ (see #43)
✅ ~~F-Droid~~ (only a handful of packages manually whitelisted, too much android garbage)

Upstream repos

Doable as shadow repos as well

✅ ~~CPAN (perl packages)~~
✅ ~~PyPi (python packages)~~
✅ ~~RubyGems~~
💡 Others? (node, etc).
💡 GitHub, for projects which fetch from there. Need feedback loop here (parse normal repos -> get github urls -> parse github urls for latest tags)

New version detectors

💡 FreeBSD's portscout
💡 pkgsrc's thingy
💡 AllMyChanges.com just for completeness, usefullness is questionable: it's focused on iOS apps and stuff which intersects with *nix software is fairly outdated

...more ideas?

Allow multiple packages per repo

Currenly, each metapackage only allow single package per repo (e.g. only one package for FreeBSD). This leads to shadowing and information loss (when e.g. php55, php56, php70 are merged into a single metapackage). Allow multiple packages per repo, always take highest version, but leave all information there.

Script for prototype repology.org generation

Before proper dynamic backend is developed, let's just make a static site generator. Required features:

Pagefied package lists
- Plain (just packages)
- Outdated for each repository
- Absent for each repostitory
- Per maintainer
- Per category
Summary page for each package

Add Fedora packages

Fetchable via web: https://admin.fedoraproject.org/pkgdb/packages/

Available for testing in newrepos branch. Still TODO:

Improve fetching. Sequential fetching takes ~6 hours. Maybe do a parallel fetch.
Improve parsing. Need more clever .spec parser which will be useful for other RPM repos.
- Add substitution support (Package: %foovar%)
- Fix parsing of multiline fields (%description)

Debian JavaScript Maintainers <[email protected]>
Debian Javascript Maintainers <[email protected]>