rgeissert / http-redirector Goto Github PK
View Code? Open in Web Editor NEWDebian mirrors HTTP redirector
Home Page: http://httpredir.debian.org/
Debian mirrors HTTP redirector
Home Page: http://httpredir.debian.org/
Most likely displaying some details, but an overall map would be a start. One where the visitor's location is also included.
As a first step, the checker should not need to have any mirror type-specific knowledge. At present time the code should be redundant enough in those areas as to make it easy to abstract those parts. The mirror type-specific stuff could go into Mirror::Type::foo.
It may happen that a subset is switched to a new serial even when there are no IPv6 mirrors that are up to date. This leads to v6 requests to be redirected to another subset until a v6-enabled mirror is up to date and joins the subset.
The subsets should depend on the IP family, therefore allowing v6 mirrors to be older than their v4 counterpart.
Would it be possible to include additional location showing where the redirector (geoip) believes the query is coming from?
For example:
IP: xxx.xxx.xxx.xx
AS: xxxx
Continent: NA
Country: xx
State: xxxxx
City: xxxxx
At present time, the field in the master list is blindly trusted. There should at least be a one-time check to make sure IPv6 connectivity does work.
The v4 AS peering database can't be used for v6 clients, even if ticket #11 was fixed. Each IP version needs its own database.
Pre-wheezy's versions of APT abort on header lines longer than 360 characters, inclusive.
Whenever support for RFC6249 is enabled for GET requests, either the number of links should be:
As soon as ftpsync switches from "ALL but ($archs)" to "NONE but ($archs)", whenever a new architecture is included by a mirror, the redirector won't make use of it.
It would become a pain having to rebuild the database very often just to add a new architecture to a mirror.
It appears that in some cases the GeoLiteCity.dat database, incorrectly, says a given IP is in one country, yet the more generic GeoIP.dat database does indicate the correct country. Ideally, with an AS match, or AS-peer match this shouldn't be an issue, but we are not there yet, and it has happened.
Turn the redirector from a CGI into an application server using plack, to allow it to be run as a fastcgi or mod_perl application.
This requires a few changes as to the way the output is handled, but the main changes are related to the databases, and how they would need to be reloaded.
Some mirrors have good connectivity within their country but have poor connectivity to the outside world. Such kinds of restrictions (country-based, AS-based, etc) should be allowed to be taken into consideration by the redirector.
This kind of restriction should probably be added as another filed in the master list. E.g.
Restricted-to: "AS" | "country" | "subnet"
I'm currently getting redirected to an out of date mirror. Since the redirector is supposed to detect these, I'm filing this as a bug
% sudo apt-get update
Get:1 http://http.debian.net sid InRelease [268 kB]
E: Release file for http://http.debian.net/debian/dists/sid/InRelease is expired (invalid since 1d 20h 13min 24s). Updates for this repository will not be applied.
% wget http://http.debian.net/debian/dists/sid/InRelease
--2013-05-13 00:29:01-- http://http.debian.net/debian/dists/sid/InRelease
Resolving http.debian.net (http.debian.net)... 46.4.205.43, 2a01:4f8:131:152b::42
Connecting to http.debian.net (http.debian.net)|46.4.205.43|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: http://debian.utalca.cl/debian/dists/sid/InRelease [following]
--2013-05-13 00:29:01-- http://debian.utalca.cl/debian/dists/sid/InRelease
Resolving debian.utalca.cl (debian.utalca.cl)... 190.110.100.3
Connecting to debian.utalca.cl (debian.utalca.cl)|190.110.100.3|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 268429 (262K) [text/plain]
Saving to: ‘InRelease’
100%[======================================>] 268,429 748KB/s in 0.4s
2013-05-13 00:29:01 (748 KB/s) - ‘InRelease’ saved [268429/268429]
Whether they are disabled, and the reasons why; it should all be displayed in some page so that people can easily check why a mirror is not used.
The feature introduced in 15ef32c doesn't check if the local mirror matches the $mirror_type
Not sure if it should auto-add architectures listed in the trace file but not in the db, or only use the list to know for a fact that an architecture isn't mirrored.
update.pl currently removes "unused fields" (i.e. blacklisting), while it should only allow fields that are known to be used (i.e. whitelisting).
Hi,
iquartile() does a lot of pushing, popping, reimplementing ceil() and whatnot. It can be written much simpler as:
sub iquartile(@) {
my @elems = @_;
my ($lower,
$lower = POSIX::ceil($lower);
$upper = POSIX::ceil($upper);
return @elems[$lower..$upper];
}
Some mirrors don't even perform two-stages updates, leading to inconsistent mirror views. Some code should be added to automatically detect and disable them.
For instance, in North America if the mirror(s) in El Salvador are disabled, it should prefer mirrors in the US than those in Mexico, given that most international links go all the way to the US and then to the destination country.
If a mirror runs a redirector on top of its real copy, the checker might be redirected to a different mirror when the trace file is requested. This will result in the mirror being disabled.
If the checker is aware of mirrors that run the redirector, it could download the file from serve/ instead and correctly monitor the mirror.
This most likely requires a new field in the master list.
Hi,
The distance calculation assumes the Earth is a flat rectangle that doesn't wrap, which is obviously not the case. For small distances, the “flat rectangle” assumption is not so bad, but not properly wrapping, be it at the zero-meridian of between -180/+180 (I don't know offhand which one is used), can give completely bogus results.
You probably want http://en.wikipedia.org/wiki/Haversine_formula instead.
Given that the checker may use an '''incoming'''' database, the translate-log may produce incorrect output or even fail due to the mismatch of mirror ids.
Since the only input to the translate-log script is the checker's output, the latter should probably tell the former the name of the mirrors database it is actually reading from.
Commits 1e33f17 and 306b6d3 (part of issue #3) introduced support for the Restricted-to field in the master list. However, due to check.pl's per-continent age check and the way the restriction was implemented (by simply not adding the mirrors to some indexes), they are not checked for freshness.
This could lead to inconsistencies that would only be noticed in the scope of the mirror's restriction (AS, country, etc.)
Even after creating a geo location-based subset of the population of mirrors that may serve a file, the number of alternatives may go from one to over 15 based on some real data-backed tests.
Whenever APT starts handling external redirections better it will cause issues. A great diversity of mirrors would only beneficial if there's at least one file that needs to be downloaded from each, ideally more than just one.
The redirector should limit the candidates to about 5 and see how it works.
Given certain conditions, it is very well possible for a subset to switch back to an older master stamp, which may lead to a temporary inconsistent view of the indexes and missing files.
Similarly to issue #7, if one redirector redirects the request to a mirror that has an instance of the redirector, the mirror's instance may redirect the user away one more time. This could lead to redirection loops.
If the redirector is aware of a mirror running an instance, it could bypass the mirror's instance by redirecting the request to serve/.
This change could re-use the new field in the master list mentioned in issue #7.
Some geographically disperse AS' have more than one mirror, and it would be convenient to redirect requests only to the local mirror. In general the use of geo location should address this kind of issues, but it is not uncommon for the free database to lack accuracy.
To reduce network traffic (and possibly skip some checks), the trace files downloaded from the mirrors should be stored. Whenever there's a local copy of a trace file, the GET request should include an If-Modified-Since.
LWP::UserAgent has a mirror method that might do the trick.
When requests for /dists/$dist/$comp/binary-$arch/Packages.gz
, given that mirrors only differ on the set of architectures they include, it should be possible to include the depth parameter, as specified in RFC6249.
E.g. a request for /dists/sid/main/binary-armel/Packages.gz would include a
Link: <http://mirror.tld/path/to/dists/sid/main/binary-armel/Packages.gz>; rel=duplicate; depth=1
Another example, a request for /dists/sid/main/binary-armel/Packages.diff/Index:
Link: <http://mirror.tld/path/to/dists/sid/main/binary-armel/Packages.diff/Index>; rel=duplicate; depth=2
Similarly, this could be done for Contents-$arch.diff/
, Translation-$lang.diff/
(or directly to i18n/
, but would reduce parallelism for clients that do not actually use the alternative download locations), /tools/
, and /project/
.
If a mirror attempts to sync more than one time between mirror pushes, the site's trace would be even more recent than it previously was. This doesn't affect the code's assumption that if site > master it is fully synchronised.
Unless, the following sequence occurs:
The whole netblock used by tunnelbroker is currently detected as located in the US.
As for what can be done, the whois records should be more or less correct at least regarding country information.
There are currently some requests that are forced to be served by mirrors that use ftpsync. This is to guarantee that there will not be inconsistencies caused by some index files being synchronised too early.
Since there are quite some mirrors that don't use ftpsync, and some of them do keep up with the recommended rsync settings and other changes introduced in ftpsync, it would be ideal not to treat them as second-class citizens.
Continuously performing all sorts of checks to determine if they don't follow the recommendations is doomed to fail. Perhaps a new field could be introduced in the trace file that states which "features"/changes they have been updated to.
For instance, mirrors that correctly sync the InRelease file in the second stage could include:
Revision: InRelease
The absence of the field would indicate that such mirror should not be used to serve InRelease files.
Similarly, for the translation files issues:
Revision: i18n
Whether or not this field should be included in ftpsync-generated trace files should be considered. For consistency, it probably should.
This may occur, for example, when multiple mirrors that only include a few architectures and are updated from a mirror in another subset finish updating before other mirrors that include more architectures are also done updating. In this scenario, the subset may end up having no mirror that can satisfy requests for some architectures.
The rules for promoting a master stamp should take into account that at least files for the most popular architectures can be served from within the subset.
As per discussion with Md, the feature introduced in a7265c9 should not be applied to mirrors which are named directly in the input. That is, if the mirror's host name is given instead of their ASN, the limit should not be applied as it might be intentional.
It doesn't make any sense and it could cause issues due to the depth
parameter.
There are very few times where a geo lookup fails because there's no entry for a given IP subnet. In such cases, the code ends with a 501 Not Implemented. It should probably send the client to some well-connected mirror in the US or the EU, making it configurable of course.
The current implementation for http.debian.net sends a db dump over ssh to a script on the remote server that validates it to avoid perl code and then imports it.
The validation could probably be omitted if the database was created using nstore, potentially making it compatible with the remote instance. Testing is needed to guarantee compatibility between versions of Storable. If a "network order" database were to be used, performance testing would also be needed.
Many mirrors don't have a site trace, or at least it doesn't match their hostname. After a certain number of failures, the checking of those mirrors should be rate-limited: it is a waste of time to check them every X minutes when they are usually not going to be fixed.
Ditto for mirrors that go down, are very out of date, and the like.
By making it a daemon that constantly re-calculates the database even on partial mirror-rechecks, it should be possible to reduce the time it takes for some changes to take effect. It would also avoid the requirement of running a cronjob.
If this is done, the mirrors should probably be prioritised by: "reference mirrors" and later by continent (or sub-regions).
It would, for instance, be useful do detect dropped architectures after the database was built. Ideally mirrors should list the architectures they include, but...
At the moment trying to re-run non-default checks is tricky, as one needs to play the "write to the 'incoming' database" dance and that is not fun to do in a cronjob.
There's currently no code to confirm that the assumptions regarding IPv6 support actually hold true.
Some checks should be added, at least to know if and when a separate IPv6 database is necessary.
Many mirrors remain disabled because by the time they are up to date wrt their upstream, their subset is out of date already.
If instead of creating per-continent subsets (which sort of works for NA and EU) the subsets were based on the date they were last updated.
I.e. a client from AS A should be redirected to the mirror in its own A as long as it is not very out of date and even if their country's mirror is more up to date.
"Out of date" would still mean anything older than twelve hours (two archive pulses).
At present it is not enabled by default because it could potentially enable an architecture that was disabled by the architectures check (the one based on additional HTTP requests).
The traces check should be modified, probably in a similar way to how the archs check, to not re-enable an arch it didn't disable itself.
I've read in some places that most HTTP caches don't cache redirections unless they include an Expires header. All this should be investigated and tested, keeping in mind that traditional web caching is problematic with APT's repository design.
When a non-definitive geoip result is obtained for a mirror, the country in the master list is used, but the continent is never defined. Therefore, it is possible for a mirror to be assigned to a non-existing continent.
The demo page only lists the subset of the population of candidate mirrors. However, it appears that some people would like to know about all the other candidates and as such should be displayed. Perhaps the whole demo mode should stop using HEAD and use some alternative way to request and return the desired information.
501 Not Implemented was chosen as the geo lookup failures are mostly caused by trying to lookup an IP address in a private range. However, when the request doesn't come from a private address, it should send a more appropriate error code.
Related to issue #18
There are some files in dists/ that are known not to exist in certain releases. For example, there are no InRelease files in squeeze.
It should therefore be possible to "blackhole" (i.e. throw a 404) certain requests to avoid a useless request to a mirror.
To err on the side of safety, it should be T(mirror_type,codename,fiile_pattern)
When I try using http://http.debian.net/ with apt-cacher-ng, I receive 307 errors.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.