internetarchive / wayback Goto Github PK

View Code? Open in Web Editor NEW

This project forked from iipc/openwayback

721.0 65.0 131.0 13.21 MB

IA's public Wayback Machine (moved from SourceForge)

XSLT 0.05% JavaScript 9.20% HTML 0.41% Java 89.94% CSS 0.40%

wayback's People

Stargazers

Watchers

Forkers

kirktarou tessafallon rafasashi gmj2053 porebskimarek1947 vonrosen pcoughlin yodamaster highku yilab snooza nemobis marquisknox tanusoha123 maheshwaghmare rrmaniar28 f-cero kikiking10 rugby110 pavelnovitsky rlugojr james-a-allison bitbaron galgeek jake2309 limkokholefork zachbrowne bushalo nickkarras johnnnyrocket puchka gulzaar makadeli juliascript recieverecover virginiamaxwell acidburn0zzz deltathewolf jidanni briceran skacurt stephie0128 allisone-0 der-carnivore3 mr0grog tfgt omunroe-com emtee40 herbrich vcgato29 fuuddanni marcelraschke devcode1981 tarsbase dangermouse11 ken-studio blahma luminixx thefore sharismlab whitemike889 serpworx munyola hassoon1986 pkannetwork forklifters juaan97 yesiming cclauss zirfit anthonymark33 pxme erathaowl ewouth ajgappmark solversatech vishalbelsare jabberwock tojin12341za yashkorat lagunns tabshaikh xbojch mel-5569 cirosantilli global-localhost amir-toly akamhy abigailxyzw mkanin aaronwashington admariner drozado brian222781 afwu liambirt tienquanutc manny27nyc zerack5 yichongfei

wayback's Issues

The pagination filters out some results

When page=X is added to the request, some rows disappears.

For example:

http://web.archive.org/cdx/search/cdx?url=www.expertsender.ru&matchType=domain
document.body.innerText.split("\n").length = 6486

http://web.archive.org/cdx/search/cdx?url=www.expertsender.ru&matchType=domain&page=0
document.body.innerText.split("\n").length = 6246

http://web.archive.org/cdx/search/cdx?url=www.expertsender.ru&matchType=domain&page=1
Returns empty result.

Example of disappeared row:
ru,expertsender,blog)/ispolzovanie-gif-v-emejl-rassylkax-kejs-ot-butik-ru 20160401182916 http://blog.expertsender.ru:80/ispolzovanie-gif-v-emejl-rassylkax-kejs-ot-butik-ru/ text/html 200 K6ZNHY3FGL6X67KDYYW5U7L3WEJRSIM5 12454

Make robots.txt processing fully conform to widely adopted convention

Problems found during investigating WWM-163 (replay is blocked despite robots.txt is 403):

Any non-200 status is treated as a failure, and cached as 502.
RobotExclusionFilter skips any failure and goes on to test alternative robots.txt (www.example.com/robots.txt if example.com/robots.txt fails). This sounds unnecessary as it uses the host of original CDX field.
If all fails, RobotExclusionFilter considers "no robots.txt" and allows full access. This is okay for 404 and 403, but against the convention for 5xx.

Wayback should differentiate 404 and 403 from other failures and treat them as a success, rather than a failure.

"filter" parameter in CDX query API can result in blank pages

Let's say I want to get a list of all images under a given domain. It so happens that this query would span multiple pages.

If I use the parameter filter=image/jpeg and the page happens to not have any images on it, the page will appear to be blank instead of filling up results from later pages

Pass through Content-Range header field

Content-Range header field in a capture needs to be passed through as-is to replay response for audio file playback to work. Found as part of ARI-3774.

AlphaPartitionIndexTest fails with OpenJDK 7u51.

AlphaParitionIndexTest assumes HashMap.values() returns items in a particular order, and it breaks with OpenJDK 7u51 (more precisely 7u51-2.4.4-0ubuntu0.12.04.2 AMD64).

bugs.chromium.org reports an incorrect robots.txt restriction

Navigate to: https://web.archive.org/web/http://bugs.chromium.org/p/project-zero/issues/detail?id=1139

see that wayback says it's blocked by robots.txt:

See that the robots.txt for that domain, while complicated, specifically allows that type of URL:

User-agent: *
# Start by disallowing everything.
Disallow: /
# Some specific things are okay, though.
Allow: /$
Allow: /hosting
Allow: /p/*/adminIntro
# Query strings are hard. We only allow ?id=N, no other parameters.
Allow: /p/*/issues/detail?id=*
Disallow: /p/*/issues/detail?id=*&*
Disallow: /p/*/issues/detail?*&id=*
# 10 second crawl delay for bots that honor it.
Crawl-delay: 10

Expect that complex robot.txt files are parsed and matched correctly by the wayback machine.

Change the favicon to the crawled site's favicon if available

Sometimes a website will display a favicon even though one isn't explicitly defined in the page. The site for Iridion II for example.

It would be nice if in the event of no explicitly defined favicon, Wayback Machine would look for one at %DOMAIN%/favicon.ico.

It would probably be more preferable, if not more resource-consuming, to start at the same folder as the current URL and step backwards until it finds something or reaches the domain. However, I've seen only one case where there was a favicon in a place deeper than the root and I'm not even sure if it was ever used.

Playback fails with net::ERR_CONTENT_LENGTH_MISMATCH

Playback of certain URL fails with net::ERR_CONTENT_LENGTH_MISMATCH (Chrome error message). All captures of the URL are warc/revisit. There is no original captures.

Wayback is supposed to return 404 response instead 200 for such case, but it's playing back content from a revisit record (which has no response payload). Closest capture has WARC-Refers-To-Date pointing to another revisit capture. AccessPoint.retrievePayloadForIdenticalContentRevisit blindly believe WARC-Refers-To-Date always points to a non-revisit record (i.e. the original capture), and subsequent CDX query does not exclude revisit captures.

Remedy missing slash in https URLs as well

Wayback repairs URLs like http:/example.com/ to http://example.com/, but does not repair https:/example.com/ to https://example.com/. It should.

Source: ARI-4337

matchType=domain doesn't work as expected

Hi,
I want to get all archived pages for domain and all its subdomains. So I'm using the following url:

http://web.archive.org/cdx/search/cdx?url=*.tut.by&from=20150724&to=20150724&filter=mimetype:text/html&output=json&fl=timestamp,original

There are no records for subdomain news.tut.by. But if I try the following url I'll get a lot of records for subdomain news.tut.by:

http://web.archive.org/cdx/search/cdx?url=*.news.tut.by&from=20150724&to=20150724&filter=mimetype:text/html&output=json&fl=timestamp,original

Thanks

Query API

http://web.archive.org/cdx/search/cdx?url=google.com&matchType=domain&output=json

Something like this to the count of all elements (number of archived pages).

Do not insert head insert before XML declaration

XHTML capture results in XML parse error on browser, because head insert is inserted before XML declration <?xml version="1.0" ... ?>.

Make CharsetDetector adhere to WHAT-NG recommendation

CharsetDetector fails to detect correct character encoding when META tag says charset=UTF-16 but it is in fact in UTF-8. It is because CharsetDetector puts higher priority on META tag over charset detected from content. Reimplement CharsetDetector in reference to WHAT-NG recommendation http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#encoding-sniffing-algorithm

Known internally as ARI-3933.

Bug: JspInsert is inserted in the middle of inline <SCRIPT>, CSS url(...) in HTML comment not rewritten

FastArchivalUrlReplayParserEventHandler gets confused by what looks like end-tag in the script.

Minimized test case is

<html><head><script>/</g;900>a;a<k;</script></head><body></body></html>

JspInsert is inserted between a;a and <k;, because </g;900> is parsed a end-tag.

Internally known as WWM-118.

Refactor URI build/rewrite framework

We've gone through several iterations trying to come up with good URL rewrite scheme for archival-URL mode. Our conclusion at this point is that we need to maintain the form of original URL before and after rewrite. By form we mean the various absolute/relative-ness of URL. In other words we want to rewrite full URL to full URL (http://www.example.com to http://web.archive.org/20140101121314/http://www.example.com), protocol relative to protocol relative (//www.example.com to //web.archive.org/20140101121314/http://www.example.com), relative path to relative path (styles/mobile.css to styles/mobile.css) etc.

We found this is rather awkward to achieve with existing framework for URI rewriting. ResultURIConverter.makeReplayURI() method takes only two String parameters datespec and url. Thus it doesn't have access to the context in which url was found. To work around this situation, there are a lot of clumsy code around it, which results in overly complex framework. Here are some observations:

ResultURIConverter is used for building replay URL and rewriting URL. While these look similar, I consider they are distinct services that call for different configuration schemes.
ReplayParseContext has ResultURIConverter instances for each of context flags (ex cs_) built through ContextResultURIConverterFactory, just for including context flags in replay URL. Those instances would be unnecessary if makeReplayURI() took context flags as an argument.
ContextResultURIConverterFactory has two different uses. While its getContextConverter method has single argument called flags, implying its context flags, it can also receive replay URL prefix (see AccessPointAdapter.getUriConverter()). I suppose the ContextResultURIConverterFactory implementations taking context flags would have been unnecessary if ResultURIConverter.makeReplayURI() took context flags as an argument.
ReplayParseContext.contextualizeUrl(String, String) checks if URL-rewrite is necessary, and then converts URL to full absolute form before passing it to ResultURIConverter. This makes it impossible for ResultURIConverter implementation to preserve mode of URL described above. Considering ResultURIConverter's primary role, these steps should be left to ResultURIConverter implementation.
there's one issue in rewriting relative URL: if the URL being replayed does not have path part (ex. http://www.example.com, relative URL need to be converted to full path. ResultURIConverter needs to know the URL being replayed to achieve this. This is another support case for additional parameters in ResultURIConverter.makeReplayURI().
Memento code (ex. EmbeddedCDXServerIndex.addTimegateHeaders()) prepends mementoPrefix to URI returned by ResultURIConverter to ensure Memento URLs are always in absolute form. This is necessary because ResultURIConverter is used for two different purposes and breaks if ResultURIConverter returns different forms of URL depending on the context.
There's no easy way of passing X-Forwarded-Proto request header field to URL rewriting so that it can build absolute URL with appropriate protocol (http or https). We worked around it by storing the header value in ThreadLocal.

Our JIRA ARI-4033 depends on the resolution of this issue.

Resolution Plan:

Make it clear ResultURIConverter is for constructing replay URL from Capture (full URL and timestamp) and context information (only thing known at present is context flag). I know the class name doesn't represent this role well, but that's >90% use of this interface, currently, and renaming will have widespread effect. We could rename the interface later.
Have AccessPointAdapter implement ResultURIConverter. This (along with the change below) should make ContextResultURIConverterFactory unnecessary.
Define new interface for customizing URL rewrite, that receives more information than current ResultURIConverter does.
Define an interface for passing context of URL (Capture being replayed, baseURL) that ReplayParseContext can implement (for better modularity and ease of testing)

Colors to show the status code of an archived URL

When currently searching for an archived version of an URL with status code 2xx it can take some time before an archived version is found which was archived while it was still available. Searching for the right version of an archived URL can become a lot easier if it's easy to see which archived versions returned status code 2xx or 3xx.

Currently an archived page is shown in the Wayback Machine as a blue circle on the date it was archived, see for example http://wayback.archive.org/web/20010501000000*/http://archive.org. Multiple colors can be used here to let someone know the status code of a page, for example:

for status code 2xx,
for status code 3xx,
for status code 4xx.

When an URL is archived multiple times on the same day a larger circle is shown. Multiple colors can be added to this larger circle to show with which status codes the page was archived, for example:

for status code 2xx and 4xx for an URL archived multiple times on the same day.

This same idea can be used for the black bars showing the number of archived version for the months.

I think implementing colors or some other way of showing what status code an URL returned when it was achived would be very helpful for finding a right version of an URL.

Wayback doesn't scrape/rewrite srcset urls correctly

Let me know if this isn't the right repo, but ran into an issue when testing archival features on http://www.goodbyetohalos.com/

Like many webcomics using wordpress nowadays, Goodbye to Halos uses html5 srcset attribute to displays different image sizes to different devices:

<img
    width="800" height="1200" 
    src="http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg"
    class="attachment-full size-full" alt=""
    srcset="http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg 800w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-480x720.jpg 480w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-96x144.jpg 96w"
    sizes="(max-width: 800px) 100vw, 800px"
    data-webcomic-parent="837"
>

so far, so good. however, after crawling/scraping these with wayback, only the src url is scraped and rewritten, leading to the image on wayback'ed page still being served from the original server:

<img
    width="800" height="1200"
    src="/web/20170127042412im_/http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg"
    class="attachment-full size-full" alt=""
    srcset="http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg 800w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-480x720.jpg 480w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-96x144.jpg 96w"
    sizes="(max-width: 800px) 100vw, 800px"
    data-webcomic-parent="837"
>

this is very obvious because the original site doesn't use https, so it leads to a broken image on the wayback machine view:

Obviously, the correct behavior here is that all of the images should be scraped (in this case they're just resizings, but in theory they could be completely different images—nothing prevents that) and rewritten.

Thanks! let me know if you need more information, or want me to whip up a more minimal test case

Extend JSStringTransformer with unescaping language-specific escaping before URL rewrite

ReplayParseContext has ad-hoc support for the case where URLs are escaped in the target resource. For example it recognizes URLs written in JavaScript as "http://example.com/..." as absolute URLs. This approach has a few problems:

There are a few different forms of escaping, like "http:\u002F\u002Fexample.com/...". Adding support for these alternatives makes ReplayParseContext messy
JSStringTransformer is also used for other types of resources, which can have different way of escaping characters.
Characters in replayPrefix are NOT escaped to match the syntax of the target resource. For example, slashes in replayPrefix can break if URL is found in regular expression literal.

It would be more robust to implement unescaping at JSStringTransformer to pass clean URLs to ReplayParseContext. It can also escape special characters back before inserting rewritten URLs.

Switch over to iipc/webarchive-commons 1.1.3?

Hi,

I was looking at syncing up our forks, and couldn't proceed because webarchive-commons was forked a while ago: 6555609

I've just pulled your changes to webarchive-commons into the IIPC version and rolled a 1.1.3 release including that change (and a number of bugfixes). Would you consider switching back to the IIPC version? It would make keeping our forks in sync much easier.

Thanks,
Andy

hhtp 404 error while running the wayback-cdx-server http://localhost:8080/wayback-cdx-server/

I have downloaded the wayback=cdx-server api and imported the api in eclipse but http 404 error is coming. I have checked the deployment assembly and also checked the pom.xml.

Clip the toolbar timeline to only the start date and end date

Not every crawl started at 1996 for example.

This will likely be needed come 2039, which the last year in which each month will be a pixel wide on the small, 527-pixel wide timeline

Bad response header field in replay of gzip-encoded capture

If text (HTML, CSS, JavaScript) response is gzip-encoded (has Content-Encoding: gzip), replay response has weirdly-named header field: X-Archive-Orig-X-Archive-Orig-Encoding: gzip. It is supposed to be X-Archive-Orig-Encoding: gzip.

This is because TextReplayRenderer.decodeResource replaces Content-Encoding header field with X-Archive-Orig-Encoding header field while applying gzip-decode, and then later RedirectRewritingHttpHeaderProcessor prepends X-Archive-Orig- to it (note this is configurable).

Easy solution would be to avoid prepending head field name if it starts with X-Archive-Orig-, but this sounds too ad-hoc. X-Archive-Orig- prefix is currently hard-coded, but we may want to make it configurable.

Allow for using different collapseTime for replay and capture search

EmbeddedCDXServerIndex has timestampDedupLength property for culling captures to prevent capture search result page from getting too crowded with many captures (we call this feature timestamp-dedup thereafter). This property applies to all capture search queries, whether it is for capture list page or looking up closest capture for replay.

While we want timestamp-dedup for capture list page, we learned it is problematic for capture lookup for replay, because it often break revisit resolution. We want to disable timestamp-dedup if capture search query is for replay.

Internally known as ARI-3883.

RobotRules allows any path if it hits empty Disallow:

RobotRule.blocksPathForUA(String, String) returns false for any paths with this robots.txt:

User-agent: *
Disallow:
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-login.php
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /comments

Per robots.txt specification, empty Disallow: shall be just ignored. RobotRules returns false when it hits empty Disallow:, ignoring the rest of rules.

Found by ARI-4212.

Merge iipc/openwayback/master, rebase development on top of it

HTML rewrite fails to insert body-insert if both </HEAD> and <BODY> are missing

originally reported in ARI-3880.
Failure case has HTML like this:

<html>
<head>
<title>...</title>
<script type="text/javascript" src="scripts/header.js"></script>

<p align="center">
...

FastArchivalUrlReplayParseEventHandler fails to insert body-insert (jspInsertPath) because the code block is skipped if inHead flag is true (set by appearance of HEAD tag). This results in failure to render top-of-the-page banner (typically disclaimer and navigation bar).

Implement real range request handling

Currently Wayback does nothing special with range request and simply render whichever capture matching the URL + timestamp combination. This works as long as capture is either 200 response (browser assumes the server does not support Range request), or 206 response with matching Content-Range.

Recently found, some HTML5 browser, when playing a video, first probes server by making range request for entire file, then making another request for small range near the end of file. If the server does not return 206 response matching the request, it stops video playback. To support HTML5 video playback, Wayback needs to implement range request handling of its own.

This issue is internally known as ARI-4254.

Add mime type detection for replaying captures with incorrect content-type

(This is an issue item for already completed work)
Determine mime type by looking into the payload when either mimetype in the search result is suspected to have incorrect value (ex. text/html) or missing (ex. unk).

Known internally as ARI-3822, ARI-3888, WWM-58. Bug fixes in ARI-4071 and ARI-4078.

Base work is done in commits 65dfc40 through 7d9d332, then bug fixes are being tracked on mimetype-detector branch.

Resource record always rendered as text/html

Resource record is always rendered as text/html, regardless of Content-Type WARC header field.

This is due to a lack of metadata record support in JWATResource. It does not return Content-Type header field from its getHttpHeaders() method, thus Tomcat supplies default value text/html.

Known internally as WWM-126.

Make blocked captures available to revisit resolution

Blocked captures are often referred to by later revisit captures, and there's a need for making such blocked captures available only for replaying revisit captures.

Internally known as ARI-3879 and ARI-4034.

Make ServerRelativeArchivalRedirect easier to extend

Archive-It found a issue with Referer header generated by Flash plugin for Firefox (ARI-4169) and wants to extend ServerRelativeArchivalRedirect with supplemental method for obtaining ArchivalUrl context. As the method depends on private JavaScript library it uses, we'd like to keep the enhancement local to Archive-It, for now. Unfortunately ServerRelativeArchivalRedirect does not have extension point to enable this.

Plan is to move the code in ServerRelativeArchivalRedirect that parses Referer into a new method, so that sub-class can override it.

Fix links in the toolbar timeline when the URL has ampersands in it

Here for example. Note how the URL has ampersands in it. If you were to click to another point in the timeline, the URL you would go to would have all of the ampersands replaced with &, resulting in seeing a different set of crawls.

Sure, with this page, you would still see something, but in any other case, the user won't be as lucky.

It appears the core issue is that for whatever reason the wbCurrentUrl variable is HTML-encoded. Bizarrely, this does not happen in the "see all crawls" page

This could probably be fixed by changing line 74 at this file to var wbCurrentUrl = "<%= StringEscapeUtils.unescapeHtml(searchUrlJS) %>";

Rewrite percent-encoded URLs

Currently percent encoded URLs are not rewritten. For example, the text from https://web.archive.org/web/20150804131701/http://blip.tv/file/get/NostalgiaCritic-NCPlanetOfTheApes401.m4v?showplayer=2014093037100220150422135039&referrer=http://blip.tv&mask=11&skin=flashvars&view=url should be rewritten like:
Original:

message=http%3A%2F%2Fj41.video2.blip.tv%2F5520014255207%2FNostalgiaCritic-NCPlanetOfTheApes401.m4v%3Fir%3D96428%26sr%3D2334

_Should be rewritten as:_

message=http%3A%2F%2Fweb.archive.org%2Fweb%2F20150804131701%2Fhttp%3A%2F%2Fj41.video2.blip.tv%2F5520014255207%2FNostalgiaCritic-NCPlanetOfTheApes401.m4v%3Fir%3D96428%26sr%3D2334

Return the last best capture in each timestamp-collapse group

Timestamp-collapsing returns the first best capture in each group. Another option is to return the last best capture in each group. There are the cases where the latter works better than the former.

Internally known as ARI-3994.

ClassCastException while replaying revisit record

Revisit record handling code makes a bad assumption that revisit records are always an instance of WarcResource. There's a alternative implementation JWATResource, and revisit replay throws ClassCastException with it.

Internally known as WWM-101.

URL-decode date component of Archival-URL request

From WWM-110.
Some UA does more URL-encoding than strictly necessary. Notably, * is sometimes passed to Wayback as %-encoded %2A. Currently it results in 404 error. There seems to be nothing against URL-decoding date component of Archival-URL before parsing, so that 2010%2A is recognized as 2010*.

README.md Table of Contents has some broken links

At https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#advanced-usage:

"Closest Timestamp Match" link fails. No such section exists.
"Resumption Key" link fails. Such a section exists, but with a different anchor.
"Resolve Revisits" link fails. No such section exists.

Don't rewrite #... (anchor ref) links

Internally known as ARI-4024.

Unify StringTransformer and RewriteRule

Wayback has two distinct interfaces for rewriting text resources: StringTransformer and RewriteRule. It'll be useful if we can somehow unify these. At least there's a need for using MultipleRegexReplaceStringTransformer as RewriteRule. First step is to have MultiRegexReplaceStringTransformer implement RewriteRule interface.

Changes are on unify-rewrite branch, and ready to merge into openwayback.

Some links are not presented in filtered result list

The filter below is used some rows are not presented in the result list:
http://web.archive.org/cdx/search/cdx?url=http://www.expertsender.ru/bundles/core&matchType=prefix

Example row
ru,expertsender)/bundles/core?v=9nx-rocbnddl6mfbsncc8jgbjid4p8wyv00b9yjdxm81 20161015143737 http://www.expertsender.ru/bundles/core?v=9nX-roCbNddL6MFBsnCc8JGbjiD4p8wYv00b9YJdXm81 text/css 200 JROZKHBMIZC6TXGOLNJGIPRE73Q23WTD 28877

To find this row you have to specify only full URL:
http://web.archive.org/cdx/search/cdx?url=http://www.expertsender.ru/bundles/core?v=9nX-roCbNddL6MFBsnCc8JGbjiD4p8wYv00b9YJdXm81&matchType=prefix

PrivTokenAuthChecker resets ignoreRobots switch turned off by EmbeddedCDXServer

EmbeddedCDXServer has a feature that turns off robots exclusion check for embeds, but it is not working at all because PrivTokenAuthChecker.isAllUrlAccessAllowed() method turns robots exclusion flag on again.

I consider it a bad practice for getter method to have this kind of side-effect.

Knowing all the pages you have stayed?

Hey ,

As I can get all the text/html pages that have stayed?

Thanks

Add a method to UIResults for generating clean URL for capture query

UIResults.makeCaptureQueryUrl() generates very long URL with a pile of unnecessary query parameters. This significantly increases the size of URL query result page.

Embed-mode replay results in repeated redirects for captures with long revisit history

Embed-mode replay first searches for captures with timestampSearchKey flag turned on for faster lookup. If the URL has long revisit history and thus replay cannot resolve revisit within the constrained time range for timestampSearchKey, it reruns capture query with timestampSearchKey flag turned off. It is supposed to re-initialize captureSelector at that point, but it doesn't. So the replay code goes on to the next capture and returns a redirect response, and repeat.

Make per-collection exclusion configurable

Currently collection-sensitive exclusion filter provided by CompositeAccessPoint is inflexible.

Use of CustomPolicyOracleFilter is hard-coded, and it can only be combined with ExclusionFilterFactorys configured in CompositeAccessPoint staticExclusions property.
AccessPointAuthChecker assumes exclusion rules are determined by nothing but urlkey, prohibiting time-ranged exclusion rules.
CDXServer cannot pass oraclePolicy (used for delivering custom rewrite rules) from ExclusionFilter to capture search result

As a result,

Customization is required to extend EmbeddedCDXServerInde to inject ExclusionFilterFactory from AccessPointAdapter exclusionFactory into CDXToCaptureSearchResultWriter exclusionFilter - that is, Oracle exclusion filter runs at the final step of CDX processing pipeline. This turned out to be problematic, since exclusion happens after timestamp-deduplication.

Apparently CDXToCaptureSearchResultWriter exclusionFilter is necessary solely to support use of Oracle exclusion filter with CDXServer. Having multiple ways of configuring exclusion filter makes the code hard-to-follow, and customization painful.

Allow for writing perfStats header field in JSON format

Add configuration option to PerfWritingHttpServletResponse for writing perfStats response header field in JSON format. JSON is easier for monitoring tools to parse.

Proper resumeKey functionality depends on including the urlkey field

In making use of the Wayback CDX server API (documented here), I noticed that when using resumeKeys I get odd behavior when leaving the urlkey field out from the fieldOrder. Specifically, it looks like the CDX server jumps directly to the 2013 era, even though there are valid records before that:

$ wget -q -U '' -O - 'https://web.archive.org/cdx/search/cdx?collapse=timestamp%3A8&url=https%3A%2F%2Farchive.org&limit=5&fl=timestamp%2Cstatuscode&showResumeKey=true'
19970126045828 200
19971011050034 200
19971211122953 200
19980109140106 200
19980113025731 200

-+19980113025732
$ wget -q -U '' -O - 'https://web.archive.org/cdx/search/cdx?collapse=timestamp%3A8&url=https%3A%2F%2Farchive.org&limit=5&fl=timestamp%2Cstatuscode&showResumeKey=true&resumeKey=-+19980113025732'
20131019030216 502
20130818180757 502
20130402123654 502
20130902085637 502
20130903032956 502

Everything seems to work fine if I include the urlkey field:

$ wget -q -U '' -O - 'https://web.archive.org/cdx/search/cdx?collapse=timestamp%3A8&url=https%3A%2F%2Farchive.org&limit=5&fl=urlkey,timestamp%2Cstatuscode&showResumeKey=true'
org,archive)/ 19970126045828 200
org,archive)/ 19971011050034 200
org,archive)/ 19971211122953 200
org,archive)/ 19980109140106 200
org,archive)/ 19980113025731 200

org%2Carchive%29%2F+19980113025732
$ wget -q -U '' -O - 'https://web.archive.org/cdx/search/cdx?collapse=timestamp%3A8&url=https%3A%2F%2Farchive.org&limit=5&fl=urlkey,timestamp%2Cstatuscode&showResumeKey=true&resumeKey=org%2Carchive%29%2F+19980113025732'
org,archive)/ 19980129163431 200
org,archive)/ 19980501124530 200
org,archive)/ 19990116225149 200
org,archive)/ 19990117003935 200
org,archive)/ 19990202042615 200

org%2Carchive%29%2F+19990202042616

Perhaps there's an undocumented dependency on passing the urlkey field?

Thanks

Wayback replay mishandles URL with trailing asterisk

For example:

http://web.archive.org/web/20031014182312/http://example.com/* renders OK, but toolbar says 150,000 captures. Actually there are only 34 captures of http://example.com/*.
http://web.archive.org/web/2/http://example.com/* replays http://example.com/index.php?p=90575209&amp at 20031224194819.

Issue ARI-4272 reports Wayback replay fails as ResourceNotInArchive for URL ending with &* even though there are multiple captures of it.

Missing PathIndex file results in NullPointerException

FlexResourceStore throws NullPointerException if any of configured PathIndex file is missing:

WARNING: Runtime Error
org.archive.wayback.exception.ResourceNotAvailableException: File not Found: aaa.warc.gz
        at org.archive.wayback.resourcestore.FlexResourceStore.retrieveResource(FlexResourceStore.java:266)

gapless playback on archive.org

I'm listening to the Dead on archive.org, and it would sher be swell to have gapless playback, to reduce the buzzkill when listening to Dead concerts (I'm currently working through 1989 :).

This is almost certainly the wrong repo for this ticket. Can you point me to the right repo?

I looked at the banner and "Help," "Jobs," and "Volunteer" all sound the same to me, and none of them answered the "which repo" question for me.

The closest was https://developers.archive.org/get-started/. That seems to be aimed more at developers using JSON APIs than developers interested in helping with the software itself.

I also browsed around https://github.com/internetarchive ... and even ended up on https://github.com/iipc (:scream_cat: good lord what is this!?). I found a repo for Wayback, but I think that's different than what I want, right? Is there a repo for archive.org?

Thanks! :-)

internetarchive / wayback Goto Github PK

wayback's People

Stargazers

Watchers

Forkers

wayback's Issues

Recommend Projects

Recommend Topics

Recommend Org