Giter Club home page Giter Club logo

Comments (18)

anjackson avatar anjackson commented on August 27, 2024

Not sure how best to handle this. Could we wrap the StoredCollection in an iterator class that adds a finalize method to close the cursor on GC? And/or monitor when the iterator runs out and close it before handing back control?

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

Also I think the Danish web archive team were looking at upgrading BDB, and maybe later versions offer different options?

from heritrix3.

ato avatar ato commented on August 27, 2024

Note that finalizers were deprecated in jdk 9 and are removed entirely from the language in jdk 12.

from heritrix3.

ato avatar ato commented on August 27, 2024

Given the only method the existing code implements apparantly is toArray() it seems like Arrays.asList(toArray()).iterator() should do

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

Well that does allow the tests to run, and most of them work. The one that doesn't may be a faulty test.

/*
* XXX I think browsers differ on this behavior. This is what
* org.apache.http.impl.cookie.BrowserCompatSpec does.
*/
curi = makeCrawlURI("http://SUBDOMAIN.example.com:7777/");
fetcher().process(curi);
assertTrue(FetchHTTPTests.httpRequestString(curi).contains("Cookie: foo=bar\r\n"));

EDIT Having clipped that test out, all the tests run fine.

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

See https://github.com/ukwa/heritrix3/tree/upgrade-httpclient for experimental branch.

from heritrix3.

ato avatar ato commented on August 27, 2024

Per RFC6265:

Unless the cookie's attributes indicate otherwise, the cookie is returned only to the origin server (and not, for example, to any subdomains),

So that subdomain test is indeed now incorrect behaviour.

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

Having run this for a while, this problem turned up:

 java.lang.RuntimeException: not implemented
        at org.archive.modules.fetcher.AbstractCookieStore$LimitedCookieStoreFacade.clearExpired(AbstractCookieStore.java:235)
        at org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:183)
        at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:133)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:183)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at org.archive.modules.fetcher.FetchHTTPRequest.execute(FetchHTTPRequest.java:745)
        at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:658)
        at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

It turned up A LOT.

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

Looks like this needs implementing before we can consider this version workable:

@Override
public boolean clearExpired(Date date) {
throw new RuntimeException("not implemented");
}
@Override
public void clear() {
throw new RuntimeException("not implemented");
}

from heritrix3.

ato avatar ato commented on August 27, 2024

Looks like the old version just ignored expired cookies. The new version ignores and then calls clearExpired if it encountered any. So old behaviour could be preserved by making clearExpired do nothing rather than throwing. We might actually want to implement it though, not sure of the implications.

I reverted the PR temporarily until we have a solution.

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

The follow-on work stalled with an error handling HTTP Digest authentication: #260 (comment)

I've created a new branch on a more recent version and am looking at this again. It seems that the HTTP Digest error arose because of this part of the FetchHTTPRequest implementation:

AuthCache authCache = httpClientContext.getAuthCache();
if (authCache == null) {
authCache = new BasicAuthCache();
httpClientContext.setAuthCache(authCache);
}
authCache.put(host, authScheme);

This overrides the default AuthCache behaviour, but this caching pattern appears to break because of how Digests are handled. See https://issues.apache.org/jira/browse/HTTPCLIENT-1855 for a related issue that seems to show how poor caching can break Digest auth.

If I simply remove the above code from FetchHTTPRequest, then the tests all pass. The failing test was:

// fetch a fresh uri to make sure auth info was cached and we don't get another 401
curi = makeCrawlURI("http://localhost:7778/auth/2");
fetcher().process(curi);
httpRequestString = httpRequestString(curi);
assertTrue(httpRequestString.contains("Authorization: Digest"));
// otherwise should be a normal 200 response
runDefaultChecks(curi, "requestLine", "hostHeader");

and the fact that the test passes seems to indicated that the authentication information is being cached just fine but the default behaviour.

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

I'm putting together a new PR, but I note that this still does not deal with this comment from @ato (above):

Looks like the old version just ignored expired cookies. The new version ignores and then calls clearExpired if it encountered any. So old behaviour could be preserved by making clearExpired do nothing rather than throwing. We might actually want to implement it though, not sure of the implications.

I'm also unsure of the implications of this - whether it's okay to ignore expired cookies, or whether they should be expired properly. Seems like we should probably implement the expiration.

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

Looking through the code, it seems that when processing each HTTP response, the code calls addCookie for every cookie. Under the hood, this does:

if (!cookie.isExpired(new Date())) {
cookies.put(key, cookie);
} else {
cookies.remove(key);
}

i.e. I think every cookie gets re-added every time, and at that time every expired cookie will be removed. Looking at the HTTP Client code (org.apache.http.client.protocol.RequestAddCookies), I find:

         // Per RFC 6265, 5.3
        // The user agent must evict all expired cookies if, at any time, an expired cookie
        // exists in the cookie store
        if (expired) {
            cookieStore.clearExpired(now);
        }

So I guess the gap is that H3 is expired cookies after a request, whereas HTTP Client is expiring them before. This does seem sensible, but would be quite difficult to implement, as the BdbCookieStore is careful to protect the underlying store and only present an immutable list to the HTTP Client.

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

Ugh, attempting to use a SNAPSHOT build for testing and hitting a weird error:

heritrix_1                           | java.lang.NoSuchMethodError: java.nio.CharBuffer.flip()Ljava/nio/CharBuffer;
heritrix_1                           | 	at org.archive.modules.net.Robotstxt.initializeFromReader(Robotstxt.java:92)
heritrix_1                           | 	at org.archive.modules.net.Robotstxt.<init>(Robotstxt.java:70)
heritrix_1                           | 	at org.archive.modules.net.CrawlServer.updateRobots(CrawlServer.java:180)
heritrix_1                           | 	at uk.bl.wap.crawler.postprocessor.ModifiedDispositionProcessor.innerProcess(ModifiedDispositionProcessor.java:78)
heritrix_1                           | 	at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
heritrix_1                           | 	at org.archive.modules.Processor.process(Processor.java:142)
heritrix_1                           | 	at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
heritrix_1                           | 	at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)

Which implies the code was compile for Java 9, but running under Java 8. It was compiled under Oracle JDK 8, targeting 1.8. Very odd.

See https://stackoverflow.com/questions/61267495/exception-in-thread-main-java-lang-nosuchmethoderror-java-nio-bytebuffer-flip and note that CharBuffer gained it's own .flip() method when Java 9 came out.

I'm trying upgrading the maven-compiler-plugin as this often helps resolve issues like this.

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

Okay, PR #397 is available for testing.

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

I tracked down the .flip() issue - although my machines command line javac was version 8, Maven was using the version installed by Homebrew (JDK 15), and as such was referring to the later run-time libraries. I'll try to address this via the new release functionality.

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

I'd be particularly pleased if anyone can confirm dropping the cache is the best way to fix the Digest authentication issue noted above: #245 (comment)

from heritrix3.

anjackson avatar anjackson commented on August 27, 2024

Closed by #397.

from heritrix3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.