Comments (18)
Not sure how best to handle this. Could we wrap the StoredCollection
in an iterator class that adds a finalize
method to close the cursor on GC? And/or monitor when the iterator runs out and close it before handing back control?
from heritrix3.
Also I think the Danish web archive team were looking at upgrading BDB, and maybe later versions offer different options?
from heritrix3.
Note that finalizers were deprecated in jdk 9 and are removed entirely from the language in jdk 12.
from heritrix3.
Given the only method the existing code implements apparantly is toArray() it seems like Arrays.asList(toArray()).iterator()
should do
from heritrix3.
Well that does allow the tests to run, and most of them work. The one that doesn't may be a faulty test.
EDIT Having clipped that test out, all the tests run fine.
from heritrix3.
See https://github.com/ukwa/heritrix3/tree/upgrade-httpclient for experimental branch.
from heritrix3.
Per RFC6265:
Unless the cookie's attributes indicate otherwise, the cookie is returned only to the origin server (and not, for example, to any subdomains),
So that subdomain test is indeed now incorrect behaviour.
from heritrix3.
Having run this for a while, this problem turned up:
java.lang.RuntimeException: not implemented
at org.archive.modules.fetcher.AbstractCookieStore$LimitedCookieStoreFacade.clearExpired(AbstractCookieStore.java:235)
at org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:183)
at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:133)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:183)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at org.archive.modules.fetcher.FetchHTTPRequest.execute(FetchHTTPRequest.java:745)
at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:658)
at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
at org.archive.modules.Processor.process(Processor.java:142)
at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
It turned up A LOT.
from heritrix3.
Looks like this needs implementing before we can consider this version workable:
from heritrix3.
Looks like the old version just ignored expired cookies. The new version ignores and then calls clearExpired if it encountered any. So old behaviour could be preserved by making clearExpired do nothing rather than throwing. We might actually want to implement it though, not sure of the implications.
I reverted the PR temporarily until we have a solution.
from heritrix3.
The follow-on work stalled with an error handling HTTP Digest authentication: #260 (comment)
I've created a new branch on a more recent version and am looking at this again. It seems that the HTTP Digest error arose because of this part of the FetchHTTPRequest
implementation:
This overrides the default AuthCache behaviour, but this caching pattern appears to break because of how Digests are handled. See https://issues.apache.org/jira/browse/HTTPCLIENT-1855 for a related issue that seems to show how poor caching can break Digest auth.
If I simply remove the above code from FetchHTTPRequest
, then the tests all pass. The failing test was:
and the fact that the test passes seems to indicated that the authentication information is being cached just fine but the default behaviour.
from heritrix3.
I'm putting together a new PR, but I note that this still does not deal with this comment from @ato (above):
Looks like the old version just ignored expired cookies. The new version ignores and then calls clearExpired if it encountered any. So old behaviour could be preserved by making clearExpired do nothing rather than throwing. We might actually want to implement it though, not sure of the implications.
I'm also unsure of the implications of this - whether it's okay to ignore expired cookies, or whether they should be expired properly. Seems like we should probably implement the expiration.
from heritrix3.
Looking through the code, it seems that when processing each HTTP response, the code calls addCookie
for every cookie. Under the hood, this does:
i.e. I think every cookie gets re-added every time, and at that time every expired cookie will be removed. Looking at the HTTP Client code (org.apache.http.client.protocol.RequestAddCookies
), I find:
// Per RFC 6265, 5.3
// The user agent must evict all expired cookies if, at any time, an expired cookie
// exists in the cookie store
if (expired) {
cookieStore.clearExpired(now);
}
So I guess the gap is that H3 is expired cookies after a request, whereas HTTP Client is expiring them before. This does seem sensible, but would be quite difficult to implement, as the BdbCookieStore
is careful to protect the underlying store and only present an immutable list to the HTTP Client.
from heritrix3.
Ugh, attempting to use a SNAPSHOT build for testing and hitting a weird error:
heritrix_1 | java.lang.NoSuchMethodError: java.nio.CharBuffer.flip()Ljava/nio/CharBuffer;
heritrix_1 | at org.archive.modules.net.Robotstxt.initializeFromReader(Robotstxt.java:92)
heritrix_1 | at org.archive.modules.net.Robotstxt.<init>(Robotstxt.java:70)
heritrix_1 | at org.archive.modules.net.CrawlServer.updateRobots(CrawlServer.java:180)
heritrix_1 | at uk.bl.wap.crawler.postprocessor.ModifiedDispositionProcessor.innerProcess(ModifiedDispositionProcessor.java:78)
heritrix_1 | at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
heritrix_1 | at org.archive.modules.Processor.process(Processor.java:142)
heritrix_1 | at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
heritrix_1 | at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
Which implies the code was compile for Java 9, but running under Java 8. It was compiled under Oracle JDK 8, targeting 1.8. Very odd.
See https://stackoverflow.com/questions/61267495/exception-in-thread-main-java-lang-nosuchmethoderror-java-nio-bytebuffer-flip and note that CharBuffer
gained it's own .flip()
method when Java 9 came out.
I'm trying upgrading the maven-compiler-plugin
as this often helps resolve issues like this.
from heritrix3.
Okay, PR #397 is available for testing.
from heritrix3.
I tracked down the .flip()
issue - although my machines command line javac
was version 8, Maven was using the version installed by Homebrew (JDK 15), and as such was referring to the later run-time libraries. I'll try to address this via the new release
functionality.
from heritrix3.
I'd be particularly pleased if anyone can confirm dropping the cache is the best way to fix the Digest authentication issue noted above: #245 (comment)
from heritrix3.
Closed by #397.
from heritrix3.
Related Issues (20)
- Question re: cloudfront.net HOT 1
- Compatibility problems with Sonatype release process
- ${launchId} is not being replaced (sometimes) HOT 1
- Questions about TransclusionDecideRule HOT 6
- Bean reference missing inherited properties
- Question about the size of the 'state' directory HOT 3
- Time is not stopped when Disk Space Monitor is triggered and report files are removed HOT 5
- Resume a crawl for later
- Question: how to create a new log/report for a single class
- Implicit max. value of URI cost and precedence (?) should raise warning if exceeded HOT 1
- Error: Could not find or load main class org.archive.crawler.Heritrix Caused by: java.lang.ClassNotFoundException: org.archive.crawler.Heritrix HOT 2
- WARNING: politessDelay unset, returning default 5000
- How to change auth type?
- Provided seed files are updated (the more the job is repited, the more they are modified)
- Error when more than 125 jobs are instantiated HOT 4
- archive web crawler - crawl speed HOT 7
- Support for silent option when running a job
- Redirect field in seeds-report.txt is only populated for status 301 and 302
- Text versions of DNS should be recorded as WARC-Type resource instead of response
- Heritrix 3.4.0-SNAPSHOT-2022-03-08T19:15:59Z keeps pausing.. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from heritrix3.