Comments (8)
Hm, well it seems each store gets it's own ObjectIdentityBdbManualCache
and each of those has it's own AutoKryo
instance so this seems reasonably safe. i.e. the code should be able to reload the checkpoint if the code that registers the classes for that part of the system has not been changed?
from heritrix3.
Note that Kryo operates a ThreadLocal
cache that could perhaps have a bug in it? Long shot but we're running a very old Kryo (1.01 - not even the 1.05 I linked to above).
from heritrix3.
FWIW, this appears to be Kryo @ 1.01 https://github.com/EsotericSoftware/kryo/blob/00902f65f222ebcd5bea197535f41cda1009b967/src/com/esotericsoftware/kryo/Kryo.java
from heritrix3.
Hm okay Kryo registration appears to be scoped per instance and deterministic. No idea why this is throwing errors.
Iām going to run a full crawl but without resuming from a checkpoint and see if the issue turns up then.
from heritrix3.
Other observations:
- Appears to happen only when starting from a checkpoint (possibly re-starting from a previously-used checkpoint?)
- No problems for a few hours, then suddenly the same error over and over a few times every second. The removal listener is only run once per item in the removal notification queue, so each error corresponds to a different entry. The cache in question has a maximum size of 10,000 and expiry time of five minutes.
- The total number of queues in Heritrix was around 21,000 at when the problem first arose.
- The problem arose shortly after a significant increase in the number of
exhausted
queues most of which appeared to have beeninactive
state just prior to becomingexhausted
. (Based on this that means going from 'out of rotation' to 'empty' pretty much immediately). - Most of the exceptions are from the
processFinish
trace (see above), but some are from theschedule
trace (see below). If this issue arises from a runtime race-condition it seems likely to be between these two. - The 'processFinished
trace is part of the end of the whole crawl process. The
scheduletrace occurs earlier in the
ToeThread's progression (during the disposition chain). The former operates on the
ToeThread's own
WorkQueue` whereas the latter is often operating on other queues.
java.lang.IllegalArgumentException: Can not set org.archive.modules.fetcher.FetchStats field org.archive.crawler.frontier.WorkQueue.substats to java.lang.Byte
at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:167)
at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:171)
at sun.reflect.UnsafeObjectFieldAccessorImpl.set(UnsafeObjectFieldAccessorImpl.java:81)
at java.lang.reflect.Field.set(Field.java:764)
at com.esotericsoftware.kryo.serialize.FieldSerializer$CachedField.set(FieldSerializer.java:290)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:209)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:178)
at com.esotericsoftware.kryo.Kryo.readObjectData(Kryo.java:512)
at com.esotericsoftware.kryo.ObjectBuffer.readObjectData(ObjectBuffer.java:212)
at org.archive.bdb.KryoBinding.entryToObject(KryoBinding.java:84)
at com.sleepycat.collections.DataView.makeValue(DataView.java:595)
at com.sleepycat.collections.DataCursor.getCurrentValue(DataCursor.java:349)
at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:813)
at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:119)
at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:374)
at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:688)
at org.archive.crawler.frontier.WorkQueueFrontier.sendToQueue(WorkQueueFrontier.java:409)
at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleAlways(WorkQueueFrontier.java:332)
at org.archive.crawler.frontier.AbstractFrontier.receive(AbstractFrontier.java:549)
at org.archive.crawler.util.SetBasedUriUniqFilter.add(SetBasedUriUniqFilter.java:85)
at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleIfUnique(WorkQueueFrontier.java:377)
at org.archive.crawler.frontier.WorkQueueFrontier.schedule(WorkQueueFrontier.java:355)
at org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:189)
at org.archive.crawler.postprocessor.CandidatesProcessor.innerProcess(CandidatesProcessor.java:230)
at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
at org.archive.modules.Processor.process(Processor.java:142)
at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
from heritrix3.
So, to my frustration, I think I'll just have to leave this. To summarise:
- The
BdbWorkQueue
instances appear to have been loaded from disk once, but when the time comes to re-write them, the exact-same Kryo code appears to behave differently the second time it runs. - The very specific nature of the repeating error is difficult to understand. The sudden but specific mismatch of serialisation code seems to imply the Kryo registration map or related caches are getting damaged somehow.
- It is very difficult to see how corruption of the BDB data store itself would lead to such a 'neat' error.
- It is perhaps notable that
FetchStats
does not have a default constructor, and so serialisation relies on the magic done inAutoKryo
to invoke the constructor via reflection. This code appears to rely onReflectionFactory
being thread-safe, but there's no indication it should not be. Even if this was a problem, it is hard to see how this could work on the first load and then become a problem on the second. - Note also that the internal Kryo class registration map was a HashMap in version 1.01 but a ConcurrentHashMap by version 1.05. The underlying issue indicates that 1.01 registration is not thread-safe, but there does not appear to be any multi-threaded registration going on AFAICT (each thread has separate registration on startup I think?!). Even if there is, it's not clear how the de-serialisation could appear to work once and then fail the second time.
- The
sync()
code underlying theObjectIdentityBdbManualCache
does not appear to be remotely thread-safe in that, while thesync()
method itself issynchronized
, nothing prevents the in-memory cache being modified viagetOrUse
while thesync()
is occurring. But, yet again, it's difficult to imagine how this would lead to the errors observed here. - Also, some notes from Adam via Slack: "I think I've seen issues arise after resuming from the same checkpoint multiple times." "It's been a while since I've run into it, but I'm pretty sure it was a bdb deserialization thing. It wasn't something that happened consistently. I think part of the issue for reproducing it was that the issues didn't always happen immediately."
I will perhaps return to this if/when we look at updating BDB and Kryo to something other than deeply-ancient versions.
from heritrix3.
Just observed in an (admittedly troubled by #245 issues) crawl:
WARNING: com.google.common.cache.LocalCache processPendingNotifications Exception thrown by removal listener [Wed Mar 27 01:24:27 GMT 2019]
java.util.ConcurrentModificationException
at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1211)
at java.util.TreeMap$EntryIterator.next(TreeMap.java:1247)
at java.util.TreeMap$EntryIterator.next(TreeMap.java:1242)
at com.esotericsoftware.kryo.serialize.MapSerializer.writeObjectData(MapSerializer.java:90)
at com.esotericsoftware.kryo.serialize.FieldSerializer.writeObjectData(FieldSerializer.java:161)
at com.esotericsoftware.kryo.serialize.FieldSerializer.writeObjectData(FieldSerializer.java:161)
at com.esotericsoftware.kryo.Kryo.writeObjectData(Kryo.java:453)
at com.esotericsoftware.kryo.ObjectBuffer.writeObjectData(ObjectBuffer.java:262)
at org.archive.bdb.KryoBinding.objectToEntry(KryoBinding.java:79)
at com.sleepycat.collections.DataView.useValue(DataView.java:549)
at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:817)
at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:119)
at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:374)
at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:688)
at org.archive.crawler.frontier.AbstractFrontier.tally(AbstractFrontier.java:636)
at org.archive.crawler.frontier.AbstractFrontier.doJournalAdded(AbstractFrontier.java:648)
at org.archive.crawler.frontier.WorkQueueFrontier.sendToQueue(WorkQueueFrontier.java:408)
at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleAlways(WorkQueueFrontier.java:332)
at org.archive.crawler.frontier.AbstractFrontier.receive(AbstractFrontier.java:549)
at org.archive.crawler.util.SetBasedUriUniqFilter.add(SetBasedUriUniqFilter.java:85)
at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleIfUnique(WorkQueueFrontier.java:377)
at org.archive.crawler.frontier.WorkQueueFrontier.schedule(WorkQueueFrontier.java:355)
at org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:189)
at org.archive.crawler.postprocessor.CandidatesProcessor.innerProcess(CandidatesProcessor.java:230)
at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
at org.archive.modules.Processor.process(Processor.java:142)
at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
is the first error observed. It repeated at Wed Mar 27 03:03:40 GMT 2019
. Then, somewhat later:
WARNING: com.google.common.cache.LocalCache processPendingNotifications Exception thrown by removal listener [Wed Mar 27 07:30:12 GMT 2019]
com.esotericsoftware.kryo.SerializationException: Unable to deserialize object of type: org.archive.crawler.frontier.BdbWorkQueue
at com.esotericsoftware.kryo.Kryo.readObjectData(Kryo.java:514)
at com.esotericsoftware.kryo.ObjectBuffer.readObjectData(ObjectBuffer.java:212)
at org.archive.bdb.KryoBinding.entryToObject(KryoBinding.java:84)
at com.sleepycat.collections.DataView.makeValue(DataView.java:595)
at com.sleepycat.collections.DataCursor.getCurrentValue(DataCursor.java:349)
at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:813)
at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:119)
at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:374)
at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:688)
at org.archive.crawler.frontier.AbstractFrontier.tally(AbstractFrontier.java:636)
at org.archive.crawler.frontier.AbstractFrontier.doJournalFinishedFailure(AbstractFrontier.java:670)
at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:1008)
at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:569)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)
Caused by: com.esotericsoftware.kryo.SerializationException: Encountered unregistered class ID: 23
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:381)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:469)
at com.esotericsoftware.kryo.serialize.MapSerializer.readObjectData(MapSerializer.java:129)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:198)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:178)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:198)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:178)
at com.esotericsoftware.kryo.Kryo.readObjectData(Kryo.java:512)
... 21 more
which then repeats a LOT.
At the time of the first error there were nearly 25K queues although nearly 20K of them were exhausted. Note that this crawl did NOT resume from a checkpoint (although there were checkpoint files lying around, if that matters?)
from heritrix3.
I believe this issue was to do with problems in managing checkpoints, and should have been resolved by #281.
from heritrix3.
Related Issues (20)
- Question re: cloudfront.net HOT 1
- Compatibility problems with Sonatype release process
- ${launchId} is not being replaced (sometimes) HOT 1
- Questions about TransclusionDecideRule HOT 6
- Bean reference missing inherited properties
- Question about the size of the 'state' directory HOT 3
- Time is not stopped when Disk Space Monitor is triggered and report files are removed HOT 5
- Resume a crawl for later
- Question: how to create a new log/report for a single class
- Implicit max. value of URI cost and precedence (?) should raise warning if exceeded HOT 1
- Error: Could not find or load main class org.archive.crawler.Heritrix Caused by: java.lang.ClassNotFoundException: org.archive.crawler.Heritrix HOT 2
- WARNING: politessDelay unset, returning default 5000
- How to change auth type?
- Provided seed files are updated (the more the job is repited, the more they are modified)
- Error when more than 125 jobs are instantiated HOT 4
- archive web crawler - crawl speed HOT 7
- Support for silent option when running a job
- Redirect field in seeds-report.txt is only populated for status 301 and 302
- Text versions of DNS should be recorded as WARC-Type resource instead of response
- Heritrix 3.4.0-SNAPSHOT-2022-03-08T19:15:59Z keeps pausing.. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ššš
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ā¤ļø Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from heritrix3.