Giter Club home page Giter Club logo

Comments (8)

anjackson avatar anjackson commented on July 24, 2024

Hm, well it seems each store gets it's own ObjectIdentityBdbManualCache and each of those has it's own AutoKryo instance so this seems reasonably safe. i.e. the code should be able to reload the checkpoint if the code that registers the classes for that part of the system has not been changed?

from heritrix3.

anjackson avatar anjackson commented on July 24, 2024

Note that Kryo operates a ThreadLocal cache that could perhaps have a bug in it? Long shot but we're running a very old Kryo (1.01 - not even the 1.05 I linked to above).

from heritrix3.

anjackson avatar anjackson commented on July 24, 2024

FWIW, this appears to be Kryo @ 1.01 https://github.com/EsotericSoftware/kryo/blob/00902f65f222ebcd5bea197535f41cda1009b967/src/com/esotericsoftware/kryo/Kryo.java

from heritrix3.

anjackson avatar anjackson commented on July 24, 2024

Hm okay Kryo registration appears to be scoped per instance and deterministic. No idea why this is throwing errors.

Iā€™m going to run a full crawl but without resuming from a checkpoint and see if the issue turns up then.

from heritrix3.

anjackson avatar anjackson commented on July 24, 2024

Other observations:

  • Appears to happen only when starting from a checkpoint (possibly re-starting from a previously-used checkpoint?)
  • No problems for a few hours, then suddenly the same error over and over a few times every second. The removal listener is only run once per item in the removal notification queue, so each error corresponds to a different entry. The cache in question has a maximum size of 10,000 and expiry time of five minutes.
  • The total number of queues in Heritrix was around 21,000 at when the problem first arose.
  • The problem arose shortly after a significant increase in the number of exhausted queues most of which appeared to have been inactive state just prior to becoming exhausted. (Based on this that means going from 'out of rotation' to 'empty' pretty much immediately).
  • Most of the exceptions are from the processFinish trace (see above), but some are from the schedule trace (see below). If this issue arises from a runtime race-condition it seems likely to be between these two.
  • The 'processFinishedtrace is part of the end of the whole crawl process. Thescheduletrace occurs earlier in theToeThread's progression (during the disposition chain). The former operates on the ToeThread's own WorkQueue` whereas the latter is often operating on other queues.
java.lang.IllegalArgumentException: Can not set org.archive.modules.fetcher.FetchStats field org.archive.crawler.frontier.WorkQueue.substats to java.lang.Byte
        at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:167)
        at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:171)
        at sun.reflect.UnsafeObjectFieldAccessorImpl.set(UnsafeObjectFieldAccessorImpl.java:81)
        at java.lang.reflect.Field.set(Field.java:764)
        at com.esotericsoftware.kryo.serialize.FieldSerializer$CachedField.set(FieldSerializer.java:290)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:209)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:178)
        at com.esotericsoftware.kryo.Kryo.readObjectData(Kryo.java:512)
        at com.esotericsoftware.kryo.ObjectBuffer.readObjectData(ObjectBuffer.java:212)
        at org.archive.bdb.KryoBinding.entryToObject(KryoBinding.java:84)
        at com.sleepycat.collections.DataView.makeValue(DataView.java:595)
        at com.sleepycat.collections.DataCursor.getCurrentValue(DataCursor.java:349)
        at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:813)
        at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
        at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
        at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
        at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:119)
        at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
        at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
        at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
        at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
        at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
        at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:374)
        at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:688)
        at org.archive.crawler.frontier.WorkQueueFrontier.sendToQueue(WorkQueueFrontier.java:409)
        at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleAlways(WorkQueueFrontier.java:332)
        at org.archive.crawler.frontier.AbstractFrontier.receive(AbstractFrontier.java:549)
        at org.archive.crawler.util.SetBasedUriUniqFilter.add(SetBasedUriUniqFilter.java:85)
        at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleIfUnique(WorkQueueFrontier.java:377)
        at org.archive.crawler.frontier.WorkQueueFrontier.schedule(WorkQueueFrontier.java:355)
        at org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:189)
        at org.archive.crawler.postprocessor.CandidatesProcessor.innerProcess(CandidatesProcessor.java:230)
        at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)

from heritrix3.

anjackson avatar anjackson commented on July 24, 2024

So, to my frustration, I think I'll just have to leave this. To summarise:

  • The BdbWorkQueue instances appear to have been loaded from disk once, but when the time comes to re-write them, the exact-same Kryo code appears to behave differently the second time it runs.
  • The very specific nature of the repeating error is difficult to understand. The sudden but specific mismatch of serialisation code seems to imply the Kryo registration map or related caches are getting damaged somehow.
  • It is very difficult to see how corruption of the BDB data store itself would lead to such a 'neat' error.
  • It is perhaps notable that FetchStats does not have a default constructor, and so serialisation relies on the magic done in AutoKryo to invoke the constructor via reflection. This code appears to rely on ReflectionFactory being thread-safe, but there's no indication it should not be. Even if this was a problem, it is hard to see how this could work on the first load and then become a problem on the second.
  • Note also that the internal Kryo class registration map was a HashMap in version 1.01 but a ConcurrentHashMap by version 1.05. The underlying issue indicates that 1.01 registration is not thread-safe, but there does not appear to be any multi-threaded registration going on AFAICT (each thread has separate registration on startup I think?!). Even if there is, it's not clear how the de-serialisation could appear to work once and then fail the second time.
  • The sync() code underlying the ObjectIdentityBdbManualCache does not appear to be remotely thread-safe in that, while the sync() method itself is synchronized, nothing prevents the in-memory cache being modified via getOrUse while the sync() is occurring. But, yet again, it's difficult to imagine how this would lead to the errors observed here.
  • Also, some notes from Adam via Slack: "I think I've seen issues arise after resuming from the same checkpoint multiple times." "It's been a while since I've run into it, but I'm pretty sure it was a bdb deserialization thing. It wasn't something that happened consistently. I think part of the issue for reproducing it was that the issues didn't always happen immediately."

I will perhaps return to this if/when we look at updating BDB and Kryo to something other than deeply-ancient versions.

from heritrix3.

anjackson avatar anjackson commented on July 24, 2024

Just observed in an (admittedly troubled by #245 issues) crawl:

WARNING: com.google.common.cache.LocalCache processPendingNotifications Exception thrown by removal listener [Wed Mar 27 01:24:27 GMT 2019]
java.util.ConcurrentModificationException
        at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1211)
        at java.util.TreeMap$EntryIterator.next(TreeMap.java:1247)
        at java.util.TreeMap$EntryIterator.next(TreeMap.java:1242)
        at com.esotericsoftware.kryo.serialize.MapSerializer.writeObjectData(MapSerializer.java:90)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.writeObjectData(FieldSerializer.java:161)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.writeObjectData(FieldSerializer.java:161)
        at com.esotericsoftware.kryo.Kryo.writeObjectData(Kryo.java:453)
        at com.esotericsoftware.kryo.ObjectBuffer.writeObjectData(ObjectBuffer.java:262)
        at org.archive.bdb.KryoBinding.objectToEntry(KryoBinding.java:79)
        at com.sleepycat.collections.DataView.useValue(DataView.java:549)
        at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:817)
        at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
        at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
        at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
        at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:119)
        at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
        at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
        at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
        at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
        at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
        at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:374)
        at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:688)
        at org.archive.crawler.frontier.AbstractFrontier.tally(AbstractFrontier.java:636)
        at org.archive.crawler.frontier.AbstractFrontier.doJournalAdded(AbstractFrontier.java:648)
        at org.archive.crawler.frontier.WorkQueueFrontier.sendToQueue(WorkQueueFrontier.java:408)
        at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleAlways(WorkQueueFrontier.java:332)
        at org.archive.crawler.frontier.AbstractFrontier.receive(AbstractFrontier.java:549)
        at org.archive.crawler.util.SetBasedUriUniqFilter.add(SetBasedUriUniqFilter.java:85)
        at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleIfUnique(WorkQueueFrontier.java:377)
        at org.archive.crawler.frontier.WorkQueueFrontier.schedule(WorkQueueFrontier.java:355)
        at org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:189)
        at org.archive.crawler.postprocessor.CandidatesProcessor.innerProcess(CandidatesProcessor.java:230)
        at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)

is the first error observed. It repeated at Wed Mar 27 03:03:40 GMT 2019. Then, somewhat later:

WARNING: com.google.common.cache.LocalCache processPendingNotifications Exception thrown by removal listener [Wed Mar 27 07:30:12 GMT 2019]
com.esotericsoftware.kryo.SerializationException: Unable to deserialize object of type: org.archive.crawler.frontier.BdbWorkQueue
        at com.esotericsoftware.kryo.Kryo.readObjectData(Kryo.java:514)
        at com.esotericsoftware.kryo.ObjectBuffer.readObjectData(ObjectBuffer.java:212)
        at org.archive.bdb.KryoBinding.entryToObject(KryoBinding.java:84)
        at com.sleepycat.collections.DataView.makeValue(DataView.java:595)
        at com.sleepycat.collections.DataCursor.getCurrentValue(DataCursor.java:349)
        at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:813)
        at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
        at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
        at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
        at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:119)
        at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
        at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
        at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
        at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
        at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
        at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:374)
        at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:688)
        at org.archive.crawler.frontier.AbstractFrontier.tally(AbstractFrontier.java:636)
        at org.archive.crawler.frontier.AbstractFrontier.doJournalFinishedFailure(AbstractFrontier.java:670)
        at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:1008)
        at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:569)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)
Caused by: com.esotericsoftware.kryo.SerializationException: Encountered unregistered class ID: 23
        at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:381)
        at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:469)
        at com.esotericsoftware.kryo.serialize.MapSerializer.readObjectData(MapSerializer.java:129)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:198)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:178)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:198)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:178)
        at com.esotericsoftware.kryo.Kryo.readObjectData(Kryo.java:512)
        ... 21 more

which then repeats a LOT.

At the time of the first error there were nearly 25K queues although nearly 20K of them were exhausted. Note that this crawl did NOT resume from a checkpoint (although there were checkpoint files lying around, if that matters?)

from heritrix3.

anjackson avatar anjackson commented on July 24, 2024

I believe this issue was to do with problems in managing checkpoints, and should have been resolved by #281.

from heritrix3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.