Giter Club home page Giter Club logo

Comments (4)

arne-bdt avatar arne-bdt commented on June 7, 2024

My work on an alternative to GraphMem is progressing slowly.
So I decided to contribute a few improvements to the existing GraphMem on the way.
Please review my PR.

from jena.

arne-bdt avatar arne-bdt commented on June 7, 2024

There's a straightforward way to improve the speed of GraphMem for Graph#add by nearly 50% and Graph#delete by 10-35%. This change has almost no impact on any other operation. However, it uses an additional array for the hashCodes in HashCommon. As a result, the memory consumption of the indexing structures (excluding the space for the triples) would increase by 31-51%.

I propose not to implement this for the deprecated GraphMem, but instead to create a successor GraphMem2 with the following changes:

  • Speed up Graph#add and Graph#delete accepting the given memory cost
  • Remove support for Iterator#remove to reduce complexity
    (other implementations like DatasetGraphInMemory don't support this either)
  • Switch to term-equality as proposed in issue#1867 mostly by:
    • Removing calls to org.apache.jena.graph.Node#sameValueAs
    • Replacing usages of org.apache.jena.graph.Node#getIndexingValue with the Node itself

What do you say?

from jena.

afs avatar afs commented on June 7, 2024

Yes - term-quality.

A GraphMem2 sounds like a good plan; just the switch to term-equality means we need to migrate somehow. Jena5 is likely later this year.

No support Iterator#remove is generally a good idea. It is at best an "optional extra" not a core part of iterators IMO.

The extra space usage - this I am in two minds about. Faster is better but so is having an storage that optimizes for space. It doesn't sound if it is so fundamental as to be irreversible.

from jena.

arne-bdt avatar arne-bdt commented on June 7, 2024

Thanks, that's the information I needed to proceed.

Adding and removing the array with the hashes is straightforward. It might also be quite simple to have two variants of the graph store. That way, users can choose between the fast variant or the low memory one.

Speaking of which:
There are two fields tracking changes for potential concurrent modifications:

  1. org.apache.jena.mem.ArrayBunch#changes
  2. org.apache.jena.mem.HashCommon#changes

The one in ArrayBunch is volatile, but the one in HashCommon is not. ArrayBunch is totally fine with integer overflow, since it checks for equality when verifying:
if (changes != initialChanges) throw new ConcurrentModificationException();

HashCommon, however, wouldn't handle an integer overflow correctly:
if (changes > initialChanges) throw new ConcurrentModificationException();

The document header of java.util.HashMap states:
` *

Note that the fail-fast behavior of an iterator cannot be guaranteed

  • as it is, generally speaking, impossible to make any hard guarantees in the
  • presence of unsynchronized concurrent modification. Fail-fast iterators
  • throw {@code ConcurrentModificationException} on a best-effort basis.
  • Therefore, it would be wrong to write a program that depended on this
  • exception for its correctness: the fail-fast behavior of iterators
  • should be used only to detect bugs.`

Tracking the changes here might be the safest possible way. However, HashCommon should then also make changes volatile and compare for equality rather than using a greater than comparison.

On the other hand, if we removed both fields and checked org.apache.jena.mem.HashCommon#size and org.apache.jena.mem.ArrayBunch#size instead, we would save memory (since there are many instances of both classes) and avoid incrementing a second variable on #add and #remove operations. The iterators should still help detect concurrent modifications in almost any scenario. Only if there are an equal number of #add and #remove calls, so that the total size of the graph is not affected, could there be undetected concurrent modifications while executing an iterator. This seems unlikely enough to me that it would prevent anyone from detecting a concurrency bug.
What is your opinion in this?

from jena.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.