Giter Club home page Giter Club logo

parallelgit's Introduction

ParallelGit

A high performance Java JDK 7 nio in-memory filesystem for Git.

Build Status Coverage Status Maven Central

Quick start

Maven:

<dependency>
  <groupId>com.beijunyi</groupId>
  <artifactId>parallelgit-filesystem</artifactId>
  <version>2.0.0</version>
</dependency>

Gradle:

'com.beijunyi:parallelgit-filesystem:2.0.0'

Basic usages

Read - Copy a file from repository to hard drive:

public void readFile() throws IOException {
  try(GitFileSystem gfs = Gfs.newFileSystem("my_branch", "/project/repository")) {
    Path source = gfs.getPath("/settings.xml"); // repo
    Path target = Paths.get("/app/config/settings.xml"); // hard drive
    Files.copy(source, target);
  }
}

Write - Copy a file to repository and commit:

public void writeFile() throws IOException {
  try(GitFileSystem gfs = Gfs.newFileSystem("my_branch", "/project/repository")) {
    Path source = Paths.get("/app/config/settings.xml"); // hard drive
    Path target = gfs.getPath("/settings.xml"); // repo
    Files.copy(source, target);
    Gfs.commit(gfs).message("Update settings").execute();
  }
}

Project purpose explained

Git is a unique type of data store. Its special data structure offers useful features such as:

  • Keeping history snapshots at a very low cost
  • Automatic duplication detection
  • Remote backup
  • Merging and conflict resolution

Git is well known and widely used as a VCS, yet few software application uses Git as an internal data store. One of the reasons is the lack of high level API that allows efficient communication between application and Git.

Consider the workflow in software development, the standard steps to make changes in Git are:

Checkout a branch ⇒ Write files ⇒ Add files to index ⇒ Commit

While this model works sufficiently well with (human) developers, it does not fit in the architecture diagram of a server role application. Reasons are:

  • Only one branch can be checked out at a time
  • Checking out a branch has a heavy I/O overhead as files need to be deleted and re-created on hard drive
  • Every context switching needs a check out

There are ways around these problems, but they usually involve manually creating blobs and trees, which is verbose and error prone.

ParallelGit is a layer between application logic and Git. It abstracts away Git's low level object manipulation details and provides a friendly interface which extends the Java 7 NIO filesystem API. The filesystem itself operates in memory with data pulled from hard drive on demand.

With ParallelGit an application can control a Git repository as it were a normal filesystem. Arbitrary branch and commit can be checked out instantly at minimal resource cost. Multiple filesystem instances can be hosted simultaneously with no interference.

I/O & performance explained

Like with any data store, the size of a single request is often very small compared to the size of the store. It would be an overkill to select an entire table from a SQL database when only one row is requested. Similarly, checking out all files in a branch is usually unnecessary for common tasks.

ParallelGit adopts a lazy loading strategy to minimise I/O and other resource usages. For inputs, directories and file contents are only loaded when they are demanded by the task. For outputs, new blobs and trees are only created at commit creation stage.

Read requests

Imagine a branch with the below file tree in its HEAD commit. The task is to read the 3 .java files from this branch.

 /
 ├──app-core
 │   └──src
 │       ├──main
 │       │   ├──MyFactory.java *(to read)
 │       │   └──MyProduct.java *(to read)
 │       └──test
 │           └──ProductionTest.java *(to read)
 └──app-web
     ├──index.jsp
     └──style.css

Directories and files are stored as tree and blob objects in Git. Every tree object has the references to its children nodes.

When the branch is checked out, its HEAD commit is parsed and stored in memory. A commit object has the reference to the tree object that corresponds to its root directory.

To read file /app-core/src/main/MyFactory.java, ParallelGit needs to resolve its parent directories recursively i.e:

1) /
2) /app-core
3) /app-core/src
4) /app-core/src/main

After the last tree object is resolved, ParallelGit finds the blob object of MyFactory.java, which can then be parsed and converted into a byte[] or String depending on the task requirements.

The second file, /app-core/src/main/MyProduct.java, lives in the same directory. As the required tree objects for this request are already available in memory, ParallelGit simply finds the blob reference from its parent and retrieves the file data.

The last file, /app-core/src/test/ProductionTest.java, shares a common ancestor, /app-core/src, with the previous two files. From this subtree ParallelGit resolves its other child, /app-core/src/test, which leads to the blob of ProductionTest.java.

Write requests

In the same branch, assume there is a follow up task to change MyFactory.java.

 /
 ├──app-core
 │   └──src
 │       ├──main
 │       │   ├──MyFactory.java *(to update)
 │       │   └──MyProduct.java
 │       └──test
 │           └──ProductionTest.java
 └──app-web
     ├──index.jsp
     └──style.css

Because all object references in Git are the hash values of their contents, whenever a file's content has changed, its hash value also changes and so do its parent directories'.

All changes are staged in memory before committed to the repository. There is no write access made to the hard drive when MyFactory.java is being updated.

When Gfs.commit(...).execute() is called, ParallelGit creates a blob object for the new file content. To make this blob reachable, ParallelGit creates the tree objects for its updated parent directories i.e:

1) /app-core/src/main
2) /app-core/src
3) /app-core
4) /

After the root tree object is created, ParallelGit creates a new commit and makes it the HEAD of the branch.

Complexity

The important property in the performance aspect is the resource usage per task is linear to the size of the task scope. The size of the repository has no impact on individual task's runtime and memory footprint.

Advanced features

Merge

public void mergeFeatureBranch() throws IOException {
  try(GitFileSystem gfs = Gfs.newFileSystem("master", "/project/repository")) {
    GfsMerge.Result result = Gfs.merge(gfs).source("feature_branch").execute();
    assert result.isSuccessful();
  }
}

Conflict resolution

// a magical method that can resolve any conflicts
public abstract void resolveConflicts(GitFileSystem gfs, Map<String, MergeConflict> conflicts);

public void mergeFeatureBranch() throws IOException {
  try(GitFileSystem gfs = Gfs.newFileSystem("master", "/project/repository")) {
    GfsMerge.Result result = Gfs.merge(gfs).source("feature_branch").execute();
    assert result.getStatus() == GfsMerge.Status.CONFLICTING;
      
    resolveConflicts(gfs, result.getConflicts());
    Gfs.commit(gfs).execute();
  }
}

Create stash

// a magical method that does very interesting work
public abstract void doSomeWork(GitFileSystem gfs);

public void stashIncompleteWork() throws IOException {
  try(GitFileSystem gfs = Gfs.newFileSystem("master", "/project/repository")) {
    doSomeWork(gfs);
    Gfs.createStash(gfs).execute();
  }
}

Apply stash

// a magical method that does some more interesting work
public abstract void doSomeMoreWork(GitFileSystem gfs);

public void continuePreviousWork() throws IOException {
  try(GitFileSystem gfs = Gfs.newFileSystem("master", "/project/repository")) {
    Gfs.applyStash(gfs)
       .stash(0)  // (optional) to specify the index of the stash to apply 
       .execute();
    doSomeMoreWork(gfs);
  }
}

Reset

// a magical method that does good work at the second time
public abstract void doSomeWork(GitFileSystem gfs);

public void doSomeGoodWork() throws IOException {
  try(GitFileSystem gfs = Gfs.newFileSystem("master", "/project/repository")) {
    doSomeWork(gfs);
    Gfs.reset(gfs).execute();
    doSomeWork(gfs);
  }
}

Handy Utils

Package com.beijunyi.parallelgit.utils has a collection of utility classes to perform common Git tasks.

  1. BlobUtils - Blob insertion, byte array retrieval
  2. BranchUtils - Branch creation, branch HEAD reference update
  3. CacheUtils - Index cache manipulation
  4. CommitUtils - Commit creation, commit history retrieval
  5. GitFileUtils - Shortcuts for readonly file accesses
  6. RefUtils - Ref name normalisation, Ref-log retrieval
  7. RepositoryUtils - Repository creation, repository settings
  8. StashUtils - Stash manipulation
  9. TagUtils - Tag manipulation
  10. TreeUtils - Tree insertion, tree/subtree retrieval

License

This project is licensed under Apache License, Version 2.0.

parallelgit's People

Contributors

pimpcapital avatar novalis avatar gaborhorvath avatar

Stargazers

 avatar  avatar Pavel Erokhin avatar Adam avatar ZheNing Hu avatar Bhavyai Gupta avatar WenbinAi avatar  avatar y-tomida avatar Chen Chenglong avatar Kyle Burton avatar h h avatar Oleksandr Buzmakov avatar Imran S Shah avatar Matt Drees avatar Maximilian Gärber avatar Pavel Vojtěchovský avatar Wade avatar  avatar Mark Zhitomirski avatar Marco Foroni avatar Patrick Neubauer avatar Aleksandar Janković avatar Igor Konoplyanko avatar Victor Melnik avatar Piotr Wittchen avatar Federico Tomassetti avatar FanX avatar  avatar dongdongchao avatar Justin avatar mason avatar  avatar Andrew NS Yeow avatar Victor Nike avatar Jimmi Dyson avatar Philippe Arteau avatar Pavel Semenov avatar Rüdiger Herrmann avatar  avatar  avatar  avatar  avatar Markus Latvala avatar Roman Dawydkin avatar Marcos A. Sobrinho avatar Nicolas Lochet avatar Daniel Bos avatar Donal Tobin avatar Marek Potociar avatar Andrzej Gdula avatar Corey Minter avatar jusiočřывп avatar Mikael Karon avatar Martin Karpisek avatar Noirox avatar Paulo Cereda avatar  avatar Andrej Golovnin avatar Sander avatar Stefan Ferstl avatar  avatar Daniel Heinrich avatar Ivo Limmen avatar Andrea Fonti avatar George Lucchese avatar Jörn Gersdorf avatar  avatar  avatar Yuji Kiriki avatar  avatar Philipp Kraus avatar  avatar Ronen avatar Val Markovic avatar  avatar Viktor Szathmáry avatar  avatar Jim Bethancourt avatar  avatar Yuki Yoshikawa avatar Christopher Brown avatar Jakub Narębski avatar Bluesky Yao avatar Rodrigo Peleias avatar Alan Parkinson avatar  avatar Manos Batsis avatar Adriano Machado avatar Rajkumar Singh avatar Mauro Monti avatar Jakob Skov avatar  avatar Adrian Png avatar  avatar  avatar Olivier Cinquin avatar Angus H. avatar Dean Jones avatar  avatar

Watchers

Mark Derricutt avatar guo yingshou avatar dafei1288 avatar Claudiu avatar Pavel Shabalin avatar James Cloos avatar  avatar  avatar mauricio gamarra avatar Philipp Kraus avatar  avatar Victor Nike avatar  avatar  avatar

parallelgit's Issues

Is this project still maintained?

I have been implementing a read-only access to git as a Java-7 nio FS, only to discover recently that your project already does it, and seems more complete as it also implements write access.

It seems like the source code is of good quality, and with lots of unit tests, so I’d be most interested in using it in my project. It only lacks some docs, as apparently there is no Javadoc at all in the source code (but the code seems clear enough to be understandable).

This repository has got 110 stars and 22 forks, which does suggest that the project raises some interest. It is used by two artifacts published on Maven Central: The Modern Way Server Core, which seems abandoned (last updated June 2018), and browserbox-maven-plugin, updated a few days ago and still using ParallelGit, AFAICS.

However, this project has not been updated since 3 years+. Is it abandoned? Is there some hope to see a release 2.1? I see that some commits have been added since 2.0.0. Also, some of the forks visible on GitHub are one commit ahead of the basis, which suggests some patches are available and could be integrated. (Please comment here, anyone who has forked this project and would be willing to contribute back.)

In summary, any information about the current status of this project would be most welcome!

Circular loading of installed providers detected

final URI uri = URI.create("GFS:http://....../example1.git");
final FileSystem fs1 = FileSystems.getFileSystem(uri);

I would expect this to return null, if newFileSystem has not been called, but I get,

java.util.ServiceConfigurationError: java.nio.file.spi.FileSystemProvider: Provider com.beijunyi.parallelgit.filesystem.GitFileSystemProvider could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at java.nio.file.spi.FileSystemProvider.loadInstalledProviders(FileSystemProvider.java:119)
at java.nio.file.spi.FileSystemProvider.access$000(FileSystemProvider.java:77)
at java.nio.file.spi.FileSystemProvider$1.run(FileSystemProvider.java:169)
at java.nio.file.spi.FileSystemProvider$1.run(FileSystemProvider.java:166)
at java.security.AccessController.doPrivileged(Native Method)
at java.nio.file.spi.FileSystemProvider.installedProviders(FileSystemProvider.java:166)
at java.nio.file.FileSystems.getFileSystem(FileSystems.java:219)
at ....

Caused by: java.lang.Error: Circular loading of installed providers detected
at java.nio.file.spi.FileSystemProvider.installedProviders(FileSystemProvider.java:161)
at com.beijunyi.parallelgit.filesystem.GitFileSystemProvider.getInstalledProvider(GitFileSystemProvider.java:219)
at com.beijunyi.parallelgit.filesystem.GitFileSystemProvider.(GitFileSystemProvider.java:30)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at java.lang.Class.newInstance(Class.java:442)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 34 more

GfsObjectService closes repository without incrementing useCnt during construction

GfsObjectService during close calls a close on the reader, inserter and repo. However during construction only object reader and object inserter are created. For the repo I think that an incrementOpen should be called during construction so close can work as expected.

  @Override
  public synchronized void close() {
    if(!closed) {
      closed = true;
      reader.close();
      inserter.close();
      repo.close();
    }
  }
  GfsObjectService(final Repository repo) {
    this.repo = repo;   // should be updated to      this.repo = repo.incrementOpen();
    this.reader = repo.newObjectReader();
    this.inserter = repo.newObjectInserter();
  } 

Please let me know if you would like me to create a PR for this.

GitFileSystemProvider.getInstance() classloader issues

There is a problem with class loading when used in eclipse with maven project. It does not find instance of GitFileSystemProvider because...

FileSystemProvider.getInstance() calls FileSystemProvider.installedProviders() which eventually calls ClassLoader.getSystemClassLoader()

The problem is: it is not the "system" class loader who see the parallelgit-filesystem jar. We can try reason about what is "boot class loader", or "system class loader", what IDEs or web servers are doing with class loading. But it would still be fragile. Can you simply get the singleton instance without calling FileSystemProvider.getInstance()? It is not needed, is it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.