Giter Club home page Giter Club logo

bigdoc's Introduction

Overview

'bigdoc' allows you to handle gigabyte order files easily with high performance. You can search bytes or words / read data/text from huge files.

It is licensed under MIT license.

Maven Central

Quick start

Search sequence of bytes from a big file quickly.

Search mega-bytes,giga-bytes order file.

package org.example;

import java.io.File;
import java.util.List;

import org.riversun.bigdoc.bin.BigFileSearcher;

public class Example {

	public static void main(String[] args) throws Exception {

		byte[] searchBytes = "hello world.".getBytes("UTF-8");

		File file = new File("/var/tmp/yourBigfile.bin");

		BigFileSearcher searcher = new BigFileSearcher();

		List<Long> findList = searcher.searchBigFile(file, searchBytes);

		System.out.println("positions = " + findList);
	}
}

Example code for canceling a search in progress

When used asynchronously, #cancel can be used to stop the process in the middle of a search.

package org.riversun.bigdoc.bin;

import java.io.File;
import java.io.UnsupportedEncodingException;
import java.util.List;

import org.riversun.bigdoc.bin.BigFileSearcher.OnRealtimeResultListener;

public class Example {

  public static void main(String[] args) throws UnsupportedEncodingException, InterruptedException {
    byte[] searchBytes = "sometext".getBytes("UTF-8");
    
    File file = new File("path/to/file");

    final BigFileSearcher searcher = new BigFileSearcher();

    searcher.setUseOptimization(true);
    searcher.setSubBufferSize(256);
    searcher.setSubThreadSize(Runtime.getRuntime().availableProcessors());

    final SearchCondition sc = new SearchCondition();
    
    sc.srcFile = file;
    sc.startPosition = 0;
    sc.searchBytes = searchBytes;

    sc.onRealtimeResultListener = new OnRealtimeResultListener() {

      @Override
      public void onRealtimeResultListener(float progress, List<Long> pointerList) {
        System.out.println("progress:" + progress + " pointerList:" + pointerList);
      }
    };

    final Thread th = new Thread(new Runnable() {

      @Override
      public void run() {
        List<Long> searchBigFileRealtime = searcher.searchBigFile(sc);
      }
    });

    th.start();

    Thread.sleep(1500);

    searcher.cancel();

    th.join();

  }
}

Performance Test

Search sequence of bytes from big file

Environment

Tested on AWS t2.*

Results

CPU Instance EC2 t2.2xlarge
vCPU x 8,32GiB
EC2 t2.xlarge
vCPU x 4,16GiB
EC2 t2.large
vCPU x 2,8GiB
EC2 t2.medium
vCPU x 2,4GiB
File Size Time(sec) Time(sec) Time(sec) Time(sec)
10MB 0.5s 0.6s 0.8s 0.8s
50MB 2.8s 5.9s 13.4s 12.8s
100MB 5.4s 10.7s 25.9s 25.1s
250MB 15.7s 32.6s 77.1s 74.8s
1GB 55.9s 120.5s 286.1s -
5GB 259.6s 566.1s - -
10GB 507.0s 1081.7s - -

Please Note

  • Processing speed depends on the number of CPU Cores(included hyper threading) not memory capacity.
  • The result is different depending on the environment of the Java ,Java version and compiler or runtime optimization.

Architecture and Tuning

architecture

You can tune the performance using the following methods. It can be adjusted according to the number of CPU cores and memory capacity.

  • BigFileSearcher#setBlockSize
  • BigFileSearcher#setMaxNumOfThreads
  • BigFileSearcher#setBufferSizePerWorker
  • BigFileSearcher#setBufferSize
  • BigFileSearcher#setSubThreadSize

BigFileSearcher can search for sequence of bytes by dividing a big file into multiple blocks. Use multiple workers to search for multiple blocks concurrently. One worker thread sequentially searches for one block. The number of workers is specified by #setMaxNumOfThreads. Within a single worker thread, it reads and searches into the memory by the capacity specified by #setBufferSize. A small area - used to compare sequence of bytes when searching - is called a window, and the size of that window is specified by #setSubBufferSize. Multiple windows can be operated concurrently, and the number of conccurent operations in a worker is specified by #setSubThreadSize.

More Details

See javadoc as follows.

https://riversun.github.io/javadoc/bigdoc/

Downloads

maven

  • You can add dependencies to maven pom.xml file.
<dependency>
    <groupId>org.riversun</groupId>
    <artifactId>bigdoc</artifactId>
    <version>0.4.0</version>
</dependency>

bigdoc's People

Contributors

riversun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

bigdoc's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.