Giter Club home page Giter Club logo

big-sequence-file's Introduction

big-sequence-file

a big file, similar to java embedded implementation of kafka, providing sequential data access.

QuickStart

  • Sample Code
BigSequenceFile bsf = null;

try {
  bsf = new BigSequenceFile("hello.bsf");
  bsf.open();

  bsf.appendTrunk("abc".getBytes());
  bsf.appendTrunk("def".getBytes());
  bsf.appendTrunk("g".getBytes());

  byte[] trunk = bsf.deductTrunk();
  System.out.println(new String(trunk));

} finally {
  if (bsf != null) {
    bsf.close();
  }
}

for more infomation, please read QuickStart.java example.

  • maven dependency
<dependency>
  <groupId>com.github.downgoon</groupId>
  <artifactId>big-sequence-file</artifactId>
  <version>0.1.0</version>
</dependency>
  • underlying structure

when we new two '.bsf' files: new BigSequenceFile("hello.bsf") and new BigSequenceFile("world.bsf"), the underlying files may look like as follows:

$ tree .
├── hello.bsf
├── hello_0.seg
├── world.bsf
├── world_0.seg
├── world_1.seg
└── world_2.seg

the .bsf file manages meta info of the user namespaced bsf file (e.g. hello.bsf), while multipule .seg files store data info. in general, a BSF file always consists of only one .bsf and several .seg files in underlying storage layer.

For Developers

big-sequence-file's People

Contributors

downgoon avatar

Stargazers

 avatar

big-sequence-file's Issues

.bsf头可简化,.seg无需内存映射

#1 中提到没有必要在头部记录usedSize

那么进一步,就没必要用内存映射了,因为内存映射主要是为了回写头部时带来效率,现在没有头部,就都是追加,内存映射反而影响效率。

再看看.bsf的头部:

对于写入位置,压根不需要记录,一定是最后一个文件(名字编号最大的)的末尾(因为没有内存映射,不用预分配空间,直接追加就好)。

对于读取位置,的确需要记录。另外,kafka这类是不记录的,它把记录的任务交给客户端了,当然客户端也可以从任意点重新再读一遍。

最后,trunk-count是一个可选项。记录了,会一眼看出有没有消息的消费滞后,不记录也不影响主要操作:顺序写,顺序读。

没必要在segment文件头部记录usedSize

设计回顾

segment有个12B的固定头部。最必要的信息是第4个字段usedSize,表示segment文件的实际使用大小。您可能会说,为什么需要额外存储这个值?文件大小难道不是操作系统的文件系统就管理了的吗?!主要是因为实现时,底层采用了内存映射机制,每次映射了BSFConf.segmentLimitBytes大小,默认值是128MB,相当于每次segment都是以128MB为单位分配,无论它实际是否用到。

借鉴kafka

big-sequence-file中,多个segment的起名规则是:XXX_0.segXXX_1.segXXX_2.seg。在kafka里,segment文件名后缀记录了本文件相对整体文件的起始偏移量,相当于记录了上一个文件的结束量,这样就知道每个文件的有效字节数了。

例如:

hello.bsf
hello_00000000000000000000.seg
hello_00000000000000001003.seg
hello_00000000000000001946.seg
hello_00000000000000003068.seg

假设一个segment的预分配容量是1MB,超过1MB就切换下一个文件。那么hello_00000000000000000000.seg文件的有效大小就是下一个文件的文件名1003,并且有1024-1003=21B浪费(叫段内碎片);同样hello_00000000000000001003.seg的有效大小就是 1946-1003=943B,浪费了1024-943=81B,也可以推断下一个文件(也就是hello_00000000000000001946.seg)的第一个消息的总长度一定超过了81B,否则不会触发切换新文件。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.