downgoon / big-sequence-file Goto Github PK

View Code? Open in Web Editor NEW

1.0 0.0 0.0 1.14 MB

a big file providing sequential data access

Java 100.00%

big-files kafka bigpipe

big-sequence-file's Introduction

big-sequence-file

a big file, similar to java embedded implementation of kafka, providing sequential data access.

QuickStart

Sample Code

BigSequenceFile bsf = null;

try {
  bsf = new BigSequenceFile("hello.bsf");
  bsf.open();

  bsf.appendTrunk("abc".getBytes());
  bsf.appendTrunk("def".getBytes());
  bsf.appendTrunk("g".getBytes());

  byte[] trunk = bsf.deductTrunk();
  System.out.println(new String(trunk));

} finally {
  if (bsf != null) {
    bsf.close();
  }
}

for more infomation, please read QuickStart.java example.

maven dependency

<dependency>
  <groupId>com.github.downgoon</groupId>
  <artifactId>big-sequence-file</artifactId>
  <version>0.1.0</version>
</dependency>

underlying structure

when we new two '.bsf' files: new BigSequenceFile("hello.bsf") and new BigSequenceFile("world.bsf"), the underlying files may look like as follows:

$ tree .
├── hello.bsf
├── hello_0.seg
├── world.bsf
├── world_0.seg
├── world_1.seg
└── world_2.seg

the .bsf file manages meta info of the user namespaced bsf file (e.g. hello.bsf), while multipule .seg files store data info. in general, a BSF file always consists of only one .bsf and several .seg files in underlying storage layer.

For Developers

Developer Guide

big-sequence-file's People

Contributors

Stargazers

big-sequence-file's Issues

.bsf头可简化，.seg无需内存映射

在 #1 中提到没有必要在头部记录usedSize。

那么进一步，就没必要用内存映射了，因为内存映射主要是为了回写头部时带来效率，现在没有头部，就都是追加，内存映射反而影响效率。

再看看.bsf的头部：

对于写入位置，压根不需要记录，一定是最后一个文件（名字编号最大的）的末尾（因为没有内存映射，不用预分配空间，直接追加就好）。

对于读取位置，的确需要记录。另外，kafka这类是不记录的，它把记录的任务交给客户端了，当然客户端也可以从任意点重新再读一遍。

最后，trunk-count是一个可选项。记录了，会一眼看出有没有消息的消费滞后，不记录也不影响主要操作：顺序写，顺序读。

没必要在segment文件头部记录usedSize

设计回顾

segment有个12B的固定头部。最必要的信息是第4个字段usedSize，表示segment文件的实际使用大小。您可能会说，为什么需要额外存储这个值？文件大小难道不是操作系统的文件系统就管理了的吗？！主要是因为实现时，底层采用了内存映射机制，每次映射了BSFConf.segmentLimitBytes大小，默认值是128MB，相当于每次segment都是以128MB为单位分配，无论它实际是否用到。

借鉴`kafka`

在 big-sequence-file中，多个segment的起名规则是：XXX_0.seg，XXX_1.seg，XXX_2.seg。在kafka里，segment文件名后缀记录了本文件相对整体文件的起始偏移量，相当于记录了上一个文件的结束量，这样就知道每个文件的有效字节数了。

例如：

hello.bsf
hello_00000000000000000000.seg
hello_00000000000000001003.seg
hello_00000000000000001946.seg
hello_00000000000000003068.seg

假设一个segment的预分配容量是1MB，超过1MB就切换下一个文件。那么hello_00000000000000000000.seg文件的有效大小就是下一个文件的文件名1003，并且有1024-1003=21B浪费（叫段内碎片）；同样hello_00000000000000001003.seg的有效大小就是 1946-1003=943B，浪费了1024-943=81B，也可以推断下一个文件（也就是hello_00000000000000001946.seg）的第一个消息的总长度一定超过了81B，否则不会触发切换新文件。

内存映射实验

https://github.com/downgoon/big-sequence-file/blob/master/docs/labs.md

downgoon / big-sequence-file Goto Github PK

big-sequence-file's Introduction

big-sequence-file

QuickStart

For Developers

big-sequence-file's People

Contributors

Stargazers

big-sequence-file's Issues

.bsf头可简化，.seg无需内存映射

没必要在segment文件头部记录usedSize

设计回顾

借鉴`kafka`

内存映射实验

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

downgoon / big-sequence-file Goto Github PK

big-sequence-file's Introduction

big-sequence-file

QuickStart

For Developers

big-sequence-file's People

Contributors

Stargazers

big-sequence-file's Issues

设计回顾

借鉴kafka

Recommend Projects

Recommend Topics

Recommend Org

借鉴`kafka`