Giter Club home page Giter Club logo

hadoop-perfect-file's Introduction

Hadoop-Perfect-File : A Fast access container for small files

Hadoop Perfect File (HPF) like others hadoop index based archive files also consists of combining small files into large files before storing on HDFS. HPF organizes its index system efficiently and provide a very fast access performance. Unlike HAR file, MapFile ... where it is not possible to access the metadata of a particular file directly from the index file, HPF offers direct access to file’s metadata. By using a monotone minimal perfect hash function in its index system, HPF can calculate from which offset to which limit of the index file to read the metadata of a file. After the offset and limit calculation, HPF seek in the index file at the offset position and read to limit. Seek in a random way to some positions in a file is an operation that can take time when the file is very large. In order to avoid the degradation of seek operations, we avoid getting too big index file by distributing the metadata of our files into several index files using an extendible hash function. Our approach allows appending more files after the creation of HPF file without having to decompress and recompress the entire archive file.

hpf_creation

Creating and adding files to the HPF file is done using the PerfectFile.Writer class and reading the HPF file is done using the PerfectFile.Reader class.

Example of writing to HPF file

From Local

In this example, we will add all the files located in the folder D:/files/ of the client in the HPF file whose path on HDFS is /data/file.hpf. If the file /data/file.hpf does not exist on HDFS, it will be automatically created.

FileSystem lfs = LocalFileSystem.getLocal(conf);
	
try (PerfectFile.Writer writer = new PerfectFile.Writer(conf, new Path("/data/file.hpf"), 200000)) {
    for (FileStatus status : lfs.listStatus(new Path("D:/files/"))) {
        String key = status.getPath().getName();
        Path path = status.getPath();
        writer.putFromLocal(key,path);
    }
}

The last parameter(200000) of the class PerfectFile.Writer's constructor represents the maximun number of files metadata that each index file can hold.

From HDFS

In this example, we will add all the files located in the folder /data/files/ of HDFS in the HPF file whose path on HDFS is /data/file.hpf.

FileSystem fs = FileSystem.getLocal(conf);
	
try (PerfectFile.Writer writer = new PerfectFile.Writer(conf, new Path("/data/file.hpf"), 200000)) {
    for (FileStatus status : fs.listStatus(new Path("/data/files/"))) {
        String key = status.getPath().getName();
        Path path = status.getPath();
        writer.put(key,path);
    }
}

Example of reading from HPF file

To read a file from the HPF file, we have two functions: the get() function that returns an input steam and the getBytes() function that return the binary contents of the file.

try (PerfectFile.Reader reader = new PerfectFile.Reader(conf, new Path("/data/file.hpf"))) {
    for (String key : keys) {
        InputStream inputStream = reader.get(key);
        byte[] bs = reader.getBytes(key);
    }
 }

hadoop-perfect-file's People

Contributors

tchaye59 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

hadoop-perfect-file's Issues

question about creating part file

I found that no matter how big the data set was uploaded, only one part file was created in the HPF system, even though its size was larger than blockSize and partMaxSize. Thank you for your answer
截屏2022-04-29 23 58 22

recoveryOnFailure error

In the WriterTest, an exception occurs when you restart the function after a manual interruption.

java.lang.IllegalArgumentException: The input bit vectors are not distinct

	at it.unimi.dsi.sux4j.mph.HollowTrieMonotoneMinimalPerfectHashFunction.<init>(HollowTrieMonotoneMinimalPerfectHashFunction.java:180)
	at net.almightshell.pf.PerfectTableHolder.reloadBucketDictionary(PerfectTableHolder.java:71)
	at net.almightshell.pf.PerfectFile$Writer.close(PerfectFile.java:226)
	at net.almightshell.pf.WriterTest.testPut_Path(WriterTest.java:53)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.