Giter Club home page Giter Club logo

externalsortinginjava's Introduction

Externalsortinginjava

docs-badge Java CI

External-Memory Sorting in Java: useful to sort very large files using multiple cores and an external-memory algorithm.

The versions 0.1 of the library are compatible with Java 6 and above. Versions 0.2 and above require at least Java 8.

This code is used in Apache Jackrabbit Oak as well as in Apache Beam and in Spotify scio.

Code sample

import com.google.code.externalsorting.ExternalSort;

//... inputfile: input file name
//... outputfile: output file name
// next command sorts the lines from inputfile to outputfile
int numLinesWritten = ExternalSort.mergeSortedFiles(ExternalSort.sortInBatch(new File(inputfile)), new File(outputfile));
// you can also provide a custom string comparator, see API

Code sample (CSV)

For sorting CSV files, it might be more convenient to use CsvExternalSort.

import com.google.code.externalsorting.CsvExternalSort;
import com.google.code.externalsorting.CsvSortOptions;

// provide a comparator
Comparator<CSVRecord> comparator = (op1, op2) -> op1.get(0).compareTo(op2.get(0));
//... inputfile: input file name
//... outputfile: output file name
//...provide sort options
CsvSortOptions sortOptions = new CsvSortOptions
				.Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory())
				.charset(Charset.defaultCharset())
				.distinct(false)
				.numHeader(1)
				.skipHeader(false)
				.format(CSVFormat.DEFAULT)
				.build();
// container to store the header lines
ArrayList<CSVRecord> header = new ArrayList<CSVRecord>();

// next two lines sort the lines from inputfile to outputfile
List<File> sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header);
// at this point you can access header if you'd like.
int numWrittenLines = CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header);

The numHeader parameter is the number of lines of headers in the CSV files (typically 1 or 0) and the skipHeader parameter indicates whether you would like to exclude these lines from the parsing.

API Documentation

http://www.javadoc.io/doc/com.google.code.externalsortinginjava/externalsortinginjava/

Maven dependency

You can download the jar files from the Maven central repository: https://repo1.maven.org/maven2/com/google/code/externalsortinginjava/externalsortinginjava/

You can also specify the dependency in the Maven "pom.xml" file:

    <dependencies>
         <dependency>
	     <groupId>com.google.code.externalsortinginjava</groupId>
	     <artifactId>externalsortinginjava</artifactId>
	     <version>[0.6.0,)</version>
         </dependency>
     </dependencies>

How to build

  • get the java jdk
  • Install Maven 2
  • mvn install - builds jar (requires signing)
  • or mvn package - builds jar (does not require signing)
  • mvn test - runs tests

externalsortinginjava's People

Contributors

dependabot[bot] avatar dritter-sap avatar evgenydatanerds avatar feyyazesat avatar gecec avatar hansenmc avatar julianselser avatar lemire avatar nddipiazza avatar pbloem avatar suneyz avatar tsangjustin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

externalsortinginjava's Issues

performance for external sorter

Thanks for this module .
I was wondering if you have any information regarding how the performance of this module would be when using SSDs vs sayy using a rocksDB embedded KV store that relies on SST files.

if i don't want to take a dependency on rocksdb this library makes perfect use case for me , but wondering if its more performant , wanted to use it for some streaming ingesting use cases with low latency requirement,

Ability to set the line separator for the final file

Currently fbw.newLine(); is used exclusively, but I require the ability to use single line feed on windows so the files can interchanged with other systems, without changing the default system line separator.

Randomizer

Could this be modified to randomize huge files instead of sort them much like "sort -R" does?

mergeSortedFiles with BufferedWriter cannot be used

I'm trying to use the merge method with a GZIPOutput Buffered writer.
The method mergeSortedFiles(BufferedWriter fbw, final Comparator cmp, boolean distinct, List buffers) looks perfect, but the dependency to BinaryFileBuffer makes it impossible to use.
As BinaryFileBuffer is a non-public child class, you cannot use it externally and you cannot extend it either.
Could you just make the BinaryFileBuffer a standard public class in its own file?

Thanks

ExternalSort.sortInBatch returns empty file, ExternalSort.mergeSortedFiles doesn't delete it

I have a test that call ExternalSort.sortInBatch with a very small maxMemory - 100 bytes. When I call that method on a small file, ExternalSort.sortInBatch creates 5 temporary files, one of which is empty. I then call ExternalSort.mergeSortedFiles on the result of ExternalSort.sortInBatch. The resulting merged file is correct, however ExternalSort.mergeSortedFiles doesn't delete the empty file.

That seems to be caused by the empty file still being in use (I'm running on Windows). The code only adds non-empty files to the priority queue:

`

       public static long mergeSortedFiles(BufferedWriter fbw,
            final Comparator<String> cmp, boolean distinct,
            List<IOStringStack> buffers)  ...
            for (IOStringStack bfb : buffers) {
                    if (!bfb.empty()) {
                            pq.add(bfb);
                    }
            }

`
Since the empty file is not on the queue, it's not being closed so it cannot be deleted.

There is probably more than one way to fix this issue - e.g., if a file is empty, don't create an IOStringStack for it. I'm not sure what's the best way.

Update commons-csv to 1.9.0

I'm using org.apache.commons:commons-csv:1.9.0
Could we update it in externalsortinginjava ?
So when I have commons-csv:1.9.0 running in runtime, I know that externalsortinginjava was compiled against 1.9 as well.

CsvExternalSort.mergeSortedFiles ignores CsvSortOptions.numHeader

CsvExternalSort.mergeSortedFiles ignores CsvSortOptions.numHeader which could trigger issues in the provided Comparator as it will try to compare header line with other lines. The issue only occurs if the batch has size > 1.

README example:

import com.google.code.externalsorting.CsvExternalSort;
import com.google.code.externalsorting.CsvSortOptions;

// provide a comparator
Comparator<CSVRecord> comparator = (op1, op2) -> op1.get(0).compareTo(op2.get(0));
//... inputfile: input file name
//... outputfile: output file name
//...provide sort options
CsvSortOptions sortOptions = new CsvSortOptions
				.Builder(CsvExternalSort.DEFAULTMAXTEMPFILES, comparator, 1, CsvExternalSort.estimateAvailableMemory())
				.charset(Charset.defaultCharset())
				.distinct(false)
				.numHeader(1)
				.skipHeader(false)
				.format(CSVFormat.DEFAULT)
				.build();

// next two lines sort the lines from inputfile to outputfile
List<File> sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions);;
CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true);

What is numHeader and skipHeader?

I was a bit confused with the skipheader and numHeader in the the CSVSortOptions.
Couldn't find the description in your API documentation.
Can you please elaborate?

ExternalSort.sortInBatch - java.io.IOException: The system cannot find the path specified

Hi,

I used to work with this library on Linux but now when i work in windows throw me this exception java.ioException. I verified the paths and there are OK. If i am doing something wrong, please tell me.

Thank in advance

This is my code:

`

def sortFile(inputFile: String, outputFile: String): Int = {
    logger.info("sorting file " + inputFile)
    logger.info("sorted file " + outputFile)

    val distinct = true
    val maxtmpfiles = 1024
    val cs = java.nio.charset.Charset.defaultCharset()
    val tempFileStore: File = new File(outputFile)
    val usegzip = false
    val headersize = 0
    var response = 0
    try {

      val comparator = ExternalSort.defaultcomparator

      val fileToSort = new File(inputFile)
      val fileSorted = new File(outputFile)

      if (!fileToSort.exists()) {
        logger.error("File Not found " + inputFile)
      }

      val l = ExternalSort.sortInBatch(fileToSort, comparator,
        maxtmpfiles, cs, tempFileStore, distinct, headersize, usegzip) //This is the line 204 that cause the exception 

      if (!fileSorted.exists()) {
        logger.error("File not found " + outputFile)
      }

      response = ExternalSort.mergeSortedFiles(l, fileSorted, comparator, cs,
        distinct, false, usegzip)


    } catch {
      case e: Exception => {
        logger.error("Error sorting", e)
      }
    }
    response
  }

`

And this the Log output:

sorting file C:\subscriber-files\main-prepago.csv sorted file C:\subscriber-files\sorted-main-prepago.csv Error sorting java.io.IOException: The system cannot find the path specified at java.io.WinNTFileSystem.createFileExclusively(Native Method) ~[?:1.8.0_65] at java.io.File.createTempFile(File.java:2024) ~[?:1.8.0_65] at com.google.code.externalsorting.ExternalSort.sortAndSave(ExternalSort.java:465) ~[externalsortinginjava-0.1.9.jar:?] at com.google.code.externalsorting.ExternalSort.sortInBatch(ExternalSort.java:594) ~[externalsortinginjava-0.1.9.jar:?] at com.google.code.externalsorting.ExternalSort.sortInBatch(ExternalSort.java:731) ~[externalsortinginjava-0.1.9.jar:?] at subscriberdumper.utils.FileUtils$.sortFile(Utils.scala:204) [subscribers-dumper-1.0-SNAPSHOT.jar:?]

StringSizeEstimator

do you consider the padding when you caculate the size of string ?the size of an object
Is a multiple of 8

How to use multi core maximize?

I found code in sortAndSave

            boolean distinct, boolean usegzip, boolean parallel) throws IOException {
            if (parallel) {
              tmplist = tmplist.parallelStream().sorted(cmp).collect(Collectors.toCollection(ArrayList<String>::new));
            } else {
              Collections.sort(tmplist, cmp);
            }

Is this the only thing related in multiple core (wrote in readme)?

In mergeSortedFiles seem read one line from the sorted files one by one.
in sortInBatch . seem sort one block one by one . files.add(sortAndSave(tmplist, cmp, cs,tmpdirectory, distinct, usegzip, parallel));

Can we do concurrent handling in mergeSortedFiles (like read block concurrent ) and in sortInBatch (one thread merge the smaller , one thread to merge the bigger, such like 20 tmpfiles to 10 tmpfiles then to 4 then to 1)

Sort CSV based on multiple columns

1. Is it possible to sort a CSV file based on multiple columns, say firstName, lastName, and dob?
For example in case multiple records have the same firstName, those records will be sorted based on the lastName value. If lastName is also same, they will be sorted based on the dob value, And so on.

2. Can this method also be applied to sort fixed length files based on multiple columns?

If yes, can you point me to where can I find some samples of those?

CsvExternalSort.sortInBatch drops a row for each batch

In the case where if (currentBlock.get() < blocksize) is false, the CSVRecord e is lost. You'll notice this as a missing row for each block.

My solution was to remove the conditional logic for processing e, and add the conditional logic only in the part where the block is processed.

Here is my updated code:
` try (CSVParser parser = new CSVParser(fbr, sortOptions.getFormat())) {
parser.spliterator().forEachRemaining(e -> {
if (e.getRecordNumber() <= sortOptions.getNumHeader()) {
header[0] = e;
} else {
tmplist.add(e);
currentBlock.addAndGet(SizeEstimator.estimatedSizeOf(e));
}
if (currentBlock.get() >= blocksize) {
try {
files.add(sortAndSave(tmplist, tmpdirectory, sortOptions, header[0]));
} catch (IOException e1) {
logger.warn(String.format("Error during the sort in batch"),e1); //$NON-NLS-1$
}
tmplist.clear();
currentBlock.getAndSet(0);
}

		});
	}

`

CsvExternalSort.mergeSortedFiles rowcounter when sortOptions.isDistinct()

Hi, thanks for this great library!

I've noticed CsvExternalSort.mergeSortedFiles() counts all rows even when sortOptions.isDistinct() and was wondering if this is on purpose, I expected it to return the number of rows in the sorted file.

I can already count the number of rows in the input but Im forced to reopen the output file and count myself otherwise. There is no test for distinct(true) on CsvExternalSortTest.java so Im not sure.

Seems like we could move the counter next to printRecord() in mergeSortedFiles like this?

if (sortOptions.isDistinct() && checkDuplicateLine(r, lastLine)) {
} else {
	printer.printRecord(r);
	++rowcounter;
	lastLine = r;
}

We'd find the feature really useful, let me know if its ok to submit a PR regarding the above or if we could discuss a way to include it that makes sense if this doesnt. Thanks!

Output file is same as input file. NO Change

I have a csv file with float value in first column. I run the program but output file is same as input file.
Here is my code:-

Comparator<CSVRecord> comparator = (op1, op2) -> op1.get(0).compareTo(op2.get(0));
        File inputFile=new File("/home/signity/Documents/workshop/csvjobmultiplier/TEMP_DIR/test/short.csv");
        File outputfile=new File("/home/signity/Documents/workshop/csvjobmultiplier/TEMP_DIR/test/output_ks.csv");
        CsvSortOptions sortOptions = new CsvSortOptions
                .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory())
                .charset(Charset.defaultCharset())
                .distinct(false)
                .numHeader(0)
                .skipHeader(false)
                .format(CSVFormat.DEFAULT)
                .build();
        ArrayList<CSVRecord> header = new ArrayList<CSVRecord>();
        List<File> sortInBatch = CsvExternalSort.sortInBatch(inputFile, null, sortOptions, header);
        CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header);

INPUT CSV FILE SYNTAX:-

0.3,0.3,"39826ab68245a00e85853fb04d337c17",TRUE
10,10,"39826ab68245a00e85853fb04d337c17",TRUE
10,10,"39826ab68245a00e85853fb04d337c17",TRUE
126,126,"39826ab68245a00e85853fb04d337c17",TRUE
126,126,"39826ab68245a00e85853fb04d337c17",TRUE
126,126,"39826ab68245a00e85853fb04d337c17",TRUE
17,17,"39826ab68245a00e85853fb04d337c17",TRUE
2,2,"39826ab68245a00e85853fb04d337c17",TRUE
2,2,"39826ab68245a00e85853fb04d337c17",TRUE
21,21,"39826ab68245a00e85853fb04d337c17",TRUE
22,22,"39826ab68245a00e85853fb04d337c17",TRUE
233,233,"39826ab68245a00e85853fb04d337c17",TRUE
233,233,"39826ab68245a00e85853fb04d337c17",TRUE
233,233,"39826ab68245a00e85853fb04d337c17",TRUE

OUTPUT CSV FILE (NO CHANGE):-

0.3,0.3,"39826ab68245a00e85853fb04d337c17",TRUE
10,10,"39826ab68245a00e85853fb04d337c17",TRUE
10,10,"39826ab68245a00e85853fb04d337c17",TRUE
126,126,"39826ab68245a00e85853fb04d337c17",TRUE
126,126,"39826ab68245a00e85853fb04d337c17",TRUE
126,126,"39826ab68245a00e85853fb04d337c17",TRUE
17,17,"39826ab68245a00e85853fb04d337c17",TRUE
2,2,"39826ab68245a00e85853fb04d337c17",TRUE
2,2,"39826ab68245a00e85853fb04d337c17",TRUE
21,21,"39826ab68245a00e85853fb04d337c17",TRUE
22,22,"39826ab68245a00e85853fb04d337c17",TRUE
233,233,"39826ab68245a00e85853fb04d337c17",TRUE
233,233,"39826ab68245a00e85853fb04d337c17",TRUE
233,233,"39826ab68245a00e85853fb04d337c17",TRUE

Could you please add below method to the class ExternalSort

Currently the code does not allow someone to use his own BufferWriter class.
The current method "mergeSortedFiles" which supports BufferWriter has also another input parameter referring to a private class and therefore cannot be used (BinaryFileBuffer).

In order to increase flexibility with your code, could you please add below method:

        public static long mergeSortedFiles(List<File> files, BufferedWriter fbw,
                final Comparator<String> cmp, Charset cs, boolean distinct,
                boolean append, boolean usegzip) throws IOException {
                ArrayList<BinaryFileBuffer> bfbs = new ArrayList<>();
                for (File f : files) {
                        final int BUFFERSIZE = 2048;
                        InputStream in = new FileInputStream(f);
                        BufferedReader br;
                        if (usegzip) {
                                br = new BufferedReader(
                                        new InputStreamReader(
                                                new GZIPInputStream(in,
                                                        BUFFERSIZE), cs));
                        } else {
                                br = new BufferedReader(new InputStreamReader(
                                        in, cs));
                        }

                        BinaryFileBuffer bfb = new BinaryFileBuffer(br);
                        bfbs.add(bfb);
                }
                long rowcounter = mergeSortedFiles(fbw, cmp, distinct, bfbs);
                for (File f : files) {
                        f.delete();
                }
                return rowcounter;
        }

Thanks and regards
Guillaume

README Code Sample argument confusing

According to CsvSortOptions.Builder the order of constructor arguments is:

public Builder(long datalength, Comparator<CSVRecord> cmp, int maxTmpFiles, long maxMemory)

The README code sample states:

import com.google.code.externalsorting.CsvExternalSort;
import com.google.code.externalsorting.CsvSortOptions;

// provide a comparator
Comparator<CSVRecord> comparator = (op1, op2) -> op1.get(0).compareTo(op2.get(0));
//... inputfile: input file name
//... outputfile: output file name
//...provide sort options
CsvSortOptions sortOptions = new CsvSortOptions
				.Builder(CsvExternalSort.DEFAULTMAXTEMPFILES, comparator, 1, CsvExternalSort.estimateAvailableMemory())
				.charset(Charset.defaultCharset())
				.distinct(false)
				.numHeader(1)
				.skipHeader(false)
				.format(CSVFormat.DEFAULT)
				.build();

// next two lines sort the lines from inputfile to outputfile
List<File> sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions);;
CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true);

Should the first argument use another constant and the third argument replaced with CsvExternalSort.DEFAULTMAXTEMPFILES?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.