Giter Club home page Giter Club logo

paleo's Introduction

Paleo Build Status download

Immutable Java 8 data frames with typed columns.

A data frame is composed of 0..n named columns, which all contain the same number of row values. Each column has a fixed data type, which allows for type-safe value access. The following column types are supported out-of-the-box:

  • Int: Primitive int values

  • Long: Primitive long values

  • Double: Primitive double values

  • Boolean: Primitive boolean values

  • String: java.lang.String values

  • Timestamp: java.time.Instant values

  • Category: Categorical String values (aka factors)

Columns can be created via simple factory methods, through a fluent builder API, or from text files.

Hello Paleo

The paleo-core module provides all classes to identify, create, and structure typed columns:

// Type-safe column identifiers
final StringColumnId NAME = StringColumnId.of("Name");
final CategoryColumnId COLOR = CategoryColumnId.of("Color");
final DoubleColumnId SERVING_SIZE = DoubleColumnId.of("Serving Size (g)");

// Convenient column creation
StringColumn nameColumn = StringColumn.ofAll(NAME, "Banana", "Blueberry", "Lemon", "Apple");
CategoryColumn colorColumn = CategoryColumn.ofAll(COLOR, "Yellow", "Blue", "Yellow", "Green");
DoubleColumn servingSizeColumn = DoubleColumn.ofAll(SERVING_SIZE, 118, 148, 83, 182);

// Grouping columns into a data frame
DataFrame dataFrame = DataFrame.ofAll(nameColumn, colorColumn, servingSizeColumn);

// Typed random access to individual values (based on rowIndex / columnId)
String lemon = dataFrame.getValueAt(2, NAME);
double appleServingSize = dataFrame.getValueAt(3, SERVING_SIZE);

// Typed stream-based access to all values
DoubleStream servingSizes = servingSizeColumn.valueStream();
double maxServingSize = servingSizes.summaryStatistics().getMax();

// Smart column implementations
Set<String> colors = colorColumn.getCategories();

Parsing From Text / File

The paleo-io module parses data frames from tab-delimited or comma-separated text representations. The structure of the data frame (i.e. the names and types of its columns) can be defined in one of two ways:

Header Rows

In its simplest format, the tab-delimited text representation directly contains column names and types in a header. The first row specifies the column names, the second row specifies the column types (actual data starting on third row):

1 Name    Color
2 String  Category
3 Banana  Yellow
...
n Apple   Green

The contents can then be parsed via Parser.tsv(Reader in) or Parser.csv(Reader in), e.g. like:

final String EXAMPLE =
            "Name\tColor\tServing Size (g)\n" +
            "String\tCategory\tDouble\n" +
            "Banana\tYellow\t118\n" +
            "Blueberry\tBlue\t148\n" +
            "Lemon\tYellow\t83\n" +
            "Apple\tGreen\t182";

DataFrame dataFrame = Parser.tsv(new StringReader(EXAMPLE));

External JSON Schema

Generally it is advisable to separate the structural information from the actual data. Paleo therefore supports the definition of an external JSON schema. The format is inspired by the JSON Table Schema:

{
  "title": "Example Schema",
  "dataFileName": "data.txt",
  "charsetName": "ISO-8859-1", // (1)
  "fields": [
    {
      "name": "Name",
      "type": "String"
    },
    {
      "name": "Color",
      "type": "Category"
    },
    {
      "name": "Serving Size",
      "type": "Double",
      "metaData": { "unit": "g" }
    },
    {
      "name": "Exemplary Date",
      "type": "Timestamp",
      "format": "yyyyMMddHHmmss"
    }
  ]
}
  1. Optionally specify an encoding

Dedicated parsing methods allow to first parse the schema from JSON, and subsequently use it to create a DataFrame. A given base directory is used to load the actual data (i.e. to resolve the location of the configured dataFileName):

Schema schema = Schema.parseJson(new StringReader(EXAMPLE_SCHEMA));
DataFrame dataFrame = Parser.tsv(schema, baseDir);

Working With Parsed Data Frames

Once a DataFrame instance has been parsed, its data can be accessed through a type-safe API:

final String EXAMPLE =
            "Name\tColor\tServing Size (g)\n" +
            "String\tCategory\tDouble\n" +
            "Banana\tYellow\t118\n" +
            "Blueberry\tBlue\t148\n" +
            "Lemon\tYellow\t83\n" +
            "Apple\tGreen\t182";

DataFrame dataFrame = Parser.tsv(new StringReader(EXAMPLE));

// Lookup typed identifiers by column index
final StringColumnId NAME = dataFrame.getColumnId(0, ColumnType.STRING);
final CategoryColumnId COLOR = dataFrame.getColumnId(1, ColumnType.CATEGORY);
final DoubleColumnId SERVING_SIZE = dataFrame.getColumnId(2, ColumnType.DOUBLE);

// Use identifier to access columns & values
StringColumn nameColumn = dataFrame.getColumn(NAME);
IndexedSeq<String> nameValues = nameColumn.getValues();

// ... or access individual values via row index / column id
String yellow = dataFrame.getValueAt(2, COLOR);

Usage

All modules are available via Bintray/JCenter.

Repository Configuration

Gradle:

repositories {
    jcenter()
}

Maven settings.xml:

<repository>
    <snapshots>
      <enabled>false</enabled>
    </snapshots>
    <id>central</id>
    <name>bintray</name>
    <url>http://jcenter.bintray.com</url>
</repository>

Using the paleo-core module

Gradle:

compile 'ch.netzwerg:paleo-core:0.13.2'

Maven:

<dependency>
    <groupId>ch.netzwerg</groupId>
    <artifactId>paleo-core</artifactId>
    <version>0.13.2</version>
    <type>jar</type>
</dependency>

Using the paleo-io module

Optional (requires paleo-core)

Gradle:

compile 'ch.netzwerg:paleo-io:0.13.2'

Maven:

<dependency>
    <groupId>ch.netzwerg</groupId>
    <artifactId>paleo-io</artifactId>
    <version>0.13.2</version>
    <type>jar</type>
</dependency>

Vavr

Paleo makes extensive use of the Vavr library. Vavr provides awesome collection classes which offer functionality way beyond the standard JDK. Working with the Vavr classes is highly recommended, but it is always possible to back out and convert to JDK standards (e.g. with toJavaList()).

Factory-Methods vs. Builders

Paleo tries to make the best compromise between parsing speed, index-based value lookup, and memory usage. That’s why it offers two ways to create columns: Static factory methods allow for convenient construction if all values are already available. Individual column builders should be used if columns are constructed via successive value addition. Please be aware that the builders are not thread-safe.

Why The Name?

The backing data structures are all about raw values and primitive types — this somehow reminded me of the paleo diet.

Contributions

Pull requests are very welcome. Please note that by submitting a pull request, you agree to license your contribution under the "Apache License Version 2.0".

paleo's People

Contributors

netzwerg avatar benmccann avatar chronodm avatar plaeremans avatar karste01 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.