Giter Club home page Giter Club logo

krangl's Introduction

krangl

Download Build Status Gitter

krangl is no longer developed. It was a wonderful experiement, but has been superceeded with the more complete, more usable and more modern https://github.com/Kotlin/dataframe.

krangl is a {K}otlin library for data w{rangl}ing. By implementing a grammar of data manipulation using a modern functional-style API, it allows to filter, transform, aggregate and reshape tabular data.

krangl is heavily inspired by the amazing dplyr for R. krangl is written in Kotlin, excels in Kotlin, but emphasizes as well on good java-interop. It is mimicking the API of dplyr, while carefully adding more typed constructs where possible.

If you're not sure about how to proceed, check out krangl in 10 minutes section in the krangl user guide.

Installation

To get started simply add it as a dependency to your build.gradle:

repositories {
    mavenCentral() 
}

dependencies {
    implementation "com.github.holgerbrandl:krangl:0.18.4"
}

Declaring the repository is purely optional as it is the default already.

You can also use JitPack with Maven or Gradle to build the latest snapshot as a dependency in your project.

repositories {
    maven { url 'https://jitpack.io' }
}
dependencies {
    implementation 'com.github.holgerbrandl:krangl:-SNAPSHOT'
}

To build and install it into your local maven cache, simply clone the repo and run

./gradlew install

Features

  • Filter, transform, aggregate and reshape tabular data

  • Modern, user-friendly and easy-to-learn data-science API

  • Reads from plain and compressed tsv, csv, json, or any delimited format with or without header from local or remote

  • Supports grouped operations

  • Ships with JDBC support

  • Tables can contain atomic columns (int, double, boolean) as well as object columns

  • Reshape tables from wide to long and back

  • Table joins (left, right, semi, inner, outer)

  • Cross tabulation

  • Descriptive statistics (mean, min, max, median, ...)

  • Functional API inspired by dplyr, pandas, and Kotlin stdlib

  • many more...

krangl is just about data wrangling. For data visualization we recommend kravis which seamlessly integrates with krangl and implements a grammar to build a wide variety of plots.

Examples

// Read data-frame from disk
val iris = DataFrame.readTSV("data/iris.txt")


// Create data-frame in memory
val df: DataFrame = dataFrameOf(
    "first_name", "last_name", "age", "weight")(
    "Max", "Doe", 23, 55,
    "Franz", "Smith", 23, 88,
    "Horst", "Keanes", 12, 82
)

// Or from csv
// val otherDF = DataFrame.readCSV("path/to/file")

// Print rows
df                              // with implict string conversion using default options
df.print(colNames = false)      // with custom printing options

// Print structure
df.schema()


// Add columns with mutate
// by adding constant values as new column
df.addColumn("salary_category") { 3 }

// by doing basic column arithmetics
df.addColumn("age_3y_later") { it["age"] + 3 }

// Note: krangl dataframes are immutable so we need to (re)assign results to preserve changes.
val newDF = df.addColumn("full_name") { it["first_name"] + " " + it["last_name"] }

// Also feel free to mix types here since krangl overloads  arithmetic operators like + for dataframe-columns
df.addColumn("user_id") { it["last_name"] + "_id" + rowNumber }

// Create new attributes with string operations like matching, splitting or extraction.
df.addColumn("with_anz") { it["first_name"].asStrings().map { it!!.contains("anz") } }

// Note: krangl is using 'null' as missing value, and provides convenience methods to process non-NA bits
df.addColumn("first_name_initial") { it["first_name"].map<String>{ it.first() } }

// or add multiple columns at once
df.addColumns(
    "age_plus3" to { it["age"] + 3 },
    "initials" to { it["first_name"].map<String> { it.first() } concat it["last_name"].map<String> { it.first() } }
)


// Sort your data with sortedBy
df.sortedBy("age")
// and add secondary sorting attributes as varargs
df.sortedBy("age", "weight")
df.sortedByDescending("age")
df.sortedBy { it["weight"].asInts() }


// Subset columns with select
df.select2 { it is IntCol } // functional style column selection
df.select("last_name", "weight")    // positive selection
df.remove("weight", "age")  // negative selection
df.select({ endsWith("name") })    // selector mini-language


// Subset rows with vectorized filter
df.filter { it["age"] eq 23 }
df.filter { it["weight"] gt 50 }
df.filter({ it["last_name"].isMatching { startsWith("Do")  }})

// In case vectorized operations are not possible or available we can also filter tables by row
// which allows for scalar operators
df.filterByRow { it["age"] as Int > 5 }
df.filterByRow { (it["age"] as Int).rem(10) == 0 } // round birthdays :-)


// Summarize

// do simple cross tabulations
df.count("age", "last_name")

// ... or calculate single summary statistic
df.summarize("mean_age" to { it["age"].mean(true) })

// ... or multiple summary statistics
df.summarize(
    "min_age" to { it["age"].min() },
    "max_age" to { it["age"].max() }
)

// for sake of r and python adoptability you can also use `=` here
df.summarize(
    "min_age" `=` { it["age"].min() },
    "max_age" `=` { it["age"].max() }
)

// Grouped operations
val groupedDf: DataFrame = df.groupBy("age") // or provide multiple grouping attributes with varargs
val sumDF = groupedDf.summarize(
    "mean_weight" to { it["weight"].mean(removeNA = true) },
    "num_persons" to { nrow }
)

// Optionally ungroup the data
sumDF.ungroup().print()

// generate object bindings for kotlin.
// Unfortunately the syntax is a bit odd since we can not access the variable name by reflection
sumDF.printDataClassSchema("Person")

// This will generate and print the following conversion code:
data class Person(val age: Int, val mean_weight: Double, val num_persons: Int)

val records = sumDF.rows.map { row -> Person(row["age"] as Int, row["mean_weight"] as Double, row["num_persons"] as Int) }

// Now we can use the krangl result table in a strongly typed way
records.first().mean_weight

// Vice versa we can also convert an existing set of objects into
val recordsDF = records.asDataFrame()
recordsDF.print()

// to populate a data-frame with selected properties only, we can do
val deparsedDF = records.deparseRecords { mapOf("age" to it.age, "weight" to it.mean_weight) }

Documentation

krangl is not yet mature, full of bugs and its API is in constant flux. Nevertheless, feel welcome to submit pull-requests or tickets, or simply get in touch via gitter (see button on top).

  • Krangl User Guide for detailed information about the API and usage examples.
  • API Docs for detailed information about the API including manu usage examples
  • TBD krangl Cheat Sheet

Another great introduction into data-science with kotlin was presented at 2019's KotlinConf by Roman Belov from JetBrains.

How to contribute?

Feel welcome to post ideas, suggestions and criticism to our tracker.

We always welcome pull requests. :-)

You could also show your spiritual support by upvoting krangl here on github.

Also see

  • Developer Information with technical notes & details about to build, test, release and improve krangl
  • Roadmap complementing the tracker with a backlog

Also, there are a few issues in the IDE itself which limit the applicability/usability of krangl, So, you may want to vote for

  • KT-24789 "Unresolved reference" when running a script which is a symlink to a script outside of source roots
  • KT-12583 IDE REPL should run in project root directory
  • KT-11409 Allow to "Send Selection To Kotlin Console"
  • KT-13319 Support ":paste" for pasting multi-line expressions in REPL
  • KT-21224 REPL output is not aligned with input

krangl's People

Contributors

benmccann avatar crystalsplitter avatar davidpedrosa avatar holgerbrandl avatar jcheungshred avatar kopilov avatar leanderg avatar leandroc89 avatar melastmohican avatar nikitinas avatar oliviercavadenti avatar robertperrotta avatar sorokod avatar stangls avatar tcasstevens avatar thomasnield avatar tokuhirom avatar tuesd4y avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

krangl's Issues

Turning Column into LocalDate

I'm having an issue creating a new column that turns date text into a Java 8 LocalDate.

package com.swa.np.myproject

import krangl.*
import java.time.LocalDate
import java.time.format.DateTimeFormatter

fun main(args: Array<String>) {

    val df = DataFrame.fromCSV("C:\\Users\\e98594\\Desktop\\Iter1R1 - New Mktg Day Parameters_Inputs\\Globals\\Connection Builder_V1.csv")

    val converted = df.mutate(TableFormula("Converted EffectiveDate") {
        it["EffectiveDate"].asStrings().map { it?.substring(0,8) }.filterNotNull().map { LocalDate.parse(it, DateTimeFormatter.ofPattern("M/d/yyyy")) }
    })

    println(converted)
}

ERROR:

Exception in thread "main" java.lang.UnsupportedOperationException
	at krangl.SimpleDataFrameKt.handleListErasure(SimpleDataFrame.kt:391)
	at krangl.SimpleDataFrameKt.anyAsColumn(SimpleDataFrame.kt:355)
	at krangl.SimpleDataFrame.mutate(SimpleDataFrame.kt:195)
	at com.swa.np.myproject.TestKt.main(Test.kt:16)

It seems to work fine if I map() it to anything other than a LocalDate.

package com.swa.np.myproject

import krangl.*

fun main(args: Array<String>) {

    val df = DataFrame.fromCSV("C:\\Users\\e98594\\Desktop\\Iter1R1 - New Mktg Day Parameters_Inputs\\Globals\\Connection Builder_V1.csv")
    df.print()

    val converted = df.mutate(TableFormula("Converted EffectiveDate") {
        it["EffectiveDate"].asStrings().map { it?.substring(0,8) }
    })

    println(converted)
}

The output of "unite" is inconsistent with the one in tidyr

If I run the following program:
val df: DataFrame = dataFrameOf(
"id", "Year", "A", "B")(
1, 2007, 5, 10,
1, 2008, 2, 0,
1, 2009, 3, 50,
2, 2007, 7, 13,
2, 2008, 5, 17,
2, 2009, 6, 17
)
var dfUnite = df.unite("newCol", mutableListOf("id", "A"))
dfUnite.print()

I will get the following output from krangl:
Year B newCol
2007 10 1_5
2008 0 1_2
2009 50 1_3
2007 13 2_7
2008 17 2_5
2009 17 2_6

However,
If I use the same api (unite) in tidyr, I will get the following output:
p4_input1 %>% unite(newCol, id, A)
newCol Year B
1_5 2007 10
1_2 2008 0
1_3 2009 50
2_7 2007 13
2_5 2008 17
2_6 2009 17

It seems that the "unite" in tidyr will insert the new column (newCol) to the position of the first column (id) that is being united while Krangl always puts the new column to the last position.

It would be nice if krangl could generate output that is consistent with the original dplyr/tidyr library.

df.sortedByDescending(...) produces unexpected (unsorted) results

I've been getting unusual results from DataFrame.sortedByDescending() whereas sortedBy() worked as expected. The following two examples were taken from the main page here and slightly augmented. I wanted to sort dataframe rows by the "weight" column in descending order:

Example (Integers)

fun main(args: Array<String>) {
    val df = dataFrameOf("first_name", "last_name", "age", "weight")(
            "Max", "Doe", 23, 55,
            "Franz", "Smith", 23, 10,
            "Horst", "Keanes", 12, 0,
            "Horst", "Keanes", 12, 0,
            "Horst", "Keanes", 12, 1,
            "Horst", "Keanes", 12, 50,
            "Horst", "Keanes", 12, 2
    )

    df.sortedByDescending("weight").print()
}

Result

A DataFrame: 7 x 4
first_name   last_name   age   weight
       Max         Doe    23       55
     Horst      Keanes    12        0
     Horst      Keanes    12        2
     Horst      Keanes    12       50
     Horst      Keanes    12        1
     Franz       Smith    23       10
     Horst      Keanes    12        0

Example (Double + negative)

fun main(args: Array<String>) {
    val df = dataFrameOf("first_name", "last_name", "age", "weight")(
            "Max", "Doe", 23, 55.0,
            "Franz", "Smith", 23, -10.0,
            "Horst", "Keanes", 12, 0.0,
            "Horst", "Keanes", 12, 0.10,
            "Horst", "Keanes", 12, 1.0,
            "Horst", "Keanes", 12, -0.05,
            "Horst", "Keanes", 12, -0.02
    )

    df.sortedByDescending("weight").print()
}

Result

A DataFrame: 7 x 4
first_name   last_name   age   weight
       Max         Doe    23       55
     Horst      Keanes    12    -0.02
     Horst      Keanes    12      0.1
     Horst      Keanes    12        0
     Franz       Smith    23      -10
     Horst      Keanes    12    -0.05
     Horst      Keanes    12        1

In both cases, the "weight" column in the printout is different, but not sorted at all. In case of sortedBy(), the "weight" column was properly sorted...

This looks like a bug to me.

How to append a total row

What's the best way to total a column?
Say you have a df like this:

| Name  | Duration | Color  |
-----------------------------
| Foo   | 100      | Blue   |
| Goo   | 200      | Red    |
| Bar   | 300      | Yellow |

I don't see a sum() or total() method on DataCol - only mean, min, etc.
I can total the column myself like so:
val total = df["duration"].asInts().sumBy { it -> it!! }
but how to I append this to the data frame to end up with this:

| Name  | Duration | Color  |
-----------------------------
| Foo   | 100      | Blue   |
| Goo   | 200      | Red    |
| Bar   | 300      | Yellow |
| Total | 600      |        |

Add options to DataFrame.printDataClassSchema

The code generated by the DataFrame.printDataClassSchema method is unusable when column headers with spaces in them are used.

I would suggest either wrapping the column headers in backticks (`) or converting them to the regular kotlin naming convention (eg: User Id to userId)

Provide a License

Hey there.
Really liking the approach of krangl. I would like to use it in one of my project, but there is not LICENSE provided.
You want to publish it under for example MIT or Apache License?
Best Regards!

java.lang.NullPointerException when using DataFrame.fromCSV()

Hi,

here is my code :

import krangl.*

fun test(pathCSV: String) {
    // Create data-frame  in memory
    val otherDF = DataFrame.fromCSV(pathCSV)
    otherDF.print(colNames = true)
}

I just try to load a csv file into a dataframe having the filepath.
The considered file look like ; head -n 8 iris.csv :

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa

And i get this error at runtime :

Exception in thread "main" java.lang.NullPointerException
        at krangl.TableIOKt.fromCSV(TableIO.kt:57)
        at krangl.TableIOKt.fromCSV(TableIO.kt:35)
        at krangl.TableIOKt.fromCSV$default(TableIO.kt:25)
        at krangl.TableIOKt.fromCSV(TableIO.kt:18)
        at com.mycompany.dataframes.DataFrameKt.test(DataFrame.kt:15)
        at com.mycompany.controler.ControlerKt.main(Controler.kt:34)

If you have some answer or advice.
Thanks.
p.s: I already read it with some Python code , using Pandas library.

Documentation

Can you add a more friendly installation guide? E.g. for people that are very new to kotlin and gradle?

TableFormula shouldn't have to be specified

I don't know if I agree the name mutate() is optimal because it implies the DataFrame is mutable, which does not seem to be the case because it actually yields a new DataFrame.

package com.swa.np.myproject

import krangl.DataFrame
import krangl.TableFormula
import krangl.fromCSV

fun main(args: Array<String>) {
    val df = DataFrame.fromCSV("C:\\Users\\e98594\\Desktop\\Iter1R1 - New Mktg Day Parameters_Inputs\\Globals\\Connection Builder_V1.csv")

    val newDf = df.mutate(TableFormula("Test") {3})

    print(newDf)
}

Also, the documented arguments in the README do not work. You have to explicitly provide a TableFormula so I may put in a PR later to fix this.

Provide more elegant object bindings

Something along

data class Person(val name:String, val age:Int)

val persons : Iteratable<Person> = df.mapTo<Person>()

Internal impl could use reflection on reified type:

...
T::class.constructors.first().call(args)
...

Add unit tests to support this for all basic types including object and nested dfs.

CSV files without header take first line as a header always

Apache Commons CSV allows specifying headers.
Iterable<CSVRecord> records = CSVFormat.RFC4180.withHeader("ID", "CustomerNo", "Name").parse(in); for (CSVRecord record : records) { String id = record.get("ID"); String customerNo = record.get("CustomerNo"); String name = record.get("Name"); }

How to modify data

I'm porting some python code to kotlin and I'm stuck. The python code creates a new column by applying a lambda with some if statements to compare values of columns. For example add a new column and assign a value if the status is true and assign a different value if the status is false. It uses assign, apply, and loc to do this.

I cannot find a way to do this with krangl. Is it possible?

I can't write some code examples because I have to write this on my phone because my company blocks logging into GitHub.

Thanks

Hide columns in `print` after exceeding maximum line length

Similar to tibble printing:

> require(nycflights13)
> flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515        2.      830            819
 2  2013     1     1      533            529        4.      850            830
 3  2013     1     1      542            540        2.      923            850
 4  2013     1     1      544            545       -1.     1004           1022
 5  2013     1     1      554            600       -6.      812            837
 6  2013     1     1      554            558       -4.      740            728
 7  2013     1     1      555            600       -5.      913            854
 8  2013     1     1      557            600       -3.      709            723
 9  2013     1     1      557            600       -3.      838            846
10  2013     1     1      558            600       -2.      753            745
# ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Is mutate() gone?

I'm on krangl v0.6 and I can't use df.mutate() at all. I looked in DataFrame.kt and I see references to it in comments but no method declaration. I do see it in v0.4 though.

Allow specifying return type of DataFrame.get(name: String)

It would be nice to have a function on DataFrame similar to operator fun get(name: String): DataCol, except you can specify the return type with generics.

Calling the function could look this way:
val d:IntCol = dataFrame.getColumn("d")
or:
val d = dataFrame.getColumn<IntCol>("d")

A possible implementation (only tested for SimpleDataFrame):

inline fun <reified T : DataCol> DataFrame.getColumn(name: String): T =
        try {
            val column = cols.first { it.name == name }
            if (column is T) {
                column
            } else {
                val msg = "Could not cast column '${name}' of type '${column::class.simpleName}' to type '${T::class}'"
                throw ColumnTypeCastException(msg)
            }
        } catch (e: NoSuchElementException) {
            throw NoSuchElementException("No column found with name '$name'")
        }

The function could also be named get, but I find getColumn a bit clearer in what it returns. Unfortunately, it can't be called from Java because of reified.

Related to this:
I think get on DoubleCol (IntCol etc.) should return a Double? (Int? etc.) instead of Any?.

edit:
I guess you could also just use dataFrame["d"] as IntCol

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.