holgerbrandl / krangl Goto Github PK

krangl is a {K}otlin DSL for data w{rangl}ing

License: MIT License

Kotlin 99.20% R 0.49% Java 0.31%

krangl's Introduction

krangl

krangl is no longer developed. It was a wonderful experiement, but has been superceeded with the more complete, more usable and more modern https://github.com/Kotlin/dataframe.

krangl is a {K}otlin library for data w{rangl}ing. By implementing a grammar of data manipulation using a modern functional-style API, it allows to filter, transform, aggregate and reshape tabular data.

krangl is heavily inspired by the amazing dplyr for R. krangl is written in Kotlin, excels in Kotlin, but emphasizes as well on good java-interop. It is mimicking the API of dplyr, while carefully adding more typed constructs where possible.

Installation
Features
Examples
Documentation
How to contribute?

If you're not sure about how to proceed, check out krangl in 10 minutes section in the krangl user guide.

Installation

To get started simply add it as a dependency to your build.gradle:

repositories {
    mavenCentral() 
}

dependencies {
    implementation "com.github.holgerbrandl:krangl:0.18.4"
}

Declaring the repository is purely optional as it is the default already.

You can also use JitPack with Maven or Gradle to build the latest snapshot as a dependency in your project.

repositories {
    maven { url 'https://jitpack.io' }
}
dependencies {
    implementation 'com.github.holgerbrandl:krangl:-SNAPSHOT'
}

To build and install it into your local maven cache, simply clone the repo and run

./gradlew install

Features

Filter, transform, aggregate and reshape tabular data
Modern, user-friendly and easy-to-learn data-science API
Reads from plain and compressed tsv, csv, json, or any delimited format with or without header from local or remote
Supports grouped operations
Ships with JDBC support
Tables can contain atomic columns (int, double, boolean) as well as object columns
Reshape tables from wide to long and back
Table joins (left, right, semi, inner, outer)
Cross tabulation
Descriptive statistics (mean, min, max, median, ...)
Functional API inspired by dplyr, pandas, and Kotlin stdlib
many more...

krangl is just about data wrangling. For data visualization we recommend kravis which seamlessly integrates with krangl and implements a grammar to build a wide variety of plots.

Examples

// Read data-frame from disk
val iris = DataFrame.readTSV("data/iris.txt")


// Create data-frame in memory
val df: DataFrame = dataFrameOf(
    "first_name", "last_name", "age", "weight")(
    "Max", "Doe", 23, 55,
    "Franz", "Smith", 23, 88,
    "Horst", "Keanes", 12, 82
)

// Or from csv
// val otherDF = DataFrame.readCSV("path/to/file")

// Print rows
df                              // with implict string conversion using default options
df.print(colNames = false)      // with custom printing options

// Print structure
df.schema()


// Add columns with mutate
// by adding constant values as new column
df.addColumn("salary_category") { 3 }

// by doing basic column arithmetics
df.addColumn("age_3y_later") { it["age"] + 3 }

// Note: krangl dataframes are immutable so we need to (re)assign results to preserve changes.
val newDF = df.addColumn("full_name") { it["first_name"] + " " + it["last_name"] }

// Also feel free to mix types here since krangl overloads  arithmetic operators like + for dataframe-columns
df.addColumn("user_id") { it["last_name"] + "_id" + rowNumber }

// Create new attributes with string operations like matching, splitting or extraction.
df.addColumn("with_anz") { it["first_name"].asStrings().map { it!!.contains("anz") } }

// Note: krangl is using 'null' as missing value, and provides convenience methods to process non-NA bits
df.addColumn("first_name_initial") { it["first_name"].map<String>{ it.first() } }

// or add multiple columns at once
df.addColumns(
    "age_plus3" to { it["age"] + 3 },
    "initials" to { it["first_name"].map<String> { it.first() } concat it["last_name"].map<String> { it.first() } }
)


// Sort your data with sortedBy
df.sortedBy("age")
// and add secondary sorting attributes as varargs
df.sortedBy("age", "weight")
df.sortedByDescending("age")
df.sortedBy { it["weight"].asInts() }


// Subset columns with select
df.select2 { it is IntCol } // functional style column selection
df.select("last_name", "weight")    // positive selection
df.remove("weight", "age")  // negative selection
df.select({ endsWith("name") })    // selector mini-language


// Subset rows with vectorized filter
df.filter { it["age"] eq 23 }
df.filter { it["weight"] gt 50 }
df.filter({ it["last_name"].isMatching { startsWith("Do")  }})

// In case vectorized operations are not possible or available we can also filter tables by row
// which allows for scalar operators
df.filterByRow { it["age"] as Int > 5 }
df.filterByRow { (it["age"] as Int).rem(10) == 0 } // round birthdays :-)


// Summarize

// do simple cross tabulations
df.count("age", "last_name")

// ... or calculate single summary statistic
df.summarize("mean_age" to { it["age"].mean(true) })

// ... or multiple summary statistics
df.summarize(
    "min_age" to { it["age"].min() },
    "max_age" to { it["age"].max() }
)

// for sake of r and python adoptability you can also use `=` here
df.summarize(
    "min_age" `=` { it["age"].min() },
    "max_age" `=` { it["age"].max() }
)

// Grouped operations
val groupedDf: DataFrame = df.groupBy("age") // or provide multiple grouping attributes with varargs
val sumDF = groupedDf.summarize(
    "mean_weight" to { it["weight"].mean(removeNA = true) },
    "num_persons" to { nrow }
)

// Optionally ungroup the data
sumDF.ungroup().print()

// generate object bindings for kotlin.
// Unfortunately the syntax is a bit odd since we can not access the variable name by reflection
sumDF.printDataClassSchema("Person")

// This will generate and print the following conversion code:
data class Person(val age: Int, val mean_weight: Double, val num_persons: Int)

val records = sumDF.rows.map { row -> Person(row["age"] as Int, row["mean_weight"] as Double, row["num_persons"] as Int) }

// Now we can use the krangl result table in a strongly typed way
records.first().mean_weight

// Vice versa we can also convert an existing set of objects into
val recordsDF = records.asDataFrame()
recordsDF.print()

// to populate a data-frame with selected properties only, we can do
val deparsedDF = records.deparseRecords { mapOf("age" to it.age, "weight" to it.mean_weight) }

Documentation

krangl is not yet mature, full of bugs and its API is in constant flux. Nevertheless, feel welcome to submit pull-requests or tickets, or simply get in touch via gitter (see button on top).

Krangl User Guide for detailed information about the API and usage examples.
API Docs for detailed information about the API including manu usage examples
TBD krangl Cheat Sheet

Another great introduction into data-science with kotlin was presented at 2019's KotlinConf by Roman Belov from JetBrains.

How to contribute?

Feel welcome to post ideas, suggestions and criticism to our tracker.

We always welcome pull requests. :-)

You could also show your spiritual support by upvoting krangl here on github.

Also see

Developer Information with technical notes & details about to build, test, release and improve krangl
Roadmap complementing the tracker with a backlog

Also, there are a few issues in the IDE itself which limit the applicability/usability of krangl, So, you may want to vote for

KT-24789 "Unresolved reference" when running a script which is a symlink to a script outside of source roots
KT-12583 IDE REPL should run in project root directory
KT-11409 Allow to "Send Selection To Kotlin Console"
KT-13319 Support ":paste" for pasting multi-line expressions in REPL
KT-21224 REPL output is not aligned with input

krangl's People

Contributors

Stargazers

Watchers

krangl's Issues

add jdbc interface

https://jtablesaw.wordpress.com/2016/06/19/new-load-data-from-any-rdbms/

Implement `Iterable<Foo>.asDataFrame()` for complex cascaded types

Currently just data-classes with simply atomic properties are tested. Make sure that also more complex types work. E.g.

data class Foo(name:String, bar:File, stuff:List<URL>)
foos.asDataFrame()

should result in 3 column df with 2 object columns for bar and stuff.

`sleepData.sortedBy{ "order" }` should fail or be prevented by more typed api

provide category utilites

startsWith
endsWith
contains
matches

all should be done in na-aware manner

consider to use column indices for faster access

http://stackoverflow.com/questions/1108/how-does-database-indexing-work

Turning Column into LocalDate

I'm having an issue creating a new column that turns date text into a Java 8 LocalDate.

package com.swa.np.myproject

import krangl.*
import java.time.LocalDate
import java.time.format.DateTimeFormatter

fun main(args: Array<String>) {

    val df = DataFrame.fromCSV("C:\\Users\\e98594\\Desktop\\Iter1R1 - New Mktg Day Parameters_Inputs\\Globals\\Connection Builder_V1.csv")

    val converted = df.mutate(TableFormula("Converted EffectiveDate") {
        it["EffectiveDate"].asStrings().map { it?.substring(0,8) }.filterNotNull().map { LocalDate.parse(it, DateTimeFormatter.ofPattern("M/d/yyyy")) }
    })

    println(converted)
}

ERROR:

Exception in thread "main" java.lang.UnsupportedOperationException
	at krangl.SimpleDataFrameKt.handleListErasure(SimpleDataFrame.kt:391)
	at krangl.SimpleDataFrameKt.anyAsColumn(SimpleDataFrame.kt:355)
	at krangl.SimpleDataFrame.mutate(SimpleDataFrame.kt:195)
	at com.swa.np.myproject.TestKt.main(Test.kt:16)

It seems to work fine if I map() it to anything other than a LocalDate.

package com.swa.np.myproject

import krangl.*

fun main(args: Array<String>) {

    val df = DataFrame.fromCSV("C:\\Users\\e98594\\Desktop\\Iter1R1 - New Mktg Day Parameters_Inputs\\Globals\\Connection Builder_V1.csv")
    df.print()

    val converted = df.mutate(TableFormula("Converted EffectiveDate") {
        it["EffectiveDate"].asStrings().map { it?.substring(0,8) }
    })

    println(converted)
}

user proper logging instead of system.out/err

improve reshaping functionality by adding `unite` and `separate`

Can not add scalar object as column

Whereas this works fine for atomic values, it fails for objects. Example

irisData.addColumn("my_regex"){ "foo$it".toRegex()}

Selecting Columns without Names

Is there a way to select columns by index rather than via column names if a given dataframe does not have column names?

The output of "unite" is inconsistent with the one in tidyr

If I run the following program:
val df: DataFrame = dataFrameOf(
"id", "Year", "A", "B")(
1, 2007, 5, 10,
1, 2008, 2, 0,
1, 2009, 3, 50,
2, 2007, 7, 13,
2, 2008, 5, 17,
2, 2009, 6, 17
)
var dfUnite = df.unite("newCol", mutableListOf("id", "A"))
dfUnite.print()

I will get the following output from krangl:
Year B newCol
2007 10 1_5
2008 0 1_2
2009 50 1_3
2007 13 2_7
2008 17 2_5
2009 17 2_6

However,
If I use the same api (unite) in tidyr, I will get the following output:
p4_input1 %>% unite(newCol, id, A)
newCol Year B
1_5 2007 10
1_2 2008 0
1_3 2009 50
2_7 2007 13
2_5 2008 17
2_6 2009 17

It seems that the "unite" in tidyr will insert the new column (newCol) to the position of the first column (id) that is being united while Krangl always puts the new column to the last position.

It would be nice if krangl could generate output that is consistent with the original dplyr/tidyr library.

df.sortedByDescending(...) produces unexpected (unsorted) results

I've been getting unusual results from DataFrame.sortedByDescending() whereas sortedBy() worked as expected. The following two examples were taken from the main page here and slightly augmented. I wanted to sort dataframe rows by the "weight" column in descending order:

Example (Integers)

fun main(args: Array<String>) {
    val df = dataFrameOf("first_name", "last_name", "age", "weight")(
            "Max", "Doe", 23, 55,
            "Franz", "Smith", 23, 10,
            "Horst", "Keanes", 12, 0,
            "Horst", "Keanes", 12, 0,
            "Horst", "Keanes", 12, 1,
            "Horst", "Keanes", 12, 50,
            "Horst", "Keanes", 12, 2
    )

    df.sortedByDescending("weight").print()
}

Result

A DataFrame: 7 x 4
first_name   last_name   age   weight
       Max         Doe    23       55
     Horst      Keanes    12        0
     Horst      Keanes    12        2
     Horst      Keanes    12       50
     Horst      Keanes    12        1
     Franz       Smith    23       10
     Horst      Keanes    12        0

Example (Double + negative)

fun main(args: Array<String>) {
    val df = dataFrameOf("first_name", "last_name", "age", "weight")(
            "Max", "Doe", 23, 55.0,
            "Franz", "Smith", 23, -10.0,
            "Horst", "Keanes", 12, 0.0,
            "Horst", "Keanes", 12, 0.10,
            "Horst", "Keanes", 12, 1.0,
            "Horst", "Keanes", 12, -0.05,
            "Horst", "Keanes", 12, -0.02
    )

    df.sortedByDescending("weight").print()
}

Result

A DataFrame: 7 x 4
first_name   last_name   age   weight
       Max         Doe    23       55
     Horst      Keanes    12    -0.02
     Horst      Keanes    12      0.1
     Horst      Keanes    12        0
     Franz       Smith    23      -10
     Horst      Keanes    12    -0.05
     Horst      Keanes    12        1

In both cases, the "weight" column in the printout is different, but not sorted at all. In case of sortedBy(), the "weight" column was properly sorted...

This looks like a bug to me.

Add excel import/export

How to append a total row

What's the best way to total a column?
Say you have a df like this:

| Name  | Duration | Color  |
-----------------------------
| Foo   | 100      | Blue   |
| Goo   | 200      | Red    |
| Bar   | 300      | Yellow |

I don't see a sum() or total() method on DataCol - only mean, min, etc.
I can total the column myself like so:
val total = df["duration"].asInts().sumBy { it -> it!! }
but how to I append this to the data frame to end up with this:

| Name  | Duration | Color  |
-----------------------------
| Foo   | 100      | Blue   |
| Goo   | 200      | Red    |
| Bar   | 300      | Yellow |
| Total | 600      |        |

consistently log non-idiomatic use of core-verbs as warnings

Add options to DataFrame.printDataClassSchema

The code generated by the DataFrame.printDataClassSchema method is unusable when column headers with spaces in them are used.

I would suggest either wrapping the column headers in backticks (`) or converting them to the regular kotlin naming convention (eg: User Id to userId)

csv reader does not handle incorrect type guesses

Example:

Provide a License

Hey there.
Really liking the approach of krangl. I would like to use it in one of my project, but there is not LICENSE provided.
You want to publish it under for example MIT or Apache License?
Best Regards!

Impl `Iterable<Any>.asDataFrame()` via reflection

scan top-level properties and add as corresponding columns
optionally recurse into non-atomic properties (similar to JSON flattening in R)

Add renjin bindings as a subproject

See https://github.com/bedatadriven/renjin-gradle-example/blob/master/src/test/java/org/renjin/gradle/RenjinGradleTest.java for an example.

Continue peeking until we hit the first/N non-NA values for column type detection

See kplyr.peekCol

Create example for https://github.com/JetBrains/Exposed

java.lang.NullPointerException when using DataFrame.fromCSV()

Hi,

here is my code :

import krangl.*

fun test(pathCSV: String) {
    // Create data-frame  in memory
    val otherDF = DataFrame.fromCSV(pathCSV)
    otherDF.print(colNames = true)
}

I just try to load a csv file into a dataframe having the filepath.
The considered file look like ; head -n 8 iris.csv :

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa

And i get this error at runtime :

Exception in thread "main" java.lang.NullPointerException
        at krangl.TableIOKt.fromCSV(TableIO.kt:57)
        at krangl.TableIOKt.fromCSV(TableIO.kt:35)
        at krangl.TableIOKt.fromCSV$default(TableIO.kt:25)
        at krangl.TableIOKt.fromCSV(TableIO.kt:18)
        at com.mycompany.dataframes.DataFrameKt.test(DataFrame.kt:15)
        at com.mycompany.controler.ControlerKt.main(Controler.kt:34)

If you have some answer or advice.
Thanks.
p.s: I already read it with some Python code , using Pandas library.

Provide database integration example or actual backend

See https://github.com/JetBrains/Exposed, especially

Database.connect("jdbc:h2:mem:test", driver = "org.h2.Driver")

`schema()` should no throw memory exception

To reproduce

val jsonUrl = "internal link to json user model"
        val df = DataFrame.fromJson(jsonUrl)
        df.schema()

Documentation

Can you add a more friendly installation guide? E.g. for people that are very new to kotlin and gradle?

Print row numbers by default when using `print`

provide eqivalent for dplyr::summarize_each and dplyr::mutate_each

use selector API

TableFormula shouldn't have to be specified

I don't know if I agree the name mutate() is optimal because it implies the DataFrame is mutable, which does not seem to be the case because it actually yields a new DataFrame.

package com.swa.np.myproject

import krangl.DataFrame
import krangl.TableFormula
import krangl.fromCSV

fun main(args: Array<String>) {
    val df = DataFrame.fromCSV("C:\\Users\\e98594\\Desktop\\Iter1R1 - New Mktg Day Parameters_Inputs\\Globals\\Connection Builder_V1.csv")

    val newDf = df.mutate(TableFormula("Test") {3})

    print(newDf)
}

Also, the documented arguments in the README do not work. You have to explicitly provide a TableFormula so I may put in a PR later to fix this.

`DataFrame.print` should round numeric values for better readability

Return `Sequence` instead of `Iterable` where possible to defer computation where possible

see https://stackoverflow.com/questions/35629159/kotlins-iterable-and-sequence-look-exactly-same-why-are-two-types-required/35630670

Add function literal support for `count`

Add ability to group by using indicator function `df.groupBy{ it["name"].startsWith("F") }`

All other main verbs allow to do so and it convenient once in a while.

Mixing of positive and negative column selection should throw `InvalidColumnSelectException`

Example

df.gather("year", "rainfall", columns = { except("city") AND startsWith("coast") } )

city   1995   2000   2005             year   rainfall
Dresden    343    252    423   coast_distance        400
Frankfurt    534    435    913   coast_distance        534

except starts a negative selection, but startsWith converts it into a positive one.

implement kplyrized `tidyr::separate` and `tidyr::gather`

Provide more elegant object bindings

Something along

data class Person(val name:String, val age:Int)

val persons : Iteratable<Person> = df.mapTo<Person>()

Internal impl could use reflection on reified type:

...
T::class.constructors.first().call(args)
...

Add unit tests to support this for all basic types including object and nested dfs.

Add generic NA aware mapper for filter expressions

Example (too clumsy to be fun)

df.filter({ it["last_name"].asStrings().map { it!!.startsWith("Do") }.toBooleanArray() })

CSV files without header take first line as a header always

Apache Commons CSV allows specifying headers.
Iterable<CSVRecord> records = CSVFormat.RFC4180.withHeader("ID", "CustomerNo", "Name").parse(in); for (CSVRecord record : records) { String id = record.get("ID"); String customerNo = record.get("CustomerNo"); String name = record.get("Name"); }

intro presentation link in README is dead

http://holgerbrandl.github.io/krangl/krangl_intro/krangl_intro.html

Add `complete`

See #https://rdrr.io/cran/tidyr/man/complete.html

How to modify data

I'm porting some python code to kotlin and I'm stuck. The python code creates a new column by applying a lambda with some if statements to compare values of columns. For example add a new column and assign a value if the status is true and assign a different value if the status is false. It uses assign, apply, and loc to do this.

I cannot find a way to do this with krangl. Is it possible?

I can't write some code examples because I have to write this on my phone because my company blocks logging into GitHub.

Thanks

Hide columns in `print` after exceeding maximum line length

Similar to tibble printing:

> require(nycflights13)
> flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515        2.      830            819
 2  2013     1     1      533            529        4.      850            830
 3  2013     1     1      542            540        2.      923            850
 4  2013     1     1      544            545       -1.     1004           1022
 5  2013     1     1      554            600       -6.      812            837
 6  2013     1     1      554            558       -4.      740            728
 7  2013     1     1      555            600       -5.      913            854
 8  2013     1     1      557            600       -3.      709            723
 9  2013     1     1      557            600       -3.      838            846
10  2013     1     1      558            600       -2.      753            745
# ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

`flightsData.schema()` fails with `java.lang.OutOfMemoryError`

Implement `nest` and `unnest`

See

Is mutate() gone?

I'm on krangl v0.6 and I can't use df.mutate() at all. I looked in DataFrame.kt and I see references to it in comments but no method declaration. I do see it in v0.4 though.

Allow specifying return type of DataFrame.get(name: String)

It would be nice to have a function on DataFrame similar to operator fun get(name: String): DataCol, except you can specify the return type with generics.

Calling the function could look this way:
val d:IntCol = dataFrame.getColumn("d")
or:
val d = dataFrame.getColumn<IntCol>("d")

A possible implementation (only tested for SimpleDataFrame):

inline fun <reified T : DataCol> DataFrame.getColumn(name: String): T =
        try {
            val column = cols.first { it.name == name }
            if (column is T) {
                column
            } else {
                val msg = "Could not cast column '${name}' of type '${column::class.simpleName}' to type '${T::class}'"
                throw ColumnTypeCastException(msg)
            }
        } catch (e: NoSuchElementException) {
            throw NoSuchElementException("No column found with name '$name'")
        }

The function could also be named get, but I find getColumn a bit clearer in what it returns. Unfortunately, it can't be called from Java because of reified.

Related to this:
I think get on DoubleCol (IntCol etc.) should return a Double? (Int? etc.) instead of Any?.

edit:
I guess you could also just use dataFrame["d"] as IntCol

compare with tablesaw

https://jtablesaw.wordpress.com/an-introduction/

feature comparison
link
performance comparison

holgerbrandl / krangl Goto Github PK

krangl's Introduction

krangl

krangl is no longer developed. It was a wonderful experiement, but has been superceeded with the more complete, more usable and more modern https://github.com/Kotlin/dataframe.

Installation

Features

Examples

Documentation

How to contribute?

krangl's People

Contributors

Stargazers

Watchers

Forkers

krangl's Issues

Example (Integers)

Result

Example (Double + negative)

Result

Recommend Projects

Recommend Topics

Recommend Org

`krangl` is no longer developed. It was a wonderful experiement, but has been superceeded with the more complete, more usable and more modern https://github.com/Kotlin/dataframe.