Giter Club home page Giter Club logo

saddle's Introduction

Build status codecov doc doc maven

Saddle: Scala Data Library for JVM and Scala.js

Introduction

Saddle is a data manipulation library for Scala that provides array-backed, indexed, one- and two-dimensional data structures that are specialized on primitive types to avoid the overhead of boxing and unboxing.

Saddle offers numerical calculations, automatic alignment of data along indices, robustness to missing (N/A) values, and facilities for I/O.

Features

  • All of saddle's core data structures avoid boxing of primitive types thus maintaining optimal memory efficiency and cache locality.
  • One- and two-dimensional vectors (Vec[T] and Mat[T]).
  • Constant time lookup index supporting database-like inner and outer joins (Index[T]).
  • Combined of index and vector types, both 1D (Series[Key,Value]) and 2D (Frame[RowKey,ColumnKey,Value]).
  • Support for multilevel indexes, and data manipulations like pivots, joins, merges, group by-s, sorts.
  • Convenient vectorized binary operations betweens the above data structures.
  • Automatic, non-boxed handling of missing values.
  • Native linear algebra backed by BLAS/LAPACK on amd64, aarch64 and Apple arm64. On linux needs a system wide installation of BLAS and LAPACK shared libraries.
  • Getting data in and out of the library:
    • Extremely fast CSV, integer and floating point parsers doing minimal allocations and minimal branching.
    • CSV writer.
    • Support for reading contigous numeric arrays from npy files.
    • Fast and memory efficient binary serialization format.
    • Circe and jsoniter-scala type classes.
  • Published for Scala on the JVM and Scala.js, Scala 2.13 and 3.

Documentation

How to build the code

You need sbt: sbt test

How to build the website

The website is built with hugo and the hugo-book theme.

The theme is a git submodule. It must be initialized.

git submodule update --init

Create and serve the site with:

sbt docs/mdoc docs/unidoc && cd website && hugo

License

Saddle is distributed under the Apache License Version 2.0 (see LICENSE file).

Copyright

Copyright (c) 2013-2015 Novus Partners, Inc.

Copyright (c) 2013-2015 The Saddle Development Team

All rights reserved.

Saddle is subject to a shared copyright. Each contributor retains copyright to his or her contributions to Saddle, and is free to annotate these contributions via code repository commit messages. The copyright to the entirety of the code base is shared among the Saddle Development Team, comprised of the developers who have made such contributions.

The copyright and license of each file shall read as follows:

Copyright (c) 2013-2015 Saddle Development Team

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Individual contributors may, if they so desire, append their names to the CONTRIBUTORS file.

Code in saddle-core/src/main/scala/org/saddle/util/LongMap.scala has different copyright terms, see its header.

Code in saddle-core/src/main/scala/org/saddle/Buffer.scala has different copyright terms, see its header.

Code in spire-prng has different copyright terms, see the spire-prng/COPYING.

Code in FastDoubleParser.scala is a translation of https://github.com/wrandelshofer/FastDoubleParser. The original Java code is licensed as Copyright © 2022. Werner Randelshofer, Switzerland. MIT License. The test data for FastDoubleParser.scala is licensed as Copyright © Nigel Tao, Apache License Version 2.0.

About the Copyright Holders

Adam Klein began Saddle development in 2012 while an employee of Novus Partners, Inc. The code was released by Novus under this license in 2013. Adam Klein is lead developer. Saddle was inspired by earlier prototypes developed by Chris Lewis, Cheng Peng, & David Cru. Saddle was also inspired by previous work with pandas, a data analysis library written in Python.

Code in the saddle-linalg/ folder is contributed by Istvan Bartha.

This repository is a fork of the original Saddle repository which has seen no activity for some time.

saddle's People

Contributors

adamklein avatar andrelfpinto avatar chrislewis avatar folone avatar jvns avatar marklister avatar pityka avatar scala-steward avatar sv3ndk avatar tnielens avatar wheaties avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

saddle's Issues

does saddle can not use in multiple thread read csv ?

HI:
saddle is perfect tools read csv in scala like pandas,but for huge num rows csv,we need multiple threads to read the csv file to increase the load the data speed. in one thread read csv is no problem, but for multiple thread the saddle can not read the csv file, and meet the error [ CharsetDecoder Current state = RESET, new state = FLUSHED ] in val dataCsvOpt = csv.CsvParser.parseFile[String](csvFileHandle,fieldSeparator='\t',recordSeparator = "\n").toOption

scala 3 support

Placeholder issue for the scala 3 support effort.
Have some challenges been already identified for this migration?

Cross product in Frame.apply

Frame(
  Series(0 -> 1,2 -> 2, 1 -> 3, 0 -> 4),
  Series(1 -> 1,2 -> 2, 0 -> 3, 0 -> 4),
  Series(0 -> 1,1 -> 2, 2 -> 3, 0 -> 4)
)

yields

val res3: org.saddle.Frame[Int,Int,Int] =
[10 x 3]
      0  1  2
     -- -- --
0 ->  1  3  1
0 ->  1  3  4
0 ->  1  4  1
0 ->  1  4  4
0 ->  4  3  1
0 ->  4  3  4
0 ->  4  4  1
0 ->  4  4  4
2 ->  2  2  3
1 ->  3  1  2

If the row indices are not the same and they contain duplicates then it is doing a series of cross products. This is looses no data and from a certain pespective this is correct. However it is rarely the expected behavior and on any medium sized data it blows up.

[proposition] mask-based implementation for missing values

Currently, saddle uses one value of each primitive type to represent NA. For floating point numbers, this is straightforward as they already include such a value. For other types (Boolean, Byte, Int, etc), it isn't straightforward and an arbitrary value must be used. Currently the minimum value is used (Byte.MinValue, Short.MinValue, etc).

I think this approach has important drawbacks:

  • users are unlikely to know about that encoding and could use the min values. That leads to surprising behavior.
  • operations resulting in the MinValue would result in a missing value.
  • the binary operations on collections lose in simplicity. Implementation such as if (tag.isMissing(v1)) v1 else v1 + 2 might prevent loop optimizations of the jvm to kick-in.
  • the .raw(i)-like api exposes unnecessary complexity to the user.

An alternative approach would be to use a mask-based implementation for the integer-based Vec[T]s. That is, the vector stores a companion Array[Boolean] indicating missing value. This approach is used by pandas.

how to use groupby method ?

HI ,when I create the Frame[Int,int,string] by saddle ,I want to compute one field column every element count, like

frame.columnAt(index).groupby( ele => ele.count), but I found can not coding like this, how to do it,thanks

Join of identical non-unique indexes is not correct

  "Outer join of same non-unique indexes " in {
      val ix1 = Index(0, 0)
      val ix2 = Index(0, 0)
      val res = ix1.join(ix2, how = index.OuterJoin)
      /*
        Outer join two columns with repeated values should generate all possible pairs

        Correct assertions:
      res.index must_== Index(0, 0, 0, 0)
      res.lTake.get must_== Array(0, 0, 1, 1)
      res.rTake.get must_== Array(0, 1, 0, 1)

      However the current implementation has a shortcut for identical indexes and passes these:
       */
      res.index must_== Index(0, 0)
      res.lTake must_== None
      res.rTake must_== None

    }

This is because in the implementation there is a shortcut:

def join(left: Index[T], right: Index[T], how: JoinType): ReIndexer[T] = {
    if (left == right) {
      ReIndexer(None, None, right)
    } else if..

The above condition should check for uniqueness.

Vec[Unit] throws

org.saddle.Vec(()).map(println) throws

java.lang.IllegalArgumentException
  at java.lang.reflect.Array.newArray(Native Method)
  at java.lang.reflect.Array.newInstance(Array.java:75)
  at scala.reflect.ClassTag.newArray(ClassTag.scala:66)
  at scala.reflect.ClassTag.newArray$(ClassTag.scala:65)
  at org.saddle.scalar.ScalarTagAny.newArray(ScalarTagAny.scala:23)
  at scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1278)
  at scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1276)
  at scala.collection.AbstractIterable.toArray(Iterable.scala:919)
  at org.saddle.Vec$.apply(Vec.scala:52)

Even if not very useful, it should not blow up.

flaky tests in MatCheck

Elementwise matrix operations with scalar (I,D) => B
[info]   + op < works
[error]   x op <= works
[error]    Falsified after 12 passed tests.
[error]    > ARG_0: [100 x 100]
[error]     1269956579 -1651588045  -23183[90](https://github.com/pityka/saddle/runs/5309342893?check_suite_focus=true#step:5:90)47  1009904595  ...   598167718 -10[91](https://github.com/pityka/saddle/runs/5309342893?check_suite_focus=true#step:5:91)060996  -56654787   376489325 
[error]     -808139313  1961476986 -1414633998  -383026468  ...  1799829621  1321614098 1435773124   358693408 
[error]     -118862603  -518816064   385642582   -17791247  ...  -655062533 -1228565848 1002622225  2070131445 
[error]    -1099712990   639182453  -786264453   205137673  ...  1188727935   539641967 1972840830  1311770901 
[error]    ...
[error]     -297240988 -1704265288  1080866841  2115918235  ...  -179810603   873080620 -964765650  1945350146 
[error]     1566446872 -1128376140 -1901055413 -1774102629  ... -2143441516   -65718677  166885557 -1808064785 
[error]     1555259345   581502718   614097437 -1372645149  ... -1596531520   -78749081 -584278297  -534540360 
[error]     -304506615 -1694821768 -170322[92](https://github.com/pityka/saddle/runs/5309342893?check_suite_focus=true#step:5:92)01   -57132164  ...  1995229777  1356349720 1926262097  21358187[93](https://github.com/pityka/saddle/runs/5309342893?check_suite_focus=true#step:5:93) 
[error]    
[error]    > ARG_1: -5.3055700234605[99](https://github.com/pityka/saddle/runs/5309342893?check_suite_focus=true#step:5:99)E-165
[error]    The seed is rHuHhy2VdhK_RTuBaXb7PkeY6mOGYXBmj_fsLhY-L0J=
[error]    
[error]    > [[100](https://github.com/pityka/saddle/runs/5309342893?check_suite_focus=true#step:5:100) x 100]
[error]    false  true  true false  ... false  true  true false 
[error]     true false  true  true  ... false false false false 
[error]     true  true false  true  ...  true  true false false 
[error]     true false  true false  ... false false false false 
[error]    ...
[error]     true  true false false  ...  true false  true false 
[error]    false  true  true  true  ...  true  true false  true 
[error]    false false false  true  ...  true  true  true  true 
[error]     true  true  true  true  ... false false false false 
[error]     != [100 x 100]
[error]    false  true  true false  ... false  true  true false 
[error]     true false  true  true  ... false false false false 
[error]     true  true false  true  ...  true  true false false 
[error]     true false  true false  ... false false false false 
[error]    ...
[error]     true  true false false  ...  true false  true false 
[error]    false  true  true  true  ...  true  true false  true 
[error]    false false false  true  ...  true  true  true  true 
[error]     true  true  true  true  ... false false false false  (MatCheck.scala:78)

`ScalarTagAny` issues with missing values

Unexpected: ScalarTagChar.isMissing(na.to[Char]) returns false.

I noticed this by by testing Vec[Char]('a', 'b', na).fillFoward() which didn't yield the result expected.

add ToC in doc pages

The current hugo theme does not support page tocs. Switching to a doc theme that supports it would improve the doc navigability.
Example of page toc here: https://doks.netlify.app/docs/prologue/introduction/

Two potential alternatives:

Cherry on the cake: all these doc templates have search bars.

My suggestion is a switch to docausaurus and mdoc integration because many other reference libraries in the scala ecosystem use it.

I'd be glad to take this task.

scala 3 support

Scala 3 support is blocked on the lack of @ specialized annotation. Boxing would turn a fast and terminating program into a slow and non-terminating (OOM) program.

Loop optimizations investigation

Hotspot has support for auto-vectorization. That is, the jvm will identify certain looping patterns on arrays and generate vectorized instructions (SIMD).

Investigate whether saddle triggers these optimizations by inspecting the produced x86_64 assembly. Good candidates for the investigations are the binary operations implemented in VecSclrElemOp, VecVecElemOp, etc.

transpose on Panel fails

import org.saddle._
Panel(Vec(1, 2, 3), Vec("hello", "world", "!")).T

throws

java.lang.ArrayStoreException
	at java.lang.System.arraycopy(Native Method)
	at org.saddle.array.package$.$anonfun$flatten$2(package.scala:623)
	at org.saddle.array.package$.$anonfun$flatten$2$adapted(package.scala:621)
	at scala.collection.immutable.Vector.foreach(Vector.scala:1856)
	at org.saddle.array.package$.flatten(package.scala:621)
	at org.saddle.scalar.ScalarTagAny.concat(ScalarTagAny.scala:64)
	at org.saddle.Frame.toMat(Frame.scala:1426)
	at org.saddle.Frame.T(Frame.scala:168)
	at repl.MdocSession$App.<init>(scalar.worksheet.sc:11)
	at repl.MdocSession$.app(scalar.worksheet.sc:3)

Small refactors requiring major version bump

  • drop implicit param of Series.proxyWith
  • fix Melter.melt3_2 (type params A,B,C,D,E and not A, B, C, D)
  • remove with AnyRef in Scalar.ord
  • widen all implicit defs of Slice and ScalarTag. Don't use subtypes or object types. Review all implicit defs.
  • ...

bug in csv parser

There is a bug in the new csv parser which manifest if the buffer ends around a line break. Can be fixed by using a larger buffer. Needs further investigation to find the specific cause.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.