spotify / featran Goto Github PK

View Code? Open in Web Editor NEW

465.0 31.0 68.0 2.99 MB

A Scala feature transformation library for data science and machine learning

Home Page: https://spotify.github.io/featran

License: Apache License 2.0

Scala 95.01% Python 0.17% Java 4.66% Shell 0.02% Nix 0.14%

scala spark scio data scalding flink ml algebird breeze xgboost tensorflow

featran's Introduction

featran

Featran, also known as Featran77 or F77 (get it?), is a Scala library for feature transformation. It aims to simplify the time consuming task of feature engineering in data science and machine learning processes. It supports various collection types for feature extraction and output formats for feature representation.

Introduction

Most feature transformation logic requires two steps, one global aggregation to summarize data followed by one element-wise mapping to transform them. For example:

Min-Max Scaler
- Aggregation: global min & max
- Mapping: scale each value to [min, max]
One-Hot Encoder
- Aggregation: distinct labels
- Mapping: convert each label to a binary vector

We can implement this in a naive way using reduce and map.

case class Point(score: Double, label: String)
val data = Seq(Point(1.0, "a"), Point(2.0, "b"), Point(3.0, "c"))

val a = data
  .map(p => (p.score, p.score, Set(p.label)))
  .reduce((x, y) => (math.min(x._1, y._1), math.max(x._2, y._2), x._3 ++ y._3))

val features = data.map { p =>
  (p.score - a._1) / (a._2 - a._1) :: a._3.toList.sorted.map(s => if (s == p.label) 1.0 else 0.0)
}

But this is unmanageable for complex feature sets. The above logic can be easily expressed in Featran.

import com.spotify.featran._
import com.spotify.featran.transformers._

val fs = FeatureSpec.of[Point]
  .required(_.score)(MinMaxScaler("min-max"))
  .required(_.label)(OneHotEncoder("one-hot"))

val fe = fs.extract(data)
val names = fe.featureNames
val features = fe.featureValues[Seq[Double]]

Featran also supports these additional features.

Extract from Scala collections, Flink DataSets, Scalding TypedPipes, Scio SCollections and Spark RDDs
Output as Scala collections, Breeze dense and sparse vectors, TensorFlow Example Protobuf, XGBoost LabeledPoint and NumPy .npy file
Import aggregation from a previous extraction for training, validation and test sets
Compose feature specifications and separate outputs

See Examples (source) for detailed examples. See transformers package for a complete list of available feature transformers.

See ScalaDocs for current API documentation.

Presentations

Featran - Type safe and generic feature transformation in Scala - NABD Conf Palo Alto 2017 talk

Artifacts

Feature includes the following artifacts:

featran-core - core library, support for extraction from Scala collections and output as Scala collections, Breeze dense and sparse vectors
featran-java - Java interface, see JavaExample.java
featran-flink - support for extraction from Flink DataSet
featran-scalding - support for extraction from Scalding TypedPipe
featran-scio - support for extraction from Scio SCollection
featran-spark - support for extraction from Spark RDD
featran-tensorflow - support for output as TensorFlow Example Protobuf
featran-xgboost - support for output as XGBoost LabeledPoint
featran-numpy - support for output as NumPy .npy file

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

featran's People

Contributors

Stargazers

Watchers

featran's Issues

Transformer persistence

Each transformer should be able to ser/de parameters and aggregations to a JSON object.

java.lang.NullPointerException when accessing featureNames/featureValues of MultiFeatureSpec

I am getting a java.lang.NullPointerException when I am trying to access either featureNames or featureValues. When I use either one of the 2 specs separately it works fine but when I try to combine them in a MultiFeatureSpec it fails. Is it a bug or am I doing something wrong?

@BigQueryType.fromQuery(
    """
      |#standardSQL
      |SELECT album_gid, album.num_tracks AS num_tracks,
      |album.availability.latest_date AS latest_date,
      |global_popularity.popularity_normalized AS popularity_normalized,
      |album.duration AS duration
      |FROM (SELECT * FROM `knowledge-graph-112233.album_entity.album_entity_%s` LIMIT 1000)
      |WHERE album.num_tracks >= 3
    """.stripMargin, "$LATEST"
  ) class AlbumMeta

  def main(cmdlineArgs: Array[String]): Unit = {
    val (sc, args) = ContextAndArgs(cmdlineArgs)

    val date = args("date").replace("-", "")
    val output = args("output")

    val albumFeatures = sc.typedBigQuery[AlbumMeta](AlbumMeta.query.format(date))

    val conSpec = FeatureSpec.of[AlbumMeta]
      .required(_.duration.get.toDouble)(StandardScaler("duration"))
      .required(_.duration.get.toDouble)(StandardScaler("duration_mean", withMean=true))
      .required(_.duration.get.toDouble)(Identity("identity"))
      .required(_.duration.get.toDouble)(MinMaxScaler("min_max"))

    val albumSpec = FeatureSpec.of[AlbumMeta]
      .required(_.album_gid.get)(OneHotEncoder("album"))

    //    val spec_extracted = albumSpec.extract(albumFeatures)
    val spec_extracted = MultiFeatureSpec(conSpec, albumSpec).extract(albumFeatures)

    val t = spec_extracted.featureNames

    sc.close().waitUntilFinish()
  }

Error:
Caused by: java.lang.NullPointerException
at com.spotify.featran.FeatureSet.multiFeatureNames(FeatureSpec.scala:231)
at com.spotify.featran.MultiFeatureExtractor$$anonfun$featureNames$1.apply(MultiFeatureExtractor.scala:56)
at com.spotify.featran.MultiFeatureExtractor$$anonfun$featureNames$1.apply(MultiFeatureExtractor.scala:56)
at com.spotify.scio.util.Functions$$anon$8.processElement(Functions.scala:145)

Create load tests for Sparse/Dense arrays

Users report that sparse representation of features might be more expensive than dense, afair we do a little be more there, like creating new arrays after #59 and mapping. Would be nice to add some load tests, and optimize that code path.

Upgrade breeze

1.0-RC1 is out. Hopefully a stable release soon.
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/scala-breeze/8b5IGfo10rU/Z1HyEpTbAQAJ

Skip reduce step if all features have unit aggregator

Allow transformer to reject inputs

By adding a method to FeatureBuilder, so any Transformer in the spec can call it to reject an entire record.

test

How to apply the same transformation for test and train datasets

It's not clear to me how would you apply the transformation to the train dataset then to the test dataset.

It is important to save the state of the transformation after applying it to the train dataset. So the same values is being used for the test.

Remove QTreeSerializer workaround

Once upstream is merged.
twitter/chill#305

Support sumOption in composite semigroups

Each bundle is in the form of Seq[Array[B]]. We can probably wrap it in a view of Seq[B] for each individual column and use the base semigroup's sumOption.

Provide defaults for FeatranSpec

In my code, I need to use featran to do identity transformation only:

  def applyFeatranSpec(s: SCollection[TrainingExample], vecSize: Int)
  : FeatureExtractor[SCollection, TrainingExample] = {
    FeatureSpec.of[TrainingExample]
      .required(e => e.context.map(_.toDouble).toArray)(VectorIdentity("context", vecSize))
      .required(e => e.target.toDouble)(Identity("target"))
      .extract(s)
  }

where

  case class TrainingExample(context: List[Long],
                             target: Long)

Would love Featran API to provide good Spec defaults, given a case class, so that I don't have to provide this boilerplate code, unless I need some specific feature transformation logic.

Try mutable set and similar mutable data structures

In the B position of Transformer[A, B, C] to reduce GC pressure.
Need to double check for correctness and make mutation detector happy (like the one in Beam) though.

Frequency rank transformer

Feature request from an internal user:
"I want to transform a string feature to integer ids, where the most common string gets id 1, second most common gets id 2, etc. and have a cap at at 10000, so everything else gets mapped to 0. Is that easily doable with featran?
The original dataset might have say 1 billion strings (of which say 1 million are unique)."

Not sure if this can be done with approximation in a single pass. The naive approach would be building a Map[String, Int] in reduce phase and a priority queue before map phase.

Expose xgboost feature extractor in java api

Add an BigQuery module

Schema generator and FeatureBuilder

Add PositionalOneHotEncoder Transformer

Name needs to some work but the idea is when you have some categorical features like (red, blue, green) and have a feature that is green instead of output <0, 0, 1> the transformer should output 2.

Testing Github + Trello Integration

FeatureBuilder not thread-safe with parallel in-memory collections

There might be concurrent access to fb if as is a in-memory parallel collection, i.e. if fb and its surrounding lambda is not copied via ser/de and accessed in a multi-threaded environment.
https://github.com/spotify/featran/blob/master/core/src/main/scala/com/spotify/featran/FeatureExtractor.scala#L94

Add an array CollectionType

Add ngrams transformer

See https://github.com/tensorflow/transform/blob/master/tensorflow_transform/mappers.py#L399

test, please ignore

Disregard

Add transformers to handle outliers and unbalanced data

maybe quantiles for outliers
SMOTE: Synthetic Minority Over-Sampling Technique for unbalanced

Add an Avro module

So we can write output records in Avro GenericRecords and eventually Parquet/Arrow compatible formats.

Add scaling factor in Hash*HotEncoders

We can provide 2 params:

hashBucketSize: Int = 0
sizeScalingFactor: Double = 4.0

If hashBucketSize > 0 we use that for assigning labels to buckets, but if it is 0 we use HLL estimated size * sizeScalingFactor instead of hashBucketSize and 4x gives us ~5% collision according to #23.

Make tests for probabilistic transformers black box

So that the test code doesn't depend on implementation details.

Hash*Encoders (maybe)
HeavyHitters
QuantileDiscretizer

one hot encoder for large N

The one hot encoder takes a long time to run when the number of tokens is large. I believe the problem lies in the fact that the current implementation iterates over each element in the dictionary:

https://github.com/spotify/featran/blob/master/core/src/main/scala/com/spotify/featran/transformers/OneHotEncoder.scala#L45

I tried writing an implementation that uses a hash map instead of a list to store the tokens (plus their index), but it is still necessary to "inflate" the feature with the empty elements.

https://github.com/slhansen/featran/blob/master/core/src/main/scala/com/spotify/featran/transformers/OneHotEncoder.scala#L57-#L68

Since we can assume that only one element will be added it would be preferable to have the default be an empty feature vector and just add one element to it given the index.

support for sparse vectors in JFeatureExtractor

We would like to be able to extract features in Java as sparse vectors. Similar to the way you can do extractor.featureValues[SparseVector[Float]] in Scala.

Cross build for Scala 2.10 and 2.12

Make SeqDataType generic

Support Selective OneHotEncoder

Currently the OHE encodes all the values, but there might be 1000s of them which will blow up the training data. It'll be good to have a control to limit the number of columns generated after OHE.

Some solutions from the discussions on slack:

Have a parameter "N" which defines the number of columns and most occurring N categories will be kept.
Have the list of required categories be defined in the feturespec. The user will define a list of N categories required and only those will be OHE, others ignored.
Have a percentile defined, e.g. keep all categories accounting for say 90% of the observations.

From initial discussions it seemed, #1 and #3 are harder to implement that #2? Open to discussion.

Add TensorFlow Example feature builder

Add object level transformer docs

So that they show up nicely in scaladoc site.

Better test coverage over different runners

So that we can detect edge cases like NPEs and ser/de issues early.

Handle out of bound cases when using previous settings

Refactor tests

Split up transformer tests, add utilities for deterministic and non-deterministic transformers, and test Transformer directly.

Collect feature stats

Could be useful for debugging. A couple of thoughts

Opt-in for user specified columns
Pre/post transformation
How do we deal with vectors, etc.?

JMapWrapper should be serializable

Otherwise it won't work with Spark.

Flink support

Fix scala doc warnings

Java wrapper?

We might wanna add a Java wrapper so that it could be used in a Java backend service.

IndexOutOfBoundsException when using HashOneHotEncoder with SparseVector

Not sure if user or featran code bug but here's the partial stacktrace:

Caused by: java.lang.IndexOutOfBoundsException: 8239914 not in [0,8239912)
	at breeze.linalg.VectorBuilder.breeze$linalg$VectorBuilder$$boundsCheck(VectorBuilder.scala:92)
	at breeze.linalg.VectorBuilder.add(VectorBuilder.scala:112)
	at breeze.linalg.SparseVector$$anonfun$apply$2.apply(SparseVector.scala:196)
	at breeze.linalg.SparseVector$$anonfun$apply$2.apply(SparseVector.scala:195)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:73)
	at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
	at breeze.linalg.SparseVector$.apply(SparseVector.scala:195)
	at com.spotify.featran.FeatureBuilder$$anon$5.result(FeatureBuilder.scala:134)
	at com.spotify.featran.FeatureBuilder$$anon$5.result(FeatureBuilder.scala:117)
	at com.spotify.featran.FeatureExtractor$$anonfun$featureValuesWithOriginal$1.apply(FeatureExtractor.scala:96)
	at com.spotify.featran.FeatureExtractor$$anonfun$featureValuesWithOriginal$1.apply(FeatureExtractor.scala:94)
	at com.spotify.scio.util.Functions$$anon$7.processElement(Functions.scala:145)

Performance issue if `plus` is expensive

For transformers like *HotEncoder where plus expensive, i.e. set or map concatenation, we might end up creating many temporary objects and causing GC pressure. We could potentially override sumOption but not sure how to implement this across multiple transformers.

java.lang.UnsupportedOperationException
    at com.google.protobuf.MapField.ensureMutable(MapField.java:290)
    at com.google.protobuf.MapField$MutatabilityAwareMap.put(MapField.java:333)
    at org.tensorflow.example.Features$Builder.putFeature(Features.java:631)
    at com.spotify.featran.tensorflow.package$TensorFlowFeatureBuilder$.add(package.scala:30)
    at com.spotify.featran.CrossingFeatureBuilder.add(CrossingFeatureBuilder.scala:96)
    at com.spotify.featran.transformers.StandardScaler.buildFeatures(StandardScaler.scala:55)
    at com.spotify.featran.transformers.StandardScaler.buildFeatures(StandardScaler.scala:42)
    at com.spotify.featran.transformers.Transformer.optBuildFeatures(Transformer.scala:83)
    at com.spotify.featran.Feature.unsafeBuildFeatures(FeatureSpec.scala:148)
    at com.spotify.featran.FeatureSet.featureValues(FeatureSpec.scala:270)
    at com.spotify.featran.FeatureExtractor$$anonfun$featureResults$1.apply(FeatureExtractor.scala:94)
    at com.spotify.featran.FeatureExtractor$$anonfun$featureResults$1.apply(FeatureExtractor.scala:93)
    at com.spotify.scio.util.Functions$$anon$7.processElement(Functions.scala:145)

Spec:

    FeatureSpec
      .of[Features]
      .required(_.doubleFeature)(StandardScaler("reg_feature", withMean = true, withStd = true))

Using featran 0.1.11 and scio 0.4.3 and protobuf-java;3.3.1