Giter Club home page Giter Club logo

featran's Introduction

featran

Build Status codecov.io Maven Central Scaladoc Scala Steward badge

Featran, also known as Featran77 or F77 (get it?), is a Scala library for feature transformation. It aims to simplify the time consuming task of feature engineering in data science and machine learning processes. It supports various collection types for feature extraction and output formats for feature representation.

Introduction

Most feature transformation logic requires two steps, one global aggregation to summarize data followed by one element-wise mapping to transform them. For example:

  • Min-Max Scaler
    • Aggregation: global min & max
    • Mapping: scale each value to [min, max]
  • One-Hot Encoder
    • Aggregation: distinct labels
    • Mapping: convert each label to a binary vector

We can implement this in a naive way using reduce and map.

case class Point(score: Double, label: String)
val data = Seq(Point(1.0, "a"), Point(2.0, "b"), Point(3.0, "c"))

val a = data
  .map(p => (p.score, p.score, Set(p.label)))
  .reduce((x, y) => (math.min(x._1, y._1), math.max(x._2, y._2), x._3 ++ y._3))

val features = data.map { p =>
  (p.score - a._1) / (a._2 - a._1) :: a._3.toList.sorted.map(s => if (s == p.label) 1.0 else 0.0)
}

But this is unmanageable for complex feature sets. The above logic can be easily expressed in Featran.

import com.spotify.featran._
import com.spotify.featran.transformers._

val fs = FeatureSpec.of[Point]
  .required(_.score)(MinMaxScaler("min-max"))
  .required(_.label)(OneHotEncoder("one-hot"))

val fe = fs.extract(data)
val names = fe.featureNames
val features = fe.featureValues[Seq[Double]]

Featran also supports these additional features.

  • Extract from Scala collections, Flink DataSets, Scalding TypedPipes, Scio SCollections and Spark RDDs
  • Output as Scala collections, Breeze dense and sparse vectors, TensorFlow Example Protobuf, XGBoost LabeledPoint and NumPy .npy file
  • Import aggregation from a previous extraction for training, validation and test sets
  • Compose feature specifications and separate outputs

See Examples (source) for detailed examples. See transformers package for a complete list of available feature transformers.

See ScalaDocs for current API documentation.

Presentations

Artifacts

Feature includes the following artifacts:

  • featran-core - core library, support for extraction from Scala collections and output as Scala collections, Breeze dense and sparse vectors
  • featran-java - Java interface, see JavaExample.java
  • featran-flink - support for extraction from Flink DataSet
  • featran-scalding - support for extraction from Scalding TypedPipe
  • featran-scio - support for extraction from Scio SCollection
  • featran-spark - support for extraction from Spark RDD
  • featran-tensorflow - support for output as TensorFlow Example Protobuf
  • featran-xgboost - support for output as XGBoost LabeledPoint
  • featran-numpy - support for output as NumPy .npy file

License

Copyright 2016-2017 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

featran's People

Contributors

alaiacano avatar andrewsmartin avatar clairemcginty avatar dependabot[bot] avatar derenrich avatar fallonchen avatar jbx avatar jcazevedo avatar jd557 avatar kellen avatar martinbomio avatar nevillelyh avatar ravwojdyla avatar regadas avatar richwhitjr avatar rustedbones avatar scala-steward avatar sckelemen avatar slhansen avatar stormy-ua avatar syodage avatar yonromai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

featran's Issues

java.lang.NullPointerException when accessing featureNames/featureValues of MultiFeatureSpec

I am getting a java.lang.NullPointerException when I am trying to access either featureNames or featureValues. When I use either one of the 2 specs separately it works fine but when I try to combine them in a MultiFeatureSpec it fails. Is it a bug or am I doing something wrong?

@BigQueryType.fromQuery(
    """
      |#standardSQL
      |SELECT album_gid, album.num_tracks AS num_tracks,
      |album.availability.latest_date AS latest_date,
      |global_popularity.popularity_normalized AS popularity_normalized,
      |album.duration AS duration
      |FROM (SELECT * FROM `knowledge-graph-112233.album_entity.album_entity_%s` LIMIT 1000)
      |WHERE album.num_tracks >= 3
    """.stripMargin, "$LATEST"
  ) class AlbumMeta

  def main(cmdlineArgs: Array[String]): Unit = {
    val (sc, args) = ContextAndArgs(cmdlineArgs)

    val date = args("date").replace("-", "")
    val output = args("output")

    val albumFeatures = sc.typedBigQuery[AlbumMeta](AlbumMeta.query.format(date))

    val conSpec = FeatureSpec.of[AlbumMeta]
      .required(_.duration.get.toDouble)(StandardScaler("duration"))
      .required(_.duration.get.toDouble)(StandardScaler("duration_mean", withMean=true))
      .required(_.duration.get.toDouble)(Identity("identity"))
      .required(_.duration.get.toDouble)(MinMaxScaler("min_max"))

    val albumSpec = FeatureSpec.of[AlbumMeta]
      .required(_.album_gid.get)(OneHotEncoder("album"))

    //    val spec_extracted = albumSpec.extract(albumFeatures)
    val spec_extracted = MultiFeatureSpec(conSpec, albumSpec).extract(albumFeatures)

    val t = spec_extracted.featureNames

    sc.close().waitUntilFinish()
  }

Error:
Caused by: java.lang.NullPointerException
at com.spotify.featran.FeatureSet.multiFeatureNames(FeatureSpec.scala:231)
at com.spotify.featran.MultiFeatureExtractor$$anonfun$featureNames$1.apply(MultiFeatureExtractor.scala:56)
at com.spotify.featran.MultiFeatureExtractor$$anonfun$featureNames$1.apply(MultiFeatureExtractor.scala:56)
at com.spotify.scio.util.Functions$$anon$8.processElement(Functions.scala:145)

Create load tests for Sparse/Dense arrays

Users report that sparse representation of features might be more expensive than dense, afair we do a little be more there, like creating new arrays after #59 and mapping. Would be nice to add some load tests, and optimize that code path.

Provide defaults for FeatranSpec

In my code, I need to use featran to do identity transformation only:

  def applyFeatranSpec(s: SCollection[TrainingExample], vecSize: Int)
  : FeatureExtractor[SCollection, TrainingExample] = {
    FeatureSpec.of[TrainingExample]
      .required(e => e.context.map(_.toDouble).toArray)(VectorIdentity("context", vecSize))
      .required(e => e.target.toDouble)(Identity("target"))
      .extract(s)
  }

where

  case class TrainingExample(context: List[Long],
                             target: Long)

Would love Featran API to provide good Spec defaults, given a case class, so that I don't have to provide this boilerplate code, unless I need some specific feature transformation logic.

Frequency rank transformer

Feature request from an internal user:
"I want to transform a string feature to integer ids, where the most common string gets id 1, second most common gets id 2, etc. and have a cap at at 10000, so everything else gets mapped to 0. Is that easily doable with featran?
The original dataset might have say 1 billion strings (of which say 1 million are unique)."

Not sure if this can be done with approximation in a single pass. The naive approach would be building a Map[String, Int] in reduce phase and a priority queue before map phase.

Add PositionalOneHotEncoder Transformer

Name needs to some work but the idea is when you have some categorical features like (red, blue, green) and have a feature that is green instead of output <0, 0, 1> the transformer should output 2.

Add an Avro module

So we can write output records in Avro GenericRecords and eventually Parquet/Arrow compatible formats.

Add scaling factor in Hash*HotEncoders

We can provide 2 params:

  • hashBucketSize: Int = 0
  • sizeScalingFactor: Double = 4.0

If hashBucketSize > 0 we use that for assigning labels to buckets, but if it is 0 we use HLL estimated size * sizeScalingFactor instead of hashBucketSize and 4x gives us ~5% collision according to #23.

one hot encoder for large N

The one hot encoder takes a long time to run when the number of tokens is large. I believe the problem lies in the fact that the current implementation iterates over each element in the dictionary:

https://github.com/spotify/featran/blob/master/core/src/main/scala/com/spotify/featran/transformers/OneHotEncoder.scala#L45

I tried writing an implementation that uses a hash map instead of a list to store the tokens (plus their index), but it is still necessary to "inflate" the feature with the empty elements.

https://github.com/slhansen/featran/blob/master/core/src/main/scala/com/spotify/featran/transformers/OneHotEncoder.scala#L57-#L68

Since we can assume that only one element will be added it would be preferable to have the default be an empty feature vector and just add one element to it given the index.

Support Selective OneHotEncoder

Currently the OHE encodes all the values, but there might be 1000s of them which will blow up the training data. It'll be good to have a control to limit the number of columns generated after OHE.

Some solutions from the discussions on slack:

  1. Have a parameter "N" which defines the number of columns and most occurring N categories will be kept.
  2. Have the list of required categories be defined in the feturespec. The user will define a list of N categories required and only those will be OHE, others ignored.
  3. Have a percentile defined, e.g. keep all categories accounting for say 90% of the observations.

From initial discussions it seemed, #1 and #3 are harder to implement that #2? Open to discussion.

Refactor tests

Split up transformer tests, add utilities for deterministic and non-deterministic transformers, and test Transformer directly.

Collect feature stats

Could be useful for debugging. A couple of thoughts

  • Opt-in for user specified columns
  • Pre/post transformation
  • How do we deal with vectors, etc.?

Java wrapper?

We might wanna add a Java wrapper so that it could be used in a Java backend service.

IndexOutOfBoundsException when using HashOneHotEncoder with SparseVector

Not sure if user or featran code bug but here's the partial stacktrace:

Caused by: java.lang.IndexOutOfBoundsException: 8239914 not in [0,8239912)
	at breeze.linalg.VectorBuilder.breeze$linalg$VectorBuilder$$boundsCheck(VectorBuilder.scala:92)
	at breeze.linalg.VectorBuilder.add(VectorBuilder.scala:112)
	at breeze.linalg.SparseVector$$anonfun$apply$2.apply(SparseVector.scala:196)
	at breeze.linalg.SparseVector$$anonfun$apply$2.apply(SparseVector.scala:195)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:73)
	at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
	at breeze.linalg.SparseVector$.apply(SparseVector.scala:195)
	at com.spotify.featran.FeatureBuilder$$anon$5.result(FeatureBuilder.scala:134)
	at com.spotify.featran.FeatureBuilder$$anon$5.result(FeatureBuilder.scala:117)
	at com.spotify.featran.FeatureExtractor$$anonfun$featureValuesWithOriginal$1.apply(FeatureExtractor.scala:96)
	at com.spotify.featran.FeatureExtractor$$anonfun$featureValuesWithOriginal$1.apply(FeatureExtractor.scala:94)
	at com.spotify.scio.util.Functions$$anon$7.processElement(Functions.scala:145)

Performance issue if `plus` is expensive

For transformers like *HotEncoder where plus expensive, i.e. set or map concatenation, we might end up creating many temporary objects and causing GC pressure. We could potentially override sumOption but not sure how to implement this across multiple transformers.

Optimize extract with settings for streaming/backend use cases

Right now FeatureSpec#extractWithSettings does 2 things, parsing JSON settings and extract with .map. This could be inefficient in a backend where data is Seq[T] of 1 element.

We should either split it into 2 steps or add a streaming API so that elements can be feed into the Seq lazily.

Settings versioning

We need versioning to prevent loading settings from incompatible versions of featran

UnsupportedOperationException when using TensorFlowFeatureBuilder

Getting the following error when using TensorFlowFeatureBuilder to build Example with fe.featureValues[Example]:

java.lang.UnsupportedOperationException
    at com.google.protobuf.MapField.ensureMutable(MapField.java:290)
    at com.google.protobuf.MapField$MutatabilityAwareMap.put(MapField.java:333)
    at org.tensorflow.example.Features$Builder.putFeature(Features.java:631)
    at com.spotify.featran.tensorflow.package$TensorFlowFeatureBuilder$.add(package.scala:30)
    at com.spotify.featran.CrossingFeatureBuilder.add(CrossingFeatureBuilder.scala:96)
    at com.spotify.featran.transformers.StandardScaler.buildFeatures(StandardScaler.scala:55)
    at com.spotify.featran.transformers.StandardScaler.buildFeatures(StandardScaler.scala:42)
    at com.spotify.featran.transformers.Transformer.optBuildFeatures(Transformer.scala:83)
    at com.spotify.featran.Feature.unsafeBuildFeatures(FeatureSpec.scala:148)
    at com.spotify.featran.FeatureSet.featureValues(FeatureSpec.scala:270)
    at com.spotify.featran.FeatureExtractor$$anonfun$featureResults$1.apply(FeatureExtractor.scala:94)
    at com.spotify.featran.FeatureExtractor$$anonfun$featureResults$1.apply(FeatureExtractor.scala:93)
    at com.spotify.scio.util.Functions$$anon$7.processElement(Functions.scala:145)

Spec:

    FeatureSpec
      .of[Features]
      .required(_.doubleFeature)(StandardScaler("reg_feature", withMean = true, withStd = true))

Using featran 0.1.11 and scio 0.4.3 and protobuf-java;3.3.1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.