Comments (10)
I meant to implement a transformer that matches this https://www.tensorflow.org/api_docs/python/tf/contrib/layers/sparse_column_with_hash_bucket. It requires a hash_bucket_size for TF so in our case we could either let the user provide that value or maybe optionally attempt to estimate it based on a HLL in the aggregation step,.
from featran.
Maybe we should have a one-hot-encoder which is just a hasher. Given a possible hash space it takes the category and hashes into the space. It would basically be free hashing if a few collisions are OK. Another option is that we can count distinct elements using a HLL and use that to create the hash space.
from featran.
@richwhitjr can you post some references here? I guess collision is unavoidable and HLL needs to be a separate step outside of featran.
from featran.
TF's implementation here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/string_to_hash_bucket_op.h
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/string_to_hash_bucket_op.cc
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/lib/hash/hash.h
from featran.
I'll try porting those over.
from featran.
Have something WIP here: https://github.com/spotify/featran/tree/neville/hash-hot
The collision rate & output dimension can be tuned via hashBucketSize
but could be pretty high though.
Using a list of 466544 English words, HLL estimates size at 464220. Here're the dense vector sizes (Array[Double]
in MBs) and collisions stats at different sizes:
Input size: 466544
hashBucketSize: 1073741824 8192.0MB collisions: 97 0.0208%
hashBucketSize: 536870912 4096.0MB collisions: 210 0.0450%
hashBucketSize: 268435456 2048.0MB collisions: 398 0.0853%
hashBucketSize: 134217728 1024.0MB collisions: 836 0.1792%
hashBucketSize: 67108864 512.0MB collisions: 1629 0.3492%
hashBucketSize: 33554432 256.0MB collisions: 3189 0.6835%
hashBucketSize: 16777216 128.0MB collisions: 6271 1.3441%
hashBucketSize: 8388608 64.0MB collisions: 12354 2.6480%
hashBucketSize: 4194304 32.0MB collisions: 24001 5.1444%
hashBucketSize: 2097152 16.0MB collisions: 44540 9.5468%
hashBucketSize: 1048576 8.0MB collisions: 77568 16.6261%
hashBucketSize: 524288 4.0MB collisions: 117448 25.1740%
from featran.
Hard to know what is acceptable. I wonder if we can check this collision rate against TF to see what they found to be good enough. I doubt we could do much better then theirs with this approach with going deep into hashing logic.
from featran.
Well most collisions are caused by the hash % bucket size
step. Without that, i.e. when size == Int.MaxValue
, 97 collisions is close to what I found in other benchmarks.
Maybe instead of using the estimated size, we can recommend some heuristics, like 8x (~5%) or 16x (~2.5%) estimated size?
from featran.
That makes sense. Maybe the options are collision rate or hash bucket size. May have have to give a warning if they make the collision rate too small.
from featran.
Fixed in #26
from featran.
Related Issues (20)
- Can we use scaladoc 2.12? HOT 2
- Add documentation site with paradox HOT 1
- Add Scala Binary Compatibility validation tool – "MiMa"
- Performance issue in TensorFlow FeatureBuilder HOT 1
- PositionEncoder doesn't support input as "Seq" of Strings HOT 3
- Feature transformations order lost after filtering on a MultiFeatureSpec HOT 1
- Use JsonSerializable typeclass for FlatReader[String] and FlatWriter[String]
- Upgrade TensorFlow to 1.9.0 HOT 3
- FlatExtractor performance. HOT 1
- Add java api for FlatConverter & FaltExtractor
- `featran` root artifact published by mistake
- Switch xgboost to official release package
- Is featran thread-safe and can be intergrated in akka
- Sequential composition of transformers HOT 2
- sbt `release skip-tests` not skipping tests? HOT 1
- Implementing feature transformer when the Aggregator and the Transformer input are of different types HOT 9
- Add dotty cross-compile support
- Can't mix `featran-xgboost` dependency with newer versions of xgboost
- Update TensorFlow to >=2.3.1 HOT 1
- Could you help upgrade the vulnerble dependency in featran?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from featran.