Giter Club home page Giter Club logo

data_science's Introduction

data_science

Keras:

Apps:

PyGotham:

Journalist LDA and ML:

Europython:

Scipy 2016:

28.7

27.7

26.7

25.7

24.7

22.7

21.7

20.7

19.7

18.7

15.7

14.7

data science summit:

daily

13.7

12.7

11.7

8.7

7.7

6.7

5.7

4.7

1.7

30.6

29.6

28.6

27.6

24.6

23.6

22.6

21.6

20.6

18.6

17.6

16.6

toread:

15.6

13.6

11.6

9.6

user classifiers:

Readings:

8.6

7.6

6.6

1.6

10 lesson learned from Xavier recap:

  • implicit signal beats explicit ones (almost always): clickbait, rating psychology
  • your model will learn what you teach it to learn: feature, function, f score
  • sup + unsup = life
  • everything is ensemble
  • model sequences: output of the model is input of others
  • FE: reusable, transformable, interpretable, reliable
  • ML infra: experimentation phase: easiness, flexibility, reusability. production phase: performance, scalable
  • Debugging feature values
  • you don't need to distribute ML algo
  • DS + ML engineering = perfection

31.5

30.5

29.5

26.5

25.5

In summary, here is what I recommend if you plan to use word2vec: choose the right training parameters and training data for word2vec, use avg predictor for query, sentence and paragraph(code here) after picking a dominant word set and apply deep learning on resulted vectors.

===

For SGNS, here is what I believe really happens during the training: If two words appear together, the training will try to increase their cosine similarity. If two words never appear together, the training will reduce their cosine similarity. So if there are a lot of user queries such as “auto insurance” and “car insurance”, then “auto” vector will be similar to “insurance” vector (cosine similarity ~= 0.3) and “car” vector will also be similar to “insurance” vector. Since “insurance”, “loan” and “repair” rarely appear together in the same context, their vectors have small mutual cosine similarity (cosine similarity ~= 0.1). We can treat them as orthogonal to each other and think them as different dimensions. After training is complete, “auto” vector will be very similar to “car” vector (cosine similarity ~= 0.6) because both of them are similar in “insurance” dimension, “loan” dimension and “repair” dimension. This intuition will be useful if you want to better design your training data to meet the goal of your text learning task.

===

for short sentences/phrases, Tomas Mikolov recommends simply adding up individual vector words to get a "sentence vector" (see his recent NIPS slides).

For longer documents, it is an open research question how to derive their representation, so no wonder you're having trouble :)

I like the way word2vec is running (no need to use important hardware to process huge collection of text). It's more usable than LSA or any system which requires a term-document matrix.

Actually LSA requires less structured data (only a bag-of-words matrix, whereas word2vec requires exact word sequences), so there's no fundamental difference in input complexity.

24.5

TSNE:

Conferences:

20.5

19.5

18.5

sentifi:

http://davidrosenberg.github.io/ml2016/#home

pydatalondon 2016:

spotify:

lda asyn, auto alpha: http://rare-technologies.com/python-lda-in-gensim-christmas-edition/

mapk: https://github.com/benhamner/Metrics/tree/master/Python/ml_metrics

ilcr2016: https://tensortalk.com/?cat=conference-iclr-2016

l.m.thang

https://github.com/jxieeducation/DIY-Data-Science

http://drivendata.github.io/cookiecutter-data-science/

http://ofey.me/papers/sparse_ijcai16.pdf

Spotify:

skflow:

a few useful things to know about ML:

tdb: https://github.com/ericjang/tdb

dask for task parallel, delayed: http://dask.pydata.org/en/latest/examples-tutorials.html

skflow:

http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/

https://medium.com/a-year-of-artificial-intelligence/lenny-2-autoencoders-and-word-embeddings-oh-my-576403b0113a#.ecj0iv4n8

https://github.com/andrewt3000/DL4NLP/blob/master/README.md

tf:

tf chatbot: https://github.com/nicolas-ivanov/tf_seq2seq_chatbot

Bayesian Opt: https://github.com/fmfn/BayesianOptimization/blob/master/examples/visualization.ipynb

click-o-tron rnn: http://clickotron.com auto generated headline clickbait: https://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/

http://blog.computationalcomplexity.org/2016/04/the-master-algorithm.html http://jyotiska.github.io/blog/posts/python_libraries.html

LSTM: http://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/

CS224d:

Sota of sa, mikolo and me :)

Thang M. L: http://web.stanford.edu/class/cs224n/handouts/cs224n-lecture16-nmt.pdf

CS224d reports:

QA in keras:

Chinese LSTM + word2vec:

DL with SA: https://cs224d.stanford.edu/reports/HongJames.pdf

MAB:

cnn nudity detection: http://blog.clarifai.com/what-convolutional-neural-networks-see-at-when-they-see-nudity/#.VxbdB0xcSko

sigopt: https://github.com/sigopt/sigopt_sklearn

first contact with TF: http://www.jorditorres.org/first-contact-with-tensorflow/

eval of ML using A/B or multibandit: http://blog.dato.com/how-to-evaluate-machine-learning-models-the-pitfalls-of-ab-testing

how to make mistakes in Python: www.oreilly.com/programming/free/files/how-to-make-mistakes-in-python.pdf

keras tut: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/keras_tutorial.pdf

Ogrisel word embedding: https://speakerd.s3.amazonaws.com/presentations/31f18ad0522c0132b9b662e7bb117668/Word_Embeddings.pdf

Tensorflow whitepaper: http://download.tensorflow.org/paper/whitepaper2015.pdf

Arimo distributed tensorflow: https://arimo.com/machine-learning/deep-learning/2016/arimo-distributed-tensorflow-on-spark/

Best ever word2vec in code: http://nbviewer.jupyter.org/github/fbkarsdorp/doc2vec/blob/master/doc2vec.ipynb

TF japanese: http://www.slideshare.net/yutakashino/tensorflow-white-paper

TF tut101: https://github.com/aymericdamien/TensorFlow-Examples

Jeff Dean: http://learningsys.org/slides/NIPS-Learning-Systems-Workshop-TensorFlow-Jeff-Dean.pdf DL: http://www.thoughtly.co/blog/deep-learning-lesson-1/ Distributed TF: https://www.tensorflow.org/versions/r0.8/how_tos/distributed/index.html

playground: http://playground.tensorflow.org/

Hoang Duong blog: http://hduongtrong.github.io/ Word2vec short explanation: http://hduongtrong.github.io/2015/11/20/word2vec/

ForestSpy: https://github.com/jvns/forestspy/blob/master/inspecting%20random%20forest%20models.ipynb

Netflix:

Lessons learned

WMD:

Hanoi trip:

VinhKhuc:

RS:

Data science bootcamp: https://cambridgecoding.com/datascience-bootcamp#outline

CambridgeCoding NLP:

Annoy:

RPForest: https://github.com/lyst/rpforest LightFM: https://github.com/lyst/lightfm Secure because of math: https://www.youtube.com/watch?v=TYVCVzEJhhQ Talking machines: http://www.thetalkingmachines.com/ Dive into DS: https://github.com/rasbt/dive-into-machine-learning

DS process: https://www.oreilly.com/ideas/building-a-high-throughput-data-science-machine Friendship paradox: https://vuhavan.wordpress.com/2016/03/25/ban-ban-ban-nhieu-hon-ban-ban/

AB test:

EMNLP 2015:

To read:

Idols:

IPython/Jupyter:

LSTM:

RNN:

Unicode:

EVENTS:

  • April 8-10 2016: PyData Madrid
  • April 15-17 2016: PyData Florence
  • May 6-8 2016: PyData London hosted by Bloomberg
  • May 20-21 2016: PyData Berlin
  • September 14-16 2016: PyData Carolinas hosted by IBM
  • October 7-9 2016: PyData DC hosted by Capital One
  • November 28-30 2016: PyData Cologne

Other Conference Dates Coming Soon!

QUOTES:

  • My name is Sherlock Homes. It is my business to know what other people dont know.
  • Take the first step in faith. You don't have to see the whole staircase, just take the first step. [M.L.King. Jr]
  • "Data data data" he cried impatiently. I can't make bricks without clay. [Arthur Donan Doyle]

STATS:

BOOKS:

CLUSTER:

EMBEDDING:

Linux:

BENCHMARK:

DIY:

Products:

Full stack:

Must seen:

Must read:

Curated:

Cool blogs:

Visualizations:

Writing:

Teaching:

data_science's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.