Giter Club home page Giter Club logo

data-science-course-columbia-note's Introduction

Data-Science-course-Columbia-Note

Data Science is multidiscipline, collect, observe, process data ... -> prediction and recommendation

Data is now explorion ( people, sensors, social ....)

Data visulization and data summerization are important because from data visualiszation and summerization, people can find out what kind of data is important. It is one of the first steps of data science.

What a data science need ? Statitic, mathematic, machine learning, optimization

In the โ€œmapโ€ stage, a programming task is divided into a number of identical subtasks, which are then distributed across many processors the intermediate results are then combined by a single reduce task

The most popular open source implementation of MapReduce is the Hadoop project

II.1 Statitic

  1. Some concept Population interet -> who is used for sampling ?

Data gerenartion process -> Create Data->Analyse->Ouput

Data generation process ( affect by assumption) if good assumption -> good data -> good ouput if bad assumption -> bad data -> bad output ( revise)

Data are number in context (the meaning of data depend on individual-context) Statitic makes observation on individual (ex M. A, M. B ) on a set of variable (ex. size, age..)

Type of variable -categorical ex, gender, national... don't have arithmetic meaning -quantatif have arithmetic meaning ex age, size -ordinal don't have arithmetic meaning but have order ( How often you learn ? every day, somtime ...) Statitics are:

Summezise of numerical data not tell hold story useful and meaningful

  1. Display numerical data For categorical variable, we can display by count, percentage (ex number of femme, number of homme), we call area principle (pie, line) For quantatif variable, we can display by count, percentage (histogram usally for interval of value age (1-10), age (10-20))

2.1 Center of variation: mean (numerical average ) median (midpoint) (point that have the most individual values) variance 1/(n-1)Sigma(Xi-Xmean)(Xi-Xmean) standard deviation : sqrt(variance) Quantile(percentile-threshold) the value that have %value that less than it Quartile = the set of quantile that have 25%, 50%, 75% values that less than it

2.2 Box plot : Five value : min, Q1, median (Q2-50%), Q3, max ( we can add some outlier, extremes value) 2.3 Association between variables ex P1=P(internet|young) p2=P(internet|senior) Relative risk rr=P1/P2 Odds ratio or=(P1/(1-P1))/(P2/(1-P2))

2.4 Relation between variables

correlation : assocation between quantative variables r=(1/n-1)Sigma((Xi-Xmean)/variance S)((Yi-Ymean)/variance Y) r in [-1, 1] r< 0 negative relation r>0 positive relation r=0 : pas de relation

2.6 Probability:

Random process (randomness): Unpredictable Trend Describe randomness Write down all outcomes (sample spaces) Change, probability ( of each outcome ) Probability of an event is the sum of the probabilities of the outcomes included in the definition of the event.

  1. Data Analysis

Five steps in overall process of data science : Data Acquisition, Data preparation, Data Analysis, Presentation and report insight, turn insight into data-driven action
Data Acquisition : Raw data from different source Data preparation: Clean data (remove noise), transform data (filter, put in range) Data Analysis: to build the model from the data. Classification, Regression, Clustering, Graph analysis, Association Analysis. Classification is to predict the category of data ( sunny, clouding, rainy ...). when a model has to predict the numeric value of data we have regression. Clustering is to organize the similar item into group

  1. Reporting Insight

data-science-course-columbia-note's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.