Vaitybharati's Projects
Hypothesis-Testing-2-Proportion-T-test-Students-Jobs-in-2-States. Assume Null Hypothesis as Ho is p1-p2 = 0 i.e. p1 ā p2. Thus Alternate Hypthesis as Ha is p1 = p2. Explanation of bernoulli Binomial RV: np.random.binomial(n=1,p,size) Suppose you perform an experiment with two possible outcomes: either success or failure. Success happens with probability p, while failure happens with probability 1-p. A random variable that takes value 1 in case of success and 0 in case of failure is called a Bernoulli random variable. Here, n = 1, Because you need to check whether it is success or failure one time (Placement or not-placement) (1 trial) p = probability of success size = number of times you will check this (Ex: for 247 students each one time = 247) Explanation of Binomial RV: np.random.binomial(n=1,p,size) (Incase of not a Bernoulli RV, n = number of trials) For egs: check how many times you will get six if you roll a dice 10 times n=10, P=1/6 and size = repetition of experiment 'dice rolled 10 times', say repeated 18 times, then size=18. As (p_value=0.7255) > (Ī± = 0.05); Accept Null Hypothesis i.e. p1 ā p2 There is significant differnce in population proportions of state1 and state2 who report that they have been placed immediately after education.
Hypothesis Testing Anova Test - Iris Flower dataset. Anova ftest statistics: Analysis of varaince between more than 2 samples or columns. Assume Null Hypothesis Ho as No Varaince: All samples population means are same. Thus Alternate Hypothesis Ha as It has Variance: Atleast one population mean is different. As (p_value = 0) < (Ī± = 0.05); Reject Null Hypothesis i.e. Atleast one population mean is different Thus there is variance in more than 2 samples.
Hypothesis-Testing-Chi2-Test-Athletes-and-Smokers. Assume Null Hypothesis as Ho: Independence of categorical variables (Athlete and Smoking not related). Thus Alternate Hypothesis as Ha: Dependence of categorical variables (Athlete and Smoking is somewhat/significantly related). As (p_value = 0.00038) < (Ī± = 0.05); Reject Null Hypothesis i.e. Dependence among categorical variables Thus Athlete and Smoking is somewhat/significantly related.
Hypothesis-Testing-Chi2-Test-Human-Gender-and-Choice-of-Pets. Assume Null Hypothesis as Ho: Human Gender and choice of pets is independent and not related. Thus Alternate Hypothesis as Ha : Human Gender and choice of pets is dependent and related. As (p_valu=0.1031) > (Ī± = 0.05); Accept Null Hypothesis i.e Independence among categorical variables. Thus, there is no relation between Human Gender and Choice of Pets.
EDA (Exploratory Data Analysis) -1: Loading the Datasets, Data type conversions,Removing duplicate entries, Dropping the column, Renaming the column, Outlier Detection, Missing Values and Imputation (Numerical and Categorical), Scatter plot and Correlation analysis, Transformations, Automatic EDA Methods (Pandas Profiling and Sweetviz).
Supervised-ML---Simple-Linear-Regression---Newspaper-data. EDA and Visualization, Correlation Analysis, Model Building, Model Testing, Model predictions.
Supervised-ML---Simple-Linear-Regression---Waist-Circumference-Adipose-Tissue-Data. EDA and data visualization, Correlation Analysis, Model Building, Model Testing, Model Prediction.
Supervised-ML---Multiple-Linear-Regression---Cars-dataset. Model MPG of a car based on other variables. EDA, Correlation Analysis, Model Building, Model Testing, Model Validation Techniques, Collinearity Problem Check, Residual Analysis, Model Deletion Diagnostics (checking Outliers or Influencers) Two Techniques : 1. Cook's Distance & 2. Leverage value, Improving the Model, Model - Re-build, Re-check and Re-improve - 2, Model - Re-build, Re-check and Re-improve - 3, Final Model, Model Predictions.
Supervised-ML---Multiple-Linear-Regression---Toyota-Cars. EDA, Correlation Analysis, Model Building, Model Testing, Model Validation Techniques, Collinearity Problem Check, Residual Analysis, Model Deletion Diagnostics (checking Outliers or Influencers) Two Techniques : 1. Cook's Distance & 2. Leverage value, Improving the Model, Model - Re-build, Re-check and Re-improve - 2, Model - Re-build, Re-check and Re-improve - 3, Final Model, Model Predictions.
Supervised-ML---Logistic-Regression---Appointing-Attorney-or-not. EDA, Model Building, Model Predictions, Testing Model Accuracy, ROC Curve plotting and finding AUC value.
Unsupervised-ML---Hierarchical-Clustering-University Data. Import libraries, Import dataset, Create Normalized data frame (considering only the numerical part of data), Create dendrograms, Create Clusters, Plot Clusters.
Unsupervised-ML---K-Means-Clustering-Non-Hierarchical-Clustering-Univ. Use Elbow Graph to find optimum number of clusters (K value) from K values range. The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion WCSS. Plot K values range vs WCSS to get Elbow graph for choosing K (no. of clusters)
Unsupervised-ML---DBSCAN-Clustering-Wholesale-Customers. Import Libraries, Import Dataset, Normalize heterogenous numerical data using standard scalar fit transform to dataset, DBSCAN Clustering, Noisy samples are given the label -1, Adding clusters to dataset.
Unsupervised-ML---Association-Rules-Data-Mining-Titanic. Data Preprocessing: As the data is categorical format, we are using One Hot Encoding to convert into numerical format. Apriori Algorithm: frequent item sets & association rules. A leverage value of 0 indicates independence. Range will be [-1 1]. A high conviction value means that the consequent is highly depending on the antecedent and range [0 inf]. Lift Ratio > 1 is a good influential rule in selecting the associated transactions.
Unsupervised-ML---PCA-Data-Mining-Univ. Import Dataset, Converting data to numpy array, Normalizing the numerical data, Applying PCA Fit Transform to dataset, PCA Components matrix or covariance Matrix, Variance of each PCA, Final Dataframe, Visualization of PCAs, Eigen vector and eigen values for a given matrix.
Unsupervised-ML-t-SNE-Data-Mining-Cancer. Import Libraries, Import Dataset, Convert data to array format, Separate array into input and output components, TSNE implementation, Cluster Visualization
Unsupervised-ML-Recommendation-System-Data-Mining-Movies. Recommend movies based on the ratings: Sort by User IDs, number of unique users in the dataset, number of unique movies in the dataset, Impute those NaNs with 0 values, Calculating Cosine Similarity between Users on array data, Store the results in a dataframe format, Set the index and column names to user ids, Slicing first 5 rows and first 5 columns, Nullifying diagonal values, Most Similar Users, extract the movies which userId 6 & 168 have watched.
Supervised-ML-Decision-Tree-C5.0-Entropy-Iris-Flower-Using Entropy Criteria - Classification Model. Import Libraries and data set, EDA, Apply Label Encoding, Model Building - Building/Training Decision Tree Classifier (C5.0) using Entropy Criteria. Validation and Testing Decision Tree Classifier (C5.0) Model
Pandas Tutorial
PCA
Probability Calculations for Normal distribution
Probability Calculation in Python
R-Basics2 homework
R-code-1a
R-code-2
R Basics Tutorial-1
R2 - Decision Making statements in R
R3 - Joins and Appling Functions in R
Recommendation-Engine
Data Cleaning, N-gram, WordCloud, Applying naive bayes for classification, Using TFIDF