Tiffany Duong's Data Science Portfolio

This portfolio is a compilation of notebooks that I have created over the course of my studies for my Masters in Data Science at Northwestern University as well as personal/side projects

Data Cleansing and Preparation

BestDeal Transactional Data

Github | Nbviewer

An exercise of cleaning transactional data, loading it onto a sqlite engine and executing SQL queries.

Data Exploration and Analysis

MSPA Software Survey

Github | Nbviewer

I explored data results from a survey given to the MSPA program back in 2016. I analyzed current students' course interests and programming language/software usage to gauge and assess future curriculum planning for the MSDS program.

NoSQL Databases

Chicago Food Inspection

Github | Nbviewer

I executed and experimented with queries to pull data of varying degrees of precision/relevance from a NoSQL document-oriented database engine ElasticSearch to assess the current state of failed sanitary inspections in buildings categorized as "children's facilities", such as daycares in the city of Chicago.

Supervised Learning

Classification Models

Evaluation of Logistic Regression and Naive Bayes

Github | Nbviewer

I used three features (loan, housing, and default) to predict the response of whether or not the bank's client will subscribe to a term deposit. I employed two classification models (Logistic Regression and Naive Bayes) and evaluated them using k-fold cross validation, as well as using the area under the ROC curve as an index of model performance.

Linear Models

Evaluation of OLS, Ridge Regression, Lasso Regression, ElasticNet

Github | Nbviewer

I used twelve features to predict the median value (in thousands of dollars) for housing in the Boston metropolitan area during the 1970's from a dataset of 500+ observations. I employed four linear models (OLS and three regularized linear models - Ridge, Lasso, ElasticNet) and evaluated them using k-fold cross validation, as well as using RMSE as an index of model performance.

Tree-Based Models and Feature Selection

Evaluation of Decision Trees, Random Forests, Gradient Boosting

Github | Nbviewer

I used twelve features to predict the median value (in thousands of dollars) for housing in the Boston metropolitan area during the 1970's from a dataset of 500+ observations. I employed three tree-based models (Decision Trees, Random Forests, Gradient Boosting (learning rate =0.1)), as well as using RMSE as an index of model performance. Then, I used a Gradient Boosting model to determine the feature importance of all of the features when predicting the target variable (median value).

Multi-Class Classifiers

Principal Component Analysis + Random Forest Classifier

Github | Nbviewer

I used the MNIST dataset and employed two models: a random forest classifier model as a benchmark for model performance and another random forest classifier model that had principal component analysis (PCA) applied to it as a dimensional-reduction method, while preserving 95% variance explained by the feature. Within the second model, I initially applied the fit_transform method to the entire data set (purposeful issue), and then I applied a fit_transform method to the training set and only the transform method to the test set separately and compared performance (using the F1 score, which is the harmonic mean between accuracy and precision, and program runtime) between the wrongly transformed and rightfully transformed data, as well as to the benchmark model.

Deep Learning

Multilayer Perceptron (MLP) - Finding Natural Feature Sets in the Hidden Layer

Github | Nbviewer

I investigated how adjustments made to the number of hidden layers and feature classes of a MLP can contribute to how feature classes are naturally found in the hidden layers based on a 9x9 input grid of alphabet data (81 input nodes, 9 output nodes) and how the pre-determined input classes are then classified. The MLP utilizes backpropagation that was defined in NumPy.

Multilayer Perceptron (MLP) - MNIST Digits Classification

Github | Nbviewer

I used the MNIST dataset and conducted a 2x2 factorial experiment by adjusting the number of hidden layers and the number of nodes per hidden layer within a multilayer perceptron architecture to classify the MNIST digits images into its proper class. This experiment was done within Tensorflow, which is an open-source ML library, developed by Google. I made use of the Adam Optimization algorithm to improve the model. Performance between the models was compared using the model's process execution time and accuracy.

Convolutional Neural Networks (CNNs) - Cats and Dogs Binary Classification

Github | Nbviewer

I used a subset of the Kaggle's cats vs. dogs dataset and conducted a 2x2 factorial experiment (MLP vs CNN, grayscale vs. RGB images) to see which type of model design has a higher classification accuracy on a determining if the subject of a given 64x64 image is a cat or dog. This experiment was done within Tensorflow, which is an open-source ML library, developed by Google. I made use of the Adam Optimization algorithm to improve the model. Performance between the models was compared using the model's process execution time and accuracy.

Convolutional Neural Networks (CNNs) - Fashion MNIST Classification

Github | Nbviewer

I used the Fashion MNIST dataset to train a classification model using a convolutional neural network. The model was compiled using Adam Optimization, categorical cross-entropy for loss, and trained with early stopping for 6 epochs with batch sizes of 64. This project was done using Keras, with Tensorflow on the back-end.

Convolutional Neural Networks (CNNs) - Distracted Driver Detection (Transfer Learning)

Github | Nbviewer

I used a image data from the State Farm's Distracted Driver Detection Competition to build a LeNet-5 Model from scratch to classify the images into 10 specified "driving behavior" classes. I also utilized transfer learning and built on top of a pre-trained VGG-16 model (all layers except for the last set of conv/max pooling and fully connected layers are frozen) to classify the images and measure classification accuracy. The models were compiled using Adam Optimization, categorical cross-entropy for loss and trained for 20 epochs with a batch size of 64, and included early stopping. This project was done using Keras, with Tensorflow on the back-end.

Recurrent Neural Networks (RNNs) - Language Modeling (Movie Review Sentiment)

Github | Nbviewer

I used pretrained word vectors from GloVe embeddings (obtained with Python package chakin) and conducted a 2x2 factorial experiment (different pretrained vectors, small vs. large vocabulary size) to see which type of model design has a higher classification accuracy when determining if the sentiment of a movie review was positive or negative. This experiment was done within Tensorflow, which is an open-source ML library, developed by Google. I made use of the Adam Optimization algorithm to improve the RNN models (50 epochs and batch sizes of 100). Performance between the models was compared using the model's train and test accuracies.

Deep-Q Network - Ms.PacMan (Reinforcement Learning)

Github | Colab

I programmed within OpenAI Gym's Ms.PacMan environment to use reinforcement learning to train a Deep-Q Network to maximize the amount of points Ms.PacMan scores. The Deep-Q Network was trained for 500,000 training steps with a replay memory size of 5,000, learning rate of 0.01, discount rate of 0.99, with batch sizes of 50. After training, I tested to see how Ms. PacMan performed and it obtained a score of 690. This project was completed using Tensorflow, OpenAI Gym, and Colab.

Computer Vision APIs

Google Cloud Vision API - Landmark Detection

Github | Colab

I used the Google Cloud Vision API to identify world landmarks from a directory with five photos. Results display the landmark name, detection score, and the landmark latitiude and longitude.

sharmabhumi / data_science_portfolio Goto Github PK

data_science_portfolio's Introduction

Tiffany Duong's Data Science Portfolio

Data Cleansing and Preparation

BestDeal Transactional Data

Data Exploration and Analysis

MSPA Software Survey

NoSQL Databases

Chicago Food Inspection

Supervised Learning

Classification Models

Linear Models

Tree-Based Models and Feature Selection

Multi-Class Classifiers

Principal Component Analysis + Random Forest Classifier

Deep Learning

Multilayer Perceptron (MLP) - Finding Natural Feature Sets in the Hidden Layer

Multilayer Perceptron (MLP) - MNIST Digits Classification

Convolutional Neural Networks (CNNs) - Cats and Dogs Binary Classification

Convolutional Neural Networks (CNNs) - Fashion MNIST Classification

Convolutional Neural Networks (CNNs) - Distracted Driver Detection (Transfer Learning)

Recurrent Neural Networks (RNNs) - Language Modeling (Movie Review Sentiment)

Deep-Q Network - Ms.PacMan (Reinforcement Learning)

Computer Vision APIs

Google Cloud Vision API - Landmark Detection

data_science_portfolio's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org