tech-with-vidhya Goto Github PK

followers: 15.0 following: 10.0 repos: 86.0 gists: 0.0

Name: Tech-with-Vidhya

Type: User

Company: Software Lead/Consultant, Data Scientist & ML/Data Engineer

Bio: Hello, Welcome to my github portfolio page. It includes all the Data Science, Machine Learning Engineering and Big Data Engineering Projects.

Location: Queen Mary University of London, UK

Blog: https://github.com/Tech-with-Vidhya

Tech-with-Vidhya's Projects

advanced-data-engineering-with-databricks

anomaly-detection-proximity-based-method-knn

This project is delivered as part of my Masters in Big Data Science (MSc BDS) Program for the module named “Data Mining” in Queen Mary University of London (QMUL), London, United Kingdom. This project covers the Implementation of the Outlier Detection using the proximity-based method of k-nearest neighbors to calculate the outlier scores on the”house prices” dataset; with the inclusion of the data pre-processing steps of z-score normalisation and PCA dimensionality reduction techniques. The implementation is executed using Python’s libraries namely pandas, numpy, matplotlib, sklearn and scipy. The solution includes the computation of the Euclidean distance to further detect the top 3 outlier houses with the highest prices when compared with the average house price. **NOTE:** Due to the data privacy and the data protection policy to be adhered by the students; the datasets and the solution related code are not exposed and updated in the GitHub public profile; in order to be compliant with the Queen Mary University of London (QMUL) policies.

apache-hadoop-hdfs-mapreduce-jobs-e2e-implementation-linux-server-olympictweets2016rio-data

This project is delivered as part of my Masters in Big Data Science (MSc BDS) Program for the module named “Big Data Processing” in Queen Mary University of London (QMUL), London, United Kingdom. This project includes the development of a MapReduce python script with and without combiner from scratch for a big data job for the “olympictweets2016rio” private large-sized multiple data files; containing a large collection of Twitter messages collected during the Rio 2016 Olympics from the Twitter Streaming API using keywords like #Rio2016 or #rioolympics. The data files are stored in the Hadoop HDFS and the jobs are executed in the Hadoop Cluster. Identified solutions to the questions namely: 1. Day with the highest number of tweets 2. Total number of malformed lines of input 3. Average length of the tweets 4. Average number of hashtags **NOTE:** Due to the data privacy and the data protection policy to be adhered by the students; the datasets and the solution related code are not exposed and updated in the GitHub public profile; in order to be compliant with the Queen Mary University of London (QMUL) policies.

apache-hadoop-hdfs-mapreduce-jobs-with-replication-repartition-joins-e2e-implementation-nasaq-data

This project is delivered as part of my Masters in Big Data Science (MSc BDS) Program for the module named “Big Data Processing” in Queen Mary University of London (QMUL), London, United Kingdom. This project includes the development of a MapReduce python script with and without combiner from scratch for the implementation of the various replication and repartition join strategies; for a big data job for the “NASDAQ” private dataset; listed with the NASDAQ daily stock variations between 1970 and 2010. The data files are stored in the Hadoop HDFS and 2 jobs are executed in the Hadoop Cluster. Identified solutions to the questions namely: 1. Companies with most entries in the top 10 2. Decision on appropriate join strategy to be used and implemented for this problem statement 3. Year with the highest number of movements in the ‘Technology‘ sector 4. List of companies that grew the most per year with its corresponding growth percentage using Top 10 algorithm **NOTE:** Due to the data privacy and the data protection policy to be adhered by the students; the datasets and the solution related code are not exposed and updated in the GitHub public profile; in order to be compliant with the Queen Mary University of London (QMUL) policies.

apache-spark-programming-with-databricks

apache-spark-rdd-computations-e2e-implementation-with-transformations-and-actions-gutenberg-data

This project is delivered as part of my Masters in Big Data Science (MSc BDS) Program for the module named “Big Data Processing” in Queen Mary University of London (QMUL), London, United Kingdom. This project covers the development of Spark RDD computations from scratch using python’s pyspark package and regular expressions functions for the “Gutenberg” private data files; containing hundreds of books downloaded from the project Gutenberg, written in different languages. Implemented the use of basic transformations (namely flatMap, map, reduceByKey) and actions on the RDDs and submitted spark jobs to the cluster. Identified solutions to the questions namely: 1. Counting the total number of words 2. Total number of occurrences of each unique word 3. Computation of Top 10 words using the Spark’s ‘takeOrdered’ function **NOTE:** Due to the data privacy and the data protection policy to be adhered by the students; the datasets and the solution related code are not exposed and updated in the GitHub public profile; in order to be compliant with the Queen Mary University of London (QMUL) policies.

audio-digits-classification-using-mfcc-and-convolutional-neural-network

This project is delivered as part of my Masters in Big Data Science (MSc BDS) Program Module named “Machine Learning” in Queen Mary University of London (QMUL), London, United Kingdom. The project cover the basic solution and the Advanced Solution as given below based on Audio Feature Extraction Method named "Mel-frequency cepstral coefficients (MFCC)" and Deep Learning Convolutional Neural Network (CNN). Basic Solution: Includes designing, building, training, validation and testing a model created to recognise numerals from 0 to 9 in the audio files. Advanced Solution: Includes implementing the solution to predict the numeral based on a new audio test file. This model's solution can be applied to a Banking Application/Product and can be used for predicting a 4-digit passcode said by an authorised customer during on-call verification as part of login process to Internet Banking Account. NOTE: Due to the data privacy and the data protection policy to be adhered by the students; the datasets and the solution related code are not exposed and updated in the GitHub public profile; in order to be compliant with the Queen Mary University of London (QMUL) policies.

automated_etl_finance_data_pipeline_with_aws_lambda_spark_transformation_job_python

This project covers the implementation of building an automated ETL data pipeline using Python and AWS Services with Spark transformation job for financial stocks trade transactions. The ETL Data Pipeline is automated using AWS Lambda Function with a Trigger defined. Whenever a new file is ingested into the AWS S3 Bucket; then the AWS Lambda Function gets triggered and will implement the further action to execute the AWS Glue Crawler ETL Spark Transformation Job. The Spark Transformation Job implemented using Python PySpark transforms the trade transactions data stored in the AWS S3 Bucket; further to filter a sub-set of trade transactions for which the total number of shares transacted are less than or equal to 100. Tools & Technologies: Python, Boto3, PySpark, SDK, AWS CLI, AWS Virtual Private Cloud (VPC), AWS VPC Endpoint, AWS S3, AWS Glue, AWS Glue Crawler, AWS Glue Jobs, AWS Athena, AWS Lambda, Spark

aws_etl_nlp_auto-reply_query_handler_using_kafka_spark_lstm_deep_learning

aws_sagemaker_bank_marketing_predictions_using-xgboost_model

This is an AWS SageMaker Bank Marketing Prediction Machine Learning Project.

aws_sagemaker_tensorflow_keras_cnn_model_fashion_mnist

This is an AWS SageMaker TensorFlow Keras CNN Machine Learning Project.

bank-customers-churn-prediction-using-decision-tree-classifier-cart-algorithm

This project deals with the classification of the bank customers on whether a customer will leave the bank (i.e.; churn) or not, using Decision Tree CART (Classification And Regression Tree) Algorithm.

bank-customers-churn-prediction-using-decision-tree-classifier-ide-algorithm

This project deals with the classification of the bank customers on whether a customer will leave the bank (i.e.; churn) or not, using Decision Tree ID3 (Iterative Dichotomiser) Algorithm.

bank-customers-churn-prediction-using-ensemble-adaptive-boosting-classifier-algorithm

This project deals with the classification of the bank customers on whether a customer will leave the bank (i.e.; churn) or not, using Ensemble Adaptive Boosting Classifier Algorithm.

bank-customers-churn-prediction-using-ensemble-extreme-gradient-boosting-classifier-algorithm

This project deals with the classification of the bank customers on whether a customer will leave the bank (i.e.; churn) or not, using Ensemble Extreme Gradient Boosting (XGBoost) Classifier Algorithm.

bank-customers-churn-prediction-using-ensemble-hist-gradient-boosting-classifier-algorithm

This project deals with the classification of the bank customers on whether a customer will leave the bank (i.e.; churn) or not, using Ensemble Hist Gradient Boosting Classifier Algorithm.

bank-customers-churn-prediction-using-ensemble-random-forest-classifier-algorithm

This project deals with the classification of the bank customers on whether a customer will leave the bank (i.e.; churn) or not, using Ensemble Random Forest Classifier Algorithm.

bank-customers-churn-prediction-using-support-vector-machine-classifier-algorithm

This project deals with the classification of the bank customers on whether a customer will leave the bank (i.e.; churn) or not, using Support Vector Machine (SVM) Classifier Algorithm.

bank_credit_card_customers_segmentation_using_unsupervised_k_means_clustering_analysis

This project deals with the segmentation and grouping of the bank credit card customers using UnSupervised K-Means Clustering Algorithm. The project involves below steps in the life-cycle and implementation. 1. Data Exploration, Analysis and Visualisations 2. Data Cleaning 3. Data Pre-Processing and Scaling 4. Model Fitting 5. Model Validation using Performance Quality Metrics namely WCSS, Elbow Method and Silhouette Coefficient/Score 6. Optimized Model Selection with appropriate number of clusters based on the various Performance Quality Metrics 7. Analysis Insights and Interpretations of 2 different business scenarios with various Visualisations

bank_credit_card_transactions_fraud_detection_using_unsupervised_dbscan_clustering

This project deals with the segmentation and grouping of the bank credit card fraud transactions using UnSupervised Density Based Spatial Clustering of Applications with Noise (DBSCAN) Clustering Algorithm. The project involves below steps in the life-cycle and implementation. 1. Data Exploration and Analysis 2. Data Pre-Processing, Scaling and Normalisation 3. Dimensionality Reduction using Principal Component Analysis (PCA) 4. Model Fitting 5. Model Hyper Parameters Tuning 6. Model Validation using Performance Quality Metrics namely Silhouette Coefficient/Score and Homogeneity Score 7. Optimized Model Selection with appropriate number of clusters based on the various Performance Quality Metrics

bank_customers_churn_prediction_exploring_7_different_classification_algorithms

This project deals with the classification of the bank customers on whether a customer will leave the bank (i.e.; churn) or not, by applying the below steps of a Data Science Project Life-Cycle 1. Data Exploration, Analysis and Visualisations 2. Data Pre-processing 3. Data Preparation for the Modelling 4. Model Training 5. Model Validation 6. Optimized Model Selection based on Various Performance Metrics 7. Deploying the Best Optimized Model into Unseen Test Data 8. Evaluating the Optimized Model’s Performance Metrics The business case of determining the churn status of bank customers are explored, trained and validated on 7 different classification algorithms/models as listed below and the best optimized model is selected based on the accuracy metrics. 1. Decision Tree Classifier - CART (Classification and Regression Tree) Algorithm 2. Decision Tree Classifier - IDE (Iterative Dichotomiser) Algorithm 3. Ensemble Random Forest Classifier Algorithm 4. Ensemble Adaptive Boosting Classifier Algorithm 5. Ensemble Hist Gradient Boosting Classifier Algorithm 6. Ensemble Extreme Gradient Boosting (XGBoost) Classifier Algorithm 7. Support Vector Machine (SVM) Classifier Algorithm

bitcoin_network_analytics_using_python_networkx_and_gephi

This group project of 4 members is delivered as part of my Masters in Big Data Science (MSc BDS) Program Module named “Digital Media and Social Network” in Queen Mary University of London (QMUL), London, United Kingdom. This project covers the network analysis covering 4 different problem statements and use cases using python NetworkX package, Gephi network analysis tool and Microsoft excel. Dataset: Dataset includes Bitcoin Trade Transactions for the period between 2011 to 2016. Dataset Representation: Bitcoin Trade Transactions -> Attributes (Rater, Ratee, Rating and Timestamp) Network Formation: For every trade transaction between 2 users in the Bitcoin Network; ratings are recorded and tracked in the system with the corresponding timestamp (Directed Network). Size of the Dataset and Network: Users/Nodes = 5881 Transactions/Edges = 35592 Ratings (in the range of -10 to +10; where -10 represents the least rating and +10 represents the highest rating) Basic Network Statistics: Use Cases and Objectives:

building_data_model_with_table_postgresql_python

This project covers the implementation of creating a data model from scratch and building tables in the PostgreSQL database using Python for Bank Customers Churn Raw Data available in the structured “csv” file format.

building_etl_data_pipeline_on-aws_emr_cluster_hive_tables_tableau_visualisation

This project covers the implementation of building a ETL batch data pipeline using Python and AWS Services for sales data. The persisted batch sales data is stored in the AWS S3 Bucket and ingested into the AWS Elastic MapReduce (EMR) Cluster. This ingested data is further transformed using Apache Hive Tables and finally consumed by Tableau application for displaying the sales related visualisations as a dashboard. Tools & Technologies: Python, Boto3, AWS CLI, AWS S3, AWS EMR, Apache Hive, Tableau

capital_markets_stocks_trade_transactions_tableau_dashboard

This project includes the data analysis related to the capital stock market trade transactions data using Tableau Desktop and results are visualized in the form of a Dynamic Tableau Dashboard.

cc-flask-cms-api

This project repository includes a Web-based Content (Articles) Management System/Application related to Data Science Learning and Career Journey with User Registration, Login functionalities using Python, Flask Web Framework, HTML, PostgreSQL database and Heroku Cloud Server. This application is implemented and deployed in Heroku Cloud Server.

cli-demo

Public resources for Databricks CLI demo

colored-text-in-python

This project will display texts and strings in coloured format with various foreground colours, background colurs and styles.

copilot-codespaces-vscode

Develop with AI-powered code suggestions using GitHub Copilot and VS Code

coursera-deep-learning-specialization-2021

Notes, programming assignments and quizzes from all courses within the Coursera Deep Learning specialization offered by deeplearning.ai: (i) Neural Networks and Deep Learning; (ii) Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization; (iii) Structuring Machine Learning Projects; (iv) Convolutional Neural Networks;

tech-with-vidhya Goto Github PK

Tech-with-Vidhya's Projects

Recommend Projects

Recommend Topics

Recommend Org