mojibay Goto Github PK

followers: 6.0 following: 7.0 repos: 13.0 gists: 0.0

Type: User

Location: Germany

👋 Hi, I’m @MojiBay
👀 I’m interested in where data science and psychology meet.
🌱 I’m an engineering graduate currently learning data analytics.
💞️ I’m looking to collaborate on data analysis projects and kaggle competitions.
📫 How to reach me

mojibay's Projects

analyze-a-b-test-results

Analyze A/B Test Results A/B tests are very commonly performed by data analysts and data scientists. It is important that you get some practice working with the difficulties of these. For this project, you will be working to understand the results of an A/B test run by an e-commerce website. The company has developed a new web page in order to try and increase the number of users who "convert," meaning the number of users who decide to pay for the company's product. Your goal is to work through this notebook to help the company understand if they should implement this new page, keep the old page, or perhaps run the experiment longer to make their decision. The data and the Jupyter Notebook, which are all of the files you need to complete the project, are provided for you in a downloadable zip file in the resources tab (as well as under "Supporting Materials" below). Note that portions of the notebook reference back to quizzes that are linked in this lesson to ensure you are on the right track. Although the quizzes in this lesson are not necessary to successfully complete the project (though they are helpful), all of the items on the Project Rubric must meet specifications to successfully complete the project.

data-modeling-with-cassandra

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app. They'd like a data engineer to create an Apache Cassandra database which can create queries on song play data to answer the questions, and wish to bring you on the project. Your role is to create a database for this analysis. You'll be able to test your database by running queries given to you by the analytics team from Sparkify to create the results. Project Overview In this project, you'll apply what you've learned on data modeling with Apache Cassandra and complete an ETL pipeline using Python. To complete the project, you will need to model your data by creating tables in Apache Cassandra to run queries. You are provided with part of the ETL pipeline that transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables. We have provided you with a project template that takes care of all the imports and provides a structure for ETL pipeline you'd need to process this data.

data-modelling-with-postgre

Project: Data Modeling with Postgres Introduction A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis, and bring you on the project. Your role is to create a database schema and ETL pipeline for this analysis. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.Song Dataset The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are file paths to two files in this dataset. song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like. {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0} Log Dataset The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations. The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset. log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json And below is an example of what the data in a log file, 2018-11-12-events.json, looks like. If you would like to look at the JSON data within log_data files, you will need to create a pandas dataframe to read the data. Remember to first import JSON and pandas libraries. df = pd.read_json(filepath, lines=True) For example, df = pd.read_json('data/log_data/2018/11/2018-11-01-events.json', lines=True) would read the data file 2018-11-01-events.json.

deforestation---sql

Introduction You’re a data analyst for ForestQuery, a non-profit organization, on a mission to reduce deforestation around the world and which raises awareness about this important environmental topic. Your executive director and her leadership team members are looking to understand which countries and regions around the world seem to have forests that have been shrinking in size, and also which countries and regions have the most significant forest area, both in terms of amount and percent of total area. The hope is that these findings can help inform initiatives, communications, and personnel allocation to achieve the largest impact with the precious few resources that the organization has at its disposal. You’ve been able to find tables of data online dealing with forestation as well as total land area and region groupings, and you’ve brought these tables together into a database that you’d like to query to answer some of the most important questions in preparation for a meeting with the ForestQuery executive team coming up in a few days. Ahead of the meeting, you’d like to prepare and disseminate a report for the leadership team that uses complete sentences to help them understand the global deforestation overview between 1990 and 2016. Steps to Complete Create a View called “forestation” by joining all three tables - forest_area, land_area and regions in the workspace. The forest_area and land_area tables join on both country_code AND year. The regions table joins these based on only country_code. In the ‘forestation’ View, include the following: All of the columns of the origin tables A new column that provides the percent of the land area that is designated as forest. Keep in mind that the column forest_area_sqkm in the forest_area table and the land_area_sqmi in the land_area table are in different units (square kilometers and square miles, respectively), so an adjustment will need to be made in the calculation you write (1 sq mi = 2.59 sq km).Instructions You will be creating a report for the executive team in which you explain your results using complete sentences. Report Sections The report has five sections that you will need to complete: Global Situation Regional Outlook Country-Level Detail Recommendations Appendix: SQL queries used You'll find further details on what you should complete for these sections on the next few pages.

jupyter-notebook

mojibay

Config files for my GitHub profile.

movie-dataset---spectacular-studios

Introduction You are continuing your work for Spectacular Studios. The executive team would like for you to walk through your recommendations and risks to the analysis that might affect your proposed action items. Your job is to provide their team with the final presentation, including an executive summary of key next steps. Your role will be to develop a final presentation that is roughly 10-15 slides and provide a detailed analysis that digs into potential limitations and biases of the dataset you’re working with. Overview You will now bring together all you have learned about data storytelling by combining the ghost deck and their analyses to provide a final recommendation. You will also need to identify the limitations and biases in data that affect the recommendations you provide. You will continue to use the same Movies Metadata CSV and conduct the EDA necessary to understand the dataset as a whole. The expected output will be to surface if the dataset is balanced if there are anomalies in the dataset that affect the applicability of the recommendation, and the final presentation itself that will be used for a mock recommendation to a management team. Datasets: For this project, you’ll be working with a choice of datasets. The description above relies on the Metadata_Movies.csv.

mymvc

note-books

pisa-data-eda

Project Overview This project has two parts that demonstrate the importance and value of data visualization techniques in the data analysis process. In Part I, Exploratory data visualization, you will use Python visualization libraries to systematically explore a selected dataset, starting from plots of single variables and building up to plots of multiple variables. In Part II, Explanatory data visualization, you will produce a short presentation that illustrates interesting properties, trends, and relationships that you discovered in your selected dataset. The primary method of conveying your findings will be through transforming your exploratory visualizations from the first part into polished, explanatory visualizations.

traveltide

A Traveling Company

us-bikeshare-data

Overview In this project, you will make use of Python to explore data related to bike share systems for three major cities in the United States—Chicago, New York City, and Washington. You will write code to import the data and answer interesting questions about it by computing descriptive statistics. You will also write a script that takes in raw input to create an interactive experience in the terminal to present these statistics. What Software Do I Need? To complete this project, the following software requirements apply: You should have Python 3, NumPy, and pandas installed using Anaconda A text editor, like Sublime or Atom. A terminal application (Terminal on Mac and Linux or Cygwin on Windows).

weratedogs

Introduction Real-world data rarely comes clean. Using Python and its libraries, you will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. You will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL. The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage. WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.

mojibay Goto Github PK

mojibay's Projects

analyze-a-b-test-results

data-modeling-with-cassandra

data-modelling-with-postgre

deforestation---sql

jupyter-notebook

mojibay

movie-dataset---spectacular-studios

mymvc

note-books

pisa-data-eda

traveltide

us-bikeshare-data

weratedogs

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent