Giter Club home page Giter Club logo

mariadb-corporation / dev-example-columnstore-developer-survey-data Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 3.0 8.91 MB

This repository provides data and information that will enable you to transform, import and analyze the raw Stack Overflow Annual Developer Survey data with MariaDB ColumnStore.

License: MIT License

Python 100.00%
mariadb mariadb-database mariadb-server mariadb-columnstore python python3 jupyter-notebook jupyterlab plotly pandas

dev-example-columnstore-developer-survey-data's Introduction

Analyzing Stack Overflow Annual Developer Survey Data with MariaDB

⚠️ [UNMAINTAINED] This repository has been moved and is currently maintained here. ⚠️


This repository provides data and information that will enable you to transform, import and analyze the raw Stack Overflow Annual Developer Survey data with MariaDB ColumnStore.

Starting at Step 2, Parse and transform the data, this repository will walk you through the process of preparing, importing and, ultimately, being able to analyze the raw Stack Overflow Annual Developer Survey data for 2020 (which is included in the developer_survey_2020 folder).

Note: This repository will be updated to include the raw 2021 data once it becomes available here.

Table of Contents

  1. Requirements
  2. Parse and transform the data
  3. Prepare the database
    1. Docker Container
    2. MariaDB SkySQL
  4. Create the schema
  5. Import the data
  6. Analyze the data
    1. SQL
    2. MariaDB SkySQL
  7. Support and contribution
  8. License

Requirements

Parsing and transforming the data

In this sample you will use the Python script file, parse_and_transform.py, to parse and transform the Stack Overflow Annual Survey data for 2020.

Executing parse_and_transform.py will parse through the [survey results data] and split each row, which contains all the responses from an single survey respondent, into multiple rows, one per response. The result of turns ~66k rows of data into ~5 million rows.

To execute the parse_and_transform.py you need to perform the following steps.

  1. Open a new terminal window at this location.

  2. Create a new Python virtual environment.

$ python3 -m venv venv
  1. Activate the virtual environment.
$ . venv/bin/activate
  1. Install the pandas Python package, which will be use for data manipulation within parse_and_transform.py.
$ pip install pandas
  1. And, finally, execute the parse_and_transform.py script!
$ python3 parse_and_transform.py

Preparing the database

To be able to store and analyze the survey data you're going to need a place to put it. MariaDB to the rescue! Below includes instructions on setting up a local database, using the official MariaDB Docker image, or MariaDB SkySQL, the ultimate MariaDB database in the cloud.

Docker Container

Running a single instance (container) of MariaDB ColumnStore is incredibly simple using the MariaDB Community Server ColumnStore image.

Check out the instructions here.

MariaDB SkySQL

SkySQL is the first and only database-as-a-service (DBaaS) to bring the full power of MariaDB Platform to the cloud, including its support for transactional, analytical and hybrid workloads. Built on Kubernetes, and optimized for cloud infrastructure and services, SkySQL combines ease of use and self-service with enterprise reliability and world-class support – everything needed to safely run mission-critical databases in the cloud, and with enterprise governance.

Get started with SkySQL!

IMPORTANT: Once you've registered for MariaDB SkySQL you will need to create a new analytics service so that you can take advantage of the MariaDB columnar storage engine, ColumnStore. For more information on how to do this check out this walk-through, or check out this short video on launching a new SkySQL service - don't worry it only takes a couple of minutes!

Create the schema

The survey result data contained in newly created answers.csv file will need to be imported to MariaDB. To accomodate that you will need to create a new database, survey_data, that contains a single table, answers.

To create the new database and table you can either copy and execute the following code within a database client of your choice.

DROP DATABASE IF EXISTS survey_data;
CREATE DATABASE survey_data;

CREATE TABLE answers (
    respondent_id INT unsigned NOT NULL, 
    question_id VARCHAR(25) NOT NULL,
    answer VARCHAR(65) NOT NULL
) ENGINE=ColumnStore DEFAULT CHARSET=utf8;

or use the MariaDB Client to execute the schema.sql script contained within this repository.

For example:

Locally

$ mariadb --host 127.0.0.1 --user root -pPassword123! < schema.sql

SkySQL

mariadb --host analytics-1.mdb0001265.db.skysql.net --port 5001 --user DB00004537 -p --ssl-ca ~/Downloads/skysql_chain.pem < schema.sql

Note: Remember to update the command above with your database location, user and SSL information accordingly!

Import the data

After you've created the new schema, you can import the answers.csv data using the MariaDB Client.

For example:

Locally

mariadb --host 127.0.0.1 --port 3306 --user root -pPassword123! -e "LOAD DATA LOCAL INFILE 'answers.csv' INTO TABLE answers FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n'" survey_data

SkySQL

mariadb --host analytics-1.mdb0001265.db.skysql.net --port 5001 --user DB00004537 -p --ssl-ca ~/Downloads/skysql_chain.pem -e "LOAD DATA LOCAL INFILE 'answers.csv' INTO TABLE answers FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n'" survey_data

Note: Remember to update the command above with your database location, user and SSL information accordingly!

Analyze the data

Once the data has been successfully imported into MariaDB there are many ways that you can use the data!

SQL

You can use a database client to execute SQL queries directly on the results data contained in the answers table.

For exmple, using the MariaDB Client:

  1. Start by connecting to your MariaDB database instance.
$ mariadb --host 127.0.0.1 --user root -pPassword123!
  1. Execute the following query to SELECT the top 10 programming langauges that have been used by respondents that have also used MariaDB.
SELECT
	answer, COUNT(answer) AS respondent_count
FROM
	survey_data.answers
WHERE 
	question_id = "LanguageWorkedWith" AND 
	respondent_id IN (SELECT respondent_id FROM answers WHERE question_id = "DatabaseWorkedWith" AND answer = "MariaDB")
GROUP BY
	answer
ORDER BY
	COUNT(answer) DESC
LIMIT 10;
+-----------------------+------------------+
| answer                | respondent_count |
+-----------------------+------------------+
| JavaScript            |             6878 |
| HTML/CSS              |             6597 |
| SQL                   |             6239 |
| PHP                   |             5149 |
| Python                |             4204 |
| Java                  |             4028 |
| Bash/Shell/PowerShell |             3746 |
| TypeScript            |             2558 |
| C#                    |             2396 |
| C++                   |             2277 |
+-----------------------+------------------+

Python & Jupyter Lab

You can also use modern data analysis and visualization tools like Jupyter Lab, in combination with MariaDB Connector/Python and Python libraries like Plotly and Pandas.

For more information on how you can do this please check out the following resources:

Support and Contribution

Please feel free to submit PR's, issues or requests to this project project or projects within the official MariaDB Corporation GitHub organization.

If you have any other questions, comments, or looking for more information on MariaDB please check out:

Or reach out to us diretly via:

License

License

dev-example-columnstore-developer-survey-data's People

Contributors

rhedgpeth avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.