Giter Club home page Giter Club logo

udacity-project-1's Introduction

Write a Data Science Blog Post

Business Understanding

  • Objective of the project:
    Asking questions of interests that arise from the use of the Stack Overflow Developer data from 2019 to 2021

  • What is the expected effect of the project? What are the success criteria for the project?
    The anticipated payoff is just more knowledge about the most commonly used technologies by developers. If the technolgies vary for open-source contributors and less frequent open-source contributors. If this trend changes over time as well. The success criteria is seeing the trends honestly

Data Understanding

  • Where does the data come from? What are the data sources?
    The data come from Stack Overflow, which is a forum used for developers to share information. Post questions and answers based info. posted by users of the platform. It is a trusted sources and seen as a lifeline for many developers of various walks of life. The data sources are surveys from various years taken by developers.

  • How can we best describe the data at a high level? What is the metadata of our datasets?
    The data being used is Stack Overflow survey data that was taken by developers who use and are active on the platform. This has entries recording salary, education level, location, technologies used (databases used, platforms used, languages used, etc), Mental Health, Employment Type (i.e. Full-Time, Part-Time, Student, Unemployed, etc). This can be used to answer some questions in regard to developers based on these results.

  • How is our data quality?
    The data set between the years have some similarities, like languages worked with, ethnicity, salary and etc. But their were columns that had several NaN, another interesting thing that occurred was the different schemas and columns that appeared between the survey for different years. This did not affect the analysis though because the nature of the questions were geared towards specific things like language and etc between different groups (i.e. open-source contributers, etc)

Data Preparation

  • Did we select the right data?
    Yes, the Stack Overflow survey data is the right data for gaining general knowledge about programmers, technologies they use and prefer to work with, contributions to open-source projects and etc.

  • Do we need to generate new data?
    No, that would invalid the answers that are given from the survey data. The data in the survey is sufficient to answer the questions of interest. Even though no data is generated, it is important to note that their were rows that omitted based on the criteria that was being searched on. I did not attempt to imputed the data or create any categorial data because it would bias the results from the survey so I thought it would be best to ignore any rows with Nan instead of imputing them.

  • Do we need to derive new attributes? Yes, but it was only one, the Year attribute was added to different between the different survey results. This is not needed, just added for convenience

Modeling

The questions asked were able to be answered with only inferential statistics. So no modeling took place for this project.

Evaluation

Likewise, no model evaluation was needed for the questions asked for this project.

Deployment

N/A

Summary

between 2019 - 2021, developers seem to really be fond of

  • Javascript
  • HTML/CSS
  • SQL
  • Python

Developers who contributed more frequently to open-source projects desired to use technologies like Node.js, Tensorflow and NET Core. Developers who never contribute to open-source projects desired similar tools

Next Steps

I could add more survey data from previous years to get a better idea of the changing trend in desired technologies. Currently this notebook only uses 2019 - 2021, which is a small timeframe to be drawing any big conclusions from. Maybe after increasing the amount of time from 2019 - 2021 to maybe 2010 - 2021 (i.e. a decade of data) then maybe this can be used to create a model that can guess the trend for the next year.

Libraries

  • Numpy
  • Matplotlib
  • Pandas

Files

  • data/ - directory containing all the csv files used throughout the notebook in the repository.
  • README.md/ - currently reading it
  • stackoverflow_2021.ipynb/ - interactive python notebook that is used to explore the data in the data directory

udacity-project-1's People

Contributors

khayyon1 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.