Giter Club home page Giter Club logo

date-a-scientist's People

Contributors

rvntone avatar

Watchers

 avatar

date-a-scientist's Issues

Job mapping and inherent hierarchy

Usually mapping non-number values to a unique number is no problem and the ideal approach. however, if the data has no inherent hierarchy, doing so can invalidate your results entirely. In the case of smoking or drinking, this has an inherent hierarchy, 0 = never smoked before and 5 = smokes a pack per day (for example). Job however, does not have an inherent hierarchy (how can I say 0 = lawyer and 5 = mechanic). This mapping may have resulted in some very inconsistent and useless results.

The issue of filtering out records

There are many occasions where a data set has some extreme outliers which need to manipulated or taken out of the data set. Unfortunately, the approach you took is a bit problematic. First, you discarded all the records which reported income as -1. Does -1 mean no income or they didn't disclose their income? This we cannot know. If only 1% or less had this response, taking them out would be no problem. In this case a vast majority of the data points had a -1 for income (around 65-70% I think). Throwing these data points out is problematic for two reasons, first is makes the data very biased towards people who did disclose their income. For example, one could argue that people with lower incomes are less likely to report their income, this makes your data set very biased towards high income people and could skew/invalidate your results. Second is that by discarding the record just because the income column was not useful, you are also discarding about 65% of your useful data, which gives you a poorly formed dataset to work with.

The second issue with how you filtered data is by throwing out records with income higher than 100K. 100K is about the median salary for most tech/software jobs with a couple years experience, this is by no means an outlier. In general, an outlier is defined as a value which is 3 or more standard deviations from the mean of all incomes, but common sense also applies. In my opinion 100K is a valid income, maybe anything over about 300K is an outlier.

A solution to the income problem could be to use the mean of all incomes as a placeholder so that the other useful information doesn't need to be thrown out.

Summary

Rubric Score

Criteria 1: Valid Python Code

  • Score Level: 4 (Meets Expectations)
  • Comment(s): All code is valid without errors.

Criteria 2: Exploration of Data

  • Score Level: 4
  • Comment(s): Data is explored well and visualization is useful. Also, good job using the exploration part to develop hypotheses about the data.

Criteria 3: Machine Learning Techniques used correctly

  • Score Level: 3 (Meets Expectations)
  • Comment(s): ML techniques are used properly but have some flaws in the way data was used to develop them, see other issues for details.

Criteria 4: Report - Are conclusions clear and supported by data?

  • Score Level: 2 (Approaches Expectations)
  • Comment(s): Conclusion are present and clear but lack analysis, and metrics. Your models produced some lack luster results. This is not a problem but requires some deep analysis as to why this happened. Good conclusions should include metrics, numbers and in depth consideration for causes of the flaws in your models.

Criteria 5: Code formatting

  • Score Level: 3
  • Comment(s): Formatting is readable and not tedious, however some comments about what is happening could help.

Overall Score: 16/20

Overall this project is done well but lacks some necessary metrics, results and conclusions. Also the way in which the data was processed are likely the source of your less-than-useful results. The code looks good but try to include some comments. Keep it up and happy coding!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.