Giter Club home page Giter Club logo

orie4741_project's Introduction

README for ORIE4741_Project

This is a git repository that is predominantly used for data and document storage for the ORIE 4741 Project at Cornell University

The main collaborators are Gloria D'Azevedo ([email protected]) and Pihu Yadav ([email protected])

In this project we are investigating and analyzing the Speed Dating Kaggle dataset (https://www.kaggle.com/annavictoria/speed-dating-experiment)

The full project proposal can be found in ORIE_4741_Project_Proposal.pdf

Thanks for reading!

Updated September 28, 2016

orie4741_project's People

Contributors

gloriadazevedo avatar pihuyadav avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

orie4741_project's Issues

Comments from Liyun Wang

Really interesting project! I am sure a lot people would want to see the result of your project.

3 things I like:

  1. You selected characteristics that people concerns the most such as hometown, education, career and religion. Those are important screening criteria during the initial part of the speed dating process.
  2. You considered how to define a successful speed date.
  3. You considered the the different weights people assign based on their preferences. I think this is extremely important because a successful date is not only about criteria match. People bring in preference and stereotypes into a date, therefore knowing how different people assign different weights to different characteristics is crucial to understand what makes a successful match.

3 things to improve:

  1. Some of the description is not clear. I not sure I understand the “inherent biases” you mentioned in paragraph two. More explanation would be helpful.
  2. The dataset comes from Columbia Business School so that means people who participate in this event would be most likely based in NYC. I am concerning how well the result is going to generalize to speed dating events all across the country.
  3. Speed dating is a very complicated setting. The decisions are usually so irrational that even the person who make that decision can’t explain themselves. How do you control the error of your prediction?

Final report peer review (Clara Ong)

Things I liked:

  1. You tried various kinds of models, probably with knowledge gained from your STSCI Machine Learning class.
  2. Model choice was well substantiated using AIC and BIC, though you could provide the formula for AIC. For both AIC and BIC, you could have explained what the terms in the equations mean.
  3. Page 6: The BIC plots for the best subset selection models were very informative.
  4. You explained why you wanted to keep the models sparse (so that surveys will be shorter and have higher response rates).
  5. Page 7: Table in “Results and conclusion” provided a clear summary. I liked that you considered “model size” in addition to “misclassification rate”.
  6. You acknowledged the possible biases and issues regarding non-response. You also mentioned that data regarding actual dates (confirmation of matches) would be very useful.
  7. Apart from interpreting the results technically, you interpreted the results intuitively to check for face validity. It was interesting to me to see that Intelligence was not significant (LOL).

Things I were unsure of / could be improved:

  1. You wrote: “The goal of this analysis is to create a model that identifies whether or not a male and a female are a match in the speed dating context”. From what I gather, the model is predicting whether a participant will say yes or no to a partner (Page 4). I think this is slightly different from a “match”, where a male has to say “yes” to a female, and that female has to say “yes” to that male.
  2. You could provide a summary of the dataset by stating what variables there are upfront, instead of mentioning them when you introduce the models.
  3. Page 3: As you mentioned, the summary statistics weren’t helpful in understanding statistical significance. Perhaps you could have provided summary statistics right at the start under a section called “exploratory data analysis”. Also, the attractiveness histogram could be under the exploratory data analysis, and more data visualization graphs can be drawn.
  4. Page 5: You said “In linear models with interaction effect, we have to include the main effects as well as the interaction effects, even if the main effects are not significant to the model, for interpretation.” I didn’t think it was necessary to show the main effects that weren’t significant. That could have reduced your model size. You could say that race alone was not significant, which is why you discarded it from the eventual model.
  5. Apart from reporting results in terms of model size and misclassification rate, you could report the actual coefficients of the significant terms.
  6. Apart from the section on cross-validation and K means, you didn’t mention how the data was split into the training and test sets (was it split?). Also, I wasn’t sure if your misclassification rates were on the training or test sets.
  7. Could report errors on the training and test sets to check for over or under-fitting.
  8. Conclusion: "The best model that we recommend..." -- This part onwards is your recommendation. Would be better to write this in a new paragraph for clarity since it's the most important part.

Overall, I thought this report was very clearly and systematically written! I could see that effort was put into the models and the analysis.

Midterm Report Peer Review (mk872)

Firstly, I think the topic and question addressed is very relevant and interesting. It was good that you were able to explore the data and see patterns that arose for specific cases, and spotting where missing data popped up. I was also curious as to how many observations were in the dataset (maybe I missed it, but make sure to include it in the final!).
I thought it would be helpful to include visualizations and graphs of the data, or even more specific results from regression tests. I also thought your experimentation used heavy assumptions on what data was saying (especially for the missing data that you concluded was equal to 0). Is there not enough data for you to just keep them as missing? And for the correlated variables, maybe consider just using one of those in models since coefficients may get messed up if you keep them all.

Final Report Review

Very well written final report!

Summary of the report
The team is aiming to develop a model that predicts whether or not two people will be attracted to one another. This is achieved by identifying what traits individuals look for and value most in their partners.

Things I really like about this project/report:

  1. It is well written. The report is easy to read and provided enough explanation when needed.
  2. The team designed the project well by clearly defining what characteristics they want from the final results, as they mentioned the sparsity as desired feature. They also have a clear understanding of class material to experiment with Lasso regularizer to achieve this.
  3. They are aware of the potential inherent biases for missing fields. Therefore, they toke associated action to deal with the problem
  4. The team employed multiple algorithms learnt in class and compared the pros and cons of each one clearly at the end of the report.

Things I am confuse about/don't like about this project

  1. There is a small typo in Equation 1. It is i in summation instead of j.
  2. If the team wanted a simple model with the fifth model, why not increase lambda further or use a k sparse algorithm to enforce that? Currently the team is looking at the different models with different lambda, but not choosing the one with the minimum error. Rather, the team considered the model simplicity as well. It seems to be conflicting goals, and pretty arbitrary. If the team know how sparse they want the model to be, maybe k sparse algorithm would be a good idea.

Things might be worthwhile trying

  1. Experiment different regularizer like huber and loss functions like hinge loss
  2. Implement train - validate - test for the first two methods
  3. Incorporate more interaction effects like l2/l1 differences. They were mentioned but not implemented in the report.

Final Peer Review (xp33)

Summary:

This project aims to create a model which predicts male/female pairings in a speed dating context by identifying important traits participants look for in a partner. The expectation is to use the specified data to create higher accuracy rates in pairing predictions. Again, as I mentioned in the last review, this is an interesting topic to explore and something which probably does not have a lot of research done on (maybe besides the Columbia professors who conducted these experiments). The results concluded that attractiveness, being fun, and having shared interests are the most important traits participants look for based on the Cross Validated Lasso model.

Positives:

  1. Interesting methods for cleaning the missing and "NA" values in your data set. I wonder what kind of results you would have gotten had you decided to give a missing entry of an ordinal value the average (say a 5, for all missing entries in "tv").
  2. Unsurprising but still interesting that the attractiveness score plays a huge factor in determining whether participants said "yes"or "no"
  3. Nice use of the many models and techniques we learned class!

Improvements:

  1. It would be useful to obtain follow up data (ie. data collected from couples after having gone on dates, the success rate and duration of their relationship).
  2. Seems like there is a lot of bias just from initial data collection so in order to build a reliable model which people will put their faith in, it will generally be important to collect a lot of data with minimum responses missing. Of course, the data sets we work with are often prepared for us so we have little control over some of these external factors.

Peer Review from Yangwen Wan(yw762)

3 things I like about your project/proposal:

  1. Your project goals are clearly stated along with detailed description of your data set and how you can design the feature representations.
  2. I like your concerns that some compatible couples do not return for consequent events making the data set incomplete.
  3. I like your concern about the randomness of the population and bias resulted from sample selection.

3 things to consider:

  1. I guess the evaluation of compatibility needs further attention. It could possibly be measured by different metrics such as the length of the relationship, the level of satisfaction or the number of events they attend.
  2. As the majority of data come from Columbia students who are similar in education level and age. Can your model generalize to data/people out of the school with more diverse backgrounds?
  3. It is possible that people of different ethnicity tend to place different weights on a certain factor. It might be good to make the model use different coefficient vector (w vector) for different group of people (assuming it is a regression model)

Peer review from Siyuan Wang

3 Things I like:

  • I like how you thought about different factors that could bring negative impact to your model, such as the randomness of the population sample, as well as early success in the speed dating rounds.
  • You explored many different features that you could potentially use to construct your model, such as age, racial background, etc. You also mentioned some of the approaches you may want to experiment, such as bucketing the age group, etc. That showed that you have given some serious thoughts about how to utilize your data set.
  • The topic is very interesting and relatable. I believe you can potentially even join this Kaggle data with some other data set to explore more features.

3 Things I think that needs improvement:

  • You should briefly discuss how you are going to separate some test data set and verify your trained model for your hypothesis.
  • The last part of the last paragraph seems to focus too much on the details of dataset itself(about the weights). It would be great if you could expand upon how you are planning on using this data and help achieve your objective.
  • It would be helpful to briefly cover the structure of your data set before going into the details of how to use them.

Peer Review zw429

3 Things I like:

This is a truly messy dataset, it aligns with our class objective very well.
The proposed data condition analysis(whether it is truly random/representative) is reasonable and explained well.
The project dataset and goal is well explained.

3 Things I noticed:

It would be nice to format it into sections with smaller paragraphs for easy interpretation.
It would be nice to show the main features will be used for different data analyzing methods and why they are chosen.
It would be nice to have more general plan discussion and a little less of detail examples.

Final Peer Review (zm223)

The layout of this project is really clear and it is well written. I really like that you have included a brief executive summary which helps me understand what you were trying to do. Also your topic is very interesting and certainly can be modeled. The other thing that is particularly good is that you have stated clearly the reason to use the model or the technique and your expectation in using the model/technique, making the flow of the report very smooth. Also, the project covers many topics we learned from the course, showing that you have a really profound understanding of the concepts introduced by the professor.

Somethings you may consider to change. The first one is that you may want to include more data visualizations. Although you have described your dataset by words clearly, it would be better if you can show the patterns of your data through a graph. Second, you may consider adding more explanation why the K-nearest-neighbor method yields such a bad result. As KNN is a standard classification algorithm, does that imply that your dataset may not be suitable for classification simply because there is no clear boundary between different groups? Third, except the KNN model, all the other models have relative similar results i.e similar cross validation results. You may want to use more comparison method to compare these models.

Final Peer Review (an533)

Summary

The purpose of the project is to improve the surveys used in online/speed dating events in order to increase the accuracy of predicting matches between people. The team tries to accomplish this by identifying which characteristics are important in determining which couples match. The team models what attributes men look for as well as what women look for in a person they date. The data that is explored consists of post-event surveys from Kaggle.

Positives

  1. The executive summary presents the problem and findings very clearly. It also explains the limitations of the model as well as the reasoning behind choosing the final model.

  2. I like how you addressed concerns the reader may have about the imputation of the data.

  3. The last few sentences in your conclusion section are super solid. Summarizing your findings in this way is very effective.

  4. The histogram of attractiveness scores is simple, yet effective. It shows that the "No" response is approximately normal while the "Yes" response is left-skewed.

Improvements

  1. In the paper, it is stated:
    "In this analysis (logistic) all the values with missing data have been removed from the data when computing the coefficients for the logistic regression, as it would be misleading to impute the values of scores given to partners or assign the average value when the answer has been left blank by the participant."
    For your logistic model, you retain about 83% of the data, but there can be a lot of information you are missing from the other 17%. For missing data (in the feature space), perhaps you could try using K-nearest neighbors in order to impute the values that seemed to be randomly left blank. Of course, you should look at the data to see if this imputation makes sense (i.e., people with the same classification tend to have this particular feature). I do not think it is misleading if you provide legitimate reasoning for your assumptions.

  2. For the conclusion part when you state "the same for both training and making predictions so there’s a high probability that the model is over-fitting the data." are you talking about the logistic model and the fact that you trained and tested on the same data? Perhaps this should be made more clear.

  3. One grammar error in section 4 (Using "try" and "tried" in same paragraph; tenses should be the same) as an aside.

Midterm Proposal Review

This project uses data from speed dating experiments to predict the compatibility between two people, male and female. Very interesting topic, pretty applicable in the right situations and a good read. Your description of your data set is excellent, really leads me into your project without much effort – great introduction to the project overall. Maybe it would be good to include some definition of compatibility from the experiment itself – perhaps it’s a number from a survey in the experiment or a yes/no response (I think this is mentioned later in the report but would be best in the beginning so we know what your model will be looking for).
The removed data is a bit tricky in this case, because you mentioned before that a lot of the data is open-ended and therefore a bunch of it is missing or incomplete. It seems like a bit of an error to me that an “NA” or blank value should be interpreted as a 0, since a number like that can affect the model unintentionally. Honestly I don’t know how you’d fix this, or if it’s even necessary to fix – wouldn’t uninterested be labeled more often as 1 instead of just blank? This is a tough call.
 Would also be nice if the regressions that you mentioned in your project were included in the data set. Maybe you didn’t have time or couldn’t bring them up so I don’t think it’s a big deal for now, just make sure you include them. It’d make this a lot easier to read.  I feel like a lot of your analysis at this point is a bit incomplete, or needs a lot of refining before you can say any patterns are present. But I do like the direction you’re headed in with the Next Steps portion of the project – well-crafted ideas based on the information that you’ve already gathered.

Comments from Anqi Wang

Nice job! It's a really interesting idea, I enjoyed reading it.

3 things I like:

  1. Very thoughtful, and mentioned some biases which will make the prediction inaccurate. Great!
  2. This project is well organized. The problem statement is defined clearly.
  3. The dataset is related with our classroom knowledge closely, I believe you all can learn a lot during this project!

3 things need to improve:

  1. There are minor typos. Please make sure to proofread.
  2. I'm not sure if it's my problem - I could only see the first page of your proposal.
  3. In order to attract managers and clients in business world, we could keep our language precise and easy for them to understand.

I pretty like your proposal in general! Good job, Gloria and Pihu!

Peer Review from Adam Wang (yw287)

I found the speed dating problem really intriguing. After examining the data, I am convinced that the data is sufficiently messy (with different scales, type, and missing data).

3 Things I like:

  1. You made the hypothesis that certain factors will automatically influence the compatibility. It would be interesting to see which factors are truly important for a speed date.
  2. You proposed how to pre-process the data such as rounding to the nearest 5. It's important to pre-process the data to normalize results and minimize noise.
  3. You brought up the fact that the data might now be randomly selected. I think it's a valid point that this data was not collected from a 100% representative sample.

3 Things I noticed:

  1. I was wondering if the metrics of the data is comprehensive enough to have a valid prediction. There are objective factors, but people also are irrational towards a relationship.
  2. How would you pre-process and normalize the categorical data? It seems that some data is not very consistent (i.e. NYC vs New York).
  3. How can you evaluate the design choice for pre-processing (i.e. Will rounding to the nearest 5 better than rounding to the nearest 10)?

Overall, I really liked this problem. I'm very excited to see the results!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.