juanshishido / okcupid Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 0.0 80.59 MB

Analyzing online self-presentation

License: MIT License

Shell 0.01% Python 0.27% Jupyter Notebook 99.73%

okcupid's Introduction

OkCupid

An unsupervised approach for detecting differences in self-presentation on OkCupid.

A final project for Applied Natural Language Processing, Fall 2015.

Proposed Approach

We will use the following methods to detect differences in self-presentation.

Pointwise mutual information for generating features
- unigrams, bigrams, and trigrams used by at least 1% of users
Dimensionality reduction with Principle Component Analysis
Clustering with k-means
Keyphrase extraction for cluster "descriptions"
Look for differences in demographic distributions by cluster

Essay Prompts

0: My self summary
1: What I'm doing with my life
2: I'm really good at
3: The first thing people notice about me
4: Favorite books, movies, tv, food
5: The six things I could never do without
6: I spend a lot of time thinking about
7: On a typical Friday night I am
8: The most private thing I am willing to admit
9: You should message me if

okcupid's People

Contributors

Stargazers

Watchers

okcupid's Issues

pca with varimax rotation

doesn't seem to be implemented in python.

urls

I remember seeing a comment about there still being URLs in the text. I used the following on another project with some pretty good success: .apply(lambda x: re.sub('(http\S*|www\S*)', '', x)). There are probably ways to improve this. Could be useful if we want to filter those out.

Fix plotting

Currently, .plot() is called on a pd.Series resulting from .value_counts().

(tdf.ethnicity_.value_counts().sort_index() /\
 tdf.ethnicity_.value_counts().sort_index().sum()).plot(alpha=0.75,
                                                        rot=90,
                                                        figsize=(8, 6))

The issue is that, if the indices aren't the same across clusters, the y-values will be plotted in an incorrect x-axis position.

determine distinctive tokens

Find ways to selecting distinctive words or phrases across groups and possibly evaluating their relative importance.

Potential ideas:

examine the $\hat{\beta}$s from something like a logistic regression model
surrogate analysis
permutation tests (this is possibly related to surrogate analysis)
others?

lemmatizing

I'm not sure if this is an issue, so much as a question...
The output of the trigrams (after lemmatizing, removing stopwords, etc) is the following:

[(('making', 'people', 'laugh'), 3282), (('http', ':/', 'www'), 2616), (('spend', 'lot', 'time'), 2526), (('meeting', 'new', 'people'), 2468), (("i'm", 'really', 'good'), 2159), (('trying', 'new', 'thing'), 2036), (('meet', 'new', 'people'), 1880), (('pretty', 'much', 'anything'), 1771), (('www', 'youtube', 'com'), 1582), (('typical', 'friday', 'night'), 1574)]

Both (('meeting', 'new', 'people'), 2468) and (('meet', 'new', 'people'), 1880) are included.

If we lemmatized meeting shouldn't they be counted as the same (ie, ('meet', 'new', 'people'))?

variance explained

Looks like PCA will be harder than we though.
When I take the first 10 principal components (after whitening), they collectively only explain about 18% of the variance. I'm not sure if the problem is in the data we're putting in (maybe imputing isn't helping our cause?), but we might need to think about this a little bit.

summarize_essays.py

need to edit it to accept an essay column (now it accepts the tokenized total essay)

describing topics

Thinking about ways to describe text topics.

The "topics" we'll try to explain are those defined by text in a given cluster. Text could be an individual or a group of (even all) essay responses—based on how the clusters were established.

This analysis may look at the original text or some cleaned version obtained in a previous step (such as when calculating the PMI). The basic approach I'm thinking of is:

keyphrase extraction
reduction using hypernyms

Tokens will be lemmatized and stopwords will be removed.

Key(word)phrase Extraction

This could be done in several ways.

token frequency
- n-grams
tfidf
- For a given cluster, create a single document—the concatenated text for all users in that cluster. This is for the tf portion. For the idf, use this single document as well as all of the other individual documents (essay responses).
co-occurrence
- Build a co-occurrence matrix for a single document (the concatenated text for the cluster) and do not define the diagonal. Words that co-occur with other words more often (in sentences) than they would if they were randomly distributed could be thought of as "important." Use the chi-squared test to determine statistical significance and to "control" for words that occur infrequently. (Based on Matsuo and Ishizuka.)
subtracting token distributions
- This would be a one-versus-rest-type approach. We calculate the normalized distribution of tokens for the cluster and subtract out the normalized distribution of the other clusters. The idea is to get the words that are most unique to the given cluster.
rapid automatic keyword extraction
- RAKE

I'm not sure how any of these methods will perform.

We could also use "standard" key_phrase_ extraction techniques that look at noun phrases along with other tokens. This might be more difficult to reduce, though. Still, it should be explored.

Hypernyms

Based on the first part, we could reduce the words to their higher-level categories. WordNet might be the way to go here (the hypernym_path() method?). An example could be with keywords such as baseball, basketball, football, hockey, etc. that would map to "sports."

Other

There are other ways to summarize documents, including Luhn and TextRank, both of which are implemented in sumy.

Remove unnecessary notebooks

clean up

tokenizer

There is a "Twitter-aware" tokenizer that we can try
http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py

nan in datamatrix

the datamatrix has nans in it, which breaks PCA. I'm not completely sure why they are there, but do you think it's reasonable to just replace nans with 0?

determining `k`

We talked about the within cluster sum of squares as a metric, which we will try.

Another option is the silhouette coefficient.

In general, intrinsic methods[, such as the silhouette coefficient,] evaluate a clustering by examining how well the clusters are separated and how compact the clusters are.

(Han, Kamber, Pei)

Another potentially useful link: Selecting the number of clusters.

Both of these approaches concentrate on the clusters themselves. However, we can focus on "downstream" outcomes and, instead, choose k based on how distinct the resulting clusters are based on the dimension (e.g., ethnicity) we're interested in investigating. This still requires a metric.

passing objects and checksums

These are some thoughts that relate to how we pass data in each of the steps we're working on implementing. It was great that the precedent of using pickle files was established. This practice will help us be faster when working downstream by allowing us to sidestep creating those objects again.

What I'm imagining are "main" scripts for each step (e.g., pmi, pca, etc.) with lots of smaller functions for doing specific things. Each of these should both (potentially) create the pickle file and return the object. We can check if the pickle file already exists in a predefined location (some relative path, probably in data/). If it does, the script should just load and return that object instead of trying to recreate it.

Because we're iterating and making improvements to our process, our objects might change (e.g., the NaNs in data_matrix). To check that we have the "right" version of those objects, we should compute and check hashes. This can be done using the hashlib module (probably hashlib.md5()). We can create a checksums file for each object (and update it when we make changes to the way the object is created). We'll then use this to check whether the version of the pickle file we have locally is what we should have. If not, the script should recreate the object and the pickle file.

Thoughts?

nmf_labels

in nonnegative_matrix_factorization.nmf_labels the function seems to be fitting the model twice, once with .fit and once with .fit_transform.
I assume this is by accident?