tenex / opensourcecontributors Goto Github PK

Find all contributions for a user through the GitHub Archive

Python 36.42% Shell 4.14% JavaScript 14.62% HTML 14.30% CSS 5.36% Go 11.56% Less 13.59%

opensourcecontributors's Introduction

OpenSourceContributo.rs

Note about name change: This project was formerly known as githubcontributions.io. GitHub requested that the name of the project be changed in order to avoid confusion about who owns and maintains this project.

This is a utility to find a list of all contributions a user has made to any public repository on GitHub from 2011-01-01 through yesterday.

The data from 2015-01-01 - present is found on GitHub Archive. The data from before this uses a different schema and was obtained from Google's BigQuery (see below)

As of 2015-08-28, it tracks a total of

% cd /github-archive/processed
% gzip -l *.json.gz | awk 'END{print $2}' | numfmt --to=iec-i --suffix=B --format="%3f"
93GiB
% zcat *.json.gz | wc -l
253027947

events.

db.contributions.stats():

{
  "ns" : "contributions.contributions",
  "count" : 284048099,
  "size" : 113714359272,
  "avgObjSize" : 400,
  "storageSize" : 47820357632,
  "capped" : false,
  "nindexes" : 4,
  "totalIndexSize" : 8810385408,
  "indexSizes" : {
    "_id_" : 2804744192,
    "_user_lower_1" : 2275647488,
    "_event_id_1" : 1029251072,
    "created_at_1" : 2700742656
  },
  "ok" : 1
}

(WiredTiger stats omitted)

Processing data archives

Processing the data archives involves 3 steps:

Download the raw events files from GitHub Archive into the events directory
Transform the events files by filtering non-contribution events (e.g., starring a repository) and adding necessary indexable keys (e.g., lowercased username)
Load the transformed data into MongoDB

The archive-processor tool in the util directory handles all of this.

The transformed data from step 2 is compressed and saved just in case we need to re-load the entire database (these files are much smaller than the raw data).

All of this can be done automatically by setting the correct environment variables, then running archive-processor process, or it can be invoked differently to separate the steps or change the working directories. Run archive-processor --help for details.

Environment Variable	Meaning
GHC_EVENTS_PATH	Contains data from 2015-01-01 to present (.json.gz)
GHC_TIMELINE_PATH	Contains data before 2015-01-01 (.csv.gz)
GHC_TRANSFORMED_PATH	Contains output of "transform" operation (.json.gz)
GHC_LOADED_PATH	Links to files in GHC_TRANSFORMED_PATH when loaded to DB
GHC_LOG_PATH	Each invocation of `archive-processor` logs to here

BigQuery Data Sets

For the data from 2011-2014 (actually, 2008-08-25 01:07:06 to 2014-12-31 23:59:59), the GitHub Archive project recorded data from the (now deprecated) Timeline API. This is in a different format and has many more quirks than the new GitHub Events API. To obtain this data, the following BigTable query was used (which took only 47.5s to run):

SELECT
  -- common fields
  created_at, actor, repository_owner, repository_name, repository_organization, type, url,
  -- specific to type
  payload_page_html_url,     -- GollumEvent
  payload_page_summary,      -- GollumEvent
  payload_page_page_name,    -- GollumEvent
  payload_page_action,       -- GollumEvent
  payload_page_title,        -- GollumEvent
  payload_page_sha,          -- GollumEvent
  payload_number,            -- IssuesEvent
  payload_action,            -- MemberEvent, IssuesEvent, ReleaseEvent, IssueCommentEvent
  payload_member_login,      -- MemberEvent
  payload_commit_msg,        -- PushEvent
  payload_commit_email,      -- PushEvent
  payload_commit_id,         -- PushEvent
  payload_head,              -- PushEvent
  payload_ref,               -- PushEvent
  payload_comment_commit_id, -- CommitCommentEvent
  payload_comment_path,      -- CommitCommentEvent
  payload_comment_body,      -- CommitCommentEvent
  payload_issue_id,          -- IssueCommentEvent
  payload_comment_id         -- IssueCommentEvent
FROM (
  TABLE_QUERY(githubarchive:year,'true') -- All the years!
)
WHERE type IN (
  "GollumEvent",
  "IssuesEvent",
  "PushEvent",
  "CommitCommentEvent",
  "ReleaseEvent",
  "PublicEvent",
  "MemberEvent",
  "IssueCommentEvent"
)

If you actually want to use this data, there's no need to run that query; just ask me for the CSVs. When gzipped, they are about 19GB.

Erroneous data

There is lots of data in the archives that just doesn't make sense. Where I can, I've worked around it, for example by parsing needed data out of the event's URL. Here are some issues:

BigQuery exports CSV nulls weird?

Example:

SELECT *
FROM [githubarchive:year.2014]
LIMIT 1000

you will note that in the results pane of Google's BigQuery page, there is the string "null" where it really means a real null value. That makes its way into the exported CSV. So you should export the table the real way, or you will have the string "null" for almost every value.

PushEvent with no repository name (Timeline API)

Example:

SELECT *
FROM [githubarchive:year.2014]
WHERE payload_head='8824ed4d86f587a2a556248d9abfac790a1cbd3f'
LIMIT 1

It seems like sometimes, the only way to get the real repository name (owner/project) is to parse it from the URL.

PushEvent with no way of figuring out the repository (Timeline API)

Example:

SELECT *
FROM [githubarchive:year.2011]
WHERE payload_head='32b2177f05be005df3542c14d9a9985be2b553f7'
LIMIT 5

repository_url is https://github.com// and repository_name is / for each of these. They actually push to: https://github.com/Jiyambi/WoW-Pro-Guides but I only know that by reading the commit messages.

Credits

Created by @hut8 and maintained by Tenex Developers (@tenex).

opensourcecontributors's People

Contributors

Stargazers

Watchers

Forkers

jmwenda s2t2 wei-studio kyleladd modulexcite nicholas-fwang nagyist anujonthemove ndelangen sportsbitenews zidingz

opensourcecontributors's Issues

Use the official Angular.js + Bootstrap code rather than my kludges

http://angular-ui.github.io/bootstrap

Middleware to filter out misleading panics

write tcp 127.0.0.1:5000->127.0.0.1:43217: write: broken pipe

This doesn't matter, along with a few others. So we should just filter these out from panicing.

Expose raw data

Expose data downloaded from githubarchive.org

Allow custom definitions of "contribution"

I had difficulty determining what a "contribution" was beginning this. I definitely got all the ones that I considered contributions, but may have included too many. For example I think commenting on an issue counts as a contribution, but other people don't. So make a selection list where people can define which event types they consider to be "contributions" before querying.

Replaces email notifications with Rollbar

I have over 1000 unread emails from processing files on githubarchive.

githubarchive-util exception list

Currently when invoked, githubarchive-util will try to download every file from 2015-01-01 to present. There are a few that won't work (see igrigorik/gharchive.org#101). Error messages should not be generated for these.

Fork me on GitHub banner

Monitor server somehow

For the past 2 weeks, the processor has been down because Arch changed the Python version and I didn't realize it, so I had to catch up processing a million or so records. I need the github-contributions-process to write the time of the last record processed to a www-accessible file, then have a service that polls for that every hour and makes sure that the output is reasonable.

Decent monitoring / logging solution

The way I monitor the server now is a mess. Recently the processor broke because a file was missing from githubarchive.org. Normally the server emails me every hour telling me the results of the archive-processor run, and separately it emails me if it realizes that the latest event is more than a couple hours old, but at some point it stopped doing that because I chmod'd the ssmtp.conf wrong, and the logs on the disk aren't something I look at regularly until I realize something's wrong.

So I need to look into alternatives and implement one of them. Requirements:

Free (at least for this usage)
Easy to post arbitrary events through an API
Python and Go libraries (preferably drop-in replacements for the standard logger)
Must have a command line utility to post these events, preferably one with no dependencies

Final prep for 1.4

Because we decided not to use Digital Ocean connecting to another VPS over a WAN, couple of minor issues:

Revert change that removed raw archive data from NGINX
Revert change that removed raw archive data links on index.html
Remove DO branding for the time being

Make ansible provisioning script

Make script to provision to Ubuntu 14.04

Show errors to the user

Since I was already warned that the site was slow, I didn't mind that I was still waiting for results after a minute. But as the site was still showing its cubic spinner after ten minutes, I looked in the Dev Tools that showed a 500 Error ...

So the query probably failed quite early, but I was not warned. Handle the error and tell the user you are overloaded (or whatever).

Port server to Go

There's tons of unnecessary complexity in the Flask stuff, plus I want to get better at Go.

MongoDB: Enable WAN link

We need MongoDB to work over a WAN so that I can host the DB on Vultr and still have DO sponsor the frontend.

Provision new server

I don't know what I was thinking when I used Arch as the server. That was a bad idea and it's broken several times. Use the ansible script to provision a new box.

Deployment script: split DB, Archive, Web roles

DB: MongoDB , /github-archive stuff, archive-processor
Web: Everything else

This is because of DigitalOcean's offer to sponsor the project, but I can't fit all the data on their droplets.

Remove links to raw data

Splitting the DB away from the web layer means that I can't link to the raw data for now without setting up a web server on the DB. So temporarily remove this until Digital Ocean has storage instances (https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3127077-extra-diskspace)

Package assets sanely

Right now we serve some out of other CDNs, and the rest are individual, non-minified versions served locally. No attention was paid to optimizing anything.

Look into webpack

Use event_id as ID where possible

This just occured to me. The pre-2015 events (in the timeline directory) don't have event_id attributes. However, the new ones all do. Maybe I could replace the MongoDB _id attribute with event_id for the post-2015 events. Dropping that index would likely result in a huge increase in insert performance, which we really need. Right now there are 4 indexes on that collection, and not being able to fit them in memory is what really slows things to a crawl.

Thoughts, @joshjordan ?

Use Angular routes

Right now you can't link to results. This is terrible.

archive-processor should process files atomically

If the processor crashes, an incomplete gzip is left in the output directory. That stops the rest of the archive from being processed on the next invocation, because if the file is in the destination directory and not empty, it thinks it's done.

The output archive should be written to a temporary file, then moved atomically to the destination.

Mobile rendering terrible

This looks terrible on mobile devices. Need to fix:

Make header and hamburger fit inline
Move search out of hamburger menu
Make sure that top header part isn't too big (should be fixed by first item) such that it covers page content
Get footer down to one line
Kill Fork me banner on mobile

Graph

Time series data would be cool to display here since it looks like lots more events are occurring recently.

Handle errors in which the server is down

When the server is down, the responseError handler is still called. But then rejection.data is null, which causes another error! 😢

    function ConfigureErrorHandler($httpProvider) {
        $httpProvider.interceptors.push(function($q, $rootScope, $log, $injector) {
             return {
                 'responseError': function(rejection) {
                     $log.debug(rejection);
                     $rootScope.errorDescription = rejection.data.error;

Clean up production server

It has all the droppings from uwsgi and stuff on it. Gross.

Bug with switching users

Steps:

Search for someone
Look at their events
Search for someone else
Some of the columns (e.g., repository) are still from the first person. Mysterious!

ng-cloak

When loading the page, all the angular template code briefly flashes. Use ngCloak to stop that.

Sitemap

Google doesn't really care about this site, in part because it can't see it has any content. There's no list of users anywhere (since it's based around search), so it won't crawl users' pages. Fix that with a bunch of sitemap.txt files.

Account for nginx in ghc-app logs

The IP addresses belong logged are all localhost, because nginx is proxying everything. This is stupid; I need to fix it.

Deployment script

Make a deployment script!

Provision DO Host

Refactor User/Event data sources into factories

Clean up this mess I've made of Angular! Get rid of explicit $http.

Write deduplication utility

@igrigorik needs some files with duplicate events de-duplicated. Make a utility to do this.

Experiment with static .json.gz

Since queries are only (currently) performed on the _user_lower key, try creating static files for each user. So the query "hut8" would end up opening up "hut8.json.gz". Yes this is insane but maybe it would actually work better than MongoDB.

Aggregate digest tool for hourly data

Part of #24 involves simply counting events for each hour, which is a pretty decent-sized task, so it should be its own issue. It would be pretty crazy to just use aggregates over the entire collection.

Blocks #24

TLS

Use letsencrypt!

Aggregate for per-repo contribution count

Right now the "repo" page just displays a list of repositories that the user contributed to without any indication of how many contributions were made to each. Fix that by making it a table, sortable by either column.

Use Angular.js

Creating branch feature/angular for this.

Cache results

Compared to the enormous amount of data you are working with I would assume that the (presumably) little traffic that's hitting you would be manageable to cache for each user (to some extent).

For interested developers that would poll your site regularly, this should lower the load quite a bit.

Upstream prematurely closed connection

From nginx error log:
712#0: *6020 upstream prematurely closed connection while reading response header from upstream ...

This seems to happen after some amount of time. I think I'm handling mgo sessions wrong and eventually it will just stop working.
Also, the logs aren't adequate at all; there are no errors whatsoever in there.

Statistics in footer

The /stats endpoint is unused, and it would be nice to be able to show some statistics about the whole collection on the page. So add a sticky footer and put it in there.

Script to build indexes

Right now there are no indexes on the new server. Definitely needs to be fixed.

The provision script should run this script (asynchronously?)

GitHub going down breaks processor

An outage at GitHub causes githubarchive's .json.gz files not to be where this expects them (since it returns 404s), so currently the archive processor just errors out constantly.

Fix leaky events cache

Whoops, obvious bug with events cache:

Search for one user
Visit events page 2.
Search for another user
Go to events tab
It displays original user's event page .

Fix angular UI deprecations

Use gb for builds

I think builds should have no external dependencies, and the built-in Go toolset introduces tons. See this for details:

https://getgb.io/

I think it's the best go build tool.

Split events directory into years

Because there are 8,760 hours in a year, there are starting to be quite a few files in a single directory. Several parts of the processor iterate over everything and this is causing performance issues.

MongoDB BSON Dates

Right now the dates exported by the archive processor are stored as strings in ISO8601. They should use this syntax:

{ "created_at": {"$date": "2012-01-01T15:00:00.000Z"} }

instead of:

{ "created_at": "2012-01-01T15:00:00.000Z"}

Deployment strategy

Ideally deployments should work like this:

Somehow get a binary in the right directory on the server (/home/ghc/ghc-app/bin/)
It should be named ghc-app-<sha1> where the sha1 is the first few hex-encoded bytes
Binding should always be done with SO_REUSEADDR
Side idea: bind each build to its own port that is derived from the SHA1 of the binary (Python 3):

app_bytes = open(r'path/to/ghc-app', 'rb').read()
app_hash = hashlib.sha1(app_bytes).digest()
port = int.from_bytes(app_hash, 'big') % (65535 - 1024)

Start the binary as a child process of the deployment thing listening on a test port on localhost (9001 maybe, since it's over 9000)
Run some basic tests on that to make sure it returns /stats for example
Kill the test process
Switch the link to ghc-app to the new build
Make a note of the old ghc-app's PID
Start the new ghc-app just like the old one, again using SO_REUSEADDR. So then two processes listen on the same port.
Kill old ghc-app

tenex / opensourcecontributors Goto Github PK

opensourcecontributors's Introduction

OpenSourceContributo.rs

Processing data archives

BigQuery Data Sets

Erroneous data

BigQuery exports CSV nulls weird?

PushEvent with no repository name (Timeline API)

PushEvent with no way of figuring out the repository (Timeline API)

Credits

opensourcecontributors's People

Contributors

Stargazers

Watchers

Forkers

opensourcecontributors's Issues

Recommend Projects

Recommend Topics

Recommend Org