Giter Club home page Giter Club logo

opensourcecontributors's Introduction

OpenSourceContributo.rs

OpenSourceContributo.rs

Note about name change: This project was formerly known as githubcontributions.io. GitHub requested that the name of the project be changed in order to avoid confusion about who owns and maintains this project.

This is a utility to find a list of all contributions a user has made to any public repository on GitHub from 2011-01-01 through yesterday.

The data from 2015-01-01 - present is found on GitHub Archive. The data from before this uses a different schema and was obtained from Google's BigQuery (see below)

As of 2015-08-28, it tracks a total of

% cd /github-archive/processed
% gzip -l *.json.gz | awk 'END{print $2}' | numfmt --to=iec-i --suffix=B --format="%3f"
93GiB
% zcat *.json.gz | wc -l
253027947

events.

db.contributions.stats():

{
  "ns" : "contributions.contributions",
  "count" : 284048099,
  "size" : 113714359272,
  "avgObjSize" : 400,
  "storageSize" : 47820357632,
  "capped" : false,
  "nindexes" : 4,
  "totalIndexSize" : 8810385408,
  "indexSizes" : {
    "_id_" : 2804744192,
    "_user_lower_1" : 2275647488,
    "_event_id_1" : 1029251072,
    "created_at_1" : 2700742656
  },
  "ok" : 1
}

(WiredTiger stats omitted)

Processing data archives

Processing the data archives involves 3 steps:

  1. Download the raw events files from GitHub Archive into the events directory
  2. Transform the events files by filtering non-contribution events (e.g., starring a repository) and adding necessary indexable keys (e.g., lowercased username)
  3. Load the transformed data into MongoDB

The archive-processor tool in the util directory handles all of this.

The transformed data from step 2 is compressed and saved just in case we need to re-load the entire database (these files are much smaller than the raw data).

All of this can be done automatically by setting the correct environment variables, then running archive-processor process, or it can be invoked differently to separate the steps or change the working directories. Run archive-processor --help for details.

Environment Variable Meaning
GHC_EVENTS_PATH Contains data from 2015-01-01 to present (.json.gz)
GHC_TIMELINE_PATH Contains data before 2015-01-01 (.csv.gz)
GHC_TRANSFORMED_PATH Contains output of "transform" operation (.json.gz)
GHC_LOADED_PATH Links to files in GHC_TRANSFORMED_PATH when loaded to DB
GHC_LOG_PATH Each invocation of archive-processor logs to here

BigQuery Data Sets

For the data from 2011-2014 (actually, 2008-08-25 01:07:06 to 2014-12-31 23:59:59), the GitHub Archive project recorded data from the (now deprecated) Timeline API. This is in a different format and has many more quirks than the new GitHub Events API. To obtain this data, the following BigTable query was used (which took only 47.5s to run):

SELECT
  -- common fields
  created_at, actor, repository_owner, repository_name, repository_organization, type, url,
  -- specific to type
  payload_page_html_url,     -- GollumEvent
  payload_page_summary,      -- GollumEvent
  payload_page_page_name,    -- GollumEvent
  payload_page_action,       -- GollumEvent
  payload_page_title,        -- GollumEvent
  payload_page_sha,          -- GollumEvent
  payload_number,            -- IssuesEvent
  payload_action,            -- MemberEvent, IssuesEvent, ReleaseEvent, IssueCommentEvent
  payload_member_login,      -- MemberEvent
  payload_commit_msg,        -- PushEvent
  payload_commit_email,      -- PushEvent
  payload_commit_id,         -- PushEvent
  payload_head,              -- PushEvent
  payload_ref,               -- PushEvent
  payload_comment_commit_id, -- CommitCommentEvent
  payload_comment_path,      -- CommitCommentEvent
  payload_comment_body,      -- CommitCommentEvent
  payload_issue_id,          -- IssueCommentEvent
  payload_comment_id         -- IssueCommentEvent
FROM (
  TABLE_QUERY(githubarchive:year,'true') -- All the years!
)
WHERE type IN (
  "GollumEvent",
  "IssuesEvent",
  "PushEvent",
  "CommitCommentEvent",
  "ReleaseEvent",
  "PublicEvent",
  "MemberEvent",
  "IssueCommentEvent"
)

If you actually want to use this data, there's no need to run that query; just ask me for the CSVs. When gzipped, they are about 19GB.

Erroneous data

There is lots of data in the archives that just doesn't make sense. Where I can, I've worked around it, for example by parsing needed data out of the event's URL. Here are some issues:

BigQuery exports CSV nulls weird?

Example:

SELECT *
FROM [githubarchive:year.2014]
LIMIT 1000

you will note that in the results pane of Google's BigQuery page, there is the string "null" where it really means a real null value. That makes its way into the exported CSV. So you should export the table the real way, or you will have the string "null" for almost every value.

PushEvent with no repository name (Timeline API)

Example:

SELECT *
FROM [githubarchive:year.2014]
WHERE payload_head='8824ed4d86f587a2a556248d9abfac790a1cbd3f'
LIMIT 1

It seems like sometimes, the only way to get the real repository name (owner/project) is to parse it from the URL.

PushEvent with no way of figuring out the repository (Timeline API)

Example:

SELECT *
FROM [githubarchive:year.2011]
WHERE payload_head='32b2177f05be005df3542c14d9a9985be2b553f7'
LIMIT 5

repository_url is https://github.com// and repository_name is / for each of these. They actually push to: https://github.com/Jiyambi/WoW-Pro-Guides but I only know that by reading the commit messages.

Credits

Created by @hut8 and maintained by Tenex Developers (@tenex).

opensourcecontributors's People

Contributors

hut8 avatar joshjordan avatar sartoshi-foot-dao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

opensourcecontributors's Issues

Allow custom definitions of "contribution"

I had difficulty determining what a "contribution" was beginning this. I definitely got all the ones that I considered contributions, but may have included too many. For example I think commenting on an issue counts as a contribution, but other people don't. So make a selection list where people can define which event types they consider to be "contributions" before querying.

Monitor server somehow

For the past 2 weeks, the processor has been down because Arch changed the Python version and I didn't realize it, so I had to catch up processing a million or so records. I need the github-contributions-process to write the time of the last record processed to a www-accessible file, then have a service that polls for that every hour and makes sure that the output is reasonable.

Decent monitoring / logging solution

The way I monitor the server now is a mess. Recently the processor broke because a file was missing from githubarchive.org. Normally the server emails me every hour telling me the results of the archive-processor run, and separately it emails me if it realizes that the latest event is more than a couple hours old, but at some point it stopped doing that because I chmod'd the ssmtp.conf wrong, and the logs on the disk aren't something I look at regularly until I realize something's wrong.

So I need to look into alternatives and implement one of them. Requirements:

  • Free (at least for this usage)
  • Easy to post arbitrary events through an API
  • Python and Go libraries (preferably drop-in replacements for the standard logger)
  • Must have a command line utility to post these events, preferably one with no dependencies

Final prep for 1.4

Because we decided not to use Digital Ocean connecting to another VPS over a WAN, couple of minor issues:

  • Revert change that removed raw archive data from NGINX
  • Revert change that removed raw archive data links on index.html
  • Remove DO branding for the time being

Show errors to the user

Since I was already warned that the site was slow, I didn't mind that I was still waiting for results after a minute. But as the site was still showing its cubic spinner after ten minutes, I looked in the Dev Tools that showed a 500 Error ...

So the query probably failed quite early, but I was not warned. Handle the error and tell the user you are overloaded (or whatever).

Port server to Go

There's tons of unnecessary complexity in the Flask stuff, plus I want to get better at Go.

MongoDB: Enable WAN link

We need MongoDB to work over a WAN so that I can host the DB on Vultr and still have DO sponsor the frontend.

Provision new server

I don't know what I was thinking when I used Arch as the server. That was a bad idea and it's broken several times. Use the ansible script to provision a new box.

Deployment script: split DB, Archive, Web roles

DB: MongoDB , /github-archive stuff, archive-processor
Web: Everything else

This is because of DigitalOcean's offer to sponsor the project, but I can't fit all the data on their droplets.

Package assets sanely

Right now we serve some out of other CDNs, and the rest are individual, non-minified versions served locally. No attention was paid to optimizing anything.

Look into webpack

Use event_id as ID where possible

This just occured to me. The pre-2015 events (in the timeline directory) don't have event_id attributes. However, the new ones all do. Maybe I could replace the MongoDB _id attribute with event_id for the post-2015 events. Dropping that index would likely result in a huge increase in insert performance, which we really need. Right now there are 4 indexes on that collection, and not being able to fit them in memory is what really slows things to a crawl.

Thoughts, @joshjordan ?

archive-processor should process files atomically

If the processor crashes, an incomplete gzip is left in the output directory. That stops the rest of the archive from being processed on the next invocation, because if the file is in the destination directory and not empty, it thinks it's done.

The output archive should be written to a temporary file, then moved atomically to the destination.

Mobile rendering terrible

This looks terrible on mobile devices. Need to fix:

  • Make header and hamburger fit inline
  • Move search out of hamburger menu
  • Make sure that top header part isn't too big (should be fixed by first item) such that it covers page content
  • Get footer down to one line
  • Kill Fork me banner on mobile

Graph

Time series data would be cool to display here since it looks like lots more events are occurring recently.

Handle errors in which the server is down

When the server is down, the responseError handler is still called. But then rejection.data is null, which causes another error! ๐Ÿ˜ข

    function ConfigureErrorHandler($httpProvider) {
        $httpProvider.interceptors.push(function($q, $rootScope, $log, $injector) {
             return {
                 'responseError': function(rejection) {
                     $log.debug(rejection);
                     $rootScope.errorDescription = rejection.data.error;

Bug with switching users

Steps:

  • Search for someone
  • Look at their events
  • Search for someone else
  • Some of the columns (e.g., repository) are still from the first person. Mysterious!

ng-cloak

When loading the page, all the angular template code briefly flashes. Use ngCloak to stop that.

Sitemap

Google doesn't really care about this site, in part because it can't see it has any content. There's no list of users anywhere (since it's based around search), so it won't crawl users' pages. Fix that with a bunch of sitemap.txt files.

Experiment with static .json.gz

Since queries are only (currently) performed on the _user_lower key, try creating static files for each user. So the query "hut8" would end up opening up "hut8.json.gz". Yes this is insane but maybe it would actually work better than MongoDB.

Aggregate digest tool for hourly data

Part of #24 involves simply counting events for each hour, which is a pretty decent-sized task, so it should be its own issue. It would be pretty crazy to just use aggregates over the entire collection.

Blocks #24

TLS

Use letsencrypt!

Aggregate for per-repo contribution count

Right now the "repo" page just displays a list of repositories that the user contributed to without any indication of how many contributions were made to each. Fix that by making it a table, sortable by either column.

Cache results

Compared to the enormous amount of data you are working with I would assume that the (presumably) little traffic that's hitting you would be manageable to cache for each user (to some extent).

For interested developers that would poll your site regularly, this should lower the load quite a bit.

Upstream prematurely closed connection

From nginx error log:
712#0: *6020 upstream prematurely closed connection while reading response header from upstream ...

This seems to happen after some amount of time. I think I'm handling mgo sessions wrong and eventually it will just stop working.
Also, the logs aren't adequate at all; there are no errors whatsoever in there.

Statistics in footer

The /stats endpoint is unused, and it would be nice to be able to show some statistics about the whole collection on the page. So add a sticky footer and put it in there.

Script to build indexes

Right now there are no indexes on the new server. Definitely needs to be fixed.

The provision script should run this script (asynchronously?)

GitHub going down breaks processor

An outage at GitHub causes githubarchive's .json.gz files not to be where this expects them (since it returns 404s), so currently the archive processor just errors out constantly.

Fix leaky events cache

Whoops, obvious bug with events cache:

Search for one user
Visit events page 2.
Search for another user
Go to events tab
It displays original user's event page .

Use gb for builds

I think builds should have no external dependencies, and the built-in Go toolset introduces tons. See this for details:

https://getgb.io/

I think it's the best go build tool.

Split events directory into years

Because there are 8,760 hours in a year, there are starting to be quite a few files in a single directory. Several parts of the processor iterate over everything and this is causing performance issues.

MongoDB BSON Dates

Right now the dates exported by the archive processor are stored as strings in ISO8601. They should use this syntax:

{ "created_at": {"$date": "2012-01-01T15:00:00.000Z"} }

instead of:

{ "created_at": "2012-01-01T15:00:00.000Z"}

Deployment strategy

Ideally deployments should work like this:

  • Somehow get a binary in the right directory on the server (/home/ghc/ghc-app/bin/)
  • It should be named ghc-app-<sha1> where the sha1 is the first few hex-encoded bytes
  • Binding should always be done with SO_REUSEADDR
  • Side idea: bind each build to its own port that is derived from the SHA1 of the binary (Python 3):
app_bytes = open(r'path/to/ghc-app', 'rb').read()
app_hash = hashlib.sha1(app_bytes).digest()
port = int.from_bytes(app_hash, 'big') % (65535 - 1024)
  • Start the binary as a child process of the deployment thing listening on a test port on localhost (9001 maybe, since it's over 9000)
  • Run some basic tests on that to make sure it returns /stats for example
  • Kill the test process
  • Switch the link to ghc-app to the new build
  • Make a note of the old ghc-app's PID
  • Start the new ghc-app just like the old one, again using SO_REUSEADDR. So then two processes listen on the same port.
  • Kill old ghc-app

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.