Giter Club home page Giter Club logo

git-influencer's Introduction

Git-Influencer

A platform for you to make most of the github by discovering social influencers in github network regarding to your interest.

Project Idea

Started from 2008, Github is now one of the most popular open source community in tech world. As of 2018, there are 28 million users and 57 million repositories, making it the largest host of source code in the world.

Github is also one of the best sources for learning coding, we share code, publish new project, follow and learn from other users. But github do not have a system to help you find the right person and resources on specific area.

Everyone wants to learn from the best, this project aims to creat a platform which can help you on finding the social influencers from the github network.

Data Source

  • Github archive : GH Archive is a project to record the public GitHub timeline, which stores all the event based github. Weighing in over 3TB total, this is the largest Bigquery dataset available on kaggle.
  • Data size: 80~100G/month since 2011
  • Update frenquency: Update every 1 hour

Tech Stack

Tech Stack

  • Data Ingestion

    • Historical data: Raw data stored in github Archive, it is scheduled to be downloaded with python and Airflow and saved to HDFS.
    • New coming data: use github Archive api and airflow scheduler to clean the updated new coming data and append to historical data in HDFS.
    • HDFS: All data in HDFS will be cleaned with spark and saved for data processing in spark.
  • Data Processing

    • use spark for batch processing data on HDFS
    • PageRank and other network analysis algorithms in graphX
  • Database

    • MySQL, with 10 table corresponding to 10 languages user rank scores and user details.
  • User Interface

    • Dash, user can select the language and they will get recommended users ranking by the social influencer score.

Engineering challenge

  • Data modeling: find clues from raw json event data for mapping users with languages.
  • Data size and update: Processing and cleaning 2.9TB github event data, combining both historical data cleaning and new coming data cleaning: 80~100 G per month, and update every hour. Use airflow auto the whole processing.

Alogorithms

  • Centrality Measures: Pagerank
  • only people who used this language before has been included in the pagerank algorithms
  • Community Detection: Strongly Connected Components
  • GraphX and more analysis in the future.

Business Value

  1. If you want to learn "Golang" or other languages, this platform will recommend you the most valueble github user to follow and learn from based on network analysis results.
  2. For example, Show N people to learn from based on network analysis result. Recommend community for colaboration. Dashboard

Further Improvement

  • Explore HDFS data storage efficiency - Parquet
  • Try different classification metric for discovering more user topics
  • Use more Graph analysis algorithms in GraphX

Dashboard and PPT

git-influencer's People

Contributors

chuqiaoshen avatar dataguy2012 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.