Giter Club home page Giter Club logo

github-stars-by-topic's Introduction

Your GitHub stars sorted by topic/category

This is a python script that fetches your GitHub stars and uses Machine Learning (buzzword, check) to extract a given number of topics. It then generates a folder structure so you can easily browse through the topics in markdown.

The result can be viewed in the example folder.

Usage

Just run python3 main.py on the command line. It will ask you for your GitHub credentials to fetch your stars and then do its job. The result will be a folder in the main directory that you can copy or save in a GitHub repository for others to browse.

How it works

This section will give you a brief overview how the tool works.

Before starting to work, the tool asks you for your GitHub credentials to be able to use the API with 5000 instead of 60 requests per hour. It then starts to fetch the stars of the targeted user. This results in a list of repositories the user has starred.

starred_repos = [
    Repo(name='Totally not Jarvis'),
    Repo(name='Laravel'),
    # and so on
]

To apply the topic extraction later, we need to find a text describing the repo. To do this, for each starred repo the title, description, and README file is fetched and used as a text. This generates a list of texts for each starred repository. The example below shows how this would look for my personal assistant bot called Totally not Jarvis and Laravel, the PHP framework.

readmes = [
    'totally not jarvis my personal assistant totally not jarvis a personal...',
    'laravel a php framework for web artisans about laravel laravel is a web...',
    # and so on
]

We then apply Term Frequency Inverse Document Frequency (tf-idf) pre-processing on the list to extract relevant keywords for each repo. This results in a list of repos with corresponding tf-idf weights. A high tf-idf value means the term is very relevant for this document, a low value means the term is irrelevant (i.e. not existing or too common). The benefit of tf-idf values over plain term frequencies is that it results in low weights for terms that are very common. Or how Wikipedia puts it:

The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

php framework bot web
Laravel 0.8 0.5 0 0.7
Totally not Jarvis 0 0.6 0.8 0.1

Afterwards, we apply Non-Negative Matrix Factorization (NMF) to extract the underlying topics defined by their most-relevant keywords. This gives us two results. Firstly, a list of topics defined by their most-important keywords (high value means more relevance for the topic):

php framework bot web
Topic 1 0.8 0.1 0 0.7
Topic 2 0 0.8 0.1 0.1

The example shows two resulting topics. Topic 1 is defined by the keywords php and web. Topic 2 is defined by the keyword framework.

And secondly, NMF yields a list of repositories per topic:

Laravel Totally not Jarvis
Topic 1 0.9 0.1
Topic 2 0.8 0.7

We see Laravel fits into both topics. On the other hand, Totally not Jarvis just fits into Topic 2 (defined by framework) but not into Topic 1 (defined by php and web). This data is then used to create folders named after the most relevant keywords. Afterwards, a readme file with the most relevant repositories for each topic is generated. You can explore the pre-generated examples in the example folder.

Further reading:

Dependencies

  • PyGithub to fetch your stars. install
  • requests to fetch readmes from github. install
  • Markdown to generate html from markdowns. install
  • BeautifulSoup: extract text from generated html (easiest method to get plain text from markdown). install
  • scikit-learn and the scipy stack for the machine learning algorithms (topic extraction, tf-idf vectors, etc.). install

github-stars-by-topic's People

Contributors

lorey avatar snoop2head avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

github-stars-by-topic's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.