Giter Club home page Giter Club logo

github-bigquery-exporter's Introduction

github-bigquery-exporter

GitHub BigQuery export utility, for those times when more granular PR and Issue queries are required. This is also good way to query data for periods longer than the GitHub max of 30 days.

Requirements

token

You can run the export script without the GitHub API token but you will be subject to much stricter rate limits. To avoid this (important for larger organizations) get a personal API tokens by following these instructions and define it in an GITHUB_ACCESS_TOKEN environment variable

export GITHUB_ACCESS_TOKEN="your long token string goes in here"

Remember, you have to be org admin for this to work

json2csv

GitHub API exports data in JSON format. The simplest way to import desired data elements is to convert the data into CSV using json2csv, a Node.js utility that converts JSON to CSV.

npm install -g json2csv

Configuration

To configure the export script you will need to define the organization and provide list of repositories in this organization.

declare -r org="my-org-name"
declare -a repos=("my-repo-1"
                  "my-repo-2"
                  "my-repo-3")

Optionally, to configure the import script you can edit the data-set name and configure the issue and pull table name. This step is only required if you for some reason have name conflicts in your BiqQuery project.

declare -r ds="github"
declare -r issues_table="issues"
declare -r pulls_table="pulls"

Export

To execute the GitHub export script run this command:

./export

The expected output should look something like this

Downloading issues for org/repo-1...
Downloading prs for org/repo-1...
Downloading issues for org/repo-2...
Downloading prs for org/repo-2...

Import

To execute the BigQuery import script run this command:

./import

The expected output should look something like this

Dataset 'project:github' successfully created.
Table 'project:github.issues' successfully created.
Table 'project:github.pulls' successfully created.
Waiting on bqjob_..._1 ... (0s) Current status: DONE
Waiting on bqjob_..._1 ... (0s) Current status: DONE

Query

When the above scripts completed successfully you should be able to query the imported data using SQL in BigQuery console. For example to find repositories with most issues over last 90 days:

select
  i.repo,
  count(*) num_of_issues
from gh.pulls i
where date_diff(CURRENT_DATE(), date(i.ts), day) < 90
group by
  i.repo
order by 2 desc

TODO

  • Add org user export/import
  • Sort out the 2nd run where tables have to be appended
  • Bash, really? Can I haz me a service?

Scratch

Users who have activity (pr/issue) but are NOT in the user table

with active_users as (
  select username
  from gh.issues
  group by username

  union all

  select p.username
  from gh.pulls p
  group by username
)
select *
from active_users
where username not in (SELECT username from gh.users)

Export results as CSV and use them as input in user-export which will download the GitHUb data for each one of those users. Then, when done, run user-import to bring those users into

Activity breakdown by company

select all_prs.company, all_prs.prs apr, coalesce(m3_prs.prs,0) rpr from (

  select
    COALESCE(u.company, 'Unknown') company,
    COUNT(*) prs
  from gh.pulls i
  join gh.users u on i.username = u.username
  group by company

) all_prs

left join (

  select
    COALESCE(u.company, 'Unknown') company,
    COUNT(*) prs
  from gh.pulls i
  join gh.users u on i.username = u.username
  where i.ts > "2018-10-30 23:59:59"
  group by company

) m3_prs on all_prs.company = m3_prs.company

order by 2 desc
select u.company, count(*)
from gh.pulls i join gh.users u on i.username = u.username
where u.company is not null
group by company order by 2 desc
select
  pr_month,
  sum(google_prs) as total_google_prs,
  sum(non_google_prs) as total_non_google_prs
from (
select
  case when u.company = 'Google' then 1 else 0 end as google_prs,
  case when u.company = 'Google' then 0 else 1 end as non_google_prs,
  TIMESTAMP_TRUNC(i.`on`, MONTH) as pr_month
from gh.pulls i
join gh.users u on i.username = u.username
where u.company  is not null
)
group by pr_month
order by 1

PRs

select
  pr_month,
  sum(google_prs) as total_google_prs,
  sum(non_google_prs) as total_non_google_prs
from (
select
  case when u.company = 'Google' then 1 else 0 end as google_prs,
  case when u.company = 'Google' then 0 else 1 end as non_google_prs,
  TIMESTAMP_TRUNC(i.ts, MONTH) as pr_month
from gh.pulls p
join gh.users u on p.username = u.username
where u.company <> ''
)
group by pr_month
order by 1

Issues

select
  pr_month,
  sum(google_prs) as total_google_prs,
  sum(non_google_prs) as total_non_google_prs
from (
select
  case when u.company = 'Google' then 1 else 0 end as google_prs,
  case when u.company = 'Google' then 0 else 1 end as non_google_prs,
  TIMESTAMP_TRUNC(i.ts, MONTH) as pr_month
from gh.issues i
join gh.users u on i.username = u.username
where u.company <> ''
)
group by pr_month
order by 1
select pr_month, repo, count(*) as prs
from (
select
  i.repo,
  TIMESTAMP_TRUNC(i.ts, MONTH) as pr_month
from gh.pulls i
join gh.users u on i.username = u.username
where u.company is not null
)
group by pr_month, repo
order by 1, 3 desc
select
  pr_month,
  repo,
  count(*) action
from (

  select
    repo,
    SUBSTR(CAST(TIMESTAMP_TRUNC(ts, MONTH) as STRING),0,7) as pr_month
  from gh.issues

  union all

  select
    repo,
    SUBSTR(CAST(TIMESTAMP_TRUNC(ts, MONTH) as STRING),0,7) as pr_month
  from gh.pulls

)
where repo = 'build' --'build-pipeline'
group by repo, pr_month
order by 1, 2
 select repo, count(*) from (
 select
    repo
  from gh.issues

  union all

  select
    repo
  from gh.pulls
)
group by repo
order by 2 desc

github-bigquery-exporter's People

Contributors

mchmarny avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

dmitri-lerko

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.