Giter Club home page Giter Club logo

github-mirror's People

Contributors

bhlowe avatar colelloa avatar deepsource-io[bot] avatar dspinellis avatar gousiosg avatar hahnicity avatar jeffmcaffer avatar larsborn avatar notalex avatar pdegenportnoy avatar rtlee9 avatar ryanfarr01 avatar sbaltes avatar vmarkovtsev avatar ward avatar xchikux avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

github-mirror's Issues

Permission problem when restoring the backup

Hi,

I'm in the process of restoring the last backup (20160419), I was following the indications described here and I think there is something missing to make it work.

The execution of the command ght-restore-mysql fails when running the LOAD DATA INFILE statements as the user created for the new schemata does not include the corresponding permission.

I solved it with something like:

GRANT FILE ON *.* TO 'ghtorrentuser'@'localhost';

Maybe the problem appears only with specific database configurations. In my case, I'm using MySQL Commnuity Server 5.6.25 on Linux.

SQLite3::SQLException in the first attempt

Have I done something wrong?

E:\github\github-mirror>ght-retrieve-repo -c config.yaml gousiosg github-mirror
Overriding configuration mirror_history_pages_back=5 with new value 1000
Database empty, running migrations from D:/RailsInstaller/Ruby2.1.0/lib/ruby/gems/2.1.0/gems/ghtorrent-0.11.1/lib/ghtorrent/migrations
Creating table users
Creating table projects
Creating table commits
Creating table commit_parents
Creating table followers
Adding organization descriminator field to table users
Updating users with default values
Creating table organization-members
Adding table commit comments
Adding table project members
Adding table watchers
Adding table pull requests
Adding table pull request history
Adding table pull request commits
Adding table pull request comments
Adding unique(name, owner) constraint to table projects
Create table project_commits
Migrating data from commits to project_commits
Adding table forks
Adding table issues
Adding issue history
SQLite3::SQLException: AUTOINCREMENT is only allowed on an INTEGER PRIMARY KEY
D:/RailsInstaller/Ruby2.1.0/lib/ruby/gems/2.1.0/gems/sqlite3-1.3.11-x86-mingw32/lib/sqlite3/database.rb:91:in `initialize'

The result is that only user appears in the MongoDB.

Confusion in the schema of RDBMS

Hello,

This is regarding the schema given here:
http://ghtorrent.org/files/schema.pdf

The id key is used in many tables and I am unable to understand it clearly. Consider the following cases:

  1. The primary key "id" used in projects table, is it the same field used in repo_labels or repo_milestones? I am unable to understand how are projects and repos related in the schema. Semantically, a project can consist of many repositories in github.
  2. The repo_labels has two labels: id and repo_id. The project members has repo_id as one of the primary key while the project table does not refer to repo_id at all.
  3. Watchers and issues are associated with repo_id while commits are associated with project_id. The project table does not has project_id in its attribute list.

I have imported the sql dump and had hoped that firing a few sql queries would resolve the confusion, but it hasn't. Please help!

Thanks and Regards,
Ayushi

Error or misundertanding

Hi there,

I'm finishing my Phd on computer science and I will use GHTorrent in my thesis.

I'm having some problems to understand why there is some differences between the data coming from github and the data coming from ghtorrent.

Let's give an example:

1 - Clone https://github.com/vaadin/framework
2 - Take the first commit by doing git rev-list --max-parents=0 HEAD : d0b04c7fb28acc39ceeb63ea0c22f8568e7ca81d
3 - Search for this sha in the project_commit table
4 - It is not associated with the https://github.com/vaadin/framework project in ghtorrent. It is associated with 2 other projects that are copies of the vaandin project.

Could please help to understand why the commit d0b04c7fb28acc39ceeb63ea0c22f8568e7ca81d is not associated with the project in https://github.com/vaadin/framework inside the ghtorrent?

Thank you very much.

trying to extract mysql dump gives an error: gzip: stdin: unexpected end of file

tar zxfv mysql-2017-04-01.tar.gz              
mysql-2017-04-01/                             
mysql-2017-04-01/commit_comments.csv          
mysql-2017-04-01/pull_requests.csv            
mysql-2017-04-01/followers.csv                
mysql-2017-04-01/watchers.csv                 
mysql-2017-04-01/pull_request_comments.csv    
                                              
gzip: stdin: unexpected end of file           
tar: Unexpected EOF in archive                
tar: Unexpected EOF in archive                
tar: Error is not recoverable: exiting now  

Non-commit entities not stored in MySQL database

When running ght-retrieve-repo, while commits are successfully stored in the database, issues, pull_requests, etc. are fetched but not stored, even when providing the -y option. I notice in the logs that while ghtorrent.rb is being used to add commits to the database when retrieving them, this is not the case with the other entities.

commits:

...
INFO, 2018-02-27T16:13:45-08:00, ghtorrent -- api_client.rb: Successful request. URL: https://api.github.com/repos/twitter/twemoji/commits/72b5e44e092d910629547cbc6886127901fb81d8?per_page=100, Remaining: 3098, Total: 278 ms
INFO, 2018-02-27T16:13:45-08:00, ghtorrent -- retriever.rb: Added commit twitter/twemoji -> 72b5e44e092d910629547cbc6886127901fb81d8
INFO, 2018-02-27T16:13:45-08:00, ghtorrent -- ghtorrent.rb: Added commit twitter/twemoji -> 72b5e44e092d910629547cbc6886127901fb81d8 
INFO, 2018-02-27T16:13:45-08:00, ghtorrent -- ghtorrent.rb: Added commit_parent 72b5e44e092d910629547cbc6886127901fb81d8 to commit 2d8c1a7e7243c76aa53db8f018dcbdb994d22024
...

pull requests:

...
INFO, 2018-02-27T16:09:54-08:00, ghtorrent -- api_client.rb: Successful request. URL: https://api.github.com/repos/twitter/twemoji/pulls/225, Remaining: 3284, Total: 733 ms
INFO, 2018-02-27T16:09:54-08:00, ghtorrent -- retriever.rb: Added pull_requests twitter/twemoji -> 225
INFO, 2018-02-27T16:09:55-08:00, ghtorrent -- api_client.rb: Successful request. URL: https://api.github.com/repos/twitter/twemoji/pulls/219, Remaining: 3283, Total: 870 ms
INFO, 2018-02-27T16:09:55-08:00, ghtorrent -- retriever.rb: Added pull_requests twitter/twemoji -> 219
...

Schema and database fields of pull_requests do not agree

The field user_id listed in the schema is not part of the pull_requests table.

mysql> describe pull_requests;
+----------------+------------+------+-----+---------+----------------+
| Field          | Type       | Null | Key | Default | Extra          |
+----------------+------------+------+-----+---------+----------------+
| id             | int(11)    | NO   | PRI | NULL    | auto_increment |
| head_repo_id   | int(11)    | YES  | MUL | NULL    |                |
| base_repo_id   | int(11)    | NO   | MUL | NULL    |                |
| head_commit_id | int(11)    | YES  | MUL | NULL    |                |
| base_commit_id | int(11)    | NO   | MUL | NULL    |                |
| pullreq_id     | int(11)    | NO   | MUL | NULL    |                |
| intra_branch   | tinyint(1) | NO   |     | NULL    |                |
+----------------+------------+------+-----+---------+----------------+
7 rows in set (0.00 sec)

Incorrect datetime value error when running ght-restore-mysql

I got this error message on an Ubuntu 16.04 system (MySQL 5.7.16):

....
Thu Jan 12 00:43:27 CET 2017 Creating indexes
Thu Jan 12 00:43:27 CET 2017 CREATE UNIQUE INDEX `login` ON `ghtorrent17_1`.`users` (`login` ASC)  COMMENT '';
Thu Jan 12 01:00:49 CET 2017 CREATE UNIQUE INDEX `sha` ON `ghtorrent17_1`.`commits` (`sha` ASC)  COMMENT '';
ERROR 1292 (22007) at line 1: Incorrect datetime value: '0000-00-00 00:00:00' for column 'created_at' at row 490174

This fixed the problem for me:

UPDATE commits SET created_at = NULL WHERE CAST(created_at AS CHAR(20)) = '0000-00-00 00:00:00';
UPDATE projects SET created_at = NULL WHERE CAST(created_at AS CHAR(20)) = '0000-00-00 00:00:00';
UPDATE projects SET updated_at = NULL WHERE CAST(updated_at AS CHAR(20)) = '0000-00-00 00:00:00';

Errors when using MySQL web

I encounter the following errors when trying to access MySQL web (http://ghtorrent.org/dblite/)

I login as a 'Guest' and encounter the below error when I try to expand a table from the Database Explorer pane/window:

SQLSTATE[HY000]: General error: 1021 Disk full (/mnt/#sql_1c7d9_0.MAI); waiting for someone to free some space... (errno: 28 "No space left on device

When I try to execute a select query (e.g., select * from users limit 5;โ€‹) I encounter the following error:

SQLSTATE[42000]: Syntax error or access violation: 1142 SELECT command denied to user 'ghtro'@'web' for table 'users'

Thanks!

Incorrect opened_at in table pull_request_history

I found this problem where I compared the opened_at of a pull-request in GHTorrent and the created_at of the same pull-request accessed via the GH official API.

Here is an example:

  1. data in table pull_request_history
id pull_request_id created_at action actor_id
20920677 345020 2012-08-30 17:41:41 opened 654469

the PR is the #1266 in repository cocos2d/cocos2d-x

  1. data fetched by GH official API
    https://api.github.com/repos/cocos2d/cocos2d-x/pulls/1266
    "created_at": "2012-08-30T19:41:41Z"

2012-08-30 17:41:41 in GHTorrent is two hours early than 2012-08-30T19:41:41Z in GH.

At first, I thought this inconsistence is caused by the difference of timezone. However, the created_at of a repository in GHTorrent equals to the value fetched by GH official API.

Data in the "User" table is not up-to-date

I've downloaded and restored the mysql dump from "mysql-2018-02-01".

When I query certain known users in the "User" table, I noticed that the information contained in the table is not up-to-date. For example, the "location" field for my record (i.e., login=shehan) is Null. This was correct when I created my profile, but I updated this profile attribute in 2016. So, the latest "User" table should not show a Null value.

Is this a known issue with the GHTorrent data? Can you let me know how I can obtain a User table with the most recent data?

Thanks!

SQL error for missing fork parents

when ensure_repo runs on a fork, it looks to ensure the parent repo is also present. If it is not or is otherwise not available, then this line breaks with an error trying to relate foreign keys.

This scenario can happen if the parent cannot be loaded. For example, the key in use may not have permissions to that repo or there may be a transient error.

What is the right fix to do here? I have not looked at the database enough to grok all the relationships

ght-restore-mysql: Errcode: 13 "Permission denied" (MariaDB 10.1)

Trying to restore a dump in MariaDB 10.1 I get the error:

acs@~/devel/ghtorrent/dump $ ./ght-restore-mysql -u root -d ghtorrent -p '' .
jue sep 20 22:39:06 CEST 2018 Creating the DB schema
jue sep 20 22:39:09 CEST 2018 Restoring table commit_comments
ERROR 13 (HY000) at line 1: Can't get stat of '/home/acs/devel/ghtorrent/dump/commit_comments.csv' (Errcode: 13 "Permission denied")

I have fixed it with the change:

LOAD DATA LOCAL INFILE inside the ght-restore-mysqlscript.

It has a performance issue: "When using LOCAL with LOAD DATA, a copy of the file is created in the directory where the MySQL server stores temporary files" (mysql doc) but it works.

Extremely high mongodb load

It seems the insert conditions put a very high loads and some locks on mongodb. Is there any suitable way to address this?

Last update too recent?

If a spider fails to get response from GitHub API in consequence of unstable network environment, or the GitHub API Server itself, the program exits in advance with no error or exception. But when I run with the same config again, it says:

WARN, 2016-05-04T00:11:20+08:00, ghtorrent -- full_repo_retriever.rb: Last update too recent

Any way to resume the polling? I found the watcher data is pretty far from the real (watchers + starrers).

Geocoded GitHub Data

Hello,
My organization just utilized the GitHub Mirror (more specifically the GHTorrent SQL Data from 4/2) to generate a dataset detailing programming language popularity by country. In order to do this, I developed a tool to geocode the entirety of the GitHub users database (affix an ISO-2-character country code to each user). I think that geo-tagged information could be a valuable addition to the datasets provided by the GHTorrent Site.

I would like to donate both the geocoded user database as well as the tool I developed to the github mirror community so that others can benefit. If you are interested, let me know the best way to approach providing the data.

Thanks!

Derek

restore script doesn't allow change of storage engine

When I try to use InnoDB I get an error.

(venv)greg@Ithilien:~/Documents/ecs260/dump/mod_dump$ ./ght-restore-mysql -uroot -proot -eInnoDB .
./ght-restore-mysql: illegal option -- e
Invalid option: -
Usage: ./ght-restore-mysql [-u dbuser ] [-p dbpasswd ] [-h dbhost] [-d database ] dump_dir

Am I doing something wrong here?

Incorrect number of Pull Requests in table pull_request_history

Using the query

SELECT action, COUNT(*) as freq, YEAR(created_at) as pr_yr, MONTH(created_at) as pr_mt
FROM
  [ghtorrent-bq:ght_2017_09_01.pull_request_history] 
GROUP BY pr_yr, pr_mt, action

I get a monthly digest of opened and merged PRs events on GitHub. Unfortunately, plotting this does not yield a very plausible graph:

rplot-1

I was able to figure out when things seem to have gone wrong for the number of merged pullrequests, to the first month of 2014:

rplot05-1

Later, the sharp drop in mid-2016 also seems questionable.

I hope this helps you debug the issue.

(Possibly related to #19.)

Pull request history "merged" problem.

It seems that the "merged" actions are not associated with the correct user. Sometimes it is associated with the issue opener sometimes with the closer. Using the github /user/repo/issues api the correct author login could be acquired.

For example the following issues are all merged by "juditacs" user (based on the github api), and in the sql dump:

SQLite3::SQLException: AUTOINCREMENT is only allowed on an INTEGER PRIMARY KEY

Hi during run this command: ght-retrieve-repo user repo I received followed error:

Overriding configuration mirror_history_pages_back=5 with new value 1000
Database empty, running migrations from /var/lib/gems/2.1.0/gems/ghtorrent-0.11.1/lib/ghtorrent/migrations
Creating table users
Creating table projects
Creating table commits
Creating table commit_parents
Creating table followers
Adding organization descriminator field to table users
Updating users with default values
Creating table organization-members
Adding table commit comments
Adding table project members
Adding table watchers
Adding table pull requests
Adding table pull request history
Adding table pull request commits
Adding table pull request comments
Adding unique(name, owner) constraint to table projects
Create table project_commits
Migrating data from commits to project_commits
Adding table forks
Adding table issues
Adding issue history
SQLite3::SQLException: AUTOINCREMENT is only allowed on an INTEGER PRIMARY KEY
/var/lib/gems/2.1.0/gems/sqlite3-1.3.11/lib/sqlite3/database.rb:91:in `initialize'

Can I resolve this problem?

Labels for pull requests.

Pull requests are a special class of issue. However some repositories seem to be missing labels attached with pull requests. I would propose that we ensure that labels are attached to pull requests the same way they are attached to regular issues.

How to check if my schema creation (through ght_restore_mysql script) is complete

I ran the 'ght-restore-mysql' script on my remote server and then shifted the process to run in the background and disowned it. ( I pressed ctrl+Z, then command 'bg', 'disown' ). The problem is I think my schema creation is completed but the bash script still gets shown as a running process with a status S. And as I didn't use any nohup command or such, I don't have an output log where I can check any exit code.

So, I was thinking if there's any way to check if my schema creation is completed or else I'd have to run the script again to be completely sure.

I used the latest June dump (mysql-2018-06-01) and after running the script,
my ghtorrent database shows a size of 357.33 GB.

+-----------------------+---------------+
| DB Name               | DB size in MB |
+-----------------------+---------------+
| ghtorrent_june01_2018 |      365905.6 |

Can anyone confirm if the size I get is accurate?

Or Is there any other way to check is the schema creation is complete or not?

Metadata appears out of date on several repos in MongoDB

GHTorrent Team,

Black Duck Software has been hoping to use GHTorrent to keep up to date on all github metadata. When accessing using the MongoDB instance, we have noticed that several major repos appear to be somewhat out of date.

Using a few repos from owner 'google' as an example, but we've seen several major repos other than google being quite out of date:

  1. google/material-design-lite

https://api.github.com/repos/google/material-design-lite

"stargazers_count": 28104,
"watchers_count": 28104,

db.repos.find({ name: 'material-design-lite', 'owner.login': 'google' })

   "stargazers_count": 12936,
   "watchers_count": 12936,

  1. google/incremental-dom

https://api.github.com/repos/google/incremental-dom

"stargazers_count": 2684,
"watchers_count": 2684,

db.repos.find({ name: 'incremental-dom', 'owner.login': 'google' })

"stargazers_count":1314,
"watchers_count":1314,
  1. google/binnavi

https://api.github.com/repos/google/binnavi

"stargazers_count": 2187,
"watchers_count": 2187,

db.repos.find({ name: 'binnavi', 'owner.login': 'google' })

   "stargazers_count":1183,
   "watchers_count":1183,

We at Black Duck were wondering if there was a way to identify repositories that have not been updated recently, and/or would like to help out the GHTorrent team in any way possible to stay as up to date as possible. Please reach out to us here on email me directly at [email protected] so we can try to find a solution that can benefit both of our organizations.

Best,
Andrew Colello
Software Engineer, Knowledgebase Team
Black Duck Software

Why project_commits?

Hi!
Why project_commits exists if there is a project_id in each commits' row?

mislabelled user type

There seems to be a great number of individual users that are labeled as 'Organization'. e.g., SureShinde

PTY allocation request

@gousiosg Hello,

after using the service for a while I get following response when running
ssh -L 3306:web.ghtorrent.org:3306 [email protected]
response:
PTY allocation request failed on channel 2

Therefore I can't connect to mysql database.
MongoDB works fine.

Thanks

issue_events action check constraint causes transaction to fail

During normal operation, ensure_issue_events fails with the following error:

ERROR, 2017-10-27T13:40:21-04:00, ghtorrent -- ghtorrent.rb: PG::CheckViolation: ERROR: new row for relation "issue_events" violates check constraint "issue_events_action_check" DETAIL: Failing row contains (831153782, 49, 27, head_ref_restored, null, 2016-10-20 20:01:36).

This error does not crash the app, but causes all later transactions to fail:

WARN, 2017-10-27T13:40:21-04:00, ghtorrent -- ghtorrent.rb: Transaction failed (3245 ms) ERROR, 2017-10-27T13:40:21-04:00, ghtorrent -- ghtorrent.rb: PG::InFailedSqlTransaction: ERROR: current transaction is aborted, commands ignored until end of transaction block

There are many different types of issue events: https://developer.github.com/v3/issues/events/#events-1

However, the type constraint only declares a few actions, but enforces a not null constraint regardless: https://github.com/gousiosg/github-mirror/blob/master/lib/ghtorrent/migrations/011_add_issues.rb#L30

The code itself does nothing to try to filter the "action" field: https://github.com/gousiosg/github-mirror/blob/master/lib/ghtorrent/ghtorrent.rb#L1597

My question:

What is preferable
a) expand the constraint, or drop it altogether, so that all issue events are allowed to be written?
or b) alter ensure_issue_events so that it won't attempt to write any events that would violate the constraint?

Download error

Hi!
I'm trying to download the mysql dump of February 2017 of 50 GB, but I'm getting error of reading bytes. I tried to download from different networks. Does the server kicking me out?

ght-retrieve-repo breaks while fetching no longer existing user/fork

WARN, 2018-03-02T15:22:53-08:00, ghtorrent -- api_client.rb: Failed request. URL: https://api.github.com/repos/Crockchartering/twitter.github.com, Status code: 404, Status: Not Found, Access: 336331c85fc, IP: 0.0.0.0, Remaining: 4367
ERROR, 2018-03-02T15:22:53-08:00, ghtorrent -- full_repo_retriever.rb: Error in stage: ensure_forks, Repo: twitter/twitter.github.com, Message: no implicit conversion of String into Integer
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:694:in `[]='
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:694:in `block (3 levels) in repo_bound_items'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:683:in `each'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:683:in `block (2 levels) in repo_bound_items'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:679:in `each'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:679:in `block in repo_bound_items'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:670:in `each'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:670:in `repo_bound_items'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:354:in `retrieve_forks'
/home/gordonl/github-mirror/lib/ghtorrent/ghtorrent.rb:1388:in `ensure_forks'
/home/gordonl/github-mirror/lib/ghtorrent/commands/full_repo_retriever.rb:87:in `block in retrieve_full_repo'
/home/gordonl/github-mirror/lib/ghtorrent/commands/full_repo_retriever.rb:84:in `each'
/home/gordonl/github-mirror/lib/ghtorrent/commands/full_repo_retriever.rb:84:in `retrieve_full_repo'
/home/gordonl/github-mirror/lib/ghtorrent/commands/ght_retrieve_repo.rb:31:in `go'
/home/gordonl/github-mirror/lib/ghtorrent/command.rb:66:in `run'
bin/ght-retrieve-repo:6:in `<main>'

Cannot find user [email protected]:gousiosg/github-mirror.git.

ght-retrieve-repo -c ~/config.yaml -s 'username' -p 'password' [email protected]:gousiosg/github-mirror.git

fail to connect with

g/github-mirror.git
Overriding configuration github_username=username with new value username
Overriding configuration github_passwd=password with new value password
Overriding configuration mirror_history_pages_back=5 with new value 1000
WARN, 2017-07-19T14:23:37+02:00, ghtorrent -- ghtorrent.rb: Not a valid email address: [email protected]:gousiosg/github-mirror.git
Error: Cannot find user [email protected]:gousiosg/github-mirror.git.
Try --help for help.

Need better api error handling

def api_request_raw(url, media_type = '') in api_client sometimes returns nil and sometimes throws an exception. This ripples up to pretty much all of the retrieve* and ensure* methods and beyond. Some of these built to handle nil and others are not. We are seeing many secondary exceptions as a result of nil return values. This typically happens when a 403 Forbidden is returned from GitHub. While I suspect that is a throttling related problem (subsequent calls appear to work), it is completely realistic that this happen as there are a raft of different REST call status that will cause nil to be returned.

I started by putting .nil? checks in the appropriate places but:

  • there are quite a few places
  • makes the code look yucky
  • generally these just exit the method

The bonus of checking in the caller is that you can provide a somewhat more targeted error. Rather then "could not retrieve user XXX", the code could give "Failed ensure_commit because user XXX could not be retrieved"

If an exception is to be thrown, there are a number of "send() loops" that will need to be augmented with rescues.

I'm happy to help make the related changes but need to know

  • exceptions or nil? checks?
  • fix/change/remove existing nil? checks if exceptions are the chosen path?

Issues importing mysql data -- Importing instructions

Hi,

Thanks for the awesome project!

I am having a hard time importing the massive mysql file. It starts out fast but after 20% or so the import speed drops significantly. It seems like it will never complete because the rate of import drops faster than the the progress.

I am wondering if you can include instructions in the readme/website of how to import the data. Are there mysql config changes that i need to make? What are the min system requirements?

Thanks!

Query a repository's license

Hi,

I'm currently doing research into developers' locations and the types of licenses they use. GHTorrent is a valuable source of information for me, but I noticed in the MySQL web interface that repositories' licenses are not currently retrieved / stored. Is this correct?

I can probably work around this by writing some code of my own, but I think it will be a valuable addition to GHTorrent. The only change that seems to be required is to provide a custom media type in the Accept header (application/vnd.github.drax-preview+json), see GitHub's documentation. I noticed that just recently you added support for custom media types, so I hope this wouldn't be too much work.

Thank you for providing GHTorrent, it is already a valuable resource for me!

Inconsistency on tables

I'm using GHTorrent on Google BigQuery (https://bigquery.cloud.google.com/table/ghtorrent-bq:ght_2017_04_01.project_languages?pli=1)

I've found a inconsistency between tables. I've queried the top projects of a specific language with more commits. For this, I've used the table project_languages. But when I queried over the table projects, the column "language" shows sometimes another language. Example: I've queried the top projects ordered by number of commits of projects of Java. When I query in the table projects with the project_id of the another query, the column "language" shows another language like C.

Now, I'm lost. Which field is more fiable? Likewise, there are a lot of commits from 1994. Is it real?

MongoDB dumps from tudelft.nl unreachable

It seems like the two initial MongoDB dumps (hosted at TU Delft) have been unreachable for the past two weeks:

http://dutihr.st.ewi.tudelft.nl/downloads/commits-dump.2015-08-03.tar.gz
http://dutihr.st.ewi.tudelft.nl/downloads/commits-1-dump.2015-08-04.tar.gz

We encountered timeouts (no error response) using either wget or Firefox from both university and private IPs, although the host responds to ping. We had previously (a few months ago) downloaded both files successfully.

Access denied on import

Importing data with the document procedure produces an error, such as the following

ERROR 1045 (28000) at line 1: Access denied for user 'ghtorrent'@'localhost' (using password: YES)

Not working when gem mongo upgraded to (2.2.1, 1.12.5)

$ ruby -Ilib bin/ght-retrieve-repo -c config.yaml gousiosg github-mirror
Overriding configuration mirror_history_pages_back=5 with new value 1000
WARN, 2016-01-25T09:46:19+00:00, ghtorrent -- ghtorrent.rb: Transaction failed (1 ms)
uninitialized constant Mongo::ConnectionFailure
/home/legend/github-mirror/lib/ghtorrent/adapters/mongo_persister.rb:199:in `rescue in rescue_connection_failure'

It seems that mongodb were not specified version?

Missing torrents

Hi,

Under 'Available Downloads', it says 'List of available torrents (Last dump date: 2014-11-29)' but there are no torrents listed.

Can you update the site with a list of available .torrent files?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.