asharov / git-hammer Goto Github PK

Collect and display statistics of git repositories

License: Apache License 2.0

Python 99.27% Mako 0.73%

git-hammer's Introduction

Git Hammer

Git Hammer is a statistics tool for projects in git repositories. Its major feature is tracking the number of lines authored by each person for every commit, but it currently includes some other useful statistics as well, and the data that it collects could be used in multiple new ways as well.

Git Hammer is under active maintenance. New features appear when a need or desire for them exists. If Git Hammer lacks some feature you would like, all kinds of contributions are welcome, from simple feature suggestions to complete pull requests implementing the feature.

Setup

By default, Git Hammer stores the historical information from the repository in an SQLite database file in the current directory. If you wish to change this default, set the DATABASE_URL environment variable to a database URL according to the SQLAlchemy engine documentation. This database will be created if it does not already exist. Note that if you wish to use a database other than SQLite, you may need to install the appropriate Python module to connect to the database.

You will need Python 3, at least version 3.5. It is a good idea to set up a virtual environment, like this:

python3 -m venv venv
source venv/bin/activate

Run these commands wherever you want to run Git Hammer. If you only want to use Git Hammer, you can install it with pip:

pip install git-hammer

If you want to use the latest development version or contribute to Git Hammer development, you need to clone this repository and run

pip install -r requirements.txt

in the directory where you cloned Git Hammer (in this case you should create the virtual environment above in that directory as well). The rest of the commands below assume that one of these has been done.

Creating a Project

Now pick some git repository to run Git Hammer on. The examples below use a hypothetical project called "baffle". You should replace the name with your own.

python -m githammer init-project baffle ~/projects/baffle

This will create the database containing the project baffle from the repository directory (here ~/projects/baffle; replace that with the path to your repository). Git Hammer will print out a progress report while it goes through all the commits in the repository.

Usually, you want your main development branch to be checked out in the repository, and not change the checked-out branch when updating Git Hammer data. This makes the statistics more relevant for the whole development team.

When the repository gets new development, first update the code in the repository to the latest version, and then run

python -m githammer update-project baffle

This will process all the new commits that were not yet seen into the database.

If the repository is very old, with much history, you might not be interested in capturing all of it. init-project has the option --earliest-commit-date that provides a date so that commits prior to that date are not included. This would be used like

python -m githammer init-project baffle ~/projects/baffle --earliest-commit-date 2018-01-01

It is currently not possible to later add commits that were excluded by date when the repository was added.

Showing Statistics

After the project has been initialized and the repository added, you can show some information on it. First try out

python -m githammer summary baffle

This will print out three tables: The number of commits for each person, the number of lines of code written by each person in the head version, and the number of tests written by each person in the head version. This last is only printed if the repository configuration includes test recognition (see below).

There are a few graphs that Git Hammer can display. To see the types of supported graphs, enter

python -m githammer graph --help

The graphs are

Type	Description
line-count	Number of lines in the project over time
line-author-count	Same as above, except split per author
test-count	Number of tests in the project over time
test-author-count	Same as above, except split per author
day-of-week	A histogram showing the number of commits for each day of the week
time-of-day	A histogram showing the number of commits for each hour of the day

Configuring Sources and Tests

By default, Git Hammer assumes that every file in the repository is a source file and that there are no tests. This can be modified by creating a configuration file. The configuration file is JSON having some predefined keys:

{
  "sourceFiles": [
    "Sources/**/*.py",
    "Tests/**/*.py",
    ...
  ],
  "excludedSourceFiles": [
    "Sources/Contrib/**"
  ],
  "testFiles": [
    "Tests/**/*.py"
  ],
  "testLineRegex": "def test_"
}

Here, sourceFiles is a list of patterns that match the source files. Any file not matching one of these patterns is not considered by Git Hammer. If sourceFiles captures too many files, for instance autogenerated sources, excludedSourceFiles is a list of patterns that will not be considered source even if they match some sourceFiles pattern.

To include test counts, testFiles needs to be specified. This is again, a list of patterns matching files that contain tests (it is up to you if you wish to define this to mean unit tests, integration tests, UI tests, etc.). Git Hammer will look inside each of the test files. Any line matching the Python regular expression testLineRegex is counted as one test. So testLineRegex should typically match whatever acts as the header of a test. Here, it is the definition of a function named starting with test_. Other projects, and especially other languages, will have different conventions.

All the file name patterns above (sourceFiles, excludedSourceFiles, testFiles) are glob patterns as defined by the globber library.

The configuration file can be given as an option to the init-project command:

python -m githammer init-project baffle ~/projects/baffle --configuration ./baffle-config.json

If the --configuration option is not given, but the repository contains a file named git-hammer-config.json, this file will be read as the configuration. This way you can keep the Git Hammer configuration for a repository in that repository.

Note: The configuration file path, as well as the repository path, will be stored in the database, so they should not be moved. If the configuration changes, data that was already in the database will not be reprocessed with the new configuration.

There is also a command to check what are the effects of a configuration. Run

python -m githammer list-sources ~/projects/baffle --configuration ./baffle-config.json

to print out a list of all files considered source or test files, and for each test file, the lines considered to be tests. A missing --configuration option is treated in the same way as with init-project above.

A partial output of the list-sources command on the Git Hammer repository looks like this:

S: githammer/dbtypes.py
S: githammer/frequency.py
S: githammer/hammer.py
T: tests/__init__.py
T: tests/check_regression.py
T: tests/hammer_test.py
T: tests/test_init.py
|---    def test_plain_init_does_not_create_database(self):
|---    def test_update_fails_when_database_not_created(self):

Source files are marked with S, test files with T, and after each test file, its test lines are printed indented with |---.

Multi-Repository Projects

Sometimes, a team works on multiple repositories that all still belong to the same project. For instance, a piece of functionality may be better to split off into a library in an independent repository. Git Hammer supports such projects by not limiting the project data to a single repository.

To add another repository to an existing project, just use add-repository:

python -m githammer add-repository baffle ~/projects/baffle-common

This will process the new repository, adding it to the project database. After this, any summary information will include data from all repositories of the project. Like init-project, add-repository also accepts the --configuration and --earliest-commit-date options with the same semantics for the added repository.

Database Migrations

If you update Git Hammer, it is possible that the database schema is updated in the new version. This means that you will need to migrate any existing databases to the latest version. Migration is performed by running

HAMMER_DATABASE_URL=<URL of database to migrate> alembic upgrade head

(If you haven't installed Git Hammer with pip, run this command in the project directory and add PYTHONPATH=. at the beginning.)

It is safe to run this even if the database schema has not changed in the update, so there is no need to try and figure that out before running the migration.

License

Git Hammer is licensed under the Apache Software License, version 2.0. See the LICENSE file for precise license terms and conditions.

git-hammer's People

Contributors

Stargazers

Watchers

Forkers

or pombredanne peterjproche kyokley gjermshus ryancargan eugenzor

git-hammer's Issues

--earliest-commit-date returns unrecognized argument error

Trying to run following init command:
python3 -m githammer init-project test ~/dev/test --earliest-commit-date 2019-01-01

Getting following response:

usage: githammer [-h]
                 {init-project,update-project,add-repository,list-projects,list-sources,graph,summary}
                 ...
githammer: error: unrecognized arguments: --earliest-commit-date 2019-01-01

Generate static markdown reports?

A feature suggestion that would be cool, would be to generate static file reports in Markdown format that can be generated and stored in a docs or reports folder in the local repository.

This way the reports could be easily ingested and sent to documentation sites for display.

Allow configuring how far back to go when first adding repository

It is not always necessary to have a full history available of a repository. It would be nice if it was possible to pass in a parameter that determines how far back to go when initially adding a repository to the database, and default this to some sensible value.

Commit processing slowing down as more commits are processed

It appears that, as the repository processing proceeds, later commits take more time to process than earlier commits. This leads to very large processing times on large repositories. This needs to be investigated and fixed.

Two things to check before making any changes:

Is the effect real? Commit processing time should depend only on the size of the commit. Verify first that the increased time required by later commits is not only caused by the later commits being more complex.
Where is the increase? What is causing a later commit to take more time than an earlier commit? Is it internal data structures, git, database, ...?

Commit with a newly created empty directory causes update-project to fail

Fails with traceback:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "~/git/git-hammer/githammer/__main__.py", line 98, in <module>
    parsed_args.func(parsed_args)
  File "~/git/git-hammer/githammer/__main__.py", line 32, in update_project
    hammer.update_data()
  File "~/git/git-hammer/githammer/hammer.py", line 312, in update_data
    self._process_repository(repository, session)
  File "~/git/git-hammer/githammer/hammer.py", line 242, in _process_repository
    parent_commit.test_counts)
  File "~/git/git-hammer/githammer/hammer.py", line 161, in _make_diffed_commit_stats
    current_test_counts)
  File "~/git/git-hammer/githammer/hammer.py", line 122, in _blame_blob_into_line_counts
    blame = repository.git_repository.blame(commit_to_blame, path, w=True)
  File "~/git/git-hammer/venv/lib/python3.6/site-packages/git/repo/base.py", line 775, in blame
    data = self.git.blame(rev, '--', file, p=True, stdout_as_string=False, **kwargs)
  File "~/git/git-hammer/venv/lib/python3.6/site-packages/git/cmd.py", line 548, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "~/git/git-hammer/venv/lib/python3.6/site-packages/git/cmd.py", line 1014, in _call_process
    return self.execute(call, **exec_kwargs)
  File "~/git/git-hammer/venv/lib/python3.6/site-packages/git/cmd.py", line 825, in execute
    raise GitCommandError(command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git blame -p -w <commit> -- src/new_directory
  stderr: 'fatal: no such path src/new_directory in <commit>'

Support configuration updates

Even with the ability to list what files are considered "source" and "tests", it is not always possible to get the configuration file correct on the first try. Currently, if the configuration for already-processed commits needs to be changed, Git Hammer needs to be rerun from scratch. For large repositories, this will waste a lot of time.

It could be possible to support configuration updates for existing data. This would mean processing only files whose status changed between source and non-source, or test and non-test. Especially with small changes in the set of source and test files, this should save a significant amount of time compared to rerunning the whole process.

Create test suite

It would be nice to have some automated tests. An artificial git repository would be nice to have so that all kinds of corner cases can be tested without needing to have a proper repository where those appear.

Add count difference graphs in addition to current absolute value graphs

There was some discussion in issue #12 after it was closed to "ignore the first commit". Apparently the idea was to plot only the changes that had happened, ignoring the counts of the first commit. This makes sense, especially when the first commit was from a shallow clone or imported from somewhere else, when the commit maker incorrectly gets all of that commit assigned to themselves.

Making this kind of graph would not be too much work on top of the existing code. But it seems like a better idea to first refactor the summary data gathering to its own testable functions and only then implement this on top of that.

Try maintaining blame information internally

Most of the time of commit processing is spent on running git blame on the changed files. I've tried a few things to speed this up, but it seems like there are no easy major improvements to be had here.

One possibility would be to avoid git blame completely, except for the first commit. Since Git Hammer is processing the commits in order, it could track, in addition to the statistics it collects, also complete per-line author information for all the files. This would make it possible to determine a new commit's statistics based only on the diff information without having to run blame.

It's probably too much to save the complete line-by-line author information in the database. This would be more an optimization during processing a large number of commits. But it could be beneficial to save this information for the head commit, so that an update could start processing in the same faster way as the existing commits. This should be experimented with, but it's not mandatory to resolve this issue.

remove a repo from the project

Hi,

I accidentally added a repo with hundreds of binary/verbose backup file commits in it, it all added very well and kudos to your project for handling it, but now a lot of my stats are.. bad.

Can I set an exclude */ glob and update project? Is update only for future commits since last update or does it check the history to make sure all lines up?

Is there a way I can remove the one repo without corrupting the other, like 50, repos I added to the same project? If not as a module parser then possibly is there a delete query I could run if I conn up to the sqlite db?

Thanks and sorry for the trouble, this project is great and I'm glad it exists.

Group entries based on author name

Noticed following issue - all summaries / graphs usually show output like:

Author	Tests
Author B	2
Author A	12
Author B	5
Author A	7
Author C	12
Author C	5
Author C	3

Suppose reason is that author email changed at certain moment.
Wonder if it's possible to add option to group results by name?

Possibly process diffs faster

Running a full blame on every file that changed in a diff is safe but it can also be inefficient if there were only small changes in large files. It might be possible to use the information in the diff to limit the blame to only the changed parts. This should at least be investigated.

Installation Problem: ImportError: cannot import name '_ColumnEntity'

(venv) user@ubuntu:$ python -m githammer init-project foo /home/user/work/mgmt/
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.6/runpy.py", line 142, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib/python3.6/runpy.py", line 109, in _get_module_details
import(pkg_name)
File "/home/user/venv/lib/python3.6/site-packages/githammer/init.py", line 2, in
from .hammer import Hammer, DatabaseNotInitializedError, OldDatabaseSchemaError
File "/home/user/venv/lib/python3.6/site-packages/githammer/hammer.py", line 24, in
from sqlalchemy_utils import create_database, database_exists
File "/home/user/venv/lib/python3.6/site-packages/sqlalchemy_utils/init.py", line 1, in
from .aggregates import aggregated # noqa
File "/home/user/venv/lib/python3.6/site-packages/sqlalchemy_utils/aggregates.py", line 372, in
from .functions.orm import get_column_key
File "/home/user/venv/lib/python3.6/site-packages/sqlalchemy_utils/functions/init.py", line 1, in
from .database import ( # noqa
File "/home/user/venv/lib/python3.6/site-packages/sqlalchemy_utils/functions/database.py", line 11, in
from .orm import quote
File "/home/user/venv/lib/python3.6/site-packages/sqlalchemy_utils/functions/orm.py", line 14, in
from sqlalchemy.orm.query import _ColumnEntity
ImportError: cannot import name '_ColumnEntity'
(venv) user@ubuntu:$

Allow ignoring certain paths

I have a git repo with millions of lines of vendored go code. The vendor directory should be ignored in the analysis. Right now the first 1000 commits were processed in 12 minutes but I have 15000 commits, so the whole thing will take many hours to finish without the ability to ignore the vendor directory. #2 would also help with this

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 25: invalid start byte

Want to test it , so cloned, fired up against one of projects and got this:

Repository xxxxx
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/git-hammer/githammer/__main__.py", line 98, in <module>
    parsed_args.func(parsed_args)
  File "/git-hammer/githammer/__main__.py", line 37, in add_repository
    hammer.add_repository(options.repository, options.configuration)
  File "/git-hammer/githammer/hammer.py", line 312, in add_repository
    self._process_repository(dbrepo, session)
  File "/git-hammer/githammer/hammer.py", line 252, in _process_repository
    line_counts, test_counts = self._make_full_commit_stats(repository, commit)
  File "/git-hammer/githammer/hammer.py", line 143, in _make_full_commit_stats
    lines = [line.decode('utf-8') for line in io.BytesIO(git_object.data_stream.read()).readlines()]
  File "/git-hammer/githammer/hammer.py", line 143, in <listcomp>
    lines = [line.decode('utf-8') for line in io.BytesIO(git_object.data_stream.read()).readlines()]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 25: invalid start byte

Any advices?

Improve error messages

Currently, Git Hammer will mostly just let Python exceptions propagate upwards, which can make it very difficult to understand what exactly the problem is. An error from somewhere in the processing should at a minimum identify the problematic commit, any relevant file(s) and lines there, and give a better reason for why the error happened. This is probably not too difficult work, but setting up everything, especially the tests, seems like it can take a while.

sort line-count-by-author list by line-count

this will also improve color distribution for the more significant committers. especially useful when you've got more than a dozen comitters and so only 'a b or c' names show up. in my case, some are dupes, my mailmap isn't perfect across all the repos at once yet. also can't see the line-count with the list like that

Handle shallow clones properly

Currently Git Hammer requires a full clone of the repository, since it processes every commit that is referenced. It would be good to be able to support shallow clones, where history far enough back is no longer available.

This is somewhat related to issue #5 since the resolution to that issue would also require starting from one or more commits that have parents.

Error when initializing repo with submodules

python -m githammer init-project project_name ~/project_location throws an error:

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/user/venv/lib/python3.7/site-packages/githammer/__main__.py", line 155, in <module>
    parsed_args.func(parsed_args)
  File "/Users/user/venv/lib/python3.7/site-packages/githammer/__main__.py", line 48, in add_repository
    hammer.add_repository(options.repository, options.configuration)
  File "/Users/user/venv/lib/python3.7/site-packages/githammer/hammer.py", line 362, in add_repository
    self._process_repository(dbrepo, session)
  File "/Users/user/venv/lib/python3.7/site-packages/githammer/hammer.py", line 295, in _process_repository
    line_counts, test_counts = self._make_full_commit_stats(repository, commit)
  File "/Users/user/venv/lib/python3.7/site-packages/githammer/hammer.py", line 170, in _make_full_commit_stats
    for git_object in commit.tree.traverse(visit_once=True):
  File "/Users/user/venv/lib/python3.7/site-packages/git/objects/util.py", line 327, in traverse
    if visit_once and item in visited:
  File "/Users/user/venv/lib/python3.7/site-packages/git/objects/submodule/base.py", line 162, in __hash__
    return hash(self._name)
  File "/Users/user/venv/lib/python3.7/site-packages/gitdb/util.py", line 253, in __getattr__
    self._set_cache_(attr)
  File "/Users/user/venv/lib/python3.7/site-packages/git/objects/submodule/base.py", line 132, in _set_cache_
    raise AttributeError("Cannot retrieve the name of a submodule if it was not set initially")
AttributeError: Cannot retrieve the name of a submodule if it was not set initially

Unfortunatelly, I cannot provide the repo itself for reproduction, but I can do some actions for narrowing down the problem if needed.

Maybe gitpython-developers/GitPython#597 and the issue linked at the very bottom can be of help.