Weighted PageRank Graphs
Domain introduction
We have a heterogeneous graph G
which represents contributions/contributors
in SourceCred, and connections between them.
It's heterogeneous in the sense that there are many different "types" of nodes
and edges. In the interest of concreteness, here is a sample of of node types
and edge types we have or expect to add.
Node types:
- GitHub users are identities we want to assign cred to
- GitHub issues contain documentation, design, bug reports, etc
- GitHub pull requests merge new code into the repository
- Git blobs contain source code
- Spotlights represent cred for contributions that don't otherwise fit
into the data model, e.g. a spotlight for participtaing in an offline
design discussion, or for giving a talk on SourceCred
Edge types:
- Authorship edges connect an issue or pull request with the users that
created it
- MergedAs edges connect a pull request to the git commit it created
- Import edges connect a Git blob containing source code to the git blobs
containing code that blob imports
- Spotlight edges connect a Spotlight node to the entities it is flowing cred
to
- Spotlight authorships flow some cred to the entity that created the spotlight
The need for weights
As currently implemented, SourceCred does not do a good job of distinguishing
valuable contributions. As an example, when run on the repository in its
current state, it believes that #292 is one of the most valuable pulls in the
project, outshining contributions like #252 and #468 which were actually much
more valuable. Even worse, it massively over-values Git objects; for example,
the blob containing our Prettier configuration is seen as >100x more valuable
than the most valuable pull request.
@wchargin and @decentralion believe that this problem results from the fact
that the graph is currently unweighted. As such, we need to implement a system
for applying weights to different nodes and to the src/dst of edges. There are
several different use cases for weights:
Kinds of weights:
Prior over types
Weights can provide prior information on the relative value of different types.
For example, we should have a weight that suggests that GitHub pull requests
are, on average, much more important than Git blobs.
Domain-specific evaluation
Weights can use domain-specific signal to differentiate the value of different
nodes of the same type. For example, # of lines of code is a useful signal for
the value of a pull request; a PR that modifies 1000 lines of code is probably
more valuable than one that modifies 3 lines.
Distinguish parts of the project
Weights can be used to differentiate different parts of the project for clearer
analysis. For example, suppose a project wants to separately measure cred for
"backend" and "frontend" work, without artificially partioning the project into
separate graphs. It would be nice to do this via a "weighting" that assigns a
high prior weight to the backend directory and key backend APIs. (This would
incorporate the project or "domain" concept via weighting).
Thus, implementing a well-designed weighting abstraction would allow us to
implement many important features in a cohesive way. What follows is a "Q&A"
style discussion of major open questions in how the weight abstraction should
work.
Open questions
What is the type signature of a weight function?
In some cases, it's natural to think of a weight function as providing a
relative weight over every node in the project. For example, the project may
have maintainer-configurable weight functions that give a prior over every type
of node in te graph
In other cases, it's natural to think of a weight function as providing
relative weights for a subset of nodes in the graph. For example, the
lines-of-code weighting can differentiate between different pull request nodes,
but has no information to offer on the relative importance of different issues.
Some ideas on how we could express weight functions:
node \in Graph -> number
, where for some subset of nodes in the graph, the
weight function assigns a weight. That weight might be seen as an update on the
a priori importance of the node. (What would the formal specification be?)
Graph -> distribution over nodes
, where for every node in the graph, the
weight function assigns an update on the prior importance of that node. (Would
it assign 0 to every node that is out-of-scope? Would that compose nicely?)
How do we compose weight functions?
It's important that we can combine and compose weight functions. For example,
suppose we have one weight function on commits that gives the logarithm of the
number of lines changed by the commit, and another weight function measures the
[cyclomatic complexity]. At the very least, we should be able to use both
weight functions in tandem. It would be cool if we could express richer
relationships, such as that commits that add a lot of lines of code but do not
add any test code actually have negative signal. (Unless they are refactors?)
cyclomatic complexity: [https://en.wikipedia.org/wiki/Cyclomatic_complexity]
We should also be able to compose weight functions "cross-product-style".
Suppose we have one pair of weight functions that distinguishes the frontend
and backend components, and we have another pair of weight functions that
distinguishes code implementation from documentation.
Then it should be possible to answer who has the most documentation cred in the
backend, who has the most implementation cred in the frontend, and so forth.
One insight from @krzhang was to think of the weight functions as priors in a
sequence of bayesian updates; he suggested that if we store the weights as
log-priors, then combining them is as simple as adding the log-priors.
How do we apply the weight functions to the PageRank algorithm?
Classical PageRank doesn't provide prior weighting on nodes. One way to add
weighting, found in the weighting around personalized PageRank, is to add
personalization vectors. This modifies the "random surfing" or
"teleportation" of the PageRank algorithm so that rather than uniformly
teleporting to any node in the graph, the surfer teleports based on the
personalization vector.
In offline discussion, @krzhang suggested a different approach. We can apply a
node weight by applying a weighting to the in-and-out edge weights for that
node; i.e., by modifying that node's row and column in the adjacency matrix.
He suggested that given an adjacency matrix M
, this could be applied by W * M * W'
, where W
is a diagonal matrix with node weights on the diagonal.
This approach seems more powerful and principled to me - easier to reason about
how it affects the final distribution. We may want to empirically try both approaches.
Let's use this thread to further discuss this model; eventually we'll likely want to commit the final document as a markdown or latex document.
Thanks to @krzhang for helping to clarify the problem domain, and providing a number of suggestions that informed this post.