Giter Club home page Giter Club logo

sourcecred's Introduction

Packages in this repository

This repository is a monorepo, meaning it is set up to contain multiple distinct packages, although for now it only contains one.

Package Name Description Subdirectory with README
sourcecred A monolith package containing the CLI that supports instances, our JS library, and all of our supported platform plugins. link

sourcecred's People

Contributors

al0ysi0us avatar anthrocypher avatar ballenaeth avatar beanow avatar befitsandpiper avatar bensodenkamp avatar blueridger avatar brianbelhumeur avatar brianlitwin avatar chibie avatar credbot avatar crisog avatar dependabot[bot] avatar greenkeeper[bot] avatar hozzjss avatar jeremitraverse avatar m-aboelenein avatar magwalk avatar meta-dreamer avatar mkusold avatar pythonpete32 avatar roveneliah avatar saintmedusa avatar teamdandelion avatar topocount avatar tylerdmace avatar vsoch avatar wchargin avatar yeqbfgxjiq avatar youngkidwarrior avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sourcecred's Issues

Choose a serialized form for graphs

Graphs should have a serialized form so that we can use Jest’s snapshot testing. Two options seem reasonable to me, so I implemented them.

On branch wchargin-graph-json-arrays, we have

export type GraphJSON = {|
  +nodes: Node<mixed>[],
  +edges: Edge<mixed>[],
|};

which is a completely non-redundant format, but for which it is possible to have multiple JSON representations of logically-equal graphs, because the arrays are really sets and so are invariant under permutation. If we choose this method, we’ll provide an additional toCanonicalJSON method that sorts the nodes and edges by their addresses.

On branch wchargin-graph-json-objects, we have

export type GraphJSON = {|
  +nodes: {[nodeAddress: string]: Node<mixed>},
  +edges: {[edgeAddress: string]: Edge<mixed>},
|};

which satisfies that if a.equals(b) then deepEqual(a.toJSON(), b.toJSON()) at the cost of a bit of redundancy and potential for mismatch: i.e., we should require that addressToString(json.nodes[key].address) === key and likewise for edges, and need to guard against bad inputs. On the other hand, this form is somewhat more useful to external clients that want to directly consume the JSON.

On our advancedMealGraph, the latter is about 15% bigger in terms of file size:

$ filename=src/backend/__snapshots__/graph.test.js.snap
$ git show origin/wchargin-graph-json-arrays:"${filename}" | wc -c
5602
$ git show origin/wchargin-graph-json-objects:"${filename}" | wc -c
6476

This factor will increase as addresses get longer in proportion to the payloads, and indeed for our advancedMealGraph the addresses are reasonably short.

It’s worth reading the diffs at 574f916 (arrays) and 60771c5 (objects). In particular, note that using the objects form gets us free interop with Jest and JSON.stringify.

I’m inclined to go with the objects solution due to its simplicity in not requiring an additional canonical form. This has already paid off in the Jest and stringify interops, and it seems totally plausible that similar dividends will continue to arise down the road.

Once we come to a consensus, I can PR one of the two branches.

cc @dandelionmane

Add webpack to Node backend

We can’t use ES6 modules with our backend scripts, because we refuse to
acquiesce to the the .mjs autocracy that Node is trying to impose.
We’ve been using the old module.exports system, but this doesn’t let
us import and export Flow types. We’re certainly going to want to set up
Webpack eventually, so... we should do that.

cc @dandelionmane

Consider using a hand-written tokenizer for GitHub references

Our current GitHub reference detection uses regular expressions to
extract known reference patterns from a block of text. It is possible
(if not easy) to reason about the regular expressions used to match a
particular kind of reference. But is is much harder to reason about how
these interact. For instance, in the current system, the string

https://github.com/sourcecred/sourcecred/issues/1#@foo

parses to two references (!):

[
  {
    "refType": "BASIC",
    "ref": "https://github.com/sourcecred/sourcecred/issues/1"
  },
  {
    "ref": "@foo",
    "refType": "BASIC"
  }
]

This is because a URL is allowed to end in a fragment specifier #,
which is not a word character [A-Za-z0-9_], and a username reference
is required to be preceded by a non-word character.

Because the regular expressions used to match the different reference
types all execute independently, it is hard to reason about scenarios
like this one. The reader must consider the cross product of “the end of
one regular expression” and “the start of another”, for instance. Of
course, attempting to combine all the different reference types into one
giant regular expression would likely be a bad idea, especially
considering that paired-with references work by computing a difference
of regular expression matches.

However, suppose that we reconsider the premise of using a regular
expression to find references globally in a string. Instead, we may
convert the input to a stream of tokens, and then make a single pass
over that token stream, identifying all kinds of references at once. We
can use similar, but simpler, regular expressions to check whether a
particular token is particular kind of reference. Then, these
expressions wouldn’t need to worry about adding non-capturing groups
(?:\W|^) and lookaheads (?=\W|$) for detecting word boundaries (and
the resulting subtleties that this entails).

We would also gain the ability to reason more clearly about interactions
by modifying state as we traverse the token stream: for instance, the
rule “if you see the token Paired followed by the token with, then
the next token is considered to be a paired-with reference instead of a
username reference” is easy to express, and eliminates the need for the
funky multiset-difference in the current implementation. For parsing URL
references, we would gain the ability to defer to an actual URL-parsing
library to handle all the special cases that we’re likely to forget.

In summary, I suspect that the resulting system would be a simple state
machine that is easy to understand.

Note that hand-writing a tokenizer is quite easy, and could plausibly
require not much more than (s) => s.split(" ").

This change would be highly self-contained; only the parseReferences
module would need to change. It also probably falls pretty low on the
priority list: though I do believe that it would greatly improve the
robustness of that module, that module’s robustness simply isn’t
critical to us right now. In the long term, where we have the
resources to expend on polishing all aspects of the system, this is
definitely something that I would like to see. (Or, in other words:
contributions welcome!)

cc decentralion

┆Issue is synchronized with this Asana task by Unito

Decide on a unicode symbol for Cred

It's important that Cred have a unicode symbol. This will save us from needing to write "SourceCredCred" and other such ungainly constructs.

I've been provisionally using ¤ as the symbol; I like that it is reminiscent of a network, something that is both a circle and has flows in/out... the circle represents egalitarian community and the edges represent new contributions? 😃

Here are some other options:
¢¢ ₡ ¢ ₢ ₵

Here are all the options in context:

SourceCred¤
SourceCred¢
SourceCred¢
SourceCred₡
SourceCred¢
SourceCred₢
SourceCred₵

image

My vote is for ¤. Please feel free to suggest characters I haven't thought of yet. There are a lot of them.

Also, yes this is the bikeshed.

Consider removing Address↔string conversion functions

Consider removing address↔string conversion functions

understandabad (adj.): A seemingly confusing decision that makes sense only once you realize the implementation details behind a leaky abstraction.

ex. Our Address type requires that none of its components contain a literal dollar sign. Why? Because the addressToString and stringToAddress conversions use the dollar sign as a delimiter for the three components, and string forms are used as keys in the Graph’s maps.

We could relatively easily allow arbitrary addresses by using nested maps:

type AddressMap<T> = {[repositoryName: string]: {[pluginName: string]: {[id: string]: T}}};
class Graph {
  nodes: AddressMap<Node<mixed>>;
  edges: AddressMap<Edge<mixed>>;
  /* ... */
}
/* private */ function addressMapGet<T>(map: AddressMap<T>, key: Address): ?T {
  const repositoryBucket = map[key.repositoryName];
  if (repositoryBucket === undefined)
    return undefined;
  const pluginNameBucket = repositoryBucket[key.pluginName];
  if (pluginNameBucket === undefined)
    return undefined;
  return pluginNameBucket[key.id];
}
/* private */ function addressMapSet<T>(map: AddressMap<T>, key: Address, value: T) {
  let repositoryBucket = map[key.repositoryName];
  if (repositoryBucket === undefined)
    repositoryBucket = map[key.repositoryName] = {};
  let pluginNameBucket = repositoryBucket[key.pluginNameName];
  if (pluginNameBucket === undefined)
    pluginNameBucket = repositoryBucket[key.pluginNameName] = {};
  pluginNameBucket[key.id] = value;
}

This would be a completely under-the-hood implementation change, and would remove the $-restriction. It may well be a faster implementation, too.

Plugins: GitHub plugin

The GitHub plugin will be responsible for sourcing the following types of information:

Nodes:

  • Pull requests
  • Issues
  • Issue comments
    n.b. the body of an issue (or PR) is treated as its first "issue comment"
  • Commit comments
  • Review comments
  • GitHub user identities

Edges:

  • authorship
    users author the other nodes, e.g. user authors commit
  • containership
    issues and pull requests contain comments
    (might merge with reference below)
  • reference
    comments can reference other nodes, e.g. the line below where this issue comment references PR-35

This is based off the overall plugin architecture design (#26); implementation has started in 'Add types for GitHub plugin' (#35)

Bad failure message when retrieving graph with an invalid GitHub token

Consider the command
node bin/sourcecred.js sourcecred sourcecred github-token=BAD

Expected:
It will give a helpful error message stating that the github token was bad.

Actual:

    /home/dandelion/git/sourcecred/bin/commands/combine.js:116
    process.on("unhandledRejection",err=>{throw err;});class CombineCommand extends __WEBPACK_IMPORTED_MODULE_0__oclif_command__["Command"]{// We disable strict-mode so that we can accept a variable number of
                                          ^
    
    TypeError: Cannot read property 'repository' of undefined
        at continuationsFromQuery (/home/dandelion/git/sourcecred/bin/commands/plugin-graph.js:437:86)
        at postQueryExhaustive (/home/dandelion/git/sourcecred/bin/commands/plugin-graph.js:468:168)
        at <anonymous>
        at process._tickCallback (internal/process/next_tick.js:188:7)

Git{Hub} plugin: Add branch pointers

It would be nice to create a connection between the GitHub repository object, and the HEAD commit of the master branch. This would tie the repository level directly to the code in a reasonable way.

A nice generalization here is to create nodes for each branch, and then give the master branch some extra weight in PageRank.

┆Issue is synchronized with this Asana task by Unito

Add labels to the GitHub data model

See the GitHub v4 API docs for the Label and Issue objects. Note
that labels on an issue (resp. pull request) are provided as a
connection, not a simple list. Note also that a repository has a
top-level labels connection. Therefore, we might want to fetch for
each issue just the ids of its labels, and separately pull all labels
at top level. This will also allow us to detect labels that do not
appear on issues, but may still be (e.g.) referenced in text content.

┆Issue is synchronized with this Asana task by Unito

The Artifact System

The artifact plugin organizes the contribution graph according to the boundaries that are set by human interpretation. An artifact is a particular node that represents a group of contributions that all are organized towards some particular function, goal, or deliverable. I expect most projects will have a "root artifact" that describes the entire project as an artifact. That root artifact will then reference other artifacts which represent particular components or processes of that project.

For example, I expect SourceCred may define the following artifacts (this is intended to give a flavor of variety, and is not at all comprehensive`

  • SourceCred
    root artifact
  • Tooling
    maintaining good tools, including lint integrations, CI setup, efficient builds
  • Tooling/Typing
    maintaining high quality type signatures across the project
  • ContributionGraph
    design, APIs, and algorithms around the contribution graph
  • Community Building
    work around communicating the existence of SourceCred to others, and welcoming them into the community

Each artifact will be represented by a node, and its edges will be maintained by some human-controlled consensus mechanism. For the start, we will store the edges as data in the repository that controls the artifact, and the edges will be maintained via pull request.

We want to incentivize the maintenance and upkeep of the artifacts. For example, if people were incentivized to go and find every single pull request that contributed to Tooling/Typing and reference it in the artifact, that would be lovely. I propose that we do this by creating a meta-Artifact for each artifact that represents the work of upkeeping that artifact. Thus, tooling/typing would flow some credit to tooling/typing/meta. It should be very simple to algorithmically maintain tooling/typing/meta and so there will not be a need for a meta/meta artifact. :)

┆Issue is synchronized with this Asana task by Unito

Formalize PluginAdapter semantics

  • Can two plugins have the same prefix?
  • Can two plugins have the same display name?
  • Can one plugin have two types with the same prefix?
  • Can one plugin have a prefix that is a prefix of another plugin's prefix?
  • Can two plugins have types with the same prefix?

Tune in next episode for answers (and invariant assertions) to all these questions, and more!

┆Issue is synchronized with this Asana task by Unito

Consider parameterizing `Graph<N, E>`

A typical Foo plugin will have the FooNode and FooEdge types, and
construct a Graph<FooNode, FooEdge>—this is already true, but our type
system can't express it. Because Graph currently accepts Node<any>
and Edge<any> for additions, clients adding nodes and edges must
type-annotate them manually, or risk adding ill-typed data.

Merging becomes:

merge: (g1: Graph<N1, E1>, g2: Graph<N2, E2>) => Graph<N1 | N2, E1 | E2>`

General graph algorithms will be able to operate parametrically,
yielding free theorems.

This sounds like a good idea. I hope that implementing it, and
expressing that these parameter types are covariant, is tractable within
Flow.

cc @dandelionmane; if you think that this is a good idea then I will be
happy to implement it.

Reconsider Git plugin graph structure

TL;DR: Remove the notion of “submodule commits” from the Git plugin,
observing that this makes for a more consistent “universal monorepo”
point of view.


Currently, the Git plugin has the following kinds of relationships
(among others):

  • A tree includes tree entries.
  • A tree entry may have a blob as contents.
  • A tree entry may have a tree as contents.
  • A tree entry may represent a submodule commit, in which case it has
    zero or more “submodule commit” nodes as contents.

This is all easy to understand until the last point. Submodule commits
change the has-contents relationship from “a tree entry has exactly one
node as contents” to “a tree entry has zero or more nodes as contents”.

Furthermore, the contents of a tree entry representing a submodule
commit cannot be determined by looking at the tree in Git. For a fixed
tree entry, the contents may even change over time as new commits are
added to the head of the repository—without force-pushes or anything.

This is all quite confusing, and of questionable value (which, of
course, is my fault).

The reason that I originally implemented the system like this is the
following. In Git, the actual tree entry for a submodule commit as
shown by git cat-file -p HEAD^{tree} or git cat-file tree HEAD | xxd
includes only the name of the submodule (as with all tree entries) and
the submodule commit hash. Armed with only the hash, one cannot find
more information about the remote commit, because one does not know how
to locate the repository to which it belongs. Including the URL helped
to remedy this; it gave one a place to look up the commit and find its
author, its contents, etc.

But, in the context of a SourceCred graph, I don’t think that this makes
much sense. Suppose that we have a cred graph for Repositories A and B,
and that Repository A has submodule commits from Repositories B and C.
Given just the hash of a submodule commit from Repository B, the
relevant commit is already in the graph, so we can get all the
information that we want. Given the hash of a submodule commit from
Repository C, we can’t get any more information about this commit—but we
don’t need to, either. What were we going to do with the URL—hope that
the repository is still mirrored there, go clone it, and generate a cred
graph on the fly? Doubtful.

Instead, we should consider moving to the following ontology, which is
simpler and a more faithful model of Git:

  • A tree includes tree entries.
  • A tree entry may have a blob as contents.
  • A tree entry may have a tree as contents.
  • A tree entry may have a commit as contents.

Here are some consequences:

  • We get invariants like “every tree entry has exactly one node as
    contents”. Consequently, both APIs and implementations become
    simpler.

  • We don’t have to read the .gitmodules file at every commit. This
    saves (a) for every commit: one git subprocess call that must read
    and parse a tree object, and (b) for commits that have at least one
    submodule, one additional git subprocess call that must read and
    parse a blob. It is probably a marginal performance improvement, but
    it makes the code significantly simpler.

  • We must relax the invariant that every commit have exactly one tree:
    in the case of a submodule commit, we might not know what the
    appropriate tree is. I posit that this is okay. The invariant
    changes from a uniqueness constraint to a subsingleton constraint.

  • It is increasingly apparent that the Git graph is not a graph of one
    Git repository, but of all Git repositories. Nodes are not scoped by
    owner or repository name; they simply exist in the universe,
    identified by their hashes.

cc @decentralion; thoughts appreciated.

Create a structured form for GraphQL queries

We will need to be able to structurally manipulate GraphQL queries so
that we can implement general-purpose pagination against the GitHub API.
Amazingly, I wasn’t able to find such a library after some searching. In
particular, the official graphql npm package has structured schemata
(of course), but only stringly-typed queries, as far as I can tell.

Happily, GraphQL has a proper spec:
https://facebook.github.io/graphql/October2016/

This makes it easy to correctly define the relevant API. Our needs are
very simple: we just need an ADT for a GraphQL request (queries and
fragment definitions). The Flow-level types should express all the
complexity of the system, and there will be a projection function to
take a query to a GraphQL-formatted string. Some helper functions will
assist creation of the types for a fluent API, but this should basically
be it.

cc @dandelionmane

Design core data models

Let’s use this space to design the core data models for SourceCred. Following is a brain-dump of the current state of affairs as of 2018-02-22-ish, as developed in conversations between @dandelionmane and myself.

First attempt: the project DAG

Originally, we considered the following scheme.

The project graph is a DAG whose nodes are primarily contributions: contributions include commits, pull requests, issues, comments on PRs and issues, design docs, etc. A weighted edge from a to b indicates that b provided value to a (in proportion to the edge weight). The leaf nodes of the graph correspond to users. There is only one instance of this graph, which serves as a reflection of the repository’s “official”/“objective” value function. Individuals can update the graph by standard pull requests (and maybe also via some lower-friction, continuous workflows; details to be fleshed out).

There is a root node in the graph representing the repository as a whole. The graph will likely also include “subprojects”, which may be functional areas of the application (“frontend”, “backend”, “plugins”), organizational divisions (“buildcop”, “triage”), or anything else that is perceived as important to the organization.

At allocation time, one unit of cred is minted to the root node. The cred “flows down” the graph, roughly according to the following algorithm. Let n be a node all of whose parents have been visited, and let c be the amount of cred that has flowed to n; then, to each child n′ of n with normalized edge-weight p, flow p c cred to n′. Repeat until all nodes have been visited. Each user receives cred proportional to the amount of cred at their respective leaf node.

One particularly desirable property of this mechanism is that it is transparent in the following sense. Suppose that we want to understand the distribution of impact among users for some subcomponent of the project: e.g., we want to see who has contributed the most value to the backend APIs. We can rerun the algorithm on the same project graph, but mint 1¤ to the “backend APIs” node instead of the root project node. The results will be useful because the process is additive. For instance, if a project has a root node with two children—“backend”, with weight 0.6; and “frontend”, with weight 0.4—then a user’s cred will be ctotal = 0.6 · cbackend + 0.4 · cfrontend, where cu is the credit assigned to the user when the search starts from node u.

Second attempt: the project graph

A major problem with this approach is the requirement that the graph be acyclic. We expect cyclic project graphs to be quite common in reality. (TODO: Provide examples.) In such cases, the process described above no longer directly applies. Instead, we would like to continue flowing cred along edges until we reach some kind of equilibrium.

Note that if we interpret a project DAG as a Markov chain, then the process described above can also be formulated as, “find the distribution of the chain after a large number of steps when starting from a distribution that is at the root node with probability 1”. Markov chains generalize to graphs with cycles, so we will attempt to formulate an improved version of the system along these lines. An immediate problem is that the chain is not likely to be ergodic—users will be leaf nodes, so will not communicate with other states—so we will have to be careful when dealing with limiting and stationary distributions.

A natural first idea is to simply take the limit distribution πAk for large k, where π₀ is the distribution that has 1 for the root node and 0 elsewhere. Then, we could restrict the support of the resulting distribution to the nodes corresponding to users, and after normalization we would have a valid allocation. While this seems reasonable so far, it is not clear how to achieve the transparency discussed above. The strategy of “assign 1¤ to an arbitrary project node and re-run as normal” does not obviously do the right thing: if the chain is ergodic, then changing the initial conditions does not affect the limiting distribution, and a densely connected graph may be “nearly ergodic” enough for the cred to become “all mixed up”. (As you might guess from reading this paragraph, the precise dynamics of what goes right and wrong, and how, are not clear to me. It might be helpful to run a simple version of these algorithms on some toy repos and some real-world repos to explore the behavior.)

More info to be posted in this thread.

Refactor "NodeTypes" and "EdgeTypes" to be part of the Address

Right now, our Address only encodes

type Address = {|
  +id: string,
  +pluginName: string,
  + repositoryName: string,
|}

However, in the GitHub plugin, we found it helpful to keep track of NodeTypes, so we overloaded the id field with the stringification of an object

type IDType ={|
  +id: string,
  +type: string, // actually NodeType | Type
|}

We then realized that finding the type of an edge is generically useful across the project, so we made the pluginAdapter interface contain an extractType function which returns a string type for any node.

Given that we now expect to be able to retrieve types for any Node, I think we should take a more legible and explicit approach by moving the type into an explicit field on the Address. We can do this very simply by adding a fourth string field with name type onto the Address. However, there are two nontrivial considerations here:

  1. Can we make the type system aware of the type string? In the GitHub plugin, we had explicit types for every ID type (e.g. IssueNodeID, CommentNodeID) and it seems natural to move that to the address layer now - in which case we get
type Address<T: string> {|
  +type: T,
...
|]

It would be great if we could express that that constructing a node with a and a "BAR" address type is a type error, but it's been difficult to properly express that in the Github plugin. We might wind up creating dual type signatures for each Node/Edge, as in the Github plugin, which I expect would look like this:

type Node<Type, PayloadType> {|
  +address: Address<Type>,
  +payload: PayloadType,
|}

Then we could define a TYPES dict like

TYPES = {|
  FOOTYPE: FooPayloadType,
  BARTYPE: BarPayloadType,
|}

There's another question of whether type names should be implicitly scoped by plugin, or if there should be "generic" types that are shared across plugins. Having generic type names adds complexity because then we need to do namescoping / disambiguation of types across plugins; it would be simpler to just have each type name be specific to the plugin that generates it, because then we can depend on plugin namespacing to prevent conflicts. So I think we should take that approach, and possibly revisit if we have a compelling case where we need to share a type across multiple plugins.

@wchargin - if you don't object to the general approach, I'll likely take a stab at refactoring this into the codebase soon - the sooner we switch to the right object type for Address, the less re-factoring we'll need to do down the line.

Add an authorless post to the sourcecred example github

We should make the SourceCred example github repository more realistic by ensuring that it has content authored by a deleted user.

Procedure:

  1. Create a fresh GitHub account for this purpose
  2. Unarchive the example-github repository
  3. Have the to-be-ghost account post issues, comments, pull requests, etc.
  4. Archive the example-github repository
  5. Delete the ghost Github account

Consider having edges refer to nodes rather than addresses

Currently, the Edge type is as follows:

export type Edge<T> = {|
  +address: Address,
  +src: Address,
  +dst: Address,
  +payload: T,
|};

What if we changed it so that src and dst directly point point to Nodes, as follows:

export type Edge<T> = {|
  +address: Address,
  +src: Node,
  +dst: Node,
  +payload: T,
|};

If callers actually want the src Address, they can just write edge.src.address, which is actually more explicit. And in the general case where you are running a graph algorithm, this approach is more efficient and useful.

Also, we could consider renaming src to source and dst to target, as this would be consistent with d3's force layout api. However, I don't think this is a very important consideration, because the d3.forceSimulation API is already an imperfect fit for our Node and Edge data structures. (E.g. forceSimulation adds position properties, which we won't want in general.)

cc: @wchargin

Investigate moving to TypeScript

Factors to consider:

  • time/effort spent integrating TypeScript into the build system
  • time/effort spent porting the existing Flow types to TypeScript (can
    this be done incrementally?)
  • time/effort saved by better type information due to TypeScript
  • time/effort wasted by worse type information due to TypeScript
  • time/effort saved or wasted by build- and run-time performance
    differences
  • relative cost of switching now versus later

Implement versioning for all serializables

Currently, SourceCred has no consistent way to approach versioning. This hasn't been a problem thus far as we have no users, and it's reasonable to regenerate all graphs from scratch every session. Once we start persisting graphs (/ have users), we will need versioning and approach for backcompat on old data structures.

We should implement a consistent versioning system. Ideally it will have at least the following two properties:

  • Allow fine-grained versioning of composite objects, e.g. at present a Graph serializes using Address Maps, and the address maps should be independently versionable (h/t @wchargin )
  • Allow plugins to have separate data versioning, e.g. the GitHub plugin can change its graph representation without this leaking over to other plugins

Consider refactoring away the address–payload distinction

Currently, a node has both an address and a payload. The address ID
needs to uniquely identify the node, and the payload holds anything else
that we want to store.

This can be confusing: if you have a node and want to access a field,
you have to think about where the data lies. It’s a leaky abstraction.

Also, as the address ID must be a string, this often leads to “stringify
the whole shebang and put it in the payload”. If you want to get the
data back out of the node, you have to either additionally store it in
the payload (wasting space and causing extra confusion) or parse the ID
(slow, breaks typing, and generally considered a Questionable Idea).

We should consider eliminating this distinction by doing something more
reasonable.

(This all applies to edges, too.)

Implement GitHub action parsing

E.g. when a user closes a GitHub issue, we should create an edge representing that information.

In effect, the "AUTHORS" edge type should become a particular type of ACTION edge, where ACTION edges generally have a user acting on a GitHub post

┆Issue is synchronized with this Asana task by Unito

design plugin architecture

since SourceCred aims to generalize to many types of "value", we will need to make it wonderfully pluggable. based on past experience, retro-converting a complicated system designed without plugins in mind into a plugin architecture is a lot of work!

the best time to design a plugin architecture is now. starting the discussion here, more thoughts on what kinds of APIs we need to follow.

cc: @wchargin
👋 @jart @chizeng

Weighted PageRank

Weighted PageRank Graphs

Domain introduction

We have a heterogeneous graph G which represents contributions/contributors
in SourceCred, and connections between them.

It's heterogeneous in the sense that there are many different "types" of nodes
and edges. In the interest of concreteness, here is a sample of of node types
and edge types we have or expect to add.

Node types:

  • GitHub users are identities we want to assign cred to
  • GitHub issues contain documentation, design, bug reports, etc
  • GitHub pull requests merge new code into the repository
  • Git blobs contain source code
  • Spotlights represent cred for contributions that don't otherwise fit
    into the data model, e.g. a spotlight for participtaing in an offline
    design discussion, or for giving a talk on SourceCred

Edge types:

  • Authorship edges connect an issue or pull request with the users that
    created it
  • MergedAs edges connect a pull request to the git commit it created
  • Import edges connect a Git blob containing source code to the git blobs
    containing code that blob imports
  • Spotlight edges connect a Spotlight node to the entities it is flowing cred
    to
  • Spotlight authorships flow some cred to the entity that created the spotlight

The need for weights

As currently implemented, SourceCred does not do a good job of distinguishing
valuable contributions. As an example, when run on the repository in its
current state, it believes that #292 is one of the most valuable pulls in the
project, outshining contributions like #252 and #468 which were actually much
more valuable. Even worse, it massively over-values Git objects; for example,
the blob containing our Prettier configuration is seen as >100x more valuable
than the most valuable pull request.

@wchargin and @decentralion believe that this problem results from the fact
that the graph is currently unweighted. As such, we need to implement a system
for applying weights to different nodes and to the src/dst of edges. There are
several different use cases for weights:

Kinds of weights:

Prior over types

Weights can provide prior information on the relative value of different types.
For example, we should have a weight that suggests that GitHub pull requests
are, on average, much more important than Git blobs.

Domain-specific evaluation

Weights can use domain-specific signal to differentiate the value of different
nodes of the same type. For example, # of lines of code is a useful signal for
the value of a pull request; a PR that modifies 1000 lines of code is probably
more valuable than one that modifies 3 lines.

Distinguish parts of the project

Weights can be used to differentiate different parts of the project for clearer
analysis. For example, suppose a project wants to separately measure cred for
"backend" and "frontend" work, without artificially partioning the project into
separate graphs. It would be nice to do this via a "weighting" that assigns a
high prior weight to the backend directory and key backend APIs. (This would
incorporate the project or "domain" concept via weighting).

Thus, implementing a well-designed weighting abstraction would allow us to
implement many important features in a cohesive way. What follows is a "Q&A"
style discussion of major open questions in how the weight abstraction should
work.

Open questions

What is the type signature of a weight function?

In some cases, it's natural to think of a weight function as providing a
relative weight over every node in the project. For example, the project may
have maintainer-configurable weight functions that give a prior over every type
of node in te graph

In other cases, it's natural to think of a weight function as providing
relative weights for a subset of nodes in the graph. For example, the
lines-of-code weighting can differentiate between different pull request nodes,
but has no information to offer on the relative importance of different issues.

Some ideas on how we could express weight functions:

node \in Graph -> number, where for some subset of nodes in the graph, the
weight function assigns a weight. That weight might be seen as an update on the
a priori importance of the node. (What would the formal specification be?)

Graph -> distribution over nodes, where for every node in the graph, the
weight function assigns an update on the prior importance of that node. (Would
it assign 0 to every node that is out-of-scope? Would that compose nicely?)

How do we compose weight functions?

It's important that we can combine and compose weight functions. For example,
suppose we have one weight function on commits that gives the logarithm of the
number of lines changed by the commit, and another weight function measures the
[cyclomatic complexity]. At the very least, we should be able to use both
weight functions in tandem. It would be cool if we could express richer
relationships, such as that commits that add a lot of lines of code but do not
add any test code actually have negative signal. (Unless they are refactors?)

cyclomatic complexity: [https://en.wikipedia.org/wiki/Cyclomatic_complexity]

We should also be able to compose weight functions "cross-product-style".
Suppose we have one pair of weight functions that distinguishes the frontend
and backend components, and we have another pair of weight functions that
distinguishes code implementation from documentation.

Then it should be possible to answer who has the most documentation cred in the
backend, who has the most implementation cred in the frontend, and so forth.

One insight from @krzhang was to think of the weight functions as priors in a
sequence of bayesian updates; he suggested that if we store the weights as
log-priors, then combining them is as simple as adding the log-priors.

How do we apply the weight functions to the PageRank algorithm?

Classical PageRank doesn't provide prior weighting on nodes. One way to add
weighting, found in the weighting around personalized PageRank, is to add
personalization vectors. This modifies the "random surfing" or
"teleportation" of the PageRank algorithm so that rather than uniformly
teleporting to any node in the graph, the surfer teleports based on the
personalization vector.

In offline discussion, @krzhang suggested a different approach. We can apply a
node weight by applying a weighting to the in-and-out edge weights for that
node; i.e., by modifying that node's row and column in the adjacency matrix.

He suggested that given an adjacency matrix M, this could be applied by W * M * W', where W is a diagonal matrix with node weights on the diagonal.

This approach seems more powerful and principled to me - easier to reason about
how it affects the final distribution. We may want to empirically try both approaches.


Let's use this thread to further discuss this model; eventually we'll likely want to commit the final document as a markdown or latex document.

Thanks to @krzhang for helping to clarify the problem domain, and providing a number of suggestions that informed this post.

design a graph visualizer

Since SourceCred's main data structure is a graph, we're going to want to build a tool for visualizing it.

some properties of the problem

fractal graph, meaningful at different levels of scale

the graph will span levels of abstraction from the repository and project level (visualize GitHub projects as nodes, edges are dependencies between projects) down to the individual commit and comment level (a pull request is a node with edges to all its comments, and edges to its commits, and to the files it modifies).

the graph visualizer should be able to seamlessly transition between these levels of scale, and explore the boundaries in between (e.g. the specific artifacts that create connections across repository boundaries)

some nodes will have very high degree

for example, user nodes will have an edge to every contribution and comment they have ever authored. we'll need to abstract over cases like this, rather than overwhelm the display

some nodes are much more important than others

so we will want to surface highest-importance nodes within the viewing context, and maybe aggregate less important nodes together. fortunately, we will have a metric (cred) for determining the relative importance of nodes (that's the whole point! 😄)

heterogenous node and edge types

the nodes and edges are provided by plugins (see: #26) and each plugin may have its own logic for how to display that node or edge. the UI will need a plugin area that provides this context.

my current vision

What I'm imagining right now is a zoom-and-pannable "map" depicting nodes and edges. To get some inspiration, please take a look at the [TensorFlow graph explorer].

At any time, there is an active "scope" that determines which part of the graph is being visualized; the scope can be global (show most important K repos, and edges between them), repo specific (show most important K projects and contributions, with edges between them), user specific (show most important contributions that user has made, and edges...), etc.

For any given scope, most of the graph will be hidden, either by being abstracted into aggregate nodes, or just elided from view. Plugins should be able to define their own scoping logic; for example, the GitHub plugin may define comments get scoped with their containing issue or pull request.

The user should be able to interactively explore the graph, both by drilling down into scopes (e.g. click on an aggregated node to expand it an explore its contents), and by traversing the graph to different locations (e.g. follow a path of package dependencies to core deps).

In contrast to the TensorFlow graph explorer, I don't expect that there will be a canonical fixed layout that the whole graph fits into; instead, the layout is determined dynamically, possibly via force directed layout or similar.

Also in contrast to the TF graph explorer, I'd like exploring the graph to have a continuous and organic feel. Consider how in Google maps, you can gently zoom in across different levels of scale, vs. in the TF graph explorer, nodes expand/collapse as single discrete events.

next steps: unblock a MVP

We're going to want a MVP of the graph explorer as soon as our GitHub plugin starts collecting data - it will be invaluable in the process of designing and testing cred allocation algorithms, as well as exploring the data our plugins collect and seeing what kinds of insights they surface.

choose our tech

To that end, we should decide on a basic technology stack, one that will enable fast iteration and prototyping while still putting us on sound footing for long term maintenance. I think starting with d3+svg for the frontend will be best for fast prototyping, although we should consider keeping the rendering cleanly decoupled so that we can consider switching to more performant options down the line.

Layout, node aggregation, and graph logic needs to be in well-tested, flow-typed javascript. That kind of logic can get quite hairy and we're gonna want to strive for great test coverage and modularity, lest we get caught in a complexity trap.

I plan to reach out to @doug, @kanitw, @dsmilkov, @nsthorat, and @chihuahua, all of whom were very influential in creating and maintaining the TF graph explorer, for their technical judgment on what approaches are best.

cc: wchargin

┆Issue is synchronized with this Asana task by Unito

Parse Markdown of GitHub text content

Our current reference detection uses regular expressions to match
patterns anywhere in a string. This is a reasonable approximation, but
it yields false positives when the text in question is enclosed in a
code block, and may yield false negatives if the text is split across
multiple lines.

We should use a Markdown parser like CommonMark to extract the
Markdown AST, eliminate softbreaks, and run reference detection only in
the text nodes of the AST.

┆Issue is synchronized with this Asana task by Unito

develop frontend testing practices

It's tempting to charge ahead with the frontend and wait until after the UI is more established to begin testing it. However, based on past experience, it's easy to get to a state where the system is so complex that adding tests is a daunting undertaking. It will be much better to test as we go.

I'd like to set up functional tests for the following behaviors:

  • expand/collapse FileExplorer UI
  • change selected filepath in FileExplorer
    • .. and UserExplorer receives it
      • ... and UserExplorer displays new sums

I'm not sure what the best way to structure these tests. Should it be very tightly scoped unit tests of each individual primitive of the behavior, or should it be end-to-end, going straight from the user input on the FileExplorer to the updated sums in the UserExplorer?

I think we can have both, in a fairly labor efficient way, if we mix unit testing and snapshot testing. For each individual component-specific interaction, we can test the logic via unit tests. For end-to-end integration, we can describe a sequence of user inputs and then take a snapshot in the terminal state. We should be able to write these easily and quickly, and maintain them without too much effort.

cc @wchargin

Setup CI integration

We should set up CI integration. What I have in mind is pretty basic:

  • should be fast (build & test <1min would be ideal, slower than 3 mins is bad IMO)
  • should run yarn flow, yarn test, and a command to verify prettier is satisfied
  • should generalize nicely to testing our backend code too

Plugins: npm plugin

Let's create a plugin with package.json files as nodes, and npm dependencies as edges.

I believe this will give a very interesting view into meta-project dynamics, and developing it early will force us to keep the meta-scale in mind as we develop tooling and visualizations. Also, it should be fairly easy to discover this data. We can start by exploring all of SourceCred's transitive dependencies.

┆Issue is synchronized with this Asana task by Unito

Need a list of available tasks

I'm thinking we need a shifting list of tasks available to work on (to encourage contributions).

Example: I've dropped off for a bit, and I have no idea what needs to be done anymore :)

For each available task, I think it should have

  • priority (P0 - P4)
  • description
  • reviewer
  • who's working on it (if anyone)
  • expected delivery date (if anyone is working on it)

GitHub plugin: Use urls as ids

Currently, the GitHub plugin uses opaque GitHub id strings as the edge/node ids. I believe we should change to using canonical GitHub urls as ids.

My reasoning is that it will make it possible to reference a GitHub asset without needing to introspect on GitHub's opaque, internal-only id structure. A simple example is if repo A wants to reference an issue in repo B. The user can do that, canonically, via a url. However, at present it isn't possible to parse that url into a graph reference without executing an additional query against GitHub to retrieve the id of the referenced issue.

Relevant: this issue in which I call for changing addresses to be real addresses in general, rather than opaque ids. #73. (Ie: the address type should be able to specify "object A in repository B at hash C" or "url X" or "object in IPFS with hash X"

cc: @wchargin

Investigate nondeterministic GitHub query rejections

In #349, we saw what appeared to be a bug in continuation resolution
when run on ipfs/go-ipfs:

TypeError: Cannot read property '_n0' of undefined
    at embeddings.reduce (/home/wchargin/git/sourcecred/bin/commands/plugin-graph.js:442:153)
    at Array.reduce (<anonymous>)
    at resolveContinuations (/home/wchargin/git/sourcecred/bin/commands/plugin-graph.js:442:82)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)

I’ve investigated this. It is not a bug in our code, per se. What is
happening is the following. Sometimes, when we post a query to the
GitHub GraphQL API, the response is (prettified):

{
  "documentation_url": "https://developer.github.com/v3/#abuse-rate-limits",
  "message": "You have triggered an abuse detection mechanism. Please wait a few minutes before you try again."
}

Note that this links to the v3 API, even though we’re using the v4 API.
Also note that we are not using the API to rapidly create content,
polling (at all), or making multiple concurent requests, and it seems
surprising that the data that we are requesting should be
computationally expensive (e.g., we are not forcing computation of Git
diffs; we’re just fetching standard fields off the standard objects).

First of all, this is not a valid GraphQL response, so it is not
surprising that we do not have logic to handle it.

Second, whether this error is produced appears to be nondeterministic.
Running the same sequence of queries multiple times can lead to exactly
the same results (failures and all), or it can be the case that the
first sequence fails and the second succeeds. One potential explanation
for this is that the abuse detection depends on actual compute resources
used by GitHub’s servers, and so initial queries cause the caches to be
hot for the particular resources in question. Or, it could just be a
fluke.

We should probably, nevertheless, attempt to fix this. For instance, we
could detect this error and, as the message suggests, wait some time
before re-issuing the query.

Switch to undirected edges

Forking from a discussion over at #81 (comment):

Right now, our graph encodes explicit directionality, via the "src" and "dst" fields on the edges. However, in many cases this directionality feels arbitrary, (in the Issue<->Comment relationship, which is src and which is dst?), and it complicates usage of the API (to find the comment nodes from an issue, you need to remember the somewhat arbitrary distinction).

We think it would be better if the edges were treated as undirected at the core graph layer, but had edge-type-specific-metadata that reveals the relationship between different sides of the edge. In the Issue-Comment edge example, rather than having a "src" and a "dst", the edge could have a "parent" and "comment" attributes. At the level of graph flow, plugin-provided heuristics would help determine what the effective weight in each direction should be. This reasoning is pretty similar to why we decided to remove "weight" as an explicit field on the edges; see #47 for context.

Create a `sourcecred-ci` or `sourcecred-bot` user

Currently, Travis CI uses my personal auth key to run cron jobs that hit
the GitHub GraphQL API. This is okay as a concession for the very short
term, but I’d like to move away from it as soon as possible.

There is a StackOverflow question about this. The accepted answer
is from a GitHub employee, who suggests creating a machine account for
this purpose. Given that the API key would only be used from cron jobs,
running at most once daily and using just a handful of API tokens, this
seems easily in line with the terms of service. As further comfort, the
GitHub help page on managing deploy keys
says:

cannot automate the creation of accounts. But if you want to create a
single machine user for automating tasks such as deploy scripts in
your project or organization, that is totally cool.

Add tests for Cred Explorer

In #265, we created a new UI for the Cred Explorer and merged it without tests. This was reasonable, as we were pushing to build out a lot of new features, and indeed succeeded in prototyping a much faster and more configurable PageRank implementation!

However, we should not forget to add tests, e.g. for the following properties:

  • Expand/collapse recursive tables
  • Nested tables have a darker color
  • Top-level filtering is possible via the selector
  • Scores (and log scores) are displayed
  • Top level table is sorted by score
  • And recursive ones are too
  • Dangling edges do not cause a crash
  • Changing the graph does not cause a crash
  • nodes without plugin adapter don't cause a crash

Write prototype Node backend

essentials

  • should be able to read git data, probably via NodeGit
    • (@wchargin I know you have some thoughts on this)
  • should use the same Flow type signatures as the frontend
  • should serve our frontend and supply the data

features

  • we keep a persisted copy of the backend's output in source control (a la src/data.json today), and use that to serve the development frontend
    • in the future case, this will be a db
    • we will update the persisted copy occasionally, maybe once a week
  • we should also have an end-to-end script that freshly loads the data from the backend, and serves it on the frontend
    • we can use it for E2E CI testing, assuming it's fast enough
  • also, it would be good to turn on backend performance testing early

Consider using babel-plugin-flow-react-proptypes

Per discussion earlier: Flow and PropTypes are both great, but neither is perfect by itself. Using only Flow, we’d miss runtime errors that could arise when we read from JSON and the schema doesn’t match what we thought it was. (This has happened already, albeit benignly!) Using only PropTypes, we have less expressive types (e.g., there’s only one function type, PropTypes.func), and we have less integration with the rest of our Flow-typed logic.

There exists a plugin, babel-plugin-flow-react-proptypes, that enables you to only write Flow types, and have the PropTypes generated from the Flow types at compilation time. We can’t use it directly before ejecting from create-react-app, but I tried it out anyway and it seems to do the job perfectly on our existing code. (See below to repro yourself.)

I think that this plugin is a good candidate, and that we should strongly consider using it after ejecting (and maybe consider ejecting in order to use it). What do you think?

cc: @dandelionmane

To try it out:

  1. Check out c1b37fa (master, at the time of writing).

  2. Apply the patch listed below, then yarn install.

  3. Manually edit the file node_modules/babel-preset-react-app/index.js, to add the line

    require.resolve('babel-plugin-flow-react-proptypes'),  // XXX

    at the top of the const plugins array.

  4. Manually edit the file node_modules/react-scripts/config/webpack.config.dev.js, changing the unique occurrence of cacheDirectory: true to cacheDirectory: false (may not be strictly necessary, depending on your state).

  5. Run yarn flow, and note that everything checks out. Run yarn start, and note that the app functions without error.

  6. Introduce a type error that would not cause a hard JS error in render: say, in UserExplorer.js, change <UserEntry userId={author} ... /> to <UserEntry userId={55} />.

  7. Run yarn flow, and note the helpful static errors. Run yarn start, and note the helpful runtime errors.

  8. Note that CI=true yarn test is still green.

The above workflow isn’t production-ready because of the step in which you manually edit the contents of some node modules. After ejecting, these steps would be replaced by editing some config files within our repo, which is totally fine.

The patch mentioned in step 2 appears below:

Contents of patch to automatically generate PropTypes from Flow types

Save the following block to a file /tmp/patch, then git apply </tmp/patch.

diff --git a/explorer/package.json b/explorer/package.json
index 4791b63..e698ff1 100644
--- a/explorer/package.json
+++ b/explorer/package.json
@@ -3,6 +3,7 @@
   "version": "0.1.0",
   "private": true,
   "dependencies": {
+    "babel-plugin-flow-react-proptypes": "^17.1.2",
     "flow-bin": "^0.65.0",
     "react": "^16.2.0",
     "react-dom": "^16.2.0",
diff --git a/explorer/src/App.js b/explorer/src/App.js
index 51d1081..19c182f 100644
--- a/explorer/src/App.js
+++ b/explorer/src/App.js
@@ -1,10 +1,12 @@
+// @flow
 import React, { Component } from 'react';
 import data from './data.json';
 import './App.css';
 import { FileExplorer } from './FileExplorer.js';
 import { UserExplorer } from './UserExplorer.js';
 
-class App extends Component {
+type AppState = {selectedPath: string, selectedUser: ?string};
+class App extends Component<{}, AppState> {
   constructor() {
     super();
     this.state = {
diff --git a/explorer/src/FileExplorer.js b/explorer/src/FileExplorer.js
index 41f11cc..24eaad6 100644
--- a/explorer/src/FileExplorer.js
+++ b/explorer/src/FileExplorer.js
@@ -1,15 +1,13 @@
+// @flow
 import React, { Component } from 'react';
-import PropTypes from 'prop-types';
 import {buildTree} from './commitUtils';
-import {propTypes as commitUtilsPropTypes} from './commitUtils';
-
-export class FileExplorer extends Component {
-  static propTypes = {
-    selectedPath: PropTypes.string,
-    onSelectPath: PropTypes.func.isRequired,
-    data: commitUtilsPropTypes.commitData.isRequired,
-  }
+import type {CommitData, FileTree} from './commitUtils';
 
+export class FileExplorer extends Component<{
+    selectedPath: string,
+    onSelectPath: (newPath: string) => void,
+    data: CommitData,
+}> {
   render() {
     // within the FileExplorer, paths start with "./", outside they don't
     // which is hacky and should be cleaned up
@@ -38,19 +36,16 @@ export class FileExplorer extends Component {
   }
 }
 
-class FileEntry extends Component {
-  static propTypes = {
-    name: PropTypes.string.isRequired,
-    path: PropTypes.string.isRequired,
-    alwaysExpand: PropTypes.bool.isRequired,
-
-    // The type for the tree is recursive, and is annoying to specify as
-    // a proptype. The Flow type definition is in commitUtils.js.
-    tree: PropTypes.object.isRequired,
-
-    selectedPath: PropTypes.string.isRequired,
-    onSelectPath: PropTypes.func.isRequired,
-  }
+class FileEntry extends Component<{
+    name: string,
+    path: string,
+    alwaysExpand: bool,
+    tree: FileTree,
+    selectedPath: string,
+    onSelectPath: (newPath: string) => void,
+}, {
+    expanded: bool,
+}> {
 
   constructor() {
     super();
diff --git a/explorer/src/UserExplorer.js b/explorer/src/UserExplorer.js
index 8da252d..6b7274f 100644
--- a/explorer/src/UserExplorer.js
+++ b/explorer/src/UserExplorer.js
@@ -1,19 +1,14 @@
+// @flow
 import React, { Component } from 'react';
-import PropTypes from 'prop-types';
-import {
-  commitWeight,
-  propTypes as commitUtilsPropTypes,
-  userWeightForPath,
-} from './commitUtils';
-
-export class UserExplorer extends Component {
-  static propTypes = {
-    selectedPath: PropTypes.string.isRequired,
-    selectedUser: PropTypes.string,
-    onSelectUser: PropTypes.func.isRequired,
-    data: commitUtilsPropTypes.commitData.isRequired,
-  }
+import {commitWeight, userWeightForPath} from './commitUtils';
+import type {CommitData, FileTree} from './commitUtils';
 
+export class UserExplorer extends Component<{
+    selectedPath: string,
+    selectedUser: ?string,
+    onSelectUser: (newUser: string) => void,
+    data: CommitData,
+}> {
   render() {
     const weights = userWeightForPath(this.props.selectedPath, this.props.data, commitWeight);
     const sortedUserWeightTuples =
@@ -34,12 +29,10 @@ export class UserExplorer extends Component {
 /**
  * Record the cred earned by the user in a given scope.
  */
-class UserEntry extends Component {
-  static propTypes = {
-    userId: PropTypes.string.isRequired,
-    weight: PropTypes.number.isRequired,
-  }
-
+class UserEntry extends Component<{
+    userId: string,
+    weight: number,
+}> {
   render() {
     return <div className="user-entry">
       <span> {this.props.userId} </span>
diff --git a/explorer/src/commitUtils.js b/explorer/src/commitUtils.js
index 4cca4a0..870ec0d 100644
--- a/explorer/src/commitUtils.js
+++ b/explorer/src/commitUtils.js
@@ -2,30 +2,12 @@
 
 import PropTypes from 'prop-types';
 
-type CommitData = {
+export type CommitData = {
   // TODO improve variable names
   fileToCommits: {[filename: string]: string[]};
   commits: {[commithash: string]: Commit};
-  authors: string[];
 }
 
-export const propTypes = {
-  commitData: PropTypes.shape({
-    fileToCommits: PropTypes.objectOf(
-      PropTypes.arrayOf(PropTypes.string.isRequired).isRequired,
-    ).isRequired,
-    commits: PropTypes.objectOf(PropTypes.shape({
-      author: PropTypes.string.isRequired,
-      stats: PropTypes.objectOf(PropTypes.shape({
-        lines: PropTypes.number.isRequired,
-        insertions: PropTypes.number.isRequired,
-        deletions: PropTypes.number.isRequired,
-      }).isRequired).isRequired,
-    }).isRequired).isRequired,
-  }),
-};
-
-
 type Commit = {
   author: string;
   stats: {[filename: string]: FileStats};
@@ -86,7 +68,7 @@ export function userWeightForPath(path: string, data: CommitData, weightFn: Weig
   return userWeightMap;
 }
 
-type FileTree = {[string]: FileTree};
+export type FileTree = {[string]: FileTree};
 
 export function buildTree(fileNames: string[]): FileTree {
   const sortedFileNames = fileNames.slice().sort();
diff --git a/explorer/yarn.lock b/explorer/yarn.lock
index 413d06d..bacc259 100644
--- a/explorer/yarn.lock
+++ b/explorer/yarn.lock
@@ -323,7 +323,7 @@ [email protected], babel-code-frame@^6.11.0, babel-code-frame@^6.22.0, bab
     esutils "^2.0.2"
     js-tokens "^3.0.2"
 
-[email protected], babel-core@^6.0.0, babel-core@^6.26.0:
+[email protected], babel-core@^6.0.0, babel-core@^6.25.0, babel-core@^6.26.0:
   version "6.26.0"
   resolved "https://registry.yarnpkg.com/babel-core/-/babel-core-6.26.0.tgz#af32f78b31a6fcef119c87b0fd8d9753f03a0bb8"
   dependencies:
@@ -514,6 +514,15 @@ [email protected]:
     babel-template "^6.26.0"
     babel-types "^6.26.0"
 
+babel-plugin-flow-react-proptypes@^17.1.2:
+  version "17.1.2"
+  resolved "https://registry.yarnpkg.com/babel-plugin-flow-react-proptypes/-/babel-plugin-flow-react-proptypes-17.1.2.tgz#89f75928a47ea869dab312605f42542dd7b6755c"
+  dependencies:
+    babel-core "^6.25.0"
+    babel-template "^6.25.0"
+    babel-traverse "^6.25.0"
+    babel-types "^6.25.0"
+
 babel-plugin-istanbul@^4.0.0:
   version "4.1.5"
   resolved "https://registry.yarnpkg.com/babel-plugin-istanbul/-/babel-plugin-istanbul-4.1.5.tgz#6760cdd977f411d3e175bb064f2bc327d99b2b6e"
@@ -913,7 +922,7 @@ [email protected], babel-runtime@^6.18.0, babel-runtime@^6.22.0, babel-runtim
     core-js "^2.4.0"
     regenerator-runtime "^0.11.0"
 
-babel-template@^6.16.0, babel-template@^6.24.1, babel-template@^6.26.0:
+babel-template@^6.16.0, babel-template@^6.24.1, babel-template@^6.25.0, babel-template@^6.26.0:
   version "6.26.0"
   resolved "https://registry.yarnpkg.com/babel-template/-/babel-template-6.26.0.tgz#de03e2d16396b069f46dd9fff8521fb1a0e35e02"
   dependencies:
@@ -923,7 +932,7 @@ babel-template@^6.16.0, babel-template@^6.24.1, babel-template@^6.26.0:
     babylon "^6.18.0"
     lodash "^4.17.4"
 
-babel-traverse@^6.18.0, babel-traverse@^6.23.1, babel-traverse@^6.24.1, babel-traverse@^6.26.0:
+babel-traverse@^6.18.0, babel-traverse@^6.23.1, babel-traverse@^6.24.1, babel-traverse@^6.25.0, babel-traverse@^6.26.0:
   version "6.26.0"
   resolved "https://registry.yarnpkg.com/babel-traverse/-/babel-traverse-6.26.0.tgz#46a9cbd7edcc62c8e5c064e2d2d8d0f4035766ee"
   dependencies:
@@ -937,7 +946,7 @@ babel-traverse@^6.18.0, babel-traverse@^6.23.1, babel-traverse@^6.24.1, babel-tr
     invariant "^2.2.2"
     lodash "^4.17.4"
 
-babel-types@^6.18.0, babel-types@^6.19.0, babel-types@^6.23.0, babel-types@^6.24.1, babel-types@^6.26.0:
+babel-types@^6.18.0, babel-types@^6.19.0, babel-types@^6.23.0, babel-types@^6.24.1, babel-types@^6.25.0, babel-types@^6.26.0:
   version "6.26.0"
   resolved "https://registry.yarnpkg.com/babel-types/-/babel-types-6.26.0.tgz#a3b073f94ab49eb6fa55cd65227a334380632497"
   dependencies:

Write a pagination API for GitHub GraphQL queries

When we request data from the GitHub API, we are limited to a maximum of
100 items per connection at a time: 100 issues per repository, 100
comments per issue, etc. When we want to fetch all data for a
repository, we’ll need to take advantage of GitHub’s pagination API to
fetch subsequent pages. Currently, our code assumes (and asserts) that
all data always fits on a single page, but clearly this is not a
scalable long-term solution: SourceCred itself will soon have more than
100 pull requests, for instance.

Eventually, we would like to implement an “auto-pagination” function
that takes an arbitrary query against the GitHub API, executes it, and
keeps fetching subsequent pages until all data has been exhausted. I’ve
been working on this for a while (order of a week), and it is
fascinatingly difficult. Without going into too much detail: for a fully
general solution, we need to map each field in the response to the field
in the query and the associated GraphQL type, which gets complicated
when dealing with fragments, and further complicated when fragments have
overlapping support. Our existing GitHub query falls into this last
category, so the functionality, at least, is required.

The GraphQL spec is of excellent quality, and is very clear about the
semantics of what happens in the query. This makes it easy enough to
formulate a top-level API and contracts for the functions that we want.
But the implementation of these functions should clearly be decomposed,
and exactly how to decompose them—what types to use, where to cut the
recursion, which base cases to pick—is surprisingly challenging. I’m
confident that I could implement it, given enough time, but it take a
lot of careful thought. (I’d currently estimate one programmer-week’s
worth of design and implementation work, non-parallelizable.)

As our tolerance for hacky ad hoc solutions increases, the problem
becomes easier. In the short term, we should consider implementing a
pagination function that works just for our particular GitHub query.
This removes the greatest difficulty, type tracking across fragments,
because we can design the code around a particular query structure. This
will unblock us indefinitely, at the cost of increasing the friction
involved in changing the GitHub query. (Estimated work: one or two days,
non-parallelizable, including good test coverage; maybe one extra day if
we want really rigorous tests.)

In the long term, it would be nice to implement a more general-purpose,
but still GitHub-specific, pagination API. A good metric for generality
is that it should be releasable as a standalone npm package, with
dependencies only on the structured query language. We can keep this in
mind as it becomes more cumbersome to write queries with manual
pagination—and also as just a “nice-to-have” on the calendar.

Absent objections, I’ll start implementing this; I have the most context
on the existing query system and on the aforementioned pagination
efforts.

cc @decentralion

Address semantics and "Value Networks"

(This description will be improved)

We should think very carefully about Address semantics. Right now we have implicitly repositoryName/pluginName, however I think we should instead think about "Value Networks", and plan for value networks to be forkable, upgradable, and shareded/distributed, where in general, for each value network, no one party has unrestricted write ACLs across that network.

For example, our "GitHub plugin" can instead be a "GitHub value network", which defines a shared contribution graph that incorporates all issues, comments, user identities, pull requests, etc, across GitHub. We will maintain an instance of the GitHub value network, but it should be very easy for others to maintain their own forks, and our addressing system should be accommodating to forks, and network partitioning.

I think our addresses should include a precise description of how to locate the linked edge, even when it is not locally available, and should have reasonable failure semantics. I think "git addresses" (look up in repository at URL X), and ipfs content addresses should both be handled well by our addressing scheme.

Note that this is an increase in flexibility from the "repositoryName/pluginName" system. It will still be the case that some ValueNets are owned by individual repositories (e.g. SourceCred owns the SourceCredArtifact net) but it is a more flexible abstraction.

Explicit goal: Someone should be able to fork any individual value network, and still play nicely with all the other networks, including the forked network.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.