se-sic / coronet Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 15.0 3.31 MB

coronet – the R library for configurable and reproducible construction of developer networks

License: GNU General Public License v2.0

R 100.00%

coronet's People

Contributors

Stargazers

Watchers

Forkers

clhunsen bockthom hechtlc ecklbarb klaraschlueter fehnkera miriyusifli nohadon nlschn jojodeveloping joba00002 kkristela maloefuds

coronet's Issues

Provide the possibility to split data into activity-based equally-sized windows

Currently, we have the possibility to split networks activity-based by either specifying the number of edges per network or by specifying the number of windows.

However, we do not have this possibility for data-based splitting, we can only specify the number of commits resp. e-mails, but not the number of windows.

So, I suggest to implement a function that computes the activity amount based on the number of wanted windows. Example:

get.size.of.equally.sized.windows <- function(input.size, number.windows.wanted) {
  size <- ceiling(input.size / number.windows)
}

In the case of activity-based network splitting, input.size is the overall number of edges.
In the case of activity-based data splitting, input.size is the overall number of commits resp. e-mails.

So, both functions split.data.activity.based and split.network.activity.based should provide a parameter number.windows and both call the above defined function get.size.of.equally.sized.windows when the parameter number.windows is given.

[In addition, one could think of providing a function for determining equally sized windows also for time-based splitting. In that case, there is no difference between network-based or data-based time-based splitting -- we only need the very first and very last date in the data source to determine a time-period for equally-sized windows given by the amount of windows wanted. However, this will only make sense after #38 is closed.]

How to deal with duplicate range names?

When using the activity-based discretisation, it is possible that two (or even more) subsequent ranges have exactly the same names.

Example: Split busybox-feature author networks into networks of 5000 edges each. Then you get the range 2006-12-26 01:30:59-2006-12-26 01:30:59 twice.

This can cause problems when you construct data.frames with the range as row name as row names have to be unique.

So, is it possible to somehow avoid duplicate range names?
Would it make sense to add a suffix which differentiates between the two range names (just in the case that we have duplicate range names)? Or should we just leave the ranges well enough alone and force the end user to deal with this problem appropriately?

Keep range name when constructing networks from range data objects

When splitting data into ranges, we receive a list of RangeData objects where each object has a name in the list which is equal to the range attribute of the object. So, we can get the range name either by querying the name in the list or calling get.range.

When splitting a network into ranges, we receive a list of networks with the ranges as names.

However, when splitting data and then constructing networks, the range is only accessible via the RangeData object, but not directly available on the network. Can we somehow add the range to the network so that we do not need the RangeData object any more? One solution would be to use
attr(network, "range") = rangeData$get.range() when constructing a network from a RangeData object. Are there any other solutions?

I am not sure whether it is worth to implement that, but it might be helpful in some cases.

Support multi-artifact networks

Especially, for "bipartite" networks, we should support multi-artifact networks, where, for example, functions and features are represented as vertices.

Introduce a network-configuration object

Essentially, we should use two configuration objects:

the project configuration, holding data paths, revisions etc. (right now CodefaceConf), and
the network configuration, holding vertex and edge configuration details (i.e., artifact types etc.), edge and vertex attributes, ... (right now passed as repetitive method parameters).

This way, we are able to initialize data objects by just replacing the network configuration and remove tons of delegated attributes to almost all data-object methods. Additionally, the configuration just need to be done once -- by passing the configuration object to the constructor.

The list of network configuration options should be the following, at least:

vertices and vertex attributes,
edges and edge attributes,
vertex relations for all parts of the networks ("mail", "cochange", "callgraph"),
synchronicity data (yes or no), synchronicity time-window,
directedness (see issue #6 on this),
network simplification (contract edges: yes or no),
network-construction details (naming not fixed):
- filter.artifact (only the exact artifact, e.g., filter feature expressions when using features as artifacts) [default: yes],
- artifact.filter.base (remove BASE_FEATURE and FILE_LEVEL) [default: yes],
- artifact.filter.empty (remove the "empty" artifact, i.e., remove all commits which does not changed a tracked artifact),
skip.threshold from function construct.dependency.network.from.list,
...

Of course, all possibilites should be documented properly.

Distinguish directedness of networks and edge-construction algorithm

Currently, the configured directedness of the networks defines the edge-construction methodology. We definitely need to distinguish here!

When constructing both author and artifact networks can either be directed or undirected. When constructing bipartite networks (where both are unioned to one single network), both need to be directed or undirected.

The edge-construction algorithm defines whether the timely occurrence events does define the networks structure. Edges can be directed or undirected. (Exception: Call-graph data is basically directed, so we need to be careful!)

We need to come up with a proper distinguishment between directedness of networks and the edge-construction algorithm, so that we are able to construct a bipartite network containing the data of, e.g., a time/order-respecting e-mail-based author network (basically directed) and a co-change-based artifact network (undirected) without any problems.

For problems that occur when we do not distinguish, refer to commit 49a9125.

[Further information might be added here.]

Store splitting information in project configuration

After splitting data, we loose the information regarding how the splitting was performed. In the project configuration, only the ranges get updated.

I would appreciate to also store the following information in the configuration:

Splitting type (time-based or activity-based)
Length of the ranges (time period resp. activity amount)
Split basis (commits or e-mails)

This information stored in the project configuration could then be used, e.g., to construct directory names directly from the configuration for saving split networks to disk.

Distinguish real bipartite networks and "pseudo" bipartite networks

Right now, the function get.bipartite.network() constructs a network that is not really bipartite, but a normal network with two types of vertices (edges are between any kinds of vertices).

We should do two things here:

Provide a function to construct real bipartite networks.
Rename the current function to something more appropriate, e.g., get.multi.network() or get.complete.network() or get.two.type.network(). Not sure exactly what's the right choice here.

In the meanwhile, we can obtain real bipartite networks in the following way:

net = codefaceData$get.bipartite.network(...)
net = igraph::delete.edges(net, E(net)[ type == TYPE.EDGES.INTRA ])

Bug in simplify.networks function

In commit 6ee3ec1 , which introduced some logging to the simplify.networks function, also a bug was introduced:

As the last line of the function is the logging::logdebug("simplify.networks: finished.") statement, the function returns NULL instead of returning the list nets.

Please add a return statement in the end: return(nets).

Sorry that I did not recognize that while reviewing the mentioned commit.

Error in construction of file-based author networks

When constructing file cochange networks, an error occurs (regardless which case study to analyze -- for busybox and openssl the error occurs sooner, for qemu it occurs later).

Here is the log of the busybox threemonth file cochange analysis:

2017-03-05 16:33:21 INFO::Construct configuration: starting.
2017-03-05 16:33:21 INFO::Attempting to load configuration file: /mnt/codeface-data/configurations/threemonth/busybox_proximity.conf
2017-03-05 16:33:21 INFO::Construct configuration: finished.
2017-03-05 16:33:21 INFO::Constructing author network.
2017-03-05 16:33:21 INFO::Getting artifact--author data.
2017-03-05 16:33:21 INFO::Getting commit data.
2017-03-05 16:33:21 INFO::Getting raw commit data.
Loading required package: tcltk
2017-03-05 16:33:21 INFO::Create edges.
2017-03-05 16:33:21 INFO::Construct network from edges.
2017-03-05 16:33:21 INFO::Constructing author network.
2017-03-05 16:33:21 INFO::Getting artifact--author data.
2017-03-05 16:33:21 INFO::Getting commit data.
2017-03-05 16:33:21 INFO::Getting raw commit data.
2017-03-05 16:33:21 INFO::Create edges.
2017-03-05 16:33:22 INFO::Construct network from edges.
Error in sum(c("3", "4", "3", "3", "7", "6", "5", "3", "3", "3", "3",  : 
  invalid 'type' (character) of argument
Calls: collect.author.networks ... simplify.network -> <Anonymous> -> .Call -> <Anonymous>

 *** caught segfault ***
address (nil), cause 'unknown'

Traceback:
 1: base::.Call(.NAME, ...)
 2: .Call("R_igraph_finalizer", PACKAGE = "igraph")
 3: igraph::simplify(network, edge.attr.comb = EDGE.ATTR.HANDLING,     remove.loops = TRUE)
 4: simplify.network(net)
 5: construct.dependency.network.from.list(artifact2author, directed = directed,     simple.network = simple.network, extra.edge.attr = extra.edge.attr)
 6: private$get.author.network.cochange(directed = directed, simple.network = simple.network)
 7: range.data$get.author.network(author.relation, directed = author.directed,     simple.network = simple.network)
 8: FUN(X[[i]], ...)
 9: lapply(ranges, function(range) {    range.data = CodefaceRangeData$new(conf, range)    author.network = range.data$get.author.network(author.relation,         directed = author.directed, simple.network = simple.network)    author.network = set.graph.attribute(author.network, "range",         range)    return(author.network)})
10: collect.author.networks(conf, author.relation = AUTHOR.RELATION,     author.directed = FALSE, simple.network = simplifyNetworks,     step = STEP)
An irrecoverable exception occurred. R is aborting now ...

Extend the README

The README file should contain more documentation of the project.

This task should be performed after the resolution of issue #8.

Provide a function to extract dates or hashes from range names

Currently, (e.g., after splitting networks), each network has a name which consists of the range begin and end dates. However, we cannot easily ask for the beginning or the ending of the range. As this information can be necessary sometimes, we should provide a function that extracts the begin and the end of a range from the range name.

For extracting dates, we can use the following regular expression:
regmatches(range, gregexpr(pattern=c("\\d{4}-\\d{2}-\\d{2}(\\s\\d{2}:\\d{2}:\\d{2})?"), range))[[1]]

In addition, we also should be able to extract commit hashes in the case the range names consists of commit hashes instead of dates.

Sliding-window splitting

Extend the splitting functionality with an approach that yields data/networks split into sliding-window bins.

When supplying a time-window parameter tw to the new splitting function, the bins should be constructed as such:

[0/2 -- 2/2 tw, 1/2 tw -- 3/2 tw, 2/2 tw -- 4/2 tw, ...]

Support arbitrarily defined revision ranges for splitting

[Edit by @bockthom on 2022-06-08: Some ideas from this issue are already outdated. Please find some updated information in today's comment below]

In the future, it may be the case that we only use the threemonth selection process of Codeface. Here the reason why: In the end, the selection process only affects the identification heuristics for artifacts within commits (commit_dependency table of Codeface DB), while we may want to analyze further and different revision ranges (e.g., six-month time windows). For this case, we developed the splitting functionality that operates on the project-level data. While we are able to split data by abstract time windows (such as 3 months or 6 months), the splitting with explicit bins is more complicated, as the timestamps for the bins need to be identified.

Here the idea: We should do a version selection for our case studies and store those with the Codeface data somehow. For example, we could provide version-based splitting bins for all case studies or bins for specific time periods where the maintainer is a different person etc. In the end, the user does not need to bother about retrieving special bins (except there is the need for a new one), but only needs to configure a file name or version selection process.

Ideas for predefined bins:

all major versions,
all versions,
time periods with different lead maintainer,
...

There are some questions now:

How do we integrate this into the code exactly?
Do we provide names for the selection process and the library does the rest (i.e., read the corresponding file from disk and pass the right parameters to the splitting functionality)?
Do we instruct the user to provide file names and proper calling of methods (a how-to would be needed anyway)?
Is this kind of a bulk mechanism? (which would support idea 2)

What are your thoughts? Are there questions?

Remove e-mails with wrongly parsed date

Should we provide the possibility to automatically remove mails which have a wrongly parsed date?

For example, the Busybox mail data contains one e-mail with the following date:
107-07-20 16:35:57
As 107 is definitively not the correct year this mail was sent in, in some cases, it can be important to remove such mails before data construction resp. network construction. (Notice that we already have fixed the date-parsing algorithm in Codeface last year, but there are some date formats which are un-parsable. So, for each of our current case studies we have up to 20 mails which have an incorrect date.)

For example, I tried to split the mail data of Busybox into "6 months" ranges. So, the first range contains just the one mail sent at 107-07-20 16:35:57. Thereafter, there are 3789 empty ranges. Range 3791 contains some first mails from 2003...

So my suggestion is, to automatically remove such e-mails with wrongly parsed dates while reading the mail data from disk.

I already had to handle this problem in my coordination-bursts scripts. There I completely remove all emails whose creation date is smaller than "1990-01-01 00:00:01" directly after reading the .list file.
I suggest to also perform this kind of data correction here, too. Let's do this in read.mails.

Possibility to build cumulative ranges

It may be useful to be able to construct cumulative ranges when splitting data or networks.
For instance, the first range should contain the first three months, the second range the first six month, the third range the first nine month, and so on...

To achieve this, we should split the data into equal ranges (three months in the example above) and then combine subsequent ranges to get the cumulative ranges.

Therefore, we need some additional functions for combining data or networks, which we can collect in a new file called util-combine.R, for example.

A function for combining networks does already exist in another repository, which can be moved from the other repository into this one (see #11).

Dynamically read e-mail addresses from 'authors.list' file

To be fault-tolerant regarding the change proposed in se-sic/codeface-extraction#5, we need to parse any further columns in the file dynamically if they exist, i.e., the authors' e-mail addresses.

Tests for identical outcome of different kinds of generating the same network

There are different ways to generate the same networks, in some cases. We should check that the outcome is the same for all ways.

Example: Generate bipartite networks and author networks using time-based splitting.

Network generation 1:

construct bipartite network
construct author network
split.networks.time.based (maybe here we have to make sure to use the same bins for both bipartite and author network splitting)

Network generation 2:

construct multi network
split.network.time.based
extract.author.network.from.network with remove.isolates = TRUE
extract.bipartite.network.from.network with remove.isolates = TRUE

Check whether the generated lists of bipartite and author networks are identical for both generations.

We should test this with different kinds of relations (mail, cochange, issue) and different kinds of networks (bipartite, author, artifacts).

In addition, are there similar cases where we can construct the same networks using different construction algorithms?

In addition, consider also this comment: #86 (comment)

Rename the project

codeface-extraction-r is probably a name not accurate enough, especially when we add bugtracker networks to the functionality. Right now, we also always refer to the "network framework" or "network library". But, we should think of a tool name. 😉

Any ideas?

Correctly incorporate committer data everywhere

As we now have committer data available (#35), we need to incorporate the committer data everywhere:

Rename several functions and configuration options (e.g., author.only.committers is not correctly named any more)
Provide access to aggregated committer data (e.g., committer2artifact (respect different kinds of artifacts), committer2commit, etc.)
Enable constructing committer networks (similar to all the other author networks)
Construct networks with authors and committers (possible for author networks, bipartite networks, multi-networks)

Anything to add here?

Merge simplify and contract.edges

Since simplify and contract.edges practically do the same thing we could discuss if we should merge them into one single attribute.

No synchronicity data in get.author2commit

kwT

Update core-peripheral module

Add contribution guide

With the integration of CONTRIBUTING files in the GitHub contribution process, we should set up such a guide.

What we should cover

Coding style
- use https://google.github.io/styleguide/Rguide.xml as a start
- basically anything not matching the Google style reference (e.g., 4 spaces, line length longer than 80 characters allowed, identifiers only with . and not camel-case names, assignment only with =)
- requireNamespace and package prefixes (e.g., igraph::)
- function and method documentation in roxygen2 (see https://support.rstudio.com/hc/en-us/articles/200532317-Writing-Package-Documentation)
- logging using the logging package
- double quotes instead of single quotes
- we use "networks", no "graphs"
- we talk about authors, no developers
- write TRUE instead of T (analogously for FALSE and F)
- write tests!
- ...
Commit style
- e.g., https://github.com/siemens/codeface/wiki/How-to-structure-commits
Merge policy
- tests must run successfully

Reference material

Distinguish author and committer

Basically, we can distinguish between authors of commits and committers, and, respectively, athor date and commit date.

Codeface does not extract the committer right now, but this is possible to do in the future, but after that is working, we can distinguish, too. I can think of a parameter in the NetworkConf class, for example.

Segfault in generation of "bipartite" networks (openssl, function, cochange)

During the generation of our "bipartite" networks for the openssl function cochange threemonth configuration, a reproducible segfault occurs within the call to igraph::simplify.

Interesting point: I have already generated the "bipartite" networks using exactly the same setting a few days ago without ending in a segfault. The only difference: Intermediately, commit abfec6f was pushed to this repository, which affected the generation of the "bipartite" networks. I am not sure, whether the changes contained in that commit cause the segfault or not.

Here is the concrete setting that leads to the segfault now: The function collect.bipartite.networks is called using the following parameters:
author.relation = "cochange"
artifact.relation = "cochange"
author.directed = FALSE
simple.network = TRUE

After getting 66 of 73 artifact networks, the following output appears in the log file (beginning omitted):

2017-04-29 16:17:12 DEBUG::get.artifact.network.cochange: finished.

 *** caught segfault ***
address 0x2ac1f6fd8, cause 'memory not mapped'

Traceback:
 1: base::.Call(.NAME, ...)
 2: .Call("R_igraph_simplify", graph, remove.multiple, remove.loops,     edge.attr.comb, PACKAGE = "igraph")
 3: igraph::simplify(network, edge.attr.comb = EDGE.ATTR.HANDLING,     remove.loops = TRUE)
 4: simplify.network(u)
 5: combine.networks(authors.net, artifacts.net, authors.to.artifacts,     simple.network = simple.network, extra.data = artifact.extra.edge.attr)
 6: range.data$get.bipartite.network(author.relation = author.relation,     artifact.relation = artifact.relation, simple.network = simple.network,     author.directed = author.directed, artifact.extra.edge.attr = artifact.extra.edge.attr,     artifact.filter = artifact.filter, artifact.filter.base = artifact.filter.base)
 7: FUN(X[[i]], ...)
 8: lapply(ranges, function(range) {    range.data = CodefaceRangeData$new(conf, range)    bp.network = range.data$get.bipartite.network(author.relation = author.relation,         artifact.relation = artifact.relation, simple.network = simple.network,         author.directed = author.directed, artifact.extra.edge.attr = artifact.extra.edge.attr,         artifact.filter = artifact.filter, artifact.filter.base = artifact.filter.base)    bp.network = set.graph.attribute(bp.network, "range", range)    return(bp.network)})
 9: collect.bipartite.networks(conf, author.relation = AUTHOR.RELATION,     artifact.relation = "cochange", author.directed = FALSE,     simple.network = simplifyNetworks, step = STEP)
An irrecoverable exception occurred. R is aborting now ...

I will try to find out next week in which threemonth range the segfault occurs... I hope that I can find out then where the problem arises from.

Issue communication networks

Right now, we create issue-based networks from all events occurring in the issues. But, it would be better to use plain communication networks . This way, we would stay consistent with the mail-based networks.

Make isolate removal optional during network splitting

Currently, we loose vertices and, accordingly, their vertex attributes if and when we split a network: Isolated vertices are removed during the splitting (see here).

Therefore, we need to make the isolate deletion configurable when we introduce vertex attributes!

The issue originates here.

Further vertex attributes

To improve over PR #67, we should consider adding further vertex attributes:

commit.count.author.and.committer and commit.count.author.or.committer (see here, fixed in PR #127)
Related to here, we need to improve the first.activity attribute by enabling the user to pass several data sources. (fixed in PR #135)
Related to here, we need to improve the active.ranges attribute by incorporating further data sources. Maybe, we should add a parameter activity.type as for add.vertex.attribute.first.activity.
Also count committers for add.vertex.attribute.artifact.editor.count (see here and #84, see PR #169)?
Currently, we only provide vertex attributes for commits, but we definitely should provide functions to add attributes for e-mails and issues. For some further information, see
here.
(moved into a separate issue, see #170)

Timezone initialization

Right now, we need to initialize some global options to guarantee correct behavior on all operating systems and system locales. For example, we set the system locale to en_US.UTF-8 and the default timezone to UTC.
For this purpose, we added the following snippet to both the files util-init.R and util-read.R:
https://github.com/se-passau/codeface-extraction-r/blob/ed4c1fe176dc2bf20a1ffdc304e8353a48ac26cd/util-init.R#L4-L11

During the implementation for PR #66 (see this comment), it became necessary to introduce the setting of the options to the file util-misc.R, too, as we do timestamp parsing there.

The question is: Do we want that? If yes, how do we approach this while minimizing code clones?

The main idea to add the option initialization to all file that require this because we may want to use only single files of the network library somewhere else but not all of it. In this case, we need to guarantee that the behavior of each file is encapsulated properly, so that its functionality is self-contained.
So, to the question above, we need to answer an additional one: Do we want to require the use to source the file util-init.R to guarantee consistent behavior also when the user only needs a part of the network library's functionality (i.e., single files)?

What is your opinion on this?

Adopt generic network functions from other projects

There are a bunch of functions from other projects that should be incorporated and adopted into this project.

Some functionalites that we should transfer here:

discretization of project-level networks (time-based or activity-based)
combination/addition of networks (see https://github.com/se-passau/dev-network-growth/blob/master/util.R#L64, #61, #15, and #98; PR #115)
core/peripheral developers (also needed for #34, added in PR #51)
universal network-construction function (see https://github.com/se-passau/dev-network-growth/pull/39)
motif identification (see PR #42)

How to handle incomplete ranges resp. missing data?

Since we extract commit data and e-mail data (and later also issue data) from different sources, the time ranges for available data are different.
For example, there may be a huge amount of time between the first extracted commit and the first extracted e-mail (and also analogously for the last commit resp. e-mail). This affects especially multi-networks, as they can be constructed on both commit and e-mail data simultaneously.
This may cause some problems for some of the analyses, even for analyses which do not use multi-networks, but to make the analyses comparable to other analyses that use multi-networks. As different analyses use different kinds of networks and have different prerequisites, this question is not easily answered.

In the following, I will mention some different use cases and possible solutions:

Global networks:

To build project-level networks, we could restrict the network generation to consider complete data, i.e., only consider the time range for which all data sources are available (globally cut incomplete time periods at beginning and end of the time series). However, some of the analyses may not care if there is incomplete data in the beginning or end, so we should make globally cutting configurable.

Range-level networks:
How to deal with incomplete ranges?

Remove ranges for which not all needed data are available?
Remove ranges for which all data sources are partly available but not for the hole range?
Only cut incomplete parts of a range? (This would also cut time periods without activity...)
Define a threshold for identifying incomplete ranges?
Globally cut incomplete time periods (analogously to 1.) before splitting?

Comparability of different analyses:

How to get the same analyzed global time period for global networks and range-level networks?
How to deal with artifact/author/bipartite networks? Also skip the time ranges with incomplete data even if the missing data is not used? Make that configurable? To many (contrary) configuration options may confuse the users...

There are many options, but it is difficult to make all the analyses and networks compatible with each other.

Any ideas on that?

Improve logging at the construction of networks

It would be nice to have some more detailed logging while networks are constructed in order to see the progress of the network construction.

Error in construct.networks.from.list for openssl function networks

After generating openssl function networks (threemonth) for more than 20 hours (using 2 cores), an error appeared:

Error in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) : 
  long vectors not supported yet: fork.c:376
Calls: <Anonymous> ... construct.dependency.network.from.list -> mclapply -> lapply -> FUN -> sendMaster
Error: All inputs to rbind.fill must be data.frames
In addition: Warning message:
In mclapply(list, function(set) { :
  scheduled core 1 encountered error in user code, all values of the job will be affected
Execution halted

Unfortunately, the error message is not that concrete since more than one core was used.

Use lubridate package for all POSIXct conversions

As a follow-up for commit e8b60f7, we should replace all date -parsing code with proper calls to functions of the lubridate package to have a consistent code base.

Provide functions for extracting networks from the multi network

In some cases, we loose information when we only consider author networks or bipartite networks separately. Therefore, the different network types cannot be used to capture the same time window in an identical way. The only solution to this is: Do not build and split the author networks and bipartite networks separately, but extract them from the (already split) multi network.

So, we need functions that extract the author network, the bipartite network, and the artifact network from a multi network.

Parameter "contract.edges" in function "get.bipartite.network" cannot be used

For some unknown reason, the parameter contract.edges in the function get.bipartite.network cannot be accessed within the function itself.
As this parameter is used within the function, the function call leads to an Error since contract.edges is always NULL.

Determine list of artifacts more reasonably

Current,y the list of artifacts is only determined by the commits, but actually, we need to determine it by the artifact.relation key in the network configuration. There are quite some places to fix here; for example:

https://github.com/se-passau/codeface-extraction-r/blob/22e1ba5eaa3bc40f89ac6a4409c41979515a472d/util-networks-covariates.R#L338-L345

https://github.com/se-passau/codeface-extraction-r/blob/22e1ba5eaa3bc40f89ac6a4409c41979515a472d/util-data.R#L571-L588

Provide possibility to filter in-active developers from range-level networks

When building multi-networks, we can make use of the configuration option author.only.committers. However, when we split the global multi-network into range-level networks, we can have developers in a certain time-range, that are only active on the mailing list in this range, but appear in the range-level multi-network as they appear in the global network since they are committer in, at least, one range.

Therefore, we could provide a function that removes the developers from the range level network that are not committers in the current range.

Vertex attributes

There should be an mechanism to add further vertex attributes to the networks. The idea is to add these after the network creation (e.g., in a file util-covariates.R) as they are likely generated from different data.

Examples:

Authors
- e-mail address,
- commit count,
- artifact count,
- committer?
- artifacts as maintainer,
- core/peripheral,
- active ranges,
- first activity
Artifacts
- t.b.d.

Any ideas for this?

Rework parallelization of edge construction

In reference to issue #3 and commit 91cb0aa, we need to come up with an idea how to properly re-introduce parallelization to the edge-construction process.

From commit 930af63:

As the parallelization of the edge construction involves quite an amount of memory for (de-)serialization is needed and this amount is restricted to 2GB [1,2], the parallelization breaks for artifact-based networks on project level.
[...]
[1] https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17137
[2] http://stackoverflow.com/q/23231183

Any idea to reduce the return value for each of the parallelized function calls in the function construct.dependency.network.from.list is welcome.

Add a license

After watching the presentation on open-source licenses in the seminar this week, I remembered that we still lack a license in the network-library project.

I propose GPL v2 as a license to be in compliance with Codeface, on which we basically build and which is our main data source.

Any thoughts on that?

Note to self: In the end, we need the permission from every contributor to add the license.

Add functions 'get.author2mail' and 'get.author2thread'

It would be nice to also have the function 'get.author2mail' and 'get.author2thread' in order to be able to access mail data in a similar way as commit data can be accessed.

I have already implemented that, see attached patch (.txt appended to file name as GitHub does not allow to upload .patch files):
0001-Add-functions-get.author2mail-and-get.author2thread.patch.txt

Check arguments properly when instantiating objects

Currently, we do not properly check the arguments passed to any class constructor. We definitely need to improve that. For example, when instantiating a ProjectData object (see below), you can omit the project.conf parameter, which results in a NULL reference and breaking analyses.

https://github.com/se-passau/codeface-extraction-r/blob/bb3d2719aaba98e977389f89f9ca9f3c92bb6732/util-data.R#L166-L176

As a solution, we would like to throw an error when no argument is passed for any parameter.

Provide function for unified splitting of several network types

To compare networks of different types (e.g., mail networks and co-change networks), sometimes unified splitting time periods are needed. That is, we need to split exactly at the same dates to get unified ranges for mail and co-change networks. If we split mail and co-change networks separately, the ranges can deviate as the first commit and the first mail may not take place at exactly the same time.

Hence, what we need is a function that takes a couple of networks and a time window for splitting. Then we can apply the same splitting bins (beginning with earliest time stamp in the data source) to all networks.

Network metrics

As part of issue #11, we need to add an extensive list of networks metrics to the library for any further network analysis. The basic implementation can be found here already.

The next steps are:

incorporate all comments from https://github.com/hechtlC/codeface-extraction-r/commit/3330ab5cff55fece4882b2ab896015c4e7943562,
merge everything on https://github.com/hechtlC/codeface-extraction-r (branch master), and
open a pull request from there.

Note: This issue is basically there for bookkeeping.

Combine networks and provide multi-relation networks

As already mentioned in #11, it would be helpful to have a function which combines networks and keeps edge types, node types and (possibly) attributes.

There are already some existing implementations to combine networks, but they either do not keep the edge types or node types (https://github.com/se-passau/dev-network-growth/blob/master/util.R#L64) or work only for disjoint networks (https://github.com/se-passau/codeface-extraction-r/blob/master/util-networks.R#L877). What we need is a universal function which can combine all kinds of networks and keeps all types (and possibly attributes).

As far as I know, @ecklbarb needs multi-edge-type networks, i.e., networks which consist of networks of different types. For example, think about author networks which contain cochange-based edges and mail-based edges together in one network. As one idea is to combine author networks with mail relation and author networks with cochange relation to achieve that goal, this is related to combining networks in general.

@ecklbarb Please keep us informed with your progress here.

As a side note, this issue is also somehow related to #15 (which could be based on the combine-networks function resulting from this issue).

Use File_Level and Base_Feature with file granularity?

When looking at the File_Level and Base_Feature artifacts, we only look at them with project granularity, i.e., we do not distinguish the different instances by the very files they are changed in. But, in the Conway analysis, I have seen such differentiation of file and project granularity.

Should we add a possibility for differentiation to the network configuration?
@ecklbarb, do you need that for your studies?

The corresponding code preventing the differentiation (for File_Level, at least) is the following: Here, we basically convert file granularity back to project granularity.
https://github.com/se-passau/codeface-extraction-r/blob/a53d04c745add2e30cbfa0d06450485596ca071e/util-read.R#L80

Add 'type' attribute to all networks

Currently, we only have a type attribute for vertices and edges of bipartite or multi networks. It would be nice to have the type attribute also for pure author networks or artifact networks.
Having such type attributes for every network will make the node color compatible with the legend when using the plot.network function.

Improvements on issue-reading functionality

To enhance the interoperability with the upcoming issue data from Jira trackers, we need to handle the issue statuses properly -- and later also the event types.

To give an example: In Jira, we have issue statues such as "Closed", "Open", "Resolved", "In Progress", and probably some more.
In GitHub, on the other side, we only have "CLOSED" and "OPEN".

Apart from the different set of values, we also ned to deal with the capitalization somehow.

This issue is only designated to remind us of potential problems, just in case, we need to deal with that.