andybega / blair-sambanis-replication Goto Github PK

Replication of 'Forecasting Civil Wars: Theory and Structure in an Age of “Big Data” and Machine Learning' by Blair and Sambanis 2020

R 78.46% Stata 4.78% Rich Text Format 2.72% TeX 13.98% Makefile 0.07%

blair-sambanis-replication's People

Contributors

Stargazers

Watchers

blair-sambanis-replication's Issues

Are the PITF predictions lead by 1 time-period or 1 year?

The PITF predictions ("pred_prob_plus1") in the data are annual, i.e. for any given country it's a vector that only changes values in whatever the corresponding equivalent of 1-year is. It appears however that, at least in the 1-month data version, this indicator is only lead 1 month, not 1 year.

df <- read_rds("trafo-data/1mo_data.rds")
df %>%
  as_tibble() %>%
  filter(country_iso3=="AFG") %>%
  select(year, month, period, pred_prob_plus1) %>%
  print(n = 25)

# A tibble: 180 x 4
    year month period pred_prob_plus1
   <int> <int>  <dbl>           <dbl>
 1  2001     1      1           0.209
 2  2001     2      2           0.209
 3  2001     3      3           0.209
 4  2001     4      4           0.209
 5  2001     5      5           0.209
 6  2001     6      6           0.209
 7  2001     7      7           0.209
 8  2001     8      8           0.209
 9  2001     9      9           0.209
10  2001    10     10           0.209
11  2001    11     11           0.209
12  2001    12     12           0.228
13  2002     1     13           0.228
14  2002     2     14           0.228
15  2002     3     15           0.228
16  2002     4     16           0.228
17  2002     5     17           0.228
18  2002     6     18           0.228
19  2002     7     19           0.228
20  2002     8     20           0.228
21  2002     9     21           0.228
22  2002    10     22           0.228
23  2002    11     23           0.228
24  2002    12     24           0.234
25  2003     1     25           0.234
# … with 155 more rows

This is problematic. If e.g. the PITF model predictions for 2008 are created using data through but not beyond 2007, then there is no need to lag/lead the predictions. If on the other hand the 2008 predictions use 2008 data, then we should lag by 1 year, in the equivalent of the respective data time aggregation level (1 or 6 months). I.e. for the 2008 rows in the data, we should use PITF predictions from 2007.

@rickmorgan2, if you have time, would you mind looking in this? Maybe I'm overthinking something here...

Adjust Table 2 for common case subset

The models in Table 2 have different test N's:

table	horizon	Model	Escalation Only	With PITF Predictors	Weighted by PITF	PITF Split Population	PITF Only
Table 2	1 month	Base specification	13748	13155	13461	13748	13510
Table 2	6 months	Base specification	2366	2264	2317	2366	2265

Need to adjust the AUC-ROC calculations to operate on the common subset of cases, otherwise they are not directly comparable.

Add 6-month data versions for tuning experiments and train-CV

So far I've only been working with the 1-month data; need to replicate both the tuning experiments and Table 1 redo with training data CV for the 6-month data as well.

Running rep_nosmooth/2-model-runner.R fails due to memory issue

The 2-model-runner.R script trains/estimates all the models needed for the replication. When trying to run it now, the memory usage blows up until the system eventually crashes. Looking at the process resources, it seems that memory is not freed up and just keeps growing. I had one parallel worker with almost 50GB before things crashed.

Happens with plan("sequential") as well. According to the discussion by the doFuture package author, this probably indicates an issue with one of the package dependencies (randomForest?). https://github.com/HenrikBengtsson/future/issues/368 (not gonna link, don't want it to show up there)

Debug what data do the PITF Split Population models actually use

I'm a bit confused about what data the PITF split pop models actually use.

Conceptually, B&S split the data into two pieces low and high risk cases using the PITF model predictions, then train two separate RF models that each use the escalation specification of features, and combine the two sub-population predictions again in the end.

The models in 1mo_run_escalation_split_PITF.R use data in escalation_highriskPITF_train_frame_civil_ns and escalation_lowriskPITF_train_frame_civil_ns.

So far so good, but when I look in 1mo_define_models.R, I don't see that they are created any differently from the data frames for all the other models, i.e. not using the train_highrisk and train_lowrisk data frames that are setup in +master.R (lines 71-90).

Searching for "train_lowrisk" strings in all files shows they are only in +master.R. If the high and low risk models are actually essentially two versions of the same escalation RF (i.e. same specification), then the only difference in their predictions is due to RNG variation. Which, looking at the AUC-ROC values in Table 2, might be the case. Both different by 0.01, which could be due to RNG variation.

RNG variance for non-escalation models in Table 2?

The results in our Table 2 recreation seem to have changed quite a bit when I just dropped in new model run results (91d9da0), compared to what was there when I started interpreting the results.

Here's a screenshot of the git diff for the relevant file:

Previously, the "With PITF Predictors" model was better both times, "Weighted by PITF" and "PITF Split Population" had mixed results, and "PITF Only" was worse both times.

Now, all three alternatives are better, each both times. And the PITF Only model is quite close to the Escalation model.

Incorrect weighting for Table 2 "Weighted by PITF" predictions?

The "Weighted by PITF" model is a combination of the Escalation model predictions multiplied by the PITF model predictions. From the paper:

The second (col- umn 3) uses PITF predicted probabilities to weight the results of the escalation model, ensuring that high-risk and low-risk countries that happen to take similar values on ICEWS-based predictors are nonetheless assigned different predicted probabilities in most months.

I think the weighting is incorrect. The test set predictions are weighted using the training data PITF model predictions.

In 1mo_run_escalation_weighted_PITF.R:

            weight <- as.numeric(train$pred_prob_plus1)
            length(weight) <- length(prediction_escalation_inc_civil_ns)
            
            prediction = prediction_escalation_inc_civil_ns*weight
            prediction[is.na(escalation_test_frame_civil_ns$test_DV_civil_ns)] <- NA

train$pred_prob_plus1 references the training data, set up on lines 43 and on in +master.R:

# Load data

data <- read.dta13("data/1mo_data.dta")

# Define training and testing sets for base specification

train_period = mean(data$period[which(data$month==12 & data$year==2007)])
end_period = mean(data$period[which(data$month==12 & data$year==2015)])
train <- data[data$period<=train_period,]
test <- data[data$period>train_period & data$period<=end_period,]

prediction_escalation_inc_civil_ns is created in 1mo_run_escalation.R line 205:

prediction_escalation_inc_civil_ns <- prediction

where prediction is in that same script higher up; just the base specification escalation predictions for the test set:

prediction <- as.numeric(predict(train_model, newdata = data.matrix(escalation_test_frame_civil_ns), type="response"))
prediction[is.na(escalation_test_frame_civil_ns$test_DV_civil_ns)] <- NA

Check this. What are the object lengths? And if this is correct, then what is the AUC-ROC with the train/test PITF predictions? For both 1 and 6-month. This kind of stuff would be good to document.

Maybe just add model def info to model table?

Right now the model table only has horizon, row, column info, basically, and the other model settings needed to run the models, like the DV used, features, HP settings, etc., are in separate lists. Not sure why I did that...maybe the model runner code would be simpler if I just add all that info to the model table? Maybe leave it out of the results table?

Check details of the forecast assessment.

Notes to future self to document some details about the 2016-H1 forecast assessment.

Are the forecasts for 2016-H1, or actually 2016-H2?

In 6mo_run_escalation_OOS.R, the actual forecast model:

train_model <- randomForest(escalation_6mo_train_formula_civil_ns,
                            type = regression, importance = TRUE, 
                            na.action = na.omit, ntree = 100000, maxnodes = 5, 
                            sampsize = 100, 
                            replace = FALSE, do.trace = FALSE , 
                            data = escalation_6mo_train_frame_civil_ns, forest = TRUE   
)

escalation_6mo_train_formula_civil_ns is a formula that is created in 6mo_define_models_OOS.R.

escalation_6mo_train_formula_civil_ns <- as.formula(paste("train_DV_civil_ns",paste(escalation_6mo, collapse=" + "), sep = "~", collapse=NULL))

train_DV_civil_ns refers to the DV, which is created in +master.R line 524:

train_DV_civil_ns <- train$incidence_civil_ns_plus1

For the predictions, back to 6mo_run_escalation_OOS.R:

prediction <- as.numeric(predict(train_model, newdata = data.matrix(escalation_6mo_test_frame_civil_ns), type="response"))

This is also turned into the pred_escalation_6mo_inc_civil_ns data frame in that file. The escalation_6mo_test_frame_civil_ns that is used to create the predictions is again created in 6mo_define_models.R, using as DV the vector test_DV_civil_ns, which is created in +master.R line 525 from test$incidence_civil_ns_plus1.

All this is created from a test data frame that is setup in lines 513 on in +master.R. It includes 2016-H1:

> range(test$year)
[1] 2008 2016

So, if the target of the forecast model is a lead 1 version of the DV ("_plus1"), then feeding 2016-H1 data into the model generates predictions for 2016-H2, not H1.

In 6mo_make_confusion_matrix.do, the non-lagged "incidence_civil_ns" is used for scoring. So, I think these are 2016-H2 forecasts scored using 2016-H1 outcomes. But probably doesn't matter anyways since there are very few onsets, only ongoing conflicts.

What the ongoing / new conflicts in 2016-H1/H2?

Using the test data that is created at the bottom OOS section of +master.R.

foo <- test %>% 
  as_tibble() %>%
  select(year:incidence_civil_ns, incidence_civil_ns_plus1) %>%
  filter(year >= 2015)

Both DV vars have only 0 and NA values in this data, there are no 1 (onset) values.

There are 15 countries with ongoing conflicts:

> foo %>% filter(year==2016, is.na(incidence_civil_ns))

Afghanistan, Democratic Republic of Congo, Colombia, India, Iraq, Libya, Mali, Nigeria, Pakistan, Russia, Sudan, Somalia, Syria, Ukraine, Yemen

In the hand-coded alternative coding (6mo_make_confusion_matrix.do), the Civil War in Colombia is coded as having finished in 2015-H1, with no civil war for the 2016 forecast period. Turkey and Burundi however are coded as seeing new civil wars in 2016, i.e. new onsets.

The 2 false positives in the B&S confusion matrices are DRC and Libya.

So with the original coding, they hit 13 ongoing civil wars and miss 2 ongoing civil wars. With the alternative coding, they hit 12 ongoing, miss 2 ongoing, and hit the 2 new onsets.

Re-reading the text, it now sounds to me like they do kind of acknowledge that they want to take credit for predicting ongoing civil wars, e.g.:

Importantly, we make no assumptions about the persistence of civil war between the second half of 2015 and the first half of 2016, meaning that all countries—even those with ongoing conflicts as of December 31, 2015—are included in the 2016 test set. As a result, 14 of the 30 countries on the list (and nine of the top 10) are cases of ongoing conflict.

Context: in the data ongoing civil wars otherwise are NA's on the DV and thus drop out.

This is so weird.

At the time the forecasts were made, in early 2016-H1 (the pre-registration date is in March, I think), we would have known the civil war state for 2015-H2
We would not have known, yet, the civil war state for 2016-H1.
However, we are not scoring the forecasts until after 2016-H1 has concluded, when we can observe civil war state and determine whether a conflict was ongoing, terminated, or new civil war started.
Does it make sense to make forecasts for civil wars that are ongoing in 2015-H2, i.e. put those cases back into our data as being at risk for a new onset?
1. We know the 2015-H2 civil war state, including presumably termination, so it would make sense to include cases that terminated and now are at risk of another, new civil war outbreak.
2. If the civil war coding was conflict specific, i.e. a country could have multiple distinct conflicts at any given, time, then it would make sense to not drop ongoing cases. But the Sambanis coding is at the country-level I think, so it doesn't. If you have ongoing conflict, no new onset is possible. Maybe back to back termination-onsets are possible, but I don't think any of those cases of conflict above are that.
3. Aside from (i) above, I don't see how it makes sense to include 2015-H2 ongoing conflicts back into the data for 2016-H1. You'd have to assume that all those conflicts end in 2015-H2 and the countries are at risk again, but we know they didn't!
Anyways, by the time we are scoring the 2016-H1 forecasts, we know what the state of the world is. Ongoing is ongoing, weird to take credit for that as if it is an onset.

Memory allocation errors in RF tuning experiments

I've been running tuning experiments that use the original training data to perform cross-validation for a different hyper parameter values. I keep getting memory allocation errors when trying some combination of hyper parameter values, especially for the CAMEO specification:

Error in { : task 3 failed - "vector memory exhausted (limit reached?)"
Calls: sourceWithProgress -> eval -> eval -> %dopar% -> <Anonymous>
Execution halted

The problem is that randomForest ends up exhausting the available memory. I think I read that one solution is to increase the default maximum size of a vector in R, but I don't really want to stray into that kind of territory. The other option would be to use ranger, which has more efficient memory management (or at least it claims so). But then one could claim that any differences in results are due to using ranger over randomForest.

My impression is that this is partly related to high mtry and high ntree values, but I also suspect that the sample size for each tree plays a role. So far I've been using sampling equal to the number of training data rows with replacement, and only varying mtry, ntree, and nodesize, but not sampsize. With replacement, it is by default set to equal the number of rows in the training data.

The reason for not changing the sampling and setting it smaller is that in the training data cross-validation, even with splitting the training data in half, some folds end up with only 1 positive case. Each tree must have at least one positive case in its data sample, and sampling training data rows with replacement works for that. This is why B&S need to use regression, not classification trees. I suspect that almost all of the trees in the forests that B&S train end up with only 0 values in the outcome.

On the other hand, it is not possible to reliably increase the number of positives cases in the training split by changing to something like 3- or higher fold CV because in that case while the training split will end up with more positive cases, sometimes the test fold will end up with 0 positive cases. This makes AUC undefined.

So far only doing a half/half split via 2-fold CV seems to guarantee that both the training split and validation split have at least one of the 9 positive cases in the original B&S training split.

Two ways forward:

In the parallel loop in run-tune-experiments.R, catch errors (tryCatch) so that the worker can move on to the next task. Any hyper parameter set that produces an error can thus later just be invalidated.
In RF, try stratified sampling by the outcome so that sampsize can also become a tunable parameter while still ensuring that each data sample will contain at least one positive case.

This would basically look like this:

fitted_mdl <- randomForest(y = train_df$incidence_civil_ns_plus1,
                           x = train_df[, escalation],
                           type = "classification",
                           ntree = 1000,
                           mtry  = 3,
                           nodesize = 1,
                           strata = train_df$incidence_civil_ns_plus1,
                           sampsize = c(1, 1000),
                           replace = FALSE,
                           do.trace = FALSE)

Some notes on this:

The sampsize vector defines how many samples to take from the "1" and "0" cases.
RF seems to only allow sampling without replacement when using stratified sampling.
As a result, in sampsize, we cannot sample more "1" cases than there are in any particular CV train data split. Sometimes this value will be 1, sometimes more.
So, either: hold sampsize[1] constant at 1 and vary sampsize[2], or allow sampsize[1] to equal or vary up to the number of positives in a train split and vary sampsize[2] as well.

BTW, the prevalence in the training data is roughly 1 to 1300.

I think the first approach, where sampsize[1] is always 1, is easier. Otherwise there will be a dependency on the number of positives in a particular training split. I don't want to optimize over that.

Verification of refactored replication

The replication script in rep_nosmooth is substantially different from the original replication code. Make sure that the results are "correct" in respect to the original replication code and results.

Table 2 N_test numbers
Look into why Table 2 weighted by PITF is not reproducing at all

Re-run original replication files

Do a complete run through the original replication files.

Add code to extract relevant model objects to use for verifying our refactored replication code.

I'm tagging all commits that alter the original replication code to this issue so that they are easier to see in one place.