andybega / icews Goto Github PK
View Code? Open in Web Editor NEWGet the ICEWS event data
Home Page: https://www.andybeger.com/icews/
License: Other
Get the ICEWS event data
Home Page: https://www.andybeger.com/icews/
License: Other
What is the license of ICEWS data? (not the license of the ICEWS R package, which is MIT license).
I cannot find the licence here: https://dataverse.harvard.edu/dataverse/icews
Latest individual file seems to imply: "For Official Use Only (FOUO), government sponsored research activities." see https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QI2T9A
Unsure it applies to all data. In any case the "terms" pane says "CC0 Public domain":
Unsure what FOUO means, some users on wikipidia say: public domain. https://en.wikipedia.org/wiki/Talk:For_Official_Use_Only
which is consistent with the "terms"
Still, look at this:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/28075
Terms pane: RESTRICTIONS ON USE: THESE MATERIALS ARE SUBJECT TO COPYRIGHT PROTECTION AND MAY ONLY BE USED AND COPIED FOR RESEARCH AND EDUCATIONAL PURPOSES. THE MATERIALS MAY NOT BE USED OR COPIED FOR ANY COMMERCIAL PURPOSES. รยฉ 2015 Lockheed Martin Corporation and BBN-Raytheon. All rights reserved.
This would be helpful for conveying the general structure, and also In the vignettes and some examples.
dr_icews(dp_path = "", raw_file_dir = "")
versus
dr_icews()
Use table "stats", for now only containing the tuple (events_n, [some number]), to store the number of rows in the main events table. This is one of the things that somewhat slows down dr_icews
.
inst/sql
Ingesting records from '20181031-icews-events.tab'
Error in result_bind(res@ptr, params) :
UNIQUE constraint failed: events.event_id, events.event_date
I.e. to allow this:
opts <- get_icews_opts()
old_opts <- unset_icews_opts()
get_icews_opts()
set_icews_opts(old_opts)
# should all be the same
get_icews_opts(); opts; old_opts
Makes testing easier since I can just return the plan for dry run = TRUE.
Testing the database related stuff will be difficult, but I can test the downloaders with the dry run option.
Functions to setup and eventually keep in sync a local event database. SQLite?
When adding events to the database, events with an already existing event ID are not added again. If all events in a ".tsv" source file are duplicates and thus none are added to the "events" table in the DB, the name of the source file is stored in the "null_source_files" table. This is because the "source_files" table is created in reference to the "source_file" column in the "events" table, and thus those files wouldn't show up. The DB state getter does not include the null source files.
After a fresh install on Ubuntu 18.04, the following fails with an error:
library("icews")
library("DBI")
library("dplyr")
library("usethis")
# Note: do not end the data_dir with a slash
setup_icews(data_dir = "/home/mk/Documents/data/icews", use_db = TRUE, keep_files = TRUE, r_profile = TRUE)
update_icews(dryrun = TRUE)
update_icews(dryrun = FALSE)
Message I got after the last line of code is:
Downloading '20181004-icews-events.zip'
Ingesting records from '20181004-icews-events.tab'
Error in if (min(events$event_date) <= max_date_in_db) { :
valor ausente donde TRUE/FALSE es necesario
(last line is more ore less "missing value where TRUE/FALSE is necessary")
With keep_files = FALSE instead (after restarting R), this is the error
Ingesting records from '20181004-icews-events.tab'
Error in get_fileid.character(dataset, file, key = key, server = server, :
File not found
Same behaviour after updating all Ubuntu packages and running update.packages() in R.
R version 3.6.0 (2019-04-26)
For testing, examples, etc.
Often, the dataverse server is slow and update_icews() stops. It would be great to have an option to relaunch it automatically in these cases (maybe after a delay, specified in seconds). There are at least 2 types of errors for which relaunching works:
Gateway Timeout (HTTP 504).
parse error: premature EOF
> update_icews(dryrun = FALSE); date()
Error in value[[3L]](cond) :
Something went wrong in 'dataverse' or the Dataverse API, try again. Original error message:
Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181129-icews-events.zip'
Ingesting records from '20181129-icews-events.tab'
Downloading '20181130-icews-events.zip'
Ingesting records from '20181130-icews-events.tab'
Downloading '20181203-icews-events.zip'
Ingesting records from '20181203-icews-events.tab'
Downloading '20181204-icews-events.zip'
Ingesting records from '20181204-icews-events.tab'
Downloading '20181205-icews-events.zip'
Ingesting records from '20181205-icews-events.tab'
Downloading '20181206-icews-events.zip'
Ingesting records from '20181206-icews-events.tab'
Downloading '20181207-icews-events.zip'
Ingesting records from '20181207-icews-events.tab'
Downloading '20181208-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) :
Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Error in value[[3L]](cond) :
Something went wrong in 'dataverse' or the Dataverse API, try again. Original error message:
parse error: premature EOF
(right here) ------^
> update_icews(dryrun = FALSE); date()
Downloading '20181208-icews-events.zip'
Ingesting records from '20181208-icews-events.tab'
Downloading '20181209-icews-events.zip'
Ingesting records from '20181209-icews-events.tab'
Downloading '20181210-icews-events.zip'
Ingesting records from '20181210-icews-events.tab'
Downloading '20181211-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) :
Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Error in value[[3L]](cond) :
Something went wrong in 'dataverse' or the Dataverse API, try again. Original error message:
Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181211-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) :
Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181211-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) :
Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181211-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) :
Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181211-icews-events.zip'
Ingesting records from '20181211-icews-events.tab'
Downloading '20181212-icews-events.zip'
Ingesting records from '20181212-icews-events.tab'
Downloading '20181213-icews-events.zip'
Output should match that of the read_icews from file version.
While updating:
Ingesting records from '20190409-icews-events-1.tab'
Error in result_bind(res@ptr, params) :
UNIQUE constraint failed: events.event_id, events.event_date
Sometimes a local file and associated event set will be superseded by a new version on DVN.
E.g. most likely this will occur with the current 2008 file as it expands to cover more of the year.
The file name patterns are consistent, events.[year].[yyyymmddhhmmss].tab
.
Separate that into events set (events.[year]
) and version based on date?
Getting the distinct source file list can take several seconds. As often it might be needed without any subsequent actions being performed, maybe keep the source file list in a second table that is updated automatically when there are changes in the events table. This should trade a relatively trivial additional amount of time when deleting or inserting rows--as this already takes very long--for a much faster read when determining the DB/events state.
See
Event ID is not unique because there are duplicate events.
In all cases, the duplicate events can be distinguished by event date. And in all cases there are exactly 2 versions of each duplicate event.
events %>% group_by(`Event ID`) %>% mutate(n = n()) %>% ungroup() %>% filter(n > 1) %>% group_by(`Event ID`, `Event Date`) %>% dplyr::summarize(n = n()) -> foo
> foo
# A tibble: 290,624 x 3
# Groups: Event ID [?]
`Event ID` `Event Date` n
<int> <date> <int>
1 20718170 2013-11-12 1
2 20718170 2014-01-01 1
3 20718171 2013-11-12 1
4 20718171 2014-01-01 1
5 20718172 2013-11-12 1
6 20718172 2014-01-01 1
7 20718173 2013-11-12 1
8 20718173 2014-01-01 1
9 20718174 2013-11-12 1
10 20718174 2014-01-01 1
# ... with 290,614 more rows
foo %>% group_by(`Event ID`) %>% summarize(n = n()) %>% group_by(n) %>% summarize(cases = n())
# A tibble: 1 x 2
n cases
<int> <int>
1 2 145312
What to do with these? Silently drop and keep the later date version?
See globals in icews-package.R
Should print the setup options
Other tables are created from SQL files, but the "events" table is not. Path dependence, probably because it was the first table I set up, or maybe because it has indices. Which in any case can be part of the create table SQL file.
Then just call "events.sql" with "execute_sql()" like the other tables.
Hi! I spotted this project on Twitter at https://twitter.com/andybeega/status/1103226111855607809
I noticed that you're using "DVN" in your README (e.g. "the current version on DVN") but in some contexts, "Harvard Dataverse" would be preferred. Let me try to summarize a few terms:
I'd be happy to make a pull request if you'd like. Please let me know. Thanks! Great project!
Also, if you're interested in helping with the "dataverse" R package, please leave a comment at IQSS/dataverse-client-r#21 ๐
Specifically, make sure event_date is Date.
First thank you for developing the icews package. I am trying to use the minimalist functionality and running into an error.
This error occurs for both the update_icews() and download_data() functions when dryrun is set to False. My setup has use_db = F and keep_files =T.update_icews(dryrun = F) Downloading 'events.1995.20150313082510.tab.zip' Error in get_file(file_ref, get_doi()[[repo]]) : Not Found (HTTP 404).
I am hoping this is a common error and an answer is ready available. Thanks for your help.
Open files from R, or at least point to documentation location?
After a fresh install on Ubuntu 18.04, the following fails after downloading 151 files (73.1 MB) with an error:
library("icews")
library("DBI")
library("dplyr")
library("usethis")
setup_icews(data_dir = "/home/mk/Documents/data/icews", use_db = TRUE, keep_files = TRUE, r_profile = TRUE)
update_icews(dryrun = TRUE)
update_icews(dryrun = FALSE)
# (...... downloads 151 files, ingesting correctly 294687 rows in sqlite database)
Downloading '20190309-icews-events.zip'
Error in writeBin(as.vector(f), tmp) : can only write vector objects
Launching update_icews(dryrun = FALSE)
again and again does not solve the issue.
The following (launched after the error) might help:
> update_icews(dryrun = TRUE)
File system changes:
Found 151 local data file(s)
Downloading 84 file(s)
Removing 0 old file(s)
Database changes:
Deleting old records for 0 file(s)
Ingesting records from 84 file(s)
Plan:
Download '20190309-icews-events.zip'
Download '20190309-icews-events.zip'
Ingest records from '20190309-icews-events.tab'
Ingest records from '20190309-icews-events.tab'
Download '20190311-icews-events.zip'
Ingest records from '20190311-icews-events.tab'
Download '20190312-icews-events.zip'
Ingest records from '20190312-icews-events.tab'
Download '20190313-icews-events.zip'
Ingest records from '20190313-icews-events.tab'
Download '20190314-icews-events.zip'
(etc.)
Takes up half the space and apparently faster index as well.
Any data updates become painfully slow, I think because each of the potentially millions of inserts/removes is a separate transaction that triggers the trigger, meaning the stats tables are also updated after each single write/remove.
Better to move this to R and manually rebuild the stats tables after the relevant operations.
Something that takes the path arguments as input and returns normalized paths as output.
Why? Core behavior right now relies on having arg = NULL defaults, and each user facing function has arguments for the paths that requires a lot of duplicate code to substitute the correct paths if the environment variable option (ICEWS_DATA_DIR) is used.
Also use this for input validation (e.g. error if one is NULL, the other path is not.).
Add a check so this becomes more informative:
> set_icews_opts("foo", TRUE, TRUE)
> read_icews()
Error in `$<-.data.frame`(`*tmp*`, "year", value = NA_integer_) :
replacement has 1 row, data has 0
In addition: Warning messages:
1: Unknown or uninitialised column: 'event_date'.
2: In read_icews_raw(find_raw(), n_max) :
Can the indices for a table be specified at DB/table creation? This would make the ingestion file by file easier (i.e. download, ingest, index, one file at a time).
Probably slower, what is the impact on speed of "ingest all at once then index" versus "ingest and index file by file"?
There are daily data drops (beta) at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QI2T9A.
Although this is still in testing, prepare for what syncing with those would have to look like.
2007 has an abnormally low number of events, maybe some other years/files as well. The cause might be related to parsing failures. Compare the number of lines here:
tsv2007 <- read_tsv(file.path(find_raw(), "events.2007.20150313083959.tab"))
str2007 <- read_lines(file.path(find_raw(), "events.2007.20150313083959.tab"))
tsv2008 <- read_tsv(file.path(find_raw(), "events.2008.20150313084156.tab"))
str2008 <- read_lines(file.path(find_raw(), "events.2008.20150313084156.tab"))
The 2007 TSV fails to parse lines after events for February in that year. In 2008, the string lines match, correctly, the TSV records number (plus 1 for header row).
> nrow(tsv2007)
[1] 135693
> length(str2007)
[1] 1011162
> nrow(tsv2008)
[1] 980879
> length(str2008)
[1] 980880
When doing:
old_opts = unset_icews_opts()
old_opts
prints:
Options not set
data_dir: NULL
use_db: NULL
keep_files: NULL
event though old_opts has correct values:
> str(old_opts)
List of 3
$ data_dir : chr "~/foo/icews_data"
$ use_db : logi TRUE
$ keep_files: logi TRUE
- attr(*, "class")= chr "icews_opts"
Right now getting gwcode-year counts and such from the DB is kind of hard since data merging has to happen in R.
old_opts <- unset_icews_opts()
download_data(to_dir = "~/Downloads/icews_data", update = TRUE, dryrun = TRUE)
Error in find_path("raw") : Path argument is missing.
Consider setting the paths up globally with `setup_icews()`.
Ideally in your .Rprofile file; try running `dr_icews()` for help.
Make the DB sync without purging and rebuilding each time
Write up a short blog post introducing the package and also use this as an introduction for the package.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.