A quick look at the public ICEWS data

The ICEWS data, including the underlying raw event data as well as some aggregations, were quietly posted on Dataverse the Friday before last. I’ve worked with the ICEWS data for several years now, first when I was working on on the ICEWS package–we deliver updated monthly forecasts for the ICEWS events of interest (EOIs) once a month–and more recently for the irregular leadership change forecasting project. The public data are formatted differently from the data I’ve worked with, so most of the code I have lying around is not that useful, but in going through the public data I did recreate a short overview that is nowhere near as complete as David Masad‘s first look (using Python), and some code that might be useful for getting started in R.

One of the nice things about the public release of these data, aside from the hope that they will start to get used in modeling (repost), is that it is very interesting to read new takes by people whose perspectives are different than mine, like, so far:

Now to the quick overview, using R rather than Python (link to code at end). The first figure below shows the daily event totals, as well as a 30-day moving average. The daily totals increase from around 500 in 1996 to a steady level of around 3,000 from 2005 on, before decreasing again around 2009/2010. As others have pointed out, this stability is a good feature to have since it makes it plausible to model without some kind of normalization to account for changes in the underlying event volume. This is in contrast to GDELT, where the story corpus and event counts increase dramatically over time.

icews-daily

Daily totals in the ICEWS event data.

Read the rest of this entry »


Precision-recall curves

ROC curves are a fairly standard way to evaluate model fit with binary outcomes, like (civil) war onset. I would be willing to bet that most if not all quantitative political scientists know what they are and how to construct one. Unlike simpler fit statistics like accuracy or percentage reduction in error (PRE), they do not depend on the particular threshold value used to divide probabilistic predictions into binary predictions, and thus give a better sense of the tradeoff between true and false positives inherent in any probabilistic model. The area under a ROC curve (AUC) can summarize a model’s performance and has the somewhat intuitive alternative interpretation of representing the probability that a randomly picked positive outcome case will have been ranked higher by the model than a randomly picked negative outcome case. What I didn’t realize until more recently though is that ROC curves are a misleading indication of model performance with kind of sparse data that happens to be the norm in conflict research.

conf-matrix

Confusion matrix

Read the rest of this entry »


Archigos leader turnovers by regime

This post is about the Archigos data, which you can find here.

Political scientists, and maybe historians as well, are familiar with coups, rebellions, and mass protests as distinct phenomena that lead to the fall of regimes occasionally. Another way to view these events is from the perspective of state leaders, and how these events affect transitions between political leaders. Selectorate theory does this by considering the sets of people within a regime that a leader must rely on to remain in power, and how their relative sizes shape behavior. We do this empirically by modeling irregular leadership changes, where we draw our dependent variable from the Archigos dataset. I’ve been vaguely aware of these data for a while, but honestly did not understand well how useful they could be. In this post I’ll try to give a quick overview of the data.

Archigos is a dataset of the political leaders of states from 1875 on collected by Hein Goemans, Kristian Skrede Gleditsch, and Giacomo Chiozza. The most recent version, 2.9, has more than 3,000 leaders through 2004 and an update to 2014 is in the works. Aside from identifying leaders and when they gained and lost office, it codes how they did so (from the Archigos codebook):

Archigos codes the manner in which transfers between rulers occur. Our main interest is whether transfers of power between leaders take place in a regular or irregular fashion. We code transfers as regular or irregular depending on the political institutions and selection mechanisms in place. We identify whether leaders are selected into and leave political office in a manner prescribed by either explicit rules or established conventions. In a democracy, a leader may come to power through direct election or establishing a sufficient coalition of representatives in the legislature. Although leaders may not be elected or selected in particularly competitive processes, many autocracies have similar implicit or explicit rules for transfers of executive power. Leader changes that occur through designation by an outgoing leader, hereditary succession in a monarchy, and appointment by the central committee of a ruling party would all be considered regular transfers of power from one leader to another in an autocratic regime.

Read the rest of this entry »


Building rgdal from source

My limited knowledge of what happens in Terminal, and thus by extension shell, is mostly driven by PostgreSQL/PostGID/rgdal/RPostgreSQL install errors. In the latest variant of this, rgdal throws the following error when attempting to build from source:

checking PROJ.4: epsg found and readable... no
Error: proj/epsg not found
Either install missing proj support files, for example the proj-nad and 
proj-epsg RPMs on systems using RPMs, or if installed but not 
autodetected, set PROJ_LIB to the correct path, and if need be use 
the --with-proj-share=configure argument.

I have to build from source by the way because the default rgdal package for Mac does not include a PostgreSQL driver, meaning I have to build it against another version of GDAL that does. This was another fun thing to discover, but at least is easy to diagnose by checking whether PostgreSQL shows up when you run ogrDrivers() in R. Anyways, as far as I can tell the problem was that I installed proj via homebrew, a package manager for OS X. As a result although rgdal could find the proj binary via a symlink, it could not find the epsg and related data files that were in a little dark corner by themselves. The solution was to build the package with an option providing the file location manually:

install.packages("rgdal", type = "source", 
  configure.args="--with-proj-share=/usr/local/Cellar/proj/4.8.0/share/proj")

This is I guess exactly what the install error message told me to do.


The right kind of variance

or, How I learned to stop worrying and love event data. (This post first appeared on Predictive Heuristics)

Nobody in their right mind would think that the chances of civil war in Denmark and Mauritania are the same. One is a well-established democracy with a GDP of $38,000 per person and which ranks in the top 10 by Human Development Index (HDI), while the other is a fledgling republic in which the current President gained power through a military coup, with a GDP of $2,000 per person and near the bottom of the HDI rankings. A lot of existing models of civil war do a good job at separating such countries on the basis of structural factors like those in this example: regime type, wealth, ethnic diversity, military spending. Ditto for similar structural models of other expressions of political conflict, like coups and insurgencies. What they fail to do well is to predict the timing of civil wars, insurgencies, etc. in places like Mauritania that we know are at risk because of their structural characteristics. And this gets worse as you leave the conventional country-year paradigm and try to predict over shorter time periods.

The reason for this is obvious when you consider the underlying variance structure. First, to predict something that changes, say dissident-government conflict, the nature of relationships between political parties, or political conflict, you need predictors that change.

thailand

Predictions for regime change in Thailand from a model based on reports of government-dissident interactions (top). White noise, with intrinsically high variance, is not helpful (middle), but neither is GDP per capita (bottom).

 

Read the rest of this entry »


Baby steps with R Shiny

Shiny is a web application framework that lets you create interactive websites representing R data visualization and analysis. The gallery has some nice examples, and it looks like a great way to make R more accessible without having to know things like JavaScript or d3. I’ve been in trying my hand at it and it seems like a great way to visualize the models underlying the forecasts I work on in Ward Lab as well as the event data on which they are in part based.

It’s always easier to pick up new things like this with a strong motivating example, and for me it was visualizing the distribution of finish times in the SEB Tallinn Marathon in Estonia last weekend. My wife and I both ran and completed our first marathons, and one can look up the finish times and some other information on the event website. However, there was a post in the New York Times a few months ago that had a plot of the distribution of marathon times and which had spikes around the half hour marks as runners pushed themselves to meet arbitrary goals. So I was curious what the distribution of finish times was for the Tallinn Marathon. Along the way, it would also be nice to see where you fall in the distribution, and, since it is maybe not fair to lump all runners into one category, to do so by age and gender groups. Instead of producing dozens of separate plots in R, this seems like a candidate for something interactive, and hence Shiny. You can find the interactive results here, and they look like this:

SEB Marathon app

Shiny interactive visualization of SEB Tallinn Marathon finish times. The highlighted time is for me. Yes, I am slow.

Read the rest of this entry »


Modeling and predicting IEDs

This first appeared on Predictive Heuristics, my employer’s blog.

Improvised explosive devices, or IEDs, were extensively used during the US wars in Iraq and Afghanistan, causing half of all US and coalition casualties despite increasingly sophisticated countermeasures. Although both of these wars have come to a close, it is unlikely that the threat of IEDs will disappear. If anything, their success implies that US and European forces are more likely to face them in similar future conflicts. As a result there is value in understanding the process by which they are employed, and being able to predict where and when they will be used. This is a goal we have been working on for some time now as part of a project funded by the Office of Naval Research, using SIGACT event data on IEDs and other forms of violence in Afghanistan.

expl-haz

Explosive hazards, which include IEDs, for our SIGACT data.

Read the rest of this entry »


Follow

Get every new post delivered to your Inbox.

Join 233 other followers