Andreas Beger

Data scientist

POLECAT Event Data


Tags: #Data

This page is about the POLECAT political event data.1

Resources

POLECAT Dataverse: this is where the data are.

PLOVER (GitHub): PLOVER is the ontology/codebook that describes what kinds of events are in POLECAT, e.g. what the possible event types are and mean, contexts, event modes, etc.

PLOVER and POLECAT: A New Political Event Ontology and Dataset (OSF): this is a data paper that discusses the PLOVER ontology and POLECAT data.

Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks (arXiv Paper): this paper describes the machine coder (NGEC) that produces the POLECAT data.

NGEC (GitHub): NGEC is the coder that produces the POLECAT event data.

FAQ

Some of the TXT files are misformatted

Several of the text (tab-separated variable) files are misformatted. This R function will be able to read all of them without errors or vroom problem warnings.

library("readr")
library("dplyr")

pc_read_tsv <- function(x, expected_col, ...) {
  # check the header for malformed files
  df <- read_tsv(file.path(polecat_data_path, file), show_col_types = FALSE, 
                 n_max = 0)
  # file with excess column names
  if ("Headline" %in% colnames(df)) {
    df <- read_tsv(file.path(polecat_data_path, file), show_col_types = FALSE,
                   skip = 1, col_names = expected_col)
    return(df)
  } 
  # file with excess tab characters
  if (ncol(df) > length(expected_col)) {
    cat("Handling excess empty columns")
    df <- read_tsv(file.path(polecat_data_path, file), show_col_types = FALSE)
    for (j in seq(ncol(df), length(expected_col)+1)) {
      if (!all(is.na(df[[j]]))) {
        stop("Column is not empty")
      }
      df[[j]] <- NULL
    }
    return(df)
  }
  
  # default, normal file
  df <- read_tsv(file.path(polecat_data_path, file), show_col_types = FALSE)
  return(df)
}

The following files have excess column names in the first row. Skipping the first row and manually providing column names will fix this issue.

ngecEvents.20230728161146.Release610.DV.txt.gz
ngecEvents.20230728161306.Release611.DV.txt.gz
ngecEvents.20230728161432.Release612.DV.txt.gz
ngecEvents.20230728161606.Release613.DV.txt.gz
ngecEvents.20230728161754.Release614.DV.txt.gz
ngecEvents.20230728161954.Release615.DV.txt.gz
ngecEvents.20230728162154.Release616.DV.txt.gz
ngecEvents.20230728162404.Release617.DV.txt.gz
ngecEvents.20230728162615.Release618.DV.txt.gz
ngecEvents.20230728162845.Release619.DV.txt.gz
ngecEvents.20230728163218.Release620.DV.txt.gz
ngecEvents.20230728163550.Release621.DV.txt.gz
ngecEvents.20230728163916.Release622.DV.txt.gz
ngecEvents.20230728164255.Release623.DV.txt.gz
ngecEvents.20230728164702.Release624.DV.txt.gz
ngecEvents.20230728165115.Release625.DV.txt.gz
ngecEvents.20230728165700.Release626.DV.txt.gz
ngecEvents.20230728170202.Release627.DV.txt.gz
ngecEvents.20230728170742.Release628.DV.txt.gz
ngecEvents.20230728171335.Release629.DV.txt.gz
ngecEvents.20230728171900.Release630.DV.txt.gz
ngecEvents.20230728172518.Release631.DV.txt.gz
ngecEvents.20230728173251.Release632.DV.txt.gz
ngecEventsDV-2021-Apr.txt.gz

The file below has excess tab characters that in R lead to the creation of 4 excess, empty column. Dropping those columns fixes this issue.

ngecEventsDV-2020-Jun.txt.gz

Reading the data with read.table in R throws an error

R’s read.table does not correctly read the TSV data by default. The problem is that the quote argument includes single quotes (’) by default. Some of the field values in the data include single quotes, e.g. as part of a person’s name when used in the possessive form, like “Merkel**’**s”. To fix this, supply your own quote argument like below:

read.table(file, sep = "\t", quote = "\"", header = TRUE)

Who created POLECAT?

See the POLECAT Dataverse repo author list.

What is the license for the data?

See the notes section in the POLECAT Weekly Data page on Dataverse.

Glossary

POLECAT is the data itself. It replaces the ICEWS event data.

PLOVER is the ontology/codebook that describes the structure of event records, what the fields are, and what possible values they can take on. E.g. lists of all event types, modes, and contexts. Compare to CAMEO, which was the basis for the ICEWS event data.

NGEC, or Next Generation Event Coder, is the coder that produces the POLECAT data.


  1. The POLECAT data are produced by the Program on Geostrategic Risk (formerly the Political Instability Task Force). The Program on Geostrategic Risk is funded by the Central Intelligence Agency. The views expressed are the authors’ alone and do not represent the views of the U.S. Government. ↩︎