As with all statistical analysis, one of the biggest parts of text analysis is data preparation. We’re going to be using a subset of the data from Greene and Cross (2017), which are European Parliament speeches. These data are about as well cleaned and prepared as we could hope for, but we still have some work ahead of us. The data are available here.
# download speeches and metadata
download.file('http://erdos.ucd.ie/files/europarl/europarl-data-speeches.zip',
'europarl-data-speeches.zip')
download.file('http://erdos.ucd.ie/files/europarl/europarl-metadata.zip',
'europarl-metadata.zip')
# extract speeches and metadata
unzip('europarl-data-speeches.zip')
unzip('europarl-metadata.zip')
Often we get get text data pre-processed where each document is contained in an individual text tile in subdirectories that represent authors, speakers, or some other meaningful grouping. In this case, we have speeches, in days, in months, in years. Take a look at the readtext()
function in the pacakge of the same name, and combine it with list.files()
to easily load all the speeches from 2009 and 2010 into a dataframe that quanteda
can read.
library(readtext) # easily read in text in directories
# recursively get filepaths for speeches from 09-12
speeches_paths <- list.files(path = c('europarl-data-speeches/2009',
'europarl-data-speeches/2010'),
recursive = T, full.names = T)
# read in speeches
speeches <- readtext(speeches_paths)
The corpus()
function in quanteda
expects a dataframe where each row contains a document ID and the text of the document itself. Luckily, this is exactly what readtext()
has produced for us.
library(quanteda)
speeches <- corpus(speeches)
Quanteda supports two kinds of external information that we can attach to each document in a corpus; metadata and document variables. Metadata are not intended to be used in any analyses, and are often more just notes to the researcher containing things like the source URL for a document, or which language it was originally written in.
metadoc(speeches, field = 'type') <- 'European Parliament Speech'
Document variables, on the other hand, are the structure that we want to be able to use in our structural topic model. Read in the speech and speaker metadata files, merge the speaker data onto the speech data, subset the document variables to just 2009 and 2010, and then assign the combined data to our corpus as document variables.
library(lubridate) # year function
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
# read in speech docvars
speeches_dv <- read.delim('europarl-documents-metadata.tsv', sep = '\t')
# subset metadata to 2009-20012
speeches_dv <- speeches_dv[year(speeches_dv$date) >= 2009 &
year(speeches_dv$date) <= 2010, ]
# read in MEP docvars
MEP_dv <- read.delim('europarl-meps-metadata.tsv', sep = '\t')
# merge MEP docvars onto speech metadata
dv <- merge(speeches_dv, MEP_dv, all.x = T,
by.x = 'mep_ids', by.y = 'mep_id')
# merge docvars onto corpus
docvars(speeches) <- dv
# inspect first entries
head(docvars(speeches))
The way that quanteda
is written, a corpus is intended to be a static object that other transformations of the text are extracted from. In other words, you won’t be altering the speeches corpus object anymore. Instead, you’ll be creating new objects from it. This approach allows us to conduct analyses with different requirements, such as ones that require stemmed input and those that need punctuation retained, without having to recreate the corpus from scratch each time.
We can also subset corpora based on document level variables using the corpus_subset()
function. Create a new corpus object that is a subset of the full speeches corpus only containing speeches made by members of the European People’s Party.
EPP_corp <- corpus_subset(speeches, group_shortname == 'EPP')
We can use the texts()
function to access the actual text of each observation.
texts(EPP_corp)[5]
## 2009/01/12/TEXT_CRE_20090112_1-019.txt
## "European Parliament\nHannes Swoboda,\non behalf of the PSE Group.\n(DE)\nMr President, we have, of course, given this matter a great deal of thought. Perhaps Mr Cohn-Bendit overestimates the significance of a resolution, but with the Security Council's resolution we have a basis which we should support and, as the President of Parliament has already said, we should require both sides to seek peace, to lay down their arms and to comply with the Security Council's resolution. I would, however, just like to add that this must be the gist of our resolution. If this is so, we can support it. In this context we would cooperate and in this context we would support Mr Cohn-Bendit's motion.\n"
We can use the KeyWord in Context function kwic()
to quickly see the words around a given word for whenever it appears in our corpus. Try finding the context for every instance of a specific word in our corpus. You might want to pick a slightly uncommon word to keep the output manageable.
kwic(speeches, 'hockey', window = 7)
Right away, we can see two very different contexts that hockey appears in: actual ice hockey and the IPCC hockey stick graph. Any model of text as data that’s worthwhile needs to be able to distinguish between these two different uses.
We can also save the summary of a corpus as a dataframe and use the resulting object to plot some simple summary statistics. By default, the summary.corpus()
function only returns the first 100 rows. Use the ndoc()
function (which is analogous to nrow()
or ncol()
, but for the number of documents in a corpus) to get around this. Plot the density of each country’s tokens on the same plot, to explore whether MEPs from some countries are more verbose than others (you’ll be able to see better if you limit the x axis to 1500; there are handful of speeches with way more words that skew the graph).
# extract summary statistics
speeches_df <- summary(speeches, n = ndoc(speeches))
library(ggplot2) # ggplots
# plot density of tokens by country
ggplot(data = speeches_df, aes(x = Tokens, fill = country)) +
geom_density(alpha = .25, linetype = 0) +
theme_bw() +
coord_cartesian(xlim = c(0, 1500)) +
theme(legend.position = 'right',
plot.background = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank())
We can also see if earlier speeches differ substantively from later ones in terms of the number of tokens per speech. Break this plot down by groups because there are so many countries that it would be unreadable.
ggplot(data = speeches_df, aes(x = ymd(date), y = Tokens, color = group_shortname)) +
geom_smooth(alpha = .6, linetype = 1, se = F, method = 'loess') +
theme_bw() +
theme(legend.position = 'right',
plot.background = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank())