Text as Data

Thomas Delcey, Aurelien Goutsmedt

2024-01-26

Introduction

Why quantifying ?

Figure 1: The evolution of data storage and computational power. Source: Bit by Bit: Social Science Research in the Digital Age, Salganik (2017)

The rise of publications

Figure 2: The evolution of documents in the Econlit database

New methods to analyze texts

A lot of progress in Natural Language Processing:
- Translation model (e.g., Deepl)
- Conversational model (e.g., Chat GPT)

Allow to goes beyond “counting words in a text”
Enable to manipulate “concepts and ideas” contained in a text
But also: easier access, especially through R and Python

Quantitative and qualitative methods

“everything can be counted”

“not everything that can be counted counts”

In the history of economic thought, the rise of quantitative methods reflects a “methodological turn”.
The rise of quantitative method goes together with qualitative one (archives, oral history, etc.)

A simple framework for analysing texts

Given a question of research \(Y\), textual analysis involved four steps (Gentzkow, Kelly, and Taddy 2019):

Corpus creation: \(C\) is a set of documents.

Representation: transform \(C\) into a table \(W\) where each textual data is associated with a number).

Measurement: apply a measurement \(f(W)\) to estimate the question of research \(\hat{Y}\).

Interpretation: discussing \(\hat{Y}\) and comparing with \(Y\).

In this course, we will examine in detail each step.

Creating a corpus

Defining a corpus

A corpus \(C\) is set of documents \(d\).
A document \(d\) is a general unit, it can be a book, an article, a speech, a paragraph, an abstract, a tweet, etc.

A document text may be accompanied by metadata, other variables that help characterizing the text (authors’ names, date of publication, etc.)
Generally, \(C\) is a table where the lines are documents and the columns are variables.

Illustration of a typical corpus

doc_id	text	Journal	Year	variable_4
1	This is the first document	American Economic Review	1950	...
2	This document is the second document	American Economic Review	1955	...
3	This document is the third document	Quaterly Journal of Economics	1950	...
...	...	...	...	...

Where to find data?

Various sources:
- Scientific publishers: Web of Science, Econlit, JSTOR/Constellate, etc.
- Online library projects: internet archives, gutenberg.org, gallica/gallicagram, etc.
- Institutions: American Economic Association, ANR, central banks, etc.
- Social Networks: Twitter, Facebook, Reddit, etc.
- Media: Europresse, Media Cloud, etc.

The corpus creation is the hardest part, quantitative analysis is “easy”.

Optical Character Recognition

Conversion of images of handwritten or printed text into machine-encoded text
- What makes a PDF “searchable”
OCR may be an essential step in corpus building
- Highly dependent on the quality of initial pictures, but good progress with machine learning

How to make OCR?
- Built-in tools in software like Adobe Acrobat (but expensive)
- Open-source tool: tesseract (implemented in R)

Web scraping

The web scraping is a method for extracting data available in the World Wide Web.
The World Wide Web, or “Web”, is a network of websites (online documents coded in html).
A web scraper is a program, for instance in R, that automatically read the html structure of a website and extract the relevant data.

Regular Expression (RegEx)

To make our corpus operational for quantitative analysis, we often need to “clean” it:

doc_id	text	Author	year
text1	Ourfinancial markets are strong	Thomas Delcey	2007
text2	I told you a crisis was brewing!	T. Delcey	13 sept 2008

For manipulating a large amount of textual data, we use Regular Expressions (RegEx), i.e. a set of rules you can use in a program such as R to find patterns in a text.

Exercise: let’s practice !

Open “exercices.rmd” and do exercices 1 to 3.

Representation

What is representation ?

Natural language, such as French or English, cannot be understood by a program or a computer.
Any quantitative measure takes as input numbers, not letters.

Before measuring, we need to somehow transform a text (richer in information but only understable for humans) into numbers (poorer in information but understable for a machine).

This crucial step is called representation.

An example

We have a corpus with two documents:

doc_id	text	Author
text1	These documents discusses economics	Thomas Piketty
text2	These documents discusses sociology, and again sociology	Pierre Bourdieu

Let’s apply a simple measure to illustrate that even a basic one implied a representation.
What are the important topics discussed by each document ?

The measure tf-idf

In the tf-idf, the tf stands for Term Frequency.
It estimates the relative frequency a word in a document:

\[tf(w, d) = \frac{f_{w,d}}{\sum\limits_{w \in d } f_{w, d}}\]

With, \(f_{w,d}\) the frequency of a word in the document \(d\), and \(\sum\limits_{w \in d } f_{w, d}\) the number of words in the document \(d\).

The measure tf-idf

In the tf-idf, the idf stands for Inverse Document Frequency.
It is the ratio between the number of documents in the corpus noted \(N\) and the number of documents that contain the word \(w\):

\[idf = \log{(\frac{N}{d\in C:w \in d})}\]

The measure tf-idf

The TF-IDF combines the two distinct measures.

\[\frac{f_{w,d}}{\sum\limits_{w \in d } f_{w, d}} * \log{(\frac{N}{d\in C:w \in d})}\]

The TF-IDF gives more weight to words that appear several times in a document (tf) and to words that are specific to a document compared to other documents (idf).
It is therefore a measure of both the importance and the specificity of a word within a document relative to the corpus.

The measure tf-idf

word	tf_idf
discusses	0.0000000
documents	0.0000000
economics	0.1732868
these	0.0000000

Text 1

word	tf_idf
again	0.0990210
and	0.0990210
discusses	0.0000000
documents	0.0000000
sociology	0.1980421
these	0.0000000

Text 2

And the representation ?

We did not apply the tf-idf to the corpus \(C\), but to a another table, \(w\), which associates a textual data to a numeric value:

doc_id	word	n
text1	discusses	1
text1	documents	1
text1	economics	1
text1	these	1

doc_id	word	n
text2	again	1
text2	and	1
text2	discusses	1
text2	documents	1
text2	sociology	2
text2	these	1

And the representation ?

Let’s slightly transform the presentation of \(W\) so that the lines are now the documents and the columns all the unique words in our corpus.

doc_id	these	documents	discusses	economics	sociology	,	and	again
text1	1	1	1	1	0	0	0	0
text2	1	1	1	0	2	1	1	1

The first document is represented by the unique vector \((1,1,1,1,0,0,0,0)\).
The second document is represented by the unique vector \((1,1,1,0,2,1,1,1)\).
The length of the vectors equals the length of the vocabulary of the corpus, namely its numbers of unique words (and punctuation here).

The representation bag-of-words

This representation is called the bag-of-words (BoW) representation:
- A text is represented as a “bag” filled of words
- It does not take into account the order of words

The “curse of dimensionality”:
- The dimension of \(W\) is equal to the size of our corpus vocabulary (with real data \(>\) 1000)
- Sparse matrix: most columns does not capture any information (value \(=\) 0)
- We can reduce the size of the vocabulary. This is called selection

Selection

Tokenization: process of dividing a string of characters into substrings (e.g., words).
Remove stopwords: stopwords are very common words shared by most documents and that do not give any information about the document (“to be”, “the”, etc.).
Stemming: process of merging similar words by reducing words to their root (“historian” and “historical” become “histor”).
Lemmatization: process of merging similar words by identifying their lemma (“better” become “good”, “meeting” may become either “meet” or “meeting”).

Exercise: let’s practice !

Open “exercices.rmd” and do exercice 4.

The limits of the representation bag-of-words

Beyond the curse of dimensionality, BoW has important shortcomings.
For instance, the two following sentences have the exact same BoW representation but opposite meanings:

“Hayek is wrong and Keynes is right !”

“Hayek is right and Keynes is wrong !”

The limits of the representation bag-of-words

The BoW representation does not capture information about:

The order of words:

“I said I never loved you.” “Never, I said. I loved you.”

The tone (irony, anger):

“Oh Great!”

The other words surrounding a word:

“The capital city of France” “The capital investment in France”

The importance of the context

Natural languages are inherently ambiguous. The meaning of a word crucially depends of the context on which it is used:

You know a word by the company it keeps (Firth 1957)

Yet, the bag-of-words approach removes all information about the context.
A simple solution is to look at word co-occurences.

The co-occurence

Imagine we want to predict what is the missing word of this sentence, what words are the most likely ?

My research contributes to the field of [ ? ] economics

Because the missing word is between “field” and “economics”, it is more likely that the missing word is ‘Monetary’ or ‘Health’ than ‘Astronomical’.
In other words, the semantic context can be estimated by the co-occurrence between a word and its neighbouring words.

The co-occurence: an example

Imagine a new corpus:

doc_id	text
text1	Economic rationality drives efficiency
text2	Emotions influence economic rationality

The co-occurence frequency

We can estimate the frequency of co-occurrences between each word.
With BoW, tokens are words.
Here, tokens are set of 2 words that co-occured.
A generalization is the ngram model, where n is the number of words.

bigram	n
economic rationality	2
drives economic	1
efficiency drives	1
emotions influence	1
influence economic	1

The co-occurence matrix

doc_id	efficiency	drives	economic	rationality	emotions	influence
efficiency	0	1	1	1	0	0
drives	1	0	1	1	0	0
economic	1	1	0	2	1	1
rationality	1	1	2	0	1	1
emotions	0	0	1	1	0	1
influence	0	0	1	1	1	0

Each word is represented by a new vector.
The values are the number of times the words co-occured.
We can increase the context window by increasing the n of ngrams

Word embedding representation

In 2013, a new method for representing texts emerged in NLP called word embedding, or word vectorization.
- Word2vec (Mikolov et al. 2013)
- Glove (Pennington et al. 2014)
- FastText (Bojanowski et al. 2016)
- and now transformers: BERT, GPT …

Represent words as short dense vectors
Capturing linguistic context and semantic relationships between words

A look at word2vec

Word2Vec used word co-occurrences. It is trained on very large corpus of texts and formulates two prediction tasks:

– Skipgram: Given a focus word, predict its context words.

– Continuous Bag of Words (CBOW): Given its context words, predict a focus word.

Self-supervision: No need for human coder; just trained on the “raw” text.

Thomas Mikolov, the leading author behind word2vec.

The word2vec structures

Skipgram approach:

Assign a random embedding vector for each of the \(N\) vocabulary words.
Treat the target word and L neighboring context words as positive examples.
Randomly sample other words in the lexicon to get negative samples.
Use logistic regression to train a classifier to distinguish those two cases.
Use the learned weights as the embeddings.

Semantic similarities

This representation associates each word of the vocabulary to a vector of real numbers (generally \(N = 300\)).
These real numbers are arbitrary, but the words that are semantically similar have close vectors

Intuitions in a 3 dimensions space

Intuition with vectors

Word semantic proximity

The standard mesure to estimate the proximity between word vectors is the cosine similarity:
\[\text{Cosine Similarity}(\vec{A}, \vec{B}) = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \|\vec{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]
Score between -1 (opposite meanings) and +1 (similar meanings).
\(\text{Cosine Similarity}(\vec{price}, \vec{inflation}) = 0.7210443\) (source: Whatever it takes project)

Manipulating concepts and ideas

Word embedding models capture conceptual and analological relationship between words:

\[\overrightarrow{king} - \overrightarrow{man} + \overrightarrow{woman} = \overrightarrow{queen}\]

\[\overrightarrow{paris} - \overrightarrow{france} + \overrightarrow{england} = \overrightarrow{london}\]

\[\overrightarrow{inflation} - \overrightarrow{hausse} + \overrightarrow{baisse} = \overrightarrow{deflation}\]

Word embeddings advantages and issues

Word embeddings advantages:

Encoding similarity
Automatic generalization
Measuring meaning (and it’s evolution over time)

Issues in building word embeddings:

Length of context window
Data source & training

Exercise: let’s practice !

Open “exercices.rmd” and read the section “Representation of documents” to discover different ways of representing a corpus in R.

Analysis

What is analysis ?

Once we build \(W\), a representation of the corpus \(C\), we can finally analyze the data to answer a research question \(Y\).
Formally, we can think the analysis step as the application of a measure \(f\) to \(W\) to produce an estimation of a research question \(\hat{Y})\).

Lexicon analysis

The dictionary approach is a supervised method based on a dictionary \(D\) made up of a set of pairs \((w_i,s_i)\) where \(w_i\) is a token associated with a measure \(s_i\).
A popular dictionary approach is the sentiment analysis, where \(s_i \in [-1,1]\) represents a sentiment score.
- Words that reveal a negative sentiment have a score close to -1 (“awful”, -0.9)
- Words that reveal a positive sentiment are close to 1 (“good”, 0.8).

An example

Boyer et al. (2020) applies this method to study the evolution of the “gilets jaunes” movement:

Source: Mobilization without Consolidation: Social Media and the Yellow Vests Protests (Boyer et al. 2020)

A word embedding augmented lexicon analysis

Standard lexicon is simple but depends of the quality of the dictionary.
Lexicon analysis can be augmented with word embeddings:
- You don’t need a complete dictionary, just a good set of word embeddings
Two examples: Ash, Chen, and Naidu (2022) and Goutsmedt et al. (2024)

Ash, Chen, and Naidu (2022): “Ideas Have Consequences”

Proximity of US judges decision with economics & law vocabulary(Ash, Chen, and Naidu 2022)

Goutsmedt et al. (2024): Central Banks Scientization

Lexical fieds representing technical discourse (Goutsmedt et al. 2024)

Evolution of technical discourse in Bank of England Speeches (Goutsmedt et al. 2024)

Comparing the usage of a word by different groups

Differences in the meaning of “equality” between Levellers and other groups (Schwartzberg and Spirling 2023)

Comparing the evolution of meaning over time

Differences in adjectives attributed to women over history (Garg et al. 2018)

Document embedded vectors

Similarity of central banks speeches to ECB (Zahner and Baumgartner 2022)

Topic modelling

Topic modelling = machine learning method to identify hidden themes in a large corpus
- Identify k topics in a corpus of documents
- Each document is represented by a probability distribution over topics
- Each topic is represented by a probability distribution over vocabulary

Topic modelling

Source: Six Decades of Economic Research at the Bank of England (Acosta et al. 2023)

Exercise: let’s practice !

Open “exercices.rmd” and read the section “Analysis” to discover exemples of measures and their applications in R.

Reference

Acosta, Juan, Beatrice Cherrier, François Claveau, Clément Fontan, Aurélien Goutsmedt, and Francesco Sergi. 2023. “Six Decades of Economic Research at the Bank of England.” History of Political Economy.

Ash, Elliott, Daniel L. Chen, and Suresh Naidu. 2022. “Ideas Have Consequences: The Impact of Law and Economics on American Justice.” NBER Working Paper, no. 29788. https://doi.org/10.3929/ETHZ-B-000376884.

Boyer, Pierre C, Thomas Delemotte, Germain Gauthier, Vincent Rollet, and Benoı̂t Schmutz. 2020. “Les déterminants de La Mobilisation Des Gilets Jaunes.” Revue économique 71 (1): 109–38.

Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes.” Proceedings of the National Academy of Sciences 115 (16): E3635–44. https://doi.org/10.1073/pnas.1720347115.

Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. “Text as Data.” Journal of Economic Literature 57 (3): 535–74.

Goutsmedt, Aurélien, Francesco Sergi, Francois Claveau, and Clément Fontan. 2024. “The Different Paths of Central Banks Scientization: The Case of the Bank of England.” Working Paper for a Special Issue in Finance & Society. https://doi.org/https://hal.science/hal-04267004.

Jurafsky, Daniel, and James H. Martin. 2022. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” Advances in Neural Information Processing Systems 26.

Schwartzberg, Melissa, and Arthur Spirling. 2023. “Equals, Peers and Free-born Englishmen.”

Zahner, Johannes, and Martin Baumgartner. 2022. “Whatever It Takes to Understand a Central Banker - Embedding Their Words Using Neural Networks.” 48.