MY459      Half Unit
Quantitative Text Analysis

This information is for the 2024/25 session.

Teacher responsible

Dr Ryan Hubert

Availability

This course is available on the MPA in Data Science for Public Policy, MSc in Applied Social Data Science, MSc in Data Science, MSc in Econometrics and Mathematical Economics, MSc in Human Geography and Urban Studies (Research), MSc in Political Science (Political Science and Political Economy), MSc in Social Research Methods, MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (Research), MSc in Statistics (Research), MSc in Statistics (Social Statistics) and MSc in Statistics (Social Statistics) (Research). This course is available with permission as an outside option to students on other programmes where regulations permit.

The course is also available to research students as MY559. This course is not controlled access. If you register for a place and meet the prerequisites, if any, you are likely to be given a place.

Pre-requisites

Applied Regression Analysis (MY452) or equivalent is required. Students should understand basic linear algebra and know at least one programming language. If this programming language is not R, students should take the Digital Skills Lab course in R before the start of term.

Course content

The course surveys methods for systematically extracting quantitative information from text for social scientific purposes, starting with classical content analysis and dictionary-based methods, classification methods, and state-of-the-art scaling methods. It continues with probabilistic topic models, word embeddings, and concludes with an outlook on current neural network-based models for texts. The course lays a theoretical foundation for text analysis but mainly takes a very practical and applied approach, so that students learn how to apply these methods in actual research. A common focus across many methods is that they can be reduced to a three-step process: first, identifying texts and units of texts for analysis; second, extracting from the texts quantitatively measured features - such as coded content categories, word counts, word types, dictionary counts, or parts of speech - and converting these into a quantitative matrix; and third, using quantitative or statistical methods to analyse this matrix in order to generate inferences about the texts or their authors. The course systematically surveys these methods in a logical progression, with a practical, hands-on approach where each technique will be applied using appropriate software to real texts.

Lectures, class exercises and homework will be based on the use of the R statistical software package but will assume no background knowledge of that language.

Teaching

This course is delivered through a combination of classes and lectures totalling a minimum of 20 hours across Winter Term. 

This course has a reading week in Week 6 of WT.

Formative coursework

Students will be expected to submit 1 problem set in the WT.

One structured problem set will be provided in the first weeks of the course. Students will start the problem set in the first computer workshop sessions and complete it outside of class.

Indicative reading

quanteda: An R package for quantitative text analysis. http://kbenoit.github.io/quanteda/

Benoit, Kenneth. 2020. “Text as Data: An Overview.” In Curini, Luigi and Robert Franzese, eds. Handbook of Research Methods in Political Science and International Relations. Thousand Oaks: Sage. pp461-497.

Assessment

Exam (100%, duration: 2 hours) in the spring exam period.

Key facts

Department: Methodology

Total students 2023/24: 59

Average class size 2023/24: 28

Controlled access 2023/24: Yes

Value: Half Unit

Guidelines for interpreting course guide information

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Personal development skills

  • Problem solving
  • Application of numeracy skills