Powerful new tools for text analysis

Research underpinned the development of new and improved methods of quantitative text analysis for social science research.

Research by Professor Ken Benoit

What was the problem?

Large bodies of text – in political speeches, social media posts, policy documents, and so on – are significant and useful data sources for social science research. Textual data analysis is a growing field of research, and there is huge potential in applying quantitative methods to this data.

However, there has been a lack of substantial academic work explaining the field of textual data analysis for the social sciences. Existing research has often relied on untested assumptions and unproven applicability, and tended to be based on short "proof-of-concept" demonstrations. There is scope for methodological innovation in textual data analysis, working with "big data" in advanced, automated ways.

What did we do?

Benoit’s research expertise lies in the development and application of automated, quantitative methods of processing large amounts of "big data", including textual data, and the methodology of text-mining.

As principal investigator for the the ‘‘Quantitative Analysis of Textual Data for Social Sciences’’ project (QUANTESS, 2011 to 2017) Benoit sought to improve statistical methods for textual data analysis, leading a multidisciplinary team with backgrounds in statistical analysis and computer simulation, and knowledge of applied domains such as legislative politics.

The novelty of this project’s work lay in its statistical approach to extracting information from texts – treating texts as "data" to be analysed rather than as text to be read and interpreted for "meaning" and categorised or synthesised by humans. It also aimed to develop powerful but accessible free software tools to support the application of textual data analysis techniques.

The research team created a complex and feature-rich software library, enabling users to implement newly developed text analysis methods along with dozens of existing methods, for which there was substantial demand but only limited tools. Together, these made up the new quanteda package and a family of companion packages – a library of software functions on a simple application-programmer interface. Designed to complement existing packages, quanteda is an open-source platform, which allows it to be stress-tested many times by knowledgeable online users.

Since 2017, Benoit has extended the QUANTESS work across several research streams. In a 2019 American Journal of Political Science article with Dr Kevin Munger (Pennsylvania State University) and Professor Arthur Spirling (New York University), Benoit developed a measure to analyse the sophistication of political language, applied to a corpus of US State of the Union (SOTU) addresses. The research showed that levels of sophistication for these speeches had indeed lowered over time, consistent with previously voiced concerns over a "dumbing down" of political discourse, but this coincided with a shift from written to spoken delivery.

Another study, with Dr Alexander Herzog of Clemson University, used quanteda tools to examine politicians’ budget debate speeches between 1987 and 2013 in Ireland. By analysing speech, the research showed how politicians, fearful of being punished by their constituents for voting in support of austerity measures, were able to express their opposition in debates, even as they continued to vote along party lines. This analysis revealed an undermining of government cohesion in a way that scrutiny of voting behaviour could not.

In a final stream of the QUANTESS project (the European Union-funded ‘‘EUENGAGE’’), Benoit and the team applied quanteda software to research attitudes to Brexit, by analysing tens of millions of social media posts, classifying and comparing the language employed by pro-Leave and pro-Remain social media users. This revealed significant differences in tone and sentiment, with pro-Leave accounts generally more positive, using the language of reward, and more oriented towards the future. Pro-Remain messages, by comparison, adopted a more negative, less assertive emotional tone, more oriented towards the past.

What happened?

The innovation of the quanteda set of packages lay in their provision of new and existing text analysis tools, in an accessible way. To facilitate their wider use, Professor Benoit and Dr Kohei Watanabe (Beppu University) established the Quanteda Initiative (QI) – a non-profit, community interest company – in January 2018.

Since then, a series of QI-branded training workshops have been held internationally and tutorial materials prepared in five languages. Any revenues have been reinvested into the company to cover costs. The quanteda suite of tools, which is continuously updated and developed, remains free to use and open-source, and is designed to be accessible to non-expert users.

Uptake has been significant: the main quanteda package has been downloaded more than 870,000 times (with others, such as "stopwords", downloaded even more). Many data analytics companies, political analysts, international media organisations, and data science professionals now use it in their work. It has been endorsed by data science professionals globally, and has been adopted by universities and incorporated into research methods courses.

In February 2020, i24 news (an Israeli news platform) used quanteda in an experimental analysis which compared the texts of two significant peace initiatives. The first was the Peace to Prosperity plan, unveiled by the Donald Trump administration in January 2020 and embraced by Israel; the second was the Arab Peace Initiative of 2002, put forward by Saudi Arabia and garnering full Palestinian support. i24’s quanteda-enabled analysis generated new insights, such as how the Trump plan’s use of language focused on economic initiatives and its glaring absence of the word "peace".

In July 2020, The Washington Post used the quanteda Lexicoder Sentiment Dictionary to analyse whether Russian disinformation campaigns were targeting African Americans. It analysed almost 40,000 purportedly divisive tweets sent from Ghana and Nigeria between June 2017 and March 2020, which were believed to be associated with Kremlin-backed sources. The newspaper’s analysis showed how these malicious accounts tweeted a mixture of sentiments to cultivate followers and manipulate US narratives about race and police conduct.

Facebook has also used quanteda. Its core data science team conducts large-scale, global, quantitative research to inform improvements to Facebook user experience. They have used quanteda to, for example, conduct text analysis on the linguistic diversity of recommendations on the platform.

In recognition of its academic utility, in 2020 quanteda received the Best Statistical Software Award from the Society for Political Methodology. The Society's committee wrote:

"[quanteda’s] extraordinary documentation not only makes it accessible for researchers from a variety of backgrounds, it also facilitates the further creation of packages and utilities, and supports its usage in teaching and training … quanteda’s innovation, accessible documentation, and functionality are testaments to the collaborative efforts of both junior and senior scholars that can serve as a model for future software development."

Powerful new tools for text analysis

What was the problem?

What did we do?

What happened?

Professor Ken Benoit

Share this page