Skip to Main Content

Computational Textual Analysis

This guide gives resources for finding, cleaning, and analzying textual corpora.

Start with a question!

On its most basic level, most textual analysis methods work by counting and classifying words or phrases within texts.  Usually, this involves comparing these counts of classifications between a few or many documents. Some of these tools and methods are more advanced than others, but even the most complicated methods often first count words.

A few tips:

  • Start with a question you want to answer, not with a specific tool or method. Just like with any other scholarship, the most important first step is to identify the critical gap which your research will answer. Don't constrain yourself yet; be imaginative!
  • Dream big -- what can you imagine learning that is new, exciting, or useful?  
  • Then, imagine what kinds of texts might contain insights into your critical gap or research question?
  • Finally, scale back a bit. Find out what resources (text data) relevant to you project are available to you. Are your texts in the public domain and already digitized? If not, your work isn’t impossible, but it can get tricky! Do you know what sort of corpus (what body of texts) you might want to work with?  Once you know your data and research question, you can start to think about how you will conduct your research.

Introduction to Computer-Assisted Textual Analysis

Why use computers to assist you with text analysis?

You are doing textual analysis any time you look at a text and pull out a deeper meaning. Computer-assisted textual analysis just means that you use some type of computing to help.

We use computers to help with text analysis because they can help us spot trends or themes that we might miss otherwise. 

Computers can help you:

  1. Read more stuff than you could feasibly read yourself. It would take you weeks to count how many times different every different word appears in even one text. Computers can count words almost instantly. These word counts are central to many powerful forms of computer aided text analysis. They can also analyze which words tend to appear near each other, or seem to be related. Some programs can pull out place names or proper names, and then give you information on them.
  2.  Automatically classify texts and parts of texts. Computers can “learn” to recognize the work of different authors.  They can then use this information to classify anonymous texts in a more rigorous and reproducible way. Computers can also make pretty accurate predictions a text's genre, theme, or primary issue area. They are good at recognizing broad categories like if text is fiction or non-fiction, based on things like the presence of dialog. Big companies use sophisticated textual analysis software for a technique called sentiment analysis. This sheds light on whether people are saying positive or negative things about their products.
  3. Find and represent networks. Computers offer us the potential to see patterns of influence, tracing ideas as they transition from news into fiction, or showing us repeated tropes or ideas. They’ve also been used to visualize networks of who knew and was communicating with whom, and to represent the migration of ideas across time and space.

Tip: If you’ve got a question that can be answered by a small number of texts that you can realistically and easily read, it might not be the best question for computer-assisted textual analysis. Many computer-assisted textual analysis methods require a fairly large corpus in order to produce results that can reliably be interpreted. Results produced from smaller collections of texts, even if they look meaningful, are likely to be misleading.

Tip: Textual analysis software is changing very quickly. There is probably a new tool out there today that wasn’t there a few months or years ago. There are several internet-based tools that let you perform many of the analyses mentioned above - and more! Always be conscious of when the tools and techniques you use were published/shared, and look for newer versions. Things more than a few years old are likely to be out-of-date compared to cutting-edge recent advances.

Preparing a Corpus for Textual Analysis

Text data that you gather from databases or websites often needs to be prepared in specific ways before it can be analyzed. This isn't the most glamorous part of digital scholarship, but it is a crucial first step. There is no one correct way to prepare (often termed 'pre-process') your text data either, although some preparation steps are so common that they are almost ubiquitous, while others are only appropriate in certain specific circumstances.

Why would you need to prepare a corpus? Most textual analysis projects rely on plain-text digital versions of documents, which are easily read and analyzed by a computer. These documents need to be relatively clean of spelling errors, clearly labeled, and all collected in one place for the results to be accurate. Computers are also very sensitive to capitalization, punctuation, and other kinds of formatting. This usually (but not always) needs to be removed before computers can help read texts.

Use the tabs at the top left of this page to explore the most common steps of preparing a digital corpus: finding, cleaning, and analyzing.

Features of the Best Projects

  • Big datasets. You’re looking at so many texts that it would be impossible for you to read them and/or process them comprehensively the way you’re doing.
  • Unique questions. You’re looking at things which are very hard to understand otherwise.
  • Transformative. This project offers the potential to change the way we understand something--a historical movement, a person, literary history...

Digital Scholarship Coordinator