Skip to main content

Computational Textual Analysis

This guide gives resources for finding and cleaning your texts.

Picking a Topic

Remember, on its most basic level, all textual analysis is counting and classifying words. Some of these tools are more advanced than others, of course!

A few tips:

  • Start with the question, not with the tool you want to use. Just like with any other scholarship, there needs to be a critical gap which your research will answer.
  • Start out by dreaming big--what can you imagine learning that is new, exciting, useful?
  • Then, scale back a bit. Look at the resources available to you. Are your texts in the public domain and already digitized? (If not, your work isn’t impossible, but it can get tricky!) Do you know what sort of corpus you might want to work with?

Introduction to Computer-Assisted Textual Analysis

What does computer-assisted textual analysis mean?

Textual analysis is any time you look at a text and pull out a deeper meaning. Computer-assisted textual analysis just means that you use some type of computing to help.

The real reason you might do this is because it can show you something that you might not otherwise be able to see.

What does it help you do that you couldn’t otherwise do?

  1. Read more stuff than you can really read. It would take you weeks to count how many times different words appear in even one text. Computers can count words almost instantly. They can also analyze which words tend to appear near each other, or seem to be related. Some programs can pull out place names or proper names, and then give you information on them.
  2.  Automatically classify texts and parts of texts. For example, computers can “learn” to recognize the work of different authors.  They can then use this information to classify anonymous texts in a more rigorous and reproducible way. Computers can make pretty accurate predictions of genre.  They are good at recognizing things like whether a text is fiction or non-fiction, based on things like the presence of dialog. Big companies use sophisticated textual analysis software for “Sentiment Analysis,” which sheds light on whether people are saying positive or negative things about their products.
  3. Find and represent networks. Computers offer us the potential to see patterns of influence, tracing ideas as they transition from news into fiction, or showing us repeated tropes or ideas. They’ve also been used to visualize networks of who knew and was communicating with whom, and to represent the migration of ideas across time and space.

Tip: If you’ve got a question that can be answered by a small number of texts that you can realistically and easily read, it might not be the best question for computer-assisted textual analysis. Many computer-assisted textual analysis methods require a fairly large corpus in order to produce results that reliably can be interpreted. Results produced from smaller collections of texts, even if they look meaningful, are likely to be misleading.

Tip: Textual analysis software is changing very quickly, and there is probably a new tool out there today that wasn’t there yesterday that will let you do all of this - and more! Always be conscious of when the tools and techniques you use were published/shared, and look for newer versions. Things more than a few years old are likely to be out-of-date compared to cutting-edge recent advances.

Preparing a Corpus for Textual Analysis

This guide walks you through the steps and provides you with resources for finding and cleaning your texts--the necessary preliminaries to textual analysis projects. This isn't the most glamorous part of digital scholarship, but with these resources, you can move quickly and avoid the most common pitfalls.

Why would you need to prepare a corpus? Most textual analysis projects rely on plain-text digital versions of documents, which are easily read and analyzed by a computer. These documents need to be relatively clean of spelling errors, clearly labeled, and all collected in one place for the results to be accurate.

Use the tabs at the top to explore the most common steps of preparing a digital corpus: finding, cleaning, and analyzing.

Features of the Best Projects

  • Big datasets. You’re looking at so many texts that it would be impossible for you to read them and/or process them comprehensively the way you’re doing.
  • Unique questions. You’re looking at things which are very hard to understand otherwise.
  • Transformative. This project offers the potential to change the way we understand something--a historical movement, a person, literary history...