Skip to Main Content

Computational Textual Analysis

This guide gives resources for finding, cleaning, and analzying textual corpora.

Step 1: Finding Your Texts

Sometimes you'll be starting with a ready-made corpus of texts you want to analyze, but in other cases, you're going to have to spend some time tracking down the texts you want to use.

Your top choice in finding texts will be pre-digitized texts which have been spell-checked or retyped manually by someone. Your second choice will be texts which have already been OCRed. Your third choice are texts which require OCR. If none of your texts have been digitized, that's another step, but not one that we'll cover here.

Hints and Tricks

  • You won't find everything you look for. Some texts don't exist online. Others aren't available for download. If you can, build in extra time to tweak your corpus.
  • Be careful with your searches. While many search engines on different databases work flexibly in a way we're accustomed to, others will only work if you're very specific with your searches. For HathiTrust, for example, do an advanced search by title to get accurate results.
  • Keep good records. You'll want to develop consistent naming conventions for your files--for example, the date, author's last name, and title, aka 1870DickensEdwinDrood.txt. Sometimes starting with the date will make it easier for a computer to track changes over time--MALLET, for example, gives you the option to preserve the order of the texts while performing an analysis.
  • Record relevant information on a separate spreadsheet, such as where the text came from, what you need to do to it (clean it, OCR it, etc.), and any relevant data.

Sources for Clean Texts (Top Choice!)

Sources for OCRed Text

Sources for PDFs