Skip to Main Content

Computational Textual Analysis

This guide gives resources for finding, cleaning, and analzying textual corpora.

Step 2: Cleaning Your Texts

In most cases, you're going to need to do some work on your texts, whether that's scanning them into Optical Character Recognition software or fixing up common OCR errors.

As you work, be sure to save multiple versions of your texts--including the original version! And take careful records of your texts and where you found them, along with any metadata, on a separate spreadsheet.

Hints and Tricks

  • Think about the order in which you do things. If you start by running everything through Lexos, which eliminates hyphens and makes everything lower-case, it's going to be much harder to correct hyphenated words and to eliminate names that are also words (aka frank/Frank, bell/Bell, etc.)
  • Work carefully and keep saving different versions. It's inevitable that you'll make some mistakes while cleaning, and unfortunately it's not always as easy as you'd expect to correct them. Notepad++ lets you change hundreds of files at once, but to undo what you just did, you have to go to each file in your corpus individually.
  • If something's not working, check your settings. If Notepad++ doesn't seem to be working, make sure that the "Regular Expression" or "Normal Expression" boxes are checked the way you want them for a particular search-and-replace.

Optical Character Recognition (OCR) Software


Some Useful Regular Expressions

These search-and-replace functions will work in Notepad++ or a similar editor for regular expressions.
Remove all page numbers
  • Find what: [0-9]+
  • Replace with:
  • Search mode: Regular expression
  • Replace All in All Opened Documents

Why? This search will save you the effort to search individually for all the numbers, aka 0, 1, 2, 3... 326, 327, etc. Page numbers appear in almost all texts, but tell us little about the text.

Remove all the uppercase words

  • Find what: ([A-Z]+){2, }
  • Replace with:
  • Search mode: Regular expression
  • Match case checked
  • Replace All in All Opened Documents

Why? Many texts (particularly from Internet Archive) include the title of the book and/or of the chapter on EVERY SINGLE PAGE. This repetition can skew your results; fortunately, these titles are often in all-caps, so you can eliminate them with regular expressions.

This particular regular expression deletes all-caps words of two letters or more. (It will leave I and A; these are common stopwords anyway, but you can keep them if you're interested in, for example, first-person use in your texts.)

Don't forget to check "match case"! If you leave the box unchecked, it will try to delete every single word longer than two letters from your corpus (erasing your data and crashing the program).