Skip to main content

Webscraping

Scraping, Visualizing, and Analyzing the Web

Mining Data

With the plethora of data available in digital spaces, it only makes sense that scholars want to access, visualize, and analyze this data. Many software packages are available to aid in this pursuit, and programming languages paired with social media APIs make the scraping process even more customized. This guide will focus on scraping, visualizing, and analyzing Twitter data.

Originally created by: Angela Cirucci

Updated by: Elizabeth Rodrigues

Currently updated & maintained by: Caroline Tynan

Why scrape the web?

Web scraping is harvesting data from websites through either an API or a page's web browser. The point of web scraping is to collect an enormous amount of information relatively easily and quickly. It works by:

  • Downloading a web page's content
  • Extracting that information en masse 

Data gathered through web scraping alone is not often useful in its original, unstructured form.  Once collected,however, it can be converted into a more readable, organized format. For example, the program Twitter scraper goes through the Twitter API to collect as many tweets as one wants based on search terms or hashtags of interest. Twitter scraper then provides this data in a JSON file, while is completely unformatted and difficult to read. Through a simple Python script, it can be converted into a readable CSV format in an Excel spreadsheet. 

There are a number of different tools that allow you to scrape the web in different ways. In many cases, scraping is not necessary, as websites and apps increasingly make their Application Programming Interfaces (APIs) available to users. In order to see why web scraping might be useful as compared with using an API,  it is useful to have a basic understanding of what exactly an API is.

Tools

There is an ever-changing array of tools and platforms to mine Twitter data. Listed below are some software packages that are free, or at least offer a free version, and that are a good place to start.

We often advise students and researchers to start with NodeXL. It is easy to use and reliable, although it recently limited the capabilities of its free version. As summarized in the "Visualization and Analysis" section of this guide, NodeXL is a useful tool because it not only scrapes the data, but also helps with visualization and analysis.

This guide also covers scrapping with programming scripts, in particular the Python package Twarc. Although scripting methods require coding experience or the willingness to learn, for more in-depth or customized studies, it is often beneficial to learn enough code to use one of these approaches.

After scraping

Scraping is only the first step. Once you have your data, there will be much cleaning and organizing to do.

Mining the site only gives you raw information. Some scraping software packages come with visualization and analysis tools. You can also employ other methods and tools.

Some ways of visualizing and analyzing your data include: comparing to other online norms, comparing to other social networking site performances, and comparing to offline phenomena. For example, you may want to compare networks with quantitative methods, or you may want to compare the content of tweets from two different hashtags with qualitative methods.

The rest of this guide will introduce you to some of my preferred tools and methods for scraping, visualizing, and analyzing Twitter data.

Content specialist