Skip to main content

Webscraping

Scraping, Visualizing, and Analyzing the Web

APIs

What is an API?

An Application Programming Interface (API) allows different software programs to communicate with one another by allowing for data exchange. In short, APIs allow for the sharing of data that is necessary for Apps to function. This allows for:

  • Operating systems to provide shortcuts for programmers, precluding the need for new code to be written each time 
  • Companies to become more virtually connected with consumers, offering more services digitally
  • Researchers to collect data from the web in a more structured format

A researcher may write their own code to directly access an API. They may also use packages or third-party apps that have already written the code needed to access the API. Twitter Archiver, for example, is a Google App that uses Twitter's API to automatically run and collect Twitter searches into a Google spreadsheet.  Depending on what you are doing, APIs have their limits for researching and collecting data on the web, and Twitter is a great example of this. Twitter's API limits how often one can search, and how far back one can search for tweets, as this is may be limited to the previous seven days.

API Access

Each API specifies who may get access to their data and for what purpose. Whether through a third-party app, or directly through code, the data source will want to authenticate the request. There are generally two types of authentication: project-based and user-based.

With project-based approaches, a researcher may be granted an API key tied to a specific project they've identified. Another researcher who has permission to use that key may use it to access the API, and all search results/queries will be saved to the project. The identities of who is utilizing the API key might not be stored.

For user-based approaches, a researcher may be granted a user-specific authentication token. With Google, for example, the authentication token is comprised of a "key" and "secret." The token is tied to the user's Google account. This means that other researchers can not use the authentication token but must apply for their own. Twitter's API uses a similar method for its 30-day searches.

API versus Webscraping 

When there is no API available, or it is too limiting, webscraping is an option. It works by "scraping" data directly from a site's underlying HTML. One webscraping package that is great for those with a bit of familiarity in Python is Beautiful Soup. This, like most packages for Python, uses HTML to parse data from a webpage. It returns unstructured data in what is known as a parse tree. While requiring a bit more on how to read HTML tags and overall messier in the format of the data it provides, Beautiful Soup can be used to get around limits to APIs. 

Webscraping does not typically involve writing code, as many programs have already been written based off of packages like Beautiful Soup. Returning to our Twitter example, there is a Web scraping application that incorporates Beautiful Soup as a way of extracting data from Twitter that gets around its Search API. Twitter Scraper returns as many searches, as far back into Twitter's history, as one wishes to retrieve, with the only limiting factor being how fast one's server is running.