Accessing information to support reporting is easier than ever. Governments and corporations publish countless bytes of data online so web scrapers can use a rotating residential proxy to gather the data that’s relevant to them. But very little information comes in a structured form that lends itself to easy analysis. Reporters often are faced with lists or tables – usually in HTML format – that aren’t so easily manipulated. That’s where web scraping comes in handy.

Web scraping for journalists involves a few key steps:

  • Getting information from somewhere
  • Storing it somewhere that can be accessed later
  • And in a form that makes it easy (or easier) to analyze and interrogate

For instance, a web scraper could be used to gather information from a local police department website, and store it in a spreadsheet that can be used to sort through, average, total up, filter and so on.

But those are just the initial aspects of web scraping. Scraping tools and customized scripts offer further benefits, including:

  • Scheduling a scraper to run at regular intervals
  • Re-formatting data to clarify it, filter it, or make it compatible with other sets of data (for example, converting lat-long coordinates to postcodes, or feet to meters)
  • Visualizing data (for example as a chart, or on a map)
  • Combining data from more than one source (for example, scraping a list of company directors and comparing that against a list of donors)

Journalists have used web scraping to tell a number of important stories in the public interest. Many investigations have relied on scrapers to pull and organize data from the Web. Now some programmer journalists are working to develop new tools to expand and redefine what web scrapers can do.

