Blog:

Blog Index

The Register-Guard’s John Heasly on Web Scraping

An interview with The Register-Guard’s scraper John Heasly by JA’s Tram Whitehurst

John Heasly is the Web content editor at The Register-Guard in Eugene, OR.

What are some of the ways you’re using web scraping at your newspaper? How has it enhanced your work?

“Here’s one of my favorites. It updates every fifteen minutes and uses this page as its source. Another big scrape is elections… Here’s the full results page.

We also have a smaller customizable widget of selected races that we put on the homepage. It uses both county and state pages for data. The data also gets formatted into InDesign templates and is reverse-published into the paper.”

Where do you see the technology headed? Are there new ways in which it can be used?

“I don’t see any wildly revolutionary changes in the field of screen-scraping, as it’s kind of a hack/workaround to begin with. I think as data publishers-news sources-governments get their acts together, there will be more APIs, so people can get at the data directly. I think the ease of geolocating events is going to continually increase.”

Who else is doing web scraping well?

“Well, any of the EveryBlock sites, of course. The L.A. Times’ crime map recently came to my attention. It seems pretty amazing.”

What is your advice to journalists looking to start using web scraping in their own work?

“Find a problem, attack it! Make sure it’s something you’re passionate about, otherwise, when you hit a bump — and you will hit bumps — you’ll get de-railed. Ask questions in newsgroups, Google groups, help.hackshackers.com. The open-source software and the advice/help are free. All you need is a computer and an Internet connection and an appropriate pig-headedness and you’re set!”

What are your favorite Web scraping tools and guides?

“I like Python as a language. I like the Python module BeautifulSoup for taming what I’ve scraped and Django as a Web framework for serving the scrapings.”