Blog:

Blog Index

The Spokesman-Review’s Ryan Pitts on Web Scraping

ScraperWiki

An interview with The Spokesman-Review’s Ryan Pitts by JA’s Tram Whitehurst.

Ryan Pitts is the online director at The Spokesman-Review in Spokane, Wash.

“We primarily use data made available through various APIs, in either JSON or XML format, or we take data exports in CSV and then import them into our own system. As far as writing scripts to go out and scrape pages that don’t have corresponding data made available in one of those other ways, well, we don’t really do much of it. Not because we can’t, but because there are enough projects to keep us plenty busy with data that’s easier to get at.

That said, if a project came up where we could ONLY get what we needed via scrape, the tool du jour these days definitely seems to be ScraperWIKI. There’s a tutorial page that can help you get going, even with nominal understanding of programming.

There’s also a great tutorial here by Ben Welsh of the Los Angeles Times.

As far as scraping advice, I think one of the key things to keep in mind is the same as we do for data use via API — be considerate of your data source. Don’t hammer their web server with your script. In whatever way you’re taking in the data, build in enough delays that their web server doesn’t suffer and keep other people from being able to access it.”