Blog:
The Spokesman-Review’s Ryan Pitts on Web Scraping
An interview with The Spokesman-Review’s Ryan Pitts by JA’s Tram Whitehurst.
Ryan Pitts is the online director at The Spokesman-Review in Spokane, Wash.
“We primarily use data made available through various APIs, in either JSON or XML format, or we take data exports in CSV and then import them into our own system. As far as writing scripts to go out and scrape pages that don’t have corresponding data made available in one of those other ways, well, we don’t really do much of it. Not because we can’t, but because there are enough projects to keep us plenty busy with data that’s easier to get at.
That said, if a project came up where we could ONLY get what we needed via scrape, the tool du jour these days definitely seems to be ScraperWIKI. There’s a tutorial page that can help you get going, even with nominal understanding of programming.
There’s also a great tutorial here by Ben Welsh of the Los Angeles Times.
As far as scraping advice, I think one of the key things to keep in mind is the same as we do for data use via API — be considerate of your data source. Don’t hammer their web server with your script. In whatever way you’re taking in the data, build in enough delays that their web server doesn’t suffer and keep other people from being able to access it.”
Weigh In: Remember to refresh often to see latest comments!
3 comments so far.
ScraperWiki is a fantastic resource for getting data that doesn’t exist in a friendlier format. I first heard about it at the Open Gov West conference in Portland this year, and have since used it to get results data from the Multnomah County Elections office (http://scraperwiki.com/scrapers/multnomah_county_elections/) – having access to queryable data instead of plain text is huge.
Cool Jeremy. So this pulled a summary of the longer “texty” pages on the County’s election site? Im guessing it can summarize anything you, or a campaign for example, wants to watch on election night.
Exactly – what this scraper does is pull down the plain text results (for example http://www.co.multnomah.or.us/dbcs/elections/2011-05/results.shtml), do some simple text processing to pull out the different fields, and then put that into a database. That way you can get the data in a format like JSON, which an app could use to summary results (or watch a specific race and alert you if anything changes).