You are browsing the archive for Web Scraping.

ScraperWiki

May 31, 2011 in Resources, Technology

ScraperWiki is a site and platform for building data scrapers that can transform unstructured data such as plain text files or tables on web pages (or even in PDFs) into structured data that can be queried with an API in JSON or XML format. Scrapers can be written in Python, Ruby, or PHP, and can be edited by anyone who has registered for the site (the “wiki” in the name). Non-programmers can request a data set and let the community help by putting together a working scraper. Source: Journalism Accelerator

The Spokesman-Review’s Ryan Pitts on Web Scraping

April 20, 2011 in Blog, Craft, Experiments, Interview, Technology

ScraperWiki

An interview with The Spokesman-Review’s Ryan Pitts by JA’s Tram Whitehurst.

Ryan Pitts is the online director at The Spokesman-Review in Spokane, Wash.

“We primarily use data made available through various APIs, in either JSON or XML format, or we take data exports in CSV and then import them into our own system. As far as writing scripts to go out and scrape pages that don’t have corresponding data made available in one of those other ways, well, we don’t really do much of it. Not because we can’t, but because there are enough projects to keep us plenty busy with data that’s easier to get at.

That said, if a project came up where we could ONLY get what we needed via scrape, the tool du jour these days definitely seems to be ScraperWIKI. There’s a tutorial page that can help you get going, even with nominal understanding of programming.

There’s also a great tutorial here by Ben Welsh of the Los Angeles Times.

As far as scraping advice, I think one of the key things to keep in mind is the same as we do for data use via API — be considerate of your data source. Don’t hammer their web server with your script. In whatever way you’re taking in the data, build in enough delays that their web server doesn’t suffer and keep other people from being able to access it.”

The Register-Guard’s John Heasly on Web Scraping

April 18, 2011 in Blog, Craft, Experiments, Interview

An interview with The Register-Guard’s scraper John Heasly by JA’s Tram Whitehurst

John Heasly is the Web content editor at The Register-Guard in Eugene, OR.

What are some of the ways you’re using web scraping at your newspaper? How has it enhanced your work?

“Here’s one of my favorites. It updates every fifteen minutes and uses this page as its source. Another big scrape is elections… Here’s the full results page.

We also have a smaller customizable widget of selected races that we put on the homepage. It uses both county and state pages for data. The data also gets formatted into InDesign templates and is reverse-published into the paper.”

Where do you see the technology headed? Are there new ways in which it can be used?

“I don’t see any wildly revolutionary changes in the field of screen-scraping, as it’s kind of a hack/workaround to begin with. I think as data publishers-news sources-governments get their acts together, there will be more APIs, so people can get at the data directly. I think the ease of geolocating events is going to continually increase.”

Who else is doing web scraping well?

“Well, any of the EveryBlock sites, of course. The L.A. Times’ crime map recently came to my attention. It seems pretty amazing.”

What is your advice to journalists looking to start using web scraping in their own work?

“Find a problem, attack it! Make sure it’s something you’re passionate about, otherwise, when you hit a bump — and you will hit bumps — you’ll get de-railed. Ask questions in newsgroups, Google groups, help.hackshackers.com. The open-source software and the advice/help are free. All you need is a computer and an Internet connection and an appropriate pig-headedness and you’re set!”

What are your favorite Web scraping tools and guides?

“I like Python as a language. I like the Python module BeautifulSoup for taming what I’ve scraped and Django as a Web framework for serving the scrapings.”

 

Web Scraping 411

March 7, 2011 in Experiments, Resources, Technology

Accessing information to support reporting is easier than ever. Governments and corporations publish countless bytes of data online. But very little information comes in a structured form that lends itself to easy analysis. Reporters often are faced with lists or tables — usually in HTML format — that aren’t so easily manipulated. That’s where web scraping comes in handy.

Web scraping for journalists involves a few key steps:

  • Getting information from somewhere
  • Storing it somewhere that can be accessed later
  • And in a form that makes it easy (or easier) to analyze and interrogate

For instance, a web scraper could be used to gather information from a local police department website, and store it in a spreadsheet that can be used to sort through, average, total up, filter and so on.

But those are just the initial aspects of web scraping. Scraping tools and customized scripts offer further benefits, including:

  • Scheduling a scraper to run at regular intervals
  • Re-formatting data to clarify it, filter it, or make it compatible with other sets of data (for example, converting lat-long coordinates to postcodes, or feet to meters)
  • Visualizing data (for example as a chart, or on a map)
  • Combining data from more than one source (for example, scraping a list of company directors and comparing that against a list of donors)

Journalists have used web scraping to tell a number of important stories in the public interest. Many investigations have relied on scrapers to pull and organize data from the Web. Now some programmer journalists are working to develop new tools to expand and redefine what web scrapers can do.

@OWNIeu talks about the state of open data: