Screen Scraping Made Pretty Easy
Many public libraries offer a valuable but often overlooked resource – free or discounted museum passes. Here in Western Massachusetts residents have access to the combined materials of the entire CW/MARS regional system of more than 140 libraries. Courtesy of our library cards, my family and I have made free trips to the Holyoke Children’s
Museum, Amherst History Museum, Historic Deerfield, Magic Wings, Brattleboro Art Museum, Smith College Art Museum, Springfield Quadrangle and Boston Science Museum.
I became curious to know the full range of museum passes our library system had to offer. This information is available via the C/W MARS search engine, but the more than 50 results are spread out over 5 rather verbose pages. What I was really looking for was a concise, browsable list of available museum passes with links to detailed information.
A powerful, elegant PHP library called Simple HTML DOM Parser makes constructing such a page relatively easy. Contributed as open source by S. C. Chen, it features a “Find” function that takes arguments modeled after JQuery’s easy-to-use syntax. For instance, this small PHP snippet is virtually all that is needed to dump all the link destinations in a web page:
include_once('/simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://some-site.com/some-page.html');
foreach($html->find('a') as $element)
echo $element->href . '<br>';
I wrote a program that used Simple HTML DOM Parser to loop through each page of search results from a CW/MARS Library Catalog search query on the term "museum pass." The program parsed each page and extracted all HTML nodes with a class of "briefcitTitle" (find('.briefcitTitle')). Each node contained a link to an individual museum pass page which was extracted and written back out as a table row. With a couple dozen lines of code I was able to put together the list of museum passes I was after: