- An Automatic Data Grabber for Large Web Sites

https://doi.org/10.1016/B978-012088469-8.50137-6Get rights and content

Publisher Summary

This chapter investigates a system to automatically grab data from data intensive Websites. The system first infers a model that describes the Website as a collection of classes. Each class represents a set of structurally homogeneous pages, and it is associated with a small set of representative pages. Based on the model, a library of wrappers, one per class, is then inferred with the help an external wrapper generator. The model, together with the library of wrappers, can thus be used to navigate the site and extract the data. The inference process is performed incrementally. The system starts from a given entry point that becomes the first member of the first class in the model. It then refines the model by exploring its boundaries to gather new pages. At each iteration, the system selects a link collection from the model outbound, and iteratively fetches a page by following one of the links in the collection. In order to reduce the number of pages actually visited, after each download the system makes a guess on the class of remaining pages. If looking at the pages already downloaded, there is sufficient evidence that the guess is right, the remaining pages of the collections are assigned to classes without actually fetching them. The process iterates until all the link collections are typed with a known class.

References (0)

Cited by (9)

View all citing articles on Scopus
View full text