Abstract
One of the main limitations when accessing the web is the lack of explicit structure, whose presence may help in understanding data semantics. Schema for web data can be constructed at different levels, structuring a single pages or a whole site or group of sites. Here we present an approach to give a logical schema to a web-site, first defining a model for a single page, where its contents is divided into “logical” sections, i.e. parts of a page each collecting related information. Then, we introduce a site model in which both physical and logical links among different page sections are represented: physical are existing hyperlinks, while logical links are links between sections containing semantically related information. We show how such links can be found and classified according to their relevance, also showing how schema is used in a structure-aware browser to improve both browsing and searching.
Similar content being viewed by others
References
P.M.G. Apers, “Identifying internet-related database reasearch,” in Proc. 2nd Intl. East-West Database Workshop, 1994.
P. Buneman, “Semistructured data,” in Proc.Workshop on Management of Semistructured Data, 1997.
S. Abiteboul, “Querying semi-structured data,” in Proc. ICDT, 1997.
S. Abiteboul et al., Data on the Web, Morgan Kaufmann Publishers, 2000.
S. Nestrorov et al., “Extracting schema from semistructured data,” in Proc. of ACM SIGMOD, 1998.
G. Huck et al., “Jedi: Extracting and synthesizing information form the web,” in Proc. of 3rd IFCIS Intl. CoopIS, 1998.
H.G. Molina et al., “The TSIMMIS project: Integration of heterogeneus information sources,” in Proc. of the Processing Society of Japan, 1997.
A. Longheu, V. Carchiolo, and M. Malgeri, “Structuring the web,” in Proc. of DEXA-Takma, 2000.
B. Adelberg, “NoDoSe: A tool for semi-automatically extracting structured and semistructured data from text documents,” in Proc. of ACM SIGMOD, 1998.
J. Hammer et al., “Extracting semistructured information from the web,” in Proc. Workshop on Management of Semistructured Data, 1997.
D. Smith and M. Lopez, “Information Extraction for semistructured documents,” in Proc. of Workshop on Management of Semistructured Data, 1997.
P. Atzeni et al., “To weave the web,” in Proc. of the 23rd VLDB Conference, 1997.
P. Fernandez et al., “Catching the boat with Strudel: Experiences with a web-site management system.”
P. Fraternali, Autoweb—http://www.elet.polimi.it/users/dei/sections/compeng/piero.fraternali/autoweb/
S. Ceri et al., “Design principles for data-intensive web sites,” in Proc. Of ACM SIGMOD, 1999.
Yahoo!, http://www.yahoo.com17. CNN, http://www.cnn.com
R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrievial, ACM Press, 1999.
C. Parisi and A. Longheu, “Ristrutturazione dei siti web: un modello semantico per l’accesso alle informazioni,” Tech Internal Report No. DIIT00/Ah74, 2000.
Y. Maarek et al., “Webcutter: A system for dynamic and tailorable site mapping,” in Proc. Of 6th WWW Conference, 1997.
L.A Zadeh, “Fuzzy sets,” Information and Control, vol. 8, pp. 338–353.
L.A Zadeh, “Fuzzy sets as a basis for the theory of possibility,” Fuzzy sets & Systems, vol. 1, 1978.
Document Object Model, http://www.w3.org/DOM
RDF Recommendation, http://www.w3.org/TR/REC-rdf-syntax
XML Namespaces, http://www.w3.org/TR/REC-xml-names.
XML Schemas, http://www.w3.org/XML/Schema.html
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Carchiolo, V., Longheu, A. & Malgeri, M. Extracting Logical Schema from the Web. Applied Intelligence 18, 341–355 (2003). https://doi.org/10.1023/A:1023206322783
Issue Date:
DOI: https://doi.org/10.1023/A:1023206322783