Hidden Schema Extraction in Web Documents

Carchiolo, Vincenza; Longheu, Alessandro; Malgeri, Michele

doi:10.1007/978-3-540-39845-5_5

Vincenza Carchiolo⁵,
Alessandro Longheu⁵ &
Michele Malgeri⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2822))

Included in the following conference series:

International Workshop on Databases in Networked Information Systems

285 Accesses
2 Citations

Abstract

One of the main limitation when accessing the web is the lack of explicit schema about the logical organization of web pages/sites, whose presence may help in understanding data semantics. Here, an approach to extract a logical schema from web pages based on HTML source code analysis is presented. We define a set of primary tags actually used to give a structural/logical backbone to the page. Primary tags are used to divide the page into collections, which represent distinct structural page sections; these are finally mapped into logical sections according to their semantics, providing a logical page schema. The structuring methodology is applied to some real web pages to test the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Article 22 July 2019

Efficient Page-Level Data Extraction via Schema Induction and Verification

Semantic Views of Homogeneous Unstructured Data

References

Apers, P.M.G.: Identifying internet-related database reasearch. In: 2nd Intl. East- West DB Workshop (1994)
Google Scholar
Buneman, P.: Semistructured data. In: Workshop on Management of Semistructured Data (1997)
Google Scholar
Abiteboul, S.: Querying Semi-structured Data. In: Afrati, F.N., Kolaitis, P.G. (eds.) ICDT 1997. LNCS, vol. 1186, Springer, Heidelberg (1996)
Google Scholar
Abiteboul, S., et al.: Data on the Web. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Nestrorov, S., et al.: Extracting schema from semistructured data. In: Proc. of ACM SIGMOD (1998)
Google Scholar
Huck, G., et al.: Jedi: extracting and synthesizing information form the web. In: Proc. of 3rd IFCIS Intl CoopIS (1998)
Google Scholar
Molina, H.G., et al.: The TSIMMIS project: integration of heterogeneous information sources, processing society of japan (1997)
Google Scholar
Carchiolo, V.: Extracting logical schema from the web. In: Applied Intelligence, vol. 18(3), Kluwer Academic Publishers, Dordrecht (2003)
Google Scholar
Longheu, A., Carchiolo, V., Malgeri, M.: Structuring the web. In: Proc. of DEXA - Takma, London (2000)
Google Scholar
Longheu, A., Carchiolo, V., Malgeri, M.: Extraction of hidden semantics from web pages. In: Yin, H., Allinson, N.M., Freeman, R., Keane, J.A., Hubbard, S. (eds.) IDEAL 2002. LNCS, vol. 2412, p. 117. Springer, Heidelberg (2002)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Ingegneria Informatica e delle Telecomunicazioni, Facoltà di Ingegneria, V.le A. Doria 6, I95125, Catania
Vincenza Carchiolo, Alessandro Longheu & Michele Malgeri

Authors

Vincenza Carchiolo
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Longheu
View author publications
You can also search for this author in PubMed Google Scholar
Michele Malgeri
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

UCLIC, University College London, 31-32 Alfred Place, WC1E7DP, London, UK
Nadia Bianchi-Berthouze

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Carchiolo, V., Longheu, A., Malgeri, M. (2003). Hidden Schema Extraction in Web Documents. In: Bianchi-Berthouze, N. (eds) Databases in Networked Information Systems. DNIS 2003. Lecture Notes in Computer Science, vol 2822. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39845-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-540-39845-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20111-3
Online ISBN: 978-3-540-39845-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics