Automated Semantic Analysis of Schematic Data

Mukherjee, Saikat; Ramakrishnan, I. V.

doi:10.1007/s11280-008-0046-0

Automated Semantic Analysis of Schematic Data

Published: 13 June 2008

Volume 11, pages 427–464, (2008)
Cite this article

World Wide Web Aims and scope Submit manuscript

Saikat Mukherjee¹ &
I. V. Ramakrishnan²

127 Accesses
9 Citations
Explore all metrics

Abstract

Content in numerous Web data sources, designed primarily for human consumption, are not directly amenable to machine processing. Automated semantic analysis of such content facilitates their transformation into machine-processable and richly structured semantically annotated data. This paper describes a learning-based technique for semantic analysis of schematic data which are characterized by being template-generated from backend databases. Starting with a seed set of hand-labeled instances of semantic concepts in a set of Web pages, the technique learns statistical models of these concepts using light-weight content features. These models direct the annotation of diverse Web pages possessing similar content semantics. The principles behind the technique find application in information retrieval and extraction problems. Focused Web browsing activities require only selective fragments of particular Web pages but are often performed using bookmarks which fetch the contents of the entire page. This results in information overload for users of constrained interaction modality devices such as small-screen handheld devices. Fine-grained information extraction from Web pages, which are typically performed using page specific and syntactic expressions known as wrappers, suffer from lack of scalability and robustness. We report on the application of our technique in developing semantic bookmarks for retrieving targeted browsing content and semantic wrappers for robust and scalable information extraction from Web pages sharing a semantic domain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Allan, J. (ed.): Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Publishers (2002)
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM Conf. on Management of Data (SIGMOD) (2003)
Aridor, Y., Carmel, D., Maarek, Y., Soffer, A., Lempel, R.: Knowledge encapsulation for focussed search from pervasive devices. In: Intl. World Wide Web Conf. (WWW) (2001)
Ashish, N., Knoblock, C.: Wrapper generation for semi-structured internet sources. ACM SIGMOD Record, 26(4), (1997)
Atzeni, P., Mecca, G.: Cut & paste. In: ACM Symposium on Principles of Database Systems (PODS) (1997)
Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: Intl. World Wide Web Conf. (WWW) (2002)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Intl. Conf. on Very Large Data Bases (VLDB) (2001)
Berners-Lee, T., Fischetti, M.: Weaving the Web. Harper San Francisco (1999)
Bickmore, T., Schilit, B.: Digestor: device-independent access to the world wide web. In: Intl. World Wide Web Conf. (WWW) (1997)
Buchanan, G., Farrant, S., Jones, M., Thimbleby, H., Marsden, G., Pazzani, M.: Improving mobile internet usability. In: Intl. World Wide Web Conf. (WWW) (2001)
Buyukkoten, O., Garcia-Molina, H., Paepcke, A.: Focussed web searching with PDAs. In: Intl. World Wide Web Conf. (WWW) (2000)
Buyukkoten, O., Garcia-Molina, H., Paepcke, A.: Accordion summarization for end-game browsing on pdas and cellular phones. In: ACM Conf. on Human Factors in Computing Systems (CHI) (2001)
Buyukkoten, O., Garcia-Molina, H., Paepcke, A.: Seeing the whole in parts: text summarization for web browsing on handheld devices. In: Intl. World Wide Web Conf. (WWW) (2001)
Buyukkoten, O., Garcia-Molina, H., Paepcke, A., Winograd, T.: Power browser: efficient web browsing for pdas. In: ACM Conf. on Human Factors in Computing Systems (CHI) (2000)
Califf, M., Mooney, R.: Relational learning of pattern-match rules for information extraction. In: National Conf. on Artificial Intelligence (AAAI) (1999)
Chalmers, D., Sloman, M., Dulay, N.: Map adaptation for users of mobile systems. In: Intl. World Wide Web Conf. (WWW) (2001)
Chang, C.-H., Lui, S.-C.: Iepad: information extraction based on pattern discovery. In: Intl. World Wide Web Conf. (WWW) (2001)
Chen, Y., Ma, W.-Y., Zhang, H.-J.: Detecting web page structure for adaptive viewing on small form factor devices. In: Intl. World Wide Web Conf. (WWW) (2003)
Chidlovskii, B.: Automatic repairing of web wrappers. In: Workshop on Web Information and Data Management (WIDM) (2001)
Chung, C.Y., Gertz, M., Sundaresan, N.: Reverse engineering for web data: from visual to semantic structures. In: Intl. Conf. on Data Engineering (ICDE) (2002)
Cohen, W., Hurst, M., Jensen, L.: A flexible learning system for wrapping tables and lists in html documents. In: Intl. World Wide Web Conf. (WWW) (2002)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Intl. Conf. on Very Large Data Bases (VLDB) (2001)
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos, P.: Imap: discovering complex mappings between database schemas. In: ACM Conf. on Management of Data (SIGMOD) (2004)
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J., Yien, J.: SemTag and Seeker: bootstrapping the semantic web via automated semantic annotation. In: Intl. World Wide Web Conf. (WWW) (2003)
Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: a machine–learning approach. In: ACM Conf. on Management of Data (SIGMOD) (2001)
Dzbor, M., Domingue, J., Motta, E.: Magpie - towards a semantic web browser. In: Intl. Semantic Web Conf. (ISWC) (2003)
Embley, D.W., Campbell, D.M., Smith, R.D., Liddle, S.W.: Ontology-based extraction and structuring of information from data-rich unstructured documents. In: Intl. Conf. on Information and Knowledge Management (CIKM) (1998)
Embley, D.W., Jiang, Y., Ng, Y.-K.: Record-boundary discovery in web documents. In: ACM Conf. on Management of Data (SIGMOD) (1999)
Fensel, D., Decker, S., Erdmann, M., Studer, R.: Ontobroker: or how to enable intelligent access to the WWW. In: 11th Banff Knowledge Acquisition for Knowledge-Based Systems Workshop, Banff, Canada (1998)
Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M.M., Vassalos, V.: Template-based wrappers in the tsimmis system. In: ACM Conf. on Management of Data (SIGMOD) (1997)
Hammond, B., Sheth, A., Kochut, K.: Semantic enhancement engine: a modular document enhancement platform for semantic applications over heterogenous content. In: Kashyap, V., Shklar, L. (eds.) Real World Semantic Applications. IOS Press (2002)
Handschuh, S., Staab, S.: Authoring and annotation of web pages in CREAM. In: Intl. World Wide Web Conf. (WWW) (2002)
Handschuh, S., Staab, S., Volz, R.: On deep annotation. In: Intl. World Wide Web Conf. (WWW) (2003)
Heflin, J., Hendler, J.A., Luke, S.: SHOE: a blueprint for the semantic web. In: Fensel, D., Hendler, J.A., Lieberman, H., Wahlster, W. (eds.) Spinning the Semantic Web, pp. 29–63. MIT Press (2003)
Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: Intl. World Wide Web Conf. (WWW) (2006)
http://www.w3c.org/Submission/SWRL/
Jones, M., Marsden, G., Mohd-Nasir, N., Boone, K., Buchanan, G.: Improving web interaction on small displays. In: Intl. World Wide Web Conf. (WWW) (1999)
Kaasinen, E., Aaltonen, M., Kolari, J., Melakoski, S., Laakko, T.: Two approaches to bringing internet services to wap devices. In: Intl. World Wide Web Conf. (WWW) (2000)
Kahan, J., Koivunen, M., E. Prud’Hommeaux, Swick, R.: Annotea: an open rdf infrastructure for shared web annotations. In: Intl. World Wide Web Conf. (WWW) (2001)
Kaikkonen, A., Roto, V.: Navigating in a mobile xhtml application. In: ACM Conf. on Human Factors in Computing Systems (CHI) (2003)
Kushmerick, N.: Wrapper verification. World Wide Web J. 3(2), 79–94 (2000)
Article MATH Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: Intl. Joint Conf. on Artificial Intelligence (IJCAI), vol. 1 (1997)
Laender, A., Ribeiro-Neto, B., da Silva, A., Teixeira, J.: A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93 (2002)
Article Google Scholar
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. In: ACM Conf. on Management of Data (SIGMOD) (2004)
Lerman, K., Minton, S., Knoblock, C.: Wrapper maintenace: a machine learning approach. J. Artif. Intell. Res. 18, 149–181 (2003)
MATH Google Scholar
Lewis, D., Schapire, R., Callan, J., Papka, R.: Training algorithms for linear text classifiers. In: ACM Conf. on Informaion Retrieval (SIGIR) (1996)
Liu, L., Pu, C., Han, W.: Xwrap: an xml-enabled wrapper construction system for web information sources. In: Intl. Conf. on Data Engineering (ICDE) (2000)
Lum, W., Lau, F.: A context-aware decision engine for content adaptation. IEEE Pervasive Computing 1(3), (2002)
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI Workshop on Learning for Text Categorization (1998)
Milic-Frayling, N., Sommerer, R.: Smartview: flexible viewing of web page contents. In: Intl. World Wide Web Conf. (WWW) (2002)
Miller, G., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: WordNet: an on-line lexical database. Int. J. Lexicogr. 3(4), 235–244 (1990)
Article Google Scholar
Mukherjee, S., Ramakrishnan, I.: Browsing fatigue on handhelds: semantic bookmarking spells relief. In: Intl. World Wide Web Conf. (WWW) (2005)
Mukherjee, S., Ramakrishnan, I., Singh, A.: Bootstrapping semantic annotation for content-rich html documents. In: Intl. Conf. on Data Engineering (ICDE) (2005)
Mukherjee, S., Yang, G., Ramakrishnan, I.: Automatic annotation of content-rich html documents: structural and semantic analysis. In: Intl. Semantic Web Conf. (ISWC) (2003)
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Intl. Conf. on Autonomous Agents (Agents’99) (1999)
Muslea, I., Minton, S., Knoblock, C.: Active learning with strong and weak views: a case study on wrapper induction. In: Intl. Joint Conf. on Artificial Intelligence (IJCAI) (2003)
Papadimitriou, C., Steiglitz, K.: Combinatorial Optimization: Algorithms and Complexity. Prentice Hall (1982)
Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: Kim – semantic annotation platform. In: Intl. Semantic Web Conf. (ISWC) (2003)
Quan, D., Karger, D.: How to make a semantic web browser. In: Intl. World Wide Web Conf. (WWW) (2004)
Rahm, E., Berstein, P.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Article MATH Google Scholar
Ramaswamy, L., Iyengar, A., Liu, L., Douglis, F.: Automatic detection of fragments in dynamically generated web pages. In: Intl. World Wide Web Conf. (WWW) (2004)
Sebastiani, F.: Machine learning in automated text categorization. In: ACM Computing Surveys (1999)
Shih, L., Karger, D.: Using urls and table layout for web classification tasks. In: Intl. World Wide Web Conf. (WWW) (2004)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1–3), 233–272 (1999)
Article MATH Google Scholar
Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning block importance models for web pages. In: Intl. World Wide Web Conf. (WWW) (2004)
Staab, S., Angele, J., Decker, S., Erdmann, M., Hotho, A., Maedche, A., Schnurr, H.-P., Studerand, R., Sure, Y.: Semantic community web portals. In: Intl. World Wide Web Conf. (WWW) (2000)
Web Ontology Language (OWL). http://www.w3.org/2004/OWL
Wong, T.-L., Lam, W.: Text mining from site invariant and dependent features for information extraction knowledge adaptation. In: SIAM Intl. Conf. on Data Mining (SDM) (2004)
Yang, C., Wang, F.L.: Fractal summarization for mobile devices to access large documents on the web. In: Intl. World Wide Web Conf. (WWW) (2003)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: ACM Conf. on Informaion Retrieval (SIGIR) (1999)
Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Intl. Conf. on Machine Learning (ICML) (1997)
Yang, Y., Zhang, H.: HTML page analysis based on visual cues. In: Intl. Conf. on Document Analysis and Recognition (ICDAR) (2001)
Yi, L., Liu, B.: Eliminating noisy information in web pages for data mining. In: ACM Conf. on Knowledge Discovery and Data Mining (SIGKDD) (2003)
Yi, L., Liu, B.: Web page cleaning for web mining through feature weighting. In: Intl. Joint Conf. on Artificial Intelligence (IJCAI) (2003)
Yin, X., Lee, W.S.: Using link analysis to improve layout on mobile devices. In: Intl. World Wide Web Conf. (WWW) (2004)
Yu, S., Cai, D., Wen, J.-R., Ma, W.-Y.: Improving pseudo-relevance feedback in web information retrieval using web page segnmentation. In: Intl. World Wide Web Conf. (WWW) (2003)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Intl. World Wide Web Conf. (WWW) (2005)
Zhang, Z., He, B., Chang, K.C.-C.: Understanding web query interfaces: best-effort parsing with hidden syntax. In: ACM Conf. on Management of Data (SIGMOD) (2004)

Download references

Author information

Authors and Affiliations

Integrated Data Systems Department, Siemens Corporate Research, 755 College Road East, Princeton, NJ, 08540, USA
Saikat Mukherjee
Computer Science Department, Stony Brook University, Stony Brook, NY, 11794, USA
I. V. Ramakrishnan

Authors

Saikat Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar
I. V. Ramakrishnan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saikat Mukherjee.

Additional information

This work has been conducted while the author was at Stony Brook University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mukherjee, S., Ramakrishnan, I.V. Automated Semantic Analysis of Schematic Data. World Wide Web 11, 427–464 (2008). https://doi.org/10.1007/s11280-008-0046-0

Download citation

Received: 11 July 2007
Revised: 07 April 2008
Accepted: 07 April 2008
Published: 13 June 2008
Issue Date: December 2008
DOI: https://doi.org/10.1007/s11280-008-0046-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automated Semantic Analysis of Schematic Data

Abstract

Access this article

Similar content being viewed by others

Any Suggestions? Active Schema Support for Structuring Web Information

Combining Syntactic and Semantic Evidence for Improving Matching over Linked Data Sources

Tabular Web Data: Schema Discovery and Integration

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automated Semantic Analysis of Schematic Data

Abstract

Access this article

Similar content being viewed by others

Any Suggestions? Active Schema Support for Structuring Web Information

Combining Syntactic and Semantic Evidence for Improving Matching over Linked Data Sources

Tabular Web Data: Schema Discovery and Integration

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation