Abstract
To extract and represent domain-independent web scale data, we introduce a schema-less and self-describing data model called Object-oriented Web Model (OWM), which is rich in semantics and flexible in structure. It represents web pages as objects with hierarchical structures and links in a web page as relationships to other objects, so that objects form a network. Taking use of web segmentation techniques, data from data-intensive web pages can be extracted, represented and integrated as OWM objects.
This work is supported by National Natural Science Funds of China under grant No. 61202100.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. CoRR (2012)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)
Sarawagi, S.: Automation in information extraction and integration. In: Tutorial of The 28th International Conference on Very Large Data Bases (VLDB) (2002)
Su, W., Wang, J., Lochovsky, F.H.: Ode: Ontology-assisted data extraction. ACM Trans. Database Syst. (TODS) 34(2), 12 (2009)
Embley, D.W.: Toward semantic understanding: an approach based on information extraction ontologies. In: Proceedings of the 15th Australasian database conference, vol. 27, pp. 3–12. Australian Computer Society Inc (2004)
Crescenzi, V., Mecca, G., Merialdo, P., et al.: Roadrunner: towards automatic data extraction from large web sites. VLDB 1, 109–118 (2001)
Mulwad, V., Finin, T., Joshi, A.: A domain independent framework for extracting linked semantic data from tables. In: Ceri, S., Brambilla, M. (eds.) Search Computing. LNCS, vol. 7538, pp. 16–33. Springer, Heidelberg (2012)
Michael, J., Cafarella, A.H., Wang, D.Z., Wang, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endowment 1(1), 538–549 (2008)
Bohannon, P., Dalvi, N., Filmus, Y., Jacoby, n., Keerthi, S., Kirpal, A.: Automatic web-scale information extraction. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 609–612. ACM (2012)
Madhavan, J., Halevy, A.Y., Cohen, S., Dong, X.L., Jeffery, S.R., Ko, D., Yu, C.: Structured data meets the web: a few observations. IEEE Data Eng. Bull. 29(4), 19–26 (2006)
Talukdar, P.P., Ives, Z.V., Pereira, F.: Automatically incorporating new sources in keyword search-based data integration. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 387–398. ACM (2010)
Zeng, J., Flanagan, B., Xiong, Q., Wen, J., Hirokawa, S.: A web page segmentation approach using seam degree and content similarity. In: Lee, R.Y. (ed.) Applied Computing and Information Technology, pp. 91–103. Springer, Berlin (2014)
Kohlschtter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: CIKM 2008, pp. 1173–1182 (2008)
Cai, D., Yu, S., Wen, J., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Liu, W., Meng, X., Meng, W.: Vide: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)
Dong, X.L., Srivastava, D.: Big data integration. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1245–1248. IEEE (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Chen, L., Liu, M., Yu, T. (2015). A Schema-Less Data Model for the Web. In: Johannesson, P., Lee, M., Liddle, S., Opdahl, A., Pastor López, Ó. (eds) Conceptual Modeling. ER 2015. Lecture Notes in Computer Science(), vol 9381. Springer, Cham. https://doi.org/10.1007/978-3-319-25264-3_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-25264-3_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25263-6
Online ISBN: 978-3-319-25264-3
eBook Packages: Computer ScienceComputer Science (R0)