Abstract
With the rapid development of Internet, data sources on deep web store a large number of high-quality structured data, which demands the development of structured data extraction method. But the existing methods focus on data rather than structure, and some of them are difficult to maintain. To resolve these problems, a complete and effective method supporting data extraction and schema recognition is proposed in this paper. To extract data, a novel algorithm based on clustering is adopted, which is also effective when faced complex data and excessive noise. And a simple extraction rule model is defined to resolve the problem of maintenance. In addition, it does deep mining on result schema recognition. At last, experiments show satisfactory results.
This research is supported by the National Natural Science Foundation of China under Grant No. 60673139, 60573090.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chang, C.-C.K., He, B., Li, C., Patel, M., Zhang, Z.: Structured Databases on the Web: Observations and Implications. In: SIGMOD Conference, pp. 61–70 (2004)
Meng, X., Lu, H., Wang, H., Gu, M.: SG-WRAP: a schema-guided wrapper generator. In: Proceedings of the 18th International Conference on Data Engineering, pp. 331–332 (2002)
Laender, A.H.F., Berthier, A.R., Altigran, S.: DEByE - data extraction by example. Data Knowl. Eng. 121–154 (2002)
Liu, B., Grossman, R.L., Zhai, Y.: Mining data records in Web pages. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, pp. 601–606 (2003)
Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: WWW, pp. 10–14 (2005)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large Web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: automatic data extraction from data-intensive web sites. In: Proceedings of the 21th ACM SIGMOD International Conference on Management of Data, Madison, p. 624 (2002)
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: SIGMOD Conference, pp. 337–348 (2003)
Wang, J., Lochovsky, H.F.: Data Extraction and Label Assignment for Web Databases. In: WWW, pp. 20–24 (2003)
Liu, W., Meng, X., Meng, W.: Vision-based Web Data Records Extraction. In: Proc. of the 9th SIGMOD International Workshop on Web and Databases, pp. 20–25. Illinois, Chicago (2006)
Cai, D., Yu, S., Wen, J., Ma, W.: Extracting Content Structure for Web Pages Based on Visual Representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Cai, D., He, X., Wen, J., Ma, W.: Block-level link analysis. In: SIGIR, pp. 440–447 (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, W., Shen, D., Nie, T. (2008). An Effective Method Supporting Data Extraction and Schema Recognition on Deep Web. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds) Progress in WWW Research and Development. APWeb 2008. Lecture Notes in Computer Science, vol 4976. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78849-2_42
Download citation
DOI: https://doi.org/10.1007/978-3-540-78849-2_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78848-5
Online ISBN: 978-3-540-78849-2
eBook Packages: Computer ScienceComputer Science (R0)