Abstract
With the development of the research on blogosphere, acquiring the post and comment from blog page becomes more important in improving the search performance. In this paper, we present a two-stage method. First, we combine the advantage of the vision information and the effective text information to locate the main text which represents the theme of blog page. Second, we use the information quantity of separator to detect the boundary between the post and comment. According to our experiments, this method achieves a good performance in extraction and improves the performance of blog search.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Crescenzi, V., RoadRunner, G.M.: Towards automatic data extraction from large web site. In: Proceeding of the 26th International Conference on very Large Database Systems, pp. 109–118 (2001)
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of html documents. In: 12th International World Wide Web Conference (May 2003)
Gupta, S., Kaiser, G.E., Grimm, P., Chiang, M.F., Starren, J.: Automating content extraction of html documents. World Wide Web 8(2), 179–224 (2005)
Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: WWW 2006, pp. 553–563–224 (2006)
Ling, Y., Meng, X., Meng, W.: Automated extraction of hit numbers from search result pages. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 73–84. Springer, Heidelberg (2006)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (August 2003)
Lu, Y., Meng, W., Zhang, W., Liu, K.-L., Yu, C.T.: Automatic extraction of publication time from news search results. In: ICDE Workshops (2006)
Qi, Y., Candan, K.S.: Blogs, wikis and rss: Cuts: Curvature-based development pattern analysis and segmentation for blogs and other text streams. In: Proceedings of the seventeenth conference on Hypertext and hypermedia HYPERTEXT 2006 (August 2006)
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th international conference on World Wide Web WWW 2004 (May 2004)
Reynar, J.C.: Topic segmentation: Algorithms and applications. PhD thesis (1998)
Song, R., Liu, H., Wen, J., Ma, W.: Learning block importance models for web pages. In: Proceedings of the 13th international conference on World Wide Web WWW 2004 (May 2004)
Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering (2006)
Zhao, H., Meng, W., Yu, C.T.: Automatic extraction of dynamic record sections from search engine result pages. In: VLDB 2006, pp. 989–1000 (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cao, D., Liao, X., Xu, H., Bai, S. (2008). Blog Post and Comment Extraction Using Information Quantity of Web Format. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_29
Download citation
DOI: https://doi.org/10.1007/978-3-540-68636-1_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)