Abstract
The problem of Discourse Data Extraction focuses on identifying comments and reviews from social networking websites. Existing approaches for Discourse Data extraction are either template-dependent or they are limited to comment-posting-structure discovery. We are not aware of any technique that extracts the detailed comment information like comment text, commenter and discussion structure from the comment page. In this paper, we present a template-independent two step approach, namely TiDE, which extracts the discourse data such as comments, reviews, posts and structural relationship among them. In the first step, we parse the input comment page to prepare a Document Object Model tree and then find the location of discourse data in the tree using the concept of Path-Strings. The outputs of the first step are Comment Blocks and these Comment Blocks are leveraged in second step to extract the comments, commenter and discussion structure. Experimental studies on 19 well known Discourse websites having different templates show that our Comment Block discovery is more adaptable than the existing posting-structure discovery technique. We are able to extract 97 % of comment-text and 79 % commenter information which is significant compared to state of the art techniques. We also show the usefulness of TiDE by building a news comment crawler.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
merge: append text of all the sibling tags into first sibling and except first, remove all sibling tags. Thus, output the Path-String of first sibling only.
References
Wang, J., Li, Q., Yuanzhu, P., Liu, J., Zhang, C., Lin, Z.: News recommendation in forum-based social media. In: AAAI Conference. AAAI Press (2010)
Krishna, A., Zambreno, J., Krishnan, S.: Polarity trend analysis of public sentiment on YouTube. In: COMAD. ACM (2013)
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM. ACM (2010)
Sluban, B., Grčar, M.: URL tree: efficient unsupervised content extraction from streams of web documents. In: CIKM. ACM (2013)
Guo, Y., Tang, H., Song, L., Wang, Y., Ding, G.: ECON: an approach to extract content from web news page. In: APWEB. Springer (2010)
Subercaze, J., Gravier, C.: Lifting user generated comments to SIOC. In: KECSM. CEUR-WS.org (2012)
Bank, M., Mattes, M.: Automatic user comment detection in flat internet fora. In: DEXA Workshops. IEEE (2009)
Yi, L., Bing, L., Li, X.: Eliminating noisy information in web pages for data mining. In: SIGKDD. ACM (2003)
Cao, D., Liao, X., Xu, H., Bai, S.: Blog post and comment extraction using information quantity of web format. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 298–309. Springer, Heidelberg (2008)
Yu, X., Wei, X., Lin, X.: Algorithms of BBS opinion leader mining based on sentiment analysis. In: Wang, F.L., Gong, Z., Luo, X., Lei, J. (eds.) Web Information Systems and Mining. LNCS, vol. 6318, pp. 360–369. Springer, Heidelberg (2010)
Bing, L., Lam, W., Gu, Y.: Towards a unified solution: data record region detection and segmentation. In: CIKM. ACM (2011)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: SIGKDD. ACM (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Barua, J., Patel, D., Goyal, V. (2015). TiDE: Template-Independent Discourse Data Extraction. In: Madria, S., Hara, T. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2015. Lecture Notes in Computer Science(), vol 9263. Springer, Cham. https://doi.org/10.1007/978-3-319-22729-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-22729-0_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22728-3
Online ISBN: 978-3-319-22729-0
eBook Packages: Computer ScienceComputer Science (R0)