TiDE: Template-Independent Discourse Data Extraction

Barua, Jayendra; Patel, Dhaval; Goyal, Vikram

doi:10.1007/978-3-319-22729-0_12

Jayendra Barua¹⁵,
Dhaval Patel¹⁵ &
Vikram Goyal¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9263))

Included in the following conference series:

International Conference on Big Data Analytics and Knowledge Discovery

1746 Accesses
4 Citations

Abstract

The problem of Discourse Data Extraction focuses on identifying comments and reviews from social networking websites. Existing approaches for Discourse Data extraction are either template-dependent or they are limited to comment-posting-structure discovery. We are not aware of any technique that extracts the detailed comment information like comment text, commenter and discussion structure from the comment page. In this paper, we present a template-independent two step approach, namely TiDE, which extracts the discourse data such as comments, reviews, posts and structural relationship among them. In the first step, we parse the input comment page to prepare a Document Object Model tree and then find the location of discourse data in the tree using the concept of Path-Strings. The outputs of the first step are Comment Blocks and these Comment Blocks are leveraged in second step to extract the comments, commenter and discussion structure. Experimental studies on 19 well known Discourse websites having different templates show that our Comment Block discovery is more adaptable than the existing posting-structure discovery technique. We are able to extract 97 % of comment-text and 79 % commenter information which is significant compared to state of the art techniques. We also show the usefulness of TiDE by building a news comment crawler.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
merge: append text of all the sibling tags into first sibling and except first, remove all sibling tags. Thus, output the Path-String of first sibling only.

References

Wang, J., Li, Q., Yuanzhu, P., Liu, J., Zhang, C., Lin, Z.: News recommendation in forum-based social media. In: AAAI Conference. AAAI Press (2010)
Google Scholar
Krishna, A., Zambreno, J., Krishnan, S.: Polarity trend analysis of public sentiment on YouTube. In: COMAD. ACM (2013)
Google Scholar
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM. ACM (2010)
Google Scholar
Sluban, B., Grčar, M.: URL tree: efficient unsupervised content extraction from streams of web documents. In: CIKM. ACM (2013)
Google Scholar
Guo, Y., Tang, H., Song, L., Wang, Y., Ding, G.: ECON: an approach to extract content from web news page. In: APWEB. Springer (2010)
Google Scholar
Subercaze, J., Gravier, C.: Lifting user generated comments to SIOC. In: KECSM. CEUR-WS.org (2012)
Google Scholar
Bank, M., Mattes, M.: Automatic user comment detection in flat internet fora. In: DEXA Workshops. IEEE (2009)
Google Scholar
Yi, L., Bing, L., Li, X.: Eliminating noisy information in web pages for data mining. In: SIGKDD. ACM (2003)
Google Scholar
Cao, D., Liao, X., Xu, H., Bai, S.: Blog post and comment extraction using information quantity of web format. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 298–309. Springer, Heidelberg (2008)
Google Scholar
Yu, X., Wei, X., Lin, X.: Algorithms of BBS opinion leader mining based on sentiment analysis. In: Wang, F.L., Gong, Z., Luo, X., Lei, J. (eds.) Web Information Systems and Mining. LNCS, vol. 6318, pp. 360–369. Springer, Heidelberg (2010)
Google Scholar
Bing, L., Lam, W., Gu, Y.: Towards a unified solution: data record region detection and segmentation. In: CIKM. ACM (2011)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: SIGKDD. ACM (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Technology, Roorkee, India
Jayendra Barua & Dhaval Patel
Indraprastha Institute of Information Technology, Delhi, India
Vikram Goyal

Authors

Jayendra Barua
View author publications
You can also search for this author in PubMed Google Scholar
Dhaval Patel
View author publications
You can also search for this author in PubMed Google Scholar
Vikram Goyal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jayendra Barua .

Editor information

Editors and Affiliations

University of Science and Technology, Rolla, Missouri, USA
Sanjay Madria
Osaka University, Osaka, Japan
Takahiro Hara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barua, J., Patel, D., Goyal, V. (2015). TiDE: Template-Independent Discourse Data Extraction. In: Madria, S., Hara, T. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2015. Lecture Notes in Computer Science(), vol 9263. Springer, Cham. https://doi.org/10.1007/978-3-319-22729-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-22729-0_12
Published: 05 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22728-3
Online ISBN: 978-3-319-22729-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics