Skip to main content

TiDE: Template-Independent Discourse Data Extraction

  • Conference paper
  • First Online:
Big Data Analytics and Knowledge Discovery (DaWaK 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9263))

Included in the following conference series:

Abstract

The problem of Discourse Data Extraction focuses on identifying comments and reviews from social networking websites. Existing approaches for Discourse Data extraction are either template-dependent or they are limited to comment-posting-structure discovery. We are not aware of any technique that extracts the detailed comment information like comment text, commenter and discussion structure from the comment page. In this paper, we present a template-independent two step approach, namely TiDE, which extracts the discourse data such as comments, reviews, posts and structural relationship among them. In the first step, we parse the input comment page to prepare a Document Object Model tree and then find the location of discourse data in the tree using the concept of Path-Strings. The outputs of the first step are Comment Blocks and these Comment Blocks are leveraged in second step to extract the comments, commenter and discussion structure. Experimental studies on 19 well known Discourse websites having different templates show that our Comment Block discovery is more adaptable than the existing posting-structure discovery technique. We are able to extract 97 % of comment-text and 79 % commenter information which is significant compared to state of the art techniques. We also show the usefulness of TiDE by building a news comment crawler.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    merge: append text of all the sibling tags into first sibling and except first, remove all sibling tags. Thus, output the Path-String of first sibling only.

References

  1. Wang, J., Li, Q., Yuanzhu, P., Liu, J., Zhang, C., Lin, Z.: News recommendation in forum-based social media. In: AAAI Conference. AAAI Press (2010)

    Google Scholar 

  2. Krishna, A., Zambreno, J., Krishnan, S.: Polarity trend analysis of public sentiment on YouTube. In: COMAD. ACM (2013)

    Google Scholar 

  3. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM. ACM (2010)

    Google Scholar 

  4. Sluban, B., Grčar, M.: URL tree: efficient unsupervised content extraction from streams of web documents. In: CIKM. ACM (2013)

    Google Scholar 

  5. Guo, Y., Tang, H., Song, L., Wang, Y., Ding, G.: ECON: an approach to extract content from web news page. In: APWEB. Springer (2010)

    Google Scholar 

  6. Subercaze, J., Gravier, C.: Lifting user generated comments to SIOC. In: KECSM. CEUR-WS.org (2012)

    Google Scholar 

  7. Bank, M., Mattes, M.: Automatic user comment detection in flat internet fora. In: DEXA Workshops. IEEE (2009)

    Google Scholar 

  8. Yi, L., Bing, L., Li, X.: Eliminating noisy information in web pages for data mining. In: SIGKDD. ACM (2003)

    Google Scholar 

  9. Cao, D., Liao, X., Xu, H., Bai, S.: Blog post and comment extraction using information quantity of web format. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 298–309. Springer, Heidelberg (2008)

    Google Scholar 

  10. Yu, X., Wei, X., Lin, X.: Algorithms of BBS opinion leader mining based on sentiment analysis. In: Wang, F.L., Gong, Z., Luo, X., Lei, J. (eds.) Web Information Systems and Mining. LNCS, vol. 6318, pp. 360–369. Springer, Heidelberg (2010)

    Google Scholar 

  11. Bing, L., Lam, W., Gu, Y.: Towards a unified solution: data record region detection and segmentation. In: CIKM. ACM (2011)

    Google Scholar 

  12. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: SIGKDD. ACM (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jayendra Barua .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Barua, J., Patel, D., Goyal, V. (2015). TiDE: Template-Independent Discourse Data Extraction. In: Madria, S., Hara, T. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2015. Lecture Notes in Computer Science(), vol 9263. Springer, Cham. https://doi.org/10.1007/978-3-319-22729-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22729-0_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22728-3

  • Online ISBN: 978-3-319-22729-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics