Extracting Records and Posts from Forum Pages with Limited Supervision

Barbosa, Luciano; Ferreira, Guilherme

doi:10.1007/978-3-319-26187-4_19

Luciano Barbosa²⁰ &
Guilherme Ferreira²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9419))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1337 Accesses
2 Citations

Abstract

Internet forums are rich sources of human-generated content. Many applications, such as opinion mining and question answering, can greatly benefit from mining and exploring such useful content. An important step towards making user content from forums more easily accessible is to extract it from forum pages. We propose REPEX (REcord and Post EXtractor), a two-step solution that uses limited supervision to achieve this goal. Given a forum page, REPEX first extracts data records that contain human-generated content and then, from these records, extracts their user content. The record extraction assumes that (1) a record is composed of an automatic-generated part, which we call record template, and a human-generated part; and (2) the structure of record templates are usually consistent across records. Based on those, the record extractor initially locates the subtree that contains all records in the forum page, using an information-theoretic measure, and then identifies the template of the records in this subtree, modelling this as an outlier detection problem. Finally, starting from the templates, REPEX determines the boundaries of the records. For the post extraction, REPEX applies an information extraction approach that performs this task by identifying the posts’ string boundaries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.regxlib.com/.
http://www.regular-expressions.info/.

References

Cong, G., Wang, L., Lin, C.-Y., Song, Y.-I., Sun, Y.: Finding question-answer pairs from online forums. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and Development in Information Retrieval, pp. 467–474. ACM (2008)
Google Scholar
Jiang, J., Song, X., Yu, N., Lin, C.-Y.: Focus: learning to crawl web forums. IEEE Trans. Knowl. Data Eng. 25(6), 1293–1306 (2013)
Article Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM (2003)
Google Scholar
Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)
Google Scholar
Seo, J., Croft, W.B., Smith, D.A.: Online community search using thread structure. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1907–1910. ACM (2009)
Google Scholar
Song, X., Liu, J., Cao, Y., Lin, C.-Y., Hon, H.-W.: Automatic extraction of web data records containing user-generated content. In: Proceedings of the 19th ACM international conference on Information and Knowledge Management, pp. 39–48. ACM (2010)
Google Scholar
Tan, P.-N., Steinbach, M., Kumar, V., et al.: Introduction to data mining, vol. 1. Pearson Addison Wesley, Boston (2006)
Google Scholar
Wang, H., Wang, C., Zhai, C., Han, J.: Learning online discussion structures by conditional random fields. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 435–444. ACM (2011)
Google Scholar
Yang, W.: Identifying syntactic differences between two programs. Soft. Pract. Experience 21(7), 739–755 (1991)
Article Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research – Brazil, Av. Pasteur, 138, Rio de Janeiro, Brazil
Luciano Barbosa & Guilherme Ferreira

Authors

Luciano Barbosa
View author publications
You can also search for this author in PubMed Google Scholar
Guilherme Ferreira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luciano Barbosa .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Jianyong Wang
Poznan University of Economics, Poznan, Poland
Wojciech Cellary
Florida Atlantic University, Boca Raton, Florida, USA
Dingding Wang
Victoria University, Melbourne, Victoria, Australia
Hua Wang
Florida International University, Miami, Florida, Florida, USA
Shu-Ching Chen
Florida International University, Miami, Florida, USA
Tao Li
Victoria University, Melbourne, Victoria, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barbosa, L., Ferreira, G. (2015). Extracting Records and Posts from Forum Pages with Limited Supervision. In: Wang, J., et al. Web Information Systems Engineering – WISE 2015. WISE 2015. Lecture Notes in Computer Science(), vol 9419. Springer, Cham. https://doi.org/10.1007/978-3-319-26187-4_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-26187-4_19
Published: 18 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26186-7
Online ISBN: 978-3-319-26187-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics