Abstract
In Web forum, thread meta-information contained in list-of-thread of board page provide fundamental data for the further forum mining. This paper describes a complete system named Juicer which was developed as a subsystem for an industrial application that involves forum mining. The task of Juicer is to extract thread meta-information from board pages of a great many of large scale online Web forums, which implies that scalable extraction is required with high accuracy and speed, and minimal user effort for maintenance. Among so many existed approaches about information extraction, we can not find any approach to fully satisfy the requirements, so we present simple scalable extraction approach behind Juicer to achieve the goal. Juicer is constituted by four modules: Template generation, Specifying labeling setting, Automatic extraction, Label assignment. Both experiments and practice show that Juicer successfully satisfied the requirements.
This work is partially supported by the National Grand Fundamental Research 973 Program of China under Grant 2004CB318109, and the National High Technology Development 863 Program of China under Grant 2007AA01Z147.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE transactions on knowledge and data engineering 18(10), 1411–1428 (2006)
Liu, B., Zhai, Y.: Mining data records in web pages. In: Proc. Intl. Conf. Knowledge Discovery in Databases and Data Mining (KDD), pp. 601–606 (2003)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. 14th Intl. Conf. World Wide Web (WWW), pp. 76–85 (2005)
Liu, B., Zhai, Y.: Net: a system for extracting web data from flat and nested data records. In: Proc. Sixth Intl. Conf. Web Information Systems Eng., pp. 487–495 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Guo, Y., Wang, Y., Ding, G., Cao, D., Zhang, G., Lv, Y. (2009). Juicer: Scalable Extraction for Thread Meta-information of Web Forum. In: Chen, H., Yang, C.C., Chau, M., Li, SH. (eds) Intelligence and Security Informatics. PAISI 2009. Lecture Notes in Computer Science, vol 5477. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01393-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-01393-5_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01392-8
Online ISBN: 978-3-642-01393-5
eBook Packages: Computer ScienceComputer Science (R0)