Article

An information extraction engine for web discussion forums

Authors:
Hanny Yulius Limanto

Nanyang Technological University, Nanyang Avenue, Singapore

Nanyang Technological University, Nanyang Avenue, Singapore
View Profile

,
Nguyen Ngoc Giang

Nanyang Technological University, Nanyang Avenue, Singapore

Nanyang Technological University, Nanyang Avenue, Singapore
View Profile

,
Vo Tan Trung

Nanyang Technological University, Nanyang Avenue, Singapore

Nanyang Technological University, Nanyang Avenue, Singapore
View Profile

,
Jun Zhang

Nanyang Technological University, Nanyang Avenue, Singapore

Nanyang Technological University, Nanyang Avenue, Singapore
View Profile

,
Qi He

Nanyang Technological University, Nanyang Avenue, Singapore

Nanyang Technological University, Nanyang Avenue, Singapore
View Profile

,
Nguyen Quang Huy

Nanyang Technological University, Nanyang Avenue, Singapore

Nanyang Technological University, Nanyang Avenue, Singapore
View Profile

WWW '05: Special interest tracks and posters of the 14th international conference on World Wide WebMay 2005Pages 978–979https://doi.org/10.1145/1062745.1062827

Published:10 May 2005Publication History

WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

Pages 978–979

ABSTRACT

In this poster, we present an information extraction engine for web-based forums. The engine analyzes the HTML files crawled from web forums, deduces the wrapper (template) of the pages and extracts the information about posts (e.g., author, title, content, number of replies and views, etc.). Extraction is an important module for forum search engine, since it helps to understand the content of a forum HTML page and facilitates ranking during retrieval. We discuss the system architecture of the extraction engine in the context of a forum search engine and present various components in the extraction engine. We also introduce briefly the extraction process and discuss some implementation issues.

References

Arasu, A. and Garcia-Molina, H. Extracting structured data from web pages. SIGMOD 2003, 337--348 Google ScholarDigital Library
Crescenzi, V., Mecca G., and Merialdo P. ROADRUNNER: towards automatic data extraction from large web sites. VLDB 2001, 109--118 Google ScholarDigital Library
Google: http://www.google.comGoogle Scholar
Lycos Discussion: http://discussion.lycos.comGoogle Scholar
Wang, J. and Lochovsky, F.H. Data extraction and label assignment for web databases. WWW 2003, 187--196 Google ScholarDigital Library

Index Terms

An information extraction engine for web discussion forums
1. Information systems
  1. Information retrieval
    1. Document representation
2. Theory of computation
  1. Semantics and reasoning
    1. Program reasoning
      1. Abstraction

Recommendations

Search Engine Optimization by Re-Ranking the Product Search Result Based on User Click Data
AISS '21: Proceedings of the 3rd International Conference on Advanced Information Science and System

Blibli.com provides a search engine for its customers. It used Solr search engine with only plain BM25 similarity function which is based on probability. In order to improve search engine performance, this research tried to implement an algorithm that ...
Read More
Discovering the representative of a search engine
CIKM '01: Proceedings of the tenth international conference on Information and knowledge management

Given a large number of search engines on the Internet, it is difficult for a person to determine which search engines could serve his/her information needs. A common solution is to construct a metasearch engine on top of the search engines. Upon ...
Read More
Discovering the representative of a search engine
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

Given a large number of search engines on the Internet, it is difficult for a person to determine which search engines could serve his/her information needs. A common solution is to construct a metasearch engine on top of the search engines. Upon ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web
May 2005
454 pages
ISBN:1595930515
DOI:10.1145/1062745
Conference Chairs:
Allan Ellis
Southern Cross University, Australia
,
Tatsuya Hagino
Keio University, Japan
,
Program Chairs:
Fred Douglis
IBM Research
,
Prabhakar Raghavan
Verity, Inc.
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 May 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
discussion board
forums
information extraction
information retrieval
search engine
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 705
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.