Abstract
Although RSS demonstrates a promising solution to track and personalize the flow of new Web information, many of the current Web sites are not yet enabled with RSS feeds. The availability of convenient approaches to “RSSify” existing suitable Web contents has become a stringent necessity. This paper presents a system that translates semi-structured HTML pages to structured RSS feeds, which proposes different approaches based on various features of HTML pages. For the information items with release time, the system provides an automatic approach based on time pattern discovery. Another automatic approach based on repeated tag pattern mining is applied to convert the regular pages without the time pattern. A semi-automatic approach based on labelling is available to process the irregular pages or specific sections in Web pages according to the user’s requirements. Experimental results and practical applications prove that our system is efficient and effective in facilitating the RSS feed generation.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Hammersley, B.: Content Syndication with RSS, 1st edn. Oreilly & Associate, Inc. (2003)
Miller, R.: Can RSS Relieve Information Overload? EContent Magazine (March 2004)
Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting Semistructured Information from the Web. In: Workshop on the Management of Semistructured Data (1997)
Huck, G., Fankhauser, P., Aberer, K., Neuhold, E.J.: Jedi: Extracting and Synthesizing Information from the Web. In: CoopIS 1998, 3rd International Conference of Cooperative Information Systems, New York (1998)
Sahuguet, A., Azavant, F.: Web Ecology: Recycling HTML pages as XML documents using W4F. In: WebDB (1999)
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based Content Extraction of HTML Documents. In: the 12th International Conference on World Wide Web, Budapest, Hungary (2003)
Berners-Lee, T.: The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American (2001)
Yang, Y., Zhang, H.: HTML Page Analysis Based on Visual Cues. In: the Proceedings of the Sixth International Conference on Document Analysis and Recognition, Washington, DC, USA (2001)
Wang, J., Uchino, K.: Efficient RSS Feed Generation from HTML Pages. In: the proceedings of the First International Conference on Web information Systems and Technologies, Miami, USA (2005)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, 1st edn. Cambridge University Press, Cambridge (1997)
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Chang, C., Lui, S.: IEPAD: Information Extraction based on Pattern Discovery. In: the 10th International Conference on World Wide Web, Hong Kong (2001)
Chen, Y., Ma, W., Zhang, H.: Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices. In: the 12th International Conference on World Wide Web, Budapest, Hungary (2003)
Nottingham, M.: XPath2rss, http://www.mnot.net/
Nanno, T., Suzuki, Y., Fujiuki, T., Okumura, M.: Automatic Collection and Monitoring of Japanese Weblogs. In: the WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2004)
Nanno, T., Okumura, M.: Automatic Generation of RSS Feed based on the HTML Document Structure Analysis. In: Proceeding of the 19th Annual Conference of JSAI (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, J., Uchino, K., Takahashi, T., Okamoto, S. (2006). RSS Feed Generation from Legacy HTML Pages. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_115
Download citation
DOI: https://doi.org/10.1007/11610113_115
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31142-3
Online ISBN: 978-3-540-32437-9
eBook Packages: Computer ScienceComputer Science (R0)