Skip to main content

RSS Feed Generation from Legacy HTML Pages

  • Conference paper
Frontiers of WWW Research and Development - APWeb 2006 (APWeb 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3841))

Included in the following conference series:

Abstract

Although RSS demonstrates a promising solution to track and personalize the flow of new Web information, many of the current Web sites are not yet enabled with RSS feeds. The availability of convenient approaches to “RSSify” existing suitable Web contents has become a stringent necessity. This paper presents a system that translates semi-structured HTML pages to structured RSS feeds, which proposes different approaches based on various features of HTML pages. For the information items with release time, the system provides an automatic approach based on time pattern discovery. Another automatic approach based on repeated tag pattern mining is applied to convert the regular pages without the time pattern. A semi-automatic approach based on labelling is available to process the irregular pages or specific sections in Web pages according to the user’s requirements. Experimental results and practical applications prove that our system is efficient and effective in facilitating the RSS feed generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Hammersley, B.: Content Syndication with RSS, 1st edn. Oreilly & Associate, Inc. (2003)

    Google Scholar 

  2. Miller, R.: Can RSS Relieve Information Overload? EContent Magazine (March 2004)

    Google Scholar 

  3. Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting Semistructured Information from the Web. In: Workshop on the Management of Semistructured Data (1997)

    Google Scholar 

  4. Huck, G., Fankhauser, P., Aberer, K., Neuhold, E.J.: Jedi: Extracting and Synthesizing Information from the Web. In: CoopIS 1998, 3rd International Conference of Cooperative Information Systems, New York (1998)

    Google Scholar 

  5. Sahuguet, A., Azavant, F.: Web Ecology: Recycling HTML pages as XML documents using W4F. In: WebDB (1999)

    Google Scholar 

  6. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based Content Extraction of HTML Documents. In: the 12th International Conference on World Wide Web, Budapest, Hungary (2003)

    Google Scholar 

  7. Berners-Lee, T.: The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American (2001)

    Google Scholar 

  8. Yang, Y., Zhang, H.: HTML Page Analysis Based on Visual Cues. In: the Proceedings of the Sixth International Conference on Document Analysis and Recognition, Washington, DC, USA (2001)

    Google Scholar 

  9. Wang, J., Uchino, K.: Efficient RSS Feed Generation from HTML Pages. In: the proceedings of the First International Conference on Web information Systems and Technologies, Miami, USA (2005)

    Google Scholar 

  10. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, 1st edn. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  11. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  12. Chang, C., Lui, S.: IEPAD: Information Extraction based on Pattern Discovery. In: the 10th International Conference on World Wide Web, Hong Kong (2001)

    Google Scholar 

  13. Chen, Y., Ma, W., Zhang, H.: Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices. In: the 12th International Conference on World Wide Web, Budapest, Hungary (2003)

    Google Scholar 

  14. http://www.feedfire.com/site/

  15. http://myrss.jp/

  16. Nottingham, M.: XPath2rss, http://www.mnot.net/

  17. Nanno, T., Suzuki, Y., Fujiuki, T., Okumura, M.: Automatic Collection and Monitoring of Japanese Weblogs. In: the WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2004)

    Google Scholar 

  18. Nanno, T., Okumura, M.: Automatic Generation of RSS Feed based on the HTML Document Structure Analysis. In: Proceeding of the 19th Annual Conference of JSAI (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, J., Uchino, K., Takahashi, T., Okamoto, S. (2006). RSS Feed Generation from Legacy HTML Pages. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_115

Download citation

  • DOI: https://doi.org/10.1007/11610113_115

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-31142-3

  • Online ISBN: 978-3-540-32437-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics