RSS Feed Generation from Legacy HTML Pages

Wang, Jun; Uchino, Kanji; Takahashi, Tetsuro; Okamoto, Seishi

doi:10.1007/11610113_115

Jun Wang²¹,
Kanji Uchino²²,
Tetsuro Takahashi²² &
…
Seishi Okamoto²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3841))

Included in the following conference series:

Asia-Pacific Web Conference

Abstract

Although RSS demonstrates a promising solution to track and personalize the flow of new Web information, many of the current Web sites are not yet enabled with RSS feeds. The availability of convenient approaches to “RSSify” existing suitable Web contents has become a stringent necessity. This paper presents a system that translates semi-structured HTML pages to structured RSS feeds, which proposes different approaches based on various features of HTML pages. For the information items with release time, the system provides an automatic approach based on time pattern discovery. Another automatic approach based on repeated tag pattern mining is applied to convert the regular pages without the time pattern. A semi-automatic approach based on labelling is available to process the irregular pages or specific sections in Web pages according to the user’s requirements. Experimental results and practical applications prove that our system is efficient and effective in facilitating the RSS feed generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

\(\mathcal {IRORS}\): intelligent recommendation of RSS feeds

Article Open access 14 January 2016

Dynamic Generation of Assessment Items Using Wikidata

Scraping Data from Web Pages Using SPARQL Queries

References

Hammersley, B.: Content Syndication with RSS, 1st edn. Oreilly & Associate, Inc. (2003)
Google Scholar
Miller, R.: Can RSS Relieve Information Overload? EContent Magazine (March 2004)
Google Scholar
Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting Semistructured Information from the Web. In: Workshop on the Management of Semistructured Data (1997)
Google Scholar
Huck, G., Fankhauser, P., Aberer, K., Neuhold, E.J.: Jedi: Extracting and Synthesizing Information from the Web. In: CoopIS 1998, 3rd International Conference of Cooperative Information Systems, New York (1998)
Google Scholar
Sahuguet, A., Azavant, F.: Web Ecology: Recycling HTML pages as XML documents using W4F. In: WebDB (1999)
Google Scholar
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based Content Extraction of HTML Documents. In: the 12th International Conference on World Wide Web, Budapest, Hungary (2003)
Google Scholar
Berners-Lee, T.: The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American (2001)
Google Scholar
Yang, Y., Zhang, H.: HTML Page Analysis Based on Visual Cues. In: the Proceedings of the Sixth International Conference on Document Analysis and Recognition, Washington, DC, USA (2001)
Google Scholar
Wang, J., Uchino, K.: Efficient RSS Feed Generation from HTML Pages. In: the proceedings of the First International Conference on Web information Systems and Technologies, Miami, USA (2005)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, 1st edn. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Article MATH MathSciNet Google Scholar
Chang, C., Lui, S.: IEPAD: Information Extraction based on Pattern Discovery. In: the 10th International Conference on World Wide Web, Hong Kong (2001)
Google Scholar
Chen, Y., Ma, W., Zhang, H.: Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices. In: the 12th International Conference on World Wide Web, Budapest, Hungary (2003)
Google Scholar
http://www.feedfire.com/site/
http://myrss.jp/
Nottingham, M.: XPath2rss, http://www.mnot.net/
Nanno, T., Suzuki, Y., Fujiuki, T., Okumura, M.: Automatic Collection and Monitoring of Japanese Weblogs. In: the WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2004)
Google Scholar
Nanno, T., Okumura, M.: Automatic Generation of RSS Feed based on the HTML Document Structure Analysis. In: Proceeding of the 19th Annual Conference of JSAI (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Fujitsu R& Center Co., Ltd., B306, Eagle Run Plaza, No.26 Xiaoyun Rd., 100016, Beijing, China
Jun Wang
Fujitsu Laboratories, Ltd., 4-1-1 Kami-kodanaka, Nakahara-Kawasaki, Kanagawa, 211-8588, Japan
Kanji Uchino, Tetsuro Takahashi & Seishi Okamoto

Authors

Jun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kanji Uchino
View author publications
You can also search for this author in PubMed Google Scholar
Tetsuro Takahashi
View author publications
You can also search for this author in PubMed Google Scholar
Seishi Okamoto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of ITEE, The University of Queensland, Australia
Xiaofang Zhou
School of Computer Science and Technology, Heilongjiang University, China
Jianzhong Li
School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD, Australia
Heng Tao Shen
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
Victoria University, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, J., Uchino, K., Takahashi, T., Okamoto, S. (2006). RSS Feed Generation from Legacy HTML Pages. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_115

Download citation

DOI: https://doi.org/10.1007/11610113_115
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31142-3
Online ISBN: 978-3-540-32437-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

RSS Feed Generation from Legacy HTML Pages

Abstract

Access this chapter

Preview

Similar content being viewed by others

\(\mathcal {IRORS}\): intelligent recommendation of RSS feeds

Dynamic Generation of Assessment Items Using Wikidata

Scraping Data from Web Pages Using SPARQL Queries

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

RSS Feed Generation from Legacy HTML Pages

Abstract

Access this chapter

Preview

Similar content being viewed by others

\(\mathcal {IRORS}\): intelligent recommendation of RSS feeds

Dynamic Generation of Assessment Items Using Wikidata

Scraping Data from Web Pages Using SPARQL Queries

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation