skip to main content
10.1145/1963192.1963211acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
poster

HyLiEn: a hybrid approach to general list extraction on the web

Published: 28 March 2011 Publication History

Abstract

We consider the problem of automatically extracting general lists from the web. Existing approaches are mostly dependent upon either the underlying HTML markup or the visual structure of the Web page. We present HyLiEn an unsupervised, Hybrid approach for automatic List discovery and Extraction on the Web. It employs general assumptions about the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods.

References

[1]
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538--549, 2008.
[2]
W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krupl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, pages 71--80, New York, NY, USA, 2007. ACM.
[3]
K. Lerman, L. Getoor, S. Minton, and C. Knoblock. Using the structure of web sites for automatic segmentation of tables. In SIGMOD, pages 119--130, New York, NY, USA, 2004. ACM.
[4]
W. Liu, X. Meng, and W. Meng. Vide: A vision-based approach for deep web data extraction. IEEE Trans. on Knowl. and Data Eng., 22(3):447--460, 2010.
[5]
K. Simon and G. Lausen. Viper: augmenting automatic information extraction with visual perceptions. In CIKM, pages 381--388, New York, NY, USA, 2005. ACM.
[6]
S. Tong and J. Dean. System and methods for automatically creating lists. In US Patent: 7350187, Mar 2008.
[7]
R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM '07: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, pages 342--350, Washington, DC, USA, 2007. IEEE
[8]
T. Weninger, F. Fumarola, R. Barber, J. Han, and D. Malerba. Unexpected results in automatic list extraction on the web. SIGKDD Explorations, 12(2), 2010.

Cited By

View all

Index Terms

  1. HyLiEn: a hybrid approach to general list extraction on the web

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '11: Proceedings of the 20th international conference companion on World wide web
    March 2011
    552 pages
    ISBN:9781450306379
    DOI:10.1145/1963192

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 March 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. web information integration
    2. web lists
    3. web mining

    Qualifiers

    • Poster

    Conference

    WWW '11
    WWW '11: 20th International World Wide Web Conference
    March 28 - April 1, 2011
    Hyderabad, India

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Combining URL and HTML Features for Entity Discovery in the WebACM Transactions on the Web10.1145/336557413:4(1-27)Online publication date: 4-Dec-2019
    • (2018)Harnessing Twitter for Answering Opinion List QueriesIEEE Transactions on Computational Social Systems10.1109/TCSS.2018.28811865:4(1083-1095)Online publication date: Dec-2018
    • (2018)STEMKnowledge and Information Systems10.1007/s10115-017-1062-055:2(305-331)Online publication date: 1-May-2018
    • (2017)Exploiting Web Sites Structural and Content Features for Web Pages ClusteringFoundations of Intelligent Systems10.1007/978-3-319-60438-1_44(446-456)Online publication date: 14-Jun-2017
    • (2016)Lossless Separation of Web Pages into Layout Code and DataProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2939672.2939858(1805-1814)Online publication date: 13-Aug-2016
    • (2016)Automatic Generation of Sitemaps Based on Navigation SystemsMachine Learning, Optimization, and Big Data10.1007/978-3-319-51469-7_18(216-223)Online publication date: 25-Dec-2016
    • (2014)Automatic Extraction of Logical Web ListsFoundations of Intelligent Systems10.1007/978-3-319-08326-1_37(365-374)Online publication date: 2014
    • (2013)The parallel path framework for entity discovery on the webACM Transactions on the Web10.1145/2516633.25166387:3(1-29)Online publication date: 30-Sep-2013
    • (2013)Exploring structure and content on the webProceedings of the sixth ACM international conference on Web search and data mining10.1145/2433396.2433499(779-780)Online publication date: 4-Feb-2013
    • (2013)Extracting the semantic content of web pages via repeated structures2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)10.1109/ICMEW.2013.6618450(1-6)Online publication date: Jul-2013
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media