Supporting More-Like-This Information Needs: Finding Similar Web Content in Different Scenarios

Hagen, Matthias; Glimm, Christiane

doi:10.1007/978-3-319-11382-1_6

Matthias Hagen²² &
Christiane Glimm²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8685))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1073 Accesses

Abstract

We examine more-like-this information needs in different scenarios. A more-like-this information need occurs, when the user sees one interesting document and wants to access other but similar documents. One of our foci is on comparing different strategies to identify related web content. We compare following links (i.e., crawling), automatically generating keyqueries for the seen document (i.e., queries that have the document in the top of their ranks), and search engine operators that automatically display related results. Our experimental study shows that in different scenarios different strategies yield the most promising related results.

One of our use cases is to automatically support people who monitor right-wing content on the web. In this scenario, it turns out that crawling from a given set of seed documents is the best strategy to find related pages with similar content. Querying or the related-operator yield much fewer good results. In case of news portals, however, crawling is a bad idea since hardly any news portal links to other news portals. Instead, a search engine’s related operator or querying are better strategies. Finally, for identifying related scientific publications for a given paper, all three strategies yield good results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

HierarSearch: Enhancing Performance of Search Engines by Mining Semantic Relationships Among Results

Metasearch Engine: A Technology for Information Extraction in Knowledge Computing

Crawl Smart: A Domain-Specific Crawler

References

Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of VLDB 1994, pp. 487–499 (1994)
Google Scholar
Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S.: Searching the web. ACM Trans. Internet Technol. 1(1), 2–43 (2001)
Article Google Scholar
Bendersky, M., Croft, W.B.: Finding text reuse on the web. In: Proceedings of WSDM 2009, pp. 262–271 (2009)
Google Scholar
Dasdan, A., D’Alberto, P., Kolay, S., Drome, C.: Automatic retrieval of similar content using search engine query interface. In: Proceedings of CIKM 2009, pp. 701–710 (2009)
Google Scholar
Fuhr, N., Lechtenfeld, M., Stein, B., Gollub, T.: The optimum clustering framework: Implementing the cluster hypothesis. Information Retrieval 15(2), 93–115 (2011)
Article Google Scholar
Gollub, T., Hagen, M., Michel, M., Stein, B.: From keywords to keyqueries: Content descriptors for the web. In: Proceedings of SIGIR 2013, pp. 981–984 (2013)
Google Scholar
Golshan, B., Lappas, T., Terzi, E.: SOFIA search: A tool for automating related-work search. In: Proceedings of SIGMOD 2012, pp. 621–624 (2012)
Google Scholar
Hagen, M., Stein, B.: Candidate document retrieval for web-scale text reuse detection. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 356–367. Springer, Heidelberg (2011)
Chapter Google Scholar
Lee, Y., Jung, H.Y., Song, W., Lee, J.H.: Mining the blogosphere for top news stories identification. In: Proceedings of SIGIR 2010, pp. 395–402 (2010)
Google Scholar
Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. In: Proceedings of EMNLP 2004, pp. 404–411 (2004)
Google Scholar
O’Callaghan, D., Greene, D., Conway, M., Carthy, J., Cunningham, P.: Uncovering the wider structure of extreme right communities spanning popular online networks. In: Proceedings of WebSci 2013, pp. 276–285 (2013)
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999)
Google Scholar
Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and expansion. In: Proceedings of CIKM 2010, pp. 1049–1058 (2010)
Google Scholar
Qi, X., Davison, B.D.: Web page classification: Features and algorithms. ACM Comput. Surv. 41(2), 12:1–12:31 (2009)
Google Scholar
Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., Papadias, D.: Query by document. In: Proceedings of WSDM 2009, pp. 34–43 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Bauhaus-Universität Weimar, Weimar, Germany
Matthias Hagen & Christiane Glimm

Authors

Matthias Hagen
View author publications
You can also search for this author in PubMed Google Scholar
Christiane Glimm
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Google Inc., Brandschenkestraße 110, 8002, Zurich, Switzerland
Evangelos Kanoulas
Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstrasse 9-11, 1040, Vienna, Austria
Mihai Lupu
Information School, University of Sheffield, Sheffield, UK
Paul Clough
Department of Computer Science and IT, RMIT University, 3000, Melbourne, VIC, Australia
Mark Sanderson
Department of Computing, Edge Hill University, L39 4QP, Ormskirk, Lancashire, UK
Mark Hall
Vienna University of Technology, Austria
Allan Hanbury
Information School, University of Sheffield, Regent Court, 211 Portobello, S1 4DP, Sheffield, UK
Elaine Toms

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hagen, M., Glimm, C. (2014). Supporting More-Like-This Information Needs: Finding Similar Web Content in Different Scenarios. In: Kanoulas, E., et al. Information Access Evaluation. Multilinguality, Multimodality, and Interaction. CLEF 2014. Lecture Notes in Computer Science, vol 8685. Springer, Cham. https://doi.org/10.1007/978-3-319-11382-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-11382-1_6
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11381-4
Online ISBN: 978-3-319-11382-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics