skip to main content
10.1145/1498759.1498831acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Is Wikipedia link structure different?

Published: 09 February 2009 Publication History

Abstract

In this paper, we investigate the difference between Wikipedia and Web link structure with respect to their value as indicators of the relevance of a page for a given topic of request. Our experimental evidence is from two IR test-collections: the .GOV collection used at the TREC Web tracks and the Wikipedia XML Corpus used at INEX. We first perform a comparative analysis of Wikipedia and .GOV link structure and then investigate the value of link evidence for improving search on Wikipedia and on the .GOV domain. Our main findings are: First, Wikipedia link structure is similar to the Web, but more densely linked. Second, Wikipedia's outlinks behave similar to inlinks and both are good indicators of relevance, whereas on the Web the inlinks are more important. Third, when incorporating link evidence in the retrieval model, for Wikipedia the global link evidence fails and we have to take the local context into account.

References

[1]
B. Amento, L. Terveen, and W. Hill. Does 'authority' mean quality? predicting expert quality ratings of web documents. In SIGIR 2000, pages 296--303. ACM Press, 2000.
[2]
A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286:509--512, 1999.
[3]
F. Bellomi and R. Bonato. Network analysis for wikipedia. In Proceedings of Wikimania, 2005.
[4]
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. In WWW9, pages 309--320. Elsevier Science, Amsterdam, 2000.
[5]
L. S. Buriol, C. Castillo, D. Donato, S. Leonardi, and S. Millozzi. Temporal analysis of the wikigraph. In WI '06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pages 45--51. IEEE Computer Society, Washington, DC, USA, 2006.
[6]
N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In SIGIR 2001, pages 250--257. ACM Press, 2001.
[7]
N. Craswell, S. Robertson, H. Zaragoza, and M. Taylor. Relevance weighting for query independent evidence. In SIGIR '05, pages 416--423. ACM, New York, NY, USA, 2005.
[8]
L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. SIGIR Forum, 40(1):64--69, June 2006.
[9]
M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM '99, pages 251--262. ACM Press, 1999.
[10]
D. Hawking. Overview of the trec-9 web track. In TREC, 2000.
[11]
D. Hawking and N. Craswell. Very large scale retrieval and web search. In E. Voorhees and D. Harman, editors, TREC: Experiment and Evaluation in Information Retrieval, chapter 9. MIT Press, 2005.
[12]
D. Hiemstra. Using Language Models for Information Retrieval. PhD thesis, Center for Telematics and Information Technology, University of Twente, 2001.
[13]
J. Kamps. Web-centric language models. In CIKM'05, pages 307--308. ACM Press, 2005.
[14]
J. Kamps and M. Koolen. The importance of link evidence in Wikipedia. In Advances in Information Retrieval: 30th European Conference on IR Research (ECIR 2008), volume 4956 of Lecture Notes in Computer Science, pages 270--282. Springer Verlag, Heidelberg, 2008.
[15]
L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18:39--43, 1953.
[16]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999.
[17]
W. Kraaij and T. Westerveld. How different are web documents? In TREC-9. NIST Special Publication, May 2001.
[18]
W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In SIGIR 2002, pages 27--34. ACM Press, 2002.
[19]
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. In WWW8, pages 403--415. Elsevier Science, Amsterdam, 1999.
[20]
S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400:107--109, 1999.
[21]
J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177--187. ACM Press, New York, NY, USA, 2005.
[22]
J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data, 1(1): 2, 2007.
[23]
M. A. Najork, H. Zaragoza, and M. J. Taylor. HITS on the Web: How does it compare? In SIGIR '07, pages 471--478. ACM, New York, NY, USA, 2007.
[24]
P. Ogilvie and J. Callan. Combining document representations for known-item search. In SIGIR 2003, pages 143--150. ACM Press, 2003.
[25]
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.
[26]
J. R. Seeley. The net of reciprocal influence. Canadian Journal of Psychology, 3:234--240, 1949.
[27]
I. Soboroff. Do trec web collections look like the web? SIGIR Forum, 36:23--31, 2002.
[28]
J. Voss. Measuring wikipedia. In ISSI 2005, 2005.
[29]
S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications, volume 8 of Structural Analysis in the Social Sciences. Cambridge University Press, Cambridge MA, 1994.
[30]
T. Westerveld, D. Hiemstra, and W. Kraaij. Retrieving web pages using content, links, URL's and anchors. In The Tenth Text Retrieval Conference, TREC-2001, pages 52--61, May 2002.

Cited By

View all
  • (2024)Periphoscape: Enhance Wikipedia Browsing by Presenting Diverse Aspects of TopicsWeb Information Systems Engineering – WISE 202410.1007/978-981-96-0579-8_25(352-366)Online publication date: 29-Nov-2024
  • (2023)Large-Scale Analysis of Wikipedia’s Link Structure and its Applications in Learning Path Construction2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI)10.1109/IRI58017.2023.00051(254-260)Online publication date: Aug-2023
  • (2020) CycleRank , or there and back again: personalized relevance scores from cyclic paths on directed graphs Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences10.1098/rspa.2019.0740476:2241Online publication date: 9-Sep-2020
  • Show More Cited By

Recommendations

Reviews

Srini Ramaswamy

Link structure plays an important role in the Web retrieval of relevant information. Characteristics of link structure are useful in measuring the relevance of a page. A page in the page pool has three parameters that determine its relevance: an indegree (the number of pages linked to it), an outdegree (the number of pages linked from it), and length (the number of characters on the page). In this paper, Kamps and Koolen analyze Wikipedia's link structure, in order to determine whether Wikipedia link structure is different. They use the 2006-2007 INitiative for the Evaluation of XML-retrieval (INEX) test collection for Wikipedia, and the 2004 Text Retrieval Conference (TREC) for general Web retrieval. Their comparison reveals some interesting observations that go beyond the expected goal, which impacts the effectiveness of information retrieval of related documents. The paper consists of eight sections, but this review only covers the two most important sections, Sections 3 and 4. Section 3 analyzes the link structures of Wikipedia and .GOV by comparing the indegree, outdegree, and length distribution of pages. The study suggests that Wikipedia has a very densely linked structure. A major portion of the Wikipedia collection consists of strongly connected components (SCC), whereas .GOV collections have a lower SCC value. As it is well organized and author guided, Wikipedia's dynamic nature makes it a complete link structure, further enhanced by peer editing and automatic link detection. Kamps and Koolen also show that, in Wikipedia, the outdegree and indegree act similarly, but this is not so in the .GOV collection. In Wikipedia, a document's high values of both outdegree and indegree indicate a high probability of relevance, whereas in .GOV only a high indegree indicates a higher probability of relevance. The length of a document is not related to its probability of relevance in .GOV, but in Wikipedia, the length is directly proportional to the probability of relevance. The authors study the correlations between indegree, outdegree, and length attributes, and show that in Wikipedia, outdegree and length are highly correlated. Section 4 derives a method to incorporate link evidence of a document: the score of a document d , given a query q , is represented as P ( d | q ) = P ( d )? P ( q | d ), where P ( q | d ) is the chance of deriving q from d , and P ( d ) is the document prior. The authors also use global and local link evidence as standard degree prior, log degree prior, and prior combination of global and local evidences. The rest of the paper discusses the experiment details regarding the data collection. In conclusion, the paper presents a deep discussion of Kamps and Koolen's results. It is a highly relevant paper for readers interested in the assessment methods of link structures for Wikipedia and other Web sources. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining
February 2009
314 pages
ISBN:9781605583907
DOI:10.1145/1498759
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Wikipedia
  2. link evidence
  3. web information retrieval

Qualifiers

  • Research-article

Funding Sources

Conference

WSDM'09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)2
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Periphoscape: Enhance Wikipedia Browsing by Presenting Diverse Aspects of TopicsWeb Information Systems Engineering – WISE 202410.1007/978-981-96-0579-8_25(352-366)Online publication date: 29-Nov-2024
  • (2023)Large-Scale Analysis of Wikipedia’s Link Structure and its Applications in Learning Path Construction2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI)10.1109/IRI58017.2023.00051(254-260)Online publication date: Aug-2023
  • (2020) CycleRank , or there and back again: personalized relevance scores from cyclic paths on directed graphs Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences10.1098/rspa.2019.0740476:2241Online publication date: 9-Sep-2020
  • (2020)Relating Wikipedia article quality to edit behavior and link structureApplied Network Science10.1007/s41109-020-00305-y5:1Online publication date: 9-Sep-2020
  • (2020)Network Structure and Scheme Analysis of the Russian Language Segment of WikipediaNetwork Algorithms, Data Mining, and Applications10.1007/978-3-030-37157-9_9(129-142)Online publication date: 23-Feb-2020
  • (2019)On the Relation of Edit Behavior, Link Structure, and Article Quality on WikipediaComplex Networks and Their Applications VIII10.1007/978-3-030-36683-4_20(242-254)Online publication date: 25-Nov-2019
  • (2019)Characterising Social MachinesThe Theory and Practice of Social Machines10.1007/978-3-030-10889-2_1(1-41)Online publication date: 15-Feb-2019
  • (2018)NegapediaProceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.5555/3382225.3382270(210-213)Online publication date: 28-Aug-2018
  • (2018)Finding High Quality Documents through Link and Click Graphs2018 7th International Congress on Advanced Applied Informatics (IIAI-AAI)10.1109/IIAI-AAI.2018.00020(49-54)Online publication date: Jul-2018
  • (2018)The Battle for Information: Exposing Wikipedia2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech)10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-3(958-965)Online publication date: Aug-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media