ABSTRACT
Wikipedia is known for serving humans' informational needs. Over the past decade, the encyclopedic knowledge encoded in Wikipedia has also powerfully served computer systems. Leading algorithms in artificial intelligence, natural language processing, data mining, geographic information science, and many other fields analyze the text and structure of articles to build computational models of the world.
Many software packages extract knowledge from Wikipedia. However, existing tools either (1) provide Wikipedia data, but not well-known Wikipedia-based algorithms or (2) narrowly focus on one such algorithm.
This paper presents the WikiBrain software framework, an extensible Java-based platform that democratizes access to a range of Wikipedia-based algorithms and technologies. WikiBrain provides simple access to the diverse Wikipedia data needed for semantic algorithms and technologies, ranging from page views to Wikidata. In a few lines of code, a developer can use WikiBrain to access Wikipedia data and state-of-the-art algorithms. WikiBrain also enables researchers to extend Wikipedia-based algorithms and evaluate their extensions. WikiBrain promotes a new vision of the Wikipedia software ecosystem: every researcher and developer should have access to state-of-the-art Wikipedia-based technologies.
- S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpedia: A Nucleus for a Web of Open Data. Lecture Notes in Computer Science, page 722--735, 2007. Google ScholarDigital Library
- P. Bao, B. Hecht, S. Carton, M. Quaderi, M. Horn, and D. Gergle. Omnipedia: bridging the wikipedia language gap. In CHI '12, 2012. Google ScholarDigital Library
- T. Bergstrom and K. Karahalios. Conversation clusters: grouping conversation topics through human-computer dialog. In CHI '09, pages 2349--2352, Boston, MA, 2009. Google ScholarDigital Library
- S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In EMNLP-CoNLL, volume 7, pages 708--716, 2007.Google Scholar
- O. Egozi, S. Markovitch, and E. Gabrilovich. Concept-Based Information Retrieval Using Explicit Semantic Analysis. Trans. Inf. Syst., 29(2):1--34, 2011. Google ScholarDigital Library
- M. Erdmann, K. Nakayama, T. Hara, and S. Nishio. An approach for extracting bilingual terminology from wikipedia. In Database Systems for Advanced Applications, pages 380--392. Springer Berlin Heidelberg, Jan. 2008. Google ScholarDigital Library
- E. Gabrilovich and S. Markovitch. Wikipedia-based semantic interpretation for natural language processing. JAIR, 34:443--498, 2009. Google ScholarDigital Library
- K. Goldsberry. CourtVision | examining the NBA through spatial and visual analytics, 2012.Google Scholar
- M. F. Goodchild. Citizens as sensors: the world of volunteered geography. GeoJournal, 69(4):211--221, 2007.Google ScholarCross Ref
- M. Graham, S. A. Hale, and M. Stephens. Geographies of the World's Knowledge. Convoco! Edition, 2011.Google Scholar
- G. Halawi, G. Dror, E. Gabrilovich, and Y. Koren. Large-scale learning of word relatedness with constraints. In KDD '12, 2012. Google ScholarDigital Library
- A. Halfaker. MediaWiki utilities.Google Scholar
- D. Hardy, J. Frew, and M. F. Goodchild. Volunteered geographic information production as a spatial process. IJGIS, 26(7):1191--1212, 2012. Google ScholarDigital Library
- S. Harispe, S. Ranwez, S. Janaqi, and J. Montmain. Semantic measures for the comparison of units of language, concepts or entities from text and knowledge base analysis. CoRR, abs/1310.1285, 2013.Google Scholar
- B. Hecht, S. H. Carton, M. Quaderi, J. Schöning, M. Raubal, D. Gergle, and D. Downey. Explanatory semantic relatedness and explicit spatialization for exploratory search. SIGIR '12, 2012. Google ScholarDigital Library
- B. Hecht and D. Gergle. Measuring self-focus bias in community-maintained knowledge repositories. In C&T '09, page 11--19, 2009. Google ScholarDigital Library
- B. Hecht and D. Gergle. On the "Localness" of user-generated content. In CSCW '10, 2010. Google ScholarDigital Library
- B. Hecht and D. Gergle. The tower of babel meets web 2.0: User-generated content and its applications in a multilingual context. In CHI '10. ACM, 2010. Google ScholarDigital Library
- B. Hecht and D. Gergle. A beginner's guide to geographic virtual communities research. IGI Global, 2011.Google ScholarCross Ref
- B. Hecht and E. Moxley. Terabytes of tobler: evaluating the first law in a massive, domain-neutral representation of world knowledge. In COSIT '09, 2009. Google ScholarDigital Library
- B. Hecht, J. Schöning, L. Capra, A. Mashhadi, L. Terveen, and M.-P. Kwan. 2013 workshop on geographic human-computer interaction. In CHI '13 EA:, 2013.Google Scholar
- J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. Yago2: A spatially and temporally enhanced knowledge base from wikipedia. Artificial Intelligence, 194:28--61, 2013. Google ScholarDigital Library
- A. Kittur, E. H. Chi, B. A. Pendleton, B. Suh, and T. Mytkowicz. Power of the few vs. wisdom of the crowd: Wikipedia and the rise of the bourgeoisie. In CHI '07, 2007.Google Scholar
- G. Kobilarov, T. Scott, Y. Raimond, S. Oliver, C. Sizemore, M. Smethurst, C. Bizer, and R. Lee. Media meets semantic web -- how the BBC uses DBpedia and linked data to make connections. In The Semantic Web: Research and Applications, number 5554 in Lecture Notes in Computer Science, pages 723--737. Springer Berlin Heidelberg, 2009. Google ScholarDigital Library
- S. Lam, A. Uduwage, Z. Dong, S. Sen, D. Musicant, L. Terveen, and J. Riedl. WP:Clubhouse? an exploration of wikipedia's gender imbalance. In WikiSym '11:, 2011. Google ScholarDigital Library
- M. D. Lieberman and J. Lin. You are where you edit: Locating wikipedia users through edit histories. In ICWSM '09, 2009.Google Scholar
- P. Massa and F. Scrinzi. Manypedia: Comparing language points of view of wikipedia communities. In WikiSym '12, 2012. Google ScholarDigital Library
- D. J. McIver and J. S. Brownstein. Wikipedia usage estimates prevalence of influenza-like illness in the united states in near real-time. PLoS Comput Biol, 10(4):e1003581, Apr. 2014.Google ScholarCross Ref
- M. Mestyán, T. Yasseri, and J. Kertész. Early prediction of movie box office success based on wikipedia activity big data. PLoS ONE, 8(8):e71226, Aug. 2013.Google ScholarCross Ref
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.Google Scholar
- R. Miller. Wikipedia founder jimmy wales responds. Slashdot: News for Nerds, Stuff That Matters, 28, 2004.Google Scholar
- D. Milne and I. Witten. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy, 2008.Google Scholar
- D. Minmo, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. McCallum. Polylingual topic models. In EMNLP '09, 2009. Google ScholarDigital Library
- C. Okoli, M. Mehdi, M. Mesgari, F. Nielsen, and A. Lanamäki. The people's encyclopedia under the gaze of the sages: A systematic review of scholarly research on wikipedia. Available at SSRN, 2012.Google Scholar
- C. Pang and R. Biuk-Aghai. Wikipedia world map: Method and application of map-like wiki visualization. In WikiSym '11, Mountain View, CA, 2011. Google ScholarDigital Library
- S. Patwardhan, S. Banerjee, and T. Pedersen. Using measures of semantic relatedness for word sense disambiguation. In CICLING '03, 2003. Google ScholarDigital Library
- T. Pedersen, S. Patwardhan, and J. Michelizzi. Wordnet:: Similarity: measuring the relatedness of concepts. In Demonstration Papers at HLT-NAACL 2004, 2004. Google ScholarDigital Library
- U. Pfeil, P. Zaphiris, and C. S. Ang. Cultural differences in collaborative authoring of wikipedia. JCMC, 12(1):88--113, Oct. 2006.Google ScholarCross Ref
- G. Pirró. Reword: Semantic relatedness in the web of data. In AAAI '12, 2012.Google Scholar
- S. P. Ponzetto and M. Strube. Exploiting semantic role labeling, WordNet and wikipedia for coreference resolution. In NAACL '06, 2006. Google ScholarDigital Library
- R. Priedhorsky, J. Chen, S. T. K. Lam, K. Panciera, L. Terveen, and J. Riedl. Creating, destroying, and restoring value in wikipedia. In Group '07, 2007. Google ScholarDigital Library
- K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch. A word at a time: computing word relatedness using temporal semantic analysis. In WWW '11, pages 337--346. ACM, 2011. Google ScholarDigital Library
- P. Resnick. Using information content to evaluate semantic similarity in a taxonomy. In IJCAI '95, 1995. Google ScholarDigital Library
- S. Sen, E. Nunes, E. I. Sparling, H. Charlton, R. Kerwin, J. Lim, B. Maus, N. Miller, M. R. Naminski, A. Schneeman, and et al. Macademia. IUI '11, 2011.Google Scholar
- A. Skupin and S. I. Fabrikant. Spatialization methods: A cartographic research agenda for non-geographic information visualization. CAGIS, 30(2):95--115, 2003.Google ScholarCross Ref
- J. R. Smith, C. Quirk, and K. Toutanova. Extracting parallel sentences from comparable corpora using document level alignment. In NAACL '10, 2010. Google ScholarDigital Library
- M. Strube and S. P. Ponzetto. WikiRelate! computing semantic relatedness using wikipedia. In AAAI '06, 2006. Google ScholarDigital Library
- W. R. Tobler. A computer movie simulating urban growth in the Detroit region. Economic geography, 1970.Google Scholar
- D. Vrandečić. Wikidata: A New Platform for Collaborative Data Collection. In WWW '12 Companion, 2012. Google ScholarDigital Library
- M. Wiesmann. Falsehoods programmers believe about geography, 2012. 00000.Google Scholar
- B. P. Wing and J. Baldridge. Simple supervised document geolocation with geodesic grids. In ACL '11, 2011. Google ScholarDigital Library
- T. Yasseri, A. Spoerri, M. Graham, and J. Kertesz. The most controversial topics in wikipedia: A multilingual and geographical analysis. In P. Fichman and N. Hara, editors, Global Wikipedia: International and cross-cultural issues in online collaboration. Scarecrow Press, 2014.Google Scholar
- T. Yasseri, R. Sumi, and J. Kertész. Circadian patterns of wikipedia editorial activity: A demographic analysis. PLoS One, 7(1):1--8, Jan. 2012.Google ScholarCross Ref
Index Terms
- WikiBrain: Democratizing computation on Wikipedia
Recommendations
Comparison of Methods to Annotate Named Entity Corpora
The authors compared two methods for annotating a corpus for the named entity (NE) recognition task using non-expert annotators: (i) revising the results of an existing NE recognizer and (ii) manually annotating the NEs completely. The annotation time, ...
Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data managementAmbiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...
Exploring entity relations for named entity disambiguation
HLT-SS '11: Proceedings of the ACL 2011 Student SessionNamed entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named ...
Comments