ABSTRACT
In this paper, we describe a methodology to estimate the geographic coverage of the web without the need for secondary knowledge or complex geo-tagging. This is achieved by randomly selecting toponyms from the Ordnance Survey 50K gazetteer to create search queries and thus gather document counts from various web sources for Great Britain. The same gazetteer is then used to geo-code the results and enable mapping. To validate our approach, and demonstrate the effects of geo/non-geo and geo/geo ambiguity, we mapped the selected toponyms to Geograph, a community project that contains user generated geo-tagged photographs of the UK. Although success varies with resolution, the proposed approach is likely sufficient to be reliably used by applications exploring the geographic coverage of the web for cases where references to settlements are likely to be common. In our case, we applied the method to produce maps of web coverage for a range of sources at a resolution of 30km.
- Amitay, E., N. Har'El, R. Sivan, and A. Soffer, Web-a-Where: Geotagging Web Content, in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. 2004, ACM: Sheffield, United Kingdom. p. 273--280. Google ScholarDigital Library
- Backstrom, L., J. Kleinberg, R. Kumar, and J. Novak, Spatial Variation in Search Engine Queries, in Proceeding of the 18th international conference on World Wide Web. 2008, ACM: Beijing, China. Google ScholarDigital Library
- Brunner, T. (2008), 'Geographic Information Retrieval: Identifikation der geographischen Lage von Zeitungsartikeln', Master's thesis, Geographisches Institut.Google Scholar
- Census, General Register Office for Scotland, Census: Standard Area Statistics (Scotland) {Computer File}. 2001, ESRC/JISC Census Programme, Census Dissemination Unit, MIMAS (University of Manchester).Google Scholar
- Census, Office for National Statistics, Census: Standard Area Statistics (England and Wales) {Computer File}. 2001, ESRC/JISC Census Programme, Census Dissemination Unit, MIMAS (University of Manchester).Google Scholar
- Chakrabrati, S., Mining the Web: Analysis of Hypertext and Semi Structured Data. 2002: Morgan Kaufmann.Google Scholar
- Cimiano, P. and S. Staab, Learning by Googling, in SIGKDD Explorations (Newsletter). 2004. p. 24--33. Google ScholarDigital Library
- Dodge, M. and R. Kitchin, Mapping Cyberspace. 2001, New York: Routledge. Google ScholarDigital Library
- Egenhofer, M. Toward the Semantic Geospatial Web. in 10th ACM International Symposium on Advances in Geographic Information Systems 2002. Google ScholarDigital Library
- Goodchild, M. F., Citizens as Sensors: The World of Volunteered Geography. GeoJournal, 2007. 69(4): p. 211--221.Google Scholar
- Gulli, A. and A. Signorini. The Indexable Web Is More Than 11.5 Billion Pages. in WWW '05: Special Interest tracks and posters of the 14th International Conference on World Wide Web. 2005: ACM. Google ScholarDigital Library
- Hill, L. L., Core Elements of Digital Gazetteers: Placenames, Categories, and Footprints, in Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries. 2000, Springer-Verlag. Google ScholarDigital Library
- Himmelstein, M., Local Search: The Internet Is the Yellow Pages. Computer, 2005. 38(2): p. 26--34. Google ScholarDigital Library
- Jones, C., B., H. Alani, and D. Tudhope, Geographical Information Retrieval with Ontologies of Place, in Proceedings of the International Conference on Spatial Information Theory: Foundations of Geographic Information Science. 2001, Springer-Verlag. Google ScholarDigital Library
- Jones, C. B.; Purves, R. S.; Clough, P. D. & Joho., H., 'Modelling vague places with knowledge from the Web', International Journal of Geographical Information Science, 2008, 22(10), 1045--1065. Google ScholarDigital Library
- Keller, F. and M. Lapata, Using the Web to Obtain Frequencies for Unseen Bigrams. Computational Linguistics, 2003. 29(3): p. 459--484. Google ScholarDigital Library
- Kilgarriff, A. and G. Grefenstette, Introduction to the Special Issue on the Web as Corpus. Computational Linguistics, 2003. 29(3): p. 333--347. Google ScholarDigital Library
- Larson, R. Geographic Information Retrieval and Spatial Browsing. in Geographic Information Systems and Libraries: Patrons, Maps, and Spatial Information. 1996: Urbana-Champaign: Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign.Google Scholar
- Li, H., R. K. Srihari, C. Niu, and W. Li, Infoxtract Location Normalization: A Hybrid Approach to Geographic References in Information Extraction, in Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1. 2003, Association for Computational Linguistics. Google ScholarDigital Library
- Lin, J. and A. Halavais, Geographical Distribution of Blogs in the United States. Webeology, 2006. 3(4).Google Scholar
- Markowetz, A., T. Brinkhoff, and B. Seeger. Geographic Information Retrieval. in 3rd International Workshop on Web Dynamics {online: http://dbs.mathematik.uni-marburg.de/publications/myPapers/2004/WebDyn2004.pdf}. 2004.Google Scholar
- McCurley, K. S. Geospatial Mapping and Navigation of the Web. in Proceedings of the 10th international conference on World Wide Web. 2001. Hong Kong, Hong Kong: ACM. Google ScholarDigital Library
- Mikheev, A., M. Moens, and C. Grover, Named Entity Recognition without Gazetteers, in Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics. 1999, Association for Computational Linguistics: Bergen, Norway. Google ScholarDigital Library
- Monroe, G., J. French, and A. Powell, Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques, in Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 3 - Volume 3. 2002, IEEE Computer Society. Google ScholarDigital Library
- Purves, R., P. Clough, and H. Joho. Identifying Imprecise Regions for Geographic Information Retrieval Using the Web. in GISRUK 2005 - 13th Annual Conference on GIS Research UK. 2005.Google Scholar
- Rauch, E., M. Bukatin, and K. Baker, A Confidence-Based Framework for Disambiguating Geographic Terms, in Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1. 2003, Association for Computational Linguistics. Google ScholarDigital Library
- Resnik, P. and N. A. Smith, The Web as a Parallel Corpus. Computational Linguistics, 2003. 29(3): p. 349--380. Google ScholarDigital Library
- Sanderson, M. and J. Kohler. Analyzing Geographic Queries. in SIGIR 2004 - Workshop on Geographic Information Retrieval. 2004.Google Scholar
- Schockaert, S. and M. De Cock. Neighborhood Restrictions in Geographic Ir. in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 2007. Amsterdam, The Netherlands: ACM. Google ScholarDigital Library
- Smith, D. A. and G. S. Mann. Bootstrapping Toponym Classifiers. in The HLT-NAACL Workshop on Analysis of Geographic References. 2003. Google ScholarDigital Library
- Srivastava, J. and R. Cooley, Web Business Intelligence: Mining the Web for Actionable Knowledge. 2003, INFORMS. p. 191--207. Google ScholarDigital Library
- Tezuka, T. and K. Tanaka. Landmark Extraction: A Web Mining Approach. in COSIT 2005 - Conference on Spatial Information Theory. 2005.Google Scholar
- Tobler, W. R. (1979), 'Smooth Pycnophylactic Interpolation for Geographical Regions', Journal of the American Statistical Association 74(367), 519--530.Google Scholar
- Zook, M., The Geographies of the Internet, in Annual Review of Information Science and Technology, B. Cronin, Editor. 2005. p. 53--78.Google Scholar
Index Terms
- Mapping geographic coverage of the web
Recommendations
Geographic scope modeling for web documents
GIR '08: Proceedings of the 5th Workshop on Geographic Information RetrievalGeographic Information Retrieval (GIR) has become a very attractive area of research. GIR is a specialization of a traditional information retrieval system, which may index and search Web documents based on their spatial footprints. Research in this new ...
Geotemporal querying of multilingual documents
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalThis demonstration utilizes a geographic information system interface to display multilingual news documents in time and space by extracting place names from text and matching them to a multilingual multi-script gazetteer which identifies the latitude ...
Relevance and ranking in geographic information retrieval
FDIA'11: Proceedings of the Fourth BCS-IRSG conference on Future Directions in Information AccessGeographic Information Retrieval (GIR) is a specialized branch of traditional Information Retrieval (IR), which deals with the information related to geographic locations. One of the main challenges of GIR is to quantify the spatial relevance of ...
Comments