Extracting geographic features from the Internet: A geographic information mining framework
Introduction
Digital gazetteers are the structured dictionaries of named places. As defined by Goodchild and Hill [1], a gazetteer is composed of three core components, i.e., namely place names, place types and geospatial locations. The primary purpose of a gazetteer is to translate the informal place names and place categories to the formal georeferencing of mathematical schemes and well-known types [2], [3]. Seen that the explosion of web-based services on the Internet, more and more gazetteer applications require a very high level of completeness, but the current gazetteers still lack the local place names used in everyday conversations [4], [5]. Besides, the existing structured dictionaries only provide static information but do not incorporate frequent updates. Thus, there is a strong need to enrich gazetteers with abundant up-to-date local place entries for improving the timeliness and integrity of the structured geographical data. One critical issue is effectively and efficiently building up place-name datasets.
Conventionally, place-name datasets can be obtained from some structured data sources, such as DBpedia,1 LinkedGeoData,2 Wikimapia,3 Google Places API,4 and OpenStreetMap.5 However, these structured sources only provide static information but do not always incorporate the timely updates. Not only that, although some commercial sources, such as the Google Places API, maintain high-quality location data, many restrictions are issued for obtaining and using their data (i.e., usage limits). In contrast, the volume of vernacular publicly available on the Web is enormous and grows rapidly. Meanwhile, the up-to-date information of places is commonly released on the Web and accepts frequent renewing since the unstructured web pages. The great utility of the Web in conducting studies on the data collection and extraction has been widely recognized [6], [7], [8].
In this paper, we firstly propose a Geographic Information Mining (GIM) framework inspired by the CRISP-DM (Cross Industry Standard Process for Data Mining) methodology [9] for extracting geographic information. The GIM framework enables us to build up the multiple dimensions of modeling geographic information mining processes, i.e., the phase-oriented model, the generic/specific task-oriented model and the instance-oriented model. Associated with search engines, each of the models specify and arrange the focal phases, tasks and instances as the entries integrating the potential methodological solutions to contribute the effort onto mining the geographic information from web pages. Under the phase-oriented modeling dimension, the information mining process consists with as a serial of phases. Each of the phases specifies a sequential business reflecting the evolution from raw data to structured knowledge. Seeing that information mining would not be completed when a solution is just deployed, we specially bring in an iterative scheme of structuring the procedural modules (i.e., tasks and instances) under the task-oriented and instance-oriented modeling dimensions. The iterative scheme reveals the sustaining evolution of knowledge discovery through gaining and utilizing the previously obtained mining achievement(s) during mining information [10], and it also exposes that information mining is an open process taking advantage of various solutions to collect, extract and treat data samples.
As the foundation for constructing GIM framework, CRISP-DM methodology, as a bridge between the gap of business problems and data mining objectives, provides a standard reference model to translate business problems into a set of data mining tasks and is independent of technological aspects. It is inclined to improve the effectiveness and efficiency of managing data mining projects. In contrast, GIM framework proposed in this paper is designed for organizing the potential methodological solutions of resolving geographic information mining issues from web pages and it supports to load the compatible methods with the iterative modeling dimensions. In other words, GIM could be adopted as a technological roadmap toward extracting the accurate and timely geographic features, associated locations and feature types from freely available online data to build the place-name datasets, which mainly concerns the key issues as follows:
The first issue is to collect and identify the appropriate place types for composing the keywords in the format of Street NamesCity namesPlace Types to search for the initial the geographic information.
The second issue is to filter out the “noisy” information mixed with the obtained search results for preserving the meaningful and valuable information.
The third issue is to extract the necessary geospatial locations from the various web pages [11] in consideration of the distinct structures in different web pages.
The fourth issue is to address the problem of place name disambiguation (also called toponym disambiguation), i.e., to accurately identify the place names as low ambiguous as possible.
To completely present our research progress for readers’ better understanding onto the contribution in this paper, we briefly clarify the previous study in the following Section 2. Around the relevant research topics, we discuss the related work in the same section. In Section 3, we elaborate the GIM framework and propose a set of methods integrated into the framework for building place-name datasets relying on the previous research result. Then in Section 5, we expound a set of experiments to verify the proposed contribution and make a comprehensive analysis onto the experimental results. In the last section, we conclude the research contribution with highlighting the characteristics of the proposed framework and state the future work concerning improving the performance for enriching gazetteers.
Section snippets
The previous work
Our previous work proposed in [12] provided a set of preliminary solutions/results corresponding to the mentioned critical issues being concerned with the proposed GIM framework.
- (1)
To compose the proper searching keywords in the format of Street NamesCity namesPlace Types with web search engines, the previous work collect the place types through receiving the manually input.
- (2)
To filter out the webpages containing many real-estate listings for obtaining more usable searching results, the
The GIM framework
In this paper, we propose a Geographic Information Mining (GIM) framework for locating our contributed methodologies as a systematic solution to the various issues concerned in building up place-name dataset (Seen in Fig. 1). The framework provides a set of modeling dimensions elicited by the contribution from [9] in terms of a hierarchical complexity of processing potential geographic information. The modeling dimensions are:
- (1)
Phase-oriented modeling dimension: it provides thescheme of modeling
Automatic construction of place-name datasets under GIM framework
In this section, we present a set of methodologies of solving the critical issues for automatically constructing place-name datasets and integrate them into the geographic information mining process of the GIM framework, through which we suggest an application model of the framework.
Based on the objective definition of the proposed research work stated in Section 3.1, we then proceed to consider the corresponding critical issues and identify the initial generic tasks according to the actual
Verification and experiments
In this section, we are going to evaluate whether the GIM framework integrated with the methods would meet the requirements and specifications brought in by the issue of enriching place-name datasets from mining geographic information from Internet. To answer the question “Are we building the things right?” [55] frequently posed during verifications, we structure the verification paradigm by following the process proposed by Sargent [56] but slightly modifying it to fit our work in this paper
Conclusion
In this paper, we presented a Geographic Information Mining (GIM) framework for building place-name datasets from Internet to enrich gazetteers with local place names used in normal life. This work concludes and extends our previous work in [12]. The new work in this paper were comprised of the following parts:
- (1)
We create a geographic information mining process under the GIM framework that provides the multiple modeling dimensions of processing potential geographic information to resolve the
Acknowledgment
This work was supported in part by the Fundamental Research Funds for the Central Universities under Grant 2018MS024 in part by the National Natural Science Foundation of China under Grant 61305056, and in part by the Overseas Expertise Introduction Program for Disciplines Innovation in Universities (Project 111) under Grant B13009.
References (58)
- et al.
An automatic approach for building place-name datasets from the web
- et al.
Named entity disambiguation for questions in community question answering
Knowl.-Based Syst.
(2017) - et al.
Word sense disambiguation based sentiment lexicons for sentiment classification
Knowl.-Based Syst.
(2016) - et al.
Introduction to digital gazetteer research
Int. J. Geogr. Inf. Sci.
(2008) Core elements of digital gazetteers: placenames, categories, and footprints
Geospatial semantics
- et al.
User needs and implications for modelling vague named places
Spatial Cognition Comput.
(2009) - et al.
Exploring place through user-generated content: using flickr tags to describe city cores
J. Spatial Inf. Sci.
(2010) - et al.
The web as a baseline: Evaluating the performance of unsupervised web-based models for a range of NLP tasks
- et al.
Universality, Language-variability and individuality: defining linguistic building blocks for spatial relations
Stability of qualitative spatial relations between vernacular regions mined from web data
CRISP-DM: TOwards a standard process model for data mining
A survey of data mining and knowledge discovery process models and methodologies
Knowl. Eng. Rev.
Geospatial data mining on the web: Discovering locations of emergency service facilities
Relative radiometric normalization of multitemporal images
Int. J. Interact. Multimedia Artif. Intell.
Combining fuzzy AHP with GIS and decision rules for industrial site selection
Int. J. Interact. Multimedia Artif. Intell.
Design a batched information retrieval system based on a concept-lattice-like structure
Knowl.-Based Syst.
Extraction, integration and analysis of crowdsourced points of interest from multiple web sources
Automatic acquisition of vernacular places
Automatic gazetteer enrichment with user-geocoded data
An agenda for the next generation gazetteer: geographic information contribution and retrieval
Modelling vague places with knowledge from the web
Int. J. Geogr. Inf. Sci.
Identifying imprecise regions for geographic information retrieval using the web
Neighborhood restrictions in geographic IR
Semi-supervised learning of geographical gazetteers from the internet
A data driven approach to mapping urban neighbourhoods
Gazetiki: automatic creation of a geographical gazetteer
A prototype for linear features generalization
Int. J. Interact. Multimedia Artif. Intell.
Spreading semantic information by word sense disambiguation
Knowl.-Based Syst.
Cited by (6)
Automatic construction of POI address lists at city streets from geo-tagged photos and web data: a case study of San Jose City
2023, Multimedia Tools and ApplicationsMathematical Methods for Sensitive Information Mining Method of News Communication Platform Based on Big Data IOT Analysis
2022, Mathematical Problems in EngineeringWeb Page Ranking Using Web Mining Techniques: A Comprehensive Survey
2022, Mobile Information SystemsAn efficient method to extract geographic information
2019, International Journal of Engineering and Advanced Technology