Geolocation Search with SharePoint Fast Search Feature and A (star) Search Algorithm

Hettipathirana, H. Chathushka Dilhan; Ariyapala, Thameera Viraj

doi:10.1007/978-3-030-21817-1_22

H. Chathushka Dilhan Hettipathirana ORCID: orcid.org/0000-0003-0742-2997¹⁶ &
Thameera Viraj Ariyapala¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11591))

Included in the following conference series:

International Conference on Human-Computer Interaction

1337 Accesses
1 Citations

Abstract

This paper represents a review on geolocation finding mechanism through SharePoint fast search and A* search algorithm. As a part of the SharePoint Fast search authors will compare two algorithms; Euclidean distance and Taxicab geometry distance in order to find a geolocation based on the shortest path. Throughout the paper, authors have highlighted the use of each individual algorithm (Euclidean distance, Taxicab geometry distance, and A* search algorithm) in terms finding the shortest path.

You have full access to this open access chapter, Download conference paper PDF

A Survey on Map-Matching Algorithms

ISOGA: A System for Geographical Reachability Analysis

Path-Based Floyd Algorithm in Automatic Search System

Keywords

1 Introduction

The search functionality of SharePoint is very robust and brings a Google or Bing-type experience to the corporate documents, and other contents. It is a combination of many components and complex search algorithms and path finding algorithms to ultimately produce quality search result sets.

SharePoint’s Search algorithm takes into account many factors when ranking search results and finding content for display. Some of these are inherent in the content being searched or determined by the way users select and reference contents. Where developer can’t control those factors. Other SharePoint search ranking elements can be influenced by the user, though.

Whereas, A* algorithm is a computer algorithm that is widely used in path finding and graph traversal, the process of plotting an efficiently traversable path between points, called nodes. A* uses a best-first search and finds the least cost path from a given initial node to one goal node (out of one or more possible goals). It uses a distance-plus-cost heuristic function f(x) to determine the order in which the search visits nodes in the tree.

It’s possible to get the shortest path of one location to another and compute distances between them in a lot of different ways. For example, address can be queried against SQL Server (if the correct data is available), or those addresses can be used in conjunction with the Bing Geocode services. A custom webpart^{Footnote 1} with logic to query one of those services with the users/objects current location and all list items with location information takes little time, but performance issues can pop up in no time.

How is the performance when there are 200 items in a list? And 2000? 20000? Maybe 200000? Surely, it is possible to imagine that there are some smart solutions to send 200000 locations to the geocode service and receive them back, yet it is not an easy task to extract that information from a SharePoint list. That takes quite some time. It is even getting harder when data comes from several lists^{Footnote 2}, not even thought about data from several site collections^{Footnote 3}, external data or, location information that resides inside document. This is where the use of efficient path finding algorithms are required.

Throughout this paper authors will discuss how to use SharePoint fast search and A* to find Geolocation, and most suitable out of both.

2 Scenario - SharePoint Fast Search to Find Geolocation

SharePoint Fast Search is a very powerful search engine that can be customized in various ways. First of all, Fast search can index all of the information that lives inside SharePoint, or outside of SharePoint (using, for example, the Business Connectivity Service or a custom connector). No matter what (besides security), but whenever a query is executed on a certain keyword, all indexed data can be checked against that keyword.

At second comes the ability to enrich the index with extra information. The source for this information can be existing metadata from site columns (address, city), data from inside the document, or data that is already extracted using the entity extractor. These sources can be used to query the geocode service to retrieve the spatial data for that source and can be added to the fast index. This metadata can be used to query the index, determine distances to items within the index.

2.1 Indexing the Data

Fast web crawls data and puts this data into the index. Web crawlers, also known as spiders are used to crawl through hundreds of millions of Web pages that exist, in order to grasp the information (Fig. 1).

According to [6, 8] one of the processes that happens during the indexation process, is processing the content. During this process, this data is traversed through a “pipeline”, which consists of several stages:

Format Conversion → Language Detection → Lemmatize → Tokenizer → Entity Extraction → Vectorizer → Web Analyzer → Properties Mapper

The stages as shown above is just a small subset of the entire process, but basically, it does the following:

1.
Normalize the document - the data of each input is normalized, so that every stage doesn’t have problems processing this content.
2.
Language detection - determine the language of the document. This metadata is used in other pipelines to, for example, determine what dictionary should be used.
3.
Lemmatizer - based on the language that is detected, lemma and stemma are determined of words^{Footnote 4}.
4.
Entity Extraction - extracts entities, based on a dictionary, from the data that is processed. Out of the box Locations, Persons are extracted.

However, this pipeline can be extended based on the requirements. As pre-this scenario it can be extended to identify the latitude and longitude.

2.2 Extending the Pipeline

As author want to work with spatial data, a custom pipeline needs to be created, which has any location data as input, and which can output spatial data: latitude and longitude. All these properties need to be crawled properties:

In order to create the pipeline, it’s a must to know what kind of data is available. It’s important to know what data is available, where it resides, and how it can be used to be processed. This data must be available through crawled properties (Fig. 2).

2.3 SharePoint Fast Search Query to Find Location

After implementing the custom pipeline extension and all the data has been re-indexed, the index is enriched with the latitude and longitude information. This information can be used for some interesting queries and some interesting sorting algorithms.

When working with spatial data, there are some different approaches that can be used to retrieve the nearest locations and sort them. There is however one caveat to take care of, when a custom sorting formula is used. For the current scenario, author is querying directly against the Fast query service application using the code below. And, also return a set of 3 managed properties: title, latitude and longitude.

The above code gets the proxy that will be used and instantiates a new keyword Query object.

2.4 Retrieve All Results and Sort Them by Distance to a Certain Point

This query is easy to execute, as the query “#” will retrieve all items. But when the managed properties are used in a sorting formula, things change. As the managed properties are of type decimal, these properties are handled differently, as described in this paper. For the search formula, different algorithms can be used. Two popular algorithms are the following:

1.
Euclidean distance: the shortest, unique distance between two points.
2.
Taxicab distance: the distance between two points is the absolute difference of their coordinates. The path between the two points doesn’t have to be unique.

3 Euclidean Distance

Joo Ghee Lim and S V Rao described a grid-based location estimation scheme. The scheme is to use hop counts between the nodes and the markers to determine location in square monitoring region. The scheme has several features like locating quickly, saving energy and strong robustness [3]. However, the location accuracy of the scheme above is stable in simulation due to multi-solution and no-solution. The details are as follows: [1, 7].

The grid-based location estimation scheme is improved by distributed grid location estimation scheme based on Euclidean distance, but its location accuracy is higher than Taxicab distance, and it is able to solve the shortest path problem (Fig. 3).

Euclidean Distance: [2] it is generally a measure function, of which the computational complexity is O(d). It can be described as follow:

$$ D \left( {x, y} \right) = \sqrt {\mathop \sum \limits_{i = 1}^{d} \left( {x_{i} - y_{i} } \right)^{2} } $$

Where $ x_{i} $ and $ y_{i} $ are vectors, and $ D\left( {x,y} \right) $ is the distance between them. Considering the balance between the accuracy and the computational complexity, author contend that Euclidean distance is better than any other measure function in this scheme (this will be proven later part of this paper).

4 Taxicab Geometry

Taxicab Geometry is a form of geometry in which the usual distance function or metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates.

The taxicab distance, $ d_{1} $, between two vectors $ p,q $ in an n-dimensional real vector space with fixed Cartesian coordinate system, is the sum of the lengths of the projections of the line segment between the points onto the coordinate axes. More formally,

$$ d_{1} \left( {p, q} \right) = \left| {\left| {p - q} \right|} \right|_{1} = \mathop \sum \limits_{i = 1}^{n} |p_{i} - q_{i} | $$

Where $ \left( {p,q} \right) $ are vectors.

The next image shows the difference between the two different algorithms (Euclidean distance and Taxicab distance). The green line represents the Euclidean algorithm and is the unique, shortest path between two points. The red, blue and yellow paths represent variations on the taxicab geometry and are indeed, not unique (Fig. 4).

5 A* Search Algorithm

In the field of heuristic searching algorithm, A* algorithm which is widely applied is a graph searching algorithm applying evaluation function to sort the nodes [9]. The basic idea of this algorithm is to avoid expanding paths that are already expensive. This algorithm uses mainly an evaluation function [4].

The distance-plus-cost heuristic is a sum of two functions:

$$ f\left( x \right) = g\left( x \right) + h\left( x \right) $$

The path-cost function, which is the cost from the starting node to the current node $ g(x). $
And an admissible “heuristic estimate” of the distance to the goal $ h(x) $.

The $ h(x) $ part of the $ f(x) $ function must be an admissible heuristic; that is, it must not overestimate the distance to the goal. Thus, for an application like routing, $ h(x) $ might represent the straight-line distance to the goal, since that is physically the smallest possible distance between any two points or nodes.

Basically, A* search is considering the path costs in order to calculate the shortest path, which is $ g(x) $ path cost from initial node to node x, and $ h(x) $ estimated cost from x to goal (Fig. 5).

6 Conclusion

As authors have explained throughout the paper, it is noticeable that both Euclidean Distance and A* search algorithm are much optimal and efficient when comparing with Taxicab distance calculation in terms of finding the shortest path. Yet, the implementation of the Euclidean distance is bit of hard with its complexity. However, author propose to use hybrid approach as combining the both algorithm’s novel features as it will enable much more accurate, optimal and cost-effective mechanism to calculate the shortest path.

Notes

1.
In SharePoint, a Web Part is a component on a web page. It can act like a window to a component that may be displayed on another SharePoint page.
2.
A list in SharePoint is used to store data across columns in separate rows. For example, a list as a table in a database that will have columns and rows.
3.
Site collection is a hierarchical site structure that is made up of one top-level site and sites below it. This top-level website can have multiple sub sites, and each sub site can have multiple sub sites, for as many levels as required.
4.
The definition on Wikipedia: canonical form, dictionary form, or citation form of a set of words (headword). Think about the following: bank – banks or good – better – best.

References

Chen, H., Wu, H., Tzeng, N.F.: Grid-based approach for working node selection in wireless sensor networks. IEEE Commun. Soc. 6, 3673–3678 (2004)
Google Scholar
Jia, Z.-X., et al.: Distributed grid location estimation scheme based on Euclidean distance. In: IEEE Industrial Electronics and Applications, Singapore, pp. 1128–1132 (2008)
Google Scholar
Lim, J.G., Rao, S.V.: A grid-based location estimation scheme using hop counts for multi-hop wireless sensor networks. In: International Workshop on Wireless Ad-Hoc Networks, pp. 330–334 (2004)
Google Scholar
Yao, J., et al.: Path planning for virtual human motion using improved A* algorithm. In: Seventh International Conference on Information Technology, pp. 1154–1158. IEEE (2010)
Google Scholar
Lemmatization - Wikipedia, the free encyclopedia. Wikipedia (2018). http://en.wikipedia.org/wiki/Lemmatisation. Accessed 10 Sept 2018
MSDN Blogs (2018). http://blogs.msdn.com/b/sharepointdev/archive/2010/12/09/tokenization-in-the-sharepoint-2010-server-ribbon.aspx. Accessed 9 Sept 2018
Patwari, N., Hero, A.O., et al.: Relative location estimation in wireless sensor networks, signal processing. IEEE Trans. Signal Process. 51(8), 2137–2148 (2003)
Article Google Scholar
SharePoint for Squirrels - by Natalya Voskresenskaya [MVP]: SharePoint Search 101: How does an indexer work? What is Lemmatization? What is Tokenization? http://spforsquirrels.blogspot.com/2011/01/sharepoint-search-101how-does-indexer.html. Accessed 9 Sept 2018
Sun, S., Lin, M.: The coordination path planning of multiple moving robots based on GA. Autom. J. 26(5), 672–676 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Moratuwa, Bandaranayake Mawatha, Moratuwa, 10400, Sri Lanka
H. Chathushka Dilhan Hettipathirana
Teesside University, Campus Heart, Southfield Rd, Middlesbrough, TS1 3BX, UK
Thameera Viraj Ariyapala

Authors

H. Chathushka Dilhan Hettipathirana
View author publications
You can also search for this author in PubMed Google Scholar
Thameera Viraj Ariyapala
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to H. Chathushka Dilhan Hettipathirana .

Editor information

Editors and Affiliations

Cyprus University of Technology, Limassol, Cyprus
Panayiotis Zaphiris
Cyprus University of Technology, Limassol, Cyprus
Andri Ioannou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hettipathirana, H.C.D., Ariyapala, T.V. (2019). Geolocation Search with SharePoint Fast Search Feature and A (star) Search Algorithm. In: Zaphiris, P., Ioannou, A. (eds) Learning and Collaboration Technologies. Ubiquitous and Virtual Environments for Learning and Collaboration. HCII 2019. Lecture Notes in Computer Science(), vol 11591. Springer, Cham. https://doi.org/10.1007/978-3-030-21817-1_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-21817-1_22
Published: 15 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21816-4
Online ISBN: 978-3-030-21817-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Geolocation Search with SharePoint Fast Search Feature and A (star) Search Algorithm

Abstract