Keywords

1 Introduction

The search functionality of SharePoint is very robust and brings a Google or Bing-type experience to the corporate documents, and other contents. It is a combination of many components and complex search algorithms and path finding algorithms to ultimately produce quality search result sets.

SharePoint’s Search algorithm takes into account many factors when ranking search results and finding content for display. Some of these are inherent in the content being searched or determined by the way users select and reference contents. Where developer can’t control those factors. Other SharePoint search ranking elements can be influenced by the user, though.

Whereas, A* algorithm is a computer algorithm that is widely used in path finding and graph traversal, the process of plotting an efficiently traversable path between points, called nodes. A* uses a best-first search and finds the least cost path from a given initial node to one goal node (out of one or more possible goals). It uses a distance-plus-cost heuristic function f(x) to determine the order in which the search visits nodes in the tree.

It’s possible to get the shortest path of one location to another and compute distances between them in a lot of different ways. For example, address can be queried against SQL Server (if the correct data is available), or those addresses can be used in conjunction with the Bing Geocode services. A custom webpartFootnote 1 with logic to query one of those services with the users/objects current location and all list items with location information takes little time, but performance issues can pop up in no time.

How is the performance when there are 200 items in a list? And 2000? 20000? Maybe 200000? Surely, it is possible to imagine that there are some smart solutions to send 200000 locations to the geocode service and receive them back, yet it is not an easy task to extract that information from a SharePoint list. That takes quite some time. It is even getting harder when data comes from several listsFootnote 2, not even thought about data from several site collectionsFootnote 3, external data or, location information that resides inside document. This is where the use of efficient path finding algorithms are required.

Throughout this paper authors will discuss how to use SharePoint fast search and A* to find Geolocation, and most suitable out of both.

2 Scenario - SharePoint Fast Search to Find Geolocation

SharePoint Fast Search is a very powerful search engine that can be customized in various ways. First of all, Fast search can index all of the information that lives inside SharePoint, or outside of SharePoint (using, for example, the Business Connectivity Service or a custom connector). No matter what (besides security), but whenever a query is executed on a certain keyword, all indexed data can be checked against that keyword.

At second comes the ability to enrich the index with extra information. The source for this information can be existing metadata from site columns (address, city), data from inside the document, or data that is already extracted using the entity extractor. These sources can be used to query the geocode service to retrieve the spatial data for that source and can be added to the fast index. This metadata can be used to query the index, determine distances to items within the index.

2.1 Indexing the Data

Fast web crawls data and puts this data into the index. Web crawlers, also known as spiders are used to crawl through hundreds of millions of Web pages that exist, in order to grasp the information (Fig. 1).

Fig. 1.
figure 1

Crawling site collections

According to [6, 8] one of the processes that happens during the indexation process, is processing the content. During this process, this data is traversed through a “pipeline”, which consists of several stages:

Format Conversion → Language Detection → Lemmatize → Tokenizer → Entity Extraction → Vectorizer → Web Analyzer → Properties Mapper

The stages as shown above is just a small subset of the entire process, but basically, it does the following:

  1. 1.

    Normalize the document - the data of each input is normalized, so that every stage doesn’t have problems processing this content.

  2. 2.

    Language detection - determine the language of the document. This metadata is used in other pipelines to, for example, determine what dictionary should be used.

  3. 3.

    Lemmatizer - based on the language that is detected, lemma and stemma are determined of wordsFootnote 4.

  4. 4.

    Entity Extraction - extracts entities, based on a dictionary, from the data that is processed. Out of the box Locations, Persons are extracted.

However, this pipeline can be extended based on the requirements. As pre-this scenario it can be extended to identify the latitude and longitude.

2.2 Extending the Pipeline

As author want to work with spatial data, a custom pipeline needs to be created, which has any location data as input, and which can output spatial data: latitude and longitude. All these properties need to be crawled properties:

In order to create the pipeline, it’s a must to know what kind of data is available. It’s important to know what data is available, where it resides, and how it can be used to be processed. This data must be available through crawled properties (Fig. 2).

Fig. 2.
figure 2

Extend pipeline

2.3 SharePoint Fast Search Query to Find Location

After implementing the custom pipeline extension and all the data has been re-indexed, the index is enriched with the latitude and longitude information. This information can be used for some interesting queries and some interesting sorting algorithms.

When working with spatial data, there are some different approaches that can be used to retrieve the nearest locations and sort them. There is however one caveat to take care of, when a custom sorting formula is used. For the current scenario, author is querying directly against the Fast query service application using the code below. And, also return a set of 3 managed properties: title, latitude and longitude.

figure a

The above code gets the proxy that will be used and instantiates a new keyword Query object.

2.4 Retrieve All Results and Sort Them by Distance to a Certain Point

This query is easy to execute, as the query “#” will retrieve all items. But when the managed properties are used in a sorting formula, things change. As the managed properties are of type decimal, these properties are handled differently, as described in this paper. For the search formula, different algorithms can be used. Two popular algorithms are the following:

  1. 1.

    Euclidean distance: the shortest, unique distance between two points.

  2. 2.

    Taxicab distance: the distance between two points is the absolute difference of their coordinates. The path between the two points doesn’t have to be unique.

3 Euclidean Distance

Joo Ghee Lim and S V Rao described a grid-based location estimation scheme. The scheme is to use hop counts between the nodes and the markers to determine location in square monitoring region. The scheme has several features like locating quickly, saving energy and strong robustness [3]. However, the location accuracy of the scheme above is stable in simulation due to multi-solution and no-solution. The details are as follows: [1, 7].

The grid-based location estimation scheme is improved by distributed grid location estimation scheme based on Euclidean distance, but its location accuracy is higher than Taxicab distance, and it is able to solve the shortest path problem (Fig. 3).

Fig. 3.
figure 3

Location based on grid extraction

Euclidean Distance: [2] it is generally a measure function, of which the computational complexity is O(d). It can be described as follow:

$$ D \left( {x, y} \right) = \sqrt {\mathop \sum \limits_{i = 1}^{d} \left( {x_{i} - y_{i} } \right)^{2} } $$

Where \( x_{i} \) and \( y_{i} \) are vectors, and \( D\left( {x,y} \right) \) is the distance between them. Considering the balance between the accuracy and the computational complexity, author contend that Euclidean distance is better than any other measure function in this scheme (this will be proven later part of this paper).

4 Taxicab Geometry

Taxicab Geometry is a form of geometry in which the usual distance function or metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates.

The taxicab distance, \( d_{1} \), between two vectors \( p,q \) in an n-dimensional real vector space with fixed Cartesian coordinate system, is the sum of the lengths of the projections of the line segment between the points onto the coordinate axes. More formally,

$$ d_{1} \left( {p, q} \right) = \left| {\left| {p - q} \right|} \right|_{1} = \mathop \sum \limits_{i = 1}^{n} |p_{i} - q_{i} | $$

Where \( \left( {p,q} \right) \) are vectors.

The next image shows the difference between the two different algorithms (Euclidean distance and Taxicab distance). The green line represents the Euclidean algorithm and is the unique, shortest path between two points. The red, blue and yellow paths represent variations on the taxicab geometry and are indeed, not unique (Fig. 4).

Fig. 4.
figure 4

Euclidean distance vs Taxicab distance (Color figure online)

5 A* Search Algorithm

In the field of heuristic searching algorithm, A* algorithm which is widely applied is a graph searching algorithm applying evaluation function to sort the nodes [9]. The basic idea of this algorithm is to avoid expanding paths that are already expensive. This algorithm uses mainly an evaluation function [4].

The distance-plus-cost heuristic is a sum of two functions:

$$ f\left( x \right) = g\left( x \right) + h\left( x \right) $$
  • The path-cost function, which is the cost from the starting node to the current node \( g(x). \)

  • And an admissible “heuristic estimate” of the distance to the goal \( h(x) \).

The \( h(x) \) part of the \( f(x) \) function must be an admissible heuristic; that is, it must not overestimate the distance to the goal. Thus, for an application like routing, \( h(x) \) might represent the straight-line distance to the goal, since that is physically the smallest possible distance between any two points or nodes.

Basically, A* search is considering the path costs in order to calculate the shortest path, which is \( g(x) \) path cost from initial node to node x, and \( h(x) \) estimated cost from x to goal (Fig. 5).

Fig. 5.
figure 5

Pseudocode for the A* algorithm

6 Conclusion

As authors have explained throughout the paper, it is noticeable that both Euclidean Distance and A* search algorithm are much optimal and efficient when comparing with Taxicab distance calculation in terms of finding the shortest path. Yet, the implementation of the Euclidean distance is bit of hard with its complexity. However, author propose to use hybrid approach as combining the both algorithm’s novel features as it will enable much more accurate, optimal and cost-effective mechanism to calculate the shortest path.