Skip to main content
Log in

Two-dimensional indexing to provide one-integrated-memory view of distributed memory for a massively-parallel search engine

  • Published:
World Wide Web Aims and scope Submit manuscript

A Correction to this article was published on 11 April 2019

This article has been updated

Abstract

We propose two-dimensional indexing—a novel in-memory indexing architecture that operates over distributed memory of a massively-parallel search engine. The goal of two-dimensional indexing is to provide a one-integrated-memory view as in a single node system using one large integrated memory. In two-dimensional indexing, we partition the entire index into n× m fragments and distribute them over the memories of multiple nodes in such a way that each fragment is entirely stored in main memory of one node. The proposed architecture is not only scalable as it uses a scaled-out shared-nothing architecture but also is capable of achieving low query response time as it processes queries in main memory. We also propose the concept of the one-memory point, which is the amount of the memory space required to completely store the entire index in main memory providing a one-integrated-memory view. We first prove the effectiveness of two-dimensional indexing with single-keyword queries, and then, extend the notion so as to be able to handle multiple-keyword queries. To handle multiple-keyword queries, we adopt pre-join that materializes a multiple-keyword query a priori as well as a new notion of semi-memory join that obviates extensive communication overhead to perform join across multiple nodes. In experiments using the real-life search query set over a database consisting of 100 million Web documents crawled, we show that two-dimensional indexing can effectively provide a one-integrated-memory view without too much of additional memory compared with the single node system using one large integrated memory. We also show that, with a six-node prototype, in an ideal case, it significantly improves the query processing performance over a disk-based search engine with an equivalent amount of in-memory buffer but without two-dimensional indexing — by up to 535.54 times. This improvement is expected to get larger as the system is scaled-out with a larger number of machines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22
Figure 23
Figure 24

Similar content being viewed by others

Change history

  • 11 April 2019

    After our paper was published on-line, the authors have learned existence of the paper by Feuerstein et al. [Feuerstein 2009].

Notes

  1. Since ODYS/2D-Indexing is a DBMS-based search engine supporting SQL, it gently supports arbitrary queries including various operators such as AND, OR, NOT.

  2. Many research studies [5, 28] focus on the index search time without including the document retrieval time.

  3. For simplicity, we assume that all slave machines have the same amount of main memory. However, in the case where slave machines have different amounts of main memory, we can partition the index shard in proportion to the amount of main memory each slave has.

  4. The number of LongFixed pages is limited to 100% of the size of the index fragment. We allocate 2GB of more pages for the ordinary buffer without LongFix. Any page loaded after the LongFixed buffer is full is treated as an ordinary buffer page without LongFix. Thus, they may be swapped out later if necessary.

  5. The disk space required for duplication is proportional to the number of fragments (m) used in the column. These replicas play a role as replication servers for fault-tolerance as well as the semi-memory join. That is, even if a failure occurs on one machine in the column, the shard master can process any queries on disk by adjusting the keyword range. We note that it is common for large-scale search engines that adding search engines in parallel as replication servers to handle large volumes of search queries. We are exploiting those redundant replication servers to two-dimensional indexing.

  6. The shard master initializes it by executing a query “SELECT keyword, nPostings FROM 〈inverted_index_name〉” that reads DFs of all posting lists (shown in Figure 3 in Section 2.2) in the slave database.

  7. The AOL(America On-line) search query set is a collection of search queries consisting of 35,020,000 queries collected from 650,000 users over 3 months. The portion of single-keyword queries is 36.43%. For other specific application domains, a similar query set can be collected over a period of time.

  8. Specifically, we tested the maximum number of search results that Google provided and observed that they provided up to top-500 search results.

  9. The goal of the experiment is to show the net effect of applying two-dimensional indexing, but is not a comprehensive performance comparison of parallel search engines. Therefore, we used the same search engine(i.e., ODYS) to test the net effect of using two-dimensional indexing.

  10. Dynamic update of the multiple-keyword query set in the buffer is not incorporated in the experiments to obviate the need to recover the database to the initial state for a number of repeated experiments, which takes significant time. It takes only tens of milliseconds of additional time to dynamically update a multiple-keyword query in the in-memory MKSet as we have shown in Table 6.

  11. In ODYS [32], the authors show that a massively-parallel search engine can be built using a DBMS tightly integrated with IR features (Odysseus).

  12. The size is much smaller (maximum 51.74GB) than those of the entire index(459.75GB) and the buffer(192GB).

  13. We use top-500 results to allow for additional query-dependent rankings (e.g., TF-IDF).

  14. Since we let every machine have the same amount of main memory, the total required memory space of ODYS/2D-Indexing(LB) is calculated by 93.7GB × 6 = 562.2GB while that of One-Memory-System is 459.7GB.

  15. The values are calculated when we use multiple-keyword queries only. If we consider single-keyword queries together, which always achieves 100% of hit ratio, the overall hit ratio becomes 92.49%.

  16. We note that Google achieves typically 70 ∼ 90% of hit ratio using Google Global Cache(GGC)[11].

References

  1. Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F.: The impact of caching on search engines. In: Proceedings of the 30th Int’l Conference on Information Retrieval (SIGIR), pp. 183–190 (2007)

  2. Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F.: Design trade-offs for search engine caching. ACM Transactions on the Web (TWEB) 2(4), 1–28 (2008)

    Article  Google Scholar 

  3. Bernstein, P., Chiu, D.: Using semi-joins to solve relational queries. J. ACM 28(1), 25–40 (1981)

    Article  Google Scholar 

  4. Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: Caching query-biased snippets for efficient retrieval. In: Proceedings of the 14th Int’l Conference on Extending Database Technology (EDBT), pp. 93–104 (2011)

  5. Culpepper, J., Petri, M., Scholer, F.: Efficient in-memory top-k document retrieval. In: Proceedings of the 35th Int’l Conference on Information Retrieval (SIGIR), pp. 225–234 (2012)

  6. Cutting, D., Pedersen, J.: Optimization for dynamic inverted index maintenance. In: Proceedings of the 13th ACM Int’l Conference on Information Retrieval (SIGIR), pp. 405–411 (1990)

  7. Dean, J.: Building Software Systems at Google and Lessons Learned, Stanford Computer Science Department Distinguished Computer Scientist Lecture, Nov. 2010. (presentation slides available at http://research.google.com/people/jeff/Stanford-DL-Nov-2010.pdf)

  8. Fagni, T., Perego, R., Silvestri, F., Orlando, S.: Boosting the performance of web search engines caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems (TOIS) 24(1), 51–78 (2006)

    Article  Google Scholar 

  9. Färber, F., et al.: The SAP HANA database - an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)

    Google Scholar 

  10. Gan, Q., Suel, T.: Improved techniques for result caching in web search engines. In: Proceedings of the 18th Int’l Conference on World Wide Web (WWW), pp. 431–440 (2009)

  11. Google Global Cache: https://peering.google.com/about/ggc.html, https://peering.google.com/about/faq.html (referenced in Jan. 2016)

  12. IBM WebSphere eXtreme Scale: http://www.ibm.com/software/products/en/websphere-extreme-scale (referenced in Jan. 2018)

  13. Internet Live Stats: http://www.internetlivestats.com/google-search-statistics (referenced in Jan. 2018)

  14. Jung, B., Omiecinski, E.: Inverted file partitioning schemes in multiple disk systems. IEEE Trans. Parallel Distributed Syst. 6(2), 142–153 (1995)

    Article  Google Scholar 

  15. Kunder, M.: http://www.worldwidewebsize.com (referenced in Jan. 2018)

  16. Markatos, E.: On caching search engine query results. Comput. Commun. 24(2), 137–143 (2001)

    Article  Google Scholar 

  17. Memcached - A Distributed Memory Object Caching System, http://memcached.org

  18. Oracle Coherence, http://www.oracle.com/technetwork/middleware/coherence

  19. Ousterhout, J., et al.: The case for RAMClouds: scalable high-performance storage entirely in DRAM. In: ACM SIGOPS Operating Systems Review, vol. 43, pp. 92–105 (2010)

    Article  Google Scholar 

  20. Ozcan, R., Altingovde, I., Ulusoy, Ö.: Static query result caching revisited. In: Proceedings of the e17th Int’l Conference on World Wide Web (WWW), pp. 1169–1170 (2008)

  21. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the Web, technical report(SIDL-WP-1999-0120) Stanford University (1999)

  22. Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. In: Proceedings of the 1st ACM Int’l Conference on Scalable Information Systems, Article No. 1 (2006)

  23. Protic, J., Tomasevic, M., Milutinović, V.: In Book Distributed Shared Memory-Concepts and Systems. Wiley, New York (1998)

    Google Scholar 

  24. Samsung Semiconductor, http://www.samsung.com/semiconductor/global/file/insight/2015/08/DDR4_Brochure_July2015-0.pdf

  25. Seagate, http://www.seagate.com/internal-hard-drives/desktop-hard-drives/desktop-hdd/#specs

  26. Skobeltsyn, G., Junqueira, G., Plachouras, V., Baeza-Yates, R.: Resin: a combination of results caching and index pruning for high-performance Web search engines. In: Proceedings of the 31th Int’l Conference on Information Retrieval (SIGIR), pp. 131–138 (2008)

  27. Stonebraker, M., Weisberg, A.: The voltDB main memory DBMS. IEEE Data Eng. Bull. 36(2), 21–27 (2013)

    Google Scholar 

  28. Strohman, T., Croft, W.: Efficient document retrieval in main memory. In: Proceedings of the 30th Int’l Conference on Information Retrieval (SIGIR), pp. 175–182 (2007)

  29. Turpin, A., Tsegay, Y., Hawking, D., Williams, H.: Fast generation of result snippets in web search. In: Proceedings of the 30th Int’l Conference on Information Retrieval (SIGIR), pp. 127–134 (2007)

  30. Whang, K., Park, B., Han, W., Lee, Y.: An inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems, U.S. Patent no. 6,349,308, Feb. 19, 2002, Application No. 09/250,487 (1999)

  31. Whang, K., Lee, M., Lee, J., Han, W.: Odysseus: a high-performance ORDBMS tightly-coupled with IR features. In: Proceedings of the 21st Int’l Conference on Data Engineering (ICDE), pp. 1104–1105 (2005)

  32. Whang, K., Yun, T., Yeo, Y., Song, I., Kwon, H., Kim, I.: ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality. In: Proceedings of the 2013 ACM Int’l Conference on Management of Data (SIGMOD), pp. 313–324 (2013)

  33. Whang, K., Lee, J., Lee, M., Han, W., Kim, M., Kim, J.: DB-IR Integration using tight-coupling in the Odysseus DBMS. The World Wide Web J 18 (3), 491–520 (2015)

    Article  Google Scholar 

  34. Xin, R., Xin, R., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I., et al.: Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM Int’l Conference on Management of Data (SIGMOD), pp. 13–24 (2013)

  35. Zaharia, M.: An architecture for fast and general data processing on large clusters, PhD Dissertation, University of California, Berkeley (2013)

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea(NRF) grant funded by Korean Government(MSIT) (No. 2016R1A2B4015929).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kyu-Young Whang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yun, TS., Whang, KY., Kwon, HY. et al. Two-dimensional indexing to provide one-integrated-memory view of distributed memory for a massively-parallel search engine. World Wide Web 22, 2437–2467 (2019). https://doi.org/10.1007/s11280-018-0647-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-018-0647-1

Keywords

Navigation