Two-dimensional indexing to provide one-integrated-memory view of distributed memory for a massively-parallel search engine

Yun, Tae-Seob; Whang, Kyu-Young; Kwon, Hyuk-Yoon; Kim, Jun-Sung; Song, Il-Yeol

doi:10.1007/s11280-018-0647-1

Two-dimensional indexing to provide one-integrated-memory view of distributed memory for a massively-parallel search engine

Published: 13 November 2018

Volume 22, pages 2437–2467, (2019)
Cite this article

World Wide Web Aims and scope Submit manuscript

Tae-Seob Yun¹,
Kyu-Young Whang¹,
Hyuk-Yoon Kwon²,
Jun-Sung Kim¹ &
…
Il-Yeol Song³

251 Accesses
1 Citation
Explore all metrics

A Correction to this article was published on 11 April 2019

This article has been updated

Abstract

We propose two-dimensional indexing—a novel in-memory indexing architecture that operates over distributed memory of a massively-parallel search engine. The goal of two-dimensional indexing is to provide a one-integrated-memory view as in a single node system using one large integrated memory. In two-dimensional indexing, we partition the entire index into n× m fragments and distribute them over the memories of multiple nodes in such a way that each fragment is entirely stored in main memory of one node. The proposed architecture is not only scalable as it uses a scaled-out shared-nothing architecture but also is capable of achieving low query response time as it processes queries in main memory. We also propose the concept of the one-memory point, which is the amount of the memory space required to completely store the entire index in main memory providing a one-integrated-memory view. We first prove the effectiveness of two-dimensional indexing with single-keyword queries, and then, extend the notion so as to be able to handle multiple-keyword queries. To handle multiple-keyword queries, we adopt pre-join that materializes a multiple-keyword query a priori as well as a new notion of semi-memory join that obviates extensive communication overhead to perform join across multiple nodes. In experiments using the real-life search query set over a database consisting of 100 million Web documents crawled, we show that two-dimensional indexing can effectively provide a one-integrated-memory view without too much of additional memory compared with the single node system using one large integrated memory. We also show that, with a six-node prototype, in an ideal case, it significantly improves the query processing performance over a disk-based search engine with an equivalent amount of in-memory buffer but without two-dimensional indexing — by up to 535.54 times. This improvement is expected to get larger as the system is scaled-out with a larger number of machines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 9

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

Overview and outlook of emerging non-volatile memories

Article 01 October 2021

Multi-model query languages: taming the variety of big data

Article Open access 31 May 2023

Change history

11 April 2019
After our paper was published on-line, the authors have learned existence of the paper by Feuerstein et al. [Feuerstein 2009].

Notes

Since ODYS/2D-Indexing is a DBMS-based search engine supporting SQL, it gently supports arbitrary queries including various operators such as AND, OR, NOT.
Many research studies [5, 28] focus on the index search time without including the document retrieval time.
For simplicity, we assume that all slave machines have the same amount of main memory. However, in the case where slave machines have different amounts of main memory, we can partition the index shard in proportion to the amount of main memory each slave has.
The number of LongFixed pages is limited to 100% of the size of the index fragment. We allocate 2GB of more pages for the ordinary buffer without LongFix. Any page loaded after the LongFixed buffer is full is treated as an ordinary buffer page without LongFix. Thus, they may be swapped out later if necessary.
The disk space required for duplication is proportional to the number of fragments (m) used in the column. These replicas play a role as replication servers for fault-tolerance as well as the semi-memory join. That is, even if a failure occurs on one machine in the column, the shard master can process any queries on disk by adjusting the keyword range. We note that it is common for large-scale search engines that adding search engines in parallel as replication servers to handle large volumes of search queries. We are exploiting those redundant replication servers to two-dimensional indexing.
The shard master initializes it by executing a query “SELECT keyword, nPostings FROM 〈inverted_index_name〉” that reads DFs of all posting lists (shown in Figure 3 in Section 2.2) in the slave database.
The AOL(America On-line) search query set is a collection of search queries consisting of 35,020,000 queries collected from 650,000 users over 3 months. The portion of single-keyword queries is 36.43%. For other specific application domains, a similar query set can be collected over a period of time.
Specifically, we tested the maximum number of search results that Google provided and observed that they provided up to top-500 search results.
The goal of the experiment is to show the net effect of applying two-dimensional indexing, but is not a comprehensive performance comparison of parallel search engines. Therefore, we used the same search engine(i.e., ODYS) to test the net effect of using two-dimensional indexing.
Dynamic update of the multiple-keyword query set in the buffer is not incorporated in the experiments to obviate the need to recover the database to the initial state for a number of repeated experiments, which takes significant time. It takes only tens of milliseconds of additional time to dynamically update a multiple-keyword query in the in-memory MKSet as we have shown in Table 6.
In ODYS [32], the authors show that a massively-parallel search engine can be built using a DBMS tightly integrated with IR features (Odysseus).
The size is much smaller (maximum 51.74GB) than those of the entire index(459.75GB) and the buffer(192GB).
We use top-500 results to allow for additional query-dependent rankings (e.g., TF-IDF).
Since we let every machine have the same amount of main memory, the total required memory space of ODYS/2D-Indexing(LB) is calculated by 93.7GB × 6 = 562.2GB while that of One-Memory-System is 459.7GB.
The values are calculated when we use multiple-keyword queries only. If we consider single-keyword queries together, which always achieves 100% of hit ratio, the overall hit ratio becomes 92.49%.
We note that Google achieves typically 70 ∼ 90% of hit ratio using Google Global Cache(GGC)[11].

References

Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F.: The impact of caching on search engines. In: Proceedings of the 30th Int’l Conference on Information Retrieval (SIGIR), pp. 183–190 (2007)
Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F.: Design trade-offs for search engine caching. ACM Transactions on the Web (TWEB) 2(4), 1–28 (2008)
Article Google Scholar
Bernstein, P., Chiu, D.: Using semi-joins to solve relational queries. J. ACM 28(1), 25–40 (1981)
Article Google Scholar
Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: Caching query-biased snippets for efficient retrieval. In: Proceedings of the 14th Int’l Conference on Extending Database Technology (EDBT), pp. 93–104 (2011)
Culpepper, J., Petri, M., Scholer, F.: Efficient in-memory top-k document retrieval. In: Proceedings of the 35th Int’l Conference on Information Retrieval (SIGIR), pp. 225–234 (2012)
Cutting, D., Pedersen, J.: Optimization for dynamic inverted index maintenance. In: Proceedings of the 13th ACM Int’l Conference on Information Retrieval (SIGIR), pp. 405–411 (1990)
Dean, J.: Building Software Systems at Google and Lessons Learned, Stanford Computer Science Department Distinguished Computer Scientist Lecture, Nov. 2010. (presentation slides available at http://research.google.com/people/jeff/Stanford-DL-Nov-2010.pdf)
Fagni, T., Perego, R., Silvestri, F., Orlando, S.: Boosting the performance of web search engines caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems (TOIS) 24(1), 51–78 (2006)
Article Google Scholar
Färber, F., et al.: The SAP HANA database - an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)
Google Scholar
Gan, Q., Suel, T.: Improved techniques for result caching in web search engines. In: Proceedings of the 18th Int’l Conference on World Wide Web (WWW), pp. 431–440 (2009)
Google Global Cache: https://peering.google.com/about/ggc.html, https://peering.google.com/about/faq.html (referenced in Jan. 2016)
IBM WebSphere eXtreme Scale: http://www.ibm.com/software/products/en/websphere-extreme-scale (referenced in Jan. 2018)
Internet Live Stats: http://www.internetlivestats.com/google-search-statistics (referenced in Jan. 2018)
Jung, B., Omiecinski, E.: Inverted file partitioning schemes in multiple disk systems. IEEE Trans. Parallel Distributed Syst. 6(2), 142–153 (1995)
Article Google Scholar
Kunder, M.: http://www.worldwidewebsize.com (referenced in Jan. 2018)
Markatos, E.: On caching search engine query results. Comput. Commun. 24(2), 137–143 (2001)
Article Google Scholar
Memcached - A Distributed Memory Object Caching System, http://memcached.org
Oracle Coherence, http://www.oracle.com/technetwork/middleware/coherence
Ousterhout, J., et al.: The case for RAMClouds: scalable high-performance storage entirely in DRAM. In: ACM SIGOPS Operating Systems Review, vol. 43, pp. 92–105 (2010)
Article Google Scholar
Ozcan, R., Altingovde, I., Ulusoy, Ö.: Static query result caching revisited. In: Proceedings of the e17th Int’l Conference on World Wide Web (WWW), pp. 1169–1170 (2008)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the Web, technical report(SIDL-WP-1999-0120) Stanford University (1999)
Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. In: Proceedings of the 1st ACM Int’l Conference on Scalable Information Systems, Article No. 1 (2006)
Protic, J., Tomasevic, M., Milutinović, V.: In Book Distributed Shared Memory-Concepts and Systems. Wiley, New York (1998)
Google Scholar
Samsung Semiconductor, http://www.samsung.com/semiconductor/global/file/insight/2015/08/DDR4_Brochure_July2015-0.pdf
Seagate, http://www.seagate.com/internal-hard-drives/desktop-hard-drives/desktop-hdd/#specs
Skobeltsyn, G., Junqueira, G., Plachouras, V., Baeza-Yates, R.: Resin: a combination of results caching and index pruning for high-performance Web search engines. In: Proceedings of the 31th Int’l Conference on Information Retrieval (SIGIR), pp. 131–138 (2008)
Stonebraker, M., Weisberg, A.: The voltDB main memory DBMS. IEEE Data Eng. Bull. 36(2), 21–27 (2013)
Google Scholar
Strohman, T., Croft, W.: Efficient document retrieval in main memory. In: Proceedings of the 30th Int’l Conference on Information Retrieval (SIGIR), pp. 175–182 (2007)
Turpin, A., Tsegay, Y., Hawking, D., Williams, H.: Fast generation of result snippets in web search. In: Proceedings of the 30th Int’l Conference on Information Retrieval (SIGIR), pp. 127–134 (2007)
Whang, K., Park, B., Han, W., Lee, Y.: An inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems, U.S. Patent no. 6,349,308, Feb. 19, 2002, Application No. 09/250,487 (1999)
Whang, K., Lee, M., Lee, J., Han, W.: Odysseus: a high-performance ORDBMS tightly-coupled with IR features. In: Proceedings of the 21st Int’l Conference on Data Engineering (ICDE), pp. 1104–1105 (2005)
Whang, K., Yun, T., Yeo, Y., Song, I., Kwon, H., Kim, I.: ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality. In: Proceedings of the 2013 ACM Int’l Conference on Management of Data (SIGMOD), pp. 313–324 (2013)
Whang, K., Lee, J., Lee, M., Han, W., Kim, M., Kim, J.: DB-IR Integration using tight-coupling in the Odysseus DBMS. The World Wide Web J 18 (3), 491–520 (2015)
Article Google Scholar
Xin, R., Xin, R., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I., et al.: Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM Int’l Conference on Management of Data (SIGMOD), pp. 13–24 (2013)
Zaharia, M.: An architecture for fast and general data processing on large clusters, PhD Dissertation, University of California, Berkeley (2013)

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea(NRF) grant funded by Korean Government(MSIT) (No. 2016R1A2B4015929).

Author information

Authors and Affiliations

Department of Computer Science, KAIST, Daejeon, Korea
Tae-Seob Yun, Kyu-Young Whang & Jun-Sung Kim
Department of Global Fusion Industrial Engineering, Seoul National University of Science and Technology, Seoul, Korea
Hyuk-Yoon Kwon
College of Information Science and Technology, Drexel University, Philadelphia, PA, 19104, USA
Il-Yeol Song

Authors

Tae-Seob Yun
View author publications
You can also search for this author in PubMed Google Scholar
Kyu-Young Whang
View author publications
You can also search for this author in PubMed Google Scholar
Hyuk-Yoon Kwon
View author publications
You can also search for this author in PubMed Google Scholar
Jun-Sung Kim
View author publications
You can also search for this author in PubMed Google Scholar
Il-Yeol Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kyu-Young Whang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yun, TS., Whang, KY., Kwon, HY. et al. Two-dimensional indexing to provide one-integrated-memory view of distributed memory for a massively-parallel search engine. World Wide Web 22, 2437–2467 (2019). https://doi.org/10.1007/s11280-018-0647-1

Download citation

Received: 18 February 2018
Revised: 05 June 2018
Accepted: 31 October 2018
Published: 13 November 2018
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11280-018-0647-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-dimensional indexing to provide one-integrated-memory view of distributed memory for a massively-parallel search engine

Abstract

Access this article

Similar content being viewed by others

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Overview and outlook of emerging non-volatile memories

Multi-model query languages: taming the variety of big data

Change history

11 April 2019

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-dimensional indexing to provide one-integrated-memory view of distributed memory for a massively-parallel search engine

Abstract

Access this article

Similar content being viewed by others

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Overview and outlook of emerging non-volatile memories

Multi-model query languages: taming the variety of big data

Change history

11 April 2019

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation