Definitions
The major theme of this topic is building indexes, which are auxiliary data structures, on top of big datasets to speed up its retrieval and querying. The topic covers a wide range of index types along with a comparison of their structures and capabilities.
Overview
Big data infrastructures such as Hadoop are increasingly supporting applications that manage structured or semi-structured data. In many applications including scientific applications, weblog analysis, click streams, transaction logs, and airline analytics, at least partial knowledge about the data structure is known. For example, some attributes (columns in the data) may have known data types and possible domain of values, while other attributes may have little information known about them. This knowledge, even if it is partial, can enable optimization techniques that otherwise would not be possible.
Query optimization is a core mechanism in data management systems. It enables executing users’ queries...
This is a preview of subscription content, log in via an institution.
References
Abadi DJ (2010) Tradeoffs between parallel database systems, Hadoop, and Hadoopdb as platforms for petabyte-scale analysis. In: SSDBM, pp 1–3
Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A (2009) HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB, pp 922–933
Abouzied A, Bajda-Pawlikowski K, Huang J, Abadi DJ, Silberschatz A (2010) Hadoopdb in action: building real world applications. In: SIGMOD conference, pp 1111–1114
Balmin A, Beyer KS, Ercegovac V, McPherson J, Özcan F, Pirahesh H, Shekita EJ, Sismanis Y, Tata S, Tian Y (2013) A platform for extreme analytics. IBM J Res Dev 57(3/4):4
Bayer R, McCreight E (1972) Organization and maintenance of large ordered indexes. Acta Informatica 1(3):173–189
Beyer K, Ercegovac V, Gemulla R, Balmin A, Eltabakh MY, Kanne CC, Ozcan F, Shekita E (2011) Jaql: a scripting language for large scale semi-structured data analysis. In: PVLDB, vol 4
Chamberlin DD, Astrahan MM, Blasgen MW, Gray JN, King WF, Lindsay BG, Lorie R, Mehl JW et al (1974) A history and evaluation of system r. In: ACM computing practices, pp 632–646
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1)
Dittrich J, Quiané-Ruiz JA, Jindal A, Kargin Y, Setty V, Schad J (2010) Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). In: VLDB, vol 3, pp 518–529
Dittrich J, Quiané-Ruiz J, Richter S, Schuh S, Jindal A, Schad J (2012) Only aggressive elephants are fast elephants. PVLDB 5(11):1591–1602
Eldawy A, Mokbel MF (2015) Spatialhadoop: a MapReduce framework for spatial data. In: 31st IEEE international conference on data engineering (ICDE 2015), Seoul, 13–17 Apr 2015, pp 1352–1363
Eltabakh MY, Özcan F, Sismanis Y, Haas P, Pirahesh H, Vondrak J (2013) Eagle-eyed elephant: split-oriented indexing in hadoop. In: Proceedings of the 16th international conference on extending database technology (EDBT), pp 89–100
Floratou A, Minhas UF, Özcan F (2014a) Sql-on-Hadoop: full circle back to shared-nothing database architectures. PVLDB 7(12):1295–1306
Floratou A, Özcan F, Schiefer B (2014b) Benchmarking sql-on-hadoop systems: TPC or not TPC? In: Big data benchmarking – 5th international workshop (WBDB 2014), Potsdam, 5–6 Aug 2014, pp 63–72. Revised Selected Papers
Gankidi VR, Teletia N, Patel JM, Halverson A, DeWitt DJ (2014) Indexing HDFS data in PDW: splitting the data from the index. PVLDB 7(13):1520–1528
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data (SIGMOD’84), pp 47–57
Jiang D, Ooi BC, Shi L, Wu S (2010) The performance of MapReduce: an in-depth study. Proc VLDB Endow pp 472–483
Katsipoulakis NR, Tian Y, Ozcan F, Pirahesh H, Reinwald B (2015) A generic solution to integrate SQL and analytics for big data. In: EDBT, pp 671–676
Liu Y, Hu S, Rabl T, Liu W, Jacobsen H, Wu K, Chen J, Li J (2014) Dgfindex for smart grid: enhancing hive with a cost-effective multidimensional range index. PVLDB 7(13):1496–1507. http://www.vldb.org/pvldb/vol7/p1496-liu.pdf
Lu P, Chen G, Ooi BC, Vo HT, Wu S (2014) Scalagist: scalable generalized search trees for MapReduce systems [innovative systems paper]. PVLDB 7(14):1797–1808
Maier D (1983) Theory of relational databases. Computer Science Press, Rockville
Moro MM, Zhang D, Tsotras VJ (2009) Hash-based Indexing. In: LIU L., \(\ddot {\mathrm{O}}\)ZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, pp 1289–1290
Richter S, Quiané-Ruiz J, Schuh S, Dittrich J (2012) Towards zero-overhead adaptive indexing in Hadoop. CoRR abs/1212.3480
Stonebraker M, Rowe LA, Hirohama M (1990) The implementation of POSTGRES. TKDE 2(1):125–142
Stonebraker M et al (2010) MapReduce and parallel DBMSs: friends or foes? Commun ACM 53(1):64–71. http://doi.acm.org/10.1145/1629175.1629197
Tian Y, Özcan F, Zou T, Goncalves R, Pirahesh H (2016) Building a hybrid warehouse: efficient joins between data stored in HDFS and enterprise warehouse. ACM Trans Database Syst 41(4):21:1–21:38
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this entry
Cite this entry
Eltabakh, M.Y. (2018). Big Data Indexing. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_255-1
Download citation
DOI: https://doi.org/10.1007/978-3-319-63962-8_255-1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering