Big Data Indexing

Eltabakh, Mohamed Y.

doi:10.1007/978-3-319-63962-8_255-1

Big Data Indexing

Mohamed Y. Eltabakh³

Living reference work entry
First Online: 09 May 2018

318 Accesses

Definitions

The major theme of this topic is building indexes, which are auxiliary data structures, on top of big datasets to speed up its retrieval and querying. The topic covers a wide range of index types along with a comparison of their structures and capabilities.

Overview

Big data infrastructures such as Hadoop are increasingly supporting applications that manage structured or semi-structured data. In many applications including scientific applications, weblog analysis, click streams, transaction logs, and airline analytics, at least partial knowledge about the data structure is known. For example, some attributes (columns in the data) may have known data types and possible domain of values, while other attributes may have little information known about them. This knowledge, even if it is partial, can enable optimization techniques that otherwise would not be possible.

Query optimization is a core mechanism in data management systems. It enables executing users’ queries...

This is a preview of subscription content, log in via an institution.

References

Abadi DJ (2010) Tradeoffs between parallel database systems, Hadoop, and Hadoopdb as platforms for petabyte-scale analysis. In: SSDBM, pp 1–3
Google Scholar
Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A (2009) HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB, pp 922–933
Article Google Scholar
Abouzied A, Bajda-Pawlikowski K, Huang J, Abadi DJ, Silberschatz A (2010) Hadoopdb in action: building real world applications. In: SIGMOD conference, pp 1111–1114
Google Scholar
Balmin A, Beyer KS, Ercegovac V, McPherson J, Özcan F, Pirahesh H, Shekita EJ, Sismanis Y, Tata S, Tian Y (2013) A platform for extreme analytics. IBM J Res Dev 57(3/4):4
Article Google Scholar
Bayer R, McCreight E (1972) Organization and maintenance of large ordered indexes. Acta Informatica 1(3):173–189
Article Google Scholar
Beyer K, Ercegovac V, Gemulla R, Balmin A, Eltabakh MY, Kanne CC, Ozcan F, Shekita E (2011) Jaql: a scripting language for large scale semi-structured data analysis. In: PVLDB, vol 4
Google Scholar
Chamberlin DD, Astrahan MM, Blasgen MW, Gray JN, King WF, Lindsay BG, Lorie R, Mehl JW et al (1974) A history and evaluation of system r. In: ACM computing practices, pp 632–646
Google Scholar
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1)
Article Google Scholar
Dittrich J, Quiané-Ruiz JA, Jindal A, Kargin Y, Setty V, Schad J (2010) Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). In: VLDB, vol 3, pp 518–529
Article Google Scholar
Dittrich J, Quiané-Ruiz J, Richter S, Schuh S, Jindal A, Schad J (2012) Only aggressive elephants are fast elephants. PVLDB 5(11):1591–1602
Article Google Scholar
Eldawy A, Mokbel MF (2015) Spatialhadoop: a MapReduce framework for spatial data. In: 31st IEEE international conference on data engineering (ICDE 2015), Seoul, 13–17 Apr 2015, pp 1352–1363
Google Scholar
Eltabakh MY, Özcan F, Sismanis Y, Haas P, Pirahesh H, Vondrak J (2013) Eagle-eyed elephant: split-oriented indexing in hadoop. In: Proceedings of the 16th international conference on extending database technology (EDBT), pp 89–100
Google Scholar
Floratou A, Minhas UF, Özcan F (2014a) Sql-on-Hadoop: full circle back to shared-nothing database architectures. PVLDB 7(12):1295–1306
Article Google Scholar
Floratou A, Özcan F, Schiefer B (2014b) Benchmarking sql-on-hadoop systems: TPC or not TPC? In: Big data benchmarking – 5th international workshop (WBDB 2014), Potsdam, 5–6 Aug 2014, pp 63–72. Revised Selected Papers
Chapter Google Scholar
Gankidi VR, Teletia N, Patel JM, Halverson A, DeWitt DJ (2014) Indexing HDFS data in PDW: splitting the data from the index. PVLDB 7(13):1520–1528
Article Google Scholar
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data (SIGMOD’84), pp 47–57
Google Scholar
Jiang D, Ooi BC, Shi L, Wu S (2010) The performance of MapReduce: an in-depth study. Proc VLDB Endow pp 472–483
Article Google Scholar
Katsipoulakis NR, Tian Y, Ozcan F, Pirahesh H, Reinwald B (2015) A generic solution to integrate SQL and analytics for big data. In: EDBT, pp 671–676
Google Scholar
Liu Y, Hu S, Rabl T, Liu W, Jacobsen H, Wu K, Chen J, Li J (2014) Dgfindex for smart grid: enhancing hive with a cost-effective multidimensional range index. PVLDB 7(13):1496–1507. http://www.vldb.org/pvldb/vol7/p1496-liu.pdf
Article Google Scholar
Lu P, Chen G, Ooi BC, Vo HT, Wu S (2014) Scalagist: scalable generalized search trees for MapReduce systems [innovative systems paper]. PVLDB 7(14):1797–1808
Google Scholar
Maier D (1983) Theory of relational databases. Computer Science Press, Rockville
MATH Google Scholar
Moro MM, Zhang D, Tsotras VJ (2009) Hash-based Indexing. In: LIU L., \(\ddot {\mathrm{O}}\)ZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, pp 1289–1290
Google Scholar
Richter S, Quiané-Ruiz J, Schuh S, Dittrich J (2012) Towards zero-overhead adaptive indexing in Hadoop. CoRR abs/1212.3480
Google Scholar
Stonebraker M, Rowe LA, Hirohama M (1990) The implementation of POSTGRES. TKDE 2(1):125–142
Google Scholar
Stonebraker M et al (2010) MapReduce and parallel DBMSs: friends or foes? Commun ACM 53(1):64–71. http://doi.acm.org/10.1145/1629175.1629197
Article Google Scholar
Tian Y, Özcan F, Zou T, Goncalves R, Pirahesh H (2016) Building a hybrid warehouse: efficient joins between data stored in HDFS and enterprise warehouse. ACM Trans Database Syst 41(4):21:1–21:38
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Worcester Polytechnic Institute, Worcester, MA, USA
Mohamed Y. Eltabakh

Authors

Mohamed Y. Eltabakh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Y. Eltabakh .

Editor information

Editors and Affiliations

Institute of Computer Science, University of Tartu, Tartu, Estonia
Sherif Sakr
Sch of Info Techno, Building J12, University of Sydney Sch of Info Techno, Building J12, Sydney, Australia
Albert Zomaya

Section Editor information

IBM Almaden Research Center, SAN JOSE, CA, USA
Yuanyuan Tian
IBM Research - Almaden, San Jose, CA, USA
Fatma Özcan

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Eltabakh, M.Y. (2018). Big Data Indexing. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_255-1

Download citation

DOI: https://doi.org/10.1007/978-3-319-63962-8_255-1
Published: 09 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics