Skip to main content

Big Data Indexing

  • Living reference work entry
  • First Online:
  • 318 Accesses

Definitions

The major theme of this topic is building indexes, which are auxiliary data structures, on top of big datasets to speed up its retrieval and querying. The topic covers a wide range of index types along with a comparison of their structures and capabilities.

Overview

Big data infrastructures such as Hadoop are increasingly supporting applications that manage structured or semi-structured data. In many applications including scientific applications, weblog analysis, click streams, transaction logs, and airline analytics, at least partial knowledge about the data structure is known. For example, some attributes (columns in the data) may have known data types and possible domain of values, while other attributes may have little information known about them. This knowledge, even if it is partial, can enable optimization techniques that otherwise would not be possible.

Query optimization is a core mechanism in data management systems. It enables executing users’ queries...

This is a preview of subscription content, log in via an institution.

References

  • Abadi DJ (2010) Tradeoffs between parallel database systems, Hadoop, and Hadoopdb as platforms for petabyte-scale analysis. In: SSDBM, pp 1–3

    Google Scholar 

  • Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A (2009) HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB, pp 922–933

    Article  Google Scholar 

  • Abouzied A, Bajda-Pawlikowski K, Huang J, Abadi DJ, Silberschatz A (2010) Hadoopdb in action: building real world applications. In: SIGMOD conference, pp 1111–1114

    Google Scholar 

  • Balmin A, Beyer KS, Ercegovac V, McPherson J, Özcan F, Pirahesh H, Shekita EJ, Sismanis Y, Tata S, Tian Y (2013) A platform for extreme analytics. IBM J Res Dev 57(3/4):4

    Article  Google Scholar 

  • Bayer R, McCreight E (1972) Organization and maintenance of large ordered indexes. Acta Informatica 1(3):173–189

    Article  Google Scholar 

  • Beyer K, Ercegovac V, Gemulla R, Balmin A, Eltabakh MY, Kanne CC, Ozcan F, Shekita E (2011) Jaql: a scripting language for large scale semi-structured data analysis. In: PVLDB, vol 4

    Google Scholar 

  • Chamberlin DD, Astrahan MM, Blasgen MW, Gray JN, King WF, Lindsay BG, Lorie R, Mehl JW et al (1974) A history and evaluation of system r. In: ACM computing practices, pp 632–646

    Google Scholar 

  • Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1)

    Article  Google Scholar 

  • Dittrich J, Quiané-Ruiz JA, Jindal A, Kargin Y, Setty V, Schad J (2010) Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). In: VLDB, vol 3, pp 518–529

    Article  Google Scholar 

  • Dittrich J, Quiané-Ruiz J, Richter S, Schuh S, Jindal A, Schad J (2012) Only aggressive elephants are fast elephants. PVLDB 5(11):1591–1602

    Article  Google Scholar 

  • Eldawy A, Mokbel MF (2015) Spatialhadoop: a MapReduce framework for spatial data. In: 31st IEEE international conference on data engineering (ICDE 2015), Seoul, 13–17 Apr 2015, pp 1352–1363

    Google Scholar 

  • Eltabakh MY, Özcan F, Sismanis Y, Haas P, Pirahesh H, Vondrak J (2013) Eagle-eyed elephant: split-oriented indexing in hadoop. In: Proceedings of the 16th international conference on extending database technology (EDBT), pp 89–100

    Google Scholar 

  • Floratou A, Minhas UF, Özcan F (2014a) Sql-on-Hadoop: full circle back to shared-nothing database architectures. PVLDB 7(12):1295–1306

    Article  Google Scholar 

  • Floratou A, Özcan F, Schiefer B (2014b) Benchmarking sql-on-hadoop systems: TPC or not TPC? In: Big data benchmarking – 5th international workshop (WBDB 2014), Potsdam, 5–6 Aug 2014, pp 63–72. Revised Selected Papers

    Chapter  Google Scholar 

  • Gankidi VR, Teletia N, Patel JM, Halverson A, DeWitt DJ (2014) Indexing HDFS data in PDW: splitting the data from the index. PVLDB 7(13):1520–1528

    Article  Google Scholar 

  • Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data (SIGMOD’84), pp 47–57

    Google Scholar 

  • Jiang D, Ooi BC, Shi L, Wu S (2010) The performance of MapReduce: an in-depth study. Proc VLDB Endow pp 472–483

    Article  Google Scholar 

  • Katsipoulakis NR, Tian Y, Ozcan F, Pirahesh H, Reinwald B (2015) A generic solution to integrate SQL and analytics for big data. In: EDBT, pp 671–676

    Google Scholar 

  • Liu Y, Hu S, Rabl T, Liu W, Jacobsen H, Wu K, Chen J, Li J (2014) Dgfindex for smart grid: enhancing hive with a cost-effective multidimensional range index. PVLDB 7(13):1496–1507. http://www.vldb.org/pvldb/vol7/p1496-liu.pdf

    Article  Google Scholar 

  • Lu P, Chen G, Ooi BC, Vo HT, Wu S (2014) Scalagist: scalable generalized search trees for MapReduce systems [innovative systems paper]. PVLDB 7(14):1797–1808

    Google Scholar 

  • Maier D (1983) Theory of relational databases. Computer Science Press, Rockville

    MATH  Google Scholar 

  • Moro MM, Zhang D, Tsotras VJ (2009) Hash-based Indexing. In: LIU L., \(\ddot {\mathrm{O}}\)ZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, pp 1289–1290

    Google Scholar 

  • Richter S, Quiané-Ruiz J, Schuh S, Dittrich J (2012) Towards zero-overhead adaptive indexing in Hadoop. CoRR abs/1212.3480

    Google Scholar 

  • Stonebraker M, Rowe LA, Hirohama M (1990) The implementation of POSTGRES. TKDE 2(1):125–142

    Google Scholar 

  • Stonebraker M et al (2010) MapReduce and parallel DBMSs: friends or foes? Commun ACM 53(1):64–71. http://doi.acm.org/10.1145/1629175.1629197

    Article  Google Scholar 

  • Tian Y, Özcan F, Zou T, Goncalves R, Pirahesh H (2016) Building a hybrid warehouse: efficient joins between data stored in HDFS and enterprise warehouse. ACM Trans Database Syst 41(4):21:1–21:38

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Y. Eltabakh .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Eltabakh, M.Y. (2018). Big Data Indexing. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_255-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63962-8_255-1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63962-8

  • Online ISBN: 978-3-319-63962-8

  • eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics