Skip to main content

Data Wrangling

  • Living reference work entry
  • Latest version View entry history
  • First Online:

Synonyms

Data preparation

Definition

Data wrangling is the process of profiling and transforming datasets to ensure they are actionable for a set of analysis tasks. One central goal is to make data usable: to put data in a form that can be parsed and manipulated by analysis tools. Another goal is to ensure that data is responsive to the intended analyses: that the data contain the necessary information, at an acceptable level of description and correctness, to support successful modeling and decision-making.

Overview

Despite significant advances in technologies for data management and analysis, it remains time-consuming to inspect a dataset and mold it to a form that allows meaningful analysis to begin. Analysts must regularly restructure data to make it palatable to databases, statistics packages, and visualization tools. To improve data quality, analysts must also identify and address issues such as misspellings, missing data, unresolved duplicates, and outliers.

Data wrangling is...

This is a preview of subscription content, log in via an institution.

Notes

  1. 1.

    Normal forms beyond first normal form (second normal form, etc.) are often less desirable for analysis purposes: one might wish to denormalize data (e.g., by joining relations with primary-foreign key relationships) in order to more conveniently perform analysis over a single table.

References

  • Carr DB, Littlefield RJ, Nicholson W, Littlefield J (1987) Scatterplot matrix techniques for large N. J Am Stat Assoc 82(398):424–436

    MathSciNet  Google Scholar 

  • Chiticariu L, Kolaitis PG, Popa L (2008) Interactive generation of integrated schemas. In: ACM SIGMOD, pp 833–846

    Google Scholar 

  • Codd EF (1971b) Further normalization of the data base relational model. In: Courant computer science symposia 6, Data base systems, (New York, May 24–25) pp 33–64, Prentice-Hall

    Google Scholar 

  • Dasu T, Johnson T (2003) Exploratory data mining and data cleaning. Wiley, New York

    Book  Google Scholar 

  • Dasu T, Johnson T, Muthukrishnan S, Shkapenyuk V (2002) Mining database structure; or, how to build a data quality browser. In: ACM SIGMOD, pp 240–251

    Google Scholar 

  • Doan A, Halevy A, Ives Z (2012) Principles of data integration. Elsevier, Amsterdam

    Google Scholar 

  • Eaton C, Plaisant C, Drizd T (2003) The challenge of missing and uncertain data. In: Proceedings of the IEEE visualization, p 100

    Google Scholar 

  • Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE TKDE 19(1):1–16

    Google Scholar 

  • Fisher K, Walker D (2011) The PADS project: an overview. In: International conference on database theory, Mar 2011

    Google Scholar 

  • Galhardas H, Florescu D, Shasha D, Simon E (2000) AJAX: an extensible data cleaning tool. In: ACM SIGMOD, p 590

    Google Scholar 

  • Gulwani S (2011) Automating string processing in spreadsheets using input-output examples. In: ACM POPL, pp 317–330

    MATH  Google Scholar 

  • Guo PJ, Kandel S, Hellerstein J, Heer J (2011) Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. In: ACM user interface software & technology (UIST)

    Google Scholar 

  • Harris W, Gulwani S (2011) Spreadsheet table transformations from examples. In: ACM PLDI

    Book  Google Scholar 

  • Heer J, Hellerstein JM, Kandel S (2015) Predictive interaction for data transformation. In: CIDR

    Google Scholar 

  • Hellerstein JM (2008) Quantitative data cleaning for large databases. White Paper, United Nations Economic Commission for Europe

    Google Scholar 

  • Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126

    Article  Google Scholar 

  • Horvitz E (1999) Principles of mixed-initiative user interfaces. In: ACM CHI, pp 159–166

    Google Scholar 

  • Huynh D, Mazzocchi S (2010) Google refine. http://code.google.com/p/google-refine/

  • Kang H, Getoor L, Shneiderman B, Bilgic M, Licamele L (2008) Interactive entity resolution in relational data: a visual analytic tool and its evaluation. IEEE TVCG 14(5):999–1014

    Google Scholar 

  • Kandel S, Heer J, Plaisant C, Kennedy J, van Ham F, Riche NH, Weaver C, Lee B, Brodbeck D, Buono P (2011a) Research directions in data wrangling: visualizations and transformations for usable and credible data. Inf Vis J 10(4):271–288

    Article  Google Scholar 

  • Kandel S, Paepcke A, Hellerstein J, Heer J (2011b) Wrangler: interactive visual specification of data transformation scripts. In: ACM human factors in computing systems (CHI)

    Google Scholar 

  • Kandel S, Paepcke A, Hellerstein J, Heer J (2012a) Enterprise data analysis and visualization: an interview study. In: IEEE visual analytics science & technology (VAST)

    Google Scholar 

  • Kandel S, Parikh R, Paepcke A, Hellerstein J, Heer J (2012b) Profiler: integrated statistical analysis and visualization for data quality assessment. In: Advanced visual interfaces

    Book  Google Scholar 

  • Lakshmanan LVS, Sadri F, Subramanian SN (2001) SchemaSQL: an extension to SQL for multidatabase interoperability. ACM Trans Database Syst 26(4): 476–519

    Article  Google Scholar 

  • Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10:334–350

    Article  Google Scholar 

  • Raman V, Hellerstein JM (2001) Potter’s wheel: an interactive data cleaning system. In: VLDB, pp 381–390

    Google Scholar 

  • Robertson GG, Czerwinski MP, Churchill JE (2005) Visualization of mappings between schemas. In: ACM CHI, pp 431–439

    Google Scholar 

  • Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: ACM SIGKDD

    Book  Google Scholar 

  • Stonebraker M, Bruckner D, Ilyas IF, Beskales G, Cherniack M, Zdonik SB, Pagan A, Xu S (2013) Data curation at scale: the data tamer system. In: CIDR

    Google Scholar 

  • Wickham H (2014) Tidy data. J Stat Softw 59(10):1–23

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jeffrey Heer .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Heer, J., Hellerstein, J.M., Kandel, S. (2018). Data Wrangling. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_9-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63962-8_9-1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63962-8

  • Online ISBN: 978-3-319-63962-8

  • eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Chapter history

  1. Latest

    Data Wrangling
    Published:
    05 February 2018

    DOI: https://doi.org/10.1007/978-3-319-63962-8_9-1

  2. Original

    Data Wrangling
    Published:
    24 February 2012

    DOI: https://doi.org/10.1007/978-3-319-63962-8_9-2