skip to main content
10.1145/3468791.3468811acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article
Public Access

SDTA: An Algebra for Statistical Data Transformation

Published:11 August 2021Publication History

ABSTRACT

Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages.

Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository.

We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.

References

  1. George Alter, Darrell Donakowski, Jack Gager, Pascal Heus, Carson Hunter, Sanda Ionescu, Jeremy Iverson, HV Jagadish, Carl Lagoze, Jared Lyle, 2020. Provenance metadata for statistical data: An introduction to Structured Data Transformation Language (SDTL). IASSIST Quarterly 44, 4 (2020).Google ScholarGoogle Scholar
  2. Peter Baumann. 1994. Management of multidimensional discrete data. The VLDB Journal 3, 4 (Oct. 1994), 401–444.Google ScholarGoogle ScholarCross RefCross Ref
  3. P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. 1998. The Multidimensional Database System RasDaMan. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (Seattle, Washington, USA) (SIGMOD ’98). Association for Computing Machinery, New York, NY, USA, 575–577. https://doi.org/10.1145/276304.276386Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Peter Baumann and Sönke Holsten. 2011. A Comparative Analysis of Array Models for Databases.FGIT-DTA/BSBT (2011).Google ScholarGoogle Scholar
  5. Peter Baumann and Sönke Holsten. 2011. A comparative analysis of array models for databases. In Database theory and application, bio-science and bio-technology. Springer, 80–89.Google ScholarGoogle Scholar
  6. Peter A Boncz and Martin L Kersten. 1999. MIL primitives for querying a fragmented world. The VLDB Journal 8, 2 (1999), 101–119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Inc Circle Systems. 2015. Stat/Transfer (version 14). https://stattransfer.comGoogle ScholarGoogle Scholar
  8. James Clark, Steve DeRose, 1999. XML path language (XPath).Google ScholarGoogle Scholar
  9. Edgar F Codd. 2002. A relational model of data for large shared data banks. In Software pioneers. Springer, 263–294.Google ScholarGoogle Scholar
  10. Colectica. 2017. Convention-based Ontology Generation System (COGS) 1.0. http://cogsdata.org/docs/.Google ScholarGoogle Scholar
  11. Roberto Cornacchia, Sándor Héman, Marcin Zukowski, Arjen P de Vries, and Peter A Boncz. 2008. Flexible and efficient IR using array databases.VLDB J. (2008).Google ScholarGoogle Scholar
  12. Catherine Dalzell. 2013. Calling R from SPSS. https://developer.ibm.com/tutorials/ba-call-r-spss/Google ScholarGoogle Scholar
  13. Boris Glavic and Klaus R Dittrich. 2007. Data Provenance: A Categorization of Existing Approaches.. In BTW, Vol. 7. 227–241.Google ScholarGoogle Scholar
  14. Noel Gorelick, Matt Hancher, Mike Dixon, Simon Ilyushchenko, David Thau, and Rebecca Moore. 2017. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment(2017). https://doi.org/10.1016/j.rse.2017.06.031Google ScholarGoogle Scholar
  15. Georg Gottlob and Roberto Zicari. 1988. Closed World Databases Opened Through Null Values.VLDB (1988).Google ScholarGoogle Scholar
  16. EF Haghish. 2019. Seamless interactive language interfacing between R and Stata. The Stata Journal 19, 1 (2019), 61–82.Google ScholarGoogle ScholarCross RefCross Ref
  17. SAS Institute Inc.2016. Calling Functions in the R Language. https://support.sas.com/ rnd/app/studio/statr.pdfGoogle ScholarGoogle Scholar
  18. Zhongjun Jin, Michael R Anderson, Michael J Cafarella, and H V Jagadish. 2017. Foofah - Transforming Data By Example.SIGMOD Conference (2017).Google ScholarGoogle Scholar
  19. Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. 2011. Wrangler - interactive visual specification of data transformation scripts.CHI (2011).Google ScholarGoogle Scholar
  20. Laks VS Lakshmanan, Fereidoon Sadri, and Iyer N Subramanian. 1996. SchemaSQL-a language for interoperability in relational multi-database systems. In VLDB, Vol. 96. Citeseer, 239–250.Google ScholarGoogle Scholar
  21. Alberto Lerner and Dennis E Shasha. 2003. AQuery - Query Language for Ordered Data, Optimization Techniques, and Experiments.VLDB (2003).Google ScholarGoogle Scholar
  22. Leonid Libkin, Rona Machlin, and Limsoon Wong. 1996. A Query Language for Multidimensional Arrays - Design, Implementation, and Optimization Techniques.SIGMOD Conference (1996).Google ScholarGoogle Scholar
  23. Nikos A Lorentzos and Roger G Johnson. 1988. Extending relational algebra to manipulate temporal data. Information Systems 13, 3 (1988), 289–296.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Arunprasad P Marathe and Kenneth Salem. 1997. A Language for Manipulating Arrays.VLDB (1997).Google ScholarGoogle Scholar
  25. Jim Melton. 2003. Advanced SQL: 1999: Understanding object-relational and other advanced features. Morgan Kaufmann.Google ScholarGoogle Scholar
  26. Chuck Murray, Janet Blowney, J Xie, T Xu, and S Yuditskaya. 2003. Oracle Spatial GeoRaster, 10 g Release 1. Oracle Corporation (2003), 2–31.Google ScholarGoogle Scholar
  27. G Özsoyoğlu, ZM Özsoyoğlu, and Victor Matos. 1987. Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Transactions on Database Systems (TODS) 12, 4 (1987), 566–592.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gultekin Özsoyoglu and Z Meral Özsoyoglu. 1985. Statistical Database Query Languages.IEEE Trans. Software Eng.(1985).Google ScholarGoogle ScholarCross RefCross Ref
  29. Vijayshankar Raman and Joseph M Hellerstein. 2001. Potter’s Wheel - An Interactive Data Cleaning System.VLDB (2001).Google ScholarGoogle Scholar
  30. P Ramsey, VB Columbia Refractions Research Inc, and 2005. [n.d.]. Introduction to PostGIS. drm.cenn.ge ([n. d.]).Google ScholarGoogle Scholar
  31. Mark A Roth, Herry F Korth, and Abraham Silberschatz. 1988. Extended algebra and calculus for nested relational databases. ACM Transactions on Database Systems (TODS) 13, 4 (1988), 389–417.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. SDMX. 2018. Validation and Transformation Language (VTL). https://sdmx.org/?page_id=5096Google ScholarGoogle Scholar
  33. Arie Shoshani. 1982. Statistical Databases - Characteristics, Problems, and some Solutions.VLDB (1982).Google ScholarGoogle Scholar
  34. Arie Shoshani. 1997. OLAP and Statistical Databases - Similarities and Differences.PODS (1997).Google ScholarGoogle Scholar
  35. David Simmen, Eugene Shekita, and Timothy Malkemus. 1996. Fundamental techniques for order optimization. In the 1996 ACM SIGMOD international conference. ACM Press, New York, New York, USA, 57–67.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. SlivinskasGiedrius, JensenChristian S, and SnodgrassRichard Thomas. 2002. Bringing order to query optimization. ACM SIGMOD Record (June 2002).Google ScholarGoogle Scholar
  37. Jie Song, George Alter, and H. V. Jagadish. 2019. C2Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation. Proceedings of the 2019 International Conference on Management of Data (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Michael Stonebraker, Paul Brown, Alex Poliakov, and Suchi Raman. 2011. The Architecture of SciDB.SSDBM (2011).Google ScholarGoogle Scholar
  39. Alex R van Ballegooij. 2004. RAM: A Multidimensional Array DBMS. In Current Trends in Database Technology - EDBT 2004 Workshops. Springer, Berlin, Heidelberg, Berlin, Heidelberg, 154–165.Google ScholarGoogle Scholar
  40. Hadley Wickham. 2007. Reshaping Data with the reshape Package. Journal of Statistical Software 21, 12 (2007), 1–20. http://www.jstatsoft.org/v21/i12/Google ScholarGoogle ScholarCross RefCross Ref
  41. Hadley Wickham. 2017. tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse R package version 1.2.1.Google ScholarGoogle Scholar
  42. Hadley Wickham, Romain François, Lionel Henry, and Kirill MÃŒller. 2018. dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr R package version 0.7.6.Google ScholarGoogle Scholar
  43. Catharine M Wyss and Edward L Robertson. 2005. Relational languages for metadata integration. ACM Transactions on Database Systems (TODS) 30, 2 (2005), 624–660.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Carlo Zaniolo. 1984. Database Relations with Null Values.J. Comput. Syst. Sci.(1984).Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management
    July 2021
    275 pages
    ISBN:9781450384131
    DOI:10.1145/3468791

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 11 August 2021

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate56of146submissions,38%
  • Article Metrics

    • Downloads (Last 12 months)154
    • Downloads (Last 6 weeks)34

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format