skip to main content
10.1145/3555041.3589678acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

DataChat: An Intuitive and Collaborative Data Analytics Platform

Published: 05 June 2023 Publication History

Abstract

Enterprises invest in data platforms with the aim of extracting meaningful information through analytics. Typically, experts create analytics pipelines that feed into dashboards and provide answers to predetermined questions. This approach makes analytics a spectator sport for most people and introduces operational bottlenecks to leveraging those investments. To improve the value derived from data, many organizations are opting to open up their data assets and allow access to a wider range of users. However, using programming languages such as SQL and Python for analytics can be difficult for most enterprise users. DataChat provides a simplified data science approach that is intuitive, powerful, and accessible to all data users. The platform is built on a library of data functions that are cleanly abstracted to maximize efficiency and ease of use while maintaining a rich suite of tools necessary for data science. With these functions, users can create data analysis pipelines by using a simple point-and-click interface in a spreadsheet view or by using natural English interfaces. Modern sharing and collaboration features are central to all aspects of the platform, allowing teams to easily bridge expertise gaps. A deeper understanding of results is facilitated by providing automatically-generated English explanations of how they were derived. By enhancing these aspects of data science and human-to-human communication, the platform addresses the needs that many organizations are encountering as their analytics needs mature.

Supplemental Material

MP4 File
Presentation video of the DataChat platform including large language model integration to generate complex data analytics pipelines from natural language user requests. The presentation starts by explaining how the DataChat platform simplifies data analytics operations into Skills users can execute individually. It then proceeds to explain how these Skills can be used as discrete building blocks in repeatable data analysis Recipes. Finally, the presentation demonstrates how such a Recipe can be fully generated using a large language model based on simple natural language prompts from the user.

References

[1]
Mangesh Bendre, Bofan Sun, Ding Zhang, Xinyan Zhou, Kevin Chen-Chuan Chang, and Aditya G. Parameswaran. 2015. DATASPREAD: Unifying Databases and Spreadsheets. Proc. VLDB Endow., Vol. 8, 12 (2015), 2000--2003. https://doi.org/10.14778/2824032.2824121
[2]
Adam Blum. 1999. Microsoft English Query 7.5: Automatic Extraction of Semantics from Relational Databases and OLAP Cubes. In VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7--10, 1999, Edinburgh, Scotland, UK. Morgan Kaufmann, "", 247--248. http://www.vldb.org/conf/1999/P24.pdf
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). "", Online, 1--25. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
[4]
Mihai Budiu, Parikshit Gopalan, Lalith Suresh, Udi Wieder, Han Kruiger, and Marcos K. Aguilera. 2019. Hillview: A trillion-cell spreadsheet for big data. Proc. VLDB Endow., Vol. 12, 11 (2019), 1442--1457. https://doi.org/10.14778/3342263.3342279
[5]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. CoRR, Vol. abs/2107.03374 (2021), 1--35. showeprint[arXiv]2107.03374 https://arxiv.org/abs/2107.03374
[6]
Sisu Data. 2023. Sisu Data: The Decision Intelligence Engine. https://sisudata.com/. Accessed: 2023-02--20.
[7]
dbt. 2023. Metrics | dbt Developer Hub - dbt Docs. https://docs.getdbt.com/docs/build/metrics. Accessed: 2023-02--20.
[8]
Domo. 2023. The Domo Business Cloud. https://www.domo.com/. Accessed: 2023-02--20.
[9]
Encyclopedia Britannica Editors of Encyclopaedia. 2023. Microsoft Excel. https://www.britannica.com/technology/Microsoft-Excel. Accessed: 2023-02--20.
[10]
James Gale, Max Seiden, Deepanshu Utkarsh, Jason Frantz, Rob Woollen, and cC agatay Demiralp. 2022. Sigma Workbook: A Spreadsheet for Cloud Data Warehouses. Proc. VLDB Endow., Vol. 15, 12 (2022), 3670--3673. https://www.vldb.org/pvldb/vol15/p3670-gale.pdf
[11]
Google. 2023. Google Collab. https://colab.research.google.com. Accessed: 2023-02--20.
[12]
Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Mü ller, Francesco Piccinno, and Julian Martin Eisenschlos. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5--10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, Online, 4320--4333. https://doi.org/10.18653/v1/2020.acl-main.398
[13]
Hex. 2023. Hex - Do more with data, together. https://hex.tech. Accessed: 2023-02--20.
[14]
Rogers Jeffrey Leo John, Navneet Potti, and Jignesh M. Patel. 2017. Ava: From Data to Insights Through Conversations. In 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings. www.cidrdb.org, Online, 1--10. http://cidrdb.org/cidr2017/papers/p87-john-cidr17.pdf
[15]
Fei Li and H. V. Jagadish. 2014. NaLIR: an interactive natural language interface for querying relational databases. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014, Curtis E. Dyreson, Feifei Li, and M. Tamer Ö zsu (Eds.). ACM, New York, NY, USA, 709--712. https://doi.org/10.1145/2588555.2594519
[16]
Yunyao Li, Huahai Yang, and H. V. Jagadish. 2007. NaLIX: A generic natural language search environment for XML data. ACM Trans. Database Syst., Vol. 32, 4 (2007), 30. https://doi.org/10.1145/1292609.1292620
[17]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. CoRR, Vol. abs/2107.13586 (2021), 1--46. showeprint[arXiv]2107.13586 https://arxiv.org/abs/2107.13586
[18]
Looker. 2023. Looker. https://www.looker.com/. Accessed: 2023-02--20.
[19]
R Melville. 1993. Crystal-clear database reporting. PC World, Vol. 11, 5 (1993), 81--81.
[20]
Microsoft. 2023 a. Excel specifications and limits. https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043--467e-8e27--269d656771c3. Accessed: 2023-02--20.
[21]
Microsoft. 2023 b. Microsoft Excel Spreadsheet Software. https://www.microsoft.com/en-us/microsoft-365/excel. Accessed: 2023-02--20.
[22]
Microsoft. 2023 c. Microsoft Power BI: Data Visualization. http://www.tableau.com. Accessed: 2023-02--20.
[23]
Jupyter Notebook. 2023. Jupyter Notebook. https://jupyter.org. Accessed: 2023-02--20.
[24]
Ana-Maria Popescu, Oren Etzioni, and Henry A. Kautz. 2003. Towards a theory of natural language interfaces to databases. In Proceedings of the 8th International Conference on Intelligent User Interfaces, IUI 2003, Miami, FL, USA, January 12--15, 2003, David B. Leake, W. Lewis Johnson, and Elisabeth André (Eds.). ACM, New York, NY, USA, 149--157. https://doi.org/10.1145/604045.604070
[25]
Qlik. 2023. Qlik: Analytics & Data Integration Platform. https://www.qlik.com/. Accessed: 2023-02--20.
[26]
Sajjadur Rahman, Mangesh Bendre, Yuyang Liu, Shichu Zhu, Zhaoyuan Su, Karrie Karahalios, and Aditya G. Parameswaran. 2021. NOAH: Interactive Spreadsheet Exploration with Dynamic Hierarchical Overviews. Proc. VLDB Endow., Vol. 14, 6 (2021), 970--983. https://doi.org/10.14778/3447689.3447701
[27]
Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. 2022. Evaluating the Text-to-SQL Capabilities of Large Language Models. arxiv: 2204.00498 [cs.CL]
[28]
Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R. Mittal, and Fatma Ö zcan. 2016. ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores. Proc. VLDB Endow., Vol. 9, 12 (2016), 1209--1220. https://doi.org/10.14778/2994509.2994536
[29]
Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. arxiv: 2109.05093 [cs.CL]
[30]
Tableau Software. 2023. Tableau: The world's leading analytics platform. https://powerbi.microsoft.com/en-us/. Accessed: 2023-02--20.
[31]
Chris Stolte, Diane Tang, and Pat Hanrahan. 2002 a. Polaris: A System for Query, Analysis, and Visualization of Multidimensional Relational Databases. IEEE Trans. Vis. Comput. Graph., Vol. 8, 1 (2002), 52--65. https://doi.org/10.1109/2945.981851
[32]
Chris Stolte, Diane Tang, and Pat Hanrahan. 2002 b. Query, analysis, and visualization of hierarchically structured data using Polaris. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23--26, 2002, Edmonton, Alberta, Canada. ACM, New York, NY, USA, 112--122. https://doi.org/10.1145/775047.775064
[33]
ThoughtSpot. 2023. ThoughtSpot.com - The Modern Analytics Cloud. https://www.thoughtspot.com. Accessed: 2023-02--20.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). "", "", 5998--6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[35]
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2019. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. arxiv: 1911.04942 [cs.CL]
[36]
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5--10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, "", 7567--7578. https://doi.org/10.18653/v1/2020.acl-main.677
[37]
Douglas J Wolf. 1999. Seagate Crystal Reports 7 for Dummies. John Wiley & Sons, Inc., Hoboken, NJ, USA.
[38]
Kuan Xu, Yongbo Wang, Yongliang Wang, Zihao Wang, Zujie Wen, and Yang Dong. 2022. SeaD: End-to-end Text-to-SQL Generation with Schema-aware Denoising. In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10--15, 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Ivá n Vladimir Meza Ru'i z (Eds.). Association for Computational Linguistics, "", 1845--1853. https://doi.org/10.18653/v1/2022.findings-naacl.141
[39]
Xiaojun Xu, Chang Liu, and Dawn Song. 2017. SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. CoRR, Vol. abs/1711.04436 (2017), 1--13. showeprint[arXiv]1711.04436 http://arxiv.org/abs/1711.04436
[40]
Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5--10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, "", 8413--8426. https://doi.org/10.18653/v1/2020.acl-main.745
[41]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. CoRR, Vol. abs/1809.08887 (2018), 1--11. showeprint[arXiv]1809.08887 http://arxiv.org/abs/1809.08887
[42]
Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, Paco Guzman, Luke Zettlemoyer, and Marjan Ghazvininejad. 2020. Detecting Hallucinated Content in Conditional Neural Sequence Generation. arxiv: 2011.02593 [cs.CL]

Cited By

View all
  • (2024)Utopia: Automatic Pivot Table AssistantProceedings of the VLDB Endowment10.14778/3685800.368586117:12(4305-4308)Online publication date: 8-Nov-2024
  • (2024)ReAcTable: Enhancing ReAct for Table Question AnsweringProceedings of the VLDB Endowment10.14778/3659437.365945217:8(1981-1994)Online publication date: 31-May-2024
  • (2024)FHIRViz: Multi-Agent Platform for FHIR Visualization to Advance Healthcare AnalyticsProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701392(1-7)Online publication date: 22-Nov-2024
  • Show More Cited By

Index Terms

  1. DataChat: An Intuitive and Collaborative Data Analytics Platform

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '23: Companion of the 2023 International Conference on Management of Data
    June 2023
    330 pages
    ISBN:9781450395076
    DOI:10.1145/3555041
    This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 June 2023

    Check for updates

    Author Tags

    1. analytics
    2. data science
    3. generative AI
    4. machine learning

    Qualifiers

    • Research-article

    Data Availability

    Presentation video of the DataChat platform including large language model integration to generate complex data analytics pipelines from natural language user requests. The presentation starts by explaining how the DataChat platform simplifies data analytics operations into Skills users can execute individually. It then proceeds to explain how these Skills can be used as discrete building blocks in repeatable data analysis Recipes. Finally, the presentation demonstrates how such a Recipe can be fully generated using a large language model based on simple natural language prompts from the user. https://dl.acm.org/doi/10.1145/3555041.3589678#SIGMOD23-DataChat-3589678.mp4

    Funding Sources

    • National Science Foundation

    Conference

    SIGMOD/PODS '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)133
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Utopia: Automatic Pivot Table AssistantProceedings of the VLDB Endowment10.14778/3685800.368586117:12(4305-4308)Online publication date: 8-Nov-2024
    • (2024)ReAcTable: Enhancing ReAct for Table Question AnsweringProceedings of the VLDB Endowment10.14778/3659437.365945217:8(1981-1994)Online publication date: 31-May-2024
    • (2024)FHIRViz: Multi-Agent Platform for FHIR Visualization to Advance Healthcare AnalyticsProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701392(1-7)Online publication date: 22-Nov-2024
    • (2023)GPT in Data Science: A Practical Exploration of Model Selection2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386503(4325-4334)Online publication date: 15-Dec-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media