Abstract
Nowadays, data analysis is widely used in numerous areas to identify new trends, opportunities, or risks and to improve decision-making. In many cases, however, data analysis is only possible by incorporating specific domain knowledge, which is why domain experts need to be involved. To this end, data mashups are a popular tool for modeling tailored analyses. Yet, with today’s data volumes from heterogeneous source systems, it is very difficult to identify beneficial data sources, in particular for explorative data analysis. In this paper, we first define requirements aiming for user-centric analytics, followed by the introduction of SDRank, a deep-learning-based approach to identify beneficial data sources. In an extensive evaluation with three scenarios, we show that this approach offers high robustness concerning the training data used and can reliably identify beneficial data sources, even for previously unknown domains, i.e., transfer learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
DBPedia: https://www.dbpedia-spotlight.org/.
- 2.
Mockaroo: https://www.mockaroo.com/.
- 3.
Download Databases: https://database-downloads.com/.
- 4.
The OpenSky Network: https://opensky-network.org/.
- 5.
DBpedia: https://dbpedia.org/.
- 6.
Keras: https://keras.io/.
- 7.
\(P(\text {``select a correct data source, 1 draw''}) = \frac{\#\text {correct datasets}}{\#\text {all datasets}} = \frac{5}{20}=0.25\).
- 8.
\(P(\text {``select at least one correct data source, 4 draws''})\)
\(= 1 - P(\text {``select only incorrect data sources''})\)
\(= 1 - (\frac{15}{20} * \frac{14}{19} * \frac{13}{18} * \frac{12}{17}) \approx 0.72\).
- 9.
\(P(\text {``select a correct data source, 1 draw''}) = \frac{\#\text {correct datasets}}{\#\text {all datasets}} = \frac{5}{25}=0.20\).
- 10.
\(P(\text {``select at least one correct data source, 4 draws''})\)
\(= 1 - P(\text {``select only incorrect data sources''})\)
\(= 1 - (\frac{20}{25} * \frac{19}{24} * \frac{18}{23} * \frac{17}{22}) \approx 0.62\).
References
Ayala, D., Hernández, I., Ruiz, D., Rahm, E.: LEAPME: learning-based property matching with embeddings. Data Knowl. Eng. 137 (2022). https://doi.org/10.1016/j.datak.2021.101943
Behringer, M., Hirmer, P., Mitschang, B.: Towards interactive data processing and analytics – putting the human in the center of the loop. In: ICEIS 2017 - Proceedings of the 19th International Conference on Enterprise Information Systems, vol. 3 (2017). https://doi.org/10.5220/0006326300870096
Behringer, M., Hirmer, P., Mitschang, B.: A human-centered approach for interactive data processing and analytics. In: Hammoudi, S., Śmiałek, M., Camp, O., Filipe, J. (eds.) ICEIS 2017. LNBIP, vol. 321, pp. 498–514. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93375-7_23
Behringer, M., Treder-Tschechlov, D., Voggesberger, J., Hirmer, P., Mitschang, B.: SDRank - a deep learning approach for similarity ranking of data sources to support user-centric data analysis. In: Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS 2023), Prague, Czech Republic, 24–26 April 2023, pp. 419–428. SciTePress, Setúbal (2023). https://doi.org/10.5220/0011998300003467
Bernstein, P.A., et al.: Generic schema matching, ten years later. VLDB Endow. 4(11), 695–701 (2011)
Craw, S.: Manhattan distance. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning and Data Mining, pp. 790–791. Springer, Boston (2017). https://doi.org/10.1007/978-1-4899-7687-1_511
Daniel, F., Matera, M.: Mashups - Concepts, Models and Architectures. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-55049-2
Endert, A., et al.: The human is the loop: new directions for visual analytics. J. Intell. Inf. Syst. 43(3), 411–435 (2014)
Hallur, G.G., Prabhu, S., Aslekar, A.: Entertainment in era of AI, big data & IoT. In: Das, S., Gochhait, S. (eds.) Digital Entertainment, pp. 87–109. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-9724-4_5
Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–160 (1950)
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques (2012)
Henke, N., et al.: The age of analytics: competing in a data-driven world. Technical report (2016). https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/the-age-of-analytics-competing-in-a-data-driven-world
Jesse, N.: Data strategy and data trust - drivers for business development. IFAC-PapersOnLine 54(13), 8–12 (2021). https://doi.org/10.1016/j.ifacol.2021.10.409
Keim, D.A., Kohlhammer, J., Mansmann, F., May, T., Wanner, F.: Visual analytics. In: Mastering the Information Age - Solving Problems with Visual Analytics, chap. 2, pp. 7–18. Eurographics Association, Goslar (2010)
Krause, E.F.: Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Dover Publications, Inc. (1975). https://cds.cern.ch/record/1547746
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3(1–2), 1338–1347 (2010). https://doi.org/10.14778/1920841.1921005
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR 2013, Workshop Track Proceedings (2013). http://arxiv.org/abs/1301.3781
Nielsen, J.: Usability Engineering. Morgan Kaufmann Publishers Inc., San Francisco (1993)
O’Neill, B.: Elementary Differential Geometry. Academic Press (2006). https://doi.org/10.1016/B978-0-12-088735-4.50006-7
Quigley, E., et al.: “Data is the new oil’’: citizen science and informed consent in an era of researchers handling of an economically valuable resource. Life Sci. Soc. Policy 17(1), 9 (2021). https://doi.org/10.1186/s40504-021-00118-6
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Reinsel, D., Gantz, J., Rydning, J.: Data age 2025: the digitization of the world. Technical report (2018)
Rekatsinas, T., et al.: Finding quality in quantity: the challenge of discovering valuable sources for integration. In: Proceedings of CIDR 2015 (2015)
Ristevski, B., Chen, M.: Big data analytics in medicine and healthcare. J. Integr. Bioinform. 15(3) (2018). https://doi.org/10.1515/jib-2017-0030
Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953). https://doi.org/10.1007/BF02289263
Wagner, M.: Integrating explicit knowledge in the visual analytics process. In: Doctoral Consortium on Computer Vision, Imaging and Computer Graphics Theory and Applications (DCVISIGRAPP 2015). SCITEPRESS Digital Library, Berlin (2015)
Ware, C.: Information Visualization: Perception for Design, 4 edn. Morgan Kaufmann Publishers Inc. (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Behringer, M., Treder-Tschechlov, D., Voggesberger, J., Hirmer, P., Mitschang, B. (2024). Connecting Domain Experts and Data: Enriching User-Centric Data Analysis with Neural Network-Aided Data Source Suggestion. In: Filipe, J., Śmiałek, M., Brodsky, A., Hammoudi, S. (eds) Enterprise Information Systems. ICEIS 2023. Lecture Notes in Business Information Processing, vol 518. Springer, Cham. https://doi.org/10.1007/978-3-031-64748-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-64748-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64747-5
Online ISBN: 978-3-031-64748-2
eBook Packages: Computer ScienceComputer Science (R0)