Abstract
The integration of frequently changing, volatile product data from different manufacturers into a single catalog is a significant challenge for small and medium-sized e-commerce companies. They rely on timely integrating product data to present them aggregated in an online shop without knowing format specifications, concept understanding of manufacturers, and data quality. Furthermore, format, concepts, and data quality may change at any time. Consequently, integrating product catalogs into a single standardized catalog is often a laborious manual task. Current strategies to streamline or automate catalog integration use techniques based on machine learning, word vectorization, or semantic similarity. However, most approaches struggle with low-quality or real-world data. We propose Attribute Label Ranking (ALR) as a recommendation engine to simplify the integration process of previously unknown, proprietary tabular format into a standardized catalog for practitioners. We evaluate ALR by focusing on the impact of different neural network architectures, language features, and semantic similarity. Additionally, we consider metrics for industrial application and present the impact of ALR in production and its limitations.
This work was supported by the German Federal Ministry for Economic Affairs and Energy (BMWi) within the Central Innovation Programme for SMEs (grant no. 16KN063729) and antibodies-online GmbH.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Allweyer, O., Schorr, C., Krieger, R., Mohr, A.: Classification of products in retail using partially abbreviated product names only. In: Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 67–77. INSTICC. SciTePress (2020). https://doi.org/10.5220/0009821400670077
Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 906–908. ACM (2005)
Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endow. 4(11), 695–701 (2011)
Bizer, C., Primpeli, A., Peeters, R.: Using the semantic web as a source of training data. Datenbank-Spektrum 19(2), 127–135 (2019). https://doi.org/10.1007/s13222-019-00313-y
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
de Carvalho, M.G., Laender, A.H., Gonçalves, M.A., da Silva, A.S.: An evolutionary approach to complex schema matching. Inf. Syst. 38(3), 302–316 (2013). https://doi.org/10.1016/j.is.2012.10.002
Chen, Z., Jia, H., Heflin, J., Davison, B.D.: Generating schema labels through dataset content analysis. In: Companion Proceedings of the the Web Conference 2018, WWW 2018, pp. 1515–1522. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2018). https://doi.org/10.1145/3184558.3191601
Comito, C., Patarin, S., Talia, D.: A semantic overlay network for P2P schema-based data integration. In: 11th IEEE Symposium on Computers and Communications (ISCC 2006), pp. 88–94, June 2006. https://doi.org/10.1109/ISCC.2006.19. ISSN 1530-1346
Dessloch, S., Hernandez, M.A., Wisnesky, R., Radwan, A., Zhou, J.: Orchid: integrating schema mapping and ETL. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 1307–1316, April 2008. https://doi.org/10.1109/ICDE.2008.4497540
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)
Foley, J., Bendersky, M., Josifovski, V.: Learning to extract local events from the web. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, Santiago, Chile, pp. 423–432. ACM, New York (2015). https://doi.org/10.1145/2766462.2767739
Gu, B., et al.: The interaction between schema matching and record matching in data integration. IEEE Trans. Knowl. Data Eng. 29(1), 186–199 (2017). https://doi.org/10.1109/TKDE.2016.2611577
Kirsten, T., Thor, A., Rahm, E.: Instance-based matching of large life science ontologies. In: Cohen-Boulakia, S., Tannen, V. (eds.) DILS 2007. LNCS, vol. 4544, pp. 172–187. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73255-6_15
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: VLDB, vol. 1, pp. 49–58 (2001)
Neumann, M., King, D., Beltagy, I., Ammar, W.: ScispaCy: fast and robust models for biomedical natural language processing. In: BioNLP@ACL (2019). https://doi.org/10.18653/v1/W19-5034
Peters, M.E., et al.: Deep contextualized word representations (2018)
Pham, M., Alse, S., Knoblock, C.A., Szekely, P.: Semantic labeling: a domain-independent approach. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 446–462. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_27
Pomp, A., Poth, L., Kraus, V., Meisen, T.: Enhancing knowledge graphs with data representatives. In: Proceedings of the 21st International Conference on Enterprise Information Systems, pp. 49–60. SCITEPRESS - Science and Technology Publications, Heraklion (2019). https://doi.org/10.5220/0007677400490060
Ristoski, P., Petrovski, P., Mika, P., Paulheim, H.: A machine learning approach for product matching and categorization. Semantic Web 9(5), 707–728 (2018). https://doi.org/10.3233/SW-180300
Schmidts, O., Kraft, B., Siebigteroth, I., Zündorf, A.: Schema matching with frequent changes on semi-structured input files: a machine learning approach on biological product data. In: Proceedings of the 21st International Conference on Enterprise Information Systems, pp. 208–215. SCITEPRESS - Science and Technology Publications, Heraklion (2019). https://doi.org/10.5220/0007723602080215
Schmidts, O., Kraft., B., Winkens., M., Zündorf., A.: Catalog integration of low-quality product data by attribute label ranking. In: Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 90–101. INSTICC. SciTePress (2020). https://doi.org/10.5220/0009831000900101
Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. In: Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–171. Springer, Heidelberg (2005). https://doi.org/10.1007/11603412_5
Sildatke, M., Karwanni, H., Kraft, B., Schmidts, O., Zündorf, A.: Automated software quality monitoring in research collaboration projects. In: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, ICSEW 2020, pp. 603–610. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3387940.3391478
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Boston (2010). https://doi.org/10.1007/978-0-387-09823-4_34
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Schmidts, O., Kraft, B., Winkens, M., Zündorf, A. (2021). Catalog Integration of Heterogeneous and Volatile Product Data. In: Hammoudi, S., Quix, C., Bernardino, J. (eds) Data Management Technologies and Applications. DATA 2020. Communications in Computer and Information Science, vol 1446. Springer, Cham. https://doi.org/10.1007/978-3-030-83014-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-83014-4_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83013-7
Online ISBN: 978-3-030-83014-4
eBook Packages: Computer ScienceComputer Science (R0)