Catalog Integration of Heterogeneous and Volatile Product Data

Schmidts, Oliver; Kraft, Bodo; Winkens, Marvin; Zündorf, Albert

doi:10.1007/978-3-030-83014-4_7

Oliver Schmidts⁸,
Bodo Kraft⁸,
Marvin Winkens⁸ &
…
Albert Zündorf⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1446))

Included in the following conference series:

International Conference on Data Management Technologies and Applications

444 Accesses

Abstract

The integration of frequently changing, volatile product data from different manufacturers into a single catalog is a significant challenge for small and medium-sized e-commerce companies. They rely on timely integrating product data to present them aggregated in an online shop without knowing format specifications, concept understanding of manufacturers, and data quality. Furthermore, format, concepts, and data quality may change at any time. Consequently, integrating product catalogs into a single standardized catalog is often a laborious manual task. Current strategies to streamline or automate catalog integration use techniques based on machine learning, word vectorization, or semantic similarity. However, most approaches struggle with low-quality or real-world data. We propose Attribute Label Ranking (ALR) as a recommendation engine to simplify the integration process of previously unknown, proprietary tabular format into a standardized catalog for practitioners. We evaluate ALR by focusing on the impact of different neural network architectures, language features, and semantic similarity. Additionally, we consider metrics for industrial application and present the impact of ALR in production and its limitations.

This work was supported by the German Federal Ministry for Economic Affairs and Energy (BMWi) within the Central Innovation Programme for SMEs (grant no. 16KN063729) and antibodies-online GmbH.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multimodal deep neural networks for attribute prediction and applications to e-commerce catalogs enhancement

Article 24 April 2021

What Matters for Shoppers: Investigating Key Attributes for Online Product Comparison

Product Classification Using Microdata Annotations

Notes

References

Allweyer, O., Schorr, C., Krieger, R., Mohr, A.: Classification of products in retail using partially abbreviated product names only. In: Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 67–77. INSTICC. SciTePress (2020). https://doi.org/10.5220/0009821400670077
Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 906–908. ACM (2005)
Google Scholar
Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endow. 4(11), 695–701 (2011)
Article Google Scholar
Bizer, C., Primpeli, A., Peeters, R.: Using the semantic web as a source of training data. Datenbank-Spektrum 19(2), 127–135 (2019). https://doi.org/10.1007/s13222-019-00313-y
Article Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
de Carvalho, M.G., Laender, A.H., Gonçalves, M.A., da Silva, A.S.: An evolutionary approach to complex schema matching. Inf. Syst. 38(3), 302–316 (2013). https://doi.org/10.1016/j.is.2012.10.002
Article Google Scholar
Chen, Z., Jia, H., Heflin, J., Davison, B.D.: Generating schema labels through dataset content analysis. In: Companion Proceedings of the the Web Conference 2018, WWW 2018, pp. 1515–1522. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2018). https://doi.org/10.1145/3184558.3191601
Comito, C., Patarin, S., Talia, D.: A semantic overlay network for P2P schema-based data integration. In: 11th IEEE Symposium on Computers and Communications (ISCC 2006), pp. 88–94, June 2006. https://doi.org/10.1109/ISCC.2006.19. ISSN 1530-1346
Dessloch, S., Hernandez, M.A., Wisnesky, R., Radwan, A., Zhou, J.: Orchid: integrating schema mapping and ETL. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 1307–1316, April 2008. https://doi.org/10.1109/ICDE.2008.4497540
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)
Google Scholar
Foley, J., Bendersky, M., Josifovski, V.: Learning to extract local events from the web. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, Santiago, Chile, pp. 423–432. ACM, New York (2015). https://doi.org/10.1145/2766462.2767739
Gu, B., et al.: The interaction between schema matching and record matching in data integration. IEEE Trans. Knowl. Data Eng. 29(1), 186–199 (2017). https://doi.org/10.1109/TKDE.2016.2611577
Article Google Scholar
Kirsten, T., Thor, A., Rahm, E.: Instance-based matching of large life science ontologies. In: Cohen-Boulakia, S., Tannen, V. (eds.) DILS 2007. LNCS, vol. 4544, pp. 172–187. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73255-6_15
Chapter Google Scholar
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: VLDB, vol. 1, pp. 49–58 (2001)
Google Scholar
Neumann, M., King, D., Beltagy, I., Ammar, W.: ScispaCy: fast and robust models for biomedical natural language processing. In: BioNLP@ACL (2019). https://doi.org/10.18653/v1/W19-5034
Peters, M.E., et al.: Deep contextualized word representations (2018)
Google Scholar
Pham, M., Alse, S., Knoblock, C.A., Szekely, P.: Semantic labeling: a domain-independent approach. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 446–462. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_27
Chapter Google Scholar
Pomp, A., Poth, L., Kraus, V., Meisen, T.: Enhancing knowledge graphs with data representatives. In: Proceedings of the 21st International Conference on Enterprise Information Systems, pp. 49–60. SCITEPRESS - Science and Technology Publications, Heraklion (2019). https://doi.org/10.5220/0007677400490060
Ristoski, P., Petrovski, P., Mika, P., Paulheim, H.: A machine learning approach for product matching and categorization. Semantic Web 9(5), 707–728 (2018). https://doi.org/10.3233/SW-180300
Article Google Scholar
Schmidts, O., Kraft, B., Siebigteroth, I., Zündorf, A.: Schema matching with frequent changes on semi-structured input files: a machine learning approach on biological product data. In: Proceedings of the 21st International Conference on Enterprise Information Systems, pp. 208–215. SCITEPRESS - Science and Technology Publications, Heraklion (2019). https://doi.org/10.5220/0007723602080215
Schmidts, O., Kraft., B., Winkens., M., Zündorf., A.: Catalog integration of low-quality product data by attribute label ranking. In: Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 90–101. INSTICC. SciTePress (2020). https://doi.org/10.5220/0009831000900101
Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. In: Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–171. Springer, Heidelberg (2005). https://doi.org/10.1007/11603412_5
Chapter Google Scholar
Sildatke, M., Karwanni, H., Kraft, B., Schmidts, O., Zündorf, A.: Automated software quality monitoring in research collaboration projects. In: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, ICSEW 2020, pp. 603–610. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3387940.3391478
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Boston (2010). https://doi.org/10.1007/978-0-387-09823-4_34
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

FH Aachen, University of Applied Sciences, Jülich, Germany
Oliver Schmidts, Bodo Kraft & Marvin Winkens
University of Kassel, Kassel, Germany
Albert Zündorf

Authors

Oliver Schmidts
View author publications
You can also search for this author in PubMed Google Scholar
Bodo Kraft
View author publications
You can also search for this author in PubMed Google Scholar
Marvin Winkens
View author publications
You can also search for this author in PubMed Google Scholar
Albert Zündorf
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oliver Schmidts .

Editor information

Editors and Affiliations

MODESTE/ESEO, Angers, France
Slimane Hammoudi
Fraunhofer FIT and RWTH Aachen University, Aachen, Germany
Christoph Quix
University of Coimbra, Coimbra, Portugal
Jorge Bernardino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schmidts, O., Kraft, B., Winkens, M., Zündorf, A. (2021). Catalog Integration of Heterogeneous and Volatile Product Data. In: Hammoudi, S., Quix, C., Bernardino, J. (eds) Data Management Technologies and Applications. DATA 2020. Communications in Computer and Information Science, vol 1446. Springer, Cham. https://doi.org/10.1007/978-3-030-83014-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-83014-4_7
Published: 23 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83013-7
Online ISBN: 978-3-030-83014-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Catalog Integration of Heterogeneous and Volatile Product Data