Automated Product-Attribute Mapping

Singh, Karamjit; Gupta, Garima; Shroff, Gautam; Agarwal, Puneet

doi:10.1007/978-3-319-67274-8_15

Karamjit Singh¹⁷,
Garima Gupta¹⁷,
Gautam Shroff¹⁷ &
…
Puneet Agarwal¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10526))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

897 Accesses

Abstract

Aggregate analysis, such as comparing country-wise sales versus global market share across product categories, is often complicated by the unavailability of common join attributes, e.g., category, across diverse datasets from different geographies or retail chains. Sometimes this is a missing data issue, while in other cases it may be inherent, e.g., the records in different geographical databases may actually describe different product ‘SKUs’, or follow different norms for categorization. Often a tedious manual mapping process is often employed in practice. We focus on improving such a process using machine-learning driven automation. Record linkage techniques, such as [5] can be used to automatically map products in different data sources to a common set of global attributes, thereby enabling federated aggregation joins to be performed. Traditional record-linkage techniques are typically unsupervised, relying textual similarity features across attributes to estimate matches. In this paper, we present an ensemble model combining minimal supervision using Bayesian network models together with unsupervised textual matching for automating such ‘attribute fusion’. We present results of our approach on a large volume of real-life data from a market-research scenario and compare with a standard record matching algorithm. Our approach is especially suited for practical implementation since we also provide confidence values for matches, enabling routing of items for human intervention where required.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Acheson, E., Peto, R.: Record linkage and the identification of long-term environmental hazards [and discussion]. Proc. Roy. Soc. London B Biol. Sci. 205, 165–178 (1979)
Article Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2003)
Google Scholar
Brizan, D.G., Tansel, A.U.: A survey of entity resolution and record linkage methodologies. Commun. IIMA 6, 5 (2015)
Google Scholar
Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14, 462–467 (1968)
Article MATH Google Scholar
Christen, P.: Febrl: a freely available record linkage system with a graphical user interface. In: 2nd Australasian Workshop on Health Data and Knowledge Management, vol. 80. Australian Computer Society, Inc. (2008)
Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969)
Article MATH Google Scholar
Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using bayesian networks to analyze expression data. J. Comput. Biol. 7, 601–620 (2000)
Article Google Scholar
Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5, 2018–2019 (2012)
Article Google Scholar
Huang, T., Russell, S.: Object identification: a bayesian analysis with application to traffic surveillance. Artif. Intell. 103, 77–93 (1998)
Article MATH Google Scholar
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)
MATH Google Scholar
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3, 484–493 (2010)
Article Google Scholar
Lam, W., Bacchus, F.: Learning bayesian belief networks: an approach based on the MDL principle. Comput. Intell. 10, 269–293 (1994)
Article Google Scholar
Li, X., Morie, P., Roth, D.: Semantic integration in text: from ambiguous names to identifiable entities. AI Mag. 26, 45 (2005)
Google Scholar
Norén, G.N., Orre, R., Bate, A.: A hit-miss model for duplicate detection in the who drug safety database. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM (2005)
Google Scholar
Poon, S., Poon, J., Lam, M., et al.: An ensemble approach for record matching in data linkage. Stud. Health Technol. Inf. 227, 113–119 (2016)
Google Scholar
Shah, A., Woolf, P.: Python environment for bayesian learning: inferring the structure of bayesian networks from knowledge and data. J. Mach. Learn. Res. JMLR 10, 159–162 (2009)
Google Scholar
Singh, K., Paneri, et al.: Visual bayesian fusion to navigate a data lake. In: 2016 19th International Conference on Information Fusion (FUSION). ISIF (2016)
Google Scholar
Singh, K., Shroff, G., Agarwal, P.: Predictive reliability mining for early warnings in populations of connected machines. In: IEEE International Conference on Data Science and Advanced Analytics (DSAA). 36678 2015. IEEE (2015)
Google Scholar
Uebersax, J.: Genetic Counseling and Cancer Risk Modeling: An Application of Bayes Nets. Ravenpack International, Marbella (2004)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst
Google Scholar
Yadav, S., Shroff, G., Hassan, E., Agarwal, P.: Business data fusion. In: 2015 18th International Conference on Information Fusion (Fusion). IEEE (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

TCS Research, Gurgaon, 122003, India
Karamjit Singh, Garima Gupta, Gautam Shroff & Puneet Agarwal

Authors

Karamjit Singh
View author publications
You can also search for this author in PubMed Google Scholar
Garima Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Gautam Shroff
View author publications
You can also search for this author in PubMed Google Scholar
Puneet Agarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Garima Gupta .

Editor information

Editors and Affiliations

Seoul National University, Seoul, Korea (Republic of)
U Kang
School of Information Systems, Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Chinese University of Hong Kong, Hong Kong, China
Jeffrey Xu Yu
Kangwon National University, Chuncheon, Korea (Republic of)
Yang-Sae Moon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singh, K., Gupta, G., Shroff, G., Agarwal, P. (2017). Automated Product-Attribute Mapping. In: Kang, U., Lim, EP., Yu, J., Moon, YS. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10526. Springer, Cham. https://doi.org/10.1007/978-3-319-67274-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-67274-8_15
Published: 07 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67273-1
Online ISBN: 978-3-319-67274-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics