Abstract
Aggregate analysis, such as comparing country-wise sales versus global market share across product categories, is often complicated by the unavailability of common join attributes, e.g., category, across diverse datasets from different geographies or retail chains. Sometimes this is a missing data issue, while in other cases it may be inherent, e.g., the records in different geographical databases may actually describe different product ‘SKUs’, or follow different norms for categorization. Often a tedious manual mapping process is often employed in practice. We focus on improving such a process using machine-learning driven automation. Record linkage techniques, such as [5] can be used to automatically map products in different data sources to a common set of global attributes, thereby enabling federated aggregation joins to be performed. Traditional record-linkage techniques are typically unsupervised, relying textual similarity features across attributes to estimate matches. In this paper, we present an ensemble model combining minimal supervision using Bayesian network models together with unsupervised textual matching for automating such ‘attribute fusion’. We present results of our approach on a large volume of real-life data from a market-research scenario and compare with a standard record matching algorithm. Our approach is especially suited for practical implementation since we also provide confidence values for matches, enabling routing of items for human intervention where required.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Acheson, E., Peto, R.: Record linkage and the identification of long-term environmental hazards [and discussion]. Proc. Roy. Soc. London B Biol. Sci. 205, 165–178 (1979)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2003)
Brizan, D.G., Tansel, A.U.: A survey of entity resolution and record linkage methodologies. Commun. IIMA 6, 5 (2015)
Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14, 462–467 (1968)
Christen, P.: Febrl: a freely available record linkage system with a graphical user interface. In: 2nd Australasian Workshop on Health Data and Knowledge Management, vol. 80. Australian Computer Society, Inc. (2008)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969)
Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using bayesian networks to analyze expression data. J. Comput. Biol. 7, 601–620 (2000)
Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5, 2018–2019 (2012)
Huang, T., Russell, S.: Object identification: a bayesian analysis with application to traffic surveillance. Artif. Intell. 103, 77–93 (1998)
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3, 484–493 (2010)
Lam, W., Bacchus, F.: Learning bayesian belief networks: an approach based on the MDL principle. Comput. Intell. 10, 269–293 (1994)
Li, X., Morie, P., Roth, D.: Semantic integration in text: from ambiguous names to identifiable entities. AI Mag. 26, 45 (2005)
Norén, G.N., Orre, R., Bate, A.: A hit-miss model for duplicate detection in the who drug safety database. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM (2005)
Poon, S., Poon, J., Lam, M., et al.: An ensemble approach for record matching in data linkage. Stud. Health Technol. Inf. 227, 113–119 (2016)
Shah, A., Woolf, P.: Python environment for bayesian learning: inferring the structure of bayesian networks from knowledge and data. J. Mach. Learn. Res. JMLR 10, 159–162 (2009)
Singh, K., Paneri, et al.: Visual bayesian fusion to navigate a data lake. In: 2016 19th International Conference on Information Fusion (FUSION). ISIF (2016)
Singh, K., Shroff, G., Agarwal, P.: Predictive reliability mining for early warnings in populations of connected machines. In: IEEE International Conference on Data Science and Advanced Analytics (DSAA). 36678 2015. IEEE (2015)
Uebersax, J.: Genetic Counseling and Cancer Risk Modeling: An Application of Bayes Nets. Ravenpack International, Marbella (2004)
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst
Yadav, S., Shroff, G., Hassan, E., Agarwal, P.: Business data fusion. In: 2015 18th International Conference on Information Fusion (Fusion). IEEE (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Singh, K., Gupta, G., Shroff, G., Agarwal, P. (2017). Automated Product-Attribute Mapping. In: Kang, U., Lim, EP., Yu, J., Moon, YS. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10526. Springer, Cham. https://doi.org/10.1007/978-3-319-67274-8_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-67274-8_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67273-1
Online ISBN: 978-3-319-67274-8
eBook Packages: Computer ScienceComputer Science (R0)