Abstract
Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity of these keywords and the data complexity on the Web, different queries may retrieve different answers to the same absent field value. To decide the most probable right answer to each absent filed value, existing method issues quite a few available imputation queries for each absent value, and then vote on deciding the most probable right answer. As a result, we have to issue a large number of imputation queries for filling all absent values in an incomplete data set, which brings a large overhead. In this paper, we work on reducing the cost of Web-based Data Imputation in two aspects: First, we propose a query execution scheme which can secure the most probable right answer to an absent field value by issuing as few imputation queries as possible. Second, we recognize and prune queries that probably will fail to return any answers a priori. Our extensive experimental evaluation shows that our proposed techniques substantially reduce the cost of Web-based Imputation without hurting its high imputation accuracy.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: ACM DL, pp. 85–94 (2000)
Barnard, J., Rubin, D.: Small-sample degrees of freedom with multiple imputation. Biometrika 86(4), 948–955 (1999)
Batista, G., Monard, M.: An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence 17(5-6), 519–533 (2003)
Chaudhuri, S.: What next?: A half-dozen data management research goals for big data and the cloud. In: PODS, pp. 1–4 (2012)
Cohen, W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration methods. In: ACM SIGKDD, pp. 89–98 (2004)
Cormode, G., Golab, L., Flip, K., McGregor, A., Srivastava, D., Zhang, X.: Estimating the confidence of conditional functional dependencies. In: SIGMOD, pp. 469–482 (2009)
Elmeleegy, H., Madhavan, J., Halevy, A.: Harvesting relational tables from lists on the web. PVLDB 2(1), 1078–1089 (2009)
Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL, pp. 363–370 (2005)
Gomadam, K., Yeh, P.Z., Verma, K.: Data enrichment using data sources on the web. In: 2012 AAAI Spring Symposium Series (2012)
Grzymala-Busse, J., Grzymala-Busse, W., Goodwin, L.: Coping with missing attribute values based on closest fit in preterm birth data: A rough set approach. Computational Intelligence 17(3), 425–434 (2001)
Grzymała-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001)
Gummadi, R., Khulbe, A., Kalavagattu, A., Salvi, S., Kambhampati, S.: Smartint: Using mined attribute dependencies to integrate fragmented web databases. Journal of Intelligent Information Systems, 1–25 (2012)
Gupta, R., Sarawagi, S.: Answering table augmentation queries from unstructured lists on the web. PVLDB 2(1), 289–300 (2009)
Koutrika, G.: Entity Reconstruction: Putting the pieces of the puzzle back together. HP Labs, Palo Alto, USA (2012)
Li, Z., Sharaf, M.A., Sitbon, L., Sadiq, S., Indulska, M., Zhou, X.: Webput: Efficient web-based data imputation. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE 2012. LNCS, vol. 7651, pp. 243–256. Springer, Heidelberg (2012)
Li, Z., Sharaf, M.A., Sitbon, L., Sadiq, S., Indulska, M., Zhou, X.: A web-based approach to data imputation. In: WWW (2013)
Lin, W., Yangarber, R., Grishman, R.: Bootstrapped learning of semantic classes from positive and negative examples. In: ICML Workshop on The Continuum from Labeled to Unlabeled Data, vol. 1, p. 21 (2003)
Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2(4), 285–318 (1988)
Loshin, D.: The Data Quality Business Case: Projecting Return on Investment. Informatica (2006)
Pantel, P., Crestan, E., Borkovsky, A., Popescu, A., Vyas, V.: Web-scale distributional similarity and entity set expansion. In: EMNLP, pp. 938–947 (2009)
Quinlan, J.: C4. 5: Programs for machine learning. Morgan kaufmann (1993)
Wu, C., Wun, C., Chou, H.: Using association rules for completing missing data. In: HIS, pp. 236–241 (2004)
Yakout, M., Ganjam, K., Chakrabarti, K., Chaudhuri, S.: Infogather: Entity augmentation and attribute discovery by holistic matching with web tables. In: SIGMOD, pp. 97–108. ACM (2012)
Zhang, S.: Shell-neighbor method and its application in missing data imputation. Applied Intelligence 35(1), 123–133 (2011)
Zhang, S.: Nearest neighbor selection for iteratively knn imputation. Journal of Systems and Software 85(11), 2541–2552 (2012)
Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE TKDE 23(1), 110–121 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Li, Z., Shang, S., Xie, Q., Zhang, X. (2014). Cost Reduction for Web-Based Data Imputation. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds) Database Systems for Advanced Applications. DASFAA 2014. Lecture Notes in Computer Science, vol 8422. Springer, Cham. https://doi.org/10.1007/978-3-319-05813-9_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-05813-9_29
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05812-2
Online ISBN: 978-3-319-05813-9
eBook Packages: Computer ScienceComputer Science (R0)