Approximate entity extraction in temporal databases

Lu, Wei; Fung, Gabriel Pui Cheong; Du, Xiaoyong; Zhou, Xiaofang; Chen, Lijiang; Deng, Ke

doi:10.1007/s11280-011-0109-5

Approximate entity extraction in temporal databases

Published: 21 January 2011

Volume 14, pages 157–186, (2011)
Cite this article

World Wide Web Aims and scope Submit manuscript

Wei Lu^1,2,
Gabriel Pui Cheong Fung³,
Xiaoyong Du^1,2,
Xiaofang Zhou^1,2,4,
Lijiang Chen⁵ &
…
Ke Deng⁴

161 Accesses
5 Citations
Explore all metrics

Abstract

We study the problem of efficiently extracting K entities, in a temporal database, which are most similar to a given search query. This problem is well studied in relational databases, where each entity is represented as a single record and there exist a variety of methods to define the similarity between a record and the search query. However, in temporal databases, each entity is represented as a sequence of historical records. How to properly define the similarity of each entity in the temporal database still remains an open problem. The main challenging is that, when a user issues a search query for an entity, he or she is prone to mix up information of the same entity at different time points. As a result, methods, which are used in relational databases based on record granularity, cannot work any further. Instead, we regard each entity as a set of “virtual records”, where attribute values of a “virtual record” can be from different records of the same entity. In this paper, we propose a novel evaluation model, based on which the similarity between each “virtual record” and the query can be effectively quantified, and the maximum similarity of its “virtual records” is taken as the similarity of an entity. For each entity, as the number of its “virtual records” is exponentially large, calculating the similarity of the entity is challenging. As a result, we further propose a Dominating Tree Algorithm (DTA), which is based on the bounding-pruning-refining strategy, to efficiently extract K entities with greatest similarities. We conduct extensive experiments on both real and synthetic datasets. The encouraging results show that our model for defining the similarity between each entity and the search query is effective, and the proposed DTA can perform at least two orders of magnitude improvement on the performance comparing with the naive approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB, pp. 586–597 (2002)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
Behm, A., Ji, S., Li, C., Lu, J.: Space-constrained gram-based indexing for efficient approximate string search. In: ICDE, pp. 604–615 (2009)
Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J.: Swoosh: a generic approach to entity resolution. Stanford University (2005)
Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Su, Q., Thavisomboon, S., Widom, J.: Generic entity resolution in the SERF project. J. IEEE Data Eng. Bull. 29(2), 13–20 (2006)
Google Scholar
Bergamaschi, S., Gelati, G., Guerra, F., Vincini, M.: An intelligent data integration approach for collaborative project management in virtual enterprises. World Wide Web 9(1), 35–61 (2006)
Article Google Scholar
Bilenko, M., Mooney, R.J.: Learning to combine trained distance metrics for duplicate detection in databases. Technical report, University of Texas, Austin (2002)
Brouwer, A.E., Cohen, A.M., Neumaier, A.: Distance-Regular Graphs. Springer, Berlin Heidelberg New York (1989)
MATH Google Scholar
Chandel, A., Hassanzadeh, O., Koudas, N., Sadoghi, M., Srivastava, D.: Benchmarking declarative approximate selection predicates. In: SIGMOD, pp. 353–364 (2007)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, pp. 5 (2006)
Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB, pp. 327–338 (2007)
Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: SIGMOD, pp. 201–212 (1998)
Date, C.J., Darwen, H., Lorentzos, N.: Temporal Data & the Relational Model. Elsevier’s Science & Technology (2002)
Do, H.-H., Rahm, E.: COMA–a system for flexible combination of schema matching approaches. In: VLDB, pp. 610–621 (2002)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: WWW, pp. 90–101 (2003)
Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: VLDB, pp. 636–647 (2004)
Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Harary, F.: Graph Theory. Addison-Wesley, Reading (1994)
Google Scholar
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. J. Data Min. Knowl. Discov. 2(1), 9–37 (1998)
Article Google Scholar
Kappel, G., Kapsammeri, E., Retschitzegger, W.: Integrating XML and relational database systems. World Wide Web 7(4), 343–384 (2004)
Article Google Scholar
Koudas, N., Marathe, A., Srivastava, D.: Flexible string matching against large databases in practice. In: VLDB, pp. 1078–1086 (2004)
Li, C., Jin, L., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006)
Article Google Scholar
Li, C., Wang, B., Yang, X.: VGRAM: improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
On, B.-W., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: ICDE, pp. 496–505 (2007)
Pak, A.N., Chung, C.-W.: A wikipedia matching approach to contextual advertising. World Wide Web 13(3), 251–274 (2010)
Article Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754 (2004)
Stonebraker, M.: The design of the postgres storage system. In: VLDB, pp. 289–300 (1987)
Tejada, S., Knoblock, C., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: SIGKDD, pp. 350–359 (2002)
Turn, P.: Onan extremal problem in graph theory. Journal of Matematiko Fizicki Lapok (in Hungarian) (1941)
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworth-Heinemann (1979)
Vernicaand, R., Li, C.: Efficient top-k algorithms for fuzzy search in string collections. In: KEYS, pp. 9 (2009)
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD, pp. 759–770 (2009)
Winkler, W.E.: The state of record linkage and current research problems. US Bureau of the Census (1999)
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: VLDB, pp. 933–944 (2008)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)
Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD, pp. 353–364 (2008)
Yin, X., Han, J., Yu, P.S.: LinkClus: efficient clustering via heterogeneous semantic links. In: VLDB, pp. 427–438 (2006)

Download references

Author information

Authors and Affiliations

School of Information, Renmin University of China, 100872, Beijing, China
Wei Lu, Xiaoyong Du & Xiaofang Zhou
Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, Beijing, China
Wei Lu, Xiaoyong Du & Xiaofang Zhou
Data Mining and Machine Learning Group, Arizona State University, 699 S. Mill Ave, Tempe, Arizona, USA
Gabriel Pui Cheong Fung
School of ITEE, The University of Queensland, GP78 South, St. Lucia, Brisbane, Australia
Xiaofang Zhou & Ke Deng
Department of Computer Science, Peking University, 100872, Beijing, China
Lijiang Chen

Authors

Wei Lu
View author publications
You can also search for this author inPubMed Google Scholar
Gabriel Pui Cheong Fung
View author publications
You can also search for this author inPubMed Google Scholar
Xiaoyong Du
View author publications
You can also search for this author inPubMed Google Scholar
Xiaofang Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Lijiang Chen
View author publications
You can also search for this author inPubMed Google Scholar
Ke Deng
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Xiaoyong Du.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, W., Fung, G.P.C., Du, X. et al. Approximate entity extraction in temporal databases. World Wide Web 14, 157–186 (2011). https://doi.org/10.1007/s11280-011-0109-5

Download citation

Received: 25 December 2009
Revised: 25 November 2010
Accepted: 04 January 2011
Published: 21 January 2011
Issue Date: March 2011
DOI: https://doi.org/10.1007/s11280-011-0109-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximate entity extraction in temporal databases

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Open benchmark for filtering techniques in entity resolution

Integrating Real-Time Entity Resolution with Top-N Join Query Processing

A two-stage entity event deduplication method based on graph node selection and node optimization strategy

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Approximate entity extraction in temporal databases

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Open benchmark for filtering techniques in entity resolution

Integrating Real-Time Entity Resolution with Top-N Join Query Processing

A two-stage entity event deduplication method based on graph node selection and node optimization strategy

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now