Skip to main content
Log in

Entity Resolution Approach of Data Stream Management Systems

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

Owing to the technological advancements in Semantic Web and sensor networks, a large amount of data has been produced in association with the open data policy. However, data stream management systems that process stream data have focused on the processing of a large amount of data with little priority on data identification, integration, and external linkage. Furthermore, entity resolution is focused mainly on static database-based technologies. In this study, a real-time stream data processing architecture that can perform the integration and entity resolution of streaming-type heterogeneous input data and interlink with external data is designed. To achieve this goal, a light adapter to integrate heterogeneous data into standard scheme and blocking technique to reduce comparison candidates are applied. The implemented data adapters shows 4 times higher throughput than open source data parsers and the entity resolution results with streaming data shows similar performance with the static data sets. The proposed streaming data entity resolution architecture is expected to form the basis of data integration research that can integrate various information sources of data efficiently, enrich internal data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://linkeddata.org/.

  2. http://stats.lod2.eu/.

  3. http://opengovernmentdata.org.

  4. http://opensensordata.net/.

  5. http://opensensordata.net/.

  6. Image of LOD from http://lod-cloud.net/.

References

  1. Abadi, D. J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., et al. (2003). Aurora: A new model and architecture for data stream management. The VLDB Journal The International Journal on Very Large Data Bases, 12(2), 120–139.

    Article  Google Scholar 

  2. Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., & Widom, J. (2004). Stream: The stanford data stream management system. Technical Report 2004–20, Stanford InfoLab. http://ilpubs.stanford.edu:8090/641/.

  3. Barbieri, D. F., Braga, D., Ceri, S., Della Valle, E., & Grossniklaus, M. (2009). C-sparql: Sparql for continuous querying. In Proceedings of the 18th international conference on World wide web. ACM, pp. 1061–1062.

  4. Bolles, A., Grawunder, M., & Jacobi, J. (2008). Streaming SPARQL-extending SPARQL to process data streams (pp. 448–462). Berlin/Heidelberg: Springer Berlin Heidelberg.

  5. Boumkheld, N., Ghogho, M., & El Koutbi, M. (2015). Energy consumption scheduling in a smart grid including renewable energy. Journal of Information Processing Systems, 11(1), 116–124.

    Google Scholar 

  6. Brizan, D. G., & Tansel, A. U. (2015). A survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3), 5.

    Google Scholar 

  7. Broder, A. Z., Charikar, M., Frieze, A. M., & Mitzenmacher, M. (1998). Min-wise independent permutations. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, pp. 327–336.

  8. Buchmann, A., & Koldehofe, B. (2009). Complex event processing. IT-Information Technology Methoden und innovative Anwendungen der Informatik und Informationstechnik, 51(5), 241–242.

    Google Scholar 

  9. Calbimonte, J. P. (2013). Ontology-based access to sensor data streams. Ph.D. thesis, Informatica.

  10. Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M. J., Hellerstein, J. M., Hong, W., Krishnamurthy, S., Madden, S. R., Reiss, F., & Shah, M. A. (2003). Telegraphcq: Continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM, pp. 668–668.

  11. Christen, P. (2012a). Data matching: Concepts and techniques for record link-age, entity resolution, and duplicate detection. Berlin/Heidelberg: Springer Berlin Heidelberg.

  12. Christen, P. (2012b). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.

    Article  Google Scholar 

  13. Christen, P., & Gayler, R. (2008). Towards scalable real-time entity resolution using a similarity-aware inverted index approach. In Proceedings of the 7th Australasian Data Mining Conference (Vol. 87). Australian Computer Society Inc, pp. 51–60.

  14. Christen, P., & Gayler RW (2013). Adaptive temporal entity resolution on dynamic databases (pp. 558–569). Berlin/Heidelberg: Springer Berlin Heidelberg.

  15. Christen, P., & Goiser, K. (2007). Quality and complexity measures for data linkage and deduplication (pp. 127–151). Berlin/Heidelberg: Springer Berlin Heidelberg.

  16. Valle, E., Ceri, S., Barbieri, DF., Braga, D., Campi, A. (2009). Future internet – FIS 2008: First Future Internet Symposium, FIS 2008. In A First Step Towards Stream Reasoning, Vienna, Austria (pp. 72–81). September 29–30, 2008, Revised Selected Papers, Berlin/Heidelberg: Springer Berlin Heidelberg.

  17. Domingo, A., Bellalta, B., Palacin, M., Oliver, M., & Almirall, E. (2013). Public open sensor data: Revolutionizing smart cities. Technology and Society Magazine, IEEE, 32(4), 50–56.

    Article  Google Scholar 

  18. Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.

    Article  Google Scholar 

  19. Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.

    Article  MATH  Google Scholar 

  20. Hwang, KI., & Nam, SW. (2014). Near real-time m2m communication for bidi-rectional amr systems. Journal of Convergence, 5(2):1–7.

    Article  Google Scholar 

  21. Jaffri, A., Glaser, H., & Millard, I. C. (2008). Managing URI synonymity to enable consistent reference on the semantic web. CEUR Workshop Proceedings (Vol. 422).

  22. Joo, JW., Park, JH., Suk, SK., Lee, DG. (2014). Liss: Log data integrity support scheme for reliable log analysis of osp. Journal of Convergence, 5(4):1–5.

    Article  Google Scholar 

  23. Kim, T., Kim, P., Lee, S., Jung, H., Sung, WK. (2011). U- and E-Service, science and technology: International conference, UNESST 2011, held as part of the Future Generation Information Technology Conference, FGIT 2011, in Conjunction with GDC 2011, Jeju Island, Korea, December 8–10, 2011. In Proceedings chap OntoURIResolver: URI Resolution and Recommendation Service Using LOD (pp. 245–250). Berlin/Heidelberg: Springer Berlin Heidelberg.

  24. Kolb, L., Thor, A., & Rahm, E. (2012). Load balancing for mapreduce-based entity resolution. In 2012 IEEE 28th international conference on data engineering (ICDE). IEEE, pp. 618–629.

  25. Le-Phuoc, D., Dao-Tran, M., Parreira, JX., Hauswirth, M. (2011). A native and adaptive approach for unified processing of linked streams and linked data. In The semantic web–ISWC 2011 (pp. 370–388). Berlin/Heidelberg: Springer Berlin Heidelberg.

  26. Li, P., Dong, X., Maurino, A., & Srivastava, D. (2011). Linking temporal records. Proceedings of the VLDB Endowment, 4(11), 956–967.

    MATH  Google Scholar 

  27. Rajaraman, A., Ullman, J. D., Ullman, J. D., & Ullman, J. D. (2012). Mining of massive datasets (Vol. 77). Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  28. Ramadan, B., Christen, P., Liang, H., Gayler, R. W., & Hawking, D. (2013). Dynamic similarity-aware inverted indexing for real-time entity resolution. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7867 LNAI, pp. 47–58. doi:10.1007/978-3-642-40319-4_5.

  29. Rao, D., McNamee, P., & Dredze, M. (2010). Streaming cross document entity coreference resolution. In International Conference on Computational Linguistics (pp. 1050–1058). http://aclweb.org/anthology//C/C10/C10-2121.pdf.

  30. Rodrıguez, A., McGrath, R., Liu, Y., Myers, J., & Urbana-Champaign, I. (2009). Semantic management of streaming data. Proc Semantic Sensor Networks, 80, 80–95.

    Google Scholar 

  31. Sequeda, J. F., & Corcho, O. (2009). Linked stream data: A position paper. CEUR Workshop Proceedings.

  32. Tummarello, G., Delbru, R., Oren, E. (2007). Sindice. com: Weaving the open linked data (pp. 552–565). Berlin/Heidelberg: Springer Berlin Heidelberg.

  33. Wang, J., Shen, H. T., Song, J., & Ji, J. (2014). Hashing for similarity search: A survey. arXiv preprint arXiv:14082927.

Download references

Acknowledgments

This work was supported by the IT R&D program of MSIP (Ministry of Science, ICT and Future Planning)/IITP (Information and communications Technology Promotion). [B010-15-0353, High performance database solution development for Integrated big data monitoring and Analytics]. We thank our colleagues from Institute for Information and communications Technology Promotion who provided insight and expertise that greatly assisted the research, although they may not agree with all of the interpretations/conclusions of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Do-Heon Jeong.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, T., Hwang, MN., Kim, YM. et al. Entity Resolution Approach of Data Stream Management Systems. Wireless Pers Commun 91, 1621–1634 (2016). https://doi.org/10.1007/s11277-016-3275-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-016-3275-z

Keywords

Navigation