Towards De-duplication Framework in Big Data Analysis. A Case Study

Maślankowski, Jacek

doi:10.1007/978-3-319-46642-2_7

Jacek Maślankowski⁷

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 264))

Included in the following conference series:

EuroSymposium on Systems Analysis and Design

868 Accesses
2 Citations

Abstract

Big Data analysis gives access to wider perspectives of information. Especially it allows processing unstructured and structured data together. However lots of data sources do not mean that the quality of data is enough to provide reliable results. There are several different quality indicators related to Big Data analysis. In this paper we will focus on two of them that are the most critical in the first phase of data processing: ambiguousness and duplicates. The goal of this paper is to present the proposal of the framework used to eliminate duplicates in large datasets acquired with Big Data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Maślankowski, J.: Data quality issues concerning statistical data gathering supported by Big Data technology. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B. (eds.) BDAS 2014. CCIS, vol. 424, pp. 92–101. Springer, Heidelberg (2014)
Chapter Google Scholar
Rousidis, D., Garoufallou, E., Balatsoukas, P., Sicilia, M.: Metadata for Big Data: a preliminary investigation of metadata quality issues in research data repositories. Inf. Serv. Use 34(3/4), 279–286 (2014)
Google Scholar
Hucheng, Z., Jian-Guang, L., Hongyu, Z., Haibo, L., Haoxiang, L., Tingting, Q.: An empirical study on quality issues of production Big Data platform. In: ICSE: International Conference on Software Engineering, pp. 17–26 (2015)
Google Scholar
Hazen, B., Boone, C., Ezell, J., Jones-Farmer, L.: Data quality for data science, predictive analytics, and Big Data in supply chain management: an introduction to the problem and suggestions for research and applications. Int. J. Prod. Econ. 154, 72–80 (2015)
Article Google Scholar
Di Pietro, R., Sorniotti, A.: Proof of ownership for deduplication systems: a secure, scalable, and efficient solution. Comput. Commun. 82, 71–82 (2016)
Article Google Scholar
Mao, B., Jiang, H., Wu, S., Tian, L.: Leveraging data deduplication to improve the performance of primary storage systems in the cloud. IEEE Trans. Comput. 65(6), 1775–1788 (2016)
Article Google Scholar
Kun, M., Fusen, D., Bo, Y.: Large-scale schema-free data deduplication approach with adaptive sliding window using MapReduce. Comput. J. 58(11), 3187–3201 (2015)
Article Google Scholar
Han, J., Chen, K., Wang, J.: Web article quality ranking based on web community knowledge. Computing 97(5), 509–537 (2015)
Article Google Scholar
Polidoro, F., Giannini, R., Lo Conte, R., Mosca, S., Rossetti, F.: Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation. Stat. J. IAOS 31(2), 165–176 (2015)
Article Google Scholar
Agafiţei, M., Gras, F., Kloek, W., Reis, F., Vâju, S.: Measuring output quality for multisource statistics in official statistics: some directions. Stat. J. IAOS 31(2), 203–211 (2015)
Article Google Scholar
Angiuli, O., Blitzstein, J., Waldo, J.: How to de-identify your data. Commun. ACM 58(12), 48–55 (2015)
Article Google Scholar
Maté, A., Llorens, H., de Gregorio, E., Tardío, R., Gil, D., Muñoz-Terol, R., Trujillo, J.: A novel multidimensional approach to integrate big data in business intelligence. J. Database Manage. 26(2), 14–31 (2015)
Article Google Scholar
Clegg, D.: Evolving data warehouse and BI architectures: the Big Data challenge. Bus. Intell. J. 20(1), 19–24 (2015)
Google Scholar
Akbay, S.: How Big Data applications are revolutionizing decision making. Bus. Intell. J. 20(1), 25–29 (2015)
Google Scholar
Martin, K.E.: Ethical issues in the Big Data industry. MIS Q. Executive 14(2), 67–85 (2015)
Google Scholar
Goes, P.B.: Big Data and IS research. MIS Q. 38(3), iii–viii (2014)
Google Scholar
Kugler, L.: What happens when Big Data blunders? Commun. ACM 59(6), 15–16 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Business Informatics, University of Gdańsk, Gdańsk, Poland
Jacek Maślankowski

Authors

Jacek Maślankowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jacek Maślankowski .

Editor information

Editors and Affiliations

Department of Business Informatics, University of Gdansk Department of Business Informatics, Sopot, Poland
Stanislaw Wrycza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Maślankowski, J. (2016). Towards De-duplication Framework in Big Data Analysis. A Case Study. In: Wrycza, S. (eds) Information Systems: Development, Research, Applications, Education. SIGSAND/PLAIS 2016. Lecture Notes in Business Information Processing, vol 264. Springer, Cham. https://doi.org/10.1007/978-3-319-46642-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-46642-2_7
Published: 22 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46641-5
Online ISBN: 978-3-319-46642-2
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics