Abstract
Data quality is a very important question in massive data process. When we want to distill valuable knowledge from a mass set of data, the key point is to know whether the dataset is clean. So before we extract useful massage from the dataset we’d better do some data clean job. Similarity search is a very important method in data clean. MapReduce will be used to do similarity search in our data clean system. But the efficiency is very low. We found that when we process the massive data stored in HDFS with MapReduce programing model every part of the dataset will be scanned and this is very time-consuming especially for large scale dataset. In this paper we will do filter operation on original data with hardware before we use similarity search to do data clean.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Rahm, E., Do, H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Morales, G.D.F., Lucchese, C., Baraglia, R.: Scaling out all pairs similarity search with mapreduce. In: 8th Workshop on LargeScale Distributed System for Information Retrieval (2010)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling uop all pairs similarity search. In: Proceeding of WWW (2007)
Awekar, A., Samatova, N.F.: Fast matching for all pairs similarity search. In: Intelligent Agent Technology Workshop (2009)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th OSDI, vol. 51, no. 1, pp. 107–113 (2004)
HDFS (Hadoop Distributed File System) Architecture. http://hadoop.apache.org/core/docs/current/hdfs_design.html
Sukhwani, B., Hong, M., Thoennes, M., Dube, P., lyer, B.: Database analytics acceleration using FPGAs. In: International Conference on Parallel Architectures and Compilation Techniques, pp. 411–420 (2012)
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
Woods, L., Teubner, J., Alonso, G.: Real-time pattern matching with FPGAs. In: IEEE International Conference on Data Engineering, pp. 1292–1295 (2011)
Teubner, J., Muller, R., Alonso, G.: Frequent item computation on a chip. IEEE Trans. Knowl. Data Eng. 23(8), 1169–1181 (2011)
Zarifi, T., Malek, M.: FPGA implementation of image processing technique for blood samples characterization. Comput. Electr. Eng. 40(5), 1750–1757 (2014)
Brost, V., Yang, F., Meunier, C.: Flexible VLIW processor based on FPGA for efficient embedded real-time image processing. J. Real-Time Image Process. 9(1), 47–59 (2014)
Chenini, H., Dérutin, J.P., Aufrère, R., Chapuis, R.: Parallel embedded processor architecture for FPGA-based image processing using parallel software skeletons. J. Adv. Sig. Process. 2013(1), 1–23 (2013)
Choi, Y.M., So, K.H.: Map-reduce processing of K-means algorithm with FPGA-accelerated computer cluster. In: IEEE International Conference on Application-specific System, Architectures and Processors, pp. 9–16 (2014)
Belean, B., Borda, M., Bot, A.: FPGA based hardware architectures for iterative algorithms implementations. In: International Conference on Telecommunications and Signal Processing, pp. 751–754 (2013)
Becher, A., Bauer, F., Ziener, D., Teich, J.: Energy-aware SQL query acceleration through FPGA-based dynamic partial reconfiguration. In: International Conference on Field Programmable Logic and Applications, pp. 1–8 (2014)
Dennl, C., Ziener, D., Teich, J.: On-the-fly composition of FPGA-based SQL query accelerators using a partially reconfigurable module library. IEEE Int. Symp. Field-Programma Custom Comput. Mach. 282(1), 45–52 (2012)
Halstead, R.J., Sukhwani, B., Min, H., Thoennes, M., Dube, P., Asaad, S., Iyer, B.: Accelerating join operation for relational databases with FPGAs. In: Proceeding of the 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 17–20 (2013)
Acknowledgements
This paper was partially supported by National Sci-Tech Support Plan 2015BAH10F01 and NSFC grant U1509216, 61472099, 61133002.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, Y., Gao, H., Shi, S., Wang, H. (2016). Similarity Search on Massive Data Based on FPGA. In: Gao, H., Kim, J., Sakurai, Y. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9645. Springer, Cham. https://doi.org/10.1007/978-3-319-32055-7_28
Download citation
DOI: https://doi.org/10.1007/978-3-319-32055-7_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32054-0
Online ISBN: 978-3-319-32055-7
eBook Packages: Computer ScienceComputer Science (R0)