Non-structured Data Integration Access Policy Using Hadoop

Cai, Ting; Yang, Xuemei

doi:10.1007/s11277-017-5112-4

Non-structured Data Integration Access Policy Using Hadoop

Published: 13 December 2017

Volume 102, pages 895–908, (2018)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

Ting Cai¹ &
Xuemei Yang¹

165 Accesses
1 Citation
Explore all metrics

Abstract

The rapid growth of unstructured data has become a key factor that drives the development of enterprises. Several problems should be addressed when obtaining effective access to massive amounts of unstructured data, such as data stored in scattered locations, differences in data access, and non-unified data formats. In this article, we use Hadoop to build a distributed computing platform that stores unstructured data and improve the Hadoop scheduling algorithm on the basis of the end time of slow tasks. The improved algorithm can avoid the execution of slow bulk task caused by non-uniform velocity nodes for Hadoop in a heterogeneous environment and can improve operating efficiency and stability. Furthermore, we propose a classification index construction method using non-training sets, thereby improving the term frequency–inverse document frequency weight formula by introducing timeliness and entropy. On this basis, we propose a classification algorithm that follows the principle of document similarity and document classification algorithm and does not use training sets. Finally, we describe the construction process of the classification index that is based on the training set by combining Hadoop and Lucene. As a proof of concept, we implement a prototype system using the Hadoop platform of our improved scheduling algorithm and conduct experimental studies to demonstrate the feasibility and performance of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

Big data privacy: a technological perspective and review

Article Open access 26 November 2016

Priyank Jain, Manasi Gyanchandani & Nilay Khare

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

References

Angus, D., Rintel, S., & Wiles, J. (2013). Making sense of big text: A visual-first approach for analysing text data using Leximancer and Discursis. International Journal of Social Research Methodology, 16(3), 261–267.
Article Google Scholar
Baars, H., & Kemper, H. G. (2008). Management support with structured and unstructured data—An integrated business intelligence framework. Information Systems Management, 25(2), 132–148.
Article Google Scholar
Blumberg, R., & Atre, S. (2003). The problem with unstructured data. Dm Review, 13(42–49), 62.
Google Scholar
Yao, Y., Wang, J., Sheng, B., et al. (2017). Self-adjusting slot configurations for homogeneous and heterogeneous hadoop clusters. IEEE Transactions on Cloud Computing, 5(2), 344–357.
Article Google Scholar
Aji, A., Wang, F., Vo, H., et al. (2013). Hadoop gis: A high performance spatial data warehousing system over mapreduce. Proceedings of the VLDB Endowment, 6(11), 1009–1020.
Article Google Scholar
Olson, D. K., Fröhlich, F., Farese, R. V., et al. (2016). Taming the sphinx: Mechanisms of cellular sphingolipid homeostasis. Biochimica et Biophysica Acta (BBA)-Molecular and Cell Biology of Lipids, 1861(8), 784–792.
Google Scholar
Balipa, M., & Balasubramani, R. (2015). Search engine using Apache Lucene. International Journal of Computer Applications, 127(9), 27–30.
Article Google Scholar
Kim, D., Choi, J., & Woo, C. (2014). A design and development of big data indexing and search system using Lucene. Journal of Internet Computing and Services, 15(6), 107–115.
Article Google Scholar
Xu, J., & Croft, W. B. (2017). Quary expansion using local and global document analysis. ACM SIGIR Forum, 51(2), 168–175.
Article Google Scholar
He, W., & Wang, F. K. (2016). Integrating a case-based reasoning shell and Web 2.0: Design recommendations and insights. World Wide Web, 19(6), 1231–1249.
Article MathSciNet Google Scholar
Kong, B., Liu, X., & Zhang, J. (2006). Incremental support vector machine based on center distance ratio. Journal of Computer Applications, 26(6), 1434–1436.
Google Scholar
Gao, X. M., Chen, F., Song, F. X., et al. (2008). Influence of feature weight on text categorization performance of Bayesian classifier. Computer Application, 28(12), 3080–3084.
Article MATH Google Scholar
Fayed, H. A., & Atiya, A. F. (2009). A novel template reduction approach for the K-nearest neighbor method. IEEE Transactions on Neural Networks, 20(5), 890–896.
Article Google Scholar
Ye, J. (2015). Improved cosine similarity measures of simplified neutrosophic sets for medical diagnoses. Artificial Intelligence in Medicine, 63(3), 171–179.
Article Google Scholar
Borjigen, C. (2015). Mass collaborative knowledge management: Towards the next generation of knowledge management studies. Program, 49(3), 325–342.
Article Google Scholar

Download references

Acknowledgements

The authors are supported by the Science and Technology Research Project of Chongqing Education Committee of China (KJ1602203), the Scientific Research Programs in Higher Education of Chongqing Institute of Higher Education (CQGJ15203B).

Author information

Authors and Affiliations

College of Mobile Telecommunications, Chongqing University of Posts and Telecommunications, Chongqing, China
Ting Cai & Xuemei Yang

Authors

Ting Cai
View author publications
You can also search for this author in PubMed Google Scholar
Xuemei Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ting Cai.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cai, T., Yang, X. Non-structured Data Integration Access Policy Using Hadoop. Wireless Pers Commun 102, 895–908 (2018). https://doi.org/10.1007/s11277-017-5112-4

Download citation

Published: 13 December 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11277-017-5112-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Non-structured Data Integration Access Policy Using Hadoop

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big data privacy: a technological perspective and review

Big data analytics on Apache Spark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big data privacy: a technological perspective and review

Big data analytics on Apache Spark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation