Skip to main content
Log in

Non-structured Data Integration Access Policy Using Hadoop

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

The rapid growth of unstructured data has become a key factor that drives the development of enterprises. Several problems should be addressed when obtaining effective access to massive amounts of unstructured data, such as data stored in scattered locations, differences in data access, and non-unified data formats. In this article, we use Hadoop to build a distributed computing platform that stores unstructured data and improve the Hadoop scheduling algorithm on the basis of the end time of slow tasks. The improved algorithm can avoid the execution of slow bulk task caused by non-uniform velocity nodes for Hadoop in a heterogeneous environment and can improve operating efficiency and stability. Furthermore, we propose a classification index construction method using non-training sets, thereby improving the term frequency–inverse document frequency weight formula by introducing timeliness and entropy. On this basis, we propose a classification algorithm that follows the principle of document similarity and document classification algorithm and does not use training sets. Finally, we describe the construction process of the classification index that is based on the training set by combining Hadoop and Lucene. As a proof of concept, we implement a prototype system using the Hadoop platform of our improved scheduling algorithm and conduct experimental studies to demonstrate the feasibility and performance of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Angus, D., Rintel, S., & Wiles, J. (2013). Making sense of big text: A visual-first approach for analysing text data using Leximancer and Discursis. International Journal of Social Research Methodology, 16(3), 261–267.

    Article  Google Scholar 

  2. Baars, H., & Kemper, H. G. (2008). Management support with structured and unstructured data—An integrated business intelligence framework. Information Systems Management, 25(2), 132–148.

    Article  Google Scholar 

  3. Blumberg, R., & Atre, S. (2003). The problem with unstructured data. Dm Review, 13(42–49), 62.

    Google Scholar 

  4. Yao, Y., Wang, J., Sheng, B., et al. (2017). Self-adjusting slot configurations for homogeneous and heterogeneous hadoop clusters. IEEE Transactions on Cloud Computing, 5(2), 344–357.

    Article  Google Scholar 

  5. Aji, A., Wang, F., Vo, H., et al. (2013). Hadoop gis: A high performance spatial data warehousing system over mapreduce. Proceedings of the VLDB Endowment, 6(11), 1009–1020.

    Article  Google Scholar 

  6. Olson, D. K., Fröhlich, F., Farese, R. V., et al. (2016). Taming the sphinx: Mechanisms of cellular sphingolipid homeostasis. Biochimica et Biophysica Acta (BBA)-Molecular and Cell Biology of Lipids, 1861(8), 784–792.

    Google Scholar 

  7. Balipa, M., & Balasubramani, R. (2015). Search engine using Apache Lucene. International Journal of Computer Applications, 127(9), 27–30.

    Article  Google Scholar 

  8. Kim, D., Choi, J., & Woo, C. (2014). A design and development of big data indexing and search system using Lucene. Journal of Internet Computing and Services, 15(6), 107–115.

    Article  Google Scholar 

  9. Xu, J., & Croft, W. B. (2017). Quary expansion using local and global document analysis. ACM SIGIR Forum, 51(2), 168–175.

    Article  Google Scholar 

  10. He, W., & Wang, F. K. (2016). Integrating a case-based reasoning shell and Web 2.0: Design recommendations and insights. World Wide Web, 19(6), 1231–1249.

    Article  MathSciNet  Google Scholar 

  11. Kong, B., Liu, X., & Zhang, J. (2006). Incremental support vector machine based on center distance ratio. Journal of Computer Applications, 26(6), 1434–1436.

    Google Scholar 

  12. Gao, X. M., Chen, F., Song, F. X., et al. (2008). Influence of feature weight on text categorization performance of Bayesian classifier. Computer Application, 28(12), 3080–3084.

    Article  MATH  Google Scholar 

  13. Fayed, H. A., & Atiya, A. F. (2009). A novel template reduction approach for the K-nearest neighbor method. IEEE Transactions on Neural Networks, 20(5), 890–896.

    Article  Google Scholar 

  14. Ye, J. (2015). Improved cosine similarity measures of simplified neutrosophic sets for medical diagnoses. Artificial Intelligence in Medicine, 63(3), 171–179.

    Article  Google Scholar 

  15. Borjigen, C. (2015). Mass collaborative knowledge management: Towards the next generation of knowledge management studies. Program, 49(3), 325–342.

    Article  Google Scholar 

Download references

Acknowledgements

The authors are supported by the Science and Technology Research Project of Chongqing Education Committee of China (KJ1602203), the Scientific Research Programs in Higher Education of Chongqing Institute of Higher Education (CQGJ15203B).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ting Cai.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cai, T., Yang, X. Non-structured Data Integration Access Policy Using Hadoop. Wireless Pers Commun 102, 895–908 (2018). https://doi.org/10.1007/s11277-017-5112-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-017-5112-4

Keywords

Navigation