Skip to main content

A Semi-automatic Data Extraction System for Heterogeneous Data Sources: a Case Study from Cotton Industry

  • Conference paper
  • First Online:
Data Mining (AusDM 2021)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1504))

Included in the following conference series:

Abstract

With the recent developments in digitisation, there are increasing number of documents available online. There are several information extraction tools that are available to extract information from digitised documents. However, identifying precise answers to a given query is often a challenging task especially if the data source where the relevant information resides is unknown. This situation becomes more complex when the data source is available in multiple formats such as PDF, table and html. In this paper, we propose a novel data extraction system to discover relevant and focused information from diverse unstructured data sources based on text mining approaches. We perform a qualitative analysis to evaluate the proposed system and its suitability and adaptability using cotton industry.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Krallinger, M., Rabal, O., Lourenco, A., Oyarzabal, J., Valencia, A.: Information retrieval and text mining technologies for chemistry. Chem. Rev. 117(12), 7673–7761 (2017)

    Article  Google Scholar 

  2. Banawan, K., Ulukus, S.: The capacity of private information retrieval from coded databases. IEEE Trans. Inf. Theory 64(3), 1945–1956 (2018)

    Article  MathSciNet  Google Scholar 

  3. Croft, W.B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in Practice, vol. 520, Addison-Wesley Reading, Boston (2010)

    Google Scholar 

  4. Mayer-Schönberger, V., Cukier, K.: Big data: A revolution that will Transform how we Live, Work, and Think. Houghton Mifflin Harcourt, Boston (2013)

    Google Scholar 

  5. Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)

    Article  Google Scholar 

  6. Porter, M.E., Kramer, M.R.: The link between competitive advantage and corporate social responsibility. Harvard Bus. Rev. 84(12), 78–92 (2006)

    Google Scholar 

  7. Peterson, E.E., Cunningham, S.A., Thomas, M., Collings, S., Bonnett, G.D., Harch, B.: An assessment framework for measuring agroecosystem health. Ecol. Ind. 79, 265–275 (2017)

    Article  Google Scholar 

  8. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of html documents. In: Proceedings of the 12th International Conference on World Wide Web, pp. 207–214. ACM (2003)

    Google Scholar 

  9. Lin, S.H., Ho, J.M.: Discovering informative content blocks from web documents. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588–593. ACM (2002)

    Google Scholar 

  10. Wei, X., Croft, B., McCallum, A.: Table extraction for answer retrieval. Inf. Retrieval 9(5), 589–611 (2006)

    Article  Google Scholar 

  11. Sang, E.F., De Meulder, F.: Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050 (2003)

    Google Scholar 

  12. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005)

    Google Scholar 

  13. Zhai, F., Potdar, S., Xiang, B., Zhou, B.: Neural models for sequence chunking. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)

    Google Scholar 

  14. Habibi, M., Weber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017)

    Article  Google Scholar 

  15. Grishman, R.: Information extraction: techniques and challenges. In: Pazienza, M.T. (ed.) SCIE 1997. LNCS, vol. 1299, pp. 10–27. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63438-X_2

    Chapter  Google Scholar 

  16. Strötgen, J., Gertz, M., Popov, P.: Extraction and exploration of spatio-temporal information in documents. In: Proceedings of the 6th Workshop on Geographic Information Retrieval, p. 16. ACM (2010)

    Google Scholar 

  17. Kononenko, O., Baysal, O., Holmes, R., Godfrey, M.W.: Mining modern repositories with elasticsearch. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pp. 328–331. ACM (2014)

    Google Scholar 

  18. Akdogan, H.: Elasticsearch Indexing. Packt Publishing Ltd, Birmingham (2015)

    Google Scholar 

  19. Ramos, J., et al.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 133–142 (2003)

    Google Scholar 

  20. Pérez-Iglesias, J., Pérez-Agüera, J.R., Fresno, V., Feinstein, Y.Z.: Integrating the probabilistic models bm25/bm25f into lucene. arXiv preprint arXiv:0911.5046 (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thirunavukarasu Balasubramaniam .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nayak, R., Balasubramaniam, T., Kutty, S., Banduthilaka, S., Peterson, E. (2021). A Semi-automatic Data Extraction System for Heterogeneous Data Sources: a Case Study from Cotton Industry. In: Xu, Y., et al. Data Mining. AusDM 2021. Communications in Computer and Information Science, vol 1504. Springer, Singapore. https://doi.org/10.1007/978-981-16-8531-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-8531-6_15

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-8530-9

  • Online ISBN: 978-981-16-8531-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics