A Semi-automatic Data Extraction System for Heterogeneous Data Sources: a Case Study from Cotton Industry

Nayak, Richi; Balasubramaniam, Thirunavukarasu; Kutty, Sangeetha; Banduthilaka, Sachindra; Peterson, Erin

doi:10.1007/978-981-16-8531-6_15

Richi Nayak¹²,
Thirunavukarasu Balasubramaniam¹²,
Sangeetha Kutty¹²,
Sachindra Banduthilaka¹³ &
…
Erin Peterson¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1504))

Included in the following conference series:

Australasian Conference on Data Mining

697 Accesses
2 Citations

Abstract

With the recent developments in digitisation, there are increasing number of documents available online. There are several information extraction tools that are available to extract information from digitised documents. However, identifying precise answers to a given query is often a challenging task especially if the data source where the relevant information resides is unknown. This situation becomes more complex when the data source is available in multiple formats such as PDF, table and html. In this paper, we propose a novel data extraction system to discover relevant and focused information from diverse unstructured data sources based on text mining approaches. We perform a qualitative analysis to evaluate the proposed system and its suitability and adaptability using cotton industry.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Exploring AI-driven approaches for unstructured document analysis and future horizons

Article Open access 05 July 2024

Information Extraction System for Transforming Unstructured Text Data in Fire Reports into Structured Forms: A Polish Case Study

Article Open access 26 July 2019

Information Extraction Approaches: A Survey

References

Krallinger, M., Rabal, O., Lourenco, A., Oyarzabal, J., Valencia, A.: Information retrieval and text mining technologies for chemistry. Chem. Rev. 117(12), 7673–7761 (2017)
Article Google Scholar
Banawan, K., Ulukus, S.: The capacity of private information retrieval from coded databases. IEEE Trans. Inf. Theory 64(3), 1945–1956 (2018)
Article MathSciNet Google Scholar
Croft, W.B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in Practice, vol. 520, Addison-Wesley Reading, Boston (2010)
Google Scholar
Mayer-Schönberger, V., Cukier, K.: Big data: A revolution that will Transform how we Live, Work, and Think. Houghton Mifflin Harcourt, Boston (2013)
Google Scholar
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Article Google Scholar
Porter, M.E., Kramer, M.R.: The link between competitive advantage and corporate social responsibility. Harvard Bus. Rev. 84(12), 78–92 (2006)
Google Scholar
Peterson, E.E., Cunningham, S.A., Thomas, M., Collings, S., Bonnett, G.D., Harch, B.: An assessment framework for measuring agroecosystem health. Ecol. Ind. 79, 265–275 (2017)
Article Google Scholar
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of html documents. In: Proceedings of the 12th International Conference on World Wide Web, pp. 207–214. ACM (2003)
Google Scholar
Lin, S.H., Ho, J.M.: Discovering informative content blocks from web documents. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588–593. ACM (2002)
Google Scholar
Wei, X., Croft, B., McCallum, A.: Table extraction for answer retrieval. Inf. Retrieval 9(5), 589–611 (2006)
Article Google Scholar
Sang, E.F., De Meulder, F.: Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050 (2003)
Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005)
Google Scholar
Zhai, F., Potdar, S., Xiang, B., Zhou, B.: Neural models for sequence chunking. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Google Scholar
Habibi, M., Weber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017)
Article Google Scholar
Grishman, R.: Information extraction: techniques and challenges. In: Pazienza, M.T. (ed.) SCIE 1997. LNCS, vol. 1299, pp. 10–27. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63438-X_2
Chapter Google Scholar
Strötgen, J., Gertz, M., Popov, P.: Extraction and exploration of spatio-temporal information in documents. In: Proceedings of the 6th Workshop on Geographic Information Retrieval, p. 16. ACM (2010)
Google Scholar
Kononenko, O., Baysal, O., Holmes, R., Godfrey, M.W.: Mining modern repositories with elasticsearch. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pp. 328–331. ACM (2014)
Google Scholar
Akdogan, H.: Elasticsearch Indexing. Packt Publishing Ltd, Birmingham (2015)
Google Scholar
Ramos, J., et al.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 133–142 (2003)
Google Scholar
Pérez-Iglesias, J., Pérez-Agüera, J.R., Fresno, V., Feinstein, Y.Z.: Integrating the probabilistic models bm25/bm25f into lucene. arXiv preprint arXiv:0911.5046 (2009)

Download references

Author information

Authors and Affiliations

School of Computer Science and Centre for Data Science, Queensland University of Technology, Brisbane, Australia
Richi Nayak, Thirunavukarasu Balasubramaniam & Sangeetha Kutty
Redeye Apps Pvt Ltd, Brisbane, Australia
Sachindra Banduthilaka
Erin Peterson Consulting, Brisbane, Australia
Erin Peterson

Authors

Richi Nayak
View author publications
You can also search for this author in PubMed Google Scholar
Thirunavukarasu Balasubramaniam
View author publications
You can also search for this author in PubMed Google Scholar
Sangeetha Kutty
View author publications
You can also search for this author in PubMed Google Scholar
Sachindra Banduthilaka
View author publications
You can also search for this author in PubMed Google Scholar
Erin Peterson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thirunavukarasu Balasubramaniam .

Editor information

Editors and Affiliations

Queensland University of Technology, Brisbane, QLD, Australia
Yue Xu
Western Sydney University, Parramatta, NSW, Australia
Rosalind Wang
University of Queensland, Herston, Australia
Anton Lord
RMIT University, Melbourne, VIC, Australia
Yee Ling Boo
Queensland University of Technology, Brisbane, QLD, Australia
Richi Nayak
Data61, CSIRO, Canberra, ACT, Australia
Yanchang Zhao
Australian National University, Canberra, ACT, Australia
Graham Williams

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nayak, R., Balasubramaniam, T., Kutty, S., Banduthilaka, S., Peterson, E. (2021). A Semi-automatic Data Extraction System for Heterogeneous Data Sources: a Case Study from Cotton Industry. In: Xu, Y., et al. Data Mining. AusDM 2021. Communications in Computer and Information Science, vol 1504. Springer, Singapore. https://doi.org/10.1007/978-981-16-8531-6_15

Download citation

DOI: https://doi.org/10.1007/978-981-16-8531-6_15
Published: 09 December 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-8530-9
Online ISBN: 978-981-16-8531-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Semi-automatic Data Extraction System for Heterogeneous Data Sources: a Case Study from Cotton Industry