Skip to main content

A Comparison of Information Retrieval Pre-processing Algorithms Applied to African Historical Data

  • Conference paper
  • First Online:
From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries (ICADL 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13636))

Included in the following conference series:

  • 865 Accesses

Abstract

African historical data presents unique challenges to search algorithms because much of the data was produced by colonial authorities or archivists far from the source of the data. Contemporary datasets include descriptions of museum artefacts in European museums and books written by colonial administrators, both of which encode African history. These are both arguably biased collections and the information retrieval algorithms used to search through such data collections may not provide modern researchers with relevant results. The goal of this study was therefore to investigate the degree to which common text and image pre-processing algorithms affect the quality of search results when users search through a current African historical data collection. Nine common algorithms were compared in terms of recall, precision and NDCG. The results indicate that text pre-processing performs better when stemming and stopping are used but thesaurus use may depend on the thesaurus chosen. Results from the image pre-processing experiment indicate that shape detectors generally work better than colour detectors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Apache: Solr. https://lucene.apache.org/solr/

  2. Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern Information Retrieval, vol. 463. ACM Press, New York (1999)

    Google Scholar 

  3. Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid Kernel. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval, pp. 401–408. CIVR 2007. Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1282280.1282340

  4. Chatzichristofis, S., Boutalis, Y., Lux, M.: Selection of the proper compact composite descriptor for improving content based image retrieval. In: Proceedings of the 6th IASTED International Conference, vol. 134643, p. 064 (2009)

    Google Scholar 

  5. Chatzichristofis, S.A., Boutalis, Y.S.: CEDD: color and edge directivity descriptor: a compact descriptor for image indexing and retrieval. In: International Conference on Computer Vision Systems, pp. 312–322. Springer (2008). https://doi.org/10.1007/978-3-540-79547-6_30

  6. Fanon, F., Sartre, J.P., Farrington, C.: The Wretched of the Earth. Grove Press, New York (1963)

    Google Scholar 

  7. Huang, J., Kumar, S., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 762–768 (1997). https://doi.org/10.1109/CVPR.1997.609412

  8. Kasutani, E., Yamada, A.: The mpeg-7 color layout descriptor: a compact image feature description for high-speed image/video segment retrieval. In: Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205), vol. 1, pp. 674–677 (2001). https://doi.org/10.1109/ICIP.2001.959135

  9. Kessi, S., Marks, Z., Ramugondo, E.: Decolonizing African studies (2020)

    Google Scholar 

  10. Lux, M., Riegler, M., Halvorsen, P., MacStravic, G.: LireSolr: a visual information retrieval server. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 466–469. ICMR 2017. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3078971.3079014

  11. Mbembe, A.: Decolonizing knowledge and the question of the archive (2015)

    Google Scholar 

  12. Memmi, A.: The Colonizer and the Colonized. Routledge (2013). https://dx.doi.org/10.4324/9781315065670

  13. Noble, S.U.: Google search: Hyper-visibility as a means of rendering black women and girls invisible. InVisible Culture (2013)

    Google Scholar 

  14. Parker, K.R.: Introduction: decolonizing the university: a battle for the African mind. CLA J. 60(2), 164–171 (2016)

    Google Scholar 

  15. Simpson, T.W.: Evaluating google as an epistemic tool. Philosophical Engineering: Toward a Philosophy of the Web, pp. 97–115 (2013)

    Google Scholar 

  16. Smithsonian Institute: National museum of African art. https://africa.si.edu/collections/collections

  17. The Five Hundred Year Archive: About. https://fhya.org/about

  18. Won, C.S., Park, D.K., Park, S.J.: Efficient use of mpeg-7 edge histogram descriptor. ETRI J. 24(1), 23–30 (2002). https://dx.doi.org/10.4218/etrij.02.0102.0103

Download references

Acknowledgements

This research was partially funded by the National Research Foundation of South Africa (Grant numbers: 105862, 119121 and 129253) and University of Cape Town. The authors acknowledge that opinions, findings and conclusions or recommendations expressed in this publication are that of the authors, and that the NRF accepts no liability whatsoever in this regard.

We would like to acknowledge the Archive & Public Culture research initiative at the University of Cape Town for allowing this research to use the Five Hundred Year Archive data collection for the purposes of this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hussein Suleman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Singh, S., Suleman, H. (2022). A Comparison of Information Retrieval Pre-processing Algorithms Applied to African Historical Data. In: Tseng, YH., Katsurai, M., Nguyen, H.N. (eds) From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries. ICADL 2022. Lecture Notes in Computer Science, vol 13636. Springer, Cham. https://doi.org/10.1007/978-3-031-21756-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21756-2_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21755-5

  • Online ISBN: 978-3-031-21756-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics