Skip to main content
Log in

Genre identification for office document search and browsing

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve the performance of genre identification. Experiments were conducted on the open-set identification of four coarse office document genres: technical paper, photo, slide, and table. Our experiments show that when combined with image-based features, text-based features do not significantly influence performance. These results provide support for a topic-independent approach to identification of coarse office document genres. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone. We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bagdanov, A., Worring, M.: Fine-grained document genre classification using first order random graphs. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 79–83 (2001)

  2. Boese, E.S., Howe, A.E.: Effects of web document evolution on genre classification. In: CIKM ’05: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, New York, NY, USA, pp. 632–639 (2005)

  3. Burges C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)

    Article  Google Scholar 

  4. Chen N., Blostein D.: A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int. J. Doc. Anal. Recognit. 10(1), 1–16 (2007)

    Article  MATH  Google Scholar 

  5. Meyer zu Eissen S., Stein B.: Genre classification of web pages: user study and feasibility analysis. In: Biundo, S., Fruhwirth, T., Palm, G. (eds) KI2004: Advances in Artificial Intelligence, pp. 256–269. Springer, Berlin (2004)

    Google Scholar 

  6. Fleiss J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)

    Article  Google Scholar 

  7. Freund, L., Clarke, C.L.A., Toms, E.G.: Towards genre classification for IR in the workplace. In: IIiX: Proceedings of the 1st International Conference on Information Interaction in Context, pp. 30–36 (2006)

  8. Gupta, M.D., Sarkar, P.: A shared parts model for document image recognition. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition, pp. 1163–1172 (2007)

  9. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explorations 11(1), 10–18 (2009). http://www.cs.waikato.ac.nz/ml/weka/

  10. Hao, X., Wang, J., Bieber, M., Ng, P.: A tool for classifying office documents. In: Proceedings of the Fifth International Conference on Tools with Artificial Intelligence, pp. 427–434 (1993)

  11. Hearst, M.A.: Design recommendations for hierarchical faceted search interfaces. In: Broder, A.Z., Maarek, Y.S. (eds.) Proceedings of the SIGIR 2006 Workshop on Faceted Search, pp. 26–30 (2006)

  12. Henderson, S.: Genre, task, topic and time: facets of personal digital document management. In: CHINZ ’05: Proceedings of the 6th ACM SIGCHI New Zealand Chapter’s International Conference on Computer-Human Interaction, ACM, New York, NY, USA, pp. 75–82 (2005)

  13. Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification, (2010). http://www.csie.ntu.edu.tw/~cjlin/libsvm/

  14. Huang, J., Kumar, S.R., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: CVPR ’97: Proceedings of the 1997 IEEE Conference on Computer Vision and Pattern Recognition, pp. 762–768 (1997)

  15. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: ECML ’98: Proceedings of the 10th European Conference on Machine Learning, Springer, London, UK, pp. 137–142 (1998)

  16. Kessler, B., Nunberg, G., Schütze, H.: Automatic detection of text genre. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pp. 32–38 (1997)

  17. Kim, Y., Ross, S.: Feature type analysis in automated genre classification (2007). http://eprints.erpanet.org/128/

  18. Kim, Y., Ross, S.: Examining variations of prominent features in genre classification. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences (2008)

  19. Lee, Y.B., Myaeng, S.H.: Text genre classification with genre-revealing and subject-revealing features. In: SIGIR ’02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, pp. 145–150 (2002)

  20. Levering, R., Cutler, M., Yu, L.: Using visual features for fine-grained genre classification of web pages. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences (2008)

  21. Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, Chap. Text classification and naive bayes, Cambridge University Press, Cambridge (2008)

  22. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. on Pattern Anal. Mach. Intell. 27, 1226–1238 (2005). http://penglab.janelia.org/proj/mRMR/index.htm

    Google Scholar 

  23. Rauber, A., Müller-Kögler, A.: Integrating automatic genre analysis into digital libraries. In: Proceedings of the Joint Conference on Digital Libraries (2001)

  24. Roussinov, D., Crowston, K., Nilan, M., Kwasnik, B., Cai, J., Liu, X.: Genre based navigation on the web. In: Proceedings of the 34th Annual Hawaii International Conference on System Sciences, vol. 4, IEEE Computer Society, Washington, DC, USA (2001)

  25. Santini M., Sharoff S.: Web genre benchmark under construction. Special issue: automatic genre identification issues and prospects. J. Lang. Technol. Comput. Linguist. 25(1):129-145 (2009)

    Google Scholar 

  26. Schölkopf, B., Burges, C., Smola, A. (eds.): Advances in Kernel Methods—Support Vector Learning, Chap. 11 Making large-scale SVM learning practical. MIT-Press, MA (1999)

  27. Scholl, P., Domínguez García, R., Böhnstedt, D., Rensing, C., Steinmetz, R.: Towards language-independent web genre detection. In: WWW ’09: Proceedings of the 18th International Conference on World Wide Web, New York, NY, USA, pp. 1157–1158 (2009)

  28. Shin C., Doermann D., Rosenfeld A.: Classification of document pages using structure-based features. Int. J. Doc. Anal. Recognit. 3(4), 232–247 (2001)

    Article  Google Scholar 

  29. Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, vol. 2 (2003)

  30. Snoek, C.G.M., Worring, M., Smeulders, A.W.M.: Early versus late fusion in semantic video analysis. In: MULTIMEDIA ’05: Proceedings of the 13th Annual ACM International Conference on Multimedia, ACM, New York, NY, USA, pp. 399–402 (2005)

  31. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Text genre detection using common word frequencies. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING2000), pp. 808–814 (2000)

  32. Witten I.H., Frank E.: Data Mining: Practical Machine Learning Tools and Techniques. 2nd edn. Morgan Kaufmann, MA (2005)

    MATH  Google Scholar 

  33. Wong K., Casey R., Wahl F.: Document analysis systems. IBM J. Res. Dev. 26(6), 647–656 (1982)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francine Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, F., Girgensohn, A., Cooper, M. et al. Genre identification for office document search and browsing. IJDAR 15, 167–182 (2012). https://doi.org/10.1007/s10032-011-0163-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-011-0163-7

Keywords

Navigation