Skip to main content

A More Effective Sentence-Wise Text Segmentation Approach Using BERT

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 (ICDAR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12824))

Included in the following conference series:

Abstract

Text Segmentation is a Natural Language Processing based task that is aimed to divide paragraphs and bodies of text into topical, semantic blocks. This plays an important role in creating structured, searchable text-based representations after digitizing paper-based documents for example. Traditionally, text segmentation has been approached with sub-optimal feature engineering efforts and heuristic modelling. We propose a novel supervised training procedure with a pre-labeled text corpus along with an improved neural Deep Learning model for improved predictions. Our results are evaluated with the Pk and WindowDiff metrics and show performance improvements beyond any public text segmentation system that exists currently. The proposed system utilizes Bidirectional Encoder Representations from Transformers (BERT) as an encoding mechanism, which feeds to several downstream layers with a final classification output layer, and even shows promise for improved results with future iterations of BERT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Angelov, D.: Top2Vec: distributed representations of topics. arXiv:2008.09470 [cs, stat], August 2020)

  2. Badjatiya, P., Kurisinkel, L.J., Gupta, M., Varma, V.: Attention-based neural text segmentation. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 180–193. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_14

    Chapter  Google Scholar 

  3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 [cs, stat], May 2016

  4. Barrow, J., Jain, R., Morariu, V., Manjunatha, V., Oard, D., Resnik, P.: A joint model for document segmentation and segment labeling. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 313–322. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.acl-main.29, https://www.aclweb.org/anthology/2020.acl-main.29

  5. Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Mach. Learn. 34(1), 177–210 (1999). https://doi.org/10.1023/A:1007506220214

    Article  MATH  Google Scholar 

  6. Blei, D.M.: Latent Dirichlet Allocation, p. 30

    Google Scholar 

  7. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  8. Chaudhary, A.: A visual survey of data augmentation in NLP, May 2020. https://amitness.com/2020/05/data-augmentation-for-nlp/

  9. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  10. Choi, F.Y.Y.: Advances in domain independent linear text segmentation. arXiv:cs/0003083, March 2000

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs], May 2019

  12. Eisenstein, J., Barzilay, R.: Bayesian unsupervised topic segmentation. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 334–343. Association for Computational Linguistics, Honolulu, October 2008. https://www.aclweb.org/anthology/D08-1035

  13. Hearst, M.A.: TextTiling: a quantitative approach to discourse segmentation. Technical report (1993)

    Google Scholar 

  14. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  15. Koshorek, O., Cohen, A., Mor, N., Rotman, M., Berant, J.: Text segmentation as a supervised learning task. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 469–473. Association for Computational Linguistics, New Orleans, June 2018. https://doi.org/10.18653/v1/N18-2075, https://www.aclweb.org/anthology/N18-2075

  16. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv:1508.04025 [cs], September 2015

  17. Marcu, D.: The Theory and Practice of Discourse Parsing and Summarization. MIT Press, Cambridge (2000).Google-Books-ID: VyjED9VOn5MC

    Book  Google Scholar 

  18. McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: contextualized word vectors. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 6294–6305. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf

  19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 [cs], September 2013

  20. Misra, H., Yvon, F., Jose, J.M., Cappe, O.: Text segmentation via topic modeling: an analytical study. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. CIKM 2009, pp. 1553–1556. Association for Computing Machinery, New York, November 2009. https://doi.org/10.1145/1645953.1646170

  21. Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, March 2016. https://ojs.aaai.org/index.php/AAAI/article/view/10350, number: 1

  22. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, October 2014. https://doi.org/10.3115/v1/D14-1162, https://www.aclweb.org/anthology/D14-1162

  23. Peters, M.E., et al.: Deep contextualized word representations. arXiv:1802.05365 [cs], March 2018

  24. Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comput. Linguist. 28(1), 19–36 (2002)

    Article  Google Scholar 

  25. Purver, M., Körding, K.P., Griffiths, T.L., Tenenbaum, J.B.: Unsupervised topic modelling for multi-party spoken discourse. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 17–24. Association for Computational Linguistics, Sydney, July 2006. https://doi.org/10.3115/1220175.1220178, https://www.aclweb.org/anthology/P06-1003

  26. Qiu, S., et al.: EasyAug: an automatic textual data augmentation platform for classification tasks. In: Companion Proceedings of the Web Conference 2020. WWW 2020, pp. 249–252. Association for Computing Machinery, New York, April 2020. https://doi.org/10.1145/3366424.3383552

  27. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text, June 2016. https://arxiv.org/abs/1606.05250v3

  28. Reimers, N., Gurevych, I.: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084 [cs], August 2019

  29. Riedl, M., Biemann, C.: TopicTiling: a text segmentation algorithm based on LDA. In: Proceedings of ACL 2012 Student Research Workshop, pp. 37–42. Association for Computational Linguistics, Jeju Island, July 2012. https://www.aclweb.org/anthology/W12-3307

  30. Rumelhart, D.E., Mcclelland, J.L.: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1. Foundations (1986)

    Google Scholar 

  31. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs], February 2020

  32. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv:1508.07909 [cs], June 2016

  33. Shoa, S.: Contextual Topic Identification: Identifying meaningful topics for sparse Steam reviews, March 2020. Publication Title: Medium

    Google Scholar 

  34. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

  35. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding, April 2018. https://arxiv.org/abs/1804.07461v3

  36. Wang, W.Y., Yang, D.: That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2557–2563. Association for Computational Linguistics, Lisbon, September 2015. https://doi.org/10.18653/v1/D15-1306, https://www.aclweb.org/anthology/D15-1306

  37. Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks, January 2019. https://arxiv.org/abs/1901.11196v2

  38. Williams, A., Nangia, N., Bowman, S.R.: A broad-coverage challenge corpus for sentence understanding through inference, April 2017. https://arxiv.org/abs/1704.05426v4

  39. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144 [cs], October 2016

  40. Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation for consistency training. arXiv:1904.12848 [cs, stat], November 2020

  41. Yang, H.: BERT meets Chinese word segmentation, September 2019. https://arxiv.org/abs/1909.09292v1

  42. Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Lee Giles, C.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks, pp. 5315–5324 (2017). https://openaccess.thecvf.com/content_cvpr_2017/html/Yang_Learning_to_Extract_CVPR_2017_paper.html

  43. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv:1710.09412 [cs, stat], April 2018

  44. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. arXiv:1509.01626 [cs], April 2016

Download references

Acknowledgements

The second author thanks the support of an NSERC Discovery Grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amit Maraj .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Maraj, A., Martin, M.V., Makrehchi, M. (2021). A More Effective Sentence-Wise Text Segmentation Approach Using BERT. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12824. Springer, Cham. https://doi.org/10.1007/978-3-030-86337-1_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86337-1_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86336-4

  • Online ISBN: 978-3-030-86337-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics