A More Effective Sentence-Wise Text Segmentation Approach Using BERT

Maraj, Amit; Martin, Miguel Vargas; Makrehchi, Masoud

doi:10.1007/978-3-030-86337-1_16

Amit Maraj¹¹,
Miguel Vargas Martin¹¹ &
Masoud Makrehchi¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12824))

Included in the following conference series:

International Conference on Document Analysis and Recognition

3318 Accesses
2 Citations

Abstract

Text Segmentation is a Natural Language Processing based task that is aimed to divide paragraphs and bodies of text into topical, semantic blocks. This plays an important role in creating structured, searchable text-based representations after digitizing paper-based documents for example. Traditionally, text segmentation has been approached with sub-optimal feature engineering efforts and heuristic modelling. We propose a novel supervised training procedure with a pre-labeled text corpus along with an improved neural Deep Learning model for improved predictions. Our results are evaluated with the P_k and WindowDiff metrics and show performance improvements beyond any public text segmentation system that exists currently. The proposed system utilizes Bidirectional Encoder Representations from Transformers (BERT) as an encoding mechanism, which feeds to several downstream layers with a final classification output layer, and even shows promise for improved results with future iterations of BERT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Angelov, D.: Top2Vec: distributed representations of topics. arXiv:2008.09470 [cs, stat], August 2020)
Badjatiya, P., Kurisinkel, L.J., Gupta, M., Varma, V.: Attention-based neural text segmentation. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 180–193. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_14
Chapter Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 [cs, stat], May 2016
Barrow, J., Jain, R., Morariu, V., Manjunatha, V., Oard, D., Resnik, P.: A joint model for document segmentation and segment labeling. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 313–322. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.acl-main.29, https://www.aclweb.org/anthology/2020.acl-main.29
Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Mach. Learn. 34(1), 177–210 (1999). https://doi.org/10.1023/A:1007506220214
Article MATH Google Scholar
Blei, D.M.: Latent Dirichlet Allocation, p. 30
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Chaudhary, A.: A visual survey of data augmentation in NLP, May 2020. https://amitness.com/2020/05/data-augmentation-for-nlp/
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Choi, F.Y.Y.: Advances in domain independent linear text segmentation. arXiv:cs/0003083, March 2000
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs], May 2019
Eisenstein, J., Barzilay, R.: Bayesian unsupervised topic segmentation. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 334–343. Association for Computational Linguistics, Honolulu, October 2008. https://www.aclweb.org/anthology/D08-1035
Hearst, M.A.: TextTiling: a quantitative approach to discourse segmentation. Technical report (1993)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Koshorek, O., Cohen, A., Mor, N., Rotman, M., Berant, J.: Text segmentation as a supervised learning task. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 469–473. Association for Computational Linguistics, New Orleans, June 2018. https://doi.org/10.18653/v1/N18-2075, https://www.aclweb.org/anthology/N18-2075
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv:1508.04025 [cs], September 2015
Marcu, D.: The Theory and Practice of Discourse Parsing and Summarization. MIT Press, Cambridge (2000).Google-Books-ID: VyjED9VOn5MC
Book Google Scholar
McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: contextualized word vectors. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 6294–6305. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 [cs], September 2013
Misra, H., Yvon, F., Jose, J.M., Cappe, O.: Text segmentation via topic modeling: an analytical study. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. CIKM 2009, pp. 1553–1556. Association for Computing Machinery, New York, November 2009. https://doi.org/10.1145/1645953.1646170
Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, March 2016. https://ojs.aaai.org/index.php/AAAI/article/view/10350, number: 1
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, October 2014. https://doi.org/10.3115/v1/D14-1162, https://www.aclweb.org/anthology/D14-1162
Peters, M.E., et al.: Deep contextualized word representations. arXiv:1802.05365 [cs], March 2018
Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comput. Linguist. 28(1), 19–36 (2002)
Article Google Scholar
Purver, M., Körding, K.P., Griffiths, T.L., Tenenbaum, J.B.: Unsupervised topic modelling for multi-party spoken discourse. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 17–24. Association for Computational Linguistics, Sydney, July 2006. https://doi.org/10.3115/1220175.1220178, https://www.aclweb.org/anthology/P06-1003
Qiu, S., et al.: EasyAug: an automatic textual data augmentation platform for classification tasks. In: Companion Proceedings of the Web Conference 2020. WWW 2020, pp. 249–252. Association for Computing Machinery, New York, April 2020. https://doi.org/10.1145/3366424.3383552
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text, June 2016. https://arxiv.org/abs/1606.05250v3
Reimers, N., Gurevych, I.: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084 [cs], August 2019
Riedl, M., Biemann, C.: TopicTiling: a text segmentation algorithm based on LDA. In: Proceedings of ACL 2012 Student Research Workshop, pp. 37–42. Association for Computational Linguistics, Jeju Island, July 2012. https://www.aclweb.org/anthology/W12-3307
Rumelhart, D.E., Mcclelland, J.L.: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1. Foundations (1986)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs], February 2020
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv:1508.07909 [cs], June 2016
Shoa, S.: Contextual Topic Identification: Identifying meaningful topics for sparse Steam reviews, March 2020. Publication Title: Medium
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding, April 2018. https://arxiv.org/abs/1804.07461v3
Wang, W.Y., Yang, D.: That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2557–2563. Association for Computational Linguistics, Lisbon, September 2015. https://doi.org/10.18653/v1/D15-1306, https://www.aclweb.org/anthology/D15-1306
Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks, January 2019. https://arxiv.org/abs/1901.11196v2
Williams, A., Nangia, N., Bowman, S.R.: A broad-coverage challenge corpus for sentence understanding through inference, April 2017. https://arxiv.org/abs/1704.05426v4
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144 [cs], October 2016
Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation for consistency training. arXiv:1904.12848 [cs, stat], November 2020
Yang, H.: BERT meets Chinese word segmentation, September 2019. https://arxiv.org/abs/1909.09292v1
Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Lee Giles, C.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks, pp. 5315–5324 (2017). https://openaccess.thecvf.com/content_cvpr_2017/html/Yang_Learning_to_Extract_CVPR_2017_paper.html
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv:1710.09412 [cs, stat], April 2018
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. arXiv:1509.01626 [cs], April 2016

Download references

Acknowledgements

The second author thanks the support of an NSERC Discovery Grant.

Author information

Authors and Affiliations

Ontario Tech University, Oshawa, ON, L1G 0C5, Canada
Amit Maraj, Miguel Vargas Martin & Masoud Makrehchi

Authors

Amit Maraj
View author publications
You can also search for this author in PubMed Google Scholar
Miguel Vargas Martin
View author publications
You can also search for this author in PubMed Google Scholar
Masoud Makrehchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amit Maraj .

Editor information

Editors and Affiliations

Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Lladós
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti
Kyushu University, Fukuoka-shi, Japan
Seiichi Uchida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Maraj, A., Martin, M.V., Makrehchi, M. (2021). A More Effective Sentence-Wise Text Segmentation Approach Using BERT. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12824. Springer, Cham. https://doi.org/10.1007/978-3-030-86337-1_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-86337-1_16
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86336-4
Online ISBN: 978-3-030-86337-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)