Abstract
Topic-aware text segmentation (TATS) involves dividing text into cohesive segments and assigning a corresponding topic label to each segment. The TATS of a document has become increasingly significant for business researchers to obtain comprehensive insights into the behavior of enterprises. However, current models either cannot balance accuracy and generalization or are unable to handle the topic nesting problem, leading to low efficiency in practical needs. This paper proposes a novel Span-based approach for Topic-aware Text Segmentation called STTS, which consists of two components including a sliding window encoder and a span-based NER module. First, we utilize the sliding window encoder to transform the input document into text spans, which are then represented in their embeddings using pre-trained language models. Second, we obtain the coherent segments and assign a topic label to each segment based on the span-based NER method called Global Pointer. Experiments on four real-world business datasets demonstrate that STTS achieves state-of-the-art performance on the flat and nested TATS tasks. Consequently, our model provides an effective solution to TATS tasks with lengthy texts and nested topics, which indicates that our solution is highly suitable for large-scale text processing in practice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Chen, H., Branavan, S., Barzilay, R., Karger, D.R.: Global models of document structure using latent permutations. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 371–379 (2009)
Chivers, B., Jiang, M.P., Lee, W., Ng, A., Rapstine, N.I., Storer, A.: ANTS: a framework for retrieval of text segments in unstructured documents. In: Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pp. 38–47 (2022)
Choi, F.Y.: Advances in domain independent linear text segmentation. In: 1st Meeting of the North American Chapter of the Association for Computational Linguistics (2000)
Gong, Z., et al.: Tipster: a topic-guided language model for topic-aware text segmentation. In: Bhattacharya, A., et al. (eds.) Database Systems for Advanced Applications: 27th International Conference, DASFAA 2022, Virtual Event, 11–14 April 2022, Proceedings, Part III, pp. 213–221. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-00129-1_14
Hearst, M.A.: Text tiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Ju, M., Miwa, M., Ananiadou, S.: A neural layered model for nested named entity recognition. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1446–1459 (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Koshorek, O., Cohen, A., Mor, N., Rotman, M., Berant, J.: Text segmentation as a supervised learning task. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 469–473 (2018)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)
Li, J., Sun, A., Joty, S.: SegBot: a generic neural text segmentation model with pointer network. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4166–4172 (2018)
Lo, K., Jin, Y., Tan, W., Liu, M., Du, L., Buntine, W.: Transformer over pre-trained transformer for neural text segmentation with enhanced topic coherence. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3334–3340 (2021)
Riedl, M., Biemann, C.: TopicTiling: a text segmentation algorithm based on LDA. In: Proceedings of ACL 2012 Student Research Workshop, pp. 37–42 (2012)
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y.: RoFormer: enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864 (2021)
Su, J., et al.: Global pointer: novel efficient span-based approach for named entity recognition. arXiv preprint arXiv:2208.03054 (2022)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Xiao, C., Hu, X., Liu, Z., Tu, C., Sun, M.: Lawformer: a pre-trained language model for Chinese legal long documents. AI Open 2, 79–84 (2021)
Yoong, S.Y., Fan, Y.-C., Leu, F.-Y.: On text tiling for documents: a neural-network approach. In: Barolli, L., Takizawa, M., Enokido, T., Chen, H.-C., Matsuo, K. (eds.) Advances on Broad-Band Wireless Computing, Communication and Applications. LNNS, vol. 159, pp. 265–274. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-61108-8_26
Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Advances in Neural Information Processing Systems, vol. 33, pp. 17283–17297 (2020)
Zhang, R., et al.: Rapid adaptation of BERT for information extraction on domain-specific business documents. arXiv preprint arXiv:2002.01861 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Cai, Y., Zhang, Y., Yang, Z. (2024). STTS: A Novel Span-Based Approach for Topic-Aware Text Segmentation. In: Huang, DS., Premaratne, P., Yuan, C. (eds) Applied Intelligence. ICAI 2023. Communications in Computer and Information Science, vol 2015. Springer, Singapore. https://doi.org/10.1007/978-981-97-0827-7_26
Download citation
DOI: https://doi.org/10.1007/978-981-97-0827-7_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0826-0
Online ISBN: 978-981-97-0827-7
eBook Packages: Computer ScienceComputer Science (R0)