STTS: A Novel Span-Based Approach for Topic-Aware Text Segmentation

Cai, Yide; Zhang, Yuzhe; Yang, Zhouwang

doi:10.1007/978-981-97-0827-7_26

Yide Cai⁸,
Yuzhe Zhang⁸ &
Zhouwang Yang⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2015))

Included in the following conference series:

International Conference on Applied Intelligence

380 Accesses

Abstract

Topic-aware text segmentation (TATS) involves dividing text into cohesive segments and assigning a corresponding topic label to each segment. The TATS of a document has become increasingly significant for business researchers to obtain comprehensive insights into the behavior of enterprises. However, current models either cannot balance accuracy and generalization or are unable to handle the topic nesting problem, leading to low efficiency in practical needs. This paper proposes a novel Span-based approach for Topic-aware Text Segmentation called STTS, which consists of two components including a sliding window encoder and a span-based NER module. First, we utilize the sliding window encoder to transform the input document into text spans, which are then represented in their embeddings using pre-trained language models. Second, we obtain the coherent segments and assign a topic label to each segment based on the span-based NER method called Global Pointer. Experiments on four real-world business datasets demonstrate that STTS achieves state-of-the-art performance on the flat and nested TATS tasks. Consequently, our model provides an effective solution to TATS tasks with lengthy texts and nested topics, which indicates that our solution is highly suitable for large-scale text processing in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Tipster: A Topic-Guided Language Model for Topic-Aware Text Segmentation

TSSE-DMM: Topic Modeling for Short Texts Based on Topic Subdivision and Semantic Enhancement

Text Segmentation with Topic Modeling and Entity Coherence

References

Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Chen, H., Branavan, S., Barzilay, R., Karger, D.R.: Global models of document structure using latent permutations. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 371–379 (2009)
Google Scholar
Chivers, B., Jiang, M.P., Lee, W., Ng, A., Rapstine, N.I., Storer, A.: ANTS: a framework for retrieval of text segments in unstructured documents. In: Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pp. 38–47 (2022)
Google Scholar
Choi, F.Y.: Advances in domain independent linear text segmentation. In: 1st Meeting of the North American Chapter of the Association for Computational Linguistics (2000)
Google Scholar
Gong, Z., et al.: Tipster: a topic-guided language model for topic-aware text segmentation. In: Bhattacharya, A., et al. (eds.) Database Systems for Advanced Applications: 27th International Conference, DASFAA 2022, Virtual Event, 11–14 April 2022, Proceedings, Part III, pp. 213–221. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-00129-1_14
Hearst, M.A.: Text tiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Ju, M., Miwa, M., Ananiadou, S.: A neural layered model for nested named entity recognition. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1446–1459 (2018)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Koshorek, O., Cohen, A., Mor, N., Rotman, M., Berant, J.: Text segmentation as a supervised learning task. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 469–473 (2018)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)
Li, J., Sun, A., Joty, S.: SegBot: a generic neural text segmentation model with pointer network. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4166–4172 (2018)
Google Scholar
Lo, K., Jin, Y., Tan, W., Liu, M., Du, L., Buntine, W.: Transformer over pre-trained transformer for neural text segmentation with enhanced topic coherence. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3334–3340 (2021)
Google Scholar
Riedl, M., Biemann, C.: TopicTiling: a text segmentation algorithm based on LDA. In: Proceedings of ACL 2012 Student Research Workshop, pp. 37–42 (2012)
Google Scholar
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y.: RoFormer: enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864 (2021)
Su, J., et al.: Global pointer: novel efficient span-based approach for named entity recognition. arXiv preprint arXiv:2208.03054 (2022)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Xiao, C., Hu, X., Liu, Z., Tu, C., Sun, M.: Lawformer: a pre-trained language model for Chinese legal long documents. AI Open 2, 79–84 (2021)
Article Google Scholar
Yoong, S.Y., Fan, Y.-C., Leu, F.-Y.: On text tiling for documents: a neural-network approach. In: Barolli, L., Takizawa, M., Enokido, T., Chen, H.-C., Matsuo, K. (eds.) Advances on Broad-Band Wireless Computing, Communication and Applications. LNNS, vol. 159, pp. 265–274. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-61108-8_26
Chapter Google Scholar
Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Advances in Neural Information Processing Systems, vol. 33, pp. 17283–17297 (2020)
Google Scholar
Zhang, R., et al.: Rapid adaptation of BERT for information extraction on domain-specific business documents. arXiv preprint arXiv:2002.01861 (2020)

Download references

Author information

Authors and Affiliations

School of Data Science, University of Science and Technology of China, Hefei, 230026, Anhui, People’s Republic of China
Yide Cai, Yuzhe Zhang & Zhouwang Yang

Authors

Yide Cai
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhe Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhouwang Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhouwang Yang .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Zhejiang, China
De-Shuang Huang
University of Wollongong, North Wollongong, NSW, Australia
Prashan Premaratne
Guangxi Academy of Sciences, Guangxi, China
Changan Yuan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cai, Y., Zhang, Y., Yang, Z. (2024). STTS: A Novel Span-Based Approach for Topic-Aware Text Segmentation. In: Huang, DS., Premaratne, P., Yuan, C. (eds) Applied Intelligence. ICAI 2023. Communications in Computer and Information Science, vol 2015. Springer, Singapore. https://doi.org/10.1007/978-981-97-0827-7_26

Download citation

DOI: https://doi.org/10.1007/978-981-97-0827-7_26
Published: 01 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0826-0
Online ISBN: 978-981-97-0827-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics