Skip to main content

STTS: A Novel Span-Based Approach for Topic-Aware Text Segmentation

  • Conference paper
  • First Online:
Applied Intelligence (ICAI 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2015))

Included in the following conference series:

  • 380 Accesses

Abstract

Topic-aware text segmentation (TATS) involves dividing text into cohesive segments and assigning a corresponding topic label to each segment. The TATS of a document has become increasingly significant for business researchers to obtain comprehensive insights into the behavior of enterprises. However, current models either cannot balance accuracy and generalization or are unable to handle the topic nesting problem, leading to low efficiency in practical needs. This paper proposes a novel Span-based approach for Topic-aware Text Segmentation called STTS, which consists of two components including a sliding window encoder and a span-based NER module. First, we utilize the sliding window encoder to transform the input document into text spans, which are then represented in their embeddings using pre-trained language models. Second, we obtain the coherent segments and assign a topic label to each segment based on the span-based NER method called Global Pointer. Experiments on four real-world business datasets demonstrate that STTS achieves state-of-the-art performance on the flat and nested TATS tasks. Consequently, our model provides an effective solution to TATS tasks with lengthy texts and nested topics, which indicates that our solution is highly suitable for large-scale text processing in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)

  2. Chen, H., Branavan, S., Barzilay, R., Karger, D.R.: Global models of document structure using latent permutations. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 371–379 (2009)

    Google Scholar 

  3. Chivers, B., Jiang, M.P., Lee, W., Ng, A., Rapstine, N.I., Storer, A.: ANTS: a framework for retrieval of text segments in unstructured documents. In: Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pp. 38–47 (2022)

    Google Scholar 

  4. Choi, F.Y.: Advances in domain independent linear text segmentation. In: 1st Meeting of the North American Chapter of the Association for Computational Linguistics (2000)

    Google Scholar 

  5. Gong, Z., et al.: Tipster: a topic-guided language model for topic-aware text segmentation. In: Bhattacharya, A., et al. (eds.) Database Systems for Advanced Applications: 27th International Conference, DASFAA 2022, Virtual Event, 11–14 April 2022, Proceedings, Part III, pp. 213–221. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-00129-1_14

  6. Hearst, M.A.: Text tiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)

    Google Scholar 

  7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  8. Ju, M., Miwa, M., Ananiadou, S.: A neural layered model for nested named entity recognition. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1446–1459 (2018)

    Google Scholar 

  9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  10. Koshorek, O., Cohen, A., Mor, N., Rotman, M., Berant, J.: Text segmentation as a supervised learning task. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 469–473 (2018)

    Google Scholar 

  11. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)

  12. Li, J., Sun, A., Joty, S.: SegBot: a generic neural text segmentation model with pointer network. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4166–4172 (2018)

    Google Scholar 

  13. Lo, K., Jin, Y., Tan, W., Liu, M., Du, L., Buntine, W.: Transformer over pre-trained transformer for neural text segmentation with enhanced topic coherence. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3334–3340 (2021)

    Google Scholar 

  14. Riedl, M., Biemann, C.: TopicTiling: a text segmentation algorithm based on LDA. In: Proceedings of ACL 2012 Student Research Workshop, pp. 37–42 (2012)

    Google Scholar 

  15. Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y.: RoFormer: enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864 (2021)

  16. Su, J., et al.: Global pointer: novel efficient span-based approach for named entity recognition. arXiv preprint arXiv:2208.03054 (2022)

  17. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  18. Xiao, C., Hu, X., Liu, Z., Tu, C., Sun, M.: Lawformer: a pre-trained language model for Chinese legal long documents. AI Open 2, 79–84 (2021)

    Article  Google Scholar 

  19. Yoong, S.Y., Fan, Y.-C., Leu, F.-Y.: On text tiling for documents: a neural-network approach. In: Barolli, L., Takizawa, M., Enokido, T., Chen, H.-C., Matsuo, K. (eds.) Advances on Broad-Band Wireless Computing, Communication and Applications. LNNS, vol. 159, pp. 265–274. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-61108-8_26

    Chapter  Google Scholar 

  20. Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Advances in Neural Information Processing Systems, vol. 33, pp. 17283–17297 (2020)

    Google Scholar 

  21. Zhang, R., et al.: Rapid adaptation of BERT for information extraction on domain-specific business documents. arXiv preprint arXiv:2002.01861 (2020)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhouwang Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cai, Y., Zhang, Y., Yang, Z. (2024). STTS: A Novel Span-Based Approach for Topic-Aware Text Segmentation. In: Huang, DS., Premaratne, P., Yuan, C. (eds) Applied Intelligence. ICAI 2023. Communications in Computer and Information Science, vol 2015. Springer, Singapore. https://doi.org/10.1007/978-981-97-0827-7_26

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0827-7_26

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0826-0

  • Online ISBN: 978-981-97-0827-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics