Loading [MathJax]/extensions/TeX/AMScd.js
An Empirical Analysis of Text Segmentation for BERT Classification in Extended Documents | IEEE Conference Publication | IEEE Xplore

An Empirical Analysis of Text Segmentation for BERT Classification in Extended Documents


Abstract:

In the domain of natural language processing and text analysis, the Bidirectional Encoder Representations from Transformers (BERT) has emerged as a powerful tool for disc...Show More

Abstract:

In the domain of natural language processing and text analysis, the Bidirectional Encoder Representations from Transformers (BERT) has emerged as a powerful tool for discerning the intricate nuances of textual data. Nonetheless, BERT’s inherent token limit of 512 tokens presents a noteworthy challenge when confronted with exceedingly long documents. Lengthy documents are commonly encountered during legal document review and often exceed the 512-token constraint. In response to this challenge, this study empirically compares two distinct applications of BERT, leveraging real-world, construction industry legal data. The approach compares applications of BERT to the entire document and on segmented text portions from each document. In the latter approach, the highest-scoring text segment from each document represents the document’s score. This research offers practical insights for effectively utilizing BERT in scenarios where document length exceeds the token limit. Our results allow practitioners and researchers to make informed choices when confronted with documents of significant length, thus contributing to a more effective and insightful application of BERT for text analysis.
Date of Conference: 15-18 December 2023
Date Added to IEEE Xplore: 22 January 2024
ISBN Information:
Conference Location: Sorrento, Italy

References

References is not available for this document.