skip to main content
10.1145/2063576.2063636acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Legal document clustering with built-in topic segmentation

Published: 24 October 2011 Publication History

Abstract

Clustering is a useful tool for helping users navigate, summarize, and organize large quantities of textual documents available on the Internet, in news sources, and in digital libraries. A variety of clustering methods have also been applied to the legal domain, with various degrees of success. Some unique characteristics of legal content as well as the nature of the legal domain present a number of challenges. For example, legal documents are often multi-topical, contain carefully crafted, professional, domain-specific language, and possess a broad and unevenly distributed coverage of legal issues. Moreover, unlike widely accessible documents on the Internet, where search and categorization services are generally free, the legal profession is still largely a fee-for-service field that makes the quality (e.g., in terms of both recall and precision) a key differentiator of provided services. This paper introduces a classification-based recursive soft clustering algorithm with built-in topic segmentation. The algorithm leverages existing legal document metadata such as topical classifications, document citations, and click stream data from user behavior databases, into a comprehensive clustering framework. Techniques associated with the algorithm have been applied successfully to very large databases of legal documents, which include judicial opinions, statutes, regulations, administrative materials and analytical documents. Extensive evaluations were conducted to determine the efficiency and effectiveness of the proposed algorithm. Subsequent evaluations conducted by legal domain experts have demonstrated that the quality of the resulting clusters based upon this algorithm is similar to those created by domain experts.

References

[1]
K. Al-Kofahi and et al. Combining multiple classifiers for text categorization. In Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM01), pages 97--104, 2001.
[2]
J. Allen and et al. Topic detection and tracking pilot study -- final report. In Proceedings of the DARPA Broadcast News Transcription and understanding Workshop, 1998.
[3]
D. Beeferman, A. Berger, and J. Lafferty. A model of lexical attraction and repulsion. In Proceedings of the ACL, pages 373--380, 1997.
[4]
P. Berkhin. A survey of clustering data mining techniques. Grouping Multidimensional Data, pages 25--71, 2006.
[5]
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2002.
[6]
P. Bradley, C. Reina, and U. Fayyad. Clustering very large databases using em mixture models. In Proceedings of ICPR, volume 2, pages 2076--2080, 2000.
[7]
F. Choi. Advances in domain independent linear text segmentation. In Proceedings of the Association for Computational Linguistics, pages 26--33, 2000.
[8]
F. Choi, P. Wiemer-Hastings, and J. Moore. Latent semantic analysis for text segmentation. In Proceedings of EMNLP, pages 109--117, 2001.
[9]
J. Conrad, K. Al-Kofahi, Y. Zhao, and G. Karypis. Effective document clustering for large heterogeneous law firm collections. In Proceedings of the 10th International Conference on Artificial Intelligence and Law (ICAIL05), pages 177--187, 2005.
[10]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI04), 2004.
[11]
S. Deerwester and et al. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
[12]
Apache hadoop. http://hadoop.apache.org/, 2010.
[13]
Marti Hearst. Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23:33--64, 1997.
[14]
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of 22rd Annual International SIGIR Conference, 1999.
[15]
M.C. Hung and D.L. Yang. An efficient fuzzy c-means clustering algorithm. In Proceedings of the IEEE International Conference on Data Mining, pages 225--232, 2001.
[16]
R. Kondadadi and R. Kozma. A modified fuzzy art for soft document clustering. In Proc. of International Joint Conference on Neural Networks IJCNN, pages 2545--2549, 2002.
[17]
H. Kozima. Text segmentation based on similarity between words full text. In Proc. of the ACL, pages 286--288, 1993.
[18]
H. Kozima and T. Furugori. Similarity between words computed by spreading activation on an english dictionary. In Proceedings of the ACL, pages 232--239, 1993.
[19]
J. Lin and C. Dyer. Data-intensive text processing with mapreduce. Synthesis Lectures on Human Language Technologies, 2010.
[20]
Apache mahout overview. http://lucene.apache.org/mahout/, 2010.
[21]
A. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD00), pages 169--178, 2000.
[22]
D. Merkl and E. Schweighofer. En route to data mining in legal text corpora: Clustering, neural computation, and international treaties. In Proceedings of the 8th International Workshop on Database and Expert Systems Applications (DEXA '97), 1997.
[23]
C. Ordonez and E. Omiecinski. Frem: Fast and robust em clustering for large data sets. In Proceedings of CIKM, pages 590--599, 2002.
[24]
M. Shafiei and E. Milios. A statistical model for topic segmentation and clustering. Lecture Notes in Computer Science, 5032, 2008.
[25]
Svm light. http://svmlight.joachims.org/, 2010.
[26]
A. Tagarelli and G. Karypis. A segment-based approach to clustering multi-topic documents. In Proceedings of the Text Mining Workshop, SIAM Data Mining Conference, 2008.
[27]
M. Utiyama and H. Isahara. A statistical model for domain-independent text segmentation. In Proceedings of the ACL, pages 499--506, 2001.
[28]
N. Vaughn and D. Boley. Automated clustering and extraction of distinctive words in legal documents. Dept. of computer science and engineering report, University of Minnesota, 2001.
[29]
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of the 21st Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, pages 46--54, 1998.

Cited By

View all
  • (2025)Topic classification of case law using a large language model and a new taxonomy for UK law: AI insights into summary judgmentArtificial Intelligence and Law10.1007/s10506-025-09434-0Online publication date: 25-Feb-2025
  • (2024)European Union’s Legislative Proposals Clustering Based on Multiple Hidden Layers RepresentationDigital Business and Intelligent Systems10.1007/978-3-031-63543-4_8(106-119)Online publication date: 23-Jun-2024
  • (2023)An Approach for Analyzing Unstructured Text Data Using Topic Modeling Techniques for Efficient Information ExtractionNew Generation Computing10.1007/s00354-023-00230-542:1(109-134)Online publication date: 27-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
October 2011
2712 pages
ISBN:9781450307178
DOI:10.1145/2063576
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. topic segmentation
  3. unsupervised learning

Qualifiers

  • Research-article

Conference

CIKM '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)3
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Topic classification of case law using a large language model and a new taxonomy for UK law: AI insights into summary judgmentArtificial Intelligence and Law10.1007/s10506-025-09434-0Online publication date: 25-Feb-2025
  • (2024)European Union’s Legislative Proposals Clustering Based on Multiple Hidden Layers RepresentationDigital Business and Intelligent Systems10.1007/978-3-031-63543-4_8(106-119)Online publication date: 23-Jun-2024
  • (2023)An Approach for Analyzing Unstructured Text Data Using Topic Modeling Techniques for Efficient Information ExtractionNew Generation Computing10.1007/s00354-023-00230-542:1(109-134)Online publication date: 27-Aug-2023
  • (2022)A Text Segmentation Approach for Automated Annotation of Online Customer Reviews, Based on Topic ModelingApplied Sciences10.3390/app1207341212:7(3412)Online publication date: 27-Mar-2022
  • (2022)A Mining approach for Automatic Processing of Regulatory Document2022 IEEE Biennial Congress of Argentina (ARGENCON)10.1109/ARGENCON55245.2022.9939668(1-8)Online publication date: 7-Sep-2022
  • (2022)GAE-Based Document Embedding Method for ClusteringIEEE Access10.1109/ACCESS.2022.322854810(130089-130096)Online publication date: 2022
  • (2022)Creating a Brief Review of Judicial Practice Using Clustering MethodsAdvances in Neural Computation, Machine Learning, and Cognitive Research VI10.1007/978-3-031-19032-2_48(466-475)Online publication date: 19-Oct-2022
  • (2021)Síťová analýza v právu: Síťové metody a jejich využití pro získávání a vyhledávání právních informacíRevue pro právo a technologie10.5817/RPT2021-2-212:24(39-76)Online publication date: 31-Dec-2021
  • (2021)A Natural Language Processing Survey on Legislative and Greek DocumentsProceedings of the 25th Pan-Hellenic Conference on Informatics10.1145/3503823.3503898(407-412)Online publication date: 26-Nov-2021
  • (2021)Structural text segmentation of legal documentsProceedings of the Eighteenth International Conference on Artificial Intelligence and Law10.1145/3462757.3466085(2-11)Online publication date: 21-Jun-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media