research-article

Legal document clustering with built-in topic segmentation

Authors:

Jack G. Conrad,

Khalid Al-Kofahi,

William KeenanAuthors Info & Claims

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Pages 383 - 392

https://doi.org/10.1145/2063576.2063636

Published: 24 October 2011 Publication History

Abstract

Clustering is a useful tool for helping users navigate, summarize, and organize large quantities of textual documents available on the Internet, in news sources, and in digital libraries. A variety of clustering methods have also been applied to the legal domain, with various degrees of success. Some unique characteristics of legal content as well as the nature of the legal domain present a number of challenges. For example, legal documents are often multi-topical, contain carefully crafted, professional, domain-specific language, and possess a broad and unevenly distributed coverage of legal issues. Moreover, unlike widely accessible documents on the Internet, where search and categorization services are generally free, the legal profession is still largely a fee-for-service field that makes the quality (e.g., in terms of both recall and precision) a key differentiator of provided services. This paper introduces a classification-based recursive soft clustering algorithm with built-in topic segmentation. The algorithm leverages existing legal document metadata such as topical classifications, document citations, and click stream data from user behavior databases, into a comprehensive clustering framework. Techniques associated with the algorithm have been applied successfully to very large databases of legal documents, which include judicial opinions, statutes, regulations, administrative materials and analytical documents. Extensive evaluations were conducted to determine the efficiency and effectiveness of the proposed algorithm. Subsequent evaluations conducted by legal domain experts have demonstrated that the quality of the resulting clusters based upon this algorithm is similar to those created by domain experts.

References

[1]

K. Al-Kofahi and et al. Combining multiple classifiers for text categorization. In Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM01), pages 97--104, 2001.

Digital Library

[2]

J. Allen and et al. Topic detection and tracking pilot study -- final report. In Proceedings of the DARPA Broadcast News Transcription and understanding Workshop, 1998.

[3]

D. Beeferman, A. Berger, and J. Lafferty. A model of lexical attraction and repulsion. In Proceedings of the ACL, pages 373--380, 1997.

Digital Library

[4]

P. Berkhin. A survey of clustering data mining techniques. Grouping Multidimensional Data, pages 25--71, 2006.

[5]

D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2002.

Digital Library

[6]

P. Bradley, C. Reina, and U. Fayyad. Clustering very large databases using em mixture models. In Proceedings of ICPR, volume 2, pages 2076--2080, 2000.

[7]

F. Choi. Advances in domain independent linear text segmentation. In Proceedings of the Association for Computational Linguistics, pages 26--33, 2000.

Digital Library

[8]

F. Choi, P. Wiemer-Hastings, and J. Moore. Latent semantic analysis for text segmentation. In Proceedings of EMNLP, pages 109--117, 2001.

[9]

J. Conrad, K. Al-Kofahi, Y. Zhao, and G. Karypis. Effective document clustering for large heterogeneous law firm collections. In Proceedings of the 10th International Conference on Artificial Intelligence and Law (ICAIL05), pages 177--187, 2005.

Digital Library

[10]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI04), 2004.

Digital Library

[11]

S. Deerwester and et al. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.

[12]

Apache hadoop. http://hadoop.apache.org/, 2010.

[13]

Marti Hearst. Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23:33--64, 1997.

Digital Library

[14]

T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of 22rd Annual International SIGIR Conference, 1999.

Digital Library

[15]

M.C. Hung and D.L. Yang. An efficient fuzzy c-means clustering algorithm. In Proceedings of the IEEE International Conference on Data Mining, pages 225--232, 2001.

Digital Library

[16]

R. Kondadadi and R. Kozma. A modified fuzzy art for soft document clustering. In Proc. of International Joint Conference on Neural Networks IJCNN, pages 2545--2549, 2002.

[17]

H. Kozima. Text segmentation based on similarity between words full text. In Proc. of the ACL, pages 286--288, 1993.

Digital Library

[18]

H. Kozima and T. Furugori. Similarity between words computed by spreading activation on an english dictionary. In Proceedings of the ACL, pages 232--239, 1993.

Digital Library

[19]

J. Lin and C. Dyer. Data-intensive text processing with mapreduce. Synthesis Lectures on Human Language Technologies, 2010.

Digital Library

[20]

Apache mahout overview. http://lucene.apache.org/mahout/, 2010.

[21]

A. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD00), pages 169--178, 2000.

Digital Library

[22]

D. Merkl and E. Schweighofer. En route to data mining in legal text corpora: Clustering, neural computation, and international treaties. In Proceedings of the 8th International Workshop on Database and Expert Systems Applications (DEXA '97), 1997.

Digital Library

[23]

C. Ordonez and E. Omiecinski. Frem: Fast and robust em clustering for large data sets. In Proceedings of CIKM, pages 590--599, 2002.

Digital Library

[24]

M. Shafiei and E. Milios. A statistical model for topic segmentation and clustering. Lecture Notes in Computer Science, 5032, 2008.

Digital Library

[25]

Svm light. http://svmlight.joachims.org/, 2010.

[26]

A. Tagarelli and G. Karypis. A segment-based approach to clustering multi-topic documents. In Proceedings of the Text Mining Workshop, SIAM Data Mining Conference, 2008.

[27]

M. Utiyama and H. Isahara. A statistical model for domain-independent text segmentation. In Proceedings of the ACL, pages 499--506, 2001.

Digital Library

[28]

N. Vaughn and D. Boley. Automated clustering and extraction of distinctive words in legal documents. Dept. of computer science and engineering report, University of Minnesota, 2001.

[29]

O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of the 21st Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, pages 46--54, 1998.

Digital Library

Cited By

Sargeant HIzzidien ASteffek F(2025)Topic classification of case law using a large language model and a new taxonomy for UK law: AI insights into summary judgmentArtificial Intelligence and Law10.1007/s10506-025-09434-0Online publication date: 25-Feb-2025
https://doi.org/10.1007/s10506-025-09434-0
Hammami EFaiz R(2024)European Union’s Legislative Proposals Clustering Based on Multiple Hidden Layers RepresentationDigital Business and Intelligent Systems10.1007/978-3-031-63543-4_8(106-119)Online publication date: 23-Jun-2024
https://doi.org/10.1007/978-3-031-63543-4_8
Zadgaonkar AAgrawal A(2023)An Approach for Analyzing Unstructured Text Data Using Topic Modeling Techniques for Efficient Information ExtractionNew Generation Computing10.1007/s00354-023-00230-542:1(109-134)Online publication date: 27-Aug-2023
https://doi.org/10.1007/s00354-023-00230-5
Show More Cited By

Index Terms

Legal document clustering with built-in topic segmentation
1. Information systems

Recommendations

Multi-document topic segmentation
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Multiple documents describing the same or closely related sets of events are common and often easy to obtain: for example, consider document clusters on a news aggregator site or multiple reviews of the same product or service. Even though each such ...
Topic segmentation with shared topic detection and alignment of multiple documents
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available ...
A topic discovery approach for unsupervised organization of legal document collections
Abstract
Technology has substantially transformed the way legal services operate in many different countries. With a large and complex collection of digitized legal documents, the judiciary system worldwide presents a promising scenario for the development ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

October 2011

2712 pages

ISBN:9781450307178

DOI:10.1145/2063576

Editors:
Bettina Berendt,
Arjen de Vries,
Wenfei Fan,
Craig Macdonald
University of Glasgow, UK
,
Iadh Ounis
University of Glasgow, UK
,
Ian Ruthven
University of Strathclyde, UK

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '11

Sponsor:

CIKM '11: International Conference on Information and Knowledge Management

October 24 - 28, 2011

Glasgow, Scotland, UK

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
697
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)3

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sargeant HIzzidien ASteffek F(2025)Topic classification of case law using a large language model and a new taxonomy for UK law: AI insights into summary judgmentArtificial Intelligence and Law10.1007/s10506-025-09434-0Online publication date: 25-Feb-2025
https://doi.org/10.1007/s10506-025-09434-0
Hammami EFaiz R(2024)European Union’s Legislative Proposals Clustering Based on Multiple Hidden Layers RepresentationDigital Business and Intelligent Systems10.1007/978-3-031-63543-4_8(106-119)Online publication date: 23-Jun-2024
https://doi.org/10.1007/978-3-031-63543-4_8
Zadgaonkar AAgrawal A(2023)An Approach for Analyzing Unstructured Text Data Using Topic Modeling Techniques for Efficient Information ExtractionNew Generation Computing10.1007/s00354-023-00230-542:1(109-134)Online publication date: 27-Aug-2023
https://doi.org/10.1007/s00354-023-00230-5
Hananto VSerdült UKryssanov V(2022)A Text Segmentation Approach for Automated Annotation of Online Customer Reviews, Based on Topic ModelingApplied Sciences10.3390/app1207341212:7(3412)Online publication date: 27-Mar-2022
https://doi.org/10.3390/app12073412
Lopez De Luise MPascal AAlvarez CVales E(2022)A Mining approach for Automatic Processing of Regulatory Document2022 IEEE Biennial Congress of Argentina (ARGENCON)10.1109/ARGENCON55245.2022.9939668(1-8)Online publication date: 7-Sep-2022
https://doi.org/10.1109/ARGENCON55245.2022.9939668
Jung SKa S(2022)GAE-Based Document Embedding Method for ClusteringIEEE Access10.1109/ACCESS.2022.322854810(130089-130096)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3228548
Taran MRevunkov GGapanyuk Y(2022)Creating a Brief Review of Judicial Practice Using Clustering MethodsAdvances in Neural Computation, Machine Learning, and Cognitive Research VI10.1007/978-3-031-19032-2_48(466-475)Online publication date: 19-Oct-2022
https://doi.org/10.1007/978-3-031-19032-2_48
Novotná T(2021)Síťová analýza v právu: Síťové metody a jejich využití pro získávání a vyhledávání právních informacíRevue pro právo a technologie10.5817/RPT2021-2-212:24(39-76)Online publication date: 31-Dec-2021
https://doi.org/10.5817/RPT2021-2-2
Krasadakis PSakkopoulos EVerykios V(2021)A Natural Language Processing Survey on Legislative and Greek DocumentsProceedings of the 25th Pan-Hellenic Conference on Informatics10.1145/3503823.3503898(407-412)Online publication date: 26-Nov-2021
https://dl.acm.org/doi/10.1145/3503823.3503898
Aumiller DAlmasian SLackner SGertz MMaranhão JWyner A(2021)Structural text segmentation of legal documentsProceedings of the Eighteenth International Conference on Artificial Intelligence and Law10.1145/3462757.3466085(2-11)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3462757.3466085
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten