Clustering and Categorization of Brazilian Portuguese Legal Documents

de Colla Furquim, Luis Otávio; de Lima, Vera Lúcia Strube

doi:10.1007/978-3-642-28885-2_31

Luis Otávio de Colla Furquim²³ &
Vera Lúcia Strube de Lima²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7243))

Included in the following conference series:

International Conference on Computational Processing of the Portuguese Language

1166 Accesses
4 Citations

Abstract

This study explores the use of machine learning in case law search in electronic trials. We clustered case law documents, automatically generating classes to a categorizer. These classes are used when a user uploads new documents to an electronic trial. We selected the algorithm TClus, created by Aggarwal, Gates and Yu, removing its document/group discarding features and adding a cluster division feature. We introduced a new paradigm “bag of terms and law references” instead of “bag of words” by generating attributes using a law domain thesaurus to detect legal terms and using regular expressions to detect law references. We clustered a case law corpus. The results were evaluated with the Relative Hardness Measure (RH) and the \(\bar{\rho}\)-Measure (RHO). The results were tested both with Wilcoxon’s Signed-ranks Test and Count of Wins and Losses Test to determine their significance. The categorization results were evaluated by human specialists. We compared true/false positives against document similarity with the centroid, cluster size, quantity and type of the attributes in the centroids and cluster cohesion. The article also discusses attribute generation and its implications to the classification results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C.C., Gates, S.C., Yu, P.S.: On using partial supervision for text categorization. IEEE Transactions on Knowledge and data Engineering, 245–255 (2004)
Google Scholar
Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal The International Journal on Very Large Data Bases 7(3), 163–178 (1998)
Article Google Scholar
Cong, G., Lee, W.S., Wu, H., Liu, B.: Semi-supervised Text Classification Using Partitioned EM. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 482–493. Springer, Heidelberg (2004)
Chapter Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1–38 (1977)
Google Scholar
Elidan, G., Friedman, N.: Learning the dimensionality of hidden variables. In: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, Citeseer, pp. 144–151 (2001)
Google Scholar
Elidan, G., Lotner, N., Friedman, N., Koller, D.: Discovering hidden variables: A structure-based approach. Pattern Recognition Letters 21, 779–786 (2000)
Article Google Scholar
Feinerer, I., Hornik, K.: Text mining of supreme administrative court jurisdictions. Data Analysis, Machine Learning and Applications, 569–576 (2008)
Google Scholar
Friedman, M., Kandel, A.: Introduction to pattern recognition: Statistical, structural, neural and fuzzy logic approaches (1999)
Google Scholar
Friedman, N.: Learning belief networks in the presence of missing values and hidden variables. In: ICML, pp. 125–133 (1997)
Google Scholar
Friedman, N.: The bayesian structural em algorithm. In: Proc. UAI, Citeseer, vol. 98 (1998)
Google Scholar
Gonzalez, M.: Termos e relacionamentos em evidência na recuperao de informação. Ph.D. thesis, Universidade Federal do Rio Grande do Sul, Porto Alegre, RS (2005)
Google Scholar
Gonçalves, T., Quaresma, P.: A Preliminary Approach to the Multilabel Classification Problem of Portuguese Juridical Documents. In: Pires, F.M., Abreu, S.P. (eds.) EPIA 2003. LNCS (LNAI), vol. 2902, pp. 435–444. Springer, Heidelberg (2003)
Chapter Google Scholar
Gonçalves, T., Quaresma, P.: Is linguistic information relevant for the classification of legal texts? In: ICAIL 2005: Proceedings of the 10th International Conference on Artificial Intelligence and Law, pp. 168–176. ACM, New York (2005)
Chapter Google Scholar
Hao, P.Y., Chiang, J.H., Tu, Y.K.: Hierarchically svm classification based on support vector clustering method and its application to document categorization. Expert Systems with applications 33(3), 627–635 (2007)
Article Google Scholar
Ingaramo, D., Pinto, D., Rosso, P., Errecalde, M.: Evaluation of internal validity measures in short-text corpora. Computational Linguistics and Intelligent Text Processing, 555–567 (2008)
Google Scholar
Jaegger, F., et al.: Vocabulário controlado básico. Serviço de Gerência da Rede Virtual de Bibliotecas, Congresso Nacional, RVBI. Brasília, DF (junho 2007)
Google Scholar
Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 625–633. ACM (2004)
Google Scholar
Li, B., Chi, M., Fan, J., Xue, X.: Support cluster machine. In: Proceedings of the 24th International Conference on Machine Learning, pp. 505–512. ACM (2007)
Google Scholar
Maarek, Y., Fagin, R., Ben-Shaul, I., Pelleg, D.: Ephemeral document clustering for web applications. Tech. Rep. RJ 10186, IBM Research (2000)
Google Scholar
Muniz, M., Nunes, M.: A Construoção de Recursos Linguístico-computacionais para o Português do Brasil: o Projeto de Unitex-PB. Master’s thesis, Universidade de So Paulo. Instituto de Ciências Matemáticas e de Computção, São Carlos, SP (2004)
Google Scholar
Pinto, D., Rosso, P.: On the Relative Hardness of Clustering Corpora. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 155–161. Springer, Heidelberg (2007)
Chapter Google Scholar
Porath, E.B., Gilboa, I.: Linear measures, the gini index, and the income-equality trade-off. Journal of Economic Theory 64, 443–467 (1994)
Article MathSciNet MATH Google Scholar
Raskutti, B., Ferrá, H., Kowalczyk, A.: Combining clustering and co-training to enhance text classification using unlabelled data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 620–625. ACM (2002)
Google Scholar
Sordi, N., et. al.: Tesauro jurídico da justiça federal. Conselho da Justiça Federal (Fev 2007)
Google Scholar
Stein, B., zu Eissen, S.M., Potthast, M.: Syntax versus semantics. In: 3rd International Workshop on Text-Based Information Retrieval (TIR 2006), Citeseer, p. 47 (2006)
Google Scholar
Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hierarchical mixture model for small training sets. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 105–113. ACM (2001)
Google Scholar
Zeng, H.J., Wang, X.H., Chen, Z., Lu, H., Ma, W.Y.: Cbc: Clustering based text classification requiring minimal labeled data. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 443–450. IEEE (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Pontifícia Universidade Católica do Rio Grande do Sul, Av Ipiranga 6681, Porto Alegre, Brazil
Luis Otávio de Colla Furquim & Vera Lúcia Strube de Lima

Authors

Luis Otávio de Colla Furquim
View author publications
You can also search for this author in PubMed Google Scholar
Vera Lúcia Strube de Lima
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

UFSCAR, Rod. Washington Luís, 13565-905, São Carlos, Brazil
Helena Caseli
UFRGS, Av. Bento Gonçalves, 9500, 91501-970, Porto Alegre, Brazil
Aline Villavicencio
DETI/IEETA, Universidade de Aveiro, Campus Universitário de Santiago, 3810-193, Aveiro, Portugal
António Teixeira
UC/ IT, DEEC, Universidade de Coimbra, Polo 2, 3030-290, Coimbra, Portugal
Fernando Perdigão

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Colla Furquim, L.O., de Lima, V.L.S. (2012). Clustering and Categorization of Brazilian Portuguese Legal Documents. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigão, F. (eds) Computational Processing of the Portuguese Language. PROPOR 2012. Lecture Notes in Computer Science(), vol 7243. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28885-2_31

Download citation

DOI: https://doi.org/10.1007/978-3-642-28885-2_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28884-5
Online ISBN: 978-3-642-28885-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics