Abstract
This study explores the use of machine learning in case law search in electronic trials. We clustered case law documents, automatically generating classes to a categorizer. These classes are used when a user uploads new documents to an electronic trial. We selected the algorithm TClus, created by Aggarwal, Gates and Yu, removing its document/group discarding features and adding a cluster division feature. We introduced a new paradigm “bag of terms and law references” instead of “bag of words” by generating attributes using a law domain thesaurus to detect legal terms and using regular expressions to detect law references. We clustered a case law corpus. The results were evaluated with the Relative Hardness Measure (RH) and the \(\bar{\rho}\)-Measure (RHO). The results were tested both with Wilcoxon’s Signed-ranks Test and Count of Wins and Losses Test to determine their significance. The categorization results were evaluated by human specialists. We compared true/false positives against document similarity with the centroid, cluster size, quantity and type of the attributes in the centroids and cluster cohesion. The article also discusses attribute generation and its implications to the classification results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, C.C., Gates, S.C., Yu, P.S.: On using partial supervision for text categorization. IEEE Transactions on Knowledge and data Engineering, 245–255 (2004)
Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal The International Journal on Very Large Data Bases 7(3), 163–178 (1998)
Cong, G., Lee, W.S., Wu, H., Liu, B.: Semi-supervised Text Classification Using Partitioned EM. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 482–493. Springer, Heidelberg (2004)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1–38 (1977)
Elidan, G., Friedman, N.: Learning the dimensionality of hidden variables. In: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, Citeseer, pp. 144–151 (2001)
Elidan, G., Lotner, N., Friedman, N., Koller, D.: Discovering hidden variables: A structure-based approach. Pattern Recognition Letters 21, 779–786 (2000)
Feinerer, I., Hornik, K.: Text mining of supreme administrative court jurisdictions. Data Analysis, Machine Learning and Applications, 569–576 (2008)
Friedman, M., Kandel, A.: Introduction to pattern recognition: Statistical, structural, neural and fuzzy logic approaches (1999)
Friedman, N.: Learning belief networks in the presence of missing values and hidden variables. In: ICML, pp. 125–133 (1997)
Friedman, N.: The bayesian structural em algorithm. In: Proc. UAI, Citeseer, vol. 98 (1998)
Gonzalez, M.: Termos e relacionamentos em evidência na recuperao de informação. Ph.D. thesis, Universidade Federal do Rio Grande do Sul, Porto Alegre, RS (2005)
Gonçalves, T., Quaresma, P.: A Preliminary Approach to the Multilabel Classification Problem of Portuguese Juridical Documents. In: Pires, F.M., Abreu, S.P. (eds.) EPIA 2003. LNCS (LNAI), vol. 2902, pp. 435–444. Springer, Heidelberg (2003)
Gonçalves, T., Quaresma, P.: Is linguistic information relevant for the classification of legal texts? In: ICAIL 2005: Proceedings of the 10th International Conference on Artificial Intelligence and Law, pp. 168–176. ACM, New York (2005)
Hao, P.Y., Chiang, J.H., Tu, Y.K.: Hierarchically svm classification based on support vector clustering method and its application to document categorization. Expert Systems with applications 33(3), 627–635 (2007)
Ingaramo, D., Pinto, D., Rosso, P., Errecalde, M.: Evaluation of internal validity measures in short-text corpora. Computational Linguistics and Intelligent Text Processing, 555–567 (2008)
Jaegger, F., et al.: Vocabulário controlado básico. Serviço de Gerência da Rede Virtual de Bibliotecas, Congresso Nacional, RVBI. Brasília, DF (junho 2007)
Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 625–633. ACM (2004)
Li, B., Chi, M., Fan, J., Xue, X.: Support cluster machine. In: Proceedings of the 24th International Conference on Machine Learning, pp. 505–512. ACM (2007)
Maarek, Y., Fagin, R., Ben-Shaul, I., Pelleg, D.: Ephemeral document clustering for web applications. Tech. Rep. RJ 10186, IBM Research (2000)
Muniz, M., Nunes, M.: A Construoção de Recursos Linguístico-computacionais para o Português do Brasil: o Projeto de Unitex-PB. Master’s thesis, Universidade de So Paulo. Instituto de Ciências Matemáticas e de Computção, São Carlos, SP (2004)
Pinto, D., Rosso, P.: On the Relative Hardness of Clustering Corpora. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 155–161. Springer, Heidelberg (2007)
Porath, E.B., Gilboa, I.: Linear measures, the gini index, and the income-equality trade-off. Journal of Economic Theory 64, 443–467 (1994)
Raskutti, B., Ferrá, H., Kowalczyk, A.: Combining clustering and co-training to enhance text classification using unlabelled data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 620–625. ACM (2002)
Sordi, N., et. al.: Tesauro jurídico da justiça federal. Conselho da Justiça Federal (Fev 2007)
Stein, B., zu Eissen, S.M., Potthast, M.: Syntax versus semantics. In: 3rd International Workshop on Text-Based Information Retrieval (TIR 2006), Citeseer, p. 47 (2006)
Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hierarchical mixture model for small training sets. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 105–113. ACM (2001)
Zeng, H.J., Wang, X.H., Chen, Z., Lu, H., Ma, W.Y.: Cbc: Clustering based text classification requiring minimal labeled data. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 443–450. IEEE (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
de Colla Furquim, L.O., de Lima, V.L.S. (2012). Clustering and Categorization of Brazilian Portuguese Legal Documents. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigão, F. (eds) Computational Processing of the Portuguese Language. PROPOR 2012. Lecture Notes in Computer Science(), vol 7243. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28885-2_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-28885-2_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28884-5
Online ISBN: 978-3-642-28885-2
eBook Packages: Computer ScienceComputer Science (R0)