Skip to main content

Clustering and Categorization of Brazilian Portuguese Legal Documents

  • Conference paper
Computational Processing of the Portuguese Language (PROPOR 2012)

Abstract

This study explores the use of machine learning in case law search in electronic trials. We clustered case law documents, automatically generating classes to a categorizer. These classes are used when a user uploads new documents to an electronic trial. We selected the algorithm TClus, created by Aggarwal, Gates and Yu, removing its document/group discarding features and adding a cluster division feature. We introduced a new paradigm “bag of terms and law references” instead of “bag of words” by generating attributes using a law domain thesaurus to detect legal terms and using regular expressions to detect law references. We clustered a case law corpus. The results were evaluated with the Relative Hardness Measure (RH) and the \(\bar{\rho}\)-Measure (RHO). The results were tested both with Wilcoxon’s Signed-ranks Test and Count of Wins and Losses Test to determine their significance. The categorization results were evaluated by human specialists. We compared true/false positives against document similarity with the centroid, cluster size, quantity and type of the attributes in the centroids and cluster cohesion. The article also discusses attribute generation and its implications to the classification results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C., Gates, S.C., Yu, P.S.: On using partial supervision for text categorization. IEEE Transactions on Knowledge and data Engineering, 245–255 (2004)

    Google Scholar 

  2. Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal The International Journal on Very Large Data Bases 7(3), 163–178 (1998)

    Article  Google Scholar 

  3. Cong, G., Lee, W.S., Wu, H., Liu, B.: Semi-supervised Text Classification Using Partitioned EM. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 482–493. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1–38 (1977)

    Google Scholar 

  5. Elidan, G., Friedman, N.: Learning the dimensionality of hidden variables. In: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, Citeseer, pp. 144–151 (2001)

    Google Scholar 

  6. Elidan, G., Lotner, N., Friedman, N., Koller, D.: Discovering hidden variables: A structure-based approach. Pattern Recognition Letters 21, 779–786 (2000)

    Article  Google Scholar 

  7. Feinerer, I., Hornik, K.: Text mining of supreme administrative court jurisdictions. Data Analysis, Machine Learning and Applications, 569–576 (2008)

    Google Scholar 

  8. Friedman, M., Kandel, A.: Introduction to pattern recognition: Statistical, structural, neural and fuzzy logic approaches (1999)

    Google Scholar 

  9. Friedman, N.: Learning belief networks in the presence of missing values and hidden variables. In: ICML, pp. 125–133 (1997)

    Google Scholar 

  10. Friedman, N.: The bayesian structural em algorithm. In: Proc. UAI, Citeseer, vol. 98 (1998)

    Google Scholar 

  11. Gonzalez, M.: Termos e relacionamentos em evidência na recuperao de informação. Ph.D. thesis, Universidade Federal do Rio Grande do Sul, Porto Alegre, RS (2005)

    Google Scholar 

  12. Gonçalves, T., Quaresma, P.: A Preliminary Approach to the Multilabel Classification Problem of Portuguese Juridical Documents. In: Pires, F.M., Abreu, S.P. (eds.) EPIA 2003. LNCS (LNAI), vol. 2902, pp. 435–444. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  13. Gonçalves, T., Quaresma, P.: Is linguistic information relevant for the classification of legal texts? In: ICAIL 2005: Proceedings of the 10th International Conference on Artificial Intelligence and Law, pp. 168–176. ACM, New York (2005)

    Chapter  Google Scholar 

  14. Hao, P.Y., Chiang, J.H., Tu, Y.K.: Hierarchically svm classification based on support vector clustering method and its application to document categorization. Expert Systems with applications 33(3), 627–635 (2007)

    Article  Google Scholar 

  15. Ingaramo, D., Pinto, D., Rosso, P., Errecalde, M.: Evaluation of internal validity measures in short-text corpora. Computational Linguistics and Intelligent Text Processing, 555–567 (2008)

    Google Scholar 

  16. Jaegger, F., et al.: Vocabulário controlado básico. Serviço de Gerência da Rede Virtual de Bibliotecas, Congresso Nacional, RVBI. Brasília, DF (junho 2007)

    Google Scholar 

  17. Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 625–633. ACM (2004)

    Google Scholar 

  18. Li, B., Chi, M., Fan, J., Xue, X.: Support cluster machine. In: Proceedings of the 24th International Conference on Machine Learning, pp. 505–512. ACM (2007)

    Google Scholar 

  19. Maarek, Y., Fagin, R., Ben-Shaul, I., Pelleg, D.: Ephemeral document clustering for web applications. Tech. Rep. RJ 10186, IBM Research (2000)

    Google Scholar 

  20. Muniz, M., Nunes, M.: A Construoção de Recursos Linguístico-computacionais para o Português do Brasil: o Projeto de Unitex-PB. Master’s thesis, Universidade de So Paulo. Instituto de Ciências Matemáticas e de Computção, São Carlos, SP (2004)

    Google Scholar 

  21. Pinto, D., Rosso, P.: On the Relative Hardness of Clustering Corpora. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 155–161. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  22. Porath, E.B., Gilboa, I.: Linear measures, the gini index, and the income-equality trade-off. Journal of Economic Theory 64, 443–467 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  23. Raskutti, B., Ferrá, H., Kowalczyk, A.: Combining clustering and co-training to enhance text classification using unlabelled data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 620–625. ACM (2002)

    Google Scholar 

  24. Sordi, N., et. al.: Tesauro jurídico da justiça federal. Conselho da Justiça Federal (Fev 2007)

    Google Scholar 

  25. Stein, B., zu Eissen, S.M., Potthast, M.: Syntax versus semantics. In: 3rd International Workshop on Text-Based Information Retrieval (TIR 2006), Citeseer, p. 47 (2006)

    Google Scholar 

  26. Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hierarchical mixture model for small training sets. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 105–113. ACM (2001)

    Google Scholar 

  27. Zeng, H.J., Wang, X.H., Chen, Z., Lu, H., Ma, W.Y.: Cbc: Clustering based text classification requiring minimal labeled data. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 443–450. IEEE (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

de Colla Furquim, L.O., de Lima, V.L.S. (2012). Clustering and Categorization of Brazilian Portuguese Legal Documents. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigão, F. (eds) Computational Processing of the Portuguese Language. PROPOR 2012. Lecture Notes in Computer Science(), vol 7243. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28885-2_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28885-2_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28884-5

  • Online ISBN: 978-3-642-28885-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics