Abstract
Compilation of Languages for Specific Purposes (LSP) corpora is a task which is fraught with several difficulties (mainly time and human effort), because it is not easy to discern between specialized and non-specialized text. The aim of this work is to study automatic specialized vs. non-specialized sentence differentiation. The experiments are carried out on two corpora of sentences extracted from specialized and non-specialized texts. One in economics (academic publications and news from newspapers), another about sexuality (academic publications and texts from forums and blogs). First we show the feasibility of the task using a statistical n-gram classifier. Then we show that grammatical features can also be used to classify sentences from the first corpus. For such purpose we use association rule mining.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Cabré, M.T.: Textos especializados y unidades de conocimiento: metodología y tipologización. In: Garía Palacios, J., Fuentes, M.T. (eds.) Texto, terminología y traducción, pp. 15–36. Ediciones Almar, Salamanca (2002)
Pearson, J.: Terms in context. John Benjamin, Amsterdam (1998)
Cabré, M.T.: La terminología. Representación y comunicación. IULA-UPF, Barcelona (1999)
Kocourek, R.: La langue française de la technique et de la science. Vers une linguistique de la langue savante. Oscar Branstetter, Wiesbaden (1991)
Hoffmann, L.: Kommunikationsmittel Fachsprache - Eine Einführung. Sammlung Akademie Verlag, Berlin (1976)
Coulon, R.: French as it is written by French sociologists. Bulletin pédagogique des IUT (18), 11–25 (1972)
Cajolet-Laganière, H., Maillet, N.: Caractérisation des textes techniques québécois. Présence francophone (47), 113–147 (1995)
L’Homme, M.C.: Contribution á l’analyse grammaticale de la langue d’espécialité: le mode, le temps et la personne du verbe dans quelques textes, scientifiques é crits á vocation pédagogique. Université Laval, Québec (1993)
L’Homme, M.C.: Formes verbales de temps et texte scientifique. Le langage et l’homme 2-3(31), 107–123 (1995)
Cabré, M.T., Bach, C., da Cunha, I., Morales, A., Vivaldi, J.: Comparación de algunas características lingüísticas del discurso especializado frente al discurso general: el caso del discurso económico. In: XXVII Congreso Internacional de AESLA: Modos y formas de la comunicación humana (AESLA 2009), Universidad de Castilla-La Mancha, Ciudad Real (2010)
Cabré, M.T.: Constituir un corpus de textos de especialidad: condiciones y posibilidades. In: Ballard, M., Pineira-Tresmontant, C. (eds.), pp. 89–106. Artois Presses Université, Arras (2005)
Vivaldi, J.: Corpus and exploitation tool: IULACT and bwanaNet. In: Cantos Gómez, P., Sánchez Pérez, A. (eds.) I International Conference on Corpus Linguistics (CICL 2009), A survey on corpus-based research, Universidad de Murcia, pp. 224–239 (2009)
Medina, A., Sierra, G.: Criteria for the Construction of a Corpus for a Mexican Spanish Dictionary of Sexuality. In: 11th Euralex International Congress, vol. 2. Université de Bretagne-Sud. Lorient, Francia (2004)
Amir, A., Aumann, Y., Feldman, R., Fresko, M.: Maximal Association Rules: A Tool for Mining Associations in Text. Journal of Intelligent Information Systems 5(3), 333–345 (2005)
Stanislas, O., Mickael, R., Nathalie, C., Kessler, R., Lefèvre, F., Torres-Moreno, J.-M.: Système du LIA pour la campagne DEFT 2010: datation et localisation d’articles de presse francophones. In: DEFT 2010, Montréal (2010)
Kocourek, R.: La langue française de lá technique et de la science, 2nd edn. Oscar Branstetter, Wiesbaden (1991)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
da Cunha, I., Cabré, M.T., SanJuan, E., Sierra, G., Torres-Moreno, J.M., Vivaldi, J. (2011). Automatic Specialized vs. Non-specialized Sentence Differentiation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6609. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19437-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-19437-5_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19436-8
Online ISBN: 978-3-642-19437-5
eBook Packages: Computer ScienceComputer Science (R0)