Abstract
The construction of a large scale corpus is a hard task. A novel approach is designed to automatically build a large scale text corpus with low cost and short building period based on the trustworthiness. It mainly solves two problems: how to automatically build a large scale text corpus on the Web and how to correct mistakes in the corpus. As Grid provides the infrastructure for processing large scale data, our approach uses Grid to collect and process language materials on the Web in the first stage. Then it picks out untrustworthy language materials in the corpus according to their trustworthiness, and checks them manually by users. After the check finishes, our approach computes the trustworthiness of each checked result and selects those ones with the highest trustworthiness as the correct results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kennedy, G.: An Introduction to Corpus Linguistics. Longman, London (1998)
Biber, D., Conrad, S., Reppen, D.: Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, Cambridge (1998)
Fairon, C.: Corporator: A Tool for Creating RSS-based Specialized Corpora. In: Proc. of the 2nd Int. Workshop on Web as Corpus, pp. 43–49 (2006)
Renouf, A.: WebCorp: Providing a Renewable Energy Source for Corpus Linguistics. Language and Computers 48(1), 39–58 (2003)
Kilgariff, A., Grefenstette, G.: Introduction to the Special Issue on the Web as Corpus. Computational Linguistics 29(3), 333–347 (2003)
Liberman, M., Cieri, C.: The Creation, Distribution and Use of Linguistic Data. In: Proc. of the 1st Int. Conf. on Language Resources and Evaluation (1998)
Jones, R., Ghani, R.: Automatically Building a Corpus for a Minority Language from the Web. In: 38th Meeting of the ACL, Proc. of the Student Research Workshop. Hong Kong, pp. 29–36 (2000)
Ghani, R., Jones, R., Mladenic, D.: Building Minority Language Corpora by Learning to Generate Web Search Queries. Knowledge and Information Systems 7(1), 56–83 (2005)
Le, B.V., Bigi, B., Besacier, L., Castelli E.: Using the Web for Fast Language Model Construction in Minority Languages. In: Eurospeech, pp. 3117–3120 (2003)
Rayson, P., Walkerdine, J., Fletcher, H.W., Kilgarriff, A.: Annotated Web as Corpus. In: Proc. of the 2nd Int. Workshop on Web as Corpus, pp. 27–33 (2006)
Resnik, P., Smith, A.N.: The Web as a Parallel Corpus. Computational Linguistics 29(3), 349–380 (2003)
Zhang, Y., Wu, K., Gao, J., Vines, P.: Automatic Acquisition of Chinese-English Parallel Corpus from the Web. In: Proc. of ECIR-06, 28th European Conf. on Information Retrieval (2006)
Banko, M., Brill, E.: Scaling to Very Very Large Corpora for Natural Language Disambiguation. In: Proc. of the 39th Annual Meeting on Association for Computational Linguistics, pp. 26–33 (2001)
Foster, I., Kesselman, C., Nick, J., et al.: Computer Grid Services for Distributed System Integration. IEEE Computer 35(6), 37–46 (2002)
Globus project: Globus homepage (2006), http://www.globus.org/
Globus project: Globus Toolkit 4.0 Release Manuals (2005), http://www.globus.org/toolkit/docs/4.0/
Li, P., Zhu, Q., Zhi, L.: The Design of a Grid Resource Management System Oriented to Information Service. Computer Engineering (2007)
Gong, Z., Zhu, Q., Li, P.: Implementation of Web Information Extraction System Based on Similar Pages. Computer Application 26(08), 1983–1986 (2006)
Zhu, Q., Gong, Z., Li, P., et al.: An Unsupervised Framework for Robust Web-based Information Extraction. Journal of Chinese Language and Computing 16(3), 157–168 (2006)
Li, P., Zhu, Q., Li, J.: A ME Model Based on Feature Template for Chinese Text Categorization. In: Proc. of the 2006 Int. Conf. on Machine Learning, Model, Technologies and Applications, pp. 242–248 (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, P., Zhu, Q., Qian, P., Fox, G.C. (2007). Constructing a Large Scale Text Corpus Based on the Grid and Trustworthiness. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2007. Lecture Notes in Computer Science(), vol 4629. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74628-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-540-74628-7_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74627-0
Online ISBN: 978-3-540-74628-7
eBook Packages: Computer ScienceComputer Science (R0)