Constructing a Large Scale Text Corpus Based on the Grid and Trustworthiness

Li, Peifeng; Zhu, Qiaoming; Qian, Peide; Fox, Geoffrey C.

doi:10.1007/978-3-540-74628-7_10

Peifeng Li^1,2,
Qiaoming Zhu¹,
Peide Qian¹ &
…
Geoffrey C. Fox²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4629))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

1754 Accesses
1 Citations

Abstract

The construction of a large scale corpus is a hard task. A novel approach is designed to automatically build a large scale text corpus with low cost and short building period based on the trustworthiness. It mainly solves two problems: how to automatically build a large scale text corpus on the Web and how to correct mistakes in the corpus. As Grid provides the infrastructure for processing large scale data, our approach uses Grid to collect and process language materials on the Web in the first stage. Then it picks out untrustworthy language materials in the corpus according to their trustworthiness, and checks them manually by users. After the check finishes, our approach computes the trustworthiness of each checked result and selects those ones with the highest trustworthiness as the correct results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kennedy, G.: An Introduction to Corpus Linguistics. Longman, London (1998)
Google Scholar
Biber, D., Conrad, S., Reppen, D.: Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, Cambridge (1998)
Google Scholar
Fairon, C.: Corporator: A Tool for Creating RSS-based Specialized Corpora. In: Proc. of the 2nd Int. Workshop on Web as Corpus, pp. 43–49 (2006)
Google Scholar
Renouf, A.: WebCorp: Providing a Renewable Energy Source for Corpus Linguistics. Language and Computers 48(1), 39–58 (2003)
Google Scholar
Kilgariff, A., Grefenstette, G.: Introduction to the Special Issue on the Web as Corpus. Computational Linguistics 29(3), 333–347 (2003)
Article MathSciNet Google Scholar
Liberman, M., Cieri, C.: The Creation, Distribution and Use of Linguistic Data. In: Proc. of the 1^st Int. Conf. on Language Resources and Evaluation (1998)
Google Scholar
Jones, R., Ghani, R.: Automatically Building a Corpus for a Minority Language from the Web. In: 38th Meeting of the ACL, Proc. of the Student Research Workshop. Hong Kong, pp. 29–36 (2000)
Google Scholar
Ghani, R., Jones, R., Mladenic, D.: Building Minority Language Corpora by Learning to Generate Web Search Queries. Knowledge and Information Systems 7(1), 56–83 (2005)
Article Google Scholar
Le, B.V., Bigi, B., Besacier, L., Castelli E.: Using the Web for Fast Language Model Construction in Minority Languages. In: Eurospeech, pp. 3117–3120 (2003)
Google Scholar
Rayson, P., Walkerdine, J., Fletcher, H.W., Kilgarriff, A.: Annotated Web as Corpus. In: Proc. of the 2nd Int. Workshop on Web as Corpus, pp. 27–33 (2006)
Google Scholar
Resnik, P., Smith, A.N.: The Web as a Parallel Corpus. Computational Linguistics 29(3), 349–380 (2003)
Article Google Scholar
Zhang, Y., Wu, K., Gao, J., Vines, P.: Automatic Acquisition of Chinese-English Parallel Corpus from the Web. In: Proc. of ECIR-06, 28th European Conf. on Information Retrieval (2006)
Google Scholar
Banko, M., Brill, E.: Scaling to Very Very Large Corpora for Natural Language Disambiguation. In: Proc. of the 39th Annual Meeting on Association for Computational Linguistics, pp. 26–33 (2001)
Google Scholar
Foster, I., Kesselman, C., Nick, J., et al.: Computer Grid Services for Distributed System Integration. IEEE Computer 35(6), 37–46 (2002)
Google Scholar
Globus project: Globus homepage (2006), http://www.globus.org/
Globus project: Globus Toolkit 4.0 Release Manuals (2005), http://www.globus.org/toolkit/docs/4.0/
Li, P., Zhu, Q., Zhi, L.: The Design of a Grid Resource Management System Oriented to Information Service. Computer Engineering (2007)
Google Scholar
Gong, Z., Zhu, Q., Li, P.: Implementation of Web Information Extraction System Based on Similar Pages. Computer Application 26(08), 1983–1986 (2006)
Google Scholar
Zhu, Q., Gong, Z., Li, P., et al.: An Unsupervised Framework for Robust Web-based Information Extraction. Journal of Chinese Language and Computing 16(3), 157–168 (2006)
Google Scholar
Li, P., Zhu, Q., Li, J.: A ME Model Based on Feature Template for Chinese Text Categorization. In: Proc. of the 2006 Int. Conf. on Machine Learning, Model, Technologies and Applications, pp. 242–248 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science & Technology, Soochow University, Suzhou, 215006, China
Peifeng Li, Qiaoming Zhu & Peide Qian
Community Grids Lab, Indiana University, Bloomington, IN 47404,
Peifeng Li & Geoffrey C. Fox

Authors

Peifeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Qiaoming Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Peide Qian
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey C. Fox
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Václav Matoušek Pavel Mautner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, P., Zhu, Q., Qian, P., Fox, G.C. (2007). Constructing a Large Scale Text Corpus Based on the Grid and Trustworthiness. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2007. Lecture Notes in Computer Science(), vol 4629. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74628-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-540-74628-7_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74627-0
Online ISBN: 978-3-540-74628-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Constructing a Large Scale Text Corpus Based on the Grid and Trustworthiness