Skip to main content

Constructing a Large Scale Text Corpus Based on the Grid and Trustworthiness

  • Conference paper
Text, Speech and Dialogue (TSD 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4629))

Included in the following conference series:

Abstract

The construction of a large scale corpus is a hard task. A novel approach is designed to automatically build a large scale text corpus with low cost and short building period based on the trustworthiness. It mainly solves two problems: how to automatically build a large scale text corpus on the Web and how to correct mistakes in the corpus. As Grid provides the infrastructure for processing large scale data, our approach uses Grid to collect and process language materials on the Web in the first stage. Then it picks out untrustworthy language materials in the corpus according to their trustworthiness, and checks them manually by users. After the check finishes, our approach computes the trustworthiness of each checked result and selects those ones with the highest trustworthiness as the correct results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kennedy, G.: An Introduction to Corpus Linguistics. Longman, London (1998)

    Google Scholar 

  2. Biber, D., Conrad, S., Reppen, D.: Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, Cambridge (1998)

    Google Scholar 

  3. Fairon, C.: Corporator: A Tool for Creating RSS-based Specialized Corpora. In: Proc. of the 2nd Int. Workshop on Web as Corpus, pp. 43–49 (2006)

    Google Scholar 

  4. Renouf, A.: WebCorp: Providing a Renewable Energy Source for Corpus Linguistics. Language and Computers 48(1), 39–58 (2003)

    Google Scholar 

  5. Kilgariff, A., Grefenstette, G.: Introduction to the Special Issue on the Web as Corpus. Computational Linguistics 29(3), 333–347 (2003)

    Article  MathSciNet  Google Scholar 

  6. Liberman, M., Cieri, C.: The Creation, Distribution and Use of Linguistic Data. In: Proc. of the 1st Int. Conf. on Language Resources and Evaluation (1998)

    Google Scholar 

  7. Jones, R., Ghani, R.: Automatically Building a Corpus for a Minority Language from the Web. In: 38th Meeting of the ACL, Proc. of the Student Research Workshop. Hong Kong, pp. 29–36 (2000)

    Google Scholar 

  8. Ghani, R., Jones, R., Mladenic, D.: Building Minority Language Corpora by Learning to Generate Web Search Queries. Knowledge and Information Systems 7(1), 56–83 (2005)

    Article  Google Scholar 

  9. Le, B.V., Bigi, B., Besacier, L., Castelli E.: Using the Web for Fast Language Model Construction in Minority Languages. In: Eurospeech, pp. 3117–3120 (2003)

    Google Scholar 

  10. Rayson, P., Walkerdine, J., Fletcher, H.W., Kilgarriff, A.: Annotated Web as Corpus. In: Proc. of the 2nd Int. Workshop on Web as Corpus, pp. 27–33 (2006)

    Google Scholar 

  11. Resnik, P., Smith, A.N.: The Web as a Parallel Corpus. Computational Linguistics 29(3), 349–380 (2003)

    Article  Google Scholar 

  12. Zhang, Y., Wu, K., Gao, J., Vines, P.: Automatic Acquisition of Chinese-English Parallel Corpus from the Web. In: Proc. of ECIR-06, 28th European Conf. on Information Retrieval (2006)

    Google Scholar 

  13. Banko, M., Brill, E.: Scaling to Very Very Large Corpora for Natural Language Disambiguation. In: Proc. of the 39th Annual Meeting on Association for Computational Linguistics, pp. 26–33 (2001)

    Google Scholar 

  14. Foster, I., Kesselman, C., Nick, J., et al.: Computer Grid Services for Distributed System Integration. IEEE Computer 35(6), 37–46 (2002)

    Google Scholar 

  15. Globus project: Globus homepage (2006), http://www.globus.org/

  16. Globus project: Globus Toolkit 4.0 Release Manuals (2005), http://www.globus.org/toolkit/docs/4.0/

  17. Li, P., Zhu, Q., Zhi, L.: The Design of a Grid Resource Management System Oriented to Information Service. Computer Engineering (2007)

    Google Scholar 

  18. Gong, Z., Zhu, Q., Li, P.: Implementation of Web Information Extraction System Based on Similar Pages. Computer Application 26(08), 1983–1986 (2006)

    Google Scholar 

  19. Zhu, Q., Gong, Z., Li, P., et al.: An Unsupervised Framework for Robust Web-based Information Extraction. Journal of Chinese Language and Computing 16(3), 157–168 (2006)

    Google Scholar 

  20. Li, P., Zhu, Q., Li, J.: A ME Model Based on Feature Template for Chinese Text Categorization. In: Proc. of the 2006 Int. Conf. on Machine Learning, Model, Technologies and Applications, pp. 242–248 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Václav Matoušek Pavel Mautner

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, P., Zhu, Q., Qian, P., Fox, G.C. (2007). Constructing a Large Scale Text Corpus Based on the Grid and Trustworthiness. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2007. Lecture Notes in Computer Science(), vol 4629. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74628-7_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74628-7_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74627-0

  • Online ISBN: 978-3-540-74628-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics