Skip to main content

Constructing Web Corpora through Topical Web Partitioning for Term Recognition

  • Conference paper
AI 2008: Advances in Artificial Intelligence (AI 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5360))

Included in the following conference series:

Abstract

The need for on-demand discovery of very large, incremental text corpora for unrestricted range of domains for term recognition in ontology learning is becoming more and more pressing. In this paper, we introduce a new 3-phase web partitioning approach for automatically constructing web corpora to support term recognition. An evaluation of the web corpora constructed using our web partitioning approach demonstrated high precision in the context of term recognition, a result comparable to the use of manually-created local corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the web. In: Proceedings of the 4th Language Resources and Evaluation Conference (LREC), Lisbon, Portugal (2004)

    Google Scholar 

  2. Agbago, A., Barriere, C.: Corpus construction for terminology. In: Proceedings of the Corpus Linguistics Conference, Birmingham, UK (2005)

    Google Scholar 

  3. Baroni, M., Bernardini, S.: Wacky! working papers on the web as corpus. In: GEDIT, Bologna, Italy (2006)

    Google Scholar 

  4. Estruch, V., Ferri, C., Hernandez-Orallo, J., Ramirez-Quintana, M.: Web categorisation using distance-based decision trees. In: Proceedings of the International Workshop on Automated Specification and Verification of Web Sites, WWV (2006)

    Google Scholar 

  5. Crabtree, D., Gao, X., Andreae, P.: Improving web clustering by cluster selection. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, WI (2005)

    Google Scholar 

  6. Mobasher, B., Cooley, R., Srivastava, J.: Creating adaptive web sites through usage-based clustering of urls. In: Proceedings of the Workshop on Knowledge and Data Engineering Exchange (1999)

    Google Scholar 

  7. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report; Stanford University (1998)

    Google Scholar 

  8. Adamic, L., Huberman, B.: Zipfs law and the internet. Glottometrics 3(1), 143–150 (2002)

    Google Scholar 

  9. Wong, W., Liu, W., Bennamoun, M.: Tree-traversing ant algorithm for term clustering based on featureless similarities. Data Mining and Knowledge Discovery 15(3), 349–381 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  10. Cilibrasi, R., Vitanyi, P.: The google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007)

    Article  Google Scholar 

  11. Wong, W., Liu, W., Bennamoun, M.: Featureless data clustering. In: Song, M., Wu, Y. (eds.) Handbook of Research on Text and Web Mining Technologies. IGI Global (2008)

    Google Scholar 

  12. Kim, J., Ohta, T., Teteisi, Y., Tsujii, J.: Genia corpus - a semantically annotated corpus for bio-textmining. Bioinformatics 19(1), 180–182 (2003)

    Article  Google Scholar 

  13. Wong, W., Liu, W., Bennamoun, M.: Determining termhood for learning domain ontologies in a probabilistic framework. In: Proceedings of the 6th Australasian Conference on Data Mining (AusDM), Gold Coast (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wong, W., Liu, W., Bennamoun, M. (2008). Constructing Web Corpora through Topical Web Partitioning for Term Recognition. In: Wobcke, W., Zhang, M. (eds) AI 2008: Advances in Artificial Intelligence. AI 2008. Lecture Notes in Computer Science(), vol 5360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89378-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-89378-3_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-89377-6

  • Online ISBN: 978-3-540-89378-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics