Constructing specialised corpora through analysing domain representativeness of websites

Wong, Wilson; Liu, Wei; Bennamoun, Mohammed

doi:10.1007/s10579-011-9141-4

Constructing specialised corpora through analysing domain representativeness of websites

Original Paper
Published: 02 March 2011

Volume 45, pages 209–241, (2011)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Wilson Wong¹^nAff2,
Wei Liu¹ &
Mohammed Bennamoun¹

328 Accesses
Explore all metrics

Abstract

The role of the Web for text corpus construction is becoming increasingly significant. However, the contribution of the Web is largely confined to building a general virtual corpus or low quality specialised corpora. In this paper, we introduce a new technique called SPARTAN for constructing specialised corpora from the Web by systematically analysing website contents. Our evaluations show that the corpora constructed using our technique are independent of the search engines employed. In particular, SPARTAN-derived corpora outperform all corpora based on existing techniques for the task of term recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora

Methods for automatic term recognition in domain-specific text collections: A survey

Article 15 November 2015

Building and evaluating web corpora representing national varieties of English

Article 06 January 2017

Notes

Google’s Web search interface serves up to 1,000 results. However, automated crawling and scraping of that page for URLs will result in the blocking of IP addresses. The SOAP API by Google, which allows up to 1,000 queries per day has been permanently phased out since August 2009. Refer to http://www.googleajaxsearchapi.blogspot.com/2007/12/search-result-limit-increase.html for more information.
http://www.webcorp.org.uk.
There are certain websites whose contents are heterogeneous in nature such as news sites, hosting sites, etc. Such sites are, however, automatically and systematically identified and removed during the corpus construction process by the proposed technique.
A generalised version of the Normalised Google Distance (NGD) by Cilibrasi and Vitanyi (2007).
This page count and all subsequent page counts derived from Google and Yahoo are obtained on 2 April 2009.
Other commonly-used search engines such as AltaVista and AlltheWeb were not cited for comparison since they use the same search index as Yahoo’s.
http://www.cleaneval.sigwac.org.uk/devset.html.
http://www.search.cpan.org/stro/Text-Compare-1.03/lib/Text/Compare.pm.
A demo is available at http://www.ontology.csse.uwa.edu.au/research/algorithm_hercules.pl.
http://www.sslmit.unibo.it/baroni/bootcat.html.
The terms are ranked using the technique by Basili et al. (2001).
The download speed was tested using http://www.ozspeedtest.com/.
More information on Yahoo! Search, including API key registration, is available at http://www.developer.yahoo.com/search/web/V1/webSearch.html.
A demo is available at http://www.ontology.csse.uwa.edu.au/research/data_virtualcorpus.pl.
A demo is available at http://www.ontology.csse.uwa.edu.au/research/data_localcorpus.pl.
Note that this estimate is highly conjectural but serves as an interesting point of discussion and future work. If linear extrapolation was used instead, a 99.21% precision may only require 85 seeds. Linear extrapolation, however, is less likely considering that if 0 number of seed is used or in other words an empty corpus is produced, the precision is still at an improbable high of 94.12%.

References

Adamic, L., & Huberman, B. (2002). Zipf’s law and the internet.Glottometrics, 3(1), 143–150.
Google Scholar
Agbago, A., & Barriere, C. (2005). Corpus construction for terminology. In Proceedings of the corpus linguistics conference, Birmingham, UK.
Baroni, M., & Bernardini, S. (2004). Bootcat: Bootstrapping corpora and terms from the web. In Proceedings of the 4th language resources and evaluation conference (LREC), Lisbon, Portugal.
Baroni, M., & Bernardini, S. (2006). Wacky! working papers on the web as corpus. Bologna, Italy: GEDIT.
Baroni, M., & Ueyama, M. (2006). Building general- and special-purpose corpora by web crawling. In Proceedings of the 13th NIJL international symposium on language corpora: Their compilation and application.
Baroni, M., Kilgarriff, A., Pomikalek, J., & Rychly, P. (2006). Webbootcat: Instant domain-specific corpora to support human translators. In Proceedings of the 11th annual conference of the European association for Machine Translation (EAMT), Norway.
Basili, R., Moschitti, A., Pazienza, M., & Zanzotto, F. (2001). A contrastive approach to term extraction. In Proceedings of the 4th terminology and artificial intelligence conference (TIA), France.
Blair, I., Urland, G., & Ma, J. (2002). Using internet search engines to estimate word frequency. Behavior Research Methods Instruments & Computers, 34(2), 286–290.
Article Google Scholar
Cavaglia, G., & Kilgarriff, A. (2001). Corpora from the web. In Proceedings of the 4th annual CLUCK colloquium, Sheffield, UK.
Cilibrasi, R., & Vitanyi, P. (2007). The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3), 370–383.
Article Google Scholar
Evert, S. (2007). Stupidos: A high-precision approach to boilerplate removal. In Proceedings of the 3rd web as corpus workshop, Belgium.
Evert, S. (2008). A lightweight and efficient tool for cleaning web pages. In Proceedings of the 4th web as corpus workshop (WAC), Morocco.
Fetterly, D., Manasse, M., Najork, M., & Wiener, J. (2003). A large-scale study of the evolution of web pages. In Proceedings of the 12th international conference on world wide web, Budapest, Hungary.
Fletcher, W. (2007). Implementing a bnc-comparable web corpus. In Proceedings of the 3rd web as corpus workshop, Belgium.
Francis, W., & Kucera, H. (1979). Brown corpus manual. http://icame.uib.no/brown/bcm.html.
Girardi, C. (2007). Htmlcleaner: Extracting the relevant text from the web pages. In Proceedings of the 3rd web as corpus workshop, Belgium.
Halliday, M., Teubert, W., Yallop, C., & Cermakova, A. (2004). Lexicology and corpus linguistics: An introduction. Continuum, London.
Google Scholar
Henzinger, M., & Lawrence, S. (2004). Extracting knowledge from the world wide web. PNAS, 101(1), 5186–5191.
Article Google Scholar
Jock, F. (2009). An overview of the importance of page rank. http://www.associatedcontent.com/article/1502284/an_overview_of_the_importance_of_page.html?cat=15; 9 March 2009.
Keller, F., Lapata, M., & Ourioupina, O. (2002). Using the web to overcome data sparseness. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), Philadelphia.
Kida, M., Tonoike, M., Utsuro, T., & Sato, S (2007). Domain classification of technical terms using the web. Systems and Computers in Japan, 38(14), 11–19.
Article Google Scholar
Kilgarriff, A. (2001). Web as corpus. In Proceedings of the corpus linguistics (CL), Lancaster University, UK.
Kilgarriff, A. (2007). Googleology is bad science. Computational Linguistics, 33(1), 147–151
Article Google Scholar
Kilgarriff, A., & Grefenstette, G. (2003). Web as corpus. Computational Linguistics, 29(3), 1–15.
Article Google Scholar
Kim, J., Ohta, T., Teteisi, Y., & Tsujii, J. (2003). Genia corpus-a semantically annotated corpus for bio-textmining. Bioinformatics, 19(1), 180–182.
Article Google Scholar
Lapata, M., & Keller, F. (2005). Web-based models for natural language processing. ACM Transactions on Speech and Language Processing, 2(1),1–30.
Article Google Scholar
Liberman, M. (2005). Questioning reality. http://www.itre.cis.upenn.edu./myl/languagelog/archives/001837.html; 26 March 2009.
Liu, V., & Curran, J. (2006). Web text corpus for natural language processing. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL), Italy.
McEnery, T., Xiao, R., & Tono, Y. (2005). Corpus-based language studies: An advanced resource book. London, UK: Taylor & Francis Group Plc.
Google Scholar
Nakov, P., & Hearst, M. (2005). A study of using search engine page hits as a proxy for n-gram frequencies. In Proceedings of the international conference on recent advances in natural language processing (RANLP), Bulgaria.
O’Neill, E., McClain, P., & Lavoie, B. (2001). A methodology for sampling the world wide web. Journal of Library Administration, 34(3), 279–291.
Article Google Scholar
Ravichandran, D., Pantel, P., & Hovy, E. (2005). Randomized algorithms and nlp: Using locality sensitive hash function for high speed noun clustering. In Proceedings of the 43rd annual meeting on association for computational linguistics, Michigan, USA.
Renouf, A., Kehoe, A., & Banerjee, J. (2007). Webcorp: An integrated system for web text search. In Nadja Nesselhauf MHCB (Ed.), Corpus linguistics and the web. Amsterdam: Rodopi
Google Scholar
Resnik, P., & Smith, N. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349–380.
Article Google Scholar
Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as corpus. Bologna: GEDIT
Thelwall, M., & Stuart, D. (2006). Web crawling ethics revisited: Cost, privacy and denial of service. Journal of the American Society for Information Science and Technology, 57(13), 1771–1779.
Article Google Scholar
Turney, P. (2001). Mining the web for synonyms: Pmi-ir versus lsa on toefl. In Proceedings of the 12th European conference on machine learning (ECML). Freiburg, Germany.
Wong, W., Liu, W., & Bennamoun, M. (2007). Tree-traversing ant algorithm for term clustering based on featureless similarities. Data Mining and Knowledge Discovery, 15(3), 349–381.
Article Google Scholar
Wong, W., Liu, W., & Bennamoun, M. (2008a). Constructing web corpora through topical web partitioning for term recognition. In Proceedings of the 21st Australasian joint conference on artificial intelligence (AI). Auckland, New Zealand.
Wong W., Liu W., & Bennamoun M. (2008b). Determination of unithood and termhood for term recognition. In M. Song & Y. Wu (Eds.), Handbook of research on text and web mining technologies. IGI Global
Wong, W., Liu, W., & Bennamoun, M. (2009). A probabilistic framework for automatic term recognition. Intelligent Data Analysis 13(4), 499–539.
Google Scholar

Download references

Acknowledgments

This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, and the UWA Research Development Award 2009 from the University of Western Australia. The authors would like to thank the anonymous reviewers for their invaluable comments.

Author information

Wilson Wong
Present address: School of Computer Science and Information Technology, RMIT University, Melbourne, VIC, 3000, Australia

Authors and Affiliations

School of Computer Science and Software Engineering, The University of Western Australia, Crawley, WA, 6009, Australia
Wilson Wong, Wei Liu & Mohammed Bennamoun

Authors

Wilson Wong
View author publications
You can also search for this author inPubMed Google Scholar
Wei Liu
View author publications
You can also search for this author inPubMed Google Scholar
Mohammed Bennamoun
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Wei Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wong, W., Liu, W. & Bennamoun, M. Constructing specialised corpora through analysing domain representativeness of websites. Lang Resources & Evaluation 45, 209–241 (2011). https://doi.org/10.1007/s10579-011-9141-4

Download citation

Published: 02 March 2011
Issue Date: May 2011
DOI: https://doi.org/10.1007/s10579-011-9141-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constructing specialised corpora through analysing domain representativeness of websites

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora

Methods for automatic term recognition in domain-specific text collections: A survey

Building and evaluating web corpora representing national varieties of English

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now