Abstract
The role of the Web for text corpus construction is becoming increasingly significant. However, the contribution of the Web is largely confined to building a general virtual corpus or low quality specialised corpora. In this paper, we introduce a new technique called SPARTAN for constructing specialised corpora from the Web by systematically analysing website contents. Our evaluations show that the corpora constructed using our technique are independent of the search engines employed. In particular, SPARTAN-derived corpora outperform all corpora based on existing techniques for the task of term recognition.


Similar content being viewed by others
Notes
Google’s Web search interface serves up to 1,000 results. However, automated crawling and scraping of that page for URLs will result in the blocking of IP addresses. The SOAP API by Google, which allows up to 1,000 queries per day has been permanently phased out since August 2009. Refer to http://www.googleajaxsearchapi.blogspot.com/2007/12/search-result-limit-increase.html for more information.
There are certain websites whose contents are heterogeneous in nature such as news sites, hosting sites, etc. Such sites are, however, automatically and systematically identified and removed during the corpus construction process by the proposed technique.
A generalised version of the Normalised Google Distance (NGD) by Cilibrasi and Vitanyi (2007).
This page count and all subsequent page counts derived from Google and Yahoo are obtained on 2 April 2009.
Other commonly-used search engines such as AltaVista and AlltheWeb were not cited for comparison since they use the same search index as Yahoo’s.
A demo is available at http://www.ontology.csse.uwa.edu.au/research/algorithm_hercules.pl.
The terms are ranked using the technique by Basili et al. (2001).
The download speed was tested using http://www.ozspeedtest.com/.
More information on Yahoo! Search, including API key registration, is available at http://www.developer.yahoo.com/search/web/V1/webSearch.html.
A demo is available at http://www.ontology.csse.uwa.edu.au/research/data_virtualcorpus.pl.
A demo is available at http://www.ontology.csse.uwa.edu.au/research/data_localcorpus.pl.
Note that this estimate is highly conjectural but serves as an interesting point of discussion and future work. If linear extrapolation was used instead, a 99.21% precision may only require 85 seeds. Linear extrapolation, however, is less likely considering that if 0 number of seed is used or in other words an empty corpus is produced, the precision is still at an improbable high of 94.12%.
References
Adamic, L., & Huberman, B. (2002). Zipf’s law and the internet.Glottometrics, 3(1), 143–150.
Agbago, A., & Barriere, C. (2005). Corpus construction for terminology. In Proceedings of the corpus linguistics conference, Birmingham, UK.
Baroni, M., & Bernardini, S. (2004). Bootcat: Bootstrapping corpora and terms from the web. In Proceedings of the 4th language resources and evaluation conference (LREC), Lisbon, Portugal.
Baroni, M., & Bernardini, S. (2006). Wacky! working papers on the web as corpus. Bologna, Italy: GEDIT.
Baroni, M., & Ueyama, M. (2006). Building general- and special-purpose corpora by web crawling. In Proceedings of the 13th NIJL international symposium on language corpora: Their compilation and application.
Baroni, M., Kilgarriff, A., Pomikalek, J., & Rychly, P. (2006). Webbootcat: Instant domain-specific corpora to support human translators. In Proceedings of the 11th annual conference of the European association for Machine Translation (EAMT), Norway.
Basili, R., Moschitti, A., Pazienza, M., & Zanzotto, F. (2001). A contrastive approach to term extraction. In Proceedings of the 4th terminology and artificial intelligence conference (TIA), France.
Blair, I., Urland, G., & Ma, J. (2002). Using internet search engines to estimate word frequency. Behavior Research Methods Instruments & Computers, 34(2), 286–290.
Cavaglia, G., & Kilgarriff, A. (2001). Corpora from the web. In Proceedings of the 4th annual CLUCK colloquium, Sheffield, UK.
Cilibrasi, R., & Vitanyi, P. (2007). The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3), 370–383.
Evert, S. (2007). Stupidos: A high-precision approach to boilerplate removal. In Proceedings of the 3rd web as corpus workshop, Belgium.
Evert, S. (2008). A lightweight and efficient tool for cleaning web pages. In Proceedings of the 4th web as corpus workshop (WAC), Morocco.
Fetterly, D., Manasse, M., Najork, M., & Wiener, J. (2003). A large-scale study of the evolution of web pages. In Proceedings of the 12th international conference on world wide web, Budapest, Hungary.
Fletcher, W. (2007). Implementing a bnc-comparable web corpus. In Proceedings of the 3rd web as corpus workshop, Belgium.
Francis, W., & Kucera, H. (1979). Brown corpus manual. http://icame.uib.no/brown/bcm.html.
Girardi, C. (2007). Htmlcleaner: Extracting the relevant text from the web pages. In Proceedings of the 3rd web as corpus workshop, Belgium.
Halliday, M., Teubert, W., Yallop, C., & Cermakova, A. (2004). Lexicology and corpus linguistics: An introduction. Continuum, London.
Henzinger, M., & Lawrence, S. (2004). Extracting knowledge from the world wide web. PNAS, 101(1), 5186–5191.
Jock, F. (2009). An overview of the importance of page rank. http://www.associatedcontent.com/article/1502284/an_overview_of_the_importance_of_page.html?cat=15; 9 March 2009.
Keller, F., Lapata, M., & Ourioupina, O. (2002). Using the web to overcome data sparseness. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), Philadelphia.
Kida, M., Tonoike, M., Utsuro, T., & Sato, S (2007). Domain classification of technical terms using the web. Systems and Computers in Japan, 38(14), 11–19.
Kilgarriff, A. (2001). Web as corpus. In Proceedings of the corpus linguistics (CL), Lancaster University, UK.
Kilgarriff, A. (2007). Googleology is bad science. Computational Linguistics, 33(1), 147–151
Kilgarriff, A., & Grefenstette, G. (2003). Web as corpus. Computational Linguistics, 29(3), 1–15.
Kim, J., Ohta, T., Teteisi, Y., & Tsujii, J. (2003). Genia corpus-a semantically annotated corpus for bio-textmining. Bioinformatics, 19(1), 180–182.
Lapata, M., & Keller, F. (2005). Web-based models for natural language processing. ACM Transactions on Speech and Language Processing, 2(1),1–30.
Liberman, M. (2005). Questioning reality. http://www.itre.cis.upenn.edu./myl/languagelog/archives/001837.html; 26 March 2009.
Liu, V., & Curran, J. (2006). Web text corpus for natural language processing. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL), Italy.
McEnery, T., Xiao, R., & Tono, Y. (2005). Corpus-based language studies: An advanced resource book. London, UK: Taylor & Francis Group Plc.
Nakov, P., & Hearst, M. (2005). A study of using search engine page hits as a proxy for n-gram frequencies. In Proceedings of the international conference on recent advances in natural language processing (RANLP), Bulgaria.
O’Neill, E., McClain, P., & Lavoie, B. (2001). A methodology for sampling the world wide web. Journal of Library Administration, 34(3), 279–291.
Ravichandran, D., Pantel, P., & Hovy, E. (2005). Randomized algorithms and nlp: Using locality sensitive hash function for high speed noun clustering. In Proceedings of the 43rd annual meeting on association for computational linguistics, Michigan, USA.
Renouf, A., Kehoe, A., & Banerjee, J. (2007). Webcorp: An integrated system for web text search. In Nadja Nesselhauf MHCB (Ed.), Corpus linguistics and the web. Amsterdam: Rodopi
Resnik, P., & Smith, N. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349–380.
Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as corpus. Bologna: GEDIT
Thelwall, M., & Stuart, D. (2006). Web crawling ethics revisited: Cost, privacy and denial of service. Journal of the American Society for Information Science and Technology, 57(13), 1771–1779.
Turney, P. (2001). Mining the web for synonyms: Pmi-ir versus lsa on toefl. In Proceedings of the 12th European conference on machine learning (ECML). Freiburg, Germany.
Wong, W., Liu, W., & Bennamoun, M. (2007). Tree-traversing ant algorithm for term clustering based on featureless similarities. Data Mining and Knowledge Discovery, 15(3), 349–381.
Wong, W., Liu, W., & Bennamoun, M. (2008a). Constructing web corpora through topical web partitioning for term recognition. In Proceedings of the 21st Australasian joint conference on artificial intelligence (AI). Auckland, New Zealand.
Wong W., Liu W., & Bennamoun M. (2008b). Determination of unithood and termhood for term recognition. In M. Song & Y. Wu (Eds.), Handbook of research on text and web mining technologies. IGI Global
Wong, W., Liu, W., & Bennamoun, M. (2009). A probabilistic framework for automatic term recognition. Intelligent Data Analysis 13(4), 499–539.
Acknowledgments
This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, and the UWA Research Development Award 2009 from the University of Western Australia. The authors would like to thank the anonymous reviewers for their invaluable comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wong, W., Liu, W. & Bennamoun, M. Constructing specialised corpora through analysing domain representativeness of websites. Lang Resources & Evaluation 45, 209–241 (2011). https://doi.org/10.1007/s10579-011-9141-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-011-9141-4