Abstract
The Internet is a global phenomenon. To support broad use of Internet applications such as the World Wide Web, character encodings have been developed for many scripts of the world’s languages and there are standard mechanisms for indicating that content is in a particular language and/or tailored to a particular region. In this paper we study the empirical characteristics of language tags used in HTTP transactions and in web pages to indicate the language of the content and possibly the script, region, and other information. To support our analysis, we develop a new algorithm to infer the value of a missing language tag for elements used to link to alternative language content. We analyze the top-level page for websites in the Alexa Top 1 Million, from six geographic perspectives. We find that one third of all pages do not include any language tags, that half of the remaining sites are tagged with English (en), and that about 10 K sites have malformed tags. We observe that 80 K sites are multilingual, and that there are hundreds of sites that offer content in the tens of languages. Besides malformed tags, we find numerous instances of correctly formed but likely erroneous language tags by using a Naïve Bayes-based language detection library and comparing its output with a given page’s language tag(s). Lastly, we comment on differences in language tags observed for the same site but from different geographic vantage points or by using different client language preferences via the HTTP Accept-Language header.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
References
IANA Language Subtag Registry. https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
World Wide Web Consortium. Internationalization techniques: Authoring HTML and CSS, January 2016. https://www.w3.org/International/techniques/authoring-html
Abbate, J.: Inventing the Internet. MIT Press, Cambridge (2000)
Fielding, R., Reschke, J.: RFC 7230: Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing, June 2014. https://tools.ietf.org/html/rfc7230
Fielding, R., Reschke, J.: RFC 7231: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content, June 2014. https://tools.ietf.org/html/rfc7231
Grefenstette, G., Nioche, J.: Estimation of English and non-English language use on the WWW. In: Content-Based Multimedia Information Access, vol. 1, pp. 237–246 (2000)
Ishida, R.: Language tags in HTML and XML. https://www.w3.org/International/articles/language-tags/
Ishida, R.: Declaring language in HTML (2014). https://www.w3.org/International/questions/qa-html-language-declarations
Phillips, A., Davis, M.: Tags for Identifying Language, September 2009. https://www.rfc-editor.org/rfc/bcp/bcp47.txt
Ishida, R.: Choosing a Language Tag (2016). https://www.w3.org/International/questions/qa-choosing-language-tags
Thomas, C., Kline, J., Barford, P.: IntegraTag: a framework for high-fidelity web client measurement. In: 2016 28th International Teletraffic Congress (ITC 28), vol. 1, pp. 278–285 (2016)
Xu, F.: Multilingual WWW. Knowledge-based information retrieval and filtering from the web 746, 165 (2003)
Acknowledgments
We thank Alex Nie ‘20 and Ryan Rios ‘20, who contributed to earlier stages of this work for their summer research. We also thank Ram Durairajan for insightful comments on this work, as well as the anonymous reviewers. Lastly, we thank the Colgate Research Council, which provided partial support for this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Sommers, J. (2018). On the Characteristics of Language Tags on the Web. In: Beverly, R., Smaragdakis, G., Feldmann, A. (eds) Passive and Active Measurement. PAM 2018. Lecture Notes in Computer Science(), vol 10771. Springer, Cham. https://doi.org/10.1007/978-3-319-76481-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-76481-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76480-1
Online ISBN: 978-3-319-76481-8
eBook Packages: Computer ScienceComputer Science (R0)