research-article

Segmenting web-domains and hashtags using length specific models

Authors:

Sriram Srinivasan,

Sourangshu Bhattacharya,

Rudrasis ChakrabortyAuthors Info & Claims

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Pages 1113 - 1122

https://doi.org/10.1145/2396761.2398410

Published: 29 October 2012 Publication History

Abstract

Segmentation of a string of English language characters into a sequence of words has many applications. Here, we study two applications in the internet domain. First application is the web domain segmentation which is crucial for monetization of broken URLs. Secondly, we propose and study a novel application of twitter hashtag segmentation for increasing recall on twitter searches. Existing methods for word segmentation use unsupervised language models. We find that when using multiple corpora, the joint probability model from multiple corpora performs significantly better than the individual corpora. Motivated by this, we propose weighted joint probability model, with weights specific to each corpus. We propose to train the weights in a supervised manner using max-margin methods. The supervised probability models improve segmentation accuracy over joint probability models. Finally, we observe that length of segments is an important parameter for word segmentation, and incorporate length-specific weights into our model. The length specific models further improve segmentation accuracy over supervised probability models. For all models proposed here, inference problem can be solved using the dynamic programming algorithm. We test our methods on five different datasets, two from web domains data, and three from news headlines data from an LDC dataset. The supervised length specific models show significant improvements over unsupervised single corpus and joint probability models. Cross-testing between the datasets confirm that supervised probability models trained on all datasets, and length specific models trained on news headlines data, generalize well. Segmentation of hashtags result in significant improvement in recall on searches for twitter trends.

References

[1]

R. Casey and E. Lecolinet. A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1996.

Digital Library

[2]

S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, ACL'96, pages 310--318, Stroudsburg, PA, USA, 1996. Association for Computational Linguistics.

Digital Library

[3]

S. Goldwater, T. L. Griffiths, and M. Johnson. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 673--680, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics.

Digital Library

[4]

P. Koehn and K. Knight. Empirical methods for compound splitting. In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1, EACL'03, pages 187--193, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.

Digital Library

[5]

J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001.

Digital Library

[6]

K. Macherey, A. M. Dai, D. Talbot, A. C. Popat, and F. Och. Language-independent compound splitting with morphological operations. In ACL HLT, 2011.

Digital Library

[7]

C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

Digital Library

[8]

R. Parker, D. Graff, J. Kong, K. Chen, and K. Maeda. English gigaword fourth edition, 2009.

[9]

F. Peng, F. Feng, and A. McCallum. Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th international conference on Computational Linguistics, COLING '04, Stroudsburg, PA, USA, 2004. Association for Computational Linguistics.

Digital Library

[10]

S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In NIPS, 2004.

[11]

S. Srinivasan and S. Bhattacharya. Learning to tokenize web domains. In Proceedings of the 20th international conference companion on World wide web, WWW'11, pages 129--130, New York, NY, USA, 2011. ACM.

Digital Library

[12]

B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.

Digital Library

[13]

I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In ICML, 2004.

Digital Library

[14]

A. Venkataraman. A statistical model for word discovery in transcribed speech. Computational Linguistics, 27(3):351--372, 2001.

Digital Library

[15]

K. Wang, X. Li, and J. Gao. Multi-style language model for web scale information retrieval. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR'10, pages 467--474, New York, NY, USA, 2010. ACM.

Digital Library

[16]

K. Wang, C. Thrasher, and B.-J. P. Hsu. Web scale nlp: a case study on url word breaking. In Proceedings of the 20th international conference on World wide web, WWW'11, pages 357--366, New York, NY, USA, 2011. ACM.

Digital Library

Cited By

Li HHao YLyu MYu XYang BPeng L(2023)An Early Stage Identification of Cryptomining Behavior with DNS RequestsAdvanced Data Mining and Applications10.1007/978-3-031-46677-9_3(30-44)Online publication date: 5-Nov-2023
https://doi.org/10.1007/978-3-031-46677-9_3
Wang Y(2022)Malicious URL Detection An Evaluation of Feature Extraction and Machine Learning AlgorithmHighlights in Science, Engineering and Technology10.54097/hset.v23i.320923(117-123)Online publication date: 3-Dec-2022
https://doi.org/10.54097/hset.v23i.3209
Vitale PPelosi SFalco M(2021)#andràtuttobene: Images, Texts, Emojis and Geodata in a Sentiment Analysis PipelineProceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 202010.4000/books.aaccademia.8954(450-456)Online publication date: 3-Sep-2021
https://doi.org/10.4000/books.aaccademia.8954
Show More Cited By

Index Terms

Segmenting web-domains and hashtags using length specific models
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Web scale NLP: a case study on url word breaking
WWW '11: Proceedings of the 20th international conference on World wide web

This paper uses the URL word breaking task as an example to elaborate what we identify as crucial in designing statistical natural language processing (NLP) algorithms for Web scale applications: (1) rudimentary multilingual capabilities to cope with ...
Word Segmentation of Hiragana Sentences Using Hiragana BERT
PRICAI 2023: Trends in Artificial Intelligence
Abstract
Unlike Western languages, word segmentation is necessary for Japanese sentences because they do not have word boundaries. The performances of existing morphological analyzers for Japanese sentences are very high. However, it is difficult to ...
Knowledge-based WSD on specific domains: performing better than generic supervised WSD
IJCAI'09: Proceedings of the 21st International Joint Conference on Artificial Intelligence

This paper explores the application of knowledge-based Word Sense Disambiguation systems to specific domains, based on our state-of-the-art graph-based WSD system that uses the information in WordNet. Evaluation was performed over a publicly available ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

October 2012

2840 pages

ISBN:9781450311564

DOI:10.1145/2396761

General Chair:
Xuewen Chen
Wayne State University, USA
,
Program Chairs:
Guy Lebanon
Georgia Institute of Technology
,
Haixun Wang
Microsoft Research Asia
,
Mohammed J. Zaki
Rensselaer Polytechnic Institute

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM'12

Sponsor:

CIKM'12: 21st ACM International Conference on Information and Knowledge Management

October 29 - November 2, 2012

Hawaii, Maui, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
366
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li HHao YLyu MYu XYang BPeng L(2023)An Early Stage Identification of Cryptomining Behavior with DNS RequestsAdvanced Data Mining and Applications10.1007/978-3-031-46677-9_3(30-44)Online publication date: 5-Nov-2023
https://doi.org/10.1007/978-3-031-46677-9_3
Wang Y(2022)Malicious URL Detection An Evaluation of Feature Extraction and Machine Learning AlgorithmHighlights in Science, Engineering and Technology10.54097/hset.v23i.320923(117-123)Online publication date: 3-Dec-2022
https://doi.org/10.54097/hset.v23i.3209
Vitale PPelosi SFalco M(2021)#andràtuttobene: Images, Texts, Emojis and Geodata in a Sentiment Analysis PipelineProceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 202010.4000/books.aaccademia.8954(450-456)Online publication date: 3-Sep-2021
https://doi.org/10.4000/books.aaccademia.8954
Zangerle EChen CTsai MYang Y(2021)Leveraging Affective Hashtags for Ranking Music RecommendationsIEEE Transactions on Affective Computing10.1109/TAFFC.2018.284659612:1(78-91)Online publication date: 1-Jan-2021
https://doi.org/10.1109/TAFFC.2018.2846596
Aggarwal SHoushmand SMukherjee TParsons J(2020)An Empirical Study on Efficiency of a Dictionary Based Viterbi Algorithm for Word Segmentation2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377762(3702-3710)Online publication date: 10-Dec-2020
https://doi.org/10.1109/BigData50022.2020.9377762
G. P. AR. GS. KGladston A(2020)A machine learning framework for domain generating algorithm based malware detectionSecurity and Privacy10.1002/spy2.1273:6Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.1002/spy2.127
Li YXiong KChin THu C(2019)A Machine Learning Framework for Domain Generation Algorithm (DGA)-Based Malware DetectionIEEE Access10.1109/ACCESS.2019.2891588(1-1)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2891588
Doval YGómez‐Rodríguez C(2019)Comparing neural‐ and N‐gram‐based language models for word segmentationJournal of the Association for Information Science and Technology10.1002/asi.2408270:2(187-197)Online publication date: 4-Jan-2019
https://dl.acm.org/doi/10.1002/asi.24082
West AFox PMcGuinness DPoirer LBoldi PKinder-Kurlanda K(2017)Analyzing the Keystroke Dynamics of Web IdentifiersProceedings of the 2017 ACM on Web Science Conference10.1145/3091478.3091482(181-190)Online publication date: 25-Jun-2017
https://dl.acm.org/doi/10.1145/3091478.3091482
Çelebi AÖzgür A(2017)Segmenting hashtags and analyzing their grammatical structureJournal of the Association for Information Science and Technology10.1002/asi.2398969:5(675-686)Online publication date: 8-Dec-2017
https://doi.org/10.1002/asi.23989
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten