skip to main content
10.1145/2396761.2398410acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Segmenting web-domains and hashtags using length specific models

Published: 29 October 2012 Publication History

Abstract

Segmentation of a string of English language characters into a sequence of words has many applications. Here, we study two applications in the internet domain. First application is the web domain segmentation which is crucial for monetization of broken URLs. Secondly, we propose and study a novel application of twitter hashtag segmentation for increasing recall on twitter searches. Existing methods for word segmentation use unsupervised language models. We find that when using multiple corpora, the joint probability model from multiple corpora performs significantly better than the individual corpora. Motivated by this, we propose weighted joint probability model, with weights specific to each corpus. We propose to train the weights in a supervised manner using max-margin methods. The supervised probability models improve segmentation accuracy over joint probability models. Finally, we observe that length of segments is an important parameter for word segmentation, and incorporate length-specific weights into our model. The length specific models further improve segmentation accuracy over supervised probability models. For all models proposed here, inference problem can be solved using the dynamic programming algorithm. We test our methods on five different datasets, two from web domains data, and three from news headlines data from an LDC dataset. The supervised length specific models show significant improvements over unsupervised single corpus and joint probability models. Cross-testing between the datasets confirm that supervised probability models trained on all datasets, and length specific models trained on news headlines data, generalize well. Segmentation of hashtags result in significant improvement in recall on searches for twitter trends.

References

[1]
R. Casey and E. Lecolinet. A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1996.
[2]
S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, ACL'96, pages 310--318, Stroudsburg, PA, USA, 1996. Association for Computational Linguistics.
[3]
S. Goldwater, T. L. Griffiths, and M. Johnson. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 673--680, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics.
[4]
P. Koehn and K. Knight. Empirical methods for compound splitting. In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1, EACL'03, pages 187--193, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.
[5]
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001.
[6]
K. Macherey, A. M. Dai, D. Talbot, A. C. Popat, and F. Och. Language-independent compound splitting with morphological operations. In ACL HLT, 2011.
[7]
C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
[8]
R. Parker, D. Graff, J. Kong, K. Chen, and K. Maeda. English gigaword fourth edition, 2009.
[9]
F. Peng, F. Feng, and A. McCallum. Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th international conference on Computational Linguistics, COLING '04, Stroudsburg, PA, USA, 2004. Association for Computational Linguistics.
[10]
S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In NIPS, 2004.
[11]
S. Srinivasan and S. Bhattacharya. Learning to tokenize web domains. In Proceedings of the 20th international conference companion on World wide web, WWW'11, pages 129--130, New York, NY, USA, 2011. ACM.
[12]
B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.
[13]
I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In ICML, 2004.
[14]
A. Venkataraman. A statistical model for word discovery in transcribed speech. Computational Linguistics, 27(3):351--372, 2001.
[15]
K. Wang, X. Li, and J. Gao. Multi-style language model for web scale information retrieval. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR'10, pages 467--474, New York, NY, USA, 2010. ACM.
[16]
K. Wang, C. Thrasher, and B.-J. P. Hsu. Web scale nlp: a case study on url word breaking. In Proceedings of the 20th international conference on World wide web, WWW'11, pages 357--366, New York, NY, USA, 2011. ACM.

Cited By

View all
  • (2023)An Early Stage Identification of Cryptomining Behavior with DNS RequestsAdvanced Data Mining and Applications10.1007/978-3-031-46677-9_3(30-44)Online publication date: 5-Nov-2023
  • (2022)Malicious URL Detection An Evaluation of Feature Extraction and Machine Learning AlgorithmHighlights in Science, Engineering and Technology10.54097/hset.v23i.320923(117-123)Online publication date: 3-Dec-2022
  • (2021)#andràtuttobene: Images, Texts, Emojis and Geodata in a Sentiment Analysis PipelineProceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 202010.4000/books.aaccademia.8954(450-456)Online publication date: 3-Sep-2021
  • Show More Cited By

Index Terms

  1. Segmenting web-domains and hashtags using length specific models

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
    October 2012
    2840 pages
    ISBN:9781450311564
    DOI:10.1145/2396761
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 October 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. compound splitting
    2. hashtag segmentation
    3. structured learning
    4. web domain segmentation
    5. word segmentation

    Qualifiers

    • Research-article

    Conference

    CIKM'12
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)An Early Stage Identification of Cryptomining Behavior with DNS RequestsAdvanced Data Mining and Applications10.1007/978-3-031-46677-9_3(30-44)Online publication date: 5-Nov-2023
    • (2022)Malicious URL Detection An Evaluation of Feature Extraction and Machine Learning AlgorithmHighlights in Science, Engineering and Technology10.54097/hset.v23i.320923(117-123)Online publication date: 3-Dec-2022
    • (2021)#andràtuttobene: Images, Texts, Emojis and Geodata in a Sentiment Analysis PipelineProceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 202010.4000/books.aaccademia.8954(450-456)Online publication date: 3-Sep-2021
    • (2021)Leveraging Affective Hashtags for Ranking Music RecommendationsIEEE Transactions on Affective Computing10.1109/TAFFC.2018.284659612:1(78-91)Online publication date: 1-Jan-2021
    • (2020)An Empirical Study on Efficiency of a Dictionary Based Viterbi Algorithm for Word Segmentation2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377762(3702-3710)Online publication date: 10-Dec-2020
    • (2020)A machine learning framework for domain generating algorithm based malware detectionSecurity and Privacy10.1002/spy2.1273:6Online publication date: 4-Nov-2020
    • (2019)A Machine Learning Framework for Domain Generation Algorithm (DGA)-Based Malware DetectionIEEE Access10.1109/ACCESS.2019.2891588(1-1)Online publication date: 2019
    • (2019)Comparing neural‐ and N‐gram‐based language models for word segmentationJournal of the Association for Information Science and Technology10.1002/asi.2408270:2(187-197)Online publication date: 4-Jan-2019
    • (2017)Analyzing the Keystroke Dynamics of Web IdentifiersProceedings of the 2017 ACM on Web Science Conference10.1145/3091478.3091482(181-190)Online publication date: 25-Jun-2017
    • (2017)Segmenting hashtags and analyzing their grammatical structureJournal of the Association for Information Science and Technology10.1002/asi.2398969:5(675-686)Online publication date: 8-Dec-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media