skip to main content
10.1145/2815675.2815693acmconferencesArticle/Chapter ViewAbstractPublication PagesimcConference Proceedingsconference-collections
research-article

Who is .com?: Learning to Parse WHOIS Records

Published: 28 October 2015 Publication History

Abstract

WHOIS is a long-established protocol for querying information about the 280M+ registered domain names on the Internet. Unfortunately, while such records are accessible in a ``human-readable'' format, they do not follow any consistent schema and thus are challenging to analyze at scale. Existing approaches, which rely on manual crafting of parsing rules and per-registrar templates, are inherently limited in coverage and fragile to ongoing changes in data representations. In this paper, we develop a statistical model for parsing WHOIS records that learns from labeled examples. Our model is a conditional random field (CRF) with a small number of hidden states, a large number of domain-specific features, and parameters that are estimated by efficient dynamic-programming procedures for probabilistic inference. We show that this approach can achieve extremely high accuracy (well over 99%) using modest amounts of labeled training data, that it is robust to minor changes in schema, and that it can adapt to new schema variants by incorporating just a handful of additional examples. Finally, using our parser, we conduct an exhaustive survey of the registration patterns found in 102M com domains.

References

[1]
X. Cai, J. Heidemann, B. Krishnamurthy, and W. Willinger. Towards an AS-to-Organization Map. In Proceedings of the 10th ACM/USENIX Internet Measurement Conference (IMC), Nov. 2010.
[2]
CAUCE. Submission to ICANN WHOIS Team review. http://www.cauce.org/2011/04/submission-to-icann-whois-team-review.html, Apr. 2011.
[3]
R. Clayton and T. Mansfield. A Study of Whois Privacy and Proxy Service Abuse. In Proceedings of the 13th Workshop on Economics of Information Security (WEIS), June 2014.
[4]
L. Daigle. RFC 3912: WHOIS Protocol Specification. IETF, Sept. 2004.
[5]
Z. Durumeric, E. Wustrow, and J. A. Halderman. ZMap: Fast Internet-Wide Scanning and its Security Applications. In Proceedings of the 22nd USENIX Security Symposium, Aug. 2013.
[6]
M. Felegyhazi, C. Kreibich, and V. Paxson. On the Potential of Proactive Domain Blacklisting. In Proceedings of the USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET), San Jose, CA, Apr. 2010.
[7]
I. Fette, N. Sadeh, and A. Tomasic. Learning to Detect Phishing Emails. In Proceedings of the International World Wide Web Conference, May 2007.
[8]
T. Frosch. Mining DNS-related Data for Suspicious Features. Master's thesis, Ruhr Universitat Bochum, 2012.
[9]
M. Gabielkov and A. Legout. The Complete Picture of the Twitter Social Graph. In ACM CoNEXT 2012 Student Workshop, Dec. 2012.
[10]
H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B. Y. Zhao. Detecting and Characterizing Social Spam Campaigns. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (IMC), 2010.
[11]
S. Hao, M. Thomas, V. Paxson, N. Feamster, C. Kreibich, C. Grier, and S. Hollenbeck. Understanding the Domain Registration Behavior of Spammers. In Proceedings of the 13th ACM/USENIX Conference on Internet Measurement (IMC), 2013.
[12]
K. Harrenstien, M. Stahl, and E. Feinler. RFC 812: NICNAME/WHOIS. IETF, Mar. 1982.
[13]
ICANN. Draft Report for the Study of the Accuracy of WHOIS Registrant Contact Information. https://www.icann.org/en/system/files/newsletters/whois-accuracy-study-17jan10-en.pdf, Jan. 2010.
[14]
ICANN. Policy Issue Brief -- gTLD WHOIS. https://www.icann.org/resources/pages/whois-2012-06--14-en, June 2012.
[15]
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), 2001.
[16]
A. McCallum and W. Li. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-enhanced Lexicons. In Proceedings of the Seventh Conference on Natural Language Learning (CONLL), 2003.
[17]
D. K. McGratn and M. Gupta. Behind Phishing: An Examination of Phisher Modi Operandi. In Proceedings of the USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET), Apr. 2008.
[18]
A. Mislove, H. S. Koppula, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Growth of the Flickr Social Network. In Proceedings of the 1st ACM SIGCOMM Workshop on Social Networks (WOSN), Aug. 2008.
[19]
A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and Analysis of Online Social Networks. In Proceedings of the 5th ACM/USENIX Internet Measurement Conference (IMC), Oct. 2007.
[20]
A. Newton and S. Hollenbeck. Registration Data Access Protocol Query Format: Draft Standard. https://tools.ietf.org/html/draft-ietf-weirds-rdap-query-18, Dec. 2014.
[21]
J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.
[22]
Nominet. Contact Data Disclosure in the .uk WHOIS: Appendix I. http://www.nominet.org.uk/sites/ default/files/Appendix-I-Comparative-registry- and-WHOIS-data-publication-review.pdf, 2015.
[23]
F. Sha and F. Pereira. Shallow Parsing with Conditional Random Fields. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL), 2003.
[24]
T. Stallings, B. Wardman, G. Warner, and S. Thapaliya. "WHOIS" Selling All The Pills. International Journal of Forensic Science, 7(2):46--63, 2012.
[25]
J. Szurdi. Understanding the Purpose of Domain Registrations. Master's thesis, Budapest University of Technology and Economics, 2012.
[26]
T. Vissers, W. Joosen, and N. Nikiforakis. Parking Sensors: Analyzing and Detecting Parking Domains. In Proceedings of the Network and Distributed System Security Symposum (NDSS), Feb. 2015.

Cited By

View all
  • (2024)Practical Attacks Against DNS Reputation Systems2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00266(4516-4534)Online publication date: 19-May-2024
  • (2024)From WHOIS to RDAP: Are IP Lookup Services Getting any Better?NOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575906(1-10)Online publication date: 6-May-2024
  • (2024)ROI: a method for identifying organizations receiving personal dataComputing10.1007/s00607-023-01209-2106:1(163-184)Online publication date: 1-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
IMC '15: Proceedings of the 2015 Internet Measurement Conference
October 2015
550 pages
ISBN:9781450338486
DOI:10.1145/2815675
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. information extraction
  2. machine learning
  3. named entity recognition
  4. whois

Qualifiers

  • Research-article

Funding Sources

Conference

IMC '15
Sponsor:
IMC '15: Internet Measurement Conference
October 28 - 30, 2015
Tokyo, Japan

Acceptance Rates

IMC '15 Paper Acceptance Rate 31 of 96 submissions, 32%;
Overall Acceptance Rate 277 of 1,083 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)4
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Practical Attacks Against DNS Reputation Systems2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00266(4516-4534)Online publication date: 19-May-2024
  • (2024)From WHOIS to RDAP: Are IP Lookup Services Getting any Better?NOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575906(1-10)Online publication date: 6-May-2024
  • (2024)ROI: a method for identifying organizations receiving personal dataComputing10.1007/s00607-023-01209-2106:1(163-184)Online publication date: 1-Jan-2024
  • (2024)WHOIS Right? An Analysis of WHOIS and RDAP ConsistencyPassive and Active Measurement10.1007/978-3-031-56249-5_9(206-231)Online publication date: 11-Mar-2024
  • (2023)Domain generated algorithms detection applying a combination of a deep feature selection and traditional machine learning modelsJournal of Computer Security10.3233/JCS-21013931:1(85-105)Online publication date: 1-Jan-2023
  • (2023)Building a Resilient Domain Whitelist to Enhance Phishing Blacklist Accuracy2023 APWG Symposium on Electronic Crime Research (eCrime)10.1109/eCrime61234.2023.10485549(1-14)Online publication date: 15-Nov-2023
  • (2023)Operational Domain Name Classification: From Automatic Ground Truth Generation to Adaptation to Missing ValuesPassive and Active Measurement10.1007/978-3-031-28486-1_24(564-591)Online publication date: 21-Mar-2023
  • (2023) : Enriching AS-to-Organization Mappings with PeeringDBPassive and Active Measurement10.1007/978-3-031-28486-1_17(400-428)Online publication date: 21-Mar-2023
  • (2023)Back-to-the-Future Whois: An IP Address Attribution Service for Working with Historic DatasetsPassive and Active Measurement10.1007/978-3-031-28486-1_10(209-226)Online publication date: 21-Mar-2023
  • (2022)A Framework for Online Public Health Debates: Some Design Elements for Visual Analytics SystemsInformation10.3390/info1304020113:4(201)Online publication date: 15-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media