Skip to main content

Clustering and the Weekend Effect: Recommendations for the Use of Top Domain Lists in Security Research

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 11419))

Abstract

Top domain rankings (e.g., Alexa) are commonly used in security research, such as to survey security features or vulnerabilities of “relevant” websites. Due to their central role in selecting a sample of sites to study, an inappropriate choice or use of such domain rankings can introduce unwanted biases into research results. We quantify various characteristics of three top domain lists that have not been reported before. For example, the weekend effect in Alexa and Umbrella causes these rankings to change their geographical diversity between the workweek and the weekend. Furthermore, up to 91% of ranked domains appear in alphabetically sorted clusters containing up to 87k domains of presumably equivalent popularity. We discuss the practical implications of these findings, and propose novel best practices regarding the use of top domain lists in the security community.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Alexa top 1 million download. http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

  2. Amazon Alexa top sites. https://www.alexa.com/topsites

  3. Are there known biases in Alexa’s traffic data? https://support.alexa.com/hc/en-us/articles/200461920-Are-there-known-biases-in-Alexa-s-traffic-data-

  4. Cisco Umbrella top 1 million. https://s3-us-west-1.amazonaws.com/umbrella-static/index.html

  5. How are Alexa’s traffic rankings determined? https://support.alexa.com/hc/en-us/articles/200449744-How-are-Alexa-s-traffic-rankings-determined-

  6. Majestic million. https://majestic.com/reports/majestic-million

  7. Quantcast top websites. https://www.quantcast.com/top-sites/

  8. Symantec BlueCoat WebPulse site review. https://sitereview.bluecoat.com/

  9. Alrwais, S., et al.: Under the shadow of sunshine: understanding and detecting bulletproof hosting on legitimate service provider networks. In: Security and Privacy Symposium (2017)

    Google Scholar 

  10. Bilge, L., Kirda, E., Kruegel, C., Balduzzi, M.: EXPOSURE: finding malicious domains using passive DNS analysis. In: NDSS (2011)

    Google Scholar 

  11. Chen, Q.A., Osterweil, E., Thomas, M., Mao, Z.M.: MitM attack by name collision: cause analysis and vulnerability assessment in the new gTLD era. In: Security and Privacy Symposium (2016)

    Google Scholar 

  12. Chen, Q.A., et al.: Client-side name collision vulnerability in the new gTLD era: a systematic study. In: CCS (2017)

    Google Scholar 

  13. Durumeric, Z., Kasten, J., Bailey, M., Halderman, J.A.: Analysis of the HTTPS certificate ecosystem. In: IMC (2013)

    Google Scholar 

  14. Englehardt, S., Narayanan, A.: Online tracking: a 1-million-site measurement and analysis. In: CCS (2016)

    Google Scholar 

  15. Heiderich, M., Frosch, T., Holz, T.: IceShield: detection and mitigation of malicious websites with a frozen DOM. In: Sommer, R., Balzarotti, D., Maier, G. (eds.) RAID 2011. LNCS, vol. 6961, pp. 281–300. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23644-0_15

    Chapter  Google Scholar 

  16. Hubbard, D.: Cisco Umbrella 1 million (2016). https://umbrella.cisco.com/blog/2016/12/14/cisco-umbrella-1-million/

  17. Jones, D.: Majestic million CSV now free for all, daily (2012). https://blog.majestic.com/development/majestic-million-csv-daily/

  18. Larisch, J., Choffnes, D., Levin, D., Maggs, B.M., Mislove, A., Wilson, C.: CRLite: a scalable system for pushing all TLS revocations to all browsers. In: Security and Privacy Symposium (2017)

    Google Scholar 

  19. Lauinger, T., Chaabane, A., Arshad, S., Robertson, W., Wilson, C., Kirda, E.: Thou Shalt not depend on me: analysing the use of outdated JavaScript libraries on the Web. In: NDSS (2017)

    Google Scholar 

  20. Le Pochat, V., van Goethem, T., Tajalizadehkhoob, S., Korczynski, M., Joosen, W.: Rigging research results by manipulating top websites rankings. In: NDSS (2019)

    Google Scholar 

  21. Lee, S., Kim, J.: WarningBird: detecting suspicious URLs in Twitter stream. In: NDSS (2011)

    Google Scholar 

  22. Lever, C., Kotzias, P., Balzarotti, D., Caballero, J., Antonakakis, M.: A lustrum of malware network communication: evolution and insights. In: Security and Privacy Symposium (2017)

    Google Scholar 

  23. Lever, C., Walls, R.J., Nadji, Y., Dagon, D., McDaniel, P., Antonakakis, M.: Domain-Z: 28 registrations later. In: Security and Privacy Symposium (2016)

    Google Scholar 

  24. Li, Z., Zhang, K., Xie, Y., Yu, F., Wang, X.: Knowing your enemy: understanding and detecting malicious web advertising. In: CCS (2012)

    Google Scholar 

  25. Lo, B.W.N., Sedhain, R.S.: How reliable are website rankings? Implications for e-business advertising and internet search. Issues Inf. Syst. 7(2), 233–238 (2006)

    Google Scholar 

  26. Nadji, Y., Antonakakis, M., Perdisci, R., Lee, W.: Connected colors: unveiling the structure of criminal networks. In: Stolfo, S.J., Stavrou, A., Wright, C.V. (eds.) RAID 2013. LNCS, vol. 8145, pp. 390–410. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41284-4_20

    Chapter  Google Scholar 

  27. Pearce, P., Ensafi, R., Li, F., Feamster, N., Paxson, V.: Augur: internet-wide detection of connectivity disruptions. In: Security and Privacy Symposium (2017)

    Google Scholar 

  28. Pitsillidis, A., Kanich, C., Voelker, G.M., Levchenko, K., Savage, S.: Taster’s choice: a comparative analysis of spam feeds. In: IMC (2012)

    Google Scholar 

  29. Felt, A.P., Barnes, R., King, A., Palmer, C., Bentzel, C., Tabriz, P.: Measuring HTTPS adoption on the Web. In: USENIX Security (2017)

    Google Scholar 

  30. Scheitle, Q., t al.: A long way to the top: significance, structure, and stability of internet top lists. In: IMC (2018)

    Google Scholar 

  31. Scheitle, Q., Jelten, J., Hohlfeld, O., Ciprian, L., Carle, G.: Structure and stability of internet top lists. In: eprint arXiv:1802.02651 [cs.NI] (2018)

  32. Starov, O., Nikiforakis, N.: XHOUND: quantifying the fingerprintability of browser extensions. In: Security and Privacy Symposium (2017)

    Google Scholar 

Download references

Acknowledgements

This work was supported by Secure Business Austria and the National Science Foundation under grants CNS-1563320, CNS-1703454, and IIS-1553088.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Walter Rweyemamu .

Editor information

Editors and Affiliations

Appendix

Appendix

Fig. 6.
figure 6

Alexa and Umbrella changes over time in exponentially increasing list intervals, using Sunday 4 February as the reference day. See Fig. 1 for full legend.

Fig. 7.
figure 7

Changes in Majestic over time in exponentially increasing list intervals, using Wednesday 24 March as the reference day. See Fig. 1 for full legend. Majestic is remarkably stable.

Fig. 8.
figure 8

Scatterplot of each alphabetically sorted cluster’s size by its highest rank in Majestic. Partially visible downsampling of small clusters. See Fig. 5 for full legend. Majestic has only small clusters.

Table 4. Top 10 domains on Wed. 4 and Sun. 8 April 2018 in Alexa and Umbrella.
Fig. 9.
figure 9

Heatmap showing Majestic domain extensions’ mean Wednesday market share ± the difference to the mean Sunday market share (also used to colour each cell) in exponentially increasing list intervals 1–10, 11–100, 101–1,000, etc., from March to May 2018. Extensions ordered by Wednesday top 1 M mean market share. Due to Majestic’s high list stability, differences are not visible. (Color figure online)

Fig. 10.
figure 10

Heatmap showing Majestic website categories’ mean Wednesday market share ± the difference to the mean Sunday market share (also used to colour each cell) in exponentially increasing list intervals 1–10, 11–100, 101–1,000, etc., from March to April 2018. Categories ordered by Wednesday top 1 M mean market share. Due to Majestic’s high list stability, differences are not visible. (Color figure online)

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rweyemamu, W., Lauinger, T., Wilson, C., Robertson, W., Kirda, E. (2019). Clustering and the Weekend Effect: Recommendations for the Use of Top Domain Lists in Security Research. In: Choffnes, D., Barcellos, M. (eds) Passive and Active Measurement. PAM 2019. Lecture Notes in Computer Science(), vol 11419. Springer, Cham. https://doi.org/10.1007/978-3-030-15986-3_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-15986-3_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-15985-6

  • Online ISBN: 978-3-030-15986-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics