skip to main content
10.1145/3485983.3494857acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

Discovering obscure looking glass sites on the web to facilitate internet measurement research

Published: 03 December 2021 Publication History

Abstract

Despite researchers have noticed that Looking Glass (LG) vantage points (VPs) are valuable for Internet measurement researches, they can only exploit VPs from well-known LG sites published on several LG portal pages. There should be a lot of LG sites that are not published in these portal pages, namely obscure LG sites, which are not easy to be found and exploited by researchers. In this paper, we design an efficient focused crawler to discover as many LG sites as possible which can avoid unnecessary resource consumption on analyzing irrelevant pages. Our designed focused crawler takes a similarity-guided search that exploits the well-developed search engines and comprehensively mines the common features shared by known LG sites to discover more LG pages. Moreover, the focused crawler takes a two-step PU learning classifier based on carefully selected LG features to efficiently discard irrelevant URLs, thus avoiding a lot of unnecessary resource consumption. As far as we know, we are the first to develop a method to discover obscure LG sites on the web. Experimental results show the effectiveness of our focused crawler. To facilitate practical applications, we further develop an automation tool, which can successfully retrieve 910 obscure automatable LG VPs from relevant pages obtained through our focused crawler. The 910 LG VPs significantly increase the geographic and network coverage of available VPs and we show their potential values in improving the completeness of AS-level Internet topology by a simple case study. Our method and the final VP list are beneficial to the measurement community.

Supplementary Material

MP4 File (3494857-presentation.mp4)
Discovering Obscure Looking Glass Sites on the Web to Facilitate Internet Measurement Research Presentation video

References

[1]
[n.d.]. Beautiful Soup. Retrieved August, 2020 from https://pypi.org/project/beautifulsoup4/
[2]
[n.d.]. BGP4.as. Retrieved April, 2020 from http://www.bgp4.as/looking-glasses
[3]
[n.d.]. BGPlookingglass.com. Retrieved April, 2020 from http://www.bgplookingglass.com
[4]
[n.d.]. CAIDA AS Rank. Retrieved October, 2020 from http://as-rank.caida.org/
[5]
[n.d.]. The CAIDA UCSD AS to Organization Mapping Dataset. Retrieved April, 2020 from https://www.caida.org/data/as_organizations.xml
[6]
[n.d.]. Cougar Looking Glass. Retrieved September, 2020 from https://github.com/Cougar/lg
[7]
[n.d.]. HSDN Looking Glass. Retrieved September, 2020 from https://github.com/hsdn/lg
[8]
[n.d.]. PeeringDB. Retrieved April, 2020 from http://www.peeringdb.com
[9]
[n.d.]. Requests. Retrieved June, 2020 from https://pypi.org/project/requests/
[10]
[n.d.]. Routeviews Prefix to AS mappings Dataset for IPv4 and IPv6. Retrieved September, 2020 from https://www.caida.org/data/routing/routeviews-prefix2as.xml
[11]
[n.d.]. Telephone Looking Glass. Retrieved September, 2020 from https://github.com/telephone/LookingGlass
[12]
[n.d.]. The CAIDA UCSD IPv4 Routed /24 Topology Dataset. Retrieved December, 2019 from https://www.caida.org/data/active/ipv4_routed_24_topology_dataset.xml
[13]
[n.d.]. Tldextract. Retrieved June, 2020 from https://pypi.org/project/tldextract/
[14]
[n.d.]. Traceroute.org. Retrieved April, 2020 from http://www.traceroute.org
[15]
[n.d.]. Wiki. Retrieved September, 2020 from https://en.wikipedia.org/wiki/Tier_1_network
[16]
Ahmed Abbasi, Tianjun Fu, Daniel Zeng, and Donald Adjeroh. 2013. Crawling credible online medical sentiments for social intelligence. In 2013 International Conference on Social Computing. IEEE, 254--263.
[17]
Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data. 207--216.
[18]
Hamidreza Alvari, Paulo Shakarian, and JE Kelly Snyder. 2017. Semi-supervised learning for detecting human trafficking. Security Informatics 6, 1 (2017), 1--14.
[19]
Amalia Amalia, Dani Gunawan, Atras Najwan, and Fathia Meirina. 2016. Focused crawler for the acquisition of health articles. In 2016 International Conference on Data and Software Engineering (ICoDSE). IEEE, 1--6.
[20]
Brice Augustin, Balachander Krishnamurthy, and Walter Willinger. 2009. IXPs: mapped?. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement. 336--349.
[21]
Vaibhav Bajpai, Steffie Jacob Eravuchira, and Jürgen Schönwälder. 2015. Lessons learned from using the Ripe Atlas platform for measurement research. ACM SIGCOMM Computer Communication Review 45, 3 (2015), 35--42.
[22]
Vanshita R Baweja, Rajesh Bhatia, and Manish Kumar. 2020. Support Vector Machine-Based Focused Crawler. In Inventive Communication and Computational Technologies. Springer, 673--686.
[23]
Donna Bergmark, Carl Lagoze, and Alex Sbityakov. 2002. Focused crawls, tunneling, and digital libraries. In International Conference on Theory and Practice of Digital Libraries. Springer, 91--106.
[24]
Luca Bruno, Mariano Graziano, Davide Balzarotti, and Aurélien Francillon. 2014. Through the looking-glass, and what eve found there. In 8th USENIX Workshop on Offensive Technologies (WOOT 14).
[25]
Brian D Davison. 2000. Topical locality in the web. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. 272--279.
[26]
Wei Dong, Hong Ni, Haojiang Deng, and Liheng Tuo. 2015. Gray Tunneling Based on Joint Link for Focused Crawling. In 3rd International Conference on Mechatronics, Robotics and Automation. Atlantis Press, 859--862.
[27]
Charles Elkan and Keith Noto. 2008. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 213--220.
[28]
Mohamed MG Farag, Sunshin Lee, and Edward A Fox. 2018. Focused crawler for events. International Journal on Digital Libraries 19, 1 (2018), 3--19.
[29]
Vasileios Giotsas, Amogh Dhamdhere, and Kimberly C Claffy. 2016. Periscope: Unifying looking glass querying. In International Conference on Passive and Active Network Measurement. Springer, 177--189.
[30]
Enrico Gregori, Alessandro Improta, Luciano Lenzini, Lorenzo Rossi, and Luca Sani. 2012. On the incompleteness of the AS-level graph: a novel methodology for BGP route collector placement. In Proceedings of the 2012 Conference on Internet Measurement Conference (IMC). 253--264.
[31]
Enrico Gregori, Alessandro Improta, Luciano Lenzini, Lorenzo Rossi, and Luca Sani. 2014. A novel methodology to address the Internet AS-level data incompleteness. IEEE/ACM Transactions on Networking 23, 4, 1314--1327.
[32]
Enrico Gregori, Luciano Lenzini, and Valerio Luconi. 2017. AS-Level Topology Discovery: Measurement strategies tailored for crowdsourcing systems. Computer Communications 112 (2017), 47--57.
[33]
Miyoung Han, Pierre-Henri Wuillemin, and Pierre Senellart. 2018. Focused crawling through reinforcement learning. In International Conference on Web Engineering. Springer, 261--278.
[34]
Luca Invernizzi, Paolo Milani Comparetti, Stefano Benvenuti, Christopher Kruegel, Marco Cova, and Giovanni Vigna. 2012. Evilseed: A guided approach to finding malicious web pages. In 2012 IEEE symposium on Security and Privacy. IEEE, 428--442.
[35]
Zitong Jin, Xingang Shi, Yan Yang, Xia Yin, Zhiliang Wang, and Jianping Wu. 2020. TopoScope: Recover AS Relationships From Fragmentary Observations. In Proceedings of the 2020 Conference on Internet Measurement Conference (IMC). 266--280.
[36]
Joyce Jiyoung Whang, Yeonsung Jung, Seonggoo Kang, Dongho Yoo, and Inderjit S. Dhillon. 2020. Scalable Anti-TrustRank with Qualified Site-level Seeds for Link-based Web Spam Detection. In Companion Proceedings of the Web Conference 2020. 593--602.
[37]
Akmal Khan, Taekyoung Kwon, Hyun-chul Kim, and Yanghee Choi. 2013. AS-level topology collection through looking glass servers. In Proceedings of the 2013 Conference on Internet Measurement Conference (IMC). 235--242.
[38]
Ken Lang. 1995. Newsweeder: Learning to filter netnews. In Machine Learning Proceedings 1995. Elsevier, 331--339.
[39]
Jae-Gil Lee, Donghwan Bae, Sansung Kim, Jungeun Kim, and Mun Yong Yi. 2019. An effective approach to enhancing a focused crawler using Google. The Journal of Supercomputing (2019), 1--18.
[40]
Jun Li, Kazutaka Furuse, and Kazunori Yamaguchi. 2005. Focused crawling by exploiting anchor text using decision tree. In Special interest tracks and posters of the 14th international conference on World Wide Web (WWW). 1190--1191.
[41]
Matthew Luckie, Bradley Huffaker, Amogh Dhamdhere, Vasileios Giotsas, and KC Claffy. 2013. AS relationships, customer cones, and validation. In Proceedings of the 2013 conference on Internet measurement conference (IMC). 243--256.
[42]
Alexander Marder, Matthew Luckie, Amogh Dhamdhere, Bradley Huffaker, KC Claffy, and Jonathan M Smith. 2018. Pushing the boundaries with bdrmapit: Mapping router ownership at Internet scale. In Proceedings of the 2018 conference on Internet Measurement Conference (IMC). 56--69.
[43]
Srdjan Matic, Costas Iordanou, Georgios Smaragdakis, and Nikolaos Laoutaris. 2020. Identifying Sensitive URLs at Web-Scale. In Proceedings of the 2020 Conference on Internet Measurement Conference (IMC). 619--633.
[44]
Luke K McDowell, Aaron Fleming, and Zane Markel. 2014. Evaluating and extending latent methods for link-based classification. In Workshop on Formal Methods Integration. Springer, 227--256.
[45]
Fantine Mordelet and J-P Vert. 2014. A bagging SVM to learn from positive and unlabeled examples. Pattern Recognition Letters 37 (2014), 201--209.
[46]
Reza Motamedi, Bahador Yeganeh, Balakrishnan Chandrasekaran, Reza Rejaie, Bruce M Maggs, and Walter Willinger. 2019. On mapping the interconnections in Today's Internet. IEEE/ACM Transactions on Networking 27, 5 (2019), 2056--2070.
[47]
George Nomikos, Vasileios Kotronis, Pavlos Sermpezis, Petros Gigis, Lefteris Manassakis, Christoph Dietzel, Stavros Konstantaras, Xenofontas Dimitropoulos, and Vasileios Giotsas. 2018. O Peer, Where Art Thou? Uncovering Remote Peering Interconnections at IXPs. In Proceedings of the 2018 Conference on Internet Measurement Conference (IMC). 265--278.
[48]
Ricardo Oliveira, Dan Pei, Walter Willinger, Beichuan Zhang, and Lixia Zhang. 2009. The (in) completeness of the observed internet AS-level structure. IEEE/ACM Transactions on Networking 18, 1 (2009), 109--122.
[49]
Nisha N Pawar and K Rajeswari. 2016. Study of different focused web crawler to search domain specific information. International Journal of Computer Applications, 2016, 136, 11 (2016).
[50]
Juan Ramos et al. 2003. Using TF-IDF to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. New Jersey, USA, 133--142.
[51]
Yuval Shavitt and Udi Weinsberg. 2009. Quantifying the importance of vantage points distribution in Internet topology measurements. In IEEE INFOCOM 2009. IEEE, 792--800.
[52]
AK Singh and Navneet Goyal. 2017. Malcrawler: A crawler for seeking and crawling malicious websites. In International Conference on Distributed Computing and Internet Technology. Springer, 210--223.
[53]
Harshal Tupsamudre, Ajeet Kumar Singh, and Sachin Lodha. 2019. Everything Is in the Name-A URL Based Approach for Phishing Detection. In International Symposium on Cyber Security Cryptography and Machine Learning. Springer, 231--248.
[54]
Yingchao Wu, Qinghua Zheng, Yuda Gao, Bo Dong, Rongzhe Wei, Fa Zhang, and Huan He. 2019. TEDM-PU: A Tax Evasion Detection Method Based on Positive and Unlabeled Learning. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 1681--1686.
[55]
Jing'an Xue, Weizhen Dang, Haibo Wang, Jilong Wang, and Hui Wang. 2019. Evaluating Performance and Inefficient Routing of an Anycast CDN. In 2019 IEEE/ACM 27th International Symposium on Quality of Service (IWQoS). 1--10.
[56]
Peng Yang, Xiaoli Li, Hon-Nian Chua, Chee-Keong Kwoh, and See-Kiong Ng. 2014. Ensemble positive unlabeled learning for disease gene identification. PloS one 9, 5 (2014), e97079.
[57]
Ya-Lin Zhang, Longfei Li, Jun Zhou, Xiaolong Li, Yujiang Liu, Yuanchao Zhang, and Zhi-Hua Zhou. 2017. Poster: A PU learning based system for potential malicious url detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2599--2601.

Cited By

View all
  • (2024)Poster: Investigating Traffic Engineering Properties at Internet eXchange PointsProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3689676(779-780)Online publication date: 4-Nov-2024
  • (2024)metAScritic: Reframing AS-Level Topology Discovery as a Recommendation SystemProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3688429(337-364)Online publication date: 4-Nov-2024
  • (2024)Collecting Self-reported Semantics of BGP Communities and Investigating Their Consistency with Real-world UsageProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3688414(314-327)Online publication date: 4-Nov-2024
  • Show More Cited By

Index Terms

  1. Discovering obscure looking glass sites on the web to facilitate internet measurement research

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CoNEXT '21: Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies
    December 2021
    507 pages
    ISBN:9781450390989
    DOI:10.1145/3485983
    • General Chairs:
    • Georg Carle,
    • Jörg Ott
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 December 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. focused crawler
    2. internet measurement
    3. looking glass

    Qualifiers

    • Research-article

    Funding Sources

    • the National Natural Science Foundation of China
    • the National Key Research and Development Program of China

    Conference

    CoNEXT '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 198 of 789 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)50
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Poster: Investigating Traffic Engineering Properties at Internet eXchange PointsProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3689676(779-780)Online publication date: 4-Nov-2024
    • (2024)metAScritic: Reframing AS-Level Topology Discovery as a Recommendation SystemProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3688429(337-364)Online publication date: 4-Nov-2024
    • (2024)Collecting Self-reported Semantics of BGP Communities and Investigating Their Consistency with Real-world UsageProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3688414(314-327)Online publication date: 4-Nov-2024
    • (2024)ProbeGeo: A Comprehensive Landmark Mining Framework Based on Web ContentIEEE/ACM Transactions on Networking10.1109/TNET.2024.342208932:5(4398-4413)Online publication date: Oct-2024
    • (2024)Peaking Beyond the Best Route: An Extensive Dataset for Looking GlassesNOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575856(1-7)Online publication date: 6-May-2024
    • (2023)A cheap and accurate delay-based IP Geolocation method using Machine Learning and Looking Glass2023 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking57963.2023.10186436(1-9)Online publication date: 12-Jun-2023
    • (2023)TinyG: Accurate IP Geolocation Using a Tiny Number of Probers2023 19th International Conference on Network and Service Management (CNSM)10.23919/CNSM59352.2023.10327884(1-9)Online publication date: 30-Oct-2023
    • (2023)Top AS Router Geolocation in Databases: Performance and TechniquesGLOBECOM 2023 - 2023 IEEE Global Communications Conference10.1109/GLOBECOM54140.2023.10437266(2117-2122)Online publication date: 4-Dec-2023
    • (2023)A hunger-based scheduling strategy for distributed crawlerExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119798222:COnline publication date: 15-Jul-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media