ABSTRACT
The back-end databases of web-based applications are a major data security concern to enterprises. The problem becomes more critical with the proliferation of enterprise hosted web applications in the cloud. While prior work has concentrated on malicious attacks that try to break into the database using vulnerabilities of web applications, little work has focused on the threat of data harvesting through web form interfaces, in which large collections of the underlying data can be harvested and sensitive information can be learnt by iteratively submitting legitimate queries and analyzing the returned results for designing new queries. To defend against data harvesting without compromising usability, we consider a detection approach. We summarize the characteristics of data harvesting, and propose the notions of query correlation and result coverage for data harvesting detection. We design a detection system called HengHa, in which Heng examines the correlation among queries in a session, and Ha evaluates the data coverage of the results of queries in the same session. The experimental results verify the effectiveness and efficiency of HengHa for data harvesting detection.
- }}C. Borgelt. An implementation of the fp-growth algorithm. In OSDM '05: Proceedings of the 1st international workshop on open source data mining, pages 1--5, 2005. Google ScholarDigital Library
- }}A. Dasgupta, G. Das, and H. Mannila. A random walk approach to sampling hidden databases. In SIGMOD Conference, pages 629--640, 2007. Google ScholarDigital Library
- }}A. Dasgupta, X. Jin, B. Jewell, N. Zhang, and G. Das. Unbiased estimation of size and other aggregates over hidden web databases. In SIGMOD '10: Proceedings of the 2010 international conference on Management of data, pages 855--866, 2010. Google ScholarDigital Library
- }}A. Dasgupta, N. Zhang, G. Das, and S. Chaudhuri. Privacy preservation of aggregates in hidden databases: why and how? In SIGMOD Conference, pages 153--164, 2009. Google ScholarDigital Library
- }}D. E. Denning and J. Schlorer. Inference controls for statistical databases. Computer, 16(7):69--82, 1983. Google ScholarDigital Library
- }}M. D. Dikaiakos, A. Stassopoulou, and L. Papageorgiou. An investigation of web crawler behavior: characterization and metrics. Computer Communications, 28(8):880--897, 2005. Google ScholarDigital Library
- }}C. Farkas and S. Jajodia. The inference problem: a survey. SIGKDD Explor. Newsl., 4(2):6--11, 2002. Google ScholarDigital Library
- }}J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. Google ScholarDigital Library
- }}J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 1--12, 2000. Google ScholarDigital Library
- }}Y.-W. Huang, S.-K. Huang, T.-P. Lin, and C.-H. Tsai. Web application security assessment by fault injection and behavior monitoring. In WWW, pages 148--159, 2003. Google ScholarDigital Library
- }}A. Kamra, E. Terzi, and E. Bertino. Detecting anomalous access patterns in relational databases. VLDB J., 17(5):1063--1077, 2008. Google ScholarDigital Library
- }}R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000 organizers' report: Peeling the onion. SIGKDD Explorations, 2(2):86--98, 2000. Google ScholarDigital Library
- }}C. Kruegel and G. Vigna. Anomaly detection of web-based attacks. In CCS '03: Proceedings of the 10th ACM conference on Computer and communications security, pages 251--261, 2003. Google ScholarDigital Library
- }}J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Y. Halevy. Google's deep web crawl. PVLDB, 1(2):1241--1252, 2008. Google ScholarDigital Library
- }}J. A. Orenstein and T. H. Merrett. A class of data structures for associative searching. In PODS, pages 181--190, 1984. Google ScholarDigital Library
- }}K. Park, V. S. Pai, K.-W. Lee, and S. Calo. Securing web service by automatic robot detection. In ATEC '06: Proceedings of the annual conference on USENIX '06 Annual Technical Conference, pages 23--23, 2006. Google ScholarDigital Library
- }}D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 277--281, 1999. Google ScholarDigital Library
- }}D. E. Robling Denning. Cryptography and data security. Addison-Wesley Longman Publishing Co., Inc., 1982. Google ScholarDigital Library
- }}A. Roichman and E. Gudes. Diweda - detecting intrusions in web databases. In DBSec, pages 313--329, 2008. Google ScholarDigital Library
- }}P.-N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov., 6(1):9--35, 2002. Google ScholarDigital Library
- }}F. Valeur, D. Mutz, and G. Vigna. A learning-based approach to the detection of sql attacks. In DIMVA, pages 123--140, 2005. Google ScholarDigital Library
- }}G. Vigna, W. Robertson, V. Kher, and R. Kemmerer. A Stateful Intrusion Detection System for World-Wide Web Servers. In Proceedings of the Annual Computer Security Applications Conference (ACSAC 2003), pages 34--43, 2003. Google ScholarDigital Library
- }}S. Wang, D. Agrawal, and A. E. Abbadi. Hengha: Data harvesting detection on hidden databases. Technical Report 2010--13, Department of Computer Science, UCSB, 2010.Google ScholarDigital Library
Index Terms
- HengHa: data harvesting detection on hidden databases
Recommendations
SecuBat: a web vulnerability scanner
WWW '06: Proceedings of the 15th international conference on World Wide WebAs the popularity of the web increases and web applications become tools of everyday use, the role of web security has been gaining importance as well. The last years have shown a significant increase in the number of web-based attacks. For example, ...
Selecting queries from sample to crawl deep web data sources
This paper studies the problem of selecting queries to efficiently crawl a deep web data source using a set of sample documents. Crawling deep web is the process of collecting data from search interfaces by issuing queries. One of the major challenges ...
Effective web-scale crawling through website analysis
WWW '06: Proceedings of the 15th international conference on World Wide WebThe web crawler space is often delimited into two general areas: full-web crawling and focused crawling. We present netSifter, a crawler system which integrates features from these two areas to provide an effective mechanism for web-scale crawling. ...
Comments