skip to main content
research-article

Assessing relevance and trust of the deep web sources and results based on inter-source agreement

Published: 29 May 2013 Publication History

Abstract

Deep web search engines face the formidable challenge of retrieving high-quality results from the vast collection of searchable databases. Deep web search is a two-step process of selecting the high-quality sources and ranking the results from the selected sources. Though there are existing methods for both the steps, they assess the relevance of the sources and the results using the query-result similarity. When applied to the deep web these methods have two deficiencies. First is that they are agnostic to the correctness (trustworthiness) of the results. Second, the query-based relevance does not consider the importance of the results and sources. These two considerations are essential for the deep web and open collections in general. Since a number of deep web sources provide answers to any query, we conjuncture that the agreements between these answers are helpful in assessing the importance and the trustworthiness of the sources and the results. For assessing source quality, we compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for the possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source, that we call SourceRank, is calculated as the stationary visit probability of a random walk. For ranking results, we analyze the second-order agreement between the results. Further extending SourceRank to multidomain search, we propose a source ranking sensitive to the query domains. Multiple domain-specific rankings of a source are computed, and these ranks are combined for the final ranking. We perform extensive evaluations on online and hundreds of Google Base sources spanning across domains. The proposed result and source rankings are implemented in the deep web search engine Factal. We demonstrate that the agreement analysis tracks source corruption. Further, our relevance evaluations show that our methods improve precision significantly over Google Base and the other baseline methods. The result ranking and the domain-specific source ranking are evaluated separately.

References

[1]
Agrawal, S., Chakrabarti, K., Chaudhuri, S., Ganti, V., Konig, A., and Xin, D. 2009. Exploiting web search engines to search structured databases. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 501--510.
[2]
Arasu, A. and Garcia-Molina, H. 2003. Extracting structured data from web pages. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 337--348.
[3]
Balakrishnan, R. and Kambhampati, S. 2010. SourceRank: Relevance and trust assessment for deep web sources based on inter-source agreement. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 1055--1056.
[4]
Balakrishnan, R. and Kambhampati, S. 2011a. Factal: Integrating deep web based on trust and relevance. In Proceedings of the International Conference on World Wide Web. ACM Press, New York.
[5]
Balakrishnan, R. and Kambhampati, S. 2011b. Sourcerank: Relevance and trust assessment for deep web sources based on inter-source agreement. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 227--236.
[6]
Barbosa, L., Freire, J., and Silva, A. 2007. Organizing hidden-web databases by clustering visible web documents. In Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE'07). 326--335.
[7]
Bender, M., Michel, S., Triantafillou, P., Weikum, G., and Zimmer, C. 2005. Improving collection selection with overlap awareness in p2p search engines. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Informational Retrieval (SIGIR'05). 67--74.
[8]
Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., and Sudarshan, S. 2002. Keyword searching and browsing in databases using banks. In Proceedings of the 18th International Conference on Data Engineering (ICDE'02). 431--440.
[9]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30, 1--7, 107--117.
[10]
Callan, J. and Connell, M. 2001. Query-Based sampling of text databases. ACM Trans. Inf. Syst. 19, 2, 97--130.
[11]
Callan, J., Lu, Z., and Croft, W. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th International ACM SIGIR Conference on Research and Development Information Retrieval. ACM Press, New York, 21--28.
[12]
Chaudhuri, S., Das, G., Hristidis, V., and Weikum, G. 2004. Probabilistic ranking of database query results. In Proceedings of the 13th International Conference on Very Large Data Bases-Volume 30. VLDB Endowment, 888--899.
[13]
Cohen, W. 1998. Integration of heterogeneous databases without common domains using queries based on textual similarity. ACM SIGMOD Rec. 27, 2, 201--212.
[14]
Cohen, W., Ravikumar, P., and Fienberg, S. 2003. A comparison of string distance metrics for namematching tasks. In Proceedings of the Workshop on Information Integration on the Web (IIWeb'03).
[15]
Croft, W. 2000. Combining approaches to information retrieval. Adv. Inf. Retr. 7, 1--36.
[16]
Dasgupta, A., Das, G., and Mannila, H. 2007. A random walk approach to sampling hidden databases. In Proceedings of ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 629--640
[17]
DMOZ Movies 2011. Open directory project movies. http://www.dmoz.org/Arts/Movies/Titles/.
[18]
Dong, X., Berti-Equille, L., Hu, Y., and Srivastava, D. 2010. Global detection of complex copying relationships between sources. Proc. VLDB Endow. 3, 1--2, 1358--1369.
[19]
Dong, X., Berti-Equille, L., and Srivastava, D. 2009. Integrating conflicting data: The role of source dependence. Proc. VLDB Endow. 2, 1, 550--561.
[20]
Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64, 328, 1183--1210.
[21]
Fuhr, N. 1999. A decision-theoretic approach to database selection in networked ir. ACM Trans. Inf. Syst. 17, 3, 229--249.
[22]
Galland, A., Abiteboul, S., Marian, A., and Senellart, P. 2010. Corroborating information from disagreeing views. In Proceedings of the 3rd ACM International on Web Search and Data Mining (WSDM'10). 131--140.
[23]
Gleich, D., Constantine, P., Flaxman, A., and Gunawardana, A. 2010. Tracking the random surfer: Empirically measured teleportation parameters in pagerank. In Proceedings of the 19th International Conference on World Wide Web.
[24]
Google Products. 2011. Google products. http://www.google.com/products.
[25]
Gravano, L., Ipeirotis, P., and Sahami, M. 2003. QProber: A system for automatic classification of hidden-web databases. ACM Trans. Inf. Syst. 21, 1, 1--41.
[26]
Gummadi, R., Khulbe, A., Kalavagattu, A., Salvi, S., and Kambhampati, S. 2011. Smartint: Using mined attribute dependencies to integrate fragmented web databases. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 51--52.
[27]
Gupta, M. and Han, J. 2011. Heterogeneous network-based trust analysis: A survey. ACM SIGKDD Explor. Newlett. 13, 1, 54--71.
[28]
Gupta, M., Sun, Y., and Han, J. 2011. Trust analysis with clustering. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 53--54.
[29]
Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating web spam with trustrank. In Proceedings of the 13th International Conference on Very Large Databases -- Volume 30. 576--587.
[30]
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. 1997. Extracting semistructured information from the web. In Proceedings of the Workshop on Management of Semistructured Data. ACM Press, New York, 18--25.
[31]
Haveliwala, T. 2003. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. IEEE Trans. Knowl. Data Engin. 15, 4, 784--796.
[32]
He, B. and Chang, K. 2003. Statistical schema matching across web query interfaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 217--228.
[33]
He, B., Tao, T., and Chang, K. 2004. Organizing structured web sources by query schemas: A clustering approach. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management. ACM Press, New York, 22--31.
[34]
IMDB 2011. IMDB movie database. http://www.imdb.com.
[35]
Ipeirotis, P. and Gravano, L. 2004. When one sample is not enough: Improving text database selection using shrinkage. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 767--778.
[36]
Kleinberg, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632.
[37]
Koudas, N., Sarawagi, S., and Srivastava, D. 2006. Record linkage: Similarity measures and algorithms. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 802--803.
[38]
Kurland, O. and Lee, L. 2005. Pagerank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 306--313.
[39]
Lee, J. 1997. Analyses of multiple evidence combination. ACM SIGIR Forum 31, 267--276.
[40]
Liang, P., Klein, D., and Jordan, M. 2008. Agreement-based learning. Adv. Neural Inf. Process. Syst. 20, 913--920.
[41]
Madhavan, J., Bernstein, P., Doan, A., and Halevy, A. 2005. Corpus-based schema matching. In Proceedings of the 21st International Conference on Data Engineering (ICDE'05). 57--68.
[42]
Madhavan, J., Halevy, A., Cohen, S., Dong, X., Jeffery, S., Ko, D., and Yu, C. 2006. Structured data meets the web: A few observations. Data Engin. Bull. 31, 4.
[43]
Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., and Halevy, A. 2008. Google's deep web crawl. Proc. VLDB Endow. 1, 2, 1241--1252.
[44]
Nie, Z. and Kambhampati, S. 2004. A frequency-based approach for mining coverage statistics in data integration. In Proceedings of the 20th International Conference on Data Engineering (ICDE'04). 387--398.
[45]
Nyt Movie Guide. 2010. New York times guide to best 1000 movies. http://www.nytimes.com/ref/movies/1000best.html.
[46]
Nyt Top Books. 2010. New york times books best sellers. http://www.hawes.com/number1s.htm.
[47]
Pbase Cameras. 2011. Pbase camera list. http://www.pbase.com/cameras.
[48]
Richardson, M., Dominowska, E., and Ragno, R. 2007. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web. ACM Press, New York, 521--530.
[49]
Shokouhi, M. and Zobel, J. 2007. Federated text retrieval from uncooperative overlapped collections. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 495--502.
[50]
Si, L. and Callan, J. 2003. Relevant document distribution estimation method for resource selection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 298--305.
[51]
UIUC TEL-8. 2003. UIUC tel-8 repository. http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/index.html.
[52]
Wang, J. and Lochovsky, F. 2003. Data extraction and label assignment for web databases. In Proceedings of the 12th International Conference on World Wide Web. ACM Press, New York, 187--196.
[53]
Wang, J., Wen, J., Lochovsky, F., and Ma, W. 2004b. Instance-Based schema matching for web databases by domain-specific query probing. Proceedings of the 13th International Conference on Very Large Databases. volume 30, VLDB Endowment, 408--419.
[54]
Wiki Top Music. 2011. Best selling albums worldwide. http://en.wikipedia.org/wiki/List_of_best-selling_albums_worldwide.
[55]
Wolf, G., Kalavagattu, A., Khatri, H., Balakrishnan, R., Chokshi, B., Fan, J., Chen, Y., and Kambhampati, S. 2009. Query processing over incomplete autonomous databases: Query rewriting using learned data dependencies. Very Large Data J. 18, 5, 1167--1190.
[56]
Wright, A. 2008. Searching the deep web. Comm. ACM 51, 10, 14--15.
[57]
Yin, X., Han, J., and Yu, P. S. 2008. Truth discovery with multiple conflicting information providers on the web. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[58]
Yin, X. and Tan, W. 2011. Semi-supervised truth discovery. In Proceedings of the 20th International Conference on World Wide Web. ACM Press, New York, 217--226.
[59]
Zhai, Y. and Liu, B. 2005. Web data extraction based on partial tree alignment. In Proceedings of the 14th International Conference on World Wide Web. ACM Press, New York, 76--85.

Cited By

View all
  • (2023)CESDAM: Centered subgraph data matrix for large graph representationPrinciples of Big Graph: In-depth Insight10.1016/bs.adcom.2021.09.005(1-38)Online publication date: 2023
  • (2018)Smart Approach to Crawl Web Interfaces Using a Two Stage Framework of Crawler2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)10.1109/ICCUBEA.2018.8697592(1-6)Online publication date: Aug-2018
  • (2018)Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction2018 International Conference on Control, Power, Communication and Computing Technologies (ICCPCCT)10.1109/ICCPCCT.2018.8574286(25-29)Online publication date: Mar-2018
  • Show More Cited By

Index Terms

  1. Assessing relevance and trust of the deep web sources and results based on inter-source agreement

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 7, Issue 2
      May 2013
      244 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/2460383
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 May 2013
      Accepted: 01 December 2012
      Revised: 01 September 2012
      Received: 01 September 2011
      Published in TWEB Volume 7, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Deep web search
      2. agreement analysis
      3. database integration
      4. deep web integration
      5. source rank
      6. web database search
      7. web trust

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)CESDAM: Centered subgraph data matrix for large graph representationPrinciples of Big Graph: In-depth Insight10.1016/bs.adcom.2021.09.005(1-38)Online publication date: 2023
      • (2018)Smart Approach to Crawl Web Interfaces Using a Two Stage Framework of Crawler2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)10.1109/ICCUBEA.2018.8697592(1-6)Online publication date: Aug-2018
      • (2018)Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction2018 International Conference on Control, Power, Communication and Computing Technologies (ICCPCCT)10.1109/ICCPCCT.2018.8574286(25-29)Online publication date: Mar-2018
      • (2018)Online digital library sampling based on query related graphThe Electronic Library10.1108/EL-08-2017-016336:6(1082-1098)Online publication date: 10-Dec-2018
      • (2017)A review on extracting underlying content from deep web interfaces2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA)10.1109/ICIMIA.2017.7975609(234-237)Online publication date: Feb-2017
      • (2017)Content extraction from deep web interfaces2017 International conference of Electronics, Communication and Aerospace Technology (ICECA)10.1109/ICECA.2017.8203702(349-353)Online publication date: Apr-2017
      • (2017)Exploratory Search of Web Data Services Based on Collective IntelligenceWeb Engineering10.1007/978-3-319-60131-1_23(378-385)Online publication date: 1-Jun-2017
      • (2016)SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web InterfacesIEEE Transactions on Services Computing10.1109/TSC.2015.24149319:4(608-620)Online publication date: 1-Jul-2016
      • (2016)An Approach for Service Selection Based on Developers' Ranking2016 IEEE International Conference on Web Services (ICWS)10.1109/ICWS.2016.98(704-707)Online publication date: Jun-2016
      • (2016)The role of developers’ social relationships in improving service selectionInternational Journal of Web Information Systems10.1108/IJWIS-04-2016-002212:4(477-503)Online publication date: 7-Nov-2016
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media