Skip to main content
Log in

Formal concept analysis approach for data extraction from a limited deep web database

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Few studies have addressed the problem of extracting data from a limited deep web database. We apply formal concept analysis to this problem and propose a novel algorithm called EdaliwdbFCA. Before a query Y is sent, the algorithm analyzes the local formal context K L , which consists of the latest extracted data, and predicts the size of the query results according to the cardinality of the extent X of the formal concept (X,Y) derived from K L . Thus, it can be determined in advance if Y is a query or not. Candidate query concepts are dynamically generated from the lower cover of the current concept (X,Y). Therefore, this method avoids building of concrete concept lattices during extraction. Moreover, two pruning rules are adopted to reduce redundant queries. Experiments on controlled data sets and real applications were performed. The results confirm that the algorithm theories are correct and it can be effectively applied in the real world.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. The DBLP Computer Science Bibliography. http://www.informatik.uni-trier.de/~ley/db/index.html, November, 2011.

  2. http://web-harvest.sourceforge.net.

  3. http://archive.ics.uci.edu/ml/datasets/Car+Evaluation.

References

  • Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. In SBBD.

  • Carpineto, C., & Romano, G. (2004). Exploiting the potential of concept lattices for information retrieval with CREDO. Journal of Universal Computer Science, 10(8), 985–1013.

    MATH  Google Scholar 

  • Chang, K., He, B., Zhang, Z. (2005). Toward large scale integration: Building a metaquerier over databases on the web. In Proceedings of CIDR 2005 (pp. 44–55).

  • Chen, K., Zuo, W., Zhang, F., He, F., Chen, Y. (2011). Robust and efficient annotation based on ontology evolution for deep web data. Journal of Computers, 6(10), 2029–2036.

    Article  Google Scholar 

  • Dasgupta, A., Zhang, N., Das, G. (2009). Leveraging count information in sampling hidden databases. In Proceedings of the 25th International Conference on Data Engineering. ICDE’09. IEEE (pp. 329–340).

  • Dau, F., Ducrou, J., Eklund, P. (2008). Concept similarity and related categories in SearchSleuth. Lecture Notes in Computer Science, 5113, 255–268.

    Article  Google Scholar 

  • Du, Y., & Hai, Y. (2012). Semantic ranking of web pages based on formal concept analysis. Journal of Systems and Software, 86(1), 187–197. doi:10.1016/j.jss.2012.07.040.

    Article  Google Scholar 

  • Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C. (2012). Opal: automated form understanding for the deep web. In Proceedings of the 21st international conference on World Wide Web (pp. 829–838).

  • Hong, J.L. (2011). Data extraction for deep web using wordnet. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 41(6), 854–868.

    Article  Google Scholar 

  • Huang, Q., Li, Q., Li, H., Yan, Z. (2012). An approach to incremental deep web crawling based on incremental harvest model. Procedia Engineering, 29, 1081–1087.

    Article  Google Scholar 

  • Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q. (2010). Efficient deep web crawling using reinforcement learning. Lecture Notes in Computer Science, 6118, 428–439.

    Article  Google Scholar 

  • Koester, B. (2006). Conceptual knowledge retrieval with FooCA: improving web search engine results with contexts and concept hierarchies. Lecture Notes in Computer Science, 4065, 176–190.

    Article  Google Scholar 

  • Li, Y., Wang, Y., Du, J. (2012). E-ffc: an enhanced form-focused crawler for domain-specific deep web databases. Journal of Intelligent Information Systems, 40(1), 159–184.

    Article  Google Scholar 

  • Lindig, C. (2000). Fast concept analysis. Working with conceptual structures—contributions to ICCS 2000 (pp. 235–248).

  • Liu, W., Meng, X., Meng, W. (2010). Vide: a vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3), 447–460.

    Article  Google Scholar 

  • Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A. (2008). Google’s deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241–1252.

    Google Scholar 

  • Palekar, V.R., Ali, M.S., Meghe, R. (2012). Deep web data extraction using web-programminglanguage-independent approach. Journal of Data Mining and Knowledge Discovery, 3(2), 69–73. http://www.bioinfo.in/journalcontent.php?vol_id=905&id=42&month=4&year=2012.

    Google Scholar 

  • Polaillon, G., Aufaure, M., Le Grand, B., Soto, M. (2007). FCA for contextual semantic navigation and information retrieval in heterogeneous information systems. In DEXA’07. 18th international workshop on database and expert systems applications (pp. 534–539). IEEE.

  • Wang, Y., Lu, J., Chen, J. (2009). Crawling deep web using a new set covering algorithm. In Proceedings of the 5th International Conference on Advanced Data Mining and Applications. ADMA 2009, Chengdu, China (pp. 326–337). Springer.

  • Wang, Y., Lu, J., Liang, J., Chen, J., Liu, J. (2012). Selecting queries from sample to crawl deep web data sources. Web Intelligence and Agent Systems, 10(1), 75–88.

    Article  Google Scholar 

  • Wille, R. (1999). Formal concept analysis: Mathematical foundations. Springer.

  • Wu, P., Wen, J., Liu, H., Ma, W. (2006). Query selection techniques for efficient crawling of structured web sources. In ICDE’06. Proceedings of the 22nd international conference on data engineering (pp. 47–47). IEEE.

  • Yang, Y., Du, Y., Sun, J., Hai, Y. (2008). A topic-specific web crawler with concept similarity context graph based on fca. In D.-S. Huang, D. Wunsch, D. Levine, K.-H. Jo (Eds.), Advanced intelligent computing theories and applications. With aspects of artificial intelligence (Vol. 5227, p. 840). Berlin/Heidelberg: Springer. doi:10.1007/978-3-540-85984-0-101.

Download references

Acknowledgements

I thank Dr. Lin Gan at the School of Computing in Wuhan University for providing helpful suggestions. I also thank the anonymous referees and editor for their constructive comments on earlier versions of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhuo Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Z., Du, J. & Wang, L. Formal concept analysis approach for data extraction from a limited deep web database. J Intell Inf Syst 41, 211–234 (2013). https://doi.org/10.1007/s10844-013-0242-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-013-0242-y

Keywords

Navigation