Skip to main content
Log in

A hybrid quantum approach to leveraging data from HTML tables

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The Web provides many data that are encoded using HTML tables. This facilitates rendering them, but obfuscates their structure and makes it difficult for automated business processes to leverage them. This has motivated many authors to work on proposals to extract them as automatically as possible. In this article, we present a new unsupervised proposal that uses a hybrid approach in which a standard computer is used to perform pre- and post-processing tasks and a quantum computer is used to perform the core task: guessing whether the cells have labels or values. The problem is addressed using a clustering approach that is known to be NP using standard computers, but our proposal can solve it in polynomial time, which implies a significant performance improvement. It is novel in that it relies on an entropy-preservation metaphor that has proven to work very well on two large collections of real-world tables from the Wikipedia and the Dresden Web Table Corpus. Our experiments prove that our proposal can beat the state-of-the-art proposal in terms of both effectiveness and efficiency; the key difference is that our proposal is totally unsupervised, whereas the state-of-the-art proposal is supervised.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Aïmeur E, Brassard G, Gambs S (2013) Quantum speed-up for unsupervised learning. Mach Learn 90(2):261–287

    Article  MathSciNet  Google Scholar 

  2. Banasiak J, Joel LO, Shindin S (2019) Discrete growth-decay-fragmentation equation: well-posedness and long-term dynamics. J Evol Equ 19(2019):771–802

    Article  MathSciNet  Google Scholar 

  3. Bapst V, Foini L, Krzakala F, Semerjian G, Zamponi F (2013) The quantum adiabatic algorithm applied to random optimization problems. Phys Rep 523(2013):127–205

    Article  MathSciNet  Google Scholar 

  4. Bauckhage C, Brito E, Cvejoski K, Ojeda C, Sifa R, Wrobel S (2017) Ising models for binary clustering via adiabatic quantum computing. In: EMMCVPR, vol 10746, pp 3–17

  5. Bizer C, Meusel R, Primpel A (2019) Web Data Commons: RDFa, Microdata, embedded JSON-LD, and Microformat data sets. Technical report, University of Mannheim. http://webdatacommons.org/structureddata/2019-12/stats/stats.html

  6. Booth M, Reinhardt SP, Roy A (2017) Partitioning optimization problems for hybrid classical/quantum execution. Technical report, D-Wave, Inc

  7. Braunschweig K, Thiele M, Lehner W (2015) From web tables to concepts: a semantic normalization approach. In: ER, pp 247–260

  8. Cafarella MJ, Halevy AY, Zhang Y, Wang DZ, Wu E (2008) Uncovering the relational Web. In: WebDB

  9. Cafarella MJ, Halevy AY, Lee H, Madhavan J, Yu C, Wang DZ, Wu E (2018) Ten years of web tables. In: VLDB, vol 11, pp 2140–2149

  10. Castelvecchi D (2017) Quantum computers ready to leap out of the lab in 2017. Nature 541(7635):9–10

    Article  Google Scholar 

  11. Chakraborty S, Halder S, Kundu S (2016) Design and analysis of a quantum circuit to cluster a set of data points. Adv Signal Process 4(2):7–12

    Article  Google Scholar 

  12. Chang C-H, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428

    Article  Google Scholar 

  13. Chen H, Tsai S, Tsai J (2000) Mining tables from large scale HTML texts. In: COLING, pp 166–172

  14. Costa-Silva A, Jorge AM, Torgo L (2006) Design of an end-to-end method to extract information from tables. IJDAR 8(2–3):144–171

    Article  Google Scholar 

  15. Crestan E, Pantel P (2011) Web-scale table census and classification. In: WSDM, pp 545–554

  16. Decheng F, Jona S, Panga C, Donga W, Wond CJ (2018) Improved quantum clustering analysis based on the weighted distance and its application. Heliyon 4:1–20

    Article  Google Scholar 

  17. Deza MM, Deza E (2016) Encyclopedia of distances, 4th edn. Springer

  18. Dunjko V, Taylor JM, Briegel HJ (2016) Quantum-enhanced machine learning. Phys Rev Lett 117(130501):1–6

    MathSciNet  Google Scholar 

  19. Eberius J, Thiele M, Braunschweig K, Lehner W (2015) Top-\(k\) entity augmentation using consistent set covering. In: SSDBM, pp 8(1–8), p 12

  20. Embley DW, Hurst M, Lopresti DP, Nagy G (2006) Table-processing paradigms: a research survey. IJDAR 8(2–3):66–86

    Article  Google Scholar 

  21. Embley DW, Seth SC, Nagy G (2014) Transforming web tables to a relational database. In: ICPR, pp 2781–2786

  22. Eslava RVC, Lisboa PJG, Ortega-Martorell S, Jarman IH, Martín-Guerrero JD (2020) Probabilistic quantum clustering. Knowl Based Syst 194:105567

    Article  Google Scholar 

  23. Ferrara E, de Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: a survey. Knowl Based Syst 70:301–323

    Article  Google Scholar 

  24. García S, Herrera F (2008) An extension on “Statistical comparisons of classifiers over multiple datasets” for all pair-wise comparisons. J Mach Learn Res 9:2677–2694

  25. Gatterbauer W, Bohunsky P, Herzog M, Krüpl B, Pollak B (2007) Towards domain-independent information extraction from web tables. In: WWW, pp 71–80

  26. Gibney E (2017) D-Wave upgrade: how scientists are using the world’s most controversial quantum computer. Nature 541(7638):447–448

  27. Giovannetti V, Lloyd S, Maccone L (2008) Architectures for a quantum random access memory. Phys Rev A 78(5):1–9

    Article  Google Scholar 

  28. Griffiths DJ (2004) Introduction to quantum mechanics, 2nd edn. Pearson Prentice Hall

  29. Hobbie RK, Roth BJ (2007) Exponential growth and decay. In: Intermediate physics for medicine and biology. Springer, pp 31–47

  30. Horn D, Gottlieb A (2002) Algorithm for data clustering in pattern recognition problems based on quantum mechanics. Phys Rev Lett 88(1):1–4

    Google Scholar 

  31. Hurst M (2001) Layout and language: challenges for table understanding on the Web. In: WDA

  32. Jenssen R (2010) Kernel entropy component analysis. IEEE Trans Pattern Anal Mach Intell 32(5):847–860

    Article  Google Scholar 

  33. Jiménez P, Corchuelo R (2016) On learning web information extraction rules with TANGO. Inf Syst 62:74–103

    Article  Google Scholar 

  34. Jiménez P, Corchuelo R (2016) Roller: a novel approach to web information extraction. Knowl Int Syst 49(1):197–241

    Article  Google Scholar 

  35. Jiménez P, Roldán JC, Gallego FO, Corchuelo R (2020) On the synthesis of metadata tags for HTML files. Softw Pract Exp 50:2169–2192

    Article  Google Scholar 

  36. Jung S, Kwon H (2006) A scalable hybrid approach for extracting head components from web tables. IEEE Trans Knowl Data Eng 18(2):174–187

    Article  Google Scholar 

  37. Kasirajan V (2021) Fundamentals of quantum computing. Springer

  38. Kayed M, Chang C-H (2010) FiVaTech: page-level web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2):249–263

    Article  Google Scholar 

  39. Kerenidis I, Prakash A (2017) Quantum recommendation systems. In: ITCS, vol 67, pp 49:1–49:21

  40. Kerenidis I, Landman J, Luongo A, Prakash A (2019) \(q\)-means: a quantum algorithm for unsupervised machine learning. In: NIPS, pp 4136–4146

  41. Kietzmann J, Demetis DS, Eriksson T, Dabirian A (2021) Hello quantum! How quantum computing will change the world. IEEE IT Profess 23(4):106–111

    Article  Google Scholar 

  42. Kim Y-S, Lee K-H (2005) Detecting tables in web documents. Eng Appl AI 18(6):745–757

    Article  Google Scholar 

  43. Knight W (2018) Serious quantum computers are finally here. MIT Technology Review

  44. Kumar V, Bass G, Tomlin C, Dulny J (2018) Quantum annealing for combinatorial clustering. Quantum Inf Process 17(2):39

    Article  MathSciNet  Google Scholar 

  45. Li Y, Wang Y, Wang Y, Jiao L, Liu Y (2016) Quantum clustering using kernel entropy component analysis. Neurocomputing 202:36–48

    Article  Google Scholar 

  46. Limaye G, Sarawagi S, Chakrabarti S (2010) Annotating and searching web tables using entities, types, and relationships. VLDB 3:1338–1347

    Google Scholar 

  47. Liu W, Meng X, Meng W (2010) ViDE: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447–460

    Article  Google Scholar 

  48. Lopresti DP, Nagy G (2000) A tabular survey of automated table processing. In: GREC, pp 93–120

  49. Milošević N, Gregson C, Hernández R, Nenadic G (2016) Disentangling the structure of tables in scientific literature. In: NLDB, pp 162–174

  50. Neukart F, Compostella G, Seidel C, von Dollen D, Yarkoni S, Parney B (2017) Traffic flow optimization using a quantum annealer. Front ICT 20:66

    Google Scholar 

  51. Nishida K, Sadamitsu K, Higashinaka R, Matsuo Y (2017) Understanding the semantic structures of tables with a hybrid deep neural network architecture. I:n AAAI, pp 168–174

  52. Oulabi Y, Bizer C (2019) Extending cross-domain knowledge bases with long tail entities using web table data. In: EDBT, pp 385–396

  53. Pimplikar R, Sarawagi S (2012) Answering table queries on the Web using column keywords. VLDB 5:908–919

    Google Scholar 

  54. Roldán JC, Jiménez P, Corchuelo R (2020) On extracting data from tables that are encoded using HTML. Knowl Based Syst 190:105157

    Article  Google Scholar 

  55. Sarawagi S (2008) Information extraction. Found Trends Databases 1(3):261–377

    Article  Google Scholar 

  56. Sheskin DJ (2020) Handbook of parametric and nonparametric statistical procedures, 5th edn. Chapman & Hall/CRC Press

  57. Sleiman HA, Corchuelo R (2013) TEX: an efficient and effective unsupervised web information extractor. Knowl Based Syst 39:109–123

    Article  Google Scholar 

  58. Sleiman HA, Corchuelo R (2013) A survey on region extractors from web documents. IEEE Trans Knowl Data Eng 25(9):1960–1981

    Article  Google Scholar 

  59. Sleiman HA, Corchuelo R (2014) A class of neural-network-based transducers for web information extraction. Neurocomputing 135:61–68

    Article  Google Scholar 

  60. Sleiman HA, Corchuelo R (2014) Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans Knowl Data Eng 26(6):1544–1556

    Article  Google Scholar 

  61. Turmo J, Ageno A, Català N (2006) Adaptive information extraction. ACM Comput Surv 38(2):66

    Article  Google Scholar 

  62. Wereszczyński K, Michalczuk A, Josiński H, Polański A (2018) Quantum computing for clustering big datasets. In: IEEE applications of electromagnetics in modern techniques and medicine, pp 276–280

  63. Wikipedia. Wikipedia download (2020)

  64. Wittek P (2014) Clustering structure and quantum computing. In: Quantum machine learning. Elsevier, pp 99–107

  65. Wittek P (2016) Quantum machine learning. Academic Press

  66. Wu X, Cao C, Wang Y, Fu J, Wang S (2016) Extracting knowledge from web tables based on DOM tree similarity. In: KSEM, vol 9983, pp 302–313

  67. Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193

    Article  MathSciNet  Google Scholar 

  68. Yang Y, Luk W (2002) A framework for web table mining. In: WIDM, pp 36–42

  69. Yoshida M, Torisawa K, Tsujii J (2001) A method to integrate tables of the World Wide Web. In: WDA, pp 31–34

  70. Zanibbi R, Blostein D, Cordy JR (2004) A survey of table recognition. IJDAR 7(1):1–16

    Article  Google Scholar 

  71. Zhang S, Balog K (2020) Web table extraction, retrieval, and augmentation: a survey. ACM Trans Intell Syst Technol 11:13:1-13:35

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Spanish and the Andalusian R&D programmes through Grants TIN2016-75394-R, PID2020-112540RB-C44 (MCIN/AEI/10.13039/501100011033), and P18-RT-1060 (FEDER funds from the EU). The standard computer used to perform the experimentation was partially supported by Dinamic Area, S.L. and the quantum computer was fully supported by D-Wave Systems, Inc.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rafael Corchuelo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiménez, P., Roldán, J.C. & Corchuelo, R. A hybrid quantum approach to leveraging data from HTML tables. Knowl Inf Syst 64, 441–474 (2022). https://doi.org/10.1007/s10115-021-01636-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-021-01636-7

Keywords

Navigation