Abstract
The Web provides many data that are encoded using HTML tables. This facilitates rendering them, but obfuscates their structure and makes it difficult for automated business processes to leverage them. This has motivated many authors to work on proposals to extract them as automatically as possible. In this article, we present a new unsupervised proposal that uses a hybrid approach in which a standard computer is used to perform pre- and post-processing tasks and a quantum computer is used to perform the core task: guessing whether the cells have labels or values. The problem is addressed using a clustering approach that is known to be NP using standard computers, but our proposal can solve it in polynomial time, which implies a significant performance improvement. It is novel in that it relies on an entropy-preservation metaphor that has proven to work very well on two large collections of real-world tables from the Wikipedia and the Dresden Web Table Corpus. Our experiments prove that our proposal can beat the state-of-the-art proposal in terms of both effectiveness and efficiency; the key difference is that our proposal is totally unsupervised, whereas the state-of-the-art proposal is supervised.
Similar content being viewed by others
References
Aïmeur E, Brassard G, Gambs S (2013) Quantum speed-up for unsupervised learning. Mach Learn 90(2):261–287
Banasiak J, Joel LO, Shindin S (2019) Discrete growth-decay-fragmentation equation: well-posedness and long-term dynamics. J Evol Equ 19(2019):771–802
Bapst V, Foini L, Krzakala F, Semerjian G, Zamponi F (2013) The quantum adiabatic algorithm applied to random optimization problems. Phys Rep 523(2013):127–205
Bauckhage C, Brito E, Cvejoski K, Ojeda C, Sifa R, Wrobel S (2017) Ising models for binary clustering via adiabatic quantum computing. In: EMMCVPR, vol 10746, pp 3–17
Bizer C, Meusel R, Primpel A (2019) Web Data Commons: RDFa, Microdata, embedded JSON-LD, and Microformat data sets. Technical report, University of Mannheim. http://webdatacommons.org/structureddata/2019-12/stats/stats.html
Booth M, Reinhardt SP, Roy A (2017) Partitioning optimization problems for hybrid classical/quantum execution. Technical report, D-Wave, Inc
Braunschweig K, Thiele M, Lehner W (2015) From web tables to concepts: a semantic normalization approach. In: ER, pp 247–260
Cafarella MJ, Halevy AY, Zhang Y, Wang DZ, Wu E (2008) Uncovering the relational Web. In: WebDB
Cafarella MJ, Halevy AY, Lee H, Madhavan J, Yu C, Wang DZ, Wu E (2018) Ten years of web tables. In: VLDB, vol 11, pp 2140–2149
Castelvecchi D (2017) Quantum computers ready to leap out of the lab in 2017. Nature 541(7635):9–10
Chakraborty S, Halder S, Kundu S (2016) Design and analysis of a quantum circuit to cluster a set of data points. Adv Signal Process 4(2):7–12
Chang C-H, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428
Chen H, Tsai S, Tsai J (2000) Mining tables from large scale HTML texts. In: COLING, pp 166–172
Costa-Silva A, Jorge AM, Torgo L (2006) Design of an end-to-end method to extract information from tables. IJDAR 8(2–3):144–171
Crestan E, Pantel P (2011) Web-scale table census and classification. In: WSDM, pp 545–554
Decheng F, Jona S, Panga C, Donga W, Wond CJ (2018) Improved quantum clustering analysis based on the weighted distance and its application. Heliyon 4:1–20
Deza MM, Deza E (2016) Encyclopedia of distances, 4th edn. Springer
Dunjko V, Taylor JM, Briegel HJ (2016) Quantum-enhanced machine learning. Phys Rev Lett 117(130501):1–6
Eberius J, Thiele M, Braunschweig K, Lehner W (2015) Top-\(k\) entity augmentation using consistent set covering. In: SSDBM, pp 8(1–8), p 12
Embley DW, Hurst M, Lopresti DP, Nagy G (2006) Table-processing paradigms: a research survey. IJDAR 8(2–3):66–86
Embley DW, Seth SC, Nagy G (2014) Transforming web tables to a relational database. In: ICPR, pp 2781–2786
Eslava RVC, Lisboa PJG, Ortega-Martorell S, Jarman IH, Martín-Guerrero JD (2020) Probabilistic quantum clustering. Knowl Based Syst 194:105567
Ferrara E, de Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: a survey. Knowl Based Syst 70:301–323
García S, Herrera F (2008) An extension on “Statistical comparisons of classifiers over multiple datasets” for all pair-wise comparisons. J Mach Learn Res 9:2677–2694
Gatterbauer W, Bohunsky P, Herzog M, Krüpl B, Pollak B (2007) Towards domain-independent information extraction from web tables. In: WWW, pp 71–80
Gibney E (2017) D-Wave upgrade: how scientists are using the world’s most controversial quantum computer. Nature 541(7638):447–448
Giovannetti V, Lloyd S, Maccone L (2008) Architectures for a quantum random access memory. Phys Rev A 78(5):1–9
Griffiths DJ (2004) Introduction to quantum mechanics, 2nd edn. Pearson Prentice Hall
Hobbie RK, Roth BJ (2007) Exponential growth and decay. In: Intermediate physics for medicine and biology. Springer, pp 31–47
Horn D, Gottlieb A (2002) Algorithm for data clustering in pattern recognition problems based on quantum mechanics. Phys Rev Lett 88(1):1–4
Hurst M (2001) Layout and language: challenges for table understanding on the Web. In: WDA
Jenssen R (2010) Kernel entropy component analysis. IEEE Trans Pattern Anal Mach Intell 32(5):847–860
Jiménez P, Corchuelo R (2016) On learning web information extraction rules with TANGO. Inf Syst 62:74–103
Jiménez P, Corchuelo R (2016) Roller: a novel approach to web information extraction. Knowl Int Syst 49(1):197–241
Jiménez P, Roldán JC, Gallego FO, Corchuelo R (2020) On the synthesis of metadata tags for HTML files. Softw Pract Exp 50:2169–2192
Jung S, Kwon H (2006) A scalable hybrid approach for extracting head components from web tables. IEEE Trans Knowl Data Eng 18(2):174–187
Kasirajan V (2021) Fundamentals of quantum computing. Springer
Kayed M, Chang C-H (2010) FiVaTech: page-level web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2):249–263
Kerenidis I, Prakash A (2017) Quantum recommendation systems. In: ITCS, vol 67, pp 49:1–49:21
Kerenidis I, Landman J, Luongo A, Prakash A (2019) \(q\)-means: a quantum algorithm for unsupervised machine learning. In: NIPS, pp 4136–4146
Kietzmann J, Demetis DS, Eriksson T, Dabirian A (2021) Hello quantum! How quantum computing will change the world. IEEE IT Profess 23(4):106–111
Kim Y-S, Lee K-H (2005) Detecting tables in web documents. Eng Appl AI 18(6):745–757
Knight W (2018) Serious quantum computers are finally here. MIT Technology Review
Kumar V, Bass G, Tomlin C, Dulny J (2018) Quantum annealing for combinatorial clustering. Quantum Inf Process 17(2):39
Li Y, Wang Y, Wang Y, Jiao L, Liu Y (2016) Quantum clustering using kernel entropy component analysis. Neurocomputing 202:36–48
Limaye G, Sarawagi S, Chakrabarti S (2010) Annotating and searching web tables using entities, types, and relationships. VLDB 3:1338–1347
Liu W, Meng X, Meng W (2010) ViDE: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447–460
Lopresti DP, Nagy G (2000) A tabular survey of automated table processing. In: GREC, pp 93–120
Milošević N, Gregson C, Hernández R, Nenadic G (2016) Disentangling the structure of tables in scientific literature. In: NLDB, pp 162–174
Neukart F, Compostella G, Seidel C, von Dollen D, Yarkoni S, Parney B (2017) Traffic flow optimization using a quantum annealer. Front ICT 20:66
Nishida K, Sadamitsu K, Higashinaka R, Matsuo Y (2017) Understanding the semantic structures of tables with a hybrid deep neural network architecture. I:n AAAI, pp 168–174
Oulabi Y, Bizer C (2019) Extending cross-domain knowledge bases with long tail entities using web table data. In: EDBT, pp 385–396
Pimplikar R, Sarawagi S (2012) Answering table queries on the Web using column keywords. VLDB 5:908–919
Roldán JC, Jiménez P, Corchuelo R (2020) On extracting data from tables that are encoded using HTML. Knowl Based Syst 190:105157
Sarawagi S (2008) Information extraction. Found Trends Databases 1(3):261–377
Sheskin DJ (2020) Handbook of parametric and nonparametric statistical procedures, 5th edn. Chapman & Hall/CRC Press
Sleiman HA, Corchuelo R (2013) TEX: an efficient and effective unsupervised web information extractor. Knowl Based Syst 39:109–123
Sleiman HA, Corchuelo R (2013) A survey on region extractors from web documents. IEEE Trans Knowl Data Eng 25(9):1960–1981
Sleiman HA, Corchuelo R (2014) A class of neural-network-based transducers for web information extraction. Neurocomputing 135:61–68
Sleiman HA, Corchuelo R (2014) Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans Knowl Data Eng 26(6):1544–1556
Turmo J, Ageno A, Català N (2006) Adaptive information extraction. ACM Comput Surv 38(2):66
Wereszczyński K, Michalczuk A, Josiński H, Polański A (2018) Quantum computing for clustering big datasets. In: IEEE applications of electromagnetics in modern techniques and medicine, pp 276–280
Wikipedia. Wikipedia download (2020)
Wittek P (2014) Clustering structure and quantum computing. In: Quantum machine learning. Elsevier, pp 99–107
Wittek P (2016) Quantum machine learning. Academic Press
Wu X, Cao C, Wang Y, Fu J, Wang S (2016) Extracting knowledge from web tables based on DOM tree similarity. In: KSEM, vol 9983, pp 302–313
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193
Yang Y, Luk W (2002) A framework for web table mining. In: WIDM, pp 36–42
Yoshida M, Torisawa K, Tsujii J (2001) A method to integrate tables of the World Wide Web. In: WDA, pp 31–34
Zanibbi R, Blostein D, Cordy JR (2004) A survey of table recognition. IJDAR 7(1):1–16
Zhang S, Balog K (2020) Web table extraction, retrieval, and augmentation: a survey. ACM Trans Intell Syst Technol 11:13:1-13:35
Acknowledgements
This work was supported by the Spanish and the Andalusian R&D programmes through Grants TIN2016-75394-R, PID2020-112540RB-C44 (MCIN/AEI/10.13039/501100011033), and P18-RT-1060 (FEDER funds from the EU). The standard computer used to perform the experimentation was partially supported by Dinamic Area, S.L. and the quantum computer was fully supported by D-Wave Systems, Inc.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jiménez, P., Roldán, J.C. & Corchuelo, R. A hybrid quantum approach to leveraging data from HTML tables. Knowl Inf Syst 64, 441–474 (2022). https://doi.org/10.1007/s10115-021-01636-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-021-01636-7