skip to main content
research-article

Multimodal Web Page Segmentation Using Self-organized Multi-objective Clustering

Published: 07 March 2022 Publication History

Abstract

Web page segmentation (WPS) aims to break a web page into different segments with coherent intra- and inter-semantics. By evidencing the morpho-dispositional semantics of a web page, WPS has traditionally been used to demarcate informative from non-informative content, but it has also evidenced its key role within the context of non-linear access to web information for visually impaired people. For that purpose, a great deal of ad hoc solutions have been proposed that rely on visual, logical, and/or text cues. However, such methodologies highly depend on manually tuned heuristics and are parameter-dependent. To overcome these drawbacks, principled frameworks have been proposed that provide the theoretical bases to achieve optimal solutions. However, existing methodologies only combine few discriminant features and do not define strategies to automatically select the optimal number of segments. In this article, we present a multi-objective clustering technique called MCS that relies on \(K\)-means, in which (1) visual, logical, and text cues are all combined in a early fusion manner and (2) an evolutionary process automatically discovers the optimal number of clusters (segments) as well as the correct positioning of seeds. As such, our proposal is parameter-free, combines many different modalities, does not depend on manually tuned heuristics, and can be run on any web page without any constraint. An exhaustive evaluation over two different tasks, where (1) the number of segments must be discovered or (2) the number of clusters is fixed with respect to the task at hand, shows that MCS drastically improves over most competitive and up-to-date algorithms for a wide variety of external and internal validation indices. In particular, results clearly evidence the impact of the visual and logical modalities towards segmentation performance.

References

[1]
Sadet Alcic and Stefan Conrad. 2011. Page segmentation by web content clustering. In International Conference on Web Intelligence, Mining and Semantics (WIMS). 1–9.
[2]
Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. 2009. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr. 12, 4 (2009), 461–486.
[3]
J.-J. Andrew. 2020. Task Oriented Web Page Segmentation. Ph.D. Dissertation. University of Caen Lower Normandy.
[4]
J.-J. Andrew, S. Ferrari, F. Maurel, G. Dias, and E. Giguet. 2019. Model-driven web page segmentation for non visual access. In 16th International Conference of the Pacific Association for Computational Linguistics (PACLING).
[5]
J.-J. Andrew, S. Ferrari, F. Maurel, G. Dias, and E. Giguet. 2019. Web page segmentation for non visual skimming. In 33rd Pacific Asia Conference on Language, Information and Computation (PACLIC).
[6]
David Arthur and Sergei Vassilvitskii. 2007. K-means++: The advantages of careful seeding. In 18th Annual ACM Symposium on Discrete Algorithms (SIAM). 1027–1035.
[7]
Jerzy Balicki. 2009. An adaptive quantum-based multiobjective evolutionary algorithm for efficient task assignment in distributed systems. In 13th WSEAES International Conference on Computers (ICCOMP). 417–422.
[8]
Shumeet Baluja. 2006. Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework. In 15th International Conference on World Wide Web (WWW). 33–42.
[9]
Lidong Bing, Rui Guo, Wai Lam, Zheng-Yu Niu, and Haifeng Wang. 2014. Web page segmentation with structured prediction and its application in web page classification. In 37th International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR). 767–776.
[10]
Adelbert W. Bronkhorst. 2000. The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acta Acust. United Acust. 86, 1 (2000), 117–128.
[11]
Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma, and Ji-Rong Wen. 2004. Hierarchical clustering of WWW image search results using visual, textual and link information. In 12th Annual ACM International Conference on Multimedia (MM). 952–959.
[12]
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. Extracting content structure for web pages based on visual representation. In 5th Asia-Pacific Web Conference on Web Technologies and Applications. 406–417.
[13]
Xiao Cai, Feiping Nie, and Heng Huang. 2013. Multi-view k-means clustering on big data. In 23rd International Joint Conference on Artificial Intelligence (IJCAI). 2598–2604.
[14]
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal sentence encoder. CoRR abs/1803.11175 (2018).
[15]
Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2008. A graph-theoretic approach to webpage segmentation. In 17th International Conference on World Wide Web (WWW). 377–386.
[16]
Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. 2019. Hybrid task cascade for instance segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR). 4969–4978.
[17]
Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. 2019. MMDetection: Open MMLab Detection Toolbox and Benchmark. arxiv:1906.07155 [cs.CV]
[18]
Yu Chen, Wei-Ying Ma, and Hong-Jiang Zhang. 2003. Detecting web page structure for adaptive viewing on small form factor devices. In 12th International Conference on World Wide Web (WWW). 225–233.
[19]
G. Cleuziou, M. Exbrayat, L. Martin, and J. Sublemontier. 2009. CoFKM: A centralized method for multiple-view clustering. In 9th IEEE International Conference on Data Mining (ICDM). 752–757.
[20]
S. Coondu, S. Chattopadhyay, M. Chattopadhyay, and S. R. Chowdhury. 2014. Mobile-enabled content adaptation system for e-learning websites using segmentation algorithm. In 8th International Conference on Software, Knowledge, Information Management and Applications (SKIMA). 1–8.
[21]
Courtney D. Corley and Rada Mihalcea. 2005. Measuring the semantic similarity of texts. In ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment. 13–18.
[22]
Michael Cormer, Richard Mann, Karyn Moffatt, and Robin Cohen. 2017. Towards an improved vision-based web page segmentation algorithm. In 14th Conference on Computer and Robot Vision (CRV). 345–352.
[23]
David L. Davies and Donald W. Bouldin. 1979. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell.2 (1979), 224–227.
[24]
Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. A. M. T. Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evolut. Comput. 6, 2 (2002), 182–197.
[25]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018).
[26]
Richard Dubes and Anil K. Jain. 1980. Clustering methodologies in exploratory data analysis. In Advances in Computers. Vol. 19. Elsevier, 113–228.
[27]
Olive Jean Dunn. 1964. Multiple comparisons using rank sums. Technometrics 6, 3 (1964), 241–252.
[28]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In 2nd International Conference on Knowledge Discovery and Data Mining (KDD). 226–231.
[29]
Vladimir Estivill-Castro. 2002. Why so many clustering algorithms: A position paper. ACM SIGKDD Explor. Newslett. 4, 1 (2002), 65–75.
[30]
F. Maurel, G. Dias, S. Ferrari, J.-J. Andrew, and E. Giguet.2019. Concurrent speech synthesis to improve document first glance for the blind. In 2nd International Workshop on Human-Document Interaction (HDI) associated to 15th International Conference on Document Analysis (ICDAR). 10–17.
[31]
A. M. Fahim, A. M. Salem, F. Af Torkey, and M. A. Ramadan. 2006. An efficient enhanced k-means clustering algorithm. J. Zhejiang Univ.-Sci. A 7, 10 (2006), 1626–1633.
[32]
Stéphanie Giraud, Pierre Thérouanne, and Dirk D. Steiner. 2018. Web accessibility: Filtering redundant and irrelevant information improves website usability for blind users. Int. J. Hum.-Comput. Stud. 111 (2018), 23–35.
[33]
Yue-Jiao Gong, Wei-Neng Chen, Zhi-Hui Zhan, Jun Zhang, Yun Li, Qingfu Zhang, and Jing-Jing Li. 2015. Distributed evolutionary algorithms and their models: A survey of the state-of-the-art. Appl. Soft Comput. 34 (2015), 286–300.
[34]
João Guerreiro. 2015. The use of concurrent speech to enhance blind people’s scanning for relevant information. SIGACCESS Access. Comput.111 (2015), 42–46.
[35]
Julia Handl and Joshua Knowles. 2007. An evolutionary approach to multiobjective clustering. IEEE Trans. Evolut. Comput. 11, 1 (2007), 56–76.
[36]
Laurie J. Heyer, Semyon Kruglyak, and Shibu Yooseph. 1999. Exploring expression data: Identification and analysis of coexpressed genes. Genome Res. 9, 11 (1999), 1106–1115.
[37]
Zexun Jiang, Hao Yin, Yulei Wu, Yongqiang Lyu, Geyong Min, and Xu Zhang. 2019. Constructing novel block layouts for webpage analysis. ACM Trans. Internet Technol. 19, 3 (2019), 1–18.
[38]
J. M. Lecarpentier, E. Manishina, F. Maurel, S. Ferrari, E. Giguet, G. Dias, and M. Busson.2016. Tag thunder: Web page skimming in non visual environment using concurrent speech. In 7th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT) associated to INTERSPEECH. 1–8.
[39]
Yunjae Jung, Haesun Park, Ding-Zhu Du, and Barry L. Drake. 2003. A decision criterion for the optimal number of clusters in hierarchical clustering. J. Global Optim. 25, 1 (2003), 91–111.
[40]
Johannes Kiesel, Florian Kneist, Lars Meyer, Kristof Komlossy, Benno Stein, and Martin Potthast. 2020. Web page segmentation revisited: Evaluation framework and dataset. In 29th ACM International Conference on Information & Knowledge Management (CIKM). 3047–3054.
[41]
Johannes Kiesel, Lars Meyer, Florian Kneist, Benno Stein, and Martin Potthast. 2021. An empirical comparison of web page segmentation algorithms. In 43rd European Conference on IR Research (ECIR).
[42]
Yehoon Kim, Jong-Hwan Kim, and Kuk-Hyun Han. 2006. Quantum-inspired multiobjective evolutionary algorithm for multiobjective 0/1 knapsack problems. In IEEE International Conference on Evolutionary Computation (ICEC). 2601–2606.
[43]
Christian Kohlschütter and Wolfgang Nejdl. 2008. A densitometric approach to web page segmentation. In 17th ACM Conference on Information and Knowledge Management (CIKM). 1173–1182.
[44]
Teuvo Kohonen. 1982. Self-organized formation of topologically correct feature maps. Biol. Cyber. 43, 1 (1982), 59–69.
[45]
Hans-Peter Kriegel, Peer Kröger, Jörg Sander, and Arthur Zimek. 2011. Density-based clustering. Data Min. Knowl. Discov. 1, 3 (2011), 231–240.
[46]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In 31st International Conference on Machine Learning (ICML). 1188–1196.
[47]
Alison Lee and Vicki Hanson. 2003. Enhancing web accessibility. In 11th Annual ACM International Conference on Multimedia (MM). 456–457.
[48]
Aristidis Likas, Nikos Vlassis, and Jakob J. Verbeek. 2003. The global k-means clustering algorithm. Pattern Recog. 36, 2 (2003), 451–461.
[49]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In 13th European Conference on Computer Vision (ECCV), David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). 740–755.
[50]
S. Lloyd. 2006. Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28, 2 (2006), 129–137.
[51]
James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In 5th Berkeley Symposium on Mathematical Statistics and Probability. 281–297.
[52]
Tomohiro Manabe and Keishi Tajima. 2015. Extracting logical hierarchical structure of HTML documents based on headings. VLDB Endow. 8, 12 (2015), 1606–1617.
[53]
Elena Manishina, Jean-Marc Lecarpentier, Fabrice Maurel, Stéphane Ferrari, and Maxence Busson. 2016. Tag thunder: Towards non-visual web page skimming. In 18th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS). 281–282.
[54]
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé Seddah, and Benoît Sagot. 2020. CamemBERT: A tasty French language model. In 58th Annual Meeting of the Association for Computational Linguistics (ACL). 7203–7219.
[55]
Fabrice Maurel. 2004. Transmodalité et multimodalité écrit/oral: modélisation, traitement automatique et évaluation de stratégies de présentation Des Structures “Visuo-architecturale” Des Textes. Ph.D. Dissertation. Université de Toulouse.
[56]
Benjamin Meier, Thilo Stadelmann, Jan Stampfli, Marek Arnold, and Mark Cieliebak. 2017. Fully convolutional neural networks for newspaper article segmentation. In 14th International Conference on Document Analysis and Recognition (ICDAR). 414–419.
[57]
Yu Meng, Jiaxin Huang, Guangyuan Wang, Chao Zhang, Honglei Zhuang, Lance M. Kaplan, and Jiawei Han. 2019. Spherical text embedding. In 32nd Annual Conference on Neural Information Processing Systems (NeurIPS). 8206–8215.
[58]
Souham Meshoul, Karima Mahdi, and Mohamed Batouche. 2005. A quantum inspired evolutionary framework for multi-objective optimization. In Progress in Artificial Intelligence. Springer Berlin, 190–201.
[59]
Martin Milicka and Radek Burget. 2015. Information extraction from web sources based on multi-aspect content analysis. In Semantic Web Evaluation Challenges. Springer, 81–92.
[60]
George A. Miller. 1994. The magical number seven, plus or minus two: Some limits on our capacity for processing information., Psychol. Rev. 101, 2 (1994), 343.
[61]
George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
[62]
Sayantan Mitra and Sriparna Saha. 2019. A multiobjective multi-view cluster ensemble technique: Application in patient subclassification. PLoS One 14, 5 (05 2019), 1–30.
[63]
Jose G. Moreno and Gaël Dias. 2015. Adapted B-CUBED metrics to unbalanced datasets. In 38th International ACM Conference on Research and Development in Information Retrieval (SIGIR). 911–914.
[64]
Jose G. Moreno, Gaël Dias, and Guillaume Cleuziou. 2014. Query log driven web search results clustering. In 37th International ACM Conference on Research & Development in Information Retrieval (SIGIR). 777–786.
[65]
A. Mukhopadhyay, U. Maulik, and S. Bandyopadhyay. 2009. Multiobjective genetic clustering with ensemble among Pareto front solutions: Application to MRI brain image segmentation. In 7th International Conference on Advances in Pattern Recognition (ICPRAM). 236–239.
[66]
Anirban Mukhopadhyay, Ujjwal Maulik, and Sanghamitra Bandyopadhyay. 2015. A survey of multiobjective evolutionary clustering. ACM Comput. Surv. 47, 4 (2015).
[67]
Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 53–65.
[68]
Waseem Safi, Fabrice Maurel, Jean-Marc Routoure, Pierre Beust, and Gaël Dias. 2015. Web-adapted supervised segmentation to improve a new tactile vision sensory substitution (TVSS) technology. Procedia Comput. Sci. 52 (2015), 35–42.
[69]
Sriparna Saha and Sanghamitra Bandyopadhyay. 2010. A symmetry based multiobjective clustering technique for automatic evolution of clusters. Pattern Recog. 43, 3 (2010), 738–751.
[70]
Naveen Saini, Sriparna Saha, and Pushpak Bhattacharyya. 2018. Automatic scientific document clustering using self-organized multi-objective differential evolution. Cog. Comput. (12 2018), 1–23.
[71]
Pasquale Salza and Filomena Ferrucci. 2019. Speed up genetic algorithms in the cloud using software containers. Fut. Gen. Comput. Syst. 92 (2019), 276–289.
[72]
Andrés Sanoja and Stéphane Gançarski. 2014. Block-o-matic: A web page segmentation framework. In International Conference on Multimedia Computing and Systems (ICMCS). 595–600.
[73]
Andrés Sanoja and Stéphane Gançarski. 2015. Web page segmentation evaluation. In 30th Annual ACM Symposium on Applied Computing (SAC).
[74]
Andrés Sanoja Vargas. 2015. Web Page Segmentation, Evaluation and Applications. Ph.D. Dissertation. Pierre and Marie Curie University, Paris, France.
[75]
Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2020. Green AI. Commun. ACM 63, 12 (2020), 54–63.
[76]
Xavier Sevillano, Joan Claudi Socoró, and Francesc Alías. 2020. Parallel hierarchical architectures for efficient consensus clustering on big multimedia cluster ensembles. Inf. Sci. 511 (2020), 212–228.
[77]
Roberto Panerai Velloso and Carina F. Dorneles. 2017. Extracting records from the web using a signal processing approach. In ACM Conference on Information and Knowledge Management (CIKM). 197–206.
[78]
Roberto Panerai Velloso and Carina F. Dorneles. 2019. Web page structured content detection using supervised machine learning. In International Conference on Web Engineering (ICWE), Maxim Bakaev, Flavius Frasincar, and In-Young Ko (Eds.). 3–18.
[79]
Daiyue Weng, Jun Hong, and David A. Bell. 2011. Extracting data records from query result pages based on visual features. In Advances in Databases. BNCOD 2011, A. A. A. Fernandes, A. J. G. Gray, and K. Belhajjame (Eds.). Lecture Notes in Computer Science, vol. 7051. Springer, Berlin, Heidelberg.
[80]
Daiyue Weng, Jun Hong, and David A. Bell. 2014. Automatically annotating structured web data using an SVM-based multiclass classifier. In 15th International Conference on Web Information Systems Engineering (WISE), Boualem Benatallah, Azer Bestavros, Yannis Manolopoulos, Athena Vakali, and Yanchun Zhang (Eds.). 115–124.
[81]
Lucas Wiener, Tomas Ekholm, and Philipp Haller. 2017. Modular responsive web design: An experience report. In 1st International Conference on the Art, Science and Engineering of Programming. Association for Computing Machinery.
[82]
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of text and layout for document image understanding. In 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 1192–1200.
[83]
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2020. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. arxiv:2012.14740 [cs.CL].
[84]
Xin Yang and Yuanchun Shi. 2007. Web page segmentation based on gestalt theory. In IEEE International Conference on Multimedia and Expo (ICME). 2253–2256.
[85]
Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C. Lee Giles. 2017. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5315–5324.
[86]
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019. Multilingual Universal Sentence Encoder for Semantic Retrieval. arxiv:1907.04307.
[87]
Jan Zeleny, Radek Burget, and Jaroslav Zendulka. 2017. Box clustering segmentation: A new method for vision-based web page preprocessing. Inf. Process. Manag. 53, 3 (2017), 735–750.
[88]
Shibing Zhou, Zhenyuan Xu, and Fei Liu. 2016. Method for determining the optimal number of clusters based on agglomerative hierarchical clustering. IEEE Trans. Neural Netw. Learn. Syst. 28, 12 (2016), 3007–3017.

Cited By

View all
  • (2023)Web Page Content Block Identification with Extended Block PropertiesApplied Sciences10.3390/app1309568013:9(5680)Online publication date: 5-May-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 40, Issue 3
July 2022
650 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3498357
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 March 2022
Accepted: 01 August 2021
Revised: 01 July 2021
Received: 01 November 2020
Published in TOIS Volume 40, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Web page segmentation
  2. multimodal early fusion
  3. multi-objective optimization
  4. self-organizing maps
  5. evolutionary computation

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)106
  • Downloads (Last 6 weeks)8
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Web Page Content Block Identification with Extended Block PropertiesApplied Sciences10.3390/app1309568013:9(5680)Online publication date: 5-May-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media