course

Dealing with Data from Multiple Web Sources

Authors:

Natércia A. Batista,

Michele A. Brandão,

Michele B. Pinheiro,

Daniel H. Dalip,

Mirella M. MoroAuthors Info & Claims

WebMedia '18: Proceedings of the 24th Brazilian Symposium on Multimedia and the Web

Pages 3 - 6

https://doi.org/10.1145/3243082.3264609

Published: 16 October 2018 Publication History

Abstract

Web data are heterogeneous and unstructured, which defines challenges for data crawling, integration and preprocessing. Different studies are "data-oriented" (i.e. based on the available data) but their results are restricted to their specific data. In contrast, there are various problems prior to identifying what data is needed to solve them, and often multiple data sources are needed. In this context, crawling, integrating and preprocessing data appropriately enables to create datasets for solving such problems. Therefore, this short course addresses these three activities by discussing challenges and practical solutions.

References

[1]

Gabriela B. Alves et al. 2016. The Strength of Social Coding Collaboration on GitHub. In Brazilian Symposium on Databases. Salvador, Brazil, 247--252.

[2]

Natércia A. Batista et al. 2017. GitSED: Um Conjunto de Dados com Informações Sociais Baseado no GitHub. In Brazilian Symposium on Databases - Dataset Showcase Workshop. 224--233.

[3]

Natércia A Batista, Michele A Brandão, Gabriela B Alves, Ana Paula Couto da Silva, and Mirella M Moro. 2017. Collaboration strength metrics and analyses on GitHub. In Proceedings of the International Conference on Web Intelligence. 170--178.

Digital Library

[4]

Michael R Berthold, Christian Borgelt, Frank Höppner, and Frank Klawonn. 2010. Guide to intelligent data analysis: how to intelligently make sense of real data. Springer Science & Business Media.

Digital Library

[5]

Mokrane Bouzeghoub, Bernadette Farias Lóscio, Zoubida Kedad, and Assia Soukane. 2002. Heterogeneous data source integration and evolution. In International Conference on Database and Expert Systems Applications. 751--757.

Digital Library

[6]

Michele A Brandão, Matheus A Diniz, Guilherme A de Sousa, and Mirella M Moro. 2018. Visualizing Co-Authorship Social Networks and Collaboration Recommendations With CNARe. In Graph Theoretic Approaches for Analyzing Large-Scale Social Networks. IGI Global, 173--188.

[7]

Daniel Hasan Dalip, Marcos André Gonçalves, Marco Cristo, and Pavel Calado. 2013. Exploiting user feedback to learn to rank answers in q&a forums: a case study with stack overflow. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 543--552.

Digital Library

[8]

Levy de Souza Silva, Fabricio Murai, Ana Paula Couto da Silva, and Mirella M. Moro. 2018. Automatic Identification of Best Attributes for Indexing in Data Deduplication. In Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management. Cali, Colombia.

[9]

Golnoosh Farnadi, Jie Tang, Martine De Cock, and Marie-Francine Moens. 2018. User Profiling through Deep Multimodal Fusion. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 171--179.

Digital Library

[10]

Roy Thomas Fielding. 2000. Architectural styles and the design of network-based software architectures. Ph.D. Dissertation. University of California, Irvine.

[11]

Floris Geerts, Paolo Missier, and Norman Paton. 2018. Editorial: Special Issue on Improving the Veracity and Value of Big Data. J. Data and Information Quality 9, 3 (2018), 13:1--13:2.

Digital Library

[12]

Minas Gjoka, Maciej Kurant, Carter T Butts, and Athina Markopoulou. 2010. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In Proceedings of the 29th Conference on Information Communications. 2498--2506.

Digital Library

[13]

Alberto H. F. Laender, Mirella M. Moro, Cristiano Nascimento, and Patrícia Martins. 2009. An X-ray on web-available XML schemas. SIGMOD Record 38, 1 (2009), 37--42.

Digital Library

[14]

Jun Liu and Sudha Ram. 2018. Using big data and network analysis to understand Wikipedia article quality. Data & Knowledge Engineering 115 (2018), 80--93.

[15]

Fenglong Ma, Chuishi Meng, Houping Xiao, Qi Li, Jing Gao, Lu Su, and Aidong Zhang. 2017. Unsupervised discovery of drug side-effects from heterogeneous data sources. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 967--976.

Digital Library

[16]

Mirella M. Moro, Vanessa Braganholo, Carina F. Dorneles, Denio Duarte, Renata de Matos Galante, and Ronaldo dos Santos Mello. 2009. XML: some papers in a haystack. SIGMOD Record 38, 2 (2009), 29--34.

Digital Library

[17]

Navin Kumar Tyagi, AK Solanki, and Sanjay Tyagi. 2010. An algorithmic approach to data preprocessing in web usage mining. International journal of information technology and knowledge management 2, 2 (2010), 279--283.

[18]

Bogdan Vasilescu, Alexander Serebrenik, and Vladimir Filkov. 2015. A data set for social diversity studies of GitHub teams. In Proceedings of the 12th Working Conference on Mining Software Repositories. 514--517.

Digital Library

[19]

Lizhi Wang, Rong Pan, Xiaohong Wang, Wenhui Fan, and Jinquan Xuan. 2017. A Bayesian reliability evaluation method with different types of data from multiple sources. Reliability Engineering & System Safety 167 (2017), 128--135.

[20]

Ruili Wang, Wanting Ji, Mingzhe Liu, Xun Wang, Jian Weng, Song Deng, Suying Gao, and Chang-an Yuan. 2018. Review on mining data from multiple data sources. Pattern Recognition Letters (2018).

Index Terms

Dealing with Data from Multiple Web Sources
1. Information systems
  1. Data management systems
    1. Information integration
      1. Extraction, transformation and loading
      2. Mediators and data integration

Recommendations

WSEM_QT: a novel approach for quality‐based evaluation of web data sources for a data warehouse

The incorporation of suitable external data from the World Wide Web offers an effective solution for enriching the data in the data warehouse (DW). However, the main challenge is the quality‐aware selection of web data sources to maintain the quality of ...
Fuzzy integration of web data sources for data warehousing
EUROCAST'07: Proceedings of the 11th international conference on Computer aided systems theory

In this paper we show our work related to an approach for monitoring web sources on the World Wide Web using its temporal properties in order to integrate them in a temporal Data Warehouse. We use these temporal properties obtained for integrating data ...
Distributed Heterogeneous Web Data Sources Integration: DeXIN Approach

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WebMedia '18: Proceedings of the 24th Brazilian Symposium on Multimedia and the Web

October 2018

437 pages

ISBN:9781450358675

DOI:10.1145/3243082

General Chairs:
Manoel Carvalho Marques Neto
IFBA
,
Renato Lima Novais
IFBA
,
Carlos Ferraz
UFPE
,
Windson Viana
UFC

Copyright © 2018 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SBC: Brazilian Computer Society
SIGMM: ACM Special Interest Group on Multimedia
CNPq: Conselho Nacional de Desenvolvimento Cientifico e Tecn
CGIBR: Comite Gestor da Internet no Brazil
CAPES: Brazilian Higher Education Funding Council

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2018

Check for updates

Author Tags

Qualifiers

Course
Research
Refereed limited

Funding Sources

Conference

WebMedia '18

WebMedia '18: Brazilian Symposium on Multimedia and the Web

October 16 - 19, 2018

BA, Salvador, Brazil

Acceptance Rates

WebMedia '18 Paper Acceptance Rate 37 of 111 submissions, 33%;

Overall Acceptance Rate 270 of 873 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
101
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten