skip to main content
10.1145/3243082.3264609acmotherconferencesArticle/Chapter ViewAbstractPublication PageswebmediaConference Proceedingsconference-collections
course

Dealing with Data from Multiple Web Sources

Published: 16 October 2018 Publication History

Abstract

Web data are heterogeneous and unstructured, which defines challenges for data crawling, integration and preprocessing. Different studies are "data-oriented" (i.e. based on the available data) but their results are restricted to their specific data. In contrast, there are various problems prior to identifying what data is needed to solve them, and often multiple data sources are needed. In this context, crawling, integrating and preprocessing data appropriately enables to create datasets for solving such problems. Therefore, this short course addresses these three activities by discussing challenges and practical solutions.

References

[1]
Gabriela B. Alves et al. 2016. The Strength of Social Coding Collaboration on GitHub. In Brazilian Symposium on Databases. Salvador, Brazil, 247--252.
[2]
Natércia A. Batista et al. 2017. GitSED: Um Conjunto de Dados com Informações Sociais Baseado no GitHub. In Brazilian Symposium on Databases - Dataset Showcase Workshop. 224--233.
[3]
Natércia A Batista, Michele A Brandão, Gabriela B Alves, Ana Paula Couto da Silva, and Mirella M Moro. 2017. Collaboration strength metrics and analyses on GitHub. In Proceedings of the International Conference on Web Intelligence. 170--178.
[4]
Michael R Berthold, Christian Borgelt, Frank Höppner, and Frank Klawonn. 2010. Guide to intelligent data analysis: how to intelligently make sense of real data. Springer Science & Business Media.
[5]
Mokrane Bouzeghoub, Bernadette Farias Lóscio, Zoubida Kedad, and Assia Soukane. 2002. Heterogeneous data source integration and evolution. In International Conference on Database and Expert Systems Applications. 751--757.
[6]
Michele A Brandão, Matheus A Diniz, Guilherme A de Sousa, and Mirella M Moro. 2018. Visualizing Co-Authorship Social Networks and Collaboration Recommendations With CNARe. In Graph Theoretic Approaches for Analyzing Large-Scale Social Networks. IGI Global, 173--188.
[7]
Daniel Hasan Dalip, Marcos André Gonçalves, Marco Cristo, and Pavel Calado. 2013. Exploiting user feedback to learn to rank answers in q&a forums: a case study with stack overflow. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 543--552.
[8]
Levy de Souza Silva, Fabricio Murai, Ana Paula Couto da Silva, and Mirella M. Moro. 2018. Automatic Identification of Best Attributes for Indexing in Data Deduplication. In Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management. Cali, Colombia.
[9]
Golnoosh Farnadi, Jie Tang, Martine De Cock, and Marie-Francine Moens. 2018. User Profiling through Deep Multimodal Fusion. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 171--179.
[10]
Roy Thomas Fielding. 2000. Architectural styles and the design of network-based software architectures. Ph.D. Dissertation. University of California, Irvine.
[11]
Floris Geerts, Paolo Missier, and Norman Paton. 2018. Editorial: Special Issue on Improving the Veracity and Value of Big Data. J. Data and Information Quality 9, 3 (2018), 13:1--13:2.
[12]
Minas Gjoka, Maciej Kurant, Carter T Butts, and Athina Markopoulou. 2010. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In Proceedings of the 29th Conference on Information Communications. 2498--2506.
[13]
Alberto H. F. Laender, Mirella M. Moro, Cristiano Nascimento, and Patrícia Martins. 2009. An X-ray on web-available XML schemas. SIGMOD Record 38, 1 (2009), 37--42.
[14]
Jun Liu and Sudha Ram. 2018. Using big data and network analysis to understand Wikipedia article quality. Data & Knowledge Engineering 115 (2018), 80--93.
[15]
Fenglong Ma, Chuishi Meng, Houping Xiao, Qi Li, Jing Gao, Lu Su, and Aidong Zhang. 2017. Unsupervised discovery of drug side-effects from heterogeneous data sources. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 967--976.
[16]
Mirella M. Moro, Vanessa Braganholo, Carina F. Dorneles, Denio Duarte, Renata de Matos Galante, and Ronaldo dos Santos Mello. 2009. XML: some papers in a haystack. SIGMOD Record 38, 2 (2009), 29--34.
[17]
Navin Kumar Tyagi, AK Solanki, and Sanjay Tyagi. 2010. An algorithmic approach to data preprocessing in web usage mining. International journal of information technology and knowledge management 2, 2 (2010), 279--283.
[18]
Bogdan Vasilescu, Alexander Serebrenik, and Vladimir Filkov. 2015. A data set for social diversity studies of GitHub teams. In Proceedings of the 12th Working Conference on Mining Software Repositories. 514--517.
[19]
Lizhi Wang, Rong Pan, Xiaohong Wang, Wenhui Fan, and Jinquan Xuan. 2017. A Bayesian reliability evaluation method with different types of data from multiple sources. Reliability Engineering & System Safety 167 (2017), 128--135.
[20]
Ruili Wang, Wanting Ji, Mingzhe Liu, Xun Wang, Jian Weng, Song Deng, Suying Gao, and Chang-an Yuan. 2018. Review on mining data from multiple data sources. Pattern Recognition Letters (2018).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WebMedia '18: Proceedings of the 24th Brazilian Symposium on Multimedia and the Web
October 2018
437 pages
ISBN:9781450358675
DOI:10.1145/3243082
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2018

Check for updates

Author Tags

  1. Practical perspective
  2. Web data sources

Qualifiers

  • Course
  • Research
  • Refereed limited

Funding Sources

Conference

WebMedia '18
WebMedia '18: Brazilian Symposium on Multimedia and the Web
October 16 - 19, 2018
BA, Salvador, Brazil

Acceptance Rates

WebMedia '18 Paper Acceptance Rate 37 of 111 submissions, 33%;
Overall Acceptance Rate 270 of 873 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 101
    Total Downloads
  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media