Abstract
Currently, the blooming growth of social networks such as Facebook, Twitter, Instagram, etc., has generated and is still generating a big amount of data, which can be regarded as a gold mine for business analysts and researchers where several insights that are useful and essential for effective decision making have to be provided. However, multiple problems and challenges affect the decisional support systems, especially at the level of the Extraction–Transformation–Loading processes. These processes are responsible for the selection, filtering and normalizing of data sources in order to obtain relevant decisions. As far as this research paper is concerned, we aim to focus on adapting the transformation phase with the MapReduce paradigm to process data in a distributed and parallel environment. Subsequently, we set forward a conceptual model of this second phase that is composed of several operations that handle NoSQL structure, which is suitable for Big Data storage. Finally, we implement through Talend for Big Data our new components, which help the designer apply selection, projection and joining operations on the extracted data from social media.
Similar content being viewed by others
Notes
References
Alarabi L, Eldawy A, Alghamdi R, Mokbel MF (2014) TAREEG: a MapReduce-based system for extracting spatial data from OpenStreetMap. In: Proceedings of the 22nd ACM SIGSPATIAL international conference on advances in geographic information systems, pp 83–92
Awiti J, Vaisman AA, Zimányi E (2020) Design and implementation of ETL processes using BPMN and relational algebra. Data Knowl Eng 129:101837
Bala M, Boussaid O, Alimazighi Z (2014) P-ETL: Parallel-ETL based on the MapReduce paradigm. In: 2014 IEEE/ACS 11th international conference on computer systems and applications (AICCSA). IEEE, pp 42–49
Bala M, Boussaid O, Alimazighi Z (2017) A fine-grained distribution approach for ETL processes in big data environments. Data Knowl Eng 111:114–136
Bendechache M, Tari AK, Kechadi MT (2019) Parallel and distributed clustering framework for big spatial data mining. Int J Parallel Emergent Distrib Syst 34(6):671–689
Biswas N, Chattopadhyay S, Mahapatra G, Chatterjee S, Mondal KC (2017) SysML based conceptual ETL process modeling. In: Computational intelligence, communications, and business analytics: first international conference, CICBA 2017, Kolkata, India, March 24–25, 2017, revised selected papers, part II, pp 242–255
Biswas N, Chattapadhyay S, Mahapatra G, Chatterjee S, Mondal KC (2019) A new approach for conceptual extraction-transformation-loading process modeling. Int Ambient Comput Intell (IJACI) 10(1):30–45
Boussahoua M, Boussaid O, Bentayeb F (2017) Logical schema for data warehouse on column-oriented NoSQL databases. In: Database and expert systems applications: 28th international conference, DEXA
Cuzzocrea A, De Maio C, Fenza G, Loia V, Parente M (2016) OLAP analysis of multidimensional tweet streams for supporting advanced analytics. In: Proceedings of the 31st annual ACM symposium on applied computing, pp 992–999
Dhaouadi A, Bousselmi K, Monnet S, Gammoudi MM, Hammoudi S (2022) A multi-layer modeling for the generation of new architectures for big data warehousing. In: Advanced Information networking and applications: proceedings of the 36th international conference on advanced information networking and applications (AINA-2022), vol 2, pp 204–218
Di Tria F, Lefons E, Tangorra F (2017) Evaluation of data warehouse design methodologies in the context of big data. In: Big data analytics and knowledge discovery: 19th international conference, DaWaK 2017, Lyon, France, August 28–31, 2017, Proceedings 19. Springer, Berlin, pp 3–18
Eckerson W, White C (2003) Evaluating ETL and data integration platforms. TDWI report series
El Akkaoui Z, Zimányi E (2009) Defining ETL worfklows using BPMN and BPEL. In: Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP, pp 41–48
El Akkaoui Z, Mazón J N, Vaisman A, Zimányi E (2012) BPMN-based conceptual modeling of ETL processes. In: Data warehousing and knowledge discovery: 14th international conference, DaWaK 2012, Vienna, Austria, September 3–6, 2012. Proceedings 14, pp 1–14
El-Sappagh SHA, Hendawi AMA, El Bastawissy AH (2017) A proposed model for DW ETL processes
Gonzalez-Lopez J, Ventura S, Cano A (2018) Distributed nearest neighbor classification for large-scale multi-label data on spark. Futur Gener Comput Syst 87:66–82
Gupta G, Kumar N, Chhabra I (2020) Optimised transformation algorithm for hadoop data loading in web ETL framework. EAI Endorsed Trans Scalable Inf Syst 7(25):e6–e6
Kumar S, Mohbey KK (2022) A review on big data based parallel and distributed approaches of pattern mining. J King Saud Univ Comput Inf Sci 34(5):1639–1662
Liu X, Thomsen C, Pedersen TB (2013) ETLMR: a highly scalable dimensional ETL framework based on MapReduce. In: Special issue on advances in data warehousing and knowledge discovery, transactions on large-scale data-and knowledge-centered systems VIII, pp 1–31
Liu X, Thomsen C, Pedersen TB (2014) CloudETL: scalable dimensional ETL for hive. In: Proceedings of the 18th international database engineering and applications symposium, pp 195–206
Machado GV, Cunha Í, Pereira AC, Oliveira LB (2019) DOD-ETL: distributed on-demand ETL for near real-time business intelligence. J Internet Serv Appl 10:1–15
Mallek H, Walha A, Ghozzi F, Gargouri F (2014) ETL-web process modeling. In: ASD Advances on decisional systems conference
Mallek H, Ghozzi F, Gargouri F (2020) Towards extract-transform-load operations in a big data context. Int J Sociotechnology Knowl Dev (IJSKD) 12(2):77–95
Mallek H, Ghozzi F, Gargouri F (2022) Conversion operation: from semi-structured collection of documents to Column-oriented structure. In: Proceedings of the 22nd international conference on hybrid intelligent systems (HIS 2022)
Mallek H, Ghozzi F, Gargouri F (2023) Conceptual modeling of Big Data extraction phase. Int J Hybrid Intell Syst 1–16. (Preprint)
Mallek H, Ghozzi F, Teste O, Gargouri F (2017). BigDimETL: ETL for multidimensional big data. In: Intelligent systems design and applications: 16th international conference on intelligent systems design and applications (ISDA 2016) held in Porto, Portugal, December 16–18, 2016, pp 935–944
Moalla I, Nabli A, Hammami M (2022) Data warehouse building to support opinion analysis in social media. Soc Netw Anal Min 12:123
Muñoz L, Mazon JN, Pardillo J, Trujillo J (2008) Modelling ETL processes of DWs with UML activity diagrams. Mexico, November 9–14, 2008. Proceedings, pp 44–53
Muñoz L, Mazón J-N, Trujillo J (2010) A family of experiments to validate measures for UML activity diagrams of ETL processes in data warehouses. Inf Softw Technol 52(11):1188–1203
Oliveira B, Belo O (2015) Task clustering on ETL systems—a pattern-oriented approach
Oliveira B, Oliveira Ó, Belo O (2021). Using BPMN for ETL conceptual modelling: a case study. In: DATA, pp 267–274
Russell N, Van Der Aalst W M, Ter Hofstede AH, Edmond D (2005) Workflow resource patterns: identification, representation and tool support. In: CAiSE, vol 5, pp 216–232
Russell N, Van der Aalst W, Ter Hofstede A, Wohed P (2006) On the suitability of UML 2.0 activity diagrams for business process modelling. In: Conceptual modelling 2006: Proceedings of APCCM2006, pp 95–104
Sharma S, Shandilya R, Patnaik S, Mahapatra A (2016) Leading NoSQL models for handling big data: a brief review. Int J Bus Inf Syst 22(1):1–25
Song X, Yan X, Yang L (2009) Design ETL metamodel based on UML profile. In: 2009 Second international symposium on knowledge acquisition and modeling, vol 3, pp 69–72
Swari MHP, Satwika IKS, Handika IPS (2020) Performance analysis of sales big data processing using hadoop and hive in cloud environment. In: 2020 6th Information technology international seminar (ITIS). IEEE
Trujillo J, Luján-Mora S (2003) A UML based approach for modeling ETL processes in data DWs. In: Conceptual modeling-ER 2003: 22nd international conference on conceptual modeling, Chicago, IL, USA, October 13–16, 2003. Proceedings 22, pp 307–320
Trujillo J, Davis KC, Du X et al (2021) Conceptual modeling in the era of big data and artificial intelligence: research topics and introduction to the special issue. Data Knowl Eng 135:101911
Vassiliadis P, Vagena Z, Skiadopoulos S, Karayannidis N, Sellis T (2001) ARKTOS: towards the modeling, design, control and execution of ETL processes. Inf Syst 26(8):537–561
Vassiliadis P, Simitsis A, Skiadopoulos S (2002) Conceptual modeling for ETL processes. In: Proceedings of the 5th ACM international workshop on data warehousing and OLAP, pp 14–21
Walha A, Ghozzi F, Gargouri F (2017) ETL4Social-Data: modeling approach for topic hierarchy. In: KEOD, pp 107–118
Wilkinson K, Simitsis A, Castellanos M, Dayal U (2010) Leveraging business process models for ETL design. In: Conceptual modeling-ER 2010: 29th international conference on conceptual modeling, Vancouver, BC, Canada, November 1–4, 2010. Proceedings 29, pp 15–30
Author information
Authors and Affiliations
Contributions
H.Mallek and F.Ghozzi wrote the main manuscript text and F. Gargouri prepared all figures. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mallek, H., Ghozzi, F. & Gargouri, F. Conceptual modeling of big data SPJ operations with Twitter social medium. Soc. Netw. Anal. Min. 13, 105 (2023). https://doi.org/10.1007/s13278-023-01112-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-023-01112-w