ABSTRACT
Creation of data analytics pipeline is a tedious task. The algorithm search space for creating a suitable solution for a given goal in a given constrained infrastructure is generally very large. The exploratory work to choose the best possible solution is an effort-, time- and intellect-intensive process. The current industry practice largely relies on the domain experts for this work. To improve a domain expert’s productivity, we propose a model- and rule-based system to automate the process of creation of data analytics pipeline. The proposed system provides a mechanism to specify domain knowledge in the form of an object model and a set of rules defined over it. Recommendations are given to choose suitable algorithm/s for carrying out various data analytics tasks based on the problem context. On successful creation of the pipeline, the system generates pipeline code. Moreover, the system also generates a trace data to help in cognitive knowledge upgrade. We discuss the approach using case study of sensor data-based health monitoring system and showcase its efficacy and lesson learnt.
Supplemental Material
Available for Download
Presentation slides
- Mario Cannataro, Pietro Hiram Guzzi, Tommaso Mazza, Giuseppe Tradigo, and Pierangelo Veltri. 2007. Using ontologies for preprocessing and mining spectra data on the Grid. Future Generation Computer Systems 23, 1 (2007), 55–60.Google ScholarDigital Library
- Michel Charest, Sylvain Delisle, Ofelia Cervantes, and Yanfen Shen. 2008. Bridging the gap between data mining and decision support: A case-based reasoning and ontology approach. Intelligent Data Analysis 12, 2 (2008), 211–236.Google ScholarDigital Library
- Radwa Elshawi, Mohamed Maher, and Sherif Sakr. 2019. Automated machine learning: State-of-the-art and open challenges. arXiv preprint arXiv:1906.02287(2019).Google Scholar
- Narendhar Gugulothu, Pankaj Malhotra, Lovekesh Vig, and Gautam Shroff. 2018. Sparse neural networks for anomaly detection in high-dimensional time series. In Proceedings of the AI4IOT workshop in conjunction with ICML, IJCAI and ECAI, Stockholm, Sweden. 13–15.Google Scholar
- Narendhar Gugulothu, Vishnu Tv, Pankaj Malhotra, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. 2017. Predicting remaining useful life using time series embeddings based on recurrent neural networks. arXiv preprint arXiv:1709.01073(2017).Google Scholar
- Narendhar Gugulothu, TV Vishnu, Priyanka Gupta, Pankaj Malhotra, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. 2018. On practical aspects of using RNNs for fault detection in sparsely-labeled multi-sensor time series. In Annual Conference of the PHM Society, Vol. 10.Google ScholarCross Ref
- Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. 2009. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11, 1 (2009), 10–18.Google ScholarDigital Library
- Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A Survey of the State-of-the-Art. Knowledge-Based Systems 212 (2021), 106622.Google ScholarCross Ref
- Chen Jin, Luo De-Lin, and Mu Fen-Xiang. 2009. An improved ID3 decision tree algorithm. In 2009 4th International Conference on Computer Science & Education. IEEE, 127–130.Google Scholar
- Nikolay Laptev, Saeed Amizadeh, and Ian Flint. 2015. Generic and scalable framework for automated time-series anomaly detection. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 1939–1947.Google ScholarDigital Library
- Mao-Song Lin, Hui Zhang, and Zhang-Guo Yu. 2006. An ontology for supporting data mining process. In The Proceedings of the Multiconference on” Computational Engineering in Systems Applications”, Vol. 2. IEEE, 2074–2077.Google ScholarCross Ref
- Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. 2016. LSTM-based encoder-decoder for multi-sensor anomaly detection. arXiv preprint arXiv:1607.00148(2016).Google Scholar
- Mark Proctor, Michael Neale, Peter Lin, and Michael Frandsen. 2008. Drools documentation. JBoss 5, 05 (2008), 2008.Google Scholar
- Jürgen Schmidhuber and Sepp Hochreiter. 1997. Long short-term memory. Neural Comput 9, 8 (1997), 1735–1780.Google ScholarDigital Library
- Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 847–855.Google ScholarDigital Library
- Quanming Yao, Mengshuo Wang, Yuqiang Chen, Wenyuan Dai, Yu-Feng Li, Wei-Wei Tu, Qiang Yang, and Yang Yu. 2018. Taking human out of learning applications: A survey on automated machine learning. arXiv preprint arXiv:1810.13306(2018).Google Scholar
Index Terms
Re-Imagining data analytics software development
Recommendations
WSDM'15 Workshop Summary / Scalable Data Analytics: Theory and Applications
WSDM '15: Proceedings of the Eighth ACM International Conference on Web Search and Data MiningThe SDA workshop at WSDM 2015 is the fifth International Workshop on Scalable Data Analytics, following the previous four workshops of SDA respectively held at IEEE Big Data 2013, PAKDD 2014, IEEE Big Data 2014, and IEEE ICDM 2014. This series of ...
Big data software analytics with Apache Spark
ICSE '18: Proceedings of the 40th International Conference on Software Engineering: Companion ProceeedingsAt the beginning of every research effort, researchers in empirical software engineering have to go through the processes of extracting data from raw data sources and transforming them to what their tools expect as inputs. This step is time consuming ...
Perspectives, Motivations and Implications Of Big Data Analytics
ICARCSET '15: Proceedings of the 2015 International Conference on Advanced Research in Computer Science Engineering & Technology (ICARCSET 2015)As today there is an enormous volume of data, examining these large sets contains structure and unstructured data of different types and sizes; big data analytics is used. Data Analytics allows the user to analyze the unusable data to make a faster and ...
Comments