Data adaptive functional outlier detection: Analysis of the Paris bike sharing system data

doi:10.1016/j.ins.2022.04.029

Information Sciences

Volume 602, July 2022, Pages 13-42

https://doi.org/10.1016/j.ins.2022.04.029 Get rights and content

Abstract

Bike sharing systems (BSSs) have become an increasingly popular means of sustainable transportation, and have been implemented in many cities worldwide. Our approach contributes to the identification of abnormal patterns by applying real-time occupancy data from Paris. In particular, we propose a novel functional outlier detection algorithm based on a two-step approach: In the first stage, a clean dataset is obtained based on the combined effect of two extreme statistics calculated from random sampling; in the second stage, a multiple testing approach based on the clean dataset is proposed, in which the false discovery rate (FDR) control procedure is used to adaptively choose the thresholds for the hypothesis tests. Extensive numerical simulations were conducted to compare the outlier detection performance with those of other state-of-art methods. The proposed approach is then applied to the Paris Vélib’ bike sharing system dataset to identify abnormal patterns that are of particular interest to BSS operators for identifying system inefficiencies and update policies.

Introduction

With the development of positioning devices, Internet-of-Things (IoT) modules, and sensor networks, massive numbers of transportation data that contain considerable undiscovered knowledge have been collected. The exploration of traffic monitoring data can provide deep insight into public transportation facility planning, route optimization, and surveillance purposes. For example, traffic volume data are useful for estimating the design–hour volumes and predicting the travel times [18], and traffic planners trace vehicle the trajectories to optimize the locations of charging stations. There have been a wide variety of research streams in the processing of large datasets in the transportation sector. Zhu et al. [43] predicted the future state of road traffic using a clustering method combined with a recurrent convolutional neural network, Deb and Liew [10] proposed an algorithm for identifying and correcting noisy categorical attribute values in large traffic accident datasets, with a probability measure inferred from statistics within the dataset and Teng et al. developed a multi-step forecasting method for online car-hailing demand [34].

This research was motivated by the real-world application of discovering patterns in the usage of stations in the Vélib’ bicycle sharing system (BSS) in Paris, France. Vélib’ BBS has approximately 14,500 bicycles at 1,230 rental stations located within the Paris metropolitan area [15]. Users are allowed to retrieve and return bikes to any available station, and staff members redistribute bicycles among rental stations overnight. The data contain hourly occupancy information in terms of the available bicycles and docks at each station. In this context, the task of outlier detection is to discover occupancy profiles that differ substantially from or are inconsistent with the remaining set. For instance, in the case of an unexpected system disruption, operators can identify the potential root causes after identifying the outlying trajectory [30]. Outlier detection can also contribute to the adjustment of pricing policies and improve the redistribution efficiency of bikes over stations.

The bike rental occupancy data are continuously measured for all rental stations in real time, which may result in high-dimensional data given a long observation period. Consequently, functional data analysis (FDA) has become a popular statistical tool that assumes that measurement vectors are realizations of functions defined on some continuous domains. Modeling data into functions is a viable way to recover the true nature of the underlying data generation process, and the smoothness characteristic of the functions is robust to measurement noise. Functional outlier detection is important for several reasons. Outliers pose significant challenges in obtaining coherent statistical analyses, resulting in biased estimations, misleading inferences, and poor predictions [29]. However, they may also carry important information regarding the nature of the underlying data-generation process; hence, the identified outliers require further investigation.

In terms of outlier detection methods for functional data, previous studies have extended outlier detection methods for multivariate data to functional settings; however, they encounter some apparent drawbacks [12]. Given that functional data are intrinsically infinite-dimensional, outlier detection methods for multivariate data are affected by the curse of dimensionality; thus, plotting methods are difficult to visually assess. There are also a limited number of methods that explicitly address the detection of functional outliers. One of the most popular methods is based on graphical approaches utilizing the concept of functional depth, as in [12], [32]. This provides a center-outward ordering score of the set of curves, where curves with a higher depth are close to the center of the functional distribution, whereas functional outliers are far from the center of the data with a significantly lower depth. Another type of approach is based on statistical measures, which include the robust principal component analysis method developed by Hyndman and Ullah [22], the high-breakdown mean function estimator created by Ren et al. [29], and the successive likelihood ratio test and smoothed bootstrapping developed by Febrero et al. [11].

In this study, we propose a novel functional outlier detection method for identifying outlying usage patterns in BSS data in Paris, France. Given the infinite-dimensional nature of functional data, this method is based on a functional principal component analysis (FPCA) to provide a low-dimensional representation of the original data. The outlier detection process is formulated as a statistical hypothesis testing procedure that has competitive power for detecting outliers with controlled sizes. Because the method used to select a clean set is based on the joint effect of the maximum and minimum statistics, the proposed algorithm is called the max–min functional outlier detection algorithm (MM-FOD). The code used to reproduce the results of this study is available at https://github.com/LC8736/MM-FOD.git.

To summarize, the main contributions of this study are as follows:

•
We propose a novel two-stage functional outlier detection approach, which can avoid the masking and swamping effect in outlier detection. This algorithm is summarized in Algorithm 1.
•
A novel clean set construction method is proposed based on the joint effect of the maximum and minimum statistics, which is easy to implement and enjoys a competitive accuracy in comparison to the existing benchmark methods.
•
The FDR control procedure is used to adaptively determine the threshold of the multiple hypothesis testing procedure in the second stage, which controls the false positive rate within the specified significance level.
•
The proposed outlier detection method is applied to the Vélib’ BSS data in Paris, and the identified outliers provides system operators with additional insight to improve the efficiency of the system.

The remainder of this paper is organized as follows. In Section 2, we briefly review the preliminaries of functional outlier detection. The proposed MM-FOD algorithm is presented in detail in Section 3. The simulation study and real-world case study are provided in Sections 4 Simulation study, 5 Analysis of bike-sharing systems, respectively, and Section 6 provides some concluding remarks regarding this research.

Section snippets

Background

In this section, we review related studies on classical outlier detection methods and state-of-art functional outlier detection methods. Owing to the curse of dimensionality and highly correlated characteristics, outlier detection methods for multivariate data cannot be directly applied to functional data. In this study, we considered a functional outlier detection approach to deal with BSS data in Paris.

Methodology

In this section, the proposed MM-FOD algorithm is presented. Specifically, in Section 3.1, the functional outlier detection problem was formulated, and some key issues were identified. In Section 3.2, based on the novel outlying statistic, a two-step procedure was developed in which a clean dataset was first constructed; then, the FDR control procedure was employed by treating the outlier detection problem as a multiple statistical hypothesis testing problem. In Section 3.3, a toy example is

Simulation study

In this section, we conducted numerical simulation experiments to evaluate the performance of the proposed functional outlier detection approach versus several recent benchmark algorithms for functional outlier detection. In Section 4.1, the data generation schemes were presented, showing different types of outlyingness considered in this study. The performance of our proposed MM-FOD procedure was discussed in Section 4.2, where the proposed MM-FOD method is compared with six state-of-the-art

Analysis of bike-sharing systems

Since 2001, the city of Paris introduced policies to promote the green mode of public transportation, such as bicycles and walking. In this context, the Vélib’ bike sharing system was launched in July 2007. Bikes were available at Vélib’ stations docked electronically to docking points; based on the company’s official website, as of the end of 2020, the entire network included 1,400 docking points in the greater Paris area and 20,000 bicycles in the fleet. After users registered, bicycles could

Conclusion

This study was motivated by the interest in analyzing BSS data to identify abnormal usage patterns and help system operators determine practical solutions to system inefficiencies. To this end, we propose a two-step outlier detection method for functional BSS data generated by the system. The first step is to obtain a clean subset of data based on the intersection of the maximum and minimum statistics. In the second step, a multiple testing procedure is proposed. Based on the clean dataset

CRediT authorship contribution statement

Chao Liu: Conceptualization, Methodology, Software, Formal analysis, Funding acquisition. Xiao Gao: Writing – review & editing. Xiaokang Wang: Supervision, Project administration, Resources, Visualization, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was financially supported by the China Postdoctoral Science Foundation, No. 2021M691443, No. 2021TQ0141, and SUSTC Presidential Postdoctoral Fellowship.

References (43)

Aurélie Fischer
On the number of groups in clustering
Stat. Prob. Lett.
(2011)
Yong He et al.
High dimensional Gaussian copula graphical model with FDR control
Comput. Stat. Data Anal.
(2017)
Zengyou He et al.
Discovering cluster-based local outliers
Pattern Recogn. Lett.
(2003)
Rob J. Hyndman et al.
Robust forecasting of mortality and fertility rates: a functional data approach
Comput. Stat. Data Anal.
(2007)
Mennatallah Amer, Markus Goldstein, Nearest-neighbor and clustering based anomaly detection algorithms for rapidminer,...
Fabrizio Angiulli et al.
Fast outlier detection in high dimensional spaces
Ana Arribas-Gil et al.
Shape outlier detection and visualization for functional data: the outliergram
Biostatistics
(2014)
Anthony Bagnall, Jason Lines, William Vickers, Eamonn Keogh, The UEA & UCR time series classification repository....
Vic Barnett et al.
Outliers in statistical data
(1984)
Charles Bouveyron et al.
The discriminative functional mixture model for a comparative analysis of bike sharing systems
Ann. Appl. Stat.
(2015)

Markus M. Breunig et al.

Lof: identifying density-based local outliers

Andrea Cerioli

Multivariate outlier detection with high-breakdown estimators

J. Am. Stat. Assoc.

(2010)

Wenlin Dai, Marc G. Genton, Directional outlyingness for multivariate functional data, Comput. Stat. Data Anal. 131...

Rupam Deb, Alan Wee-Chung Liew, Noisy values detection and correction of traffic accident data, Inf. Sci. 476 (2019)...

Manuel Febrero et al.

A functional analysis of NOx levels: location and scale estimation and outlier detection

Comput. Stat.

(2007)

Manuel Febrero et al.

Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels

Environmetrics

(2008)

T.L. Fei et al.

Isolation forest

Antoine Godichon-Baggioni et al.

Clustering transformed compositional data using k-means, with applications in gene expression and bicycle sharing system data

J. Appl. Stat.

(2019)

Frank E. Grubbs

Procedures for detecting outlying observations in samples

Technometrics

(1969)

J. Trevor Harris et al.

Elastic depths for detecting shape anomalies in functional data. Technometrics

Technometrics

(2020)

Peilan He, Guiyuan Jiang, Siew-Kei Lam, Yidan Sun, Learning heterogeneous traffic patterns for travel time prediction...

Cited by (10)

Random clustering-based outlier detector
2024, Information Sciences
Outlier detection is one of the most important issues in contemporary data analysis. At present, many methods are employed for anomaly and outlier detection, but there is still no universal tool that delivers a high degree of efficiency. In this study, we present a novel approach for outlier detection based on the skillful use of the law of large numbers. The main idea of the proposed solution consists of the random clustering of the elements of the analyzed set. Then, those elements that are sufficiently distant from the random cluster centers are marked as outliers. The proposed approach, besides being highly effective, is also very intuitive. The results of the conducted numerical experiments confirm the high degree of effectiveness of the proposed method, with the measures of accuracy and precision reaching a value of 1. The indisputable advantages of this novel approach for outlier detection are the simplicity of interpretation and the possibility of its modification by people who may lack an extensive experience in data analysis. The effectiveness of the proposed method was compared with other recognized techniques in detecting outliers within both artificially generated and empirical data sets.
Where did bike-share boom? Analyzing impact of infrastructure lockdowns on bike-sharing in Chicago
2024, Transportation Research Interdisciplinary Perspectives
The COVID-19 pandemic and the subsequent measures taken to control its spread had deep impacts on transportation behavior. This paper presents a methodology for evaluating the changes in bike-share use and focuses on the identification of extraordinary trip-making activity seen in Chicago’s Divvy bike-share program prior to and over the course of the COVID-19 pandemic. An analysis period from January 2018 to April 2021 was subdivided into three time periods: pre-pandemic, citywide lockdown, and post-lockdown for all census tracts within the city of Chicago. Over the analysis window defined, an anomaly detection algorithm was used to characterize the nature of bike-share usage where anomalous trips were those that fell outside of the forecasted ranges. The changes in census tract-level bike-share usage were used to interpret how static census tract-level attributes—land use, transit connectivity, demographics, and bike infrastructure—may have contributed to varying responses across the city’s bike-share system. Principal component analysis was used to interpret the relationship between the dynamic trip-making characterizations and the static census tract-level attributes. Census tracts that transitioned from exhibiting below average to above average bike-share trip-making tended to have highly correlated bike infrastructure metrics. On the other hand, census tracts with consistently above average trip-making had highly correlated demographic characteristics, the values of which were consistent with their standing as key communities for service and essential workers. Overall, results indicate that post-lockdown, bike-sharing activity had grown significantly across the entire city, supporting the claim that the pandemic worked to accelerate bicycling uptake in urban areas.
Outlier detection for partially labeled categorical data based on conditional information entropy
2024, International Journal of Approximate Reasoning
Labeling a large amount of data is exceptionally costly and practically infeasible, and thus available data may have missing labels. In this article, we investigate outlier detection for partially labeled categorical data based on conditional information entropy. Firstly, the equivalence class in a partially labeled categorical decision information system (p-CDIS) is introduced, so that the missing labels can be predicted by use of conditional probability. Then, conditional information entropy in a p-CDIS is calculated, which provides a more comprehensive measure of uncertainty. Additionally, the relative information entropy and relative cardinality in a p-CDIS are proposed. Next, the degree of outlierness and the weight function are presented to find outlier factors. Finally, an outlier detection method in a p-CDIS based on conditional information entropy is proposed, and a corresponding conditional information entropy algorithm (CEOF) is designed. To evaluate the stability of the CEOF algorithm, experiments are performed on ten UCI Machine Learning Repository datasets. Compared with five other algorithms, the proposed method is shown to have good effectiveness and adaptability for categorical data.
Detection of outlying patterns from sparse and irregularly sampled electronic health records data
2023, Engineering Applications of Artificial Intelligence
Within the intensive care unit (ICU), vital signs such as arterial blood pressure (ABP) collected from electronic health records (EHRs) are typically recorded at different and uneven sampling frequencies and are often infrequently measured due to the nature of the medical treatment. Furthermore, from a temporal trajectory perspective, EHR data are likely to be corrupted by outlying patterns that deviate from normal samples in terms of the curves’ magnitude and shape. In this work, we propose a two-stage outlier detection approach for sparse and irregularly sampled (SiS) temporal data using functional data analysis (FDA) tools. In the first stage, an outlier identification measure is defined by a max–min statistic and a clean subset that contains nonoutliers. In the second stage, a multiple hypothesis testing problem is formulated based on the asymptotic distribution of the proposed measure. The simulation-based framework shows that the proposed method is robust to different types of shape and magnitude outliers. The detection results are more accurate than the widely used functional depth methods, especially in extremely sparse settings where the proportion of the observed data points over the entire time series is approximately 10%. Extensive experiments are also conducted on the real-world MIMIC-II dataset, which demonstrate that the method effectively detects clinically meaningful outlying patterns.
A relative granular ratio-based outlier detection method in heterogeneous data
2023, Information Sciences
Citation Excerpt :
Therefore, outlier detection is widely used in credit card fraud [5,13], network intrusion [24,35] and so on. In recent years, outlier detection methods have been developed and can be classified roughly into four types according to their methodological principles: 1) statistical-based methods [27]; 2) proximity-based methods [9]; 3) model-based methods [14]; 4) integration-based methods [33]. Statistical methods are first used for outlier detection, and some types of data can be visualized and analyzed with box line plots, histograms, etc.
Outlier detection is the discovery of some objects that are significantly different from many objects in data, and it is widely used in important fields. Most existing methods are based on prior knowledge, while few methods are suitable for heterogeneous data. In this paper, we detect outliers based on neighborhood rough set, which can process heterogeneous data and reduce some hyper-parameters. Considering the few characters of outliers, a relative granular ratio factor is consequently created to measure the size of a neighborhood in which an object belongs. Since outliers always differ from the majority of objects, a granule-based majority set is defined. Then, a valid outlier factor is determined by the feature of a negative region to measure the difference between outliers and the majority set. Finally, a ratio and negative region detection factor (RNRD) is constructed by combining the above factors under a wide range of relations. In addition, the RNRD-based outlier detection (RNROD) algorithm is designed. And experiments show the superiority of RNROD by comparing with seven existing detection algorithms on sixteen heterogeneous datasets.
An enhanced kernel learning data-driven method for multiple fault detection and identification in industrial systems
2022, Information Sciences
Citation Excerpt :
Therefore, detecting abnormal conditions at an early stage can reduce maintenance costs and maintain the functionality of equipment [4–6]. As an effective data-driven fault detection approach, the kernel-based learning (KBL) technique can directly gain information from data sets saved in the sensors without the mathematical model [6–8]. The KBL fault detection method represents the original data in a low dimensional form to be free of the curse of dimensionality to a significant extent [9–12].
Traditional fault detection methods focus mainly on a single abnormal condition of the system. However, successive multiple faults are more common than a single fault in industrial systems. Hence, this paper proposes a novel algorithm for detecting and identifying multiple faults associated with the quality indicators of the process. Considering the dynamic feature and measurement noise in the system, an enhanced kernel learning data-driven (EKLDD) algorithm is designed to improve the performance of modeling and multiple fault detection. In addition, a monitoring scheme is proposed to evaluate the quality status under every fault based on the fault line and the angle statistics. Lastly, a simulation case and a real-world case are presented to illustrate the feasibility and effectiveness of the proposed EKLDD method.

View all citing articles on Scopus

View full text

Data adaptive functional outlier detection: Analysis of the Paris bike sharing system data

Abstract

Introduction

Section snippets

Background

Methodology

Simulation study

Analysis of bike-sharing systems

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Stat. Prob. Lett.

Comput. Stat. Data Anal.

Pattern Recogn. Lett.

Comput. Stat. Data Anal.

Fast outlier detection in high dimensional spaces

Shape outlier detection and visualization for functional data: the outliergram

Biostatistics

Outliers in statistical data

The discriminative functional mixture model for a comparative analysis of bike sharing systems

Ann. Appl. Stat.

Lof: identifying density-based local outliers

Multivariate outlier detection with high-breakdown estimators

J. Am. Stat. Assoc.

A functional analysis of NOx levels: location and scale estimation and outlier detection

Comput. Stat.

Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels

Environmetrics

Isolation forest

Clustering transformed compositional data using k-means, with applications in gene expression and bicycle sharing system data

J. Appl. Stat.

Procedures for detecting outlying observations in samples

Technometrics

Elastic depths for detecting shape anomalies in functional data. Technometrics

Technometrics