Elsevier

Information Sciences

Volume 602, July 2022, Pages 13-42
Information Sciences

Data adaptive functional outlier detection: Analysis of the Paris bike sharing system data

https://doi.org/10.1016/j.ins.2022.04.029Get rights and content

Abstract

Bike sharing systems (BSSs) have become an increasingly popular means of sustainable transportation, and have been implemented in many cities worldwide. Our approach contributes to the identification of abnormal patterns by applying real-time occupancy data from Paris. In particular, we propose a novel functional outlier detection algorithm based on a two-step approach: In the first stage, a clean dataset is obtained based on the combined effect of two extreme statistics calculated from random sampling; in the second stage, a multiple testing approach based on the clean dataset is proposed, in which the false discovery rate (FDR) control procedure is used to adaptively choose the thresholds for the hypothesis tests. Extensive numerical simulations were conducted to compare the outlier detection performance with those of other state-of-art methods. The proposed approach is then applied to the Paris Vélib’ bike sharing system dataset to identify abnormal patterns that are of particular interest to BSS operators for identifying system inefficiencies and update policies.

Introduction

With the development of positioning devices, Internet-of-Things (IoT) modules, and sensor networks, massive numbers of transportation data that contain considerable undiscovered knowledge have been collected. The exploration of traffic monitoring data can provide deep insight into public transportation facility planning, route optimization, and surveillance purposes. For example, traffic volume data are useful for estimating the design–hour volumes and predicting the travel times [18], and traffic planners trace vehicle the trajectories to optimize the locations of charging stations. There have been a wide variety of research streams in the processing of large datasets in the transportation sector. Zhu et al. [43] predicted the future state of road traffic using a clustering method combined with a recurrent convolutional neural network, Deb and Liew [10] proposed an algorithm for identifying and correcting noisy categorical attribute values in large traffic accident datasets, with a probability measure inferred from statistics within the dataset and Teng et al. developed a multi-step forecasting method for online car-hailing demand [34].

This research was motivated by the real-world application of discovering patterns in the usage of stations in the Vélib’ bicycle sharing system (BSS) in Paris, France. Vélib’ BBS has approximately 14,500 bicycles at 1,230 rental stations located within the Paris metropolitan area [15]. Users are allowed to retrieve and return bikes to any available station, and staff members redistribute bicycles among rental stations overnight. The data contain hourly occupancy information in terms of the available bicycles and docks at each station. In this context, the task of outlier detection is to discover occupancy profiles that differ substantially from or are inconsistent with the remaining set. For instance, in the case of an unexpected system disruption, operators can identify the potential root causes after identifying the outlying trajectory [30]. Outlier detection can also contribute to the adjustment of pricing policies and improve the redistribution efficiency of bikes over stations.

The bike rental occupancy data are continuously measured for all rental stations in real time, which may result in high-dimensional data given a long observation period. Consequently, functional data analysis (FDA) has become a popular statistical tool that assumes that measurement vectors are realizations of functions defined on some continuous domains. Modeling data into functions is a viable way to recover the true nature of the underlying data generation process, and the smoothness characteristic of the functions is robust to measurement noise. Functional outlier detection is important for several reasons. Outliers pose significant challenges in obtaining coherent statistical analyses, resulting in biased estimations, misleading inferences, and poor predictions [29]. However, they may also carry important information regarding the nature of the underlying data-generation process; hence, the identified outliers require further investigation.

In terms of outlier detection methods for functional data, previous studies have extended outlier detection methods for multivariate data to functional settings; however, they encounter some apparent drawbacks [12]. Given that functional data are intrinsically infinite-dimensional, outlier detection methods for multivariate data are affected by the curse of dimensionality; thus, plotting methods are difficult to visually assess. There are also a limited number of methods that explicitly address the detection of functional outliers. One of the most popular methods is based on graphical approaches utilizing the concept of functional depth, as in [12], [32]. This provides a center-outward ordering score of the set of curves, where curves with a higher depth are close to the center of the functional distribution, whereas functional outliers are far from the center of the data with a significantly lower depth. Another type of approach is based on statistical measures, which include the robust principal component analysis method developed by Hyndman and Ullah [22], the high-breakdown mean function estimator created by Ren et al. [29], and the successive likelihood ratio test and smoothed bootstrapping developed by Febrero et al. [11].

In this study, we propose a novel functional outlier detection method for identifying outlying usage patterns in BSS data in Paris, France. Given the infinite-dimensional nature of functional data, this method is based on a functional principal component analysis (FPCA) to provide a low-dimensional representation of the original data. The outlier detection process is formulated as a statistical hypothesis testing procedure that has competitive power for detecting outliers with controlled sizes. Because the method used to select a clean set is based on the joint effect of the maximum and minimum statistics, the proposed algorithm is called the max–min functional outlier detection algorithm (MM-FOD). The code used to reproduce the results of this study is available at  https://github.com/LC8736/MM-FOD.git.

To summarize, the main contributions of this study are as follows:

  • We propose a novel two-stage functional outlier detection approach, which can avoid the masking and swamping effect in outlier detection. This algorithm is summarized in Algorithm 1.

  • A novel clean set construction method is proposed based on the joint effect of the maximum and minimum statistics, which is easy to implement and enjoys a competitive accuracy in comparison to the existing benchmark methods.

  • The FDR control procedure is used to adaptively determine the threshold of the multiple hypothesis testing procedure in the second stage, which controls the false positive rate within the specified significance level.

  • The proposed outlier detection method is applied to the Vélib’ BSS data in Paris, and the identified outliers provides system operators with additional insight to improve the efficiency of the system.

The remainder of this paper is organized as follows. In Section 2, we briefly review the preliminaries of functional outlier detection. The proposed MM-FOD algorithm is presented in detail in Section 3. The simulation study and real-world case study are provided in Sections 4 Simulation study, 5 Analysis of bike-sharing systems, respectively, and Section 6 provides some concluding remarks regarding this research.

Section snippets

Background

In this section, we review related studies on classical outlier detection methods and state-of-art functional outlier detection methods. Owing to the curse of dimensionality and highly correlated characteristics, outlier detection methods for multivariate data cannot be directly applied to functional data. In this study, we considered a functional outlier detection approach to deal with BSS data in Paris.

Methodology

In this section, the proposed MM-FOD algorithm is presented. Specifically, in Section 3.1, the functional outlier detection problem was formulated, and some key issues were identified. In Section 3.2, based on the novel outlying statistic, a two-step procedure was developed in which a clean dataset was first constructed; then, the FDR control procedure was employed by treating the outlier detection problem as a multiple statistical hypothesis testing problem. In Section 3.3, a toy example is

Simulation study

In this section, we conducted numerical simulation experiments to evaluate the performance of the proposed functional outlier detection approach versus several recent benchmark algorithms for functional outlier detection. In Section 4.1, the data generation schemes were presented, showing different types of outlyingness considered in this study. The performance of our proposed MM-FOD procedure was discussed in Section 4.2, where the proposed MM-FOD method is compared with six state-of-the-art

Analysis of bike-sharing systems

Since 2001, the city of Paris introduced policies to promote the green mode of public transportation, such as bicycles and walking. In this context, the Vélib’ bike sharing system was launched in July 2007. Bikes were available at Vélib’ stations docked electronically to docking points; based on the company’s official website, as of the end of 2020, the entire network included 1,400 docking points in the greater Paris area and 20,000 bicycles in the fleet. After users registered, bicycles could

Conclusion

This study was motivated by the interest in analyzing BSS data to identify abnormal usage patterns and help system operators determine practical solutions to system inefficiencies. To this end, we propose a two-step outlier detection method for functional BSS data generated by the system. The first step is to obtain a clean subset of data based on the intersection of the maximum and minimum statistics. In the second step, a multiple testing procedure is proposed. Based on the clean dataset

CRediT authorship contribution statement

Chao Liu: Conceptualization, Methodology, Software, Formal analysis, Funding acquisition. Xiao Gao: Writing – review & editing. Xiaokang Wang: Supervision, Project administration, Resources, Visualization, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was financially supported by the China Postdoctoral Science Foundation, No. 2021M691443, No. 2021TQ0141, and SUSTC Presidential Postdoctoral Fellowship.

References (43)

  • Markus M. Breunig et al.

    Lof: identifying density-based local outliers

  • Andrea Cerioli

    Multivariate outlier detection with high-breakdown estimators

    J. Am. Stat. Assoc.

    (2010)
  • Wenlin Dai, Marc G. Genton, Directional outlyingness for multivariate functional data, Comput. Stat. Data Anal. 131...
  • Rupam Deb, Alan Wee-Chung Liew, Noisy values detection and correction of traffic accident data, Inf. Sci. 476 (2019)...
  • Manuel Febrero et al.

    A functional analysis of NOx levels: location and scale estimation and outlier detection

    Comput. Stat.

    (2007)
  • Manuel Febrero et al.

    Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels

    Environmetrics

    (2008)
  • T.L. Fei et al.

    Isolation forest

  • Antoine Godichon-Baggioni et al.

    Clustering transformed compositional data using k-means, with applications in gene expression and bicycle sharing system data

    J. Appl. Stat.

    (2019)
  • Frank E. Grubbs

    Procedures for detecting outlying observations in samples

    Technometrics

    (1969)
  • J. Trevor Harris et al.

    Elastic depths for detecting shape anomalies in functional data. Technometrics

    Technometrics

    (2020)
  • Peilan He, Guiyuan Jiang, Siew-Kei Lam, Yidan Sun, Learning heterogeneous traffic patterns for travel time prediction...
  • Cited by (10)

    • A relative granular ratio-based outlier detection method in heterogeneous data

      2023, Information Sciences
      Citation Excerpt :

      Therefore, outlier detection is widely used in credit card fraud [5,13], network intrusion [24,35] and so on. In recent years, outlier detection methods have been developed and can be classified roughly into four types according to their methodological principles: 1) statistical-based methods [27]; 2) proximity-based methods [9]; 3) model-based methods [14]; 4) integration-based methods [33]. Statistical methods are first used for outlier detection, and some types of data can be visualized and analyzed with box line plots, histograms, etc.

    • An enhanced kernel learning data-driven method for multiple fault detection and identification in industrial systems

      2022, Information Sciences
      Citation Excerpt :

      Therefore, detecting abnormal conditions at an early stage can reduce maintenance costs and maintain the functionality of equipment [4–6]. As an effective data-driven fault detection approach, the kernel-based learning (KBL) technique can directly gain information from data sets saved in the sensors without the mathematical model [6–8]. The KBL fault detection method represents the original data in a low dimensional form to be free of the curse of dimensionality to a significant extent [9–12].

    View all citing articles on Scopus
    View full text