ABSTRACT
Data from Global Positioning Systems (GPS) and fare-meters in For-Hire vehicles (FHVs) have been used for various applications – both in research as well as organizational decision-making. The utility of such exercises largely depend on the accuracy of the data. This study looks at an environment where the data is partially mislabeled. Specifically, we take a common real-world setting where vehicle operators choose to render transportation services to customers without the use of a fare-meter, often by negotiating a fixed rate with the customer. This practice, which to different degrees, has been observed and documented across urban areas in the world, leads to various undesirable effects. In this study, we seek to identify cases of such behavior in the dataset. Typically, a supervised learning classifier could be built to predict the occupancy status from GPS traces, which can then be used, to look for anomalies between the predicted and stated behaviors. However, in our case the training dataset also contains instances of incorrect tagging. We address this problem by casting it as one of learning from Positive and Unlabeled instances (PU Learning) . This is owing to the fact that we observe the phenomenon of one-sided label noise, where trips tagged ‘vacant’ by the taximeter could be truly vacant or occupied, whereas trips tagged ‘occupied’ are expected to be occupied in reality as well. To support this novel formulation, we apply three state-of-the-art PU Learning algorithms on a real-world trajectory data set from an organization plying 170 active vehicles over a period of two months. We compare these to the baselines of standard supervised learning. Validation is carried out by the organization through alternate channels of investigation which is not indicated in the data set. The results show that the PU Learners provide a significant improvement in classification across a range of metrics when compared to the baseline approaches. This translates to a significant increase in identifying or reclassifying the mislabeled rides.
- Gilles Blanchard, Marek Flaska, Gregory Handy, Sara Pozzi, and Clayton Scott. 2016. Classification with asymmetric label noise: Consistency and maximal denoising. Electronic journal of Statistics 10, 2 (2016), 2780–2824.Google ScholarCross Ref
- Bill De Blasio. 2016. For-hire vehicle Transportation Study. Office of the Mayor, City of New York. http://www1.nyc.gov/assets/operations/downloads/pdf/For-Hire-Vehicle-Transportation-Study.pdf.Google Scholar
- Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. 1992. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory.Google ScholarDigital Library
- Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123–140.Google Scholar
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011).Google ScholarDigital Library
- Chao Chen, Daqing Zhang, Pablo Samuel Castro, Nan Li, Lin Sun, Shijian Li, and Zonghui Wang. 2013. iBOAT: Isolation-based online anomalous trajectory detection. IEEE Transactions on Intelligent Transportation Systems 14, 2(2013), 806–818.Google ScholarDigital Library
- Marc Claesen, Frank De Smet, Johan AK Suykens, and Bart De Moor. 2015. A robust ensemble approach to learn from positive and unlabeled data using SVM base models. Neurocomputing 160(2015), 73–84.Google ScholarDigital Library
- Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.Google ScholarDigital Library
- Francois Denis, Anne Laurent, Rémi Gilleron, and Marc Tommasi. 2003. Text classification and co-training from positive and unlabeled examples. In Proceedings of the ICML 2003 workshop: the continuum from labeled to unlabeled data.Google Scholar
- Jon Fernquest. 2013. Taxi drivers: Customer hotline successful. http://www.bangkokpost.com/learning/learning-news/368779/taxi-drivers-customer-hotline-successful.Google Scholar
- Benoît Frénay and Michel Verleysen. 2014. Classification in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems 25, 5(2014), 845–869.Google ScholarCross Ref
- Donato Hernández Fusilier, Manuel Montes-y Gómez, Paolo Rosso, and Rafael Guzmán Cabrera. 2015. Detecting positive and negative deceptive opinions using PU-learning. Information Processing & Management 51, 4 (2015), 433–443.Google ScholarDigital Library
- Yong Ge, Chuanren Liu, Hui Xiong, and Jian Chen. 2011. A taxi business intelligence system. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining.Google ScholarDigital Library
- Marta C Gonzalez, Cesar A Hidalgo, and Albert-Laszlo Barabasi. 2008. Understanding individual human mobility patterns. Nature 453(2008), 779–782.Google ScholarCross Ref
- Zhongqing Huang and Jinjun Chen. 2015. Taxi Operational Status Real Time Monitoring System based on seat sensing. In International Conference on Intelligent Systems Research and Mechatronics Engineering.Google ScholarCross Ref
- T. Joachims. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Machine Learning-International Workshop then Conference.Google Scholar
- Shehroz S Khan and Michael G Madden. 2009. A survey of recent trends in one class classification. In Proceedings of the 20th Irish conference on Artificial Intelligence and Cognitive Science.Google ScholarDigital Library
- Cumhur Kılıç and Mehmet Tan. 2012. Positive unlabeled learning for deriving protein interaction networks. Network Modeling Analysis in Health Informatics and Bioinformatics 1, 3(2012), 87–102.Google ScholarCross Ref
- Wang-Chien Lee and John Krumm. 2011. Trajectory preprocessing. In Computing with spatial trajectories. Springer New York, 3–33.Google Scholar
- Wee Sun Lee and Bing Liu. 2003. Learning with positive and unlabeled examples using weighted logistic regression. In Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003).Google Scholar
- Quannan Li, Yu Zheng, Xing Xie, Yukun Chen, Wenyu Liu, and Wei-Ying Ma. 2008. Mining user similarity based on location history. In Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems.Google ScholarDigital Library
- Xiaoli Li and Bing Liu. 2003. Learning to classify texts using positive and unlabeled data. In Proceedings of the 18th international joint conference on Artificial intelligence.Google ScholarDigital Library
- Miao Lin and Wen-Jing Hsu. 2014. Mining GPS data for mobility patterns: A survey. Pervasive and Mobile Computing 12 (2014), 1–16.Google ScholarCross Ref
- Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S Yu. 2003. Building text classifiers using positive and unlabeled examples. In Third IEEE International Conference on Data Mining, 2003 (ICDM ’03).Google ScholarCross Ref
- Siyuan Liu, Lionel M Ni, and Ramayya Krishnan. 2014. Fraud detection from taxis’ driving behaviors. IEEE Transactions on Vehicular Technology 63, 1 (2014), 464–472.Google ScholarCross Ref
- Fantine Mordelet and Jean-Philippe Vert. 2011. ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples. BMC bioinformatics 12, 1 (2011).Google Scholar
- Fantine Mordelet and J-P Vert. 2014. A bagging SVM to learn from positive and unlabeled examples. Pattern Recognition Letters 37 (2014), 201–209.Google ScholarDigital Library
- OpenStreetMap.Com. 2004. World Map. https://www.openstreetmap.org, last accessed: Feb 8, 2016.Google Scholar
- Overpass-Turbo.eu. 2016. Data Mining tool for OpenStreetMap. https://overpass-turbo.eu/, last accessed: March 12, 2016.Google Scholar
- Santi Phithakkitnukoon, Marco Veloso, Carlos Bento, Assaf Biderman, and Carlo Ratti. 2010. Taxi-aware map: Identifying and predicting vacant taxis in the city. In International Joint Conference on Ambient Intelligence.Google ScholarCross Ref
- Mike Rizzo. 2010. Taxi scams. http://landingpadba.com/taxi-scams/.Google Scholar
- Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural computation 13, 7 (2001), 1443–1471.Google ScholarDigital Library
- John Shaheen. 1967. Taxi meter monitoring system. Publication No. US 3,343,624 A, Filed Oct 22nd.,1965 ,Issued Sep 26th 1967.Google Scholar
- Peter Torjesen. 2015. The Truth About Taxis in Bangkok. https://petertorjesen.wordpress.com/2015/05/31/the-truth-about-taxis-in-bangkok/.Google Scholar
- Chunlin Wang, Chris Ding, Richard F Meraz, and Stephen R Holbrook. 2006. PSoL: a positive sample only learning algorithm for finding non-coding RNA genes. Bioinformatics 22, 21 (2006), 2590–2596.Google ScholarDigital Library
- Beth Williams. 2015. Taxi Scams-Domestic and International. https://www.corporatetravelsafety.com/safety-tips/tax-scams-domestic-and-international/.Google Scholar
- Jing Yuan, Yu Zheng, Liuhang Zhang, XIng Xie, and Guangzhong Sun. 2011. Where to find my next passenger. In Proceedings of the 13th international conference on Ubiquitous computing.Google ScholarDigital Library
- Bangzuo Zhang and Wanli Zuo. 2008. Learning from positive and unlabeled examples: A survey. In International Symposiums on Information Processing (ISIP), 2008.Google ScholarDigital Library
- Daqing Zhang, Nan Li, Zhi-Hua Zhou, Chao Chen, Lin Sun, and Shijian Li. 2011. iBAT: detecting anomalous taxi trajectories from GPS traces. In Proceedings of the 13th international conference on Ubiquitous computing.Google ScholarDigital Library
- Yu Zheng. 2015. Trajectory data mining: an overview. ACM Transactions on Intelligent Systems and Technology (TIST) 6, 3(2015).Google Scholar
- Yu Zheng, Licia Capra, Ouri Wolfson, and Hai Yang. 2014. Urban computing: concepts, methodologies, and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 5, 3(2014).Google Scholar
- Xingquan Zhu and Xindong Wu. 2004. Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review 22, 3 (2004), 177–210.Google ScholarCross Ref
- Yin Zhu, Yu Zheng, Liuhang Zhang, Darshan Santani, Xing Xie, and Qiang Yang. 2011. Inferring Taxi Status using GPS Trajectories. Technical Report. Microsoft Research. Report MSR-TR-2011-144.Google Scholar
Recommendations
A unified framework for semi-supervised PU learning
Traditional supervised classifiers use only labeled data (features/label pairs) as the training set, while the unlabeled data is used as the testing set. In practice, it is often the case that the labeled data is hard to obtain and the unlabeled data ...
Learning from Positive and Unlabeled Multi-Instance Bags in Anomaly Detection
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data MiningIn the multi-instance learning (MIL) setting instances are grouped together into bags. Labels are provided only for the bags and not on the level of individual instances. A positive bag label means that at least one instance inside the bag is positive, ...
PU Learning for RPN Generalization
ICVIP '20: Proceedings of the 2020 4th International Conference on Video and Image ProcessingRegion proposal is an important part of the two-stage object detection. RPNs pay more attention to the object within dataset, which has poor generalization and cannot generate objects that is unseen in the dataset. The traditional method does not rely ...
Comments