research-article

Inferring customer occupancy status in for-hire vehicles using PU Learning

Authors:
Vaishnavi Muralidharan

Ford Motor Private Limited Chennai, India

Ford Motor Private Limited Chennai, India
View Profile

,
Nandan Sudarsanam

Department of Management Studies Robert Bosch Center for Data Science and AI (RBCDSAI) IIT, India

Department of Management Studies Robert Bosch Center for Data Science and AI (RBCDSAI) IIT, India
View Profile

,
Balaraman Ravindran

Department of Computer Science and Engineering Robert Bosch Center for Data Science and AI (RBCDSAI) IIT, India

Department of Computer Science and Engineering Robert Bosch Center for Data Science and AI (RBCDSAI) IIT, India
View Profile

CODS-COMAD '21: Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD)January 2021Pages 290–298https://doi.org/10.1145/3430984.3430996

Published:02 January 2021Publication History

CODS-COMAD '21: Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD)

Pages 290–298

ABSTRACT

Data from Global Positioning Systems (GPS) and fare-meters in For-Hire vehicles (FHVs) have been used for various applications – both in research as well as organizational decision-making. The utility of such exercises largely depend on the accuracy of the data. This study looks at an environment where the data is partially mislabeled. Specifically, we take a common real-world setting where vehicle operators choose to render transportation services to customers without the use of a fare-meter, often by negotiating a fixed rate with the customer. This practice, which to different degrees, has been observed and documented across urban areas in the world, leads to various undesirable effects. In this study, we seek to identify cases of such behavior in the dataset. Typically, a supervised learning classifier could be built to predict the occupancy status from GPS traces, which can then be used, to look for anomalies between the predicted and stated behaviors. However, in our case the training dataset also contains instances of incorrect tagging. We address this problem by casting it as one of learning from Positive and Unlabeled instances (PU Learning) . This is owing to the fact that we observe the phenomenon of one-sided label noise, where trips tagged ‘vacant’ by the taximeter could be truly vacant or occupied, whereas trips tagged ‘occupied’ are expected to be occupied in reality as well. To support this novel formulation, we apply three state-of-the-art PU Learning algorithms on a real-world trajectory data set from an organization plying 170 active vehicles over a period of two months. We compare these to the baselines of standard supervised learning. Validation is carried out by the organization through alternate channels of investigation which is not indicated in the data set. The results show that the PU Learners provide a significant improvement in classification across a range of metrics when compared to the baseline approaches. This translates to a significant increase in identifying or reclassifying the mislabeled rides.

References

Gilles Blanchard, Marek Flaska, Gregory Handy, Sara Pozzi, and Clayton Scott. 2016. Classification with asymmetric label noise: Consistency and maximal denoising. Electronic journal of Statistics 10, 2 (2016), 2780–2824.Google ScholarCross Ref
Bill De Blasio. 2016. For-hire vehicle Transportation Study. Office of the Mayor, City of New York. http://www1.nyc.gov/assets/operations/downloads/pdf/For-Hire-Vehicle-Transportation-Study.pdf.Google Scholar
Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. 1992. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory.Google ScholarDigital Library
Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123–140.Google Scholar
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011).Google ScholarDigital Library
Chao Chen, Daqing Zhang, Pablo Samuel Castro, Nan Li, Lin Sun, Shijian Li, and Zonghui Wang. 2013. iBOAT: Isolation-based online anomalous trajectory detection. IEEE Transactions on Intelligent Transportation Systems 14, 2(2013), 806–818.Google ScholarDigital Library
Marc Claesen, Frank De Smet, Johan AK Suykens, and Bart De Moor. 2015. A robust ensemble approach to learn from positive and unlabeled data using SVM base models. Neurocomputing 160(2015), 73–84.Google ScholarDigital Library
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.Google ScholarDigital Library
Francois Denis, Anne Laurent, Rémi Gilleron, and Marc Tommasi. 2003. Text classification and co-training from positive and unlabeled examples. In Proceedings of the ICML 2003 workshop: the continuum from labeled to unlabeled data.Google Scholar
Jon Fernquest. 2013. Taxi drivers: Customer hotline successful. http://www.bangkokpost.com/learning/learning-news/368779/taxi-drivers-customer-hotline-successful.Google Scholar
Benoît Frénay and Michel Verleysen. 2014. Classification in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems 25, 5(2014), 845–869.Google ScholarCross Ref
Donato Hernández Fusilier, Manuel Montes-y Gómez, Paolo Rosso, and Rafael Guzmán Cabrera. 2015. Detecting positive and negative deceptive opinions using PU-learning. Information Processing & Management 51, 4 (2015), 433–443.Google ScholarDigital Library
Yong Ge, Chuanren Liu, Hui Xiong, and Jian Chen. 2011. A taxi business intelligence system. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining.Google ScholarDigital Library
Marta C Gonzalez, Cesar A Hidalgo, and Albert-Laszlo Barabasi. 2008. Understanding individual human mobility patterns. Nature 453(2008), 779–782.Google ScholarCross Ref
Zhongqing Huang and Jinjun Chen. 2015. Taxi Operational Status Real Time Monitoring System based on seat sensing. In International Conference on Intelligent Systems Research and Mechatronics Engineering.Google ScholarCross Ref
T. Joachims. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Machine Learning-International Workshop then Conference.Google Scholar
Shehroz S Khan and Michael G Madden. 2009. A survey of recent trends in one class classification. In Proceedings of the 20th Irish conference on Artificial Intelligence and Cognitive Science.Google ScholarDigital Library
Cumhur Kılıç and Mehmet Tan. 2012. Positive unlabeled learning for deriving protein interaction networks. Network Modeling Analysis in Health Informatics and Bioinformatics 1, 3(2012), 87–102.Google ScholarCross Ref
Wang-Chien Lee and John Krumm. 2011. Trajectory preprocessing. In Computing with spatial trajectories. Springer New York, 3–33.Google Scholar
Wee Sun Lee and Bing Liu. 2003. Learning with positive and unlabeled examples using weighted logistic regression. In Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003).Google Scholar
Quannan Li, Yu Zheng, Xing Xie, Yukun Chen, Wenyu Liu, and Wei-Ying Ma. 2008. Mining user similarity based on location history. In Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems.Google ScholarDigital Library
Xiaoli Li and Bing Liu. 2003. Learning to classify texts using positive and unlabeled data. In Proceedings of the 18th international joint conference on Artificial intelligence.Google ScholarDigital Library
Miao Lin and Wen-Jing Hsu. 2014. Mining GPS data for mobility patterns: A survey. Pervasive and Mobile Computing 12 (2014), 1–16.Google ScholarCross Ref
Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S Yu. 2003. Building text classifiers using positive and unlabeled examples. In Third IEEE International Conference on Data Mining, 2003 (ICDM ’03).Google ScholarCross Ref
Siyuan Liu, Lionel M Ni, and Ramayya Krishnan. 2014. Fraud detection from taxis’ driving behaviors. IEEE Transactions on Vehicular Technology 63, 1 (2014), 464–472.Google ScholarCross Ref
Fantine Mordelet and Jean-Philippe Vert. 2011. ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples. BMC bioinformatics 12, 1 (2011).Google Scholar
Fantine Mordelet and J-P Vert. 2014. A bagging SVM to learn from positive and unlabeled examples. Pattern Recognition Letters 37 (2014), 201–209.Google ScholarDigital Library
OpenStreetMap.Com. 2004. World Map. https://www.openstreetmap.org, last accessed: Feb 8, 2016.Google Scholar
Overpass-Turbo.eu. 2016. Data Mining tool for OpenStreetMap. https://overpass-turbo.eu/, last accessed: March 12, 2016.Google Scholar
Santi Phithakkitnukoon, Marco Veloso, Carlos Bento, Assaf Biderman, and Carlo Ratti. 2010. Taxi-aware map: Identifying and predicting vacant taxis in the city. In International Joint Conference on Ambient Intelligence.Google ScholarCross Ref
Mike Rizzo. 2010. Taxi scams. http://landingpadba.com/taxi-scams/.Google Scholar
Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural computation 13, 7 (2001), 1443–1471.Google ScholarDigital Library
John Shaheen. 1967. Taxi meter monitoring system. Publication No. US 3,343,624 A, Filed Oct 22nd.,1965 ,Issued Sep 26th 1967.Google Scholar
Peter Torjesen. 2015. The Truth About Taxis in Bangkok. https://petertorjesen.wordpress.com/2015/05/31/the-truth-about-taxis-in-bangkok/.Google Scholar
Chunlin Wang, Chris Ding, Richard F Meraz, and Stephen R Holbrook. 2006. PSoL: a positive sample only learning algorithm for finding non-coding RNA genes. Bioinformatics 22, 21 (2006), 2590–2596.Google ScholarDigital Library
Beth Williams. 2015. Taxi Scams-Domestic and International. https://www.corporatetravelsafety.com/safety-tips/tax-scams-domestic-and-international/.Google Scholar
Jing Yuan, Yu Zheng, Liuhang Zhang, XIng Xie, and Guangzhong Sun. 2011. Where to find my next passenger. In Proceedings of the 13th international conference on Ubiquitous computing.Google ScholarDigital Library
Bangzuo Zhang and Wanli Zuo. 2008. Learning from positive and unlabeled examples: A survey. In International Symposiums on Information Processing (ISIP), 2008.Google ScholarDigital Library
Daqing Zhang, Nan Li, Zhi-Hua Zhou, Chao Chen, Lin Sun, and Shijian Li. 2011. iBAT: detecting anomalous taxi trajectories from GPS traces. In Proceedings of the 13th international conference on Ubiquitous computing.Google ScholarDigital Library
Yu Zheng. 2015. Trajectory data mining: an overview. ACM Transactions on Intelligent Systems and Technology (TIST) 6, 3(2015).Google Scholar
Yu Zheng, Licia Capra, Ouri Wolfson, and Hai Yang. 2014. Urban computing: concepts, methodologies, and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 5, 3(2014).Google Scholar
Xingquan Zhu and Xindong Wu. 2004. Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review 22, 3 (2004), 177–210.Google ScholarCross Ref
Yin Zhu, Yu Zheng, Liuhang Zhang, Darshan Santani, Xing Xie, and Qiang Yang. 2011. Inferring Taxi Status using GPS Trajectories. Technical Report. Microsoft Research. Report MSR-TR-2011-144.Google Scholar

Recommendations

A unified framework for semi-supervised PU learning

Traditional supervised classifiers use only labeled data (features/label pairs) as the training set, while the unlabeled data is used as the testing set. In practice, it is often the case that the labeled data is hard to obtain and the unlabeled data ...
Read More
Learning from Positive and Unlabeled Multi-Instance Bags in Anomaly Detection
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

In the multi-instance learning (MIL) setting instances are grouped together into bags. Labels are provided only for the bags and not on the level of individual instances. A positive bag label means that at least one instance inside the bag is positive, ...
Read More
PU Learning for RPN Generalization
ICVIP '20: Proceedings of the 2020 4th International Conference on Video and Image Processing

Region proposal is an important part of the two-stage object detection. RPNs pay more attention to the object within dataset, which has poor generalization and cannot generate objects that is unseen in the dataset. The traditional method does not rely ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CODS-COMAD '21: Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD)
January 2021
453 pages
ISBN:9781450388177
DOI:10.1145/3430984
Editors:
Jayant Haritsa,
Shourya Roy,
Manish Gupta,
Sharad Mehrotra,
Balaji Vasan Srinivasan,
Yogesh Simmhan
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 January 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Driving frauds
For-hire vehicles
Mislabeled data
PU Learning
Positive and Unlabeled learning
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate197of680submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 78
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Inferring customer occupancy status in for-hire vehicles using PU Learning

CODS-COMAD '21: Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD)

ABSTRACT

References

Cited By

Recommendations

A unified framework for semi-supervised PU learning

Learning from Positive and Unlabeled Multi-Instance Bags in Anomaly Detection

PU Learning for RPN Generalization