Skip to main content
Log in

A comparison of learning methods over raw data: forecasting cab services market share in New York City

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The cab services, present in most of the cities, are one of the most used offerings for passenger transportation. Nowadays their business model is being threatened by the meddling of emerging third parties powered by modern technologies. Based on the New York cab data, we will make a comparison of several machine learning techniques (linear regression, support vector machines and random forest) for forecasting the amount of dollars spent in the cab service. The comparison of those methods will focus on the accuracy of their forecasts under several circumstances: real data applied to all features, some noisy data (real data with some uniform distributed noise added) applied to several key features and some estimated data (obtained from other statistical estimators) applied to the key features. The main goal of this comparison is to provide some data regarding the performance of those methods when they are used in conjunction with other estimators

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Aarhaug J, Skollerud K (2014) Taxi: different solutions in different segments. Transportation Research Procedia 1(1):276–283

    Article  Google Scholar 

  2. Adusumilli S, Bhatt D, Wang H, Devabhaktuni V, Bhattacharya P (2015) A novel hybrid approach utilizing principal component regression and random forest regression to bridge the period of gps outages. Neurocomputing 166:185–192

    Article  Google Scholar 

  3. Ahmed MS, Cook AR (1979) Analysis of freeway traffic Time-Series data by using Box-Jenkins techniques. Transportation Research Board

  4. Azevedo CL, Cardoso JL, Ben-Akiva M (2014) Vehicle tracking using the k-shortest paths algorithm and dual graphs. Transportation Research Procedia 1(1):3–11

    Article  Google Scholar 

  5. Bloomberg MR, Yasski D (2014) Factbook. http://www.nyc.gov/html/tlc/downloads/pdf/2014_taxicab_fact_book.pdf

  6. Box GE, Jenkins GM, Reinsel GC, Ljung GM (2015) Time series analysis: forecasting and control. Wiley, Hoboken

    MATH  Google Scholar 

  7. Brands T, de Romph E, Veitch T, Cook J (2014) Modelling public transport route choice, with multiple access and egress modes. Transportation research procedia 1(1):12–23

    Article  Google Scholar 

  8. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  9. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  10. Dai W, Brisimi TS, Adams WG, Mela T, Saligrama V, Paschalidis IC (2015) Prediction of hospitalization due to heart diseases by supervised learning methods. Int J Med Inform 84(3):189–197

    Article  Google Scholar 

  11. (2016) Dan Work: Cab data public repository at the University of Illinois. http://www.dos.ny.gov/coog/foil2.html

  12. Dobson AJ, Barnett A (2008) An introduction to generalized linear models. CRC Press, Boca Raton

    MATH  Google Scholar 

  13. García Turrado F, García Villalba LJ, Portela J (2012) Intelligent system for time series classification using support vector machines applied to supply-chain. Expert Systems with Applications 39(12):10,590–10,599

    Article  Google Scholar 

  14. Guo G (2014) Soft biometrics from face images using support vector machines. In: Support vector machines applications, Springer, pp 269–302

  15. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer Science & Business Media, Berlin

    Book  Google Scholar 

  16. Hastie TJ, Pregibon D (1992) Statistical Models, chap. Generalized linear models. Hastie, Wadsworth & Brooks/Cole

  17. Hwang RH, Hsueh YL, Chen YT (2015) An effective taxi recommender ssystem based on a spatio-temporal factor analysis model. Inf Sci 314:28–40

    Article  Google Scholar 

  18. Hyndman RJ (2014) Forecast package for R https://cran.r-project.org/web/packages/forecast/forecast.pdf

  19. Kumar M, Rath SK (2015) Classification of microarray using mapreduce based proximal support vector machine classifier. Knowl-Based Syst 89:584–602

    Article  Google Scholar 

  20. Liaw A, Wiener M (2002) Classification and regression by randomforest. R news 2(3):18–22

    Google Scholar 

  21. Lindner C, Bromiley PA, Ionita MC, Cootes TF (2015) Robust and accurate shape model matching using random forest regression-voting. IEEE Trans Pattern Anal Mach Intell 37(9):1862–1874

    Article  Google Scholar 

  22. McCullagh P (1989) Generalized linear models. Chapman and Hall, UK

    Book  Google Scholar 

  23. McCullagh P, Nelder JA (1989) Generalized linear models. Chapman and Hall, UK

    Book  Google Scholar 

  24. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2012) Misc functions of the department of statistics (e1071). TU Wien, Version pp. 1–6

  25. Nelder JA, Baker RJ (1972) Generalized linear models. Encyclopedia of Statistical Sciences

  26. (2016) New York Freedom Of Information Law: Freedom of Information Law http://www.dos.ny.gov/coog/foil2.html

  27. Pan X, Luo Y, Xu Y (2015) K-nearest neighbor based structural twin support vector machine. Knowl-Based Syst 88:34–44

    Article  Google Scholar 

  28. Tomar D, Agarwal S (2015) A comparison on multi-class classification methods based on least squares twin support vector machine. Knowl-Based Syst 81:131–147

    Article  Google Scholar 

  29. Were K, Bui DT, Dick ØB, Singh BR (2015) A comparative assessment of support vector regression, artificial neural networks, and random forests for predicting and mapping soil organic carbon stocks across an afromontane landscape. Ecol Indic 52:394–403

    Article  Google Scholar 

  30. Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer Science & Business Media, Berlin

    Book  Google Scholar 

  31. Wong R, Szeto W, Wong S (2014) Bi-level decisions of vacant taxi drivers traveling towards taxi stands in customer-search: modeling methodology and policy implications. Transp Policy 33:73–81

    Article  Google Scholar 

  32. Wong R, Szeto W, Wong S (2014) A cell-based logit-opportunity taxi customer-search model. Transportation Research Part C: Emerging Technologies 48:84–96

    Article  Google Scholar 

  33. Wu Q, Ye Y, Zhang H, Ng MK, Ho SS (2014) Forestexter: an efficient random forest algorithm for imbalanced text categorization. Knowl-Based Syst 67:105–116

    Article  Google Scholar 

  34. Yu H, Mu C, Sun C, Yang W, Yang X, Zuo X (2015) Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data. Knowl-Based Syst 76:67–78

    Article  Google Scholar 

  35. Zhang W, Niu P, Li G, Li P (2013) Forecasting of turbine heat rate with online least squares support vector machine based on gravitational search algorithm. Knowl-Based Syst 39:34–44

    Article  Google Scholar 

  36. Zheng B, Myint SW, Thenkabail PS, Aggarwal RM (2015) A support vector machine to identify irrigated crop types using time-series landsat ndvi data. Int J Appl Earth Obs Geoinf 34:103–112

    Article  Google Scholar 

  37. Zhou QF, Zhou H, Ning YP, Yang F, Li T (2015) Two approaches for novelty detection using random forest. Expert Systems With Applications 42(10):4840–4850

    Article  Google Scholar 

Download references

Acknowledgements

This research work was supported by Sungshin Women’s University. In addition, L.J.G.V. and A.L.S.O thanks the European Commission Horizon 2020 5G-PPP Programme (Grant Agreement number H2020-ICT-2014-2/671672-SELFNET - Framework for Self-Organized Network Management in Virtualized and Software-Defined Networks).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luis Javier García Villalba.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Turrado García, F., García Villalba, L.J., Sandoval Orozco, A.L. et al. A comparison of learning methods over raw data: forecasting cab services market share in New York City. Multimed Tools Appl 78, 29783–29804 (2019). https://doi.org/10.1007/s11042-018-6285-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6285-x

Keywords

Navigation