Skip to main content
Log in

A data-driven optimization approach to baseball roster management

  • Original Research
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

Each year, major league baseball (MLB) teams face complex decisions about which players to retain and which players to recruit. In addition to operational, team and budget constraints, these decisions are further complicated by the fact that an athlete’s future performance and its impact on the team are both uncertain. In this paper, we combine prediction modeling with decision optimization to study the MLB free agent market. We develop optimization models for the allocation of a team’s recruitment budget using six different metrics that evaluate a player’s contributions to a team’s success. We consider both an ideal case, where each team can choose among all free agents, and a sequential case, where we assume that teams with stronger appeal (big market) are more successful in attracting talent, while teams with less pull must optimize their rosters over a much smaller pool of remaining players. Using the best-performing metric, which takes into account both players’ positions and their positional flexibility, we develop a series of quantitative tools that help teams, especially those with small budgets, identify (1) the players who deliver a key competitive advantage to their teams, appearing in both their ideal and sequential rosters and (2) the players who are in many ideal rosters and thus are likely to be hired by teams with big budgets, perhaps at a substantial salary premium. In order to gain and maintain an edge in the fiercely competitive free agent market, teams need to continuously adapt their strategies, and our models represent a first step towards prescriptive (not just predictive) analytics designed to help them do so. Further, our analysis indicates that a few players are in high demand from many teams (for instance, in every year of the period considered, the ten most in-demand players appear in the ideal rosters of at least seven teams), while most players appear in one ideal roster or none at all. Our models go beyond players’ individual performance metrics to help teams understand which players will be in high demand due to teams’ position needs in a given year. The results further emphasize the increasing importance of contract extensions as a strategy to bypass the free agent market.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Albert, J. (2006). Pitching statistics, talent and luck, and the best strikeout seasons of all-time. Journal of Quantitative Analysis in Sports, 2(1).

  • Barnes, S. L., & Bjarnadóttir, M. V. (2016). Great expectations: An analysis of major league baseball free agent performance. Statistical Analysis and Data Mining, 9(5), 295–309.

    Article  Google Scholar 

  • Baumer, B., & Zimbalist, A. (2014). The sabermetric revolution: Assessing the growth of analytics in baseball. University of Pennsylvania Press.

    Book  Google Scholar 

  • Bendtsen, M. (2017). Regimes in baseball players’ career data. Data Mining and Knowledge Discovery, 31, 1580–1621.

    Article  Google Scholar 

  • Ben-Tal, A., El Ghaoui, L., & Nemirovski, A. (2009). Robust optimization. Princeton series in applied mathematicsPrinceton University Press.

    Book  Google Scholar 

  • Bertsimas, D., & Sim, M. (2003). Price of robustness. Operations Research, 52, 35–53.

    Article  Google Scholar 

  • Brave, S. A., Butters, R. A., & Roberts, K. A. (2019). Uncovering the sources of team synergy: Player complementarities in the production of wins. Journal of Sports Analytics, 5(4), 247–279.

    Article  Google Scholar 

  • Breusch, T. S., & Pagan, A. R. (1979). A simple test for heteroskedasticity and random coefficient variation. Econometrica, 47(5), 1287–1294.

    Article  Google Scholar 

  • Busing, C., Koster, A., & Kutschka, M. (2011). Recoverable robust knapsacks: The discrete scenario case. Optimization Letters, 5, 379–392.

    Article  Google Scholar 

  • Chan, T. C. Y., & Fearing, D. S. (2013). The value of flexibility in baseball roster construction. In MIT sloan sports analytics conference.

  • Chan, T. C. Y., & Fearing, D. S. (2019). Process flexibility in baseball: The value of positional flexibility. Management Science, 65(4), 1642–1666.

    Article  Google Scholar 

  • Chung, D. J. (2017). How much is a win worth? An application to intercollegiate athletics. Management Science, 63, 548–565.

    Article  Google Scholar 

  • Cot’s Baseball Contracts. Highest paid players. https://legacy.baseballprospectus.com/compensation/cots/league-info/highest-paid-players/

  • DeBrock, L., Hendricks, W., & Koenker, R. (2004). Pay and performance. The impact of salary distribution on firm-level outcomes in baseball. Journal of Sports Economics, 5(3), 243–261.

    Article  Google Scholar 

  • Depken, C. A. (2000). Wage disparity and team productivity: Evidence from major league baseball. Economics Letters, 67, 87–92.

    Article  Google Scholar 

  • Elitzur, R. (2020). Data analytics effects in major league baseball. Omega, 90, 102001. https://doi.org/10.1016/j.omega.2018.11.010

    Article  Google Scholar 

  • Farrar, A., & Bruggink, T. H. (2011). A new test of the Moneyball hypothesis. The Sport Journal, 14(1), 1–9.

    Google Scholar 

  • Frick, B., Prinz, J., & Winkelmann, K. (2003). Pay inequalities and team performance: Empirical evidence from the North American major leagues. International Journal of Manpower, 24(4), 472–488.

    Article  Google Scholar 

  • Gross, A., & Link, C. (2017). Does option theory hold for Majorl League Baseball contracts? Economic Inquiry, 55(1), 425–433.

    Article  Google Scholar 

  • Hakes, J. K., & Sauer, R. D. (2006). An economic evaluation of the Moneyball hypothesis. Journal of Economic Perspectives, 20(3), 173–185.

    Article  Google Scholar 

  • Hall, S., Szymanski, S., & Zimbalist, A. S. (2002). Testing causality between team performance and payroll. The cases of major league baseball and English soccer. Journal of Sports Economics, 3, 149–168.

    Article  Google Scholar 

  • Humphrey, S. E., Morgenson, F. P., & Mannor, M. J. (2009). Developing a theory of the strategic core of teams: A role composition model of team performance. Journal of Applied Psychology, 94(1), 48–60.

    Article  Google Scholar 

  • Humphreys, B. R., & Pyun, H. (2017). Monopsony exploitation in professional sport: Evidence from major league baseball position players, 2000–2011. Managerial and Decision Economics, 28, 676–688.

    Article  Google Scholar 

  • Kahn, L. M. (1993). Managerial quality, team success, and individual player performance in major league baseball. ILR Review, 46, 531–547.

    Article  Google Scholar 

  • Kasperski, A., & Zielinski, P. (2016). Robust discrete optimization under discrete and interval uncertainty: A survey. In Robustness analysis in decision aiding, optimization and analytics. Springer.

  • Kim, J. W., & King, B. G. (2014). Seeing stars: Matthew effects and status bias in major league baseball umpiring. Management Science, 60(11), 2619–2644.

    Article  Google Scholar 

  • Koop, G. (2002). Comparing the performance of baseball players. Journal of the American Statistical Association, 97(459), 710–720. https://doi.org/10.1198/016214502388618456

    Article  Google Scholar 

  • Koseler, K., & Stephan, M. (2017). Machine learning applications in baseball: A systematic literature review. Applied Artificial Intelligence, 31(9–10), 745–763. https://doi.org/10.1080/08839514.2018.1442991

    Article  Google Scholar 

  • Krautmann, A. C. (1990). Shirking or stochastic productivity in major league baseball? Southern Economic Journal, 5(4), 961–968.

    Article  Google Scholar 

  • Krautmann, A. C. (2016). Contract extensions: The case of major league baseball. Journal of Sports Economics, 19, 1–16.

    Google Scholar 

  • Lackritz, J. R. (1990). Salary evaluation for professional baseball players. The American Statistician, 44(1), 4–8. https://doi.org/10.1080/00031305.1990.10475682

    Article  Google Scholar 

  • Lesaege, C., & Poss, M. (2016). The partial choice recoverable knapsack problem. Computational Management Science, 1, 189–194.

    Article  Google Scholar 

  • Lewis, M. (2004). Moneyball: The art of winning an unfair game. W. W. Norton & Company.

    Google Scholar 

  • Liebchen, C., Lubbecke, M., Mohring, R., & Stiller, S. (2009). The concept of recoverable robustness, linear programming recovery, and railway applications. In Robust and online large-scale optimization (pp. 1–27). Springer.

  • MacKinnon, J. G., & White, H. (1985). Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of Econometrics, 29(3), 305–325.

    Article  Google Scholar 

  • Monaci, M., Pferschy, U., & Serafini, P. (2013). Exact solution of the robust knapsack problem. Computers and Operations Research, 40, 2625–2631.

    Article  Google Scholar 

  • Nasrabadi, E., & Orlin, J. (2013). Robust optimization with incremental recourse. Technical report. MIT Sloan School of Management.

  • Raimondo, H. J. (1983). Free agents’ impact on the labor market for baseball players. Journal of Labor Research, 4(2), 183–193.

    Article  Google Scholar 

  • Rockerbie, D. W. (2009). Strategic free agency in baseball. Journal of Sports Economics, 10(3), 278–291.

    Article  Google Scholar 

  • Schall, T., & Smith, G. (2000). XXX double check the first name XXX. Do baseball players regress toward the mean? The American Statistician, 54(4), 231–235.

    Article  Google Scholar 

  • Schultz, R., & Curnow, C. (1988). Peak performance and age amount superathletes: Track and field, swimming, baseball, tennis and golf. Journal of Gerontology, 43(5), 113–120.

    Article  Google Scholar 

  • Scully, G. W. (1974). Pay and performance in major league baseball. The American Economic Review, 64(6), 915–930.

    Google Scholar 

  • Silver, N. (2012). The signal and the noise. Penguin.

    Google Scholar 

  • Spotrac. MLB offseason spending. Online tool. https://www.spotrac.com/mlb/tools/offseason/

  • Timmerman, T. A. (2000). Racial diversity, age diversity, interdependence, and team performance. Small Group Research, 13(5), 592–606.

    Article  Google Scholar 

  • Turvey, J. (2013). The future of baseball contracts: A look at the growing trend in long-term contracts. The Baseball Research Journal, 42(2), 101–107.

    Google Scholar 

  • Tymkovich, J. L. (2012). A study of minor league baseball prospects and their expected future value. CMC Senior Theses (p. 442). http://scholarship.claremont.edu/cmc_theses/442

  • van den Akker, J., Bouman, P., Hoogeveen, J., & Tonissen, D. (2014). Decomposition approaches for recoverable robust optimization problems. Technical report, Utrecht University, Utrecht.

  • Wiseman, F., & Chatterjee, S. (2003). Team Payroll and team performance in major league baseball: 1985–2002. Economics Bulletin, 1(2), 1–10.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Margrét Bjarnadóttir.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 503 KB)

Appendices

Appendix A: Additional examples: ideal rosters exceeding and falling short of actual rosters

Best performance of our metric.

  • Kansas city Royals (KCR) in 2009: The actual roster is more balanced than the optimized one with a maximum salary of $4.7m and five players paid over $1 m, while in the optimal roster, the maximum is $9.2m, the next highest salary is $1.1m, and the other 10 players are all paid less than $1 m, with nine of them at base salary. The overperformance of the optimization approach is due to the outstanding performance of Kenshin Kawakami, who was selected by the optimization model despite his WAR performance prediction of 0. Hence, in this case the overperformance may be due to chance rather than systematic advantages.

  • Washington Nationals (WSN) in 2011: Five players in the actual roster and four in the optimized roster are paid over $1 m, but in the actual roster, two players are paid at least $10 m and the third highest salary drops to $1.6m. In contrast, in the optimized roster, the maximum salary is $10.5m and two next highest are $5.5m and $3.7m. While the highest paid player in the optimized roster had an actual WPA of \(-\)0.9, the second and third highest paid players had WPAs of 2.1 and 1.2, respectively. In contrast, in the actual roster, the highest WPA was 1.1 and the second highest was 0.4.

  • Washington Nationals (WSN) in 2012: This is a case where the maximum salary is higher in the optimized roster than in the actual one ($13.4m, Carlos Beltran, vs. $11.3m, Edwin Jackson). In this case, Beltran did very well with an actual WAR of 3.9 and an actual WPA of 2.4. The optimized roster was also helped by the presence of Fernando Rodney, with an actual WAR of 3.8 and WPA of 5.1 (with an adjusted salary of $1.8m).

  • Atlanta Braves (ATL) in 2013: The main reason the optimized roster of five new players performs better than the actual one is that instead of signing B.J. Upton at $12.7m, who underperformed (actual WAR \(-\)1.3, actual WPA \(-\)2.8), the optimization approach signs two players in the $6.1–6.7m adjusted salary range, who both performed quite well. In addition, the optimization approach signs Dioner Navarro, who also exceeded predictions, at an $1.8m adjusted salary.

  • Milwaukee Brewers (MIL) in 2013: Both the optimized and actual rosters sign Kyle Lohse, who has the highest adjusted salary at $11.2m and had an actual WAR of 3.3 and WPA of 1.1. However, in the actual roster, none of the other WPAs are positive while four of the other WPAs in the optimized roster are positive, leading to a cumulative WPA of \(-\)0.3 in the optimized case vs. \(-\)4.9 in the real world. Because the salary distribution is not fundamentally altered, the overperformance of the optimization approach might to some degree be due to luck.

Worst performance of our metric.

  • Houston Astros (HOU) in 2010: The optimization approach results in a roster of 11 new players with a maximum of $7.6m in adjusted salary, two players in the $0.71-$0.76m range and the remaining eight at base salary, while the actual roster has two players in the $3.3-$4.6m range, two in the $0.76–0.87m range, and the remaining seven at base salary. Hence, the star in the optimized roster is Carl Pavano with a salary of $7.6m, with all the other salaries being much lower, while the actual roster splits his salary over two players. With an actual WAR of 4 and WPA of 0.6, Pavano did quite well, but his performance is counterbalanced by that of Rodrigo Lopez, with a WAR of \(-\)0.7 and WPA of \(-\)3.2. The worst WPA of the actual roster is \(-\)1.1 (Gustavo Chacin).

  • Arizona Diamondbacks (ARI) in 2012: The maximum salary in this roster of 11 new players is $7.7m in the actual roster and $8.2m in the optimized one. The second highest salary in the actual roster is $5.6m, with the third highest dropping to $2.0m. Six players in the actual roster are paid over $1 m. In the optimized roster, four players were paid above $1 m, with all of those being paid at least $2 m. The cumulative WPA of the players paid over $1 m was 3.4 in the actual roster and -5 in the optimized roster. Particularly detrimental to the performance of the optimized roster was the selection of Francisco Rodriguez, who is the highest paid player but had an actual WAR of \(-\)0.2 and WPA of \(-\)1.3.

  • Baltimore Orioles (BAL) in 2012: This is another case where the optimization approach leads to an overemphasis on very expensive players that backfires. In this case, the optimized approach signs Casey Kotchman at $3.1m, but his actual WAR was \(-\)0.9 and his WPA was \(-\)2.8.

  • Seattle Mariners (SEA) in 2013: The underperformance of the optimization approach is due to the signing of Hisashi Iwakuma to the actual team, who far exceeded predictions with a WAR of 7 and WPA of 3.5.

  • San Francisco Giants (SFG) in 2013: The underperformance of the optimization approach is due to the signing of B.J. Upton, who underperformed, to the optimized team at an adjusted salary of $12.7m. The maximum adjusted salary of the actual roster was $8.4m, allowing two other salaries in the $6.1–6.8m range. In the optimization approach, the next highest salaries are $8.1m and $1.4m.

Appendix B: Heteroskedasticity in team performance models

We investigate potential model misspecification in our models in Sect. 3.1 with tests for heteroskedasticity. We use the Breusch-Pagan Lagrange Multiplier test (Breusch & Pagan, 1979) for heteroskedasticity on each model, the results of which are shown in Supplementary Table 1. All of the p-values are below 0.01 for Models 1–3, indicating the presence of heteroskedasticity. In Supplementary Figure 1 we highlight the heteroskedasticity of Model 1. The models tends to predict closer to the mean, causing a pattern of under-prediction for high performing teams and over-prediction for low performing teams.

Heteroscedasticity commonly results in inconsistent estimates of standard errors of linear regression models, leading to confidence intervals that are either too wide or too narrow. To investigate this effect we reran Models 1–3 with robust standard errors, using the HC1 estimator (MacKinnon & White, 1985). The robust standard errors for each model were in fact close to or lower than the original standard errors; In all three models the intercept standard error decreased and the standard error for WPA and/or WAR increased by less than 10% (with p-values remaining highly significant). More importantly, as in this paper we are using the models as predictive inputs to the decision models it is important to note the the regression estimates are not affected when using robust errors.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barnes, S., Bjarnadóttir, M., Smolyak, D. et al. A data-driven optimization approach to baseball roster management. Ann Oper Res 335, 33–58 (2024). https://doi.org/10.1007/s10479-023-05725-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-023-05725-4

Navigation