Every year since 1956 at least one pitcher has won the Cy Young Award starting with the Brooklyn Dodgers’ Don Newcombe. Many questions can be asked. How many pitchers won the award more than once? Which pitchers achieved that feat? Who won it the most times?
In this post, I focus on just one question: How many players won the Cy Young Award? I will share how I got the answer using the R programming language. I will also be using both RStudio and Sean Lahman’s Baseball Database, an excellent resource. A basic familiarity with both R, dplyr, and RStudio is assumed. (Note: As you progress through this post, have RStudio open.)
Here is an introduction to the database.
The database contains multiple files in table format. The table containing the player awards data is AwardsPlayers.RData. It has six variables:
playerID Player ID code awardID Name of award won yearID Year lgID League tie Award was a tie (Y or N) notes Notes about the award
What AwardsPlayers.RData does not have are the players’ names though it has each Player’s ID. Their names are in People.RData. The People table contains 24 variables. In the partial display of its variables, notice that it too contains playerID.
playerID A unique code asssigned to each player. birthYear Year player was born birthMonth Month player was born birthDay Day player was born birthCountry Country where player was born birthState State where player was born birthCity City where player was born nameFirst Player's first name nameLast Player's last name weight Player's weight in pounds height Player's height in inches bats Player's batting hand (left, right, or both) throws Player's throwing hand (left or right)
Fortunately, the playerID field links together the data in the tables.
For this tutorial, you need to download from Lahman’s database the “2019 – R Package“. When the webpage appears, click “data.”
You will see a list of downloadable files. The files are in RData format, a format created for use in R. For this tutorial, download these two files: AwardsPlayers.RData and People.RData. The latter is not shown in the image below, which contains a partial file list.
After you click AwardsPlayers.RData, what is shown below will appear. Click View Raw. The file will download to your device. (Note: The way I show you to do something in this tutorial is often not the only way to do it.)
Click the downloaded file, which in this case is AwardsPlayers.RData. On my Mac, it downloaded into the Downloads folder.
When this appears, click Yes.
This should appear in your RStudio Console:
Here is how the AwardsPlayers.RData looks in RStudio’s Global Environment. It is now available for you to work on.
Now, in RStudio the two downloaded tables are R data frames. Next, I created an R Markdown file and made copies of both data frames.
AP <- AwardsPlayers
P <- People
Merge AP and P into a new data frame: AP_P. The column common to both, playerID, serves as the link.
AP_P <- merge(AP, P, by="playerID")
View the merged data frame’s variables.
To view the data, while in the Console type
Activate the Tidyverse library, reduce the number of columns, and display the last 10 rows. Note: If you have not used it before, you may need to install it using the R code on the next line.
library(tidyverse) AP_P %>% select(nameFirst, nameLast, playerID, yearID, awardID) %>% tail(10)
Select five columns in AP_P, and display the last 10 rows in AP_P.
|6228||Ryan||Zimmerman||zimmery01||2011||Lou Gehrig Memorial Award|
|6233||Barry||Zito||zitoba01||2002||Cy Young Award|
|6234||Barry||Zito||zitoba01||2002||TSN Pitcher of the Year|
|6236||Ben||Zobrist||zobribe01||2016||World Series MVP|
The next step is to combine the nameFirst and nameLast columns in a new column, fullname. The paste function automatically inserts a space between the names.
AP_P$fullname <- paste(AP_P$nameFirst, AP_P$nameLast)
The five columns to be displayed are selected and the last 15 observations in the AP_P data frame are displayed.
AP_P %>% select(playerID, fullname, awardID, yearID, lgID) %>% tail(15)
|zimmery01||Ryan Zimmerman||Lou Gehrig Memorial Award||2011||ML|
|zimmery01||Ryan Zimmerman||Silver Slugger||2009||NL|
|ziskri01||Richie Zisk||TSN All-Star||1981||AL|
|ziskri01||Richie Zisk||TSN All-Star||1974||NL|
|zitoba01||Barry Zito||Hutch Award||2012||ML|
|zitoba01||Barry Zito||Cy Young Award||2002||AL|
|zitoba01||Barry Zito||TSN Pitcher of the Year||2002||AL|
|zitoba01||Barry Zito||TSN All-Star||2002||AL|
|zobribe01||Ben Zobrist||World Series MVP||2016||ML|
Go into the View window and set the settings you see below in the first row. Notice that the last year for the Cy Young Award is 2017, thus two years are missing.
Add the missing data to the AP dataset.
AP <- add_row(AP, playerID = "degroja01", awardID = "Cy Young Award", yearID = 2018, lgID = "NL", tie = "NA", notes = "P")
AP <- add_row(AP, playerID = "snellwa01", awardID = "Cy Young Award", yearID = 2018, lgID = "AL", tie = "NA", notes = "P")
AP <- add_row(AP, playerID = "degroja01", awardID = "Cy Young Award", yearID = 2019, lgID = "NL", tie = "NA", notes = "P")
AP <- add_row(AP, playerID = "verlaju01", awardID = "Cy Young Award", yearID = 2019, lgID = "AL", tie = "NA", notes = "P")
AP <- add_row(AP, playerID = "verlaju01", awardID = "TSN All-Star", yearID = 2019, lgID = "AL", tie = "NA", notes = "P")
AP <- add_row(AP, playerID = "snellwa01", awardID = "TSN All-Star", yearID = 2018, lgID = "AL", tie = "NA", notes = "P")
Exercise: Update the AP_P data frame with the new data added to the AP dataset.
How many players won the Cy Young Award? After piping what is in the AP_P data frame to the select function, we filter it to limit the observations just to those players who won the Cy Young Award. That result is then sorted (in ascending order) and the number of observations in the awardID column are counted.
AP_P %>% select(playerID, fullname, yearID, lgID, awardID) %>% filter(awardID == "Cy Young Award") %>% arrange(yearID) %>% count(awardID)
Through 2019, 118 players have won the Cy Young.
If you find any errors in this post, please let me know. If you have any technical questions about R, please ask them on stackoverflow.com.