Exploring Baseball with R #1

Every year since 1956 at least one pitcher has won the Cy Young Award starting with the Brooklyn Dodgers’ Don Newcombe. Many questions can be asked. How many pitchers won the award more than once? Which pitchers achieved that feat? Who won it the most times?

In this post, I focus on just one question: How many players won the Cy Young Award? I will share how I got the answer using the R programming language. I will also be using both RStudio and Sean Lahman’s Baseball Database, an excellent resource. A basic familiarity with both R, dplyr, and RStudio is assumed. (Note: As you progress through this post, have RStudio open.)

Here is an introduction to the database.

The database contains multiple files in table format. The table containing the player awards data is AwardsPlayers.RData. It has six variables:

playerID       Player ID code
awardID        Name of award won
yearID         Year
lgID           League
tie            Award was a tie (Y or N)
notes          Notes about the award

What AwardsPlayers.RData does not have are the players’ names though it has each Player’s ID. Their names are in People.RData. The People table contains 24 variables. In the partial display of its variables, notice that it too contains playerID.

playerID       A unique code asssigned to each player.
birthYear      Year player was born
birthMonth     Month player was born
birthDay       Day player was born
birthCountry   Country where player was born
birthState     State where player was born
birthCity      City where player was born
nameFirst      Player's first name
nameLast       Player's last name
weight         Player's weight in pounds
height         Player's height in inches
bats           Player's batting hand (left, right, or both)        
throws         Player's throwing hand (left or right)

Fortunately, the playerID field links together the data in the tables.

For this tutorial, you need to download from Lahman’s database the “2019 – R Package“. When the webpage appears, click “data.”

You will see a list of downloadable files. The files are in RData format, a format created for use in R. For this tutorial, download these two files: AwardsPlayers.RData and People.RData. The latter is not shown in the image below, which contains a partial file list.

After you click AwardsPlayers.RData, what is shown below will appear. Click View Raw. The file will download to your device. (Note: The way I show you to do something in this tutorial is often not the only way to do it.)

Click the downloaded file, which in this case is AwardsPlayers.RData. On my Mac, it downloaded into the Downloads folder.

When this appears, click Yes.

This should appear in your RStudio Console:


Here is how the AwardsPlayers.RData looks in RStudio’s Global Environment. It is now available for you to work on.

Now, in RStudio the two downloaded tables are R data frames. Next, I created an R Markdown file and made copies of both data frames.

AP <- AwardsPlayers
P <- People

Merge AP and P into a new data frame: AP_P. The column common to both, playerID, serves as the link.

AP_P <- merge(AP, P, by="playerID")

View the merged data frame’s variables.


To view the data, while in the Console type


Activate the Tidyverse library, reduce the number of columns, and display the last 10 rows. Note: If you have not used it before, you may need to install it using the R code on the next line.

AP_P %>% select(nameFirst, nameLast, playerID, yearID, awardID) %>% tail(10)

Select five columns in AP_P, and display the last 10 rows in AP_P.

6227RyanZimmermanzimmery012010Silver Slugger
6228RyanZimmermanzimmery012011Lou Gehrig Memorial Award
6229RyanZimmermanzimmery012009Silver Slugger
6230RichieZiskziskri011981TSN All-Star
6231RichieZiskziskri011974TSN All-Star
6232BarryZitozitoba012012Hutch Award
6233BarryZitozitoba012002Cy Young Award
6234BarryZitozitoba012002TSN Pitcher of the Year
6235BarryZitozitoba012002TSN All-Star
6236BenZobristzobribe012016World Series MVP

The next step is to combine the nameFirst and nameLast columns in a new column, fullname. The paste function automatically inserts a space between the names.

AP_P$fullname <- paste(AP_P$nameFirst, AP_P$nameLast)

The five columns to be displayed are selected and the last 15 observations in the AP_P data frame are displayed.

AP_P %>% select(playerID, fullname, awardID, yearID, lgID) %>% tail(15)
zimmery01Ryan ZimmermanLou Gehrig Memorial Award2011ML
zimmery01Ryan ZimmermanSilver Slugger2009NL
ziskri01Richie ZiskTSN All-Star1981AL
ziskri01Richie ZiskTSN All-Star1974NL
zitoba01Barry ZitoHutch Award2012ML
zitoba01Barry ZitoCy Young Award2002AL
zitoba01Barry ZitoTSN Pitcher of the Year2002AL
zitoba01Barry ZitoTSN All-Star2002AL
zobribe01Ben ZobristWorld Series MVP2016ML
In this table only the last eight observations are shown.

Go into the View window and set the settings you see below in the first row. Notice that the last year for the Cy Young Award is 2017, thus two years are missing.

Partial screenshot of View window

Add the missing data to the AP dataset.

AP <- add_row(AP, playerID = "degroja01", awardID = "Cy Young Award", yearID = 2018, lgID = "NL", tie = "NA", notes = "P")
AP <- add_row(AP, playerID = "snellwa01", awardID = "Cy Young Award", yearID = 2018, lgID = "AL", tie = "NA", notes = "P")
AP <- add_row(AP, playerID = "degroja01", awardID = "Cy Young Award", yearID = 2019, lgID = "NL", tie = "NA", notes = "P")
AP <- add_row(AP, playerID = "verlaju01", awardID = "Cy Young Award", yearID = 2019, lgID = "AL", tie = "NA", notes = "P")
AP <- add_row(AP, playerID = "verlaju01", awardID = "TSN All-Star", yearID = 2019, lgID = "AL", tie = "NA", notes = "P")
AP <- add_row(AP, playerID = "snellwa01", awardID = "TSN All-Star", yearID = 2018, lgID = "AL", tie = "NA", notes = "P")

Exercise: Update the AP_P data frame with the new data added to the AP dataset.

How many players won the Cy Young Award? After piping what is in the AP_P data frame to the select function, we filter it to limit the observations just to those players who won the Cy Young Award. That result is then sorted (in ascending order) and the number of observations in the awardID column are counted.

AP_P %>% select(playerID, fullname, yearID, lgID, awardID) %>% filter(awardID == "Cy Young Award") %>% arrange(yearID) %>% count(awardID)

Through 2019, 118 players have won the Cy Young.

If you find any errors in this post, please let me know. If you have any technical questions about R, please ask them on stackoverflow.com.

Cy Young Lives On

For more than 60 years, baseball has recognized its mound stars with a plaque that memorializes the achievements of a man whose 22-year career began in 1890.

When he retired at age 45, Denton True “Cy” Young had won 511 games and, until his next-to-last year in 1910, never lost more games in a season than he won.

Young’s greatest achievement may have come on May 5, 1904, when at the age of 37 he pitched the first perfect game in American League history – just the third in the major leagues and the first from the 60-foot-6-inch pitching distance. 


Young died in 1955. A year later Major League Baseball awarded the first Cy Young Award, to Don Newcombe, who got 10 of the 16 first-place votes.

Since Newcombe, 117 more pitchers have won the prize.

During those first 11 years, just one pitcher won it more than once. Sandy Koufax won it three times. While others won it three times, only Roger Clemens (7), Randy Johnson (5), Greg Maddox (4), and Steve Carlton (4) won it more than three times.

Koufax was also one of five winners who pitched for the Los Angeles Dodgers, the Dodgers winning the prize five years in a row.

Further, five of the first 11 winners made it to the Hall of Fame: Warren Spahn, Early Wynn, Whitey Ford, Don Drysdale, and Sandy Koufax.

In 1966, Koufax was the last sole winner in a season. After his final receipt of the award, it was given to the best pitcher in each league.

Cy Young won 477 complete games, fully 60 more than Walter Johnson. Only five times did he win a game that he had started but not completed.


At first, those winning it were starters who had won at least 20 games. That ended in 1973 when members of the Baseball Writers’ Association of America awarded it to Tom Seaver though he had a 19-10 record.

A year later, the first reliever won the Cy Young. Mike Marshall, pitching for the Los Angeles Dodgers, finished the 1974 season with a 15-12 record. He stood on the mound in 106 games, 30 more than Rollie Fingers, and pitched in 208.1 innings, starter’s numbers. So he averaged just under two innings a game. Despite all those appearances, his ERA was just 2.42.

Though Marshall had 21 saves, he did not lead the lead in that category, coming in second to Terry Forster who had 24 in 59 games. Despite being the runner-up in saves, Marshall was named the The Sporting News’ Fireman of the Year. 

After Marshall, eight other relievers have won the Cy Young, but none since another Dodger, Eric Gagne, received it in 2003.

Among all pitchers, both starters and relievers, only 11 have won the award in back-to-back seasons. Among them is Jacob deGrom, Mets standout, who received it in both 2018 and 2019, winning 21 games. However, unlike Gaylord Perry, those wins were not in one season but two, 10 in 2018 and 11 in 2019. Those two win totals were the lowest ever for a Cy Young Award winner who was a starter.

Cy Young threw a baseball until his right arm could no longer obey his mind’s commands.

“All us Youngs could throw,” he said. “I used to kill squirrels with a stone when I was a kid, and my granddad once killed a turkey buzzard on the fly with a rock.”


He was inducted into baseball’s Hall of Fame in 1937.

On his tombstone, above his name and that of his wife, is a winged baseball.