I like to read books, especially baseball books, and today I found one that appears to be a standout. I say standout though I have only read about 8% of the book — according to the Kindle app — but at least I’m not judging it just by its cover. In fact, I don’t even remember what the cover looks like.
It’s written by Phil Pepe, a long-time sportswriter who covered the Yankees beat. He wrote for the World-Telegram and Sun, a New York newspaper that I once delivered by bicycle and the New York Daily News, a paper that had a great sports department until 2018 it downsized it from 30 to 9 and then in 2019 had the chutzpah to write a story headlined “The Sports Illustrated layoffs are disgraceful.”
But this post is not a bearer of bad news. Instead, it is about a company that deserves a shoutout: Sports Publishing.
Today, while browsing through Amazon’s baseball books I found Phil Pepe’s Yankee Doodles: Inside the Locker Room with Mickey, Yogi, Reggie, and Derek. Though published in 2015, the book is not dated. It’s about Yankees like Yogi Berra and Mickey Mantle, Yankee greats who played the game when baseball was still America’s pastime.
On its Amazon page, the book’s print list price of $24.99 is crossed out. Its Kindle price is not. It is $0.00.
That is not a typo.
I don’t know for how long it has been free nor for how long it will continue to be. What I do know is that it appears to be a great read. So even if you are not a Yankees fan, just being a sports fan is a good enough reason to treat yourself to this Sports Publishing giveaway.
Every year since 1956 at least one pitcher has won the Cy Young Award starting with the Brooklyn Dodgers’ Don Newcombe. Many questions can be asked. How many pitchers won the award more than once? Which pitchers achieved that feat? Who won it the most times?
In this post, I focus on just one question: How many players won the Cy Young Award? I will share how I got the answer using the R programming language. I will also be using both RStudio and Sean Lahman’s Baseball Database, an excellent resource. A basic familiarity with both R, dplyr, and RStudio is assumed. (Note: As you progress through this post, have RStudio open.)
The database contains multiple files in table format. The table containing the player awards data is AwardsPlayers.RData. It has six variables:
playerID Player ID code
awardID Name of award won
tie Award was a tie (Y or N)
notes Notes about the award
What AwardsPlayers.RData does not have are the players’ names though it has each Player’s ID. Their names are in People.RData. The People table contains 24 variables. In the partial display of its variables, notice that it too contains playerID.
playerID A unique code asssigned to each player.
birthYear Year player was born
birthMonth Month player was born
birthDay Day player was born
birthCountry Country where player was born
birthState State where player was born
birthCity City where player was born
nameFirst Player's first name
nameLast Player's last name
weight Player's weight in pounds
height Player's height in inches
bats Player's batting hand (left, right, or both)
throws Player's throwing hand (left or right)
Fortunately, the playerID field links together the data in the tables.
For this tutorial, you need to download from Lahman’s database the “2019 – R Package“. When the webpage appears, click “data.”
You will see a list of downloadable files. The files are in RData format, a format created for use in R. For this tutorial, download these two files: AwardsPlayers.RData and People.RData. The latter is not shown in the image below, which contains a partial file list.
After you click AwardsPlayers.RData, what is shown below will appear. Click View Raw. The file will download to your device. (Note: The way I show you to do something in this tutorial is often not the only way to do it.)
Click the downloaded file, which in this case is AwardsPlayers.RData. On my Mac, it downloaded into the Downloads folder.
When this appears, click Yes.
This should appear in your RStudio Console:
Here is how the AwardsPlayers.RData looks in RStudio’s Global Environment. It is now available for you to work on.
Now, in RStudio the two downloaded tables are R data frames. Next, I created an R Markdown file and made copies of both data frames.
AP <- AwardsPlayers P <- People
Merge AP and P into a new data frame: AP_P. The column common to both, playerID, serves as the link.
AP_P <- merge(AP, P, by="playerID")
View the merged data frame’s variables.
To view the data, while in the Console type
Activate the Tidyverse library, reduce the number of columns, and display the last 10 rows. Note: If you have not used it before, you may need to install it using the R code on the next line.
In this table only the last eight observations are shown.
Go into the View window and set the settings you see below in the first row. Notice that the last year for the Cy Young Award is 2017, thus two years are missing.
Add the missing data to the AP dataset.
AP <- add_row(AP, playerID = "degroja01", awardID = "Cy Young Award", yearID = 2018, lgID = "NL", tie = "NA", notes = "P") AP <- add_row(AP, playerID = "snellwa01", awardID = "Cy Young Award", yearID = 2018, lgID = "AL", tie = "NA", notes = "P") AP <- add_row(AP, playerID = "degroja01", awardID = "Cy Young Award", yearID = 2019, lgID = "NL", tie = "NA", notes = "P") AP <- add_row(AP, playerID = "verlaju01", awardID = "Cy Young Award", yearID = 2019, lgID = "AL", tie = "NA", notes = "P") AP <- add_row(AP, playerID = "verlaju01", awardID = "TSN All-Star", yearID = 2019, lgID = "AL", tie = "NA", notes = "P") AP <- add_row(AP, playerID = "snellwa01", awardID = "TSN All-Star", yearID = 2018, lgID = "AL", tie = "NA", notes = "P")
Exercise: Update the AP_P data frame with the new data added to the AP dataset.
How many players won the Cy Young Award? After piping what is in the AP_P data frame to the select function, we filter it to limit the observations just to those players who won the Cy Young Award. That result is then sorted (in ascending order) and the number of observations in the awardID column are counted.
In the 1950s, when Topps became the leading baseball card company, card-collecting gained popularity among kids. The cards were interesting to look at and easy to understand.
Among their main draws were the stats on the card’s back. By quantifying a player’s performance they strengthened our connection to the game.
Henry Louis Aaron’s 1954 Topps card, his rookie card, contained the following column headings: Games, At Bat, Runs, Hits, Doubles, Triples, Home Runs (H.R.), Runs Batted In (R.B.I.), Batting Average (B. Avg.), plus Putouts (P.O.), Assists, Errors, and Fielding Average (F. Avg.) Given that it is his rookie card, the numbers are for his minor league play.
Only two calculations were required, one for batting average, the other for fielding average. Both calculations were easy for kids to do and all the card’s were easy to embed in arguments with friends about who was the better player.
MAJOR LEAGUE BATTING RECORD LEAGUE LEADER IN ITALICS. TIE* H 2B 3B HR RBI SB BB SLG OPS AVG WAR
Notice the absence of the three fielding stats and the additions of SB, BB, SLG, OPS, and WAR. Two of those stats are new to this era: OPS and WAR.
If you are unfamiliar with OPS or WAR or want to learn more about them, a good starting point is Anthony Castrovince’s book, A Fan’s Guide to Baseball Analytics. Subtitled “Why WAR, WHIP, wOBA, and Other Advanced Sabermetrics Are Essential to Understanding Modern Baseball.” Castrovince dives deep into its statistical subjects without making readers want to run for a hyperbaric chamber to replace the oxygen that the effort required to progress through the content.
Here is a taste of what Castrovince has to say about OPS, which is the combination of On-Base Percentage (OBP) and Slugging Percentage (SLG):
OPS tells us how well a player gets on base and how well he hits for power. Now, while it has some major flaws in relaying that information (we’ll get into those in just a sec), it’s still a major step forward from batting average and RBIs. (57)
To understand OPS, you need to know about both SLG and OBP. SLG has been available for years, but OBP has yet to show any gray hairs. SLG is a hitter’s total bases divided by at bats, where total bases is the sum of a batter’s hits + doubles + triples (times 2) + home runs (times 3). The OPS calculation is a bit more complex, but it involves just addition and division.
OBP = Hits + Walks + Hit By Pitch / At Bats + Walks + Hit By Pitch + Sac Flies
In a nutshell, OBP reveals how often a batter got on base from a hit, walk, or hit by pitch. Further, for each statistic, Castrovince introduces it by sharing this information:
What it is
What it is not
How it is calculated
[Gives an] [e]xample
Why it matters
Where you can find it
Back to OPS: The highest OPS since 1884, according to stathead.com, for batters appearing in at least 100 games is Barry Bonds’ 1.422. In fact, two of the top three spots belong solely to Bonds. However, because Bond’s substance abuse issues affect the credibility of his numbers, I consider the OPS of the player who tied Bonds for third to be more valid.
In 1920, in his debut season as a Yankee, Babe Ruth’s OPS of 1.379 matched Bonds’ third-best, but the Babe did it in 11 fewer games. Further, Ruth’s career OPS of 1.164 is MLB’s best: Bonds is in fourth place with 1.051 behind both Ted Williams and Lou Gehrig.
The book does not begin, fortunately, with the latest stats. Instead, it starts with discussions of baseball’s old-timers: batting average and RBIs. After pushing both of them off the top of the baseball mountain where both had reigned for years, Castrovince attacks one of America’s most cherished words: “win” in the chapter “WINNING ISN’T EVERYTHING.” I knew that — but not what its subtitle stated: “How the Win Came to Be Baseball’s Most Deceptive Pitching Stat.”
America had brainwashed me into believing that “winning is everything,” starting with my last Little League manager, who pulled me from a game because I had tried to stretch a double into a triple — and failed. As I slid into the bag, the third baseman’s gloved hand was waiting for me, the ball I had socked down the third-base line locked in leather.
Though it was not a game-changing play, it was a life-changing one. The manager, who also coached third base, removed me from the lineup because I had failed to see his efforts to get me to stay at second base. What worsened my embarrassment was that game was the only one in which my father was able to see me play. He and I left before the game ended, which my team won.
Though my team, the Fairyland Flyers — sponsored by a small amusement park — won the League’s championship that season, each player rewarded with a plastic trophy and the opportunity to get sick for free on the park’s rides, the most significant lesson I learned is that how you play a game is much less important than winning the game.
You can be a loser even in a game your team wins.
The win stat has lost even more.
Castrovince states, “The win stat died—effectively if not officially—on November 14, 2018.” That was the day that Jacob deGrom, the Mets best pitcher since Tom Seaver, won the National League Cy Young Award though for that season he won only 10 games. Before then, no starter with fewer than 18 wins ever won the award in either league.
Castrovince does a decent job arguing that the win stat does not deserve its long-held reputation, labeling the stat as fatuous. His conclusion: “it is far too unreliable to be taken seriously.”
Was I convinced that it is “far too unreliable”? No. I am convinced that the win stat is not as reliable an indicator of a pitcher’s skill as I was before I read Castrovince’s thoughts about it.
In conclusion, his book’s content increased my knowledge about baseball stats, both old and new. His writing style is readable, his expertise is evident — he writes for MLB.com, and his anecdotes and examples are well chosen.
The book now serves as one of my go-to resources about baseball statistics. Though it does not discuss every stat — one of my favorites, RE24, is not mentioned, it gives enough information about the ones it covers to justify its acquistion.
For more than 60 years, baseball has recognized its mound stars with a plaque that memorializes the achievements of a man whose 22-year career began in 1890.
When he retired at age 45, Denton True “Cy” Young had won 511 games and, until his next-to-last year in 1910, never lost more games in a season than he won.
Young’s greatest achievement may have come on May 5, 1904, when at the age of 37 he pitched the first perfect game in American League history – just the third in the major leagues and the first from the 60-foot-6-inch pitching distance.
Since Newcombe, 117 more pitchers have won the prize.
During those first 11 years, just one pitcher won it more than once. Sandy Koufax won it three times. While others won it three times, only Roger Clemens (7), Randy Johnson (5), Greg Maddox (4), and Steve Carlton (4) won it more than three times.
Koufax was also one of five winners who pitched for the Los Angeles Dodgers, the Dodgers winning the prize five years in a row.
Further, five of the first 11 winners made it to the Hall of Fame: Warren Spahn, Early Wynn, Whitey Ford, Don Drysdale, and Sandy Koufax.
In 1966, Koufax was the last sole winner in a season. After his final receipt of the award, it was given to the best pitcher in each league.
Cy Young won 477 complete games, fully 60 more than Walter Johnson. Only five times did he win a game that he had started but not completed.
At first, those winning it were starters who had won at least 20 games. That ended in 1973 when members of the Baseball Writers’ Association of America awarded it to Tom Seaver though he had a 19-10 record.
A year later, the first reliever won the Cy Young. Mike Marshall, pitching for the Los Angeles Dodgers, finished the 1974 season with a 15-12 record. He stood on the mound in 106 games, 30 more than Rollie Fingers, and pitched in 208.1 innings, starter’s numbers. So he averaged just under two innings a game. Despite all those appearances, his ERA was just 2.42.
Though Marshall had 21 saves, he did not lead the lead in that category, coming in second to Terry Forster who had 24 in 59 games. Despite being the runner-up in saves, Marshall was named the The Sporting News’ Fireman of the Year.
After Marshall, eight other relievers have won the Cy Young, but none since another Dodger, Eric Gagne, received it in 2003.
Among all pitchers, both starters and relievers, only 11 have won the award in back-to-back seasons. Among them is Jacob deGrom, Mets standout, who received it in both 2018 and 2019, winning 21 games. However, unlike Gaylord Perry, those wins were not in one season but two, 10 in 2018 and 11 in 2019. Those two win totals were the lowest ever for a Cy Young Award winner who was a starter.
Cy Young threw a baseball until his right arm could no longer obey his mind’s commands.
“All us Youngs could throw,” he said. “I used to kill squirrels with a stone when I was a kid, and my granddad once killed a turkey buzzard on the fly with a rock.”