Unpuzzling R: Consecutive Years

When doing a statistical analysis involving baseball, I needed to find out for how many consecutive years a player has played for a team. In this article, I reveal one way of doing that using R. One of the R programming language’s amazing capabilities is how much you can accomplish with just a small amount of code1.

In the diagram below, data is shown for three players. The stint (number of years) shown in the year column is consecutive for only Player 3. Player 1 did not play in the season after his second year, and Player 2 skipped a season after his third year.

Input

Below is how the output looks. Player 1 skipped a year after his 1963 season, which is why in the yrDiff column there is a 2. Player 2 played continuously from his first thru third years, so a “1” is in the yrDiff column for each of those years, but did not play in 1969, thus there is a two-year gap between 1968 and 1970. Player 3 played continuously during his two years with the team.

Output

player_data %>% 
  mutate(yrDiff=ifelse (is.na( year - lag(year)),1, year - lag( year )))

In this code, dplyr is used. The mutate function will create a new variable named yrdiff. To create the value for yrdiff, it seeks both the first value in year (1962) and the previous year’s value — seeking that using lag; however, as 1962 is the first data item in the column, nothing precedes it so nothing can be subtracted from 1962. Therefore, the is.na check, which asks, “Is a previous year Not Available?”, returns TRUE. When the is.na result is true, yrDiff displays 1; whereas, when it is false, which means lag(year) found a number, yrDiff displays the year – lag(year) result.

To represent the input in R, you need the code below.

Input Code

player_data <- data.frame(player = c(1,1,1,2,2,2,2,3,3), year = c(1962,1963,1965,1966,1967,1968,1970,1971,1972))

Let’s look at a real-world example. I recently investigated several baseball-related questions that shows the power of R. I obtained the data from stathead.com, formatted it in Apple Numbers, and then imported it into RStudio. The dataset contained 657 observations.

Among the results I obtained was how many games each pitcher started. This R code accomplished that:

allStarters |>
  group_by(Player) |>
  summarize(SumSt = sum(GS)) |>
  arrange(desc(SumSt))

Tom Seaver started 395 games, followed by Jerry Koosman with 346 starts and Dwight Gooden with 303. No other Mets’ pitcher had 300-plus starts.

To learn how many starts each pitcher had, I grouped each one’s data.

allStarters |>
   group_by(Player)

Two hundred ninety two pitchers were grouped by season with the years they started games arranged in ascending order. The diagram below contains a sample of part of one output display.

Next, I included the previously discussed mutate code to determine for each pitcher which years were consecutive.

allStarters |>
   arrange(Player, Year) |>
   group_by((Player)) |>
   mutate(yrDiff=ifelse(is.na(Year - lag(Year)),1,Year - lag(Year))) |>
   relocate(yrDiff, .after = Year)

Here is a sample of that code’s output:

In the first yrDiff column for Al Jackson, the “1” means that he started games in 1965 and that 1965 was either the first season he started for the Mets or that he also started games in 1964; whereas, the “3” in the yrDiff column for 1968 means that it had been three seasons since he last started a Mets game.

R is a great tool for those interested in doing the statistical analysis of baseball data. To use R effectively, there is a lot to learn; however, I have found the payoff to be well worth the effort expended to get it.


1 It is assumed that you have had some interaction with R or another programming language.

Mets Best Starters by Decade

To win a baseball game, a team needs to outscore its opponents. To do that, it needs to prevent the other team from scoring as many runs as it does. The leader of the prevention part is the pitcher.

No batter leads the offense the same way that a pitcher leads the defense. He — and the catcher — are involved in the most plays in a game, but the pitcher plays a bigger role because what he does initiates the majority of a game’s plays.

A measure of a pitcher’s success in limiting other teams’ run scoring is the RE24 stat. An RE24 of zero means the player is average. On some websites, the higher a pitcher’s RE24, AKA run value, the better the pitcher performed, so a value of +24 would be much better than -24.

Sites that express it that way are Baseball Reference, FanGraphs, and Stathead with Baseball Reference now calling the RE24 for pitchers “Base-Out Runs Saved“; whereas, on other sites, such as Baseball Savant, it is the opposite: the lower a pitcher’s run value, the better. A value of -24 would be much better than +24.

Further, the complexity of the RE24 calculation has increased substantially since its early days when it was based on just base/out states and outs. For example, today on Baseball Savant, there is a Pitch Arsenal Stats Leaderboard giving a pitcher’s run value based on pitch type (e.g., changeup) “and on the runners on base, out, [and] ball and strike count,” and a Swing & Take Leaderboard giving for a pitcher a run value based on a pitch’s “outcome (ball, strike, home run, etc).”

In the chart below, the Mets top two starters in each decade based upon their RE24 totals (base-out state) in that decade are shown. The decade leaders are Tom Seaver (twice), Dwight Gooden, Rick Reed, Al Leiter, and (so far in this decade) Jacob deGrom (twice). Those five would make a starting rotation that few Mets fans would complain about.

The second-place finishers include Jerry Koosman, Jon Matlack, Sid Fernandez, Bret Saberhagen, Johan Santana, R. A. Dickey, and Marcus Stroman. Further, Matlack had a higher RE24 than did the first-place finisher in two other full decades: the 1990s and 2000s. Even the second-place finishers would make a strong starting rotation.

One pitcher yet to throw a pitch for the Mets, but who is now a member of the team, Max Scherzer, has in his 14 years in Major League Baseball accumulated an RE24 of 318.5. In that timespan, only two other pitchers have accumulated a higher RE24: Justin Verlander is at 327.22, and Clayton Kershaw is at 431.64.

And in the decade from 2010 to 2019, Scherzer remains in third place with Jacob deGrom in eighth and Carlos Carrasco 33rd.

Mets All-Time Top Catcher

The Mets have had a lot of players behind the plate, “the game’s most demanding position,” according to Jesse Yomtov, starting with Hobie Landrith who, on April 11, 1962, caught the first pitch thrown by a Mets’ starter (Roger Craig).

Five catchers have stood out.

To choose them, five statistics were primarily used: WAR, WPA, RE24, Total Bases, and Times on Base (excluding by error) with WAR and WPA the two dominant ones in that order. In addition, their selection was based solely on their time with the Mets, not on their overall career, as a player could have played for multiple teams

Among the Mets top five catchers, two are in the Hall of Fame: Mike Piazzaand Gary Carter. Piazza played eight seasons for the Mets after playing seven on the Dodgers, Carter five after playing 11 for Montreal. Filling out the list are Jerry Grote, who played 12 seasons in the Big Apple, John Stearns, who played 10, and Todd Hundley, who played nine.

Sources: Stathead Baseball and Baseball Reference

Grote came closest to Piazza in Times on Base, only 91 apart; however, as a Met, Grote played four more seasons than Piazza who averaged getting on base 183.6 times a season versus 114.8 for Grote.

Based only on their Mets WAR number, the top two are Piazza and Stearns; however, when WPA and RE24 are taken into account, the difference between the two becomes quite significant. And Piazza separates himself even more from the others in Total Bases, having 607 more than the second-most — Grote’s 1278. But then, in his Mets career, Piazza amassed a .542 SLG. No one else in the group came within 100 points of that number.

  • Piazza had the third-highest JAWS rating among all catchers.

Twitter Poll

I found the tweet below after I completed the above write-up and was not surprised by Piazza’s landslide victory. He was one of the Mets most popular players.

Another stat, TOB/TB, helps lengthen Piazza’s lead over the rest of the field. Written about in 2016 by Rob Mains, the TOB/TB Number is calculated using this formula:

  • Multiply Times on Base by Total Bases.
  • Double it.
  • Divide the result by the sum of Times on Base and Total Bases.

Piazza’s TOTtb number of 1,651 was 325 points ahead of Grote’s with the average for the top five catchers 1,170.

Others’ Views

Tim Boyle, in his catcher comparison, made this comment about Mike Piazza:

“Piazza didn’t have a reputation for playing well defensively. As the years went on, he got worse. I’m not so sure anyone holds this against him. Piazza was far too amazing at the plate for anyone to criticize him for his weaknesses behind it.”

In contrast, Jennifer Khedaroo viewed Piazza’s defensive skill differently, writing

“In terms of defense, Piazza played well year after year. He was consistently in the top five for putouts, assists, double plays turned and runners caught stealing.”

And though Harold Friend agreed that Piazza was a better hitter than Gary Carter, he still pushed Piazza into second place among the best Mets catchers, Carter’s defensive skill giving him the edge:

“Gary Carter was the most valuable Mets catcher. Piazza will always be rated as the greater player, but Carter was more valuable to the Mets. Gary Carter was (and is) a world champion.

Piazza was the greatest hitting catcher ever. Although he was a good defensive player his first few seasons with the Los Angeles Dodgers, he was a defensive liability during his tenure with the Mets.”

Overall, Friend wrote, “Carter provided great defense, handled an excellent pitching staff magnificently and was a timely clutch hitter.”

In response to Friend, in my opinion the best measure of clutch hitting is WPA. For that stat, Piazza’s score was more than 10 times higher than Carter’s.

With regard to Piazza’s ability behind the plate, in an nj.com article, its author, Brendan Kuty, wrote that Hall of Famer Tom Glavine “said Piazza’s reputation as a bad defensive catcher is undeserved.”

“He did a lot of things well behind the plate,” Glavine said. “Yeah, he wasn’t the greatest thrower. That unfortunately translated into people thinking that some of this other game wasn’t as good as it was. He called a good game. He received the ball fine. He blocked balls fine.

But so often catchers are defined defensively on how well they throw and there’s much more that goes into just being a good defensive catcher than being able to throw. That aspect of his game, for whatever reason, garnered the extra attention and overshadowed the other aspects of his game.” (from Kuty article)

New Statcast leaderboard hits a grand slam

The latest feature added to Baseball Savant focuses on one of baseball’s most exciting plays: the home run. However, its creator, Daren Willman, tweeted, “Not all home runs are created equal.”

The leaderboard’s startup screen shows all those batters in 2020 who hit at least one long ball that would have been a home run in at least one of Major League Baseball’s 30 ballparks.

On August 9, before any of the day’s games have been played, Yankees slugger Aaron Judge is Major League Baseball’s home run leader with eight. In the Home Runs Leaderboard, if you click anywhere on a player’s row except on his name, details on all his homers in the season you choose will appear, each homer listed on a separate row.

Click on Judge’s row. Below his name should beS a table showing those ballparks where each long ball that Judge hit on the given date will be a homer. For example, on August 8 in Tampa Bay, the first long ball that Judge hit (against Sean Gilmartin) would have been a four-bagger in every ballpark, but the second long ball he hit (against Nick Anderson) would have been a homer in only 18 parks — video.

Therefore, for a long ball to qualify for (be included in) the Home Runs Leaderboard it must have been able to be a home run in at least one MLB stadium even if it was not a homer in the ballpark in which it was hit. Those batted balls are labelled as “Doubters,” “Mostly Gone,” or “No Doubters.”

  • If a batted ball would be a homer in fewer than 8 ballparks, it is a “doubter.”
  • If it would be a homer in 8 to 29 parks, it is “mostly gone.”
  • If would be a home run in every stadium, it is a “no doubter.”

That is why if you sum those three columns (“Doubters,” “Mostly Gone,” “No Doubters”) the total could be less than what is in the “Actual HR” column, which is the total number of homers the player hit, as occurs with Fernando Tatis Jr.’s numbers. He had six actual homers, but one “doubter,” three “mostly gone,” and six “no doubters.”

Finally, home run data is available for batters, pitchers, and teams for both 2019 and 2020.

Here is a sample of the kinds of questions that Savant’s Home Runs Leaderboard can answer.

Which player’s has the most “could-be” homers that could only be a home run in one stadium?

Which Mets’ player has hit the most actual and “almost” homers so far in 2020? Notice that one of Davis’ “homers” was a non-homer. I label that one a “Could Be” homer.

Who has hit the most “no doubt” home runs this season?

In 2020, which pitcher have given up the most “no doubters?”

The Home Runs Leaderboard is a great resource with eye-catching visuals for statistically-minded baseball fans. One thing that could make it even better is if you could get team data by both division and league. For example, now if I select “Mets” and “Pitchers,” I only get the results for the qualifying Mets pitchers.