Unpuzzling R: Consecutive Years

When doing a statistical analysis involving baseball, I needed to find out for how many consecutive years a player has played for a team. In this article, I reveal one way of doing that using R. One of the R programming language’s amazing capabilities is how much you can accomplish with just a small amount of code1.

In the diagram below, data is shown for three players. The stint (number of years) shown in the year column is consecutive for only Player 3. Player 1 did not play in the season after his second year, and Player 2 skipped a season after his third year.

Input

Below is how the output looks. Player 1 skipped a year after his 1963 season, which is why in the yrDiff column there is a 2. Player 2 played continuously from his first thru third years, so a “1” is in the yrDiff column for each of those years, but did not play in 1969, thus there is a two-year gap between 1968 and 1970. Player 3 played continuously during his two years with the team.

Output

player_data %>% 
  mutate(yrDiff=ifelse (is.na( year - lag(year)),1, year - lag( year )))

In this code, dplyr is used. The mutate function will create a new variable named yrdiff. To create the value for yrdiff, it seeks both the first value in year (1962) and the previous year’s value — seeking that using lag; however, as 1962 is the first data item in the column, nothing precedes it so nothing can be subtracted from 1962. Therefore, the is.na check, which asks, “Is a previous year Not Available?”, returns TRUE. When the is.na result is true, yrDiff displays 1; whereas, when it is false, which means lag(year) found a number, yrDiff displays the year – lag(year) result.

To represent the input in R, you need the code below.

Input Code

player_data <- data.frame(player = c(1,1,1,2,2,2,2,3,3), year = c(1962,1963,1965,1966,1967,1968,1970,1971,1972))

Let’s look at a real-world example. I recently investigated several baseball-related questions that shows the power of R. I obtained the data from stathead.com, formatted it in Apple Numbers, and then imported it into RStudio. The dataset contained 657 observations.

Among the results I obtained was how many games each pitcher started. This R code accomplished that:

allStarters |>
  group_by(Player) |>
  summarize(SumSt = sum(GS)) |>
  arrange(desc(SumSt))

Tom Seaver started 395 games, followed by Jerry Koosman with 346 starts and Dwight Gooden with 303. No other Mets’ pitcher had 300-plus starts.

To learn how many starts each pitcher had, I grouped each one’s data.

allStarters |>
   group_by(Player)

Two hundred ninety two pitchers were grouped by season with the years they started games arranged in ascending order. The diagram below contains a sample of part of one output display.

Next, I included the previously discussed mutate code to determine for each pitcher which years were consecutive.

allStarters |>
   arrange(Player, Year) |>
   group_by((Player)) |>
   mutate(yrDiff=ifelse(is.na(Year - lag(Year)),1,Year - lag(Year))) |>
   relocate(yrDiff, .after = Year)

Here is a sample of that code’s output:

In the first yrDiff column for Al Jackson, the “1” means that he started games in 1965 and that 1965 was either the first season he started for the Mets or that he also started games in 1964; whereas, the “3” in the yrDiff column for 1968 means that it had been three seasons since he last started a Mets game.

R is a great tool for those interested in doing the statistical analysis of baseball data. To use R effectively, there is a lot to learn; however, I have found the payoff to be well worth the effort expended to get it.


1 It is assumed that you have had some interaction with R or another programming language.

Mets Best Starters by Decade

To win a baseball game, a team needs to outscore its opponents. To do that, it needs to prevent the other team from scoring as many runs as it does. The leader of the prevention part is the pitcher.

No batter leads the offense the same way that a pitcher leads the defense. He — and the catcher — are involved in the most plays in a game, but the pitcher plays a bigger role because what he does initiates the majority of a game’s plays.

A measure of a pitcher’s success in limiting other teams’ run scoring is the RE24 stat. An RE24 of zero means the player is average. On some websites, the higher a pitcher’s RE24, AKA run value, the better the pitcher performed, so a value of +24 would be much better than -24.

Sites that express it that way are Baseball Reference, FanGraphs, and Stathead with Baseball Reference now calling the RE24 for pitchers “Base-Out Runs Saved“; whereas, on other sites, such as Baseball Savant, it is the opposite: the lower a pitcher’s run value, the better. A value of -24 would be much better than +24.

Further, the complexity of the RE24 calculation has increased substantially since its early days when it was based on just base/out states and outs. For example, today on Baseball Savant, there is a Pitch Arsenal Stats Leaderboard giving a pitcher’s run value based on pitch type (e.g., changeup) “and on the runners on base, out, [and] ball and strike count,” and a Swing & Take Leaderboard giving for a pitcher a run value based on a pitch’s “outcome (ball, strike, home run, etc).”

In the chart below, the Mets top two starters in each decade based upon their RE24 totals (base-out state) in that decade are shown. The decade leaders are Tom Seaver (twice), Dwight Gooden, Rick Reed, Al Leiter, and (so far in this decade) Jacob deGrom (twice). Those five would make a starting rotation that few Mets fans would complain about.

The second-place finishers include Jerry Koosman, Jon Matlack, Sid Fernandez, Bret Saberhagen, Johan Santana, R. A. Dickey, and Marcus Stroman. Further, Matlack had a higher RE24 than did the first-place finisher in two other full decades: the 1990s and 2000s. Even the second-place finishers would make a strong starting rotation.

One pitcher yet to throw a pitch for the Mets, but who is now a member of the team, Max Scherzer, has in his 14 years in Major League Baseball accumulated an RE24 of 318.5. In that timespan, only two other pitchers have accumulated a higher RE24: Justin Verlander is at 327.22, and Clayton Kershaw is at 431.64.

And in the decade from 2010 to 2019, Scherzer remains in third place with Jacob deGrom in eighth and Carlos Carrasco 33rd.

Stathead School: Get Home Game Info

For this search, you need to use Stathead Baseball’s Split Finders tool. It can be used to get both player and team data for both batting and pitching. When used for team batting, you can search one or multiple seasons. As I just wanted the data for a single season, I did a single season search.

Here is how to use Stathead Baseball to get the results in the above tweet.

  1. Go to Split Finders > Team Batting.
  2. Set Sort By to Descending and OBP.
  3. Make Seasons “2021 to 2021”.
  4. For Choose Split Type, select Home or Away.
  5. For Choose A Split, select Home.
  6. Under Team Filters, click or tap Choose a Team Filter.
  7. Then, click or tap Team.
  8. Click or tap Any Team
  9. Select New York Mets.
  10. Click or tap Get Results.

Under Current Search, you should see this:
▶︎ In the Regular Season, in 2021, For NYM, In the AL or NL or FL, Home (within Home or Away), sorted by greatest On-Base%.

To get the away game data, repeat the above steps, making this one change:
5. For Choose A Split, select Away.

View the Stathead results for home games and away games..

Mets trade for oldest MLB pitcher

The Mets just traded two players to the Tampa Bay Rays for Rich Hill who, at 41, is the oldest pitcher in Major League Baseball and the second-oldest player, the oldest Albert Pujols.

In 19 games this season, all starts, Hill has pitched best the first and third times in a game he has faced an opposing team’s batters. Then, they hit only .190 and .164 against him, but during his second time through a lineup they hit .276 with a .530 slugging percentage. See table below.

Source: https://www.baseball-reference.com/players/split.fcgi?id=hillri01&year=2021&t=p

The data in the table reveal that his best role could be either as an opener or a reliever. However, it is likely that the Mets will mainly use him as a starter.

Hill has done better against left-handed batters, holding them to a .158 batting average. In contrast, right-handed batters hit .232 against him, still low but 74 points higher.

Against his fastball, his most frequently thrown pitch, batters hit .257, 71 points higher than their BA (.186) against his breaking pitches — but they were not his most effective pitches. Those were his off-speed pitches with a .143 BA.

His average pitch velocity this season of 80.6 mph is his second-lowest since 2015. His hard-hit% is higher than in any previous Statcast season: 6.4%.

In 2021, he has thrown 785 fastballs, 623 breaking balls, and 41 off-speed pitches.

Hill has been much more effective when pitching without any runners on base. Then, batters are hitting just .203. With runners on base, opposing batters’ BA jumps 47 points to .250.

He last pitched on July 18 against the Braves. In 4 IP, he gave up six hits, three earned runs, and two walks, while striking out four.