Tidyeval and functional programming
Web page construction in progress…
TL;DR
Inside data-masking function (actions), we can use injection operators:
+ `{` embracing operator (`rlang`) OR
+ [`enquo` followed by] `!!` operator (`base`)
+ `...` + `...`
+ varname_name `:=` function(`{{var}`)
+ `.data` pronouns.
+ `.env` pronouns.
Learning resources
MORE
https://jonthegeek.com/2018/06/04/writing-custom-tidyverse-functions/
Premise
Premise: tidyverse
functions use tidy evaluation (= they don’t evaluate the value of a variable right away! = Non-Standard evaluation).
- (+) This means you can do some intermediate transformation to the variable in abstract (e.g. to a generic “column” thing)
- (-) it’s hard to refer to variables indirectly, and hence harder to program with
In contrast, normal/base/custom R functions DO evaluate objects (i.e. a+b
) as soon as possible = Standard evaluation
So, to take full advantage of Non-Standard evaluation (more interactivity, but also writing custom functions), I will need a sort of METAVARIABLE (a “quosure”), i.e. something that doesn’t get evaluated until I tell so.
Non-Standard evaluation in tidyverse
-
DEFUSING (DELAYING) function arguments: I can create a “quosure” with
rlang::enquo()
/rlang::enquos()
so an expression can be examined, modified, and injected into other expressions.
Two (complementary) forms of NSE used in tidyverse
1) TIDY SELECTION is used in SELECTION VERBS
e.g. in dplyr::select()
across()
, relocate()
, rename()
and pull()
use tidy selection where expressions are either interpreted in the context of the data frame (e.g. c(cyl, am)
or evaluated in the user environment (e.g. all_of()
, starts_with()
)
2) DATA MASKING used in ACTION VERBS
ACTION VERBS = dplyr::mutate()
, ggplot2::aes()
, arrange()
, count()
, filter()
, group_by()
, and summarise()
.
Normal interactive programming (tidyverse
) use data-masking, which allow you to use variables in the “current” data frame without any extra syntax. This:
- (+) makes it nicer to interactively work (no extra typing for
data$column
, justcolumn
), but - (-) makes it harder to create your own functions (it could be ambiguous what is a data-variable and what is an env-variable).
SOOOOO We need some way to add $
back into the picture. Passing data-masked arguments to functions requires INJECTION (= quasiquotation), i.e. TO INJECT A FUNCTION ARGUMENT IN A DATA-MASKED CONTEST, YOU EMBRACE IT WITH {{
Inside data-masking function (actions), we can use injection operators:
+ `{` embracing operator (`rlang`) OR
+ [`enquo` followed by] `!!` operator (`base`)
+ `...` + `...`
+ varname_name `:=` function(`{{var}`)
+ `.data` pronouns.
+ `.env` pronouns.
Different options
1 Defuse (nothing!) + Inject {{
(inside custom f)
# -------- OR
grouped_mean_1 <- function(df, group_var, summarize_var) {
df %>%
# Defuse and inject in a single step with the embracing operator
group_by({{group_var}} ) %>%
summarize(mean = mean({{summarize_var}} , na.rm = TRUE))
}
# call
grouped_mean_1(
df = starwars,
group_var = sex,
summarize_var = height
)
Without {group_var}
I would get the error
“! Must group by variables found in .data
. ✖ Column group_var
is not found.”
2 Defuse enquo
+ Inject !!
(inside custom f)
# We can tell group_by() not to quote by using !! (pronounced “bang bang”). !! says something like “evaluate me!” or “unquote!”test
grouped_mean_2 <- function(df, group_var, summarize_var) {
## -- Defuse the user expression in `*_var`
group_var = enquo(group_var)
summarize_var = enquo(summarize_var)
df %>%
## -- Inject the expression contained in `*_var`
group_by(!!group_var) %>%
summarize(mean = mean(!!summarize_var, na.rm = TRUE))
}
# call
grouped_mean_2(
df = starwars,
group_var = sex,
summarize_var = height
)
3 Defuse ...
+ Inject ...
In this case, summarize_var
goes in front and ...
last
-
...
can stand for multiple variables
# ---- func
grouped_mean_3 <- function(df, summarize_var, ...) {
## -- Defuse the summarize_var = enquo(summarize_var)
## ... group_var >>>> NO NEED FOR ENQUO with ... !
summarize_var = enquo(summarize_var)
df %>%
group_by(...) %>%
summarize(mean = mean(!!summarize_var, na.rm = TRUE))
}
# ---- call
grouped_mean_3(
df = starwars,
sex, homeworld, # (...)
summarize_var = height
)
{...}
Basically we are saying “everything I throw at the function will be carried along until I want to evaluate it”
Different options with NAMING
For technical reasons, the R language doesn’t support complex expressions on the left of =
, but we can use :=
as a workaround… it allows to use glue and curly-curly syntax on the left of =
1b (nothing!) + {{
& left side :=
!!!!
- Super compact left side syntax with
"sometext_{{group_var}}" :=
# --- func
grouped_mean_1b <- function(df, group_var, summarize_var) {
df %>%
# Defuse and inject in a single step with the embracing operator
group_by({{group_var}} ) %>%
summarize( "BY_{{group_var}}" := mean({{summarize_var}} , na.rm = TRUE))
}
# --- call
grouped_mean_1b (
df = starwars,
group_var = sex,
summarize_var = height
)
2b enquo
+ !!
& left side :=
2 things needed here :
+ `as_label(enquo(____var))`
+ left side syntax with `!!str_c("Mean_", ____var) :=`
# --- func
grouped_mean_2b <- function(df, group_var, summarize_var) {
## -- Defuse the user expression in `*_var`
group_var = enquo(group_var)
summarize_var = as_label(enquo(summarize_var)) # as_label(enquo !!!!!
df %>%
## -- Inject the expression contained in `*_var`
group_by(!!group_var) %>%
summarize(!!str_c("Mean_", summarize_var) := mean(!!summarize_var, na.rm = TRUE))
}
# --- call
grouped_mean_2b(df = starwars,
group_var = sex,
summarize_var = height
)
3b ...
+ ...
& left side :=
## -- define function
grouped_mean_3b <- function(df, summarize_var, ...) {
# group_var = ... NO NEED FOR ENQUO!
summarize_var = enquo(summarize_var)
summarize_var_name <- as_label(enquo(summarize_var))
df %>%
group_by(...) %>%
# summarize(!!summarize_var_name := mean(!!summarize_var, na.rm = TRUE))
# or
summarize(!!str_c("My_mean_", summarize_var_name) := mean(!!summarize_var, na.rm = TRUE))
# ERRORE ?!?!?!?
# summarize(str_c("Mean_", !!summarize_var_name) := mean(!!summarize_var, na.rm = TRUE))
}
## -- call function
grouped_mean_3b(df = starwars,
sex, homeworld, # group_var
summarize_var = height
)
OKKIO!!! Strange enough… seems like the unquoting must be of the WHOLE left-side of the equation not just of the quoted variable as I thought + !!summarize_var_name := ...
OK + !!str_c("Mean_", summarize_var_name) := ...
OK: xchè?????? + str_c("Mean_", !!summarize_var_name) := ...
WRONG: xchè??????
Adding the .data
syntax
It’s good practice to prefix named arguments with a . (.data
) to reduce the risk of conflicts between your arguments and the arguments passed to ...
1.c Inject {{
+ specify .data
Adding a new generic function argument data
(up until now I was hardcoding it when executing the function)
Simple data
With .data
pronoun
The .data
pronoun is a tidy eval feature that is enabled in all data-masked arguments, just like {{
With .data[[var]]
pronoun
- If you have a variable name in a string, use the
.data
pronoun to subset that variable with[[
.
## -- write function
my_sum3 <- function(data, num_var, group_var) {
# no`.` here!!!
data %>%
dplyr::group_by(.data[[group_var]]) %>%
# num_var or sum will not work!
dplyr::summarise(
"weighted_count_{{num_var}}" := sum( {{ num_var }}, na.rm = TRUE ))
}
group_var <- "homeworld"
## -- call function
my_sum3 (starwars, mass, group_var)
Iteration purrr
on top of (dplyr) NSE
Now I want to iteratively execute my custom function my_mean
over a vector of grouping variables group_var_vec
.
Here the question is how to specify the function
Option 1) purrr
+ existing function .f
…
- my big “gotcha” here is I must not specify
.x
in the arguments after.f
as I kept doing… 🤯
Option 2) purrr
+ anonymous function as formula ~ fun()
- 🤔🤷🏻♀️not quite sure why it should be “anonymous” given that I gave it a name… but it works nonetheless
BTW map can output a row-binded df here
# map returns dataframes binded by row
groupmeans_dfr <- map_dfr(.x = group_var_vec, # x
~my_mean( # function YESS!! ()
data = starwars, num_var = height, group_var = .x) # ALL arguments
)
groupmeans_dfr
_______
Iteration purrr
with (purrr) NSE
Here I am trying a different recipe: i.e. putting NSE inside the purrr
function, instead of inside the dplyr
function. For examples
—- 1 masked group_var in purrr::map
—- 2a masked num_var in purrr::map
- using
syms
- 🤯 Note: I need
.x = syms(numvar)
instead of.x = numvar
because map woudl understand the vector as characters and not as symbols
- 🤯 Note: I need
num_var <- c("height", "birth_year")
NSE_map_n <- map(#.x = num_var, # WRONG, map gets a vector of characters
# .x = {{num_var}}, # also WRONG ???
# .x = syms(num_var), # RIGHT, map gets a list of symbols
.f = ~{
starwars %>%
group_by(species) %>%
summarise("mean_{{.x}}" := mean({{.x}}, na.rm = TRUE ))
})
pluck(NSE_map_n,1)
—- 2b masked num_var in purrr::map
- using A) Defuse
enquo
+ B) Inject!!
_______ WRONG (start)_____
—- 3 masked group_var + num_var vars in purrr::map
DOESNT REALLY WORK !!!!
_______ WRONG (end)_____
Tidyeval in purrr + custom function
1/2 single (group, num) function
my_mean_2 <- function(data, group_var, num_var) {
quo_group_var <- enquo(group_var)
quo_num_var <- enquo(num_var)
# no`.` here!!!
data %>%
dplyr::group_by(.data[[!! quo_group_var]]) %>%
dplyr::summarise(
"mean_of_{{quo_num_var}}" := mean( !!quo_num_var, na.rm = TRUE ))
}
# test YEAH
test1 <- my_mean_2(data = starwars, group_var = sex, num_var =height)
2/2 map function
https://stackoverflow.com/questions/47870838/how-to-loop-over-a-tidy-eval-function-using-purrr
_______
[digression] pluck
# build list
movies <- c("A New Hope",
"The Empire Strikes Back",
"Return of the Jedi",
"Phantom Menace",
"Attack of the Clones",
"Revenge of the Sith",
"The Force Awakens",
"The Last Jedi",
"Rise of Skywalker")
years <- c(1977, 1980, 1983, 1999, 2002, 2005, 2015, 2017, 2019)
film_l <- list(movies, years,
preference = c(2, 1, 3, 7, 8, 9, 4, 6, 5))
Here is an esasier way to extract elements from lists
# 1st df and 1st column of it
film_l[[1]][1]
pluck(film_l,1,1)
… which can go insude a pipe