Synopsis

R users often write code that look deeply peculiar to others. Ever wondered how the following could possibly work out:

  1. variable names show up as plot labels with a minimal plot() call
  2. operations on data frame refer to columns by name without quotes
  3. model formulas fit against any data and operators like * are no longer arithmetic

Metaprogramming underlies many of the peculiarities of R. Among the most intriguing designs of the R language, it is arguably the key that allows R users to write expressive code centered on data analysis, rather than buried in minute programming details.

This post provides an introduction to the R metaprogramming design. After duscussing basic concepts I end with an introduction to metaprogramming patterns supported by the tidyverse most notably in the rlang package.

TL;DR: Refer to the example at the end which uses many features discussed in this post.

The R’ posts are my study notes of the Advanced R book. Inaccurate information here is in all likelihood my fault.

Note: code in this post often requires R>=4.1.0 that provides the shorthand function annotation.

This work is licensed under CC BY-SA 4.0

What is metaprogramming

In the broadest sense, metaprogramming is the process where a computer program manipulates another computer program1. This umbrella term could take different meanings:

  1. Programs that modify/generate other programs are metaprograms. This category includes parsers, compilers, etc. For instance, high-level language compilers translate source code into low-level intermediate representations2.
  2. Programs that modify themselves are also metaprograms. Many languages offer tools that allow a program to inspect and modify its own structure at run-time and/or at compile-time. This is often called reflection.

In R, metaprogramming means both. You can write R programs that allow user to generate programs in other lanuages using R alone (without knowledge of the target languages); on the other hand, R programs have full read/write access to their internal runtime structure.

To further understand how metaprogramming is used in language designs, we need to think about ‘code’ versus ’execution/evaluation of code’. To keep it simple, we henceforth restrict our discussion to the R language.

Metaprogramming in R

Parser and evaluator

We as R users write plain-text code either in the interactive console or in source files. When we “run” our code:

  1. The R parser translates plain-text code into a R object representation. The parser interprets plain-text code by syntax3. Therefore, at this stage, code only has to be syntactically correct.
  2. The R evaluator evaluates parsed code. At this stage, the evaluator resolves values of symbols in the parsed code and executes function calls to yield results4.

Importantly, parser only needs the source code. However, to evaluate the parsed code, the evaluator needs an environment with which to resolve the symbols.

Note: there are many (mostly) interchangable terms for ‘parsed code’, including abstract syntax tree and concrete syntax tree (aka parse tree).

With this model in mind, R supports metaprogramming by allowing users to:

  1. Parse code and not evaluate (aka “quoting”). This generates parse tree thereby enabling subsequent inspection and modification of the tree. Quoting converts code to data by running the parser but not the evaluator.
  2. Evaluate the parse tree in an arbitrary environment (aka “evaluation”, or a close relative “unquoting”). Evaluation generates computation results of the parsed code.5.

Next, we discuss the quoting process, which is the mapping code -> expression.

Quoting: code -> expression

Parsed code segments are stored as language objects, which contain three base types: symbol (aka name), call and constant. Expression, as a R object type, is a list-like type which may contain multiple symbols, constants and/or calls. For the sake of simplicity, we subsequently refer to parsed code as expression. When we talk about expression, for most cases that refers to a single symbol or one call6.

symbol/name

A symbol refers to R objects by name. To define a symbol:

1
2
3
4
t <- quote(x) # Define a symbol `x` and assign to t
t <- as.symbol("x") # Alternative base
t <- rlang::expr(x) # Equivalent rlang
t <- rlang::sym("x") # Alternative rlang

call

A call refers to an unevaluated function call. To define a function call:

1
2
3
4
fc <- quote(max(1,2,x)) # Define a call max(1,2,x) where x is expression
fc <- call("max",1,2,quote(x)) # Alternative base
fc <- rlang::expr(max(1,2,x)) # Equivalent rlang
fc <- rlang::call2("max",1,2,rlang::expr(x)) # Alternative rlang

Notes:

  1. class() of quote() and expr() depends on the input and can be either symbol or call.
  2. The function symbol max and parameter x can be undefined (again, quoting only checks syntax). The only exception is that rlang::call2() requires defined function symbol.

parse()

It is natural to think that one should use parse() that runs the R parser. After all, quoting is parsing but not executing.

However, parse() is not best choice for most applications because it is not quite user-friendly. It actually always returning an expression list, because by design it is meant to parse script files.

We do not use parse() for quoting. To show the simplest case of parse, consider the following example:

1
2
3
4
exprs <- parse(text="x;mean(1,2,x)") # expression list
exprs[[1]] # is symbol x
exprs[[2]] # is call mean(1,2,x)
exprs[1] # expression list of length-1

defusing: quoting function arguments

Previous sections showed some examples or quoting - parsing and not evluating. It might surprise you that the same code can NOT perform quoting when put in a function body:

1
2
3
4
5
f1 <- \(x) quote(x)
f2 <- \(x) rlang::expr(x) # Equivalent rlang

f1(a+b+c) # return symbol `x`
f2(a+b+c) # also yield symbol `x`

In fact, the input parameter is never read nor evaluated at all! quote(x) always returns the symbol x because it parses plain-text code (i.e., text as read by the parser).

To allow quoting inside a function, we use quoting helpers that take advantage of delayed evaluation of R function arguments. Refer to my previous post for more details7.

1
2
3
4
5
6
7
8
# Quoting function parameter
f1 <- \(x) substitute(x) # Quoting AND substitution
f2 <- \(x) rlang::enexpr(x) # Equivalent rlang

f1(a+b+c) # return call `a+b+c`
f2(a+b+c) # also yield call `a+b+c`
# Note: a+b+c is a nested call (more than 1 call)
#		check lobstr::ast(f1(a+b+c))

rlang provides a few more function parameter quoting helpers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Quoting ...
f1 <- \(...) rlang::enexprs(...)
f1(a=1,b=x+y,5) # returns a named list of expressions
# Explicitly asks for symbol and reject call&const
f2 <- \(x) rlang::ensym(x)
f2(t) # works, because t is a symbol
f2(a+b) # errors, because a+b is a call
f3 <- \(x) rlang::ensyms(x)
f3(a,b,c) # works, because all are symbols
f3(1,b,x+y) # fails, because `1` is const and `x+y` is call

Evaluation: expression -> result

Quoting translates plain-text code to expression with the parser. Evaluation yields result of the expression by resolving values of all symbols with the evaluator. The key interface function to the R evaluator is eval().

eval()

To evaluate an expression, we need to provide values of all symbols in the expression. eval() is the key interface to the evaluator and is used by multiple evaluation helper functions in base R including local() and source().

To resolve symbols, you need to provide an environment (i.e., symbol table). eval() provides the following options:

  1. providing an environment by eval(expr, envir = ENV).
  2. providing a data frame that eval() will first use to resolve symbols, eval(expr, envir = DF, enclos = ENV). ENV will be the environment to look up for symbols not found in DF. DF is also intuitively called a “data mask”8.

Quasiquotation: partial evaluation during quotation

In previous sections, I showed how we can convert between code and data by quoting and evaluation. Quasiquotation is the combination of quoting an expression while allowing immediate evaluation (unquoting) of part of that expression9.

This feature is implemented by the rlang package of tidyverse and defines two syntactic operators !! and !!!. These two operators are ONLY valid when used as parameters of rlang quoting functions10.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
require(rlang)
# !! (unquote of different objects)
#		of function: put function object
my_mean <- \(x) exp(mean(log(x))) # geometric mean
expr(m <- (!!my_mean)(x))
#		of constant: put constant (object)
my_bias <- 100
expr(m <- mean(x) + !!my_bias)
#		of expression: put the expr "content"
my_bias <- expr(mean(runif(y)))
expr(m <- mean(x) + !!my_bias) # put `mean(runif(y))`
#		of call (not expr of class 'call'): evaluate and put results
expr(!!mean(1:5)) # put result = 3

# !!! (unquote-splice, ONLY applicable in call exprs)
params <- list(expr(x),1,2,3)
expr(mean(!!!params))
named_params <- list(expr(x),1,2,3,na.rm = TRUE)
expr(mean(!!!named_params))

The unquoting operator !! takes a single symbol and direct the quoting function to do the following:

  1. Look up the symbol in the execution environment of the quoting function.
  2. If the symbol refers to an object, put the object in the expression. This also includes explicit function calls, which are evaluated and the result object obtained.
  3. If the symbol refers to an expression, put the expression content without further evaluation. This also includes expression of class ‘call’.

The unquote-splice operator !!! takes a list of symbols and direct the quoting function to do the following:

  1. Look up each of the symbols in the execution environment of the quoting function.
  2. For each symbol, behavior is the same as the unquoting operator !!, yielding a named list of unquoting results.
  3. The named list is put as arguments (formally, an argument pairlist) of the call expression.

It shall be clear that unquote-splice will only make sense if put as parameters of a function call. In fact, trying to use this operator top-level of an expression will trigger an error:

1
2
params <- list(1,2,3)
expr(!!!params) # Error: Can't use `!!!` at top level.

Quosure: expression with environment

In previous sections, we never specified the environment for evaluation and relied on the default behavior:

  1. eval() evaluation uses environment and/or data mask explicitly provided.
  2. unquoting !! uses execution environment of the quoting function.

However, that might not be what we want. More specifically, for R package developers things can get very confusing easily. Consider the following example (adapted and simplified from the rlang quosure topic).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# In package `foo`
normalize <- function(vec, func){
	# Normalizes a numerical vector by a summary func
	#   not exported; internal use only
	stopifnot(is.numeric(vec))
	factor <- func(vec)
	stopifnot(length(factor) == 1 & is.numeric(factor))
	vec / factor
}
computeMS <- function(data, column, auto.norm = TRUE){
	# Computes mean and sem of a data column
	#   exported
	# 1. Quoting
	column <- rlang::enexpr(column)
	# 2. Auto normalization
	#		if column is a symbol, normalize by sum
	#   auto.norm implemented w/ normalize in this package.
	if (auto.norm & is.symbol(column))
	  column <- rlang::expr(normalize(!!column, sum))
	# 3. Evaluate
	data <- eval(column, envir = data)
	# 4. Compute mean and sd
	mean <- mean(data)
	sem <- sd(data) / sqrt(length(data))
	c(mean = mean, sem = sem)
}

In this example, function computeMS in the foo package computes mean and sem of a data column provided by the user. While seemingly promising, this design does not always work:

1
2
3
4
5
6
7
8
# In user environment .Global
library(foo)
#		use `computeMS` from `foo`
computeMS(mtcars, mpg, auto.norm=TRUE) # Ok
computeMS(mtcars, mpg/sum(mpg), auto.norm=FALSE) # Ok
#		custom user normalize function
normalize <- \(x) (x - min(x))/(max(x) - min(x))
computeMS(mtcars, normalize(mpg), auto.norm=FALSE) # Errors

While the user intent is clearly to use the normalize function in the user environment, computeMS(mtcars, normalize(mpg)) errors. This is because eval in the computeMS function will first look at namespace of the foo package, where normalize is already defined.

Quosure will solve the name conflict issue:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# In package `foo`
computeMS2 <- function(data, column, auto.norm = TRUE){
	# computes mean and sem of a data column
	# exported
	# 1. Quoting w/ quosure
	column <- rlang::enquo(column)
	# 2. Auto normalization
	if (!is.null(auto.norm) & is.symbol(column))
	  column <- rlang::expr(normalize(!!column, sum))
	# 3. Evaluate w/ rlang::eval_tidy
  data <- rlang::eval_tidy(column, data = data)
  # 4. Compute mean and sd
  mean <- mean(data)
  sem <- sd(data) / sqrt(length(data))
  c(mean = mean, sem = sem)
}

Quoting using enquo will create quosure instead of expression. A quosure is essentially an expression with a default environment for evaluation11.

To define quosure, similar to expression:

  1. use quo() in interactive mode or scripts
  2. use enquo() in function body.

To use quosure:

  1. Use rlang::eval_tidy() instead of eval() for evaluation.
  2. For symbols in the expression, eval_tidy will follow the standard eval rules.
  3. For quosures in the expression, eval_tidy will use quosure environments for evaluation.

Common metaprogramming examples and patterns

In this last section, I discuss a few common metaprogramming functions and patterns provided by base R or the tidyverse ecosystem.

First, we discuss the base function subset and tidyselect of tidyverse. Both share one motivation: to allow expressive syntax for subsetting data.

base::subset()

What: subset rows and columns of data that meet conditions.

Examples:

1
2
3
# See how flexible select can be
subset(airquality, Temp > 80, select = c(Ozone, Temp))
subset(airquality, Temp > 80, select = c(Ozone:Wind, Month))

Annotated base::subset.data.frame():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# base::subset.data.frame
function (x, subset, select, drop = FALSE, ...) 
{
    # I) process row *condition*
    r <- if (missing(subset))
        # subset not provided -> take all rows
        rep_len(TRUE, nrow(x))
    else {
        # subset provided
        #   first quote
        e <- substitute(subset)
        #   then eval with x as data mask
        r <- eval(e, x, parent.frame())
        #   check eval yields logical
        if (!is.logical(r)) 
            stop("'subset' must be logical")
        #   change NA to FALSE
        r & !is.na(r)
    }
    # II) process column *selection*
    vars <- if (missing(select))
        # select not provided -> take all cols
        rep_len(TRUE, ncol(x))
    else {
        # select provided
        #    generate named sequence list nl
        nl <- as.list(seq_along(x)) # value = seq_along(x)
        names(nl) <- names(x) # name = names(x)
        #    then eval select with nl as data mask
        #		 brilliant trick indeed...
        eval(substitute(select), nl, parent.frame())
    }
    x[r, vars, drop = drop]
}

Tidyselect - DSL for selection

The elegant base::subset() motivated development of selection syntax of tidyverse, implemented in the tidyselect package.

Formally, tidyselect implements a domain-specific language for making selection of named subsettable objects including vectors and data frames12. It has the following components:

  1. DSL syntax as defined in topic page tidyselect::language.
  2. Evaluation of the DSL with eval_select(), eval_rename(), or eval_relocate(). Evaluation always yields a named vector of numeric locations of the selection.

Below, I provide a few simple examples of tidyselect.

evaluation of the DSL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
require(tidyselect)
require(rlang)
# eval_* functions are evaluators of DSL
sel <- expr(drat:wt)
eval_select(sel, mtcars)
sel <- expr(drat:wt | starts_with("c"))
eval_select(sel, mtcars)
rename_sel <- expr(c(my_drat = drat)) # See `eval_rename`
eval_rename(rename_sel, mtcars)
reloc_sel <- expr(drat:wt | starts_with("c"))
reloc_after <- expr(gear)
eval_relocate(reloc_sel, mtcars, after = reloc_after)

tidyselect in tidyverse functions

tidyselect is used by tidyverse package functions involving selection. For example:

  1. dplyr functions incl. select, rename and relocate
  2. tidyr functions incl. pivot_longer

Whenever you see parameters taking type <tidy-select>, that means that you may use the tidyselect DSL for selection. Below, I show source code of dplyr:::select.data.frame which accepts the tidyselect syntax.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# dplyr:::select.data.frame is dispatched if
#  you provide data.frame argument to dplyr::select
dplyr:::select.data.frame <- 
 function (.data, ...) 
  {
    # For user-friendly report if error happens
    error_call <- dplyr_error_call()
    # Dots for selection is forwarded in `expr(c(...))`
    loc <- tidyselect::eval_select(expr(c(...)), data = .data, 
        error_call = error_call)
    # Sanity check - group variables must all be selected
    loc <- ensure_group_vars(loc, .data, notify = TRUE)
    # Use the loc returned by eval_select
    out <- dplyr_col_select(.data, loc)
    # Set selection names
    out <- set_names(out, names(loc))
    out
  }

dplyr operations

dplyr::mutate is probably among the first functions a typical R user would learn. When applied to data frames, its formals look like this:

1
2
3
4
5
6
7
8
9
## S3 method for class 'data.frame'
mutate(
  .data,
  ...,
  .by = NULL,
  .keep = c("all", "used", "unused", "none"),
  .before = NULL,
  .after = NULL
)

While .by, .keep and .before, .after are intuitive modifiers, the ellipsis ... accepts an arbitrary number of name-value pairs and allows sophisticated yet mostly intuitive “mutation operations” of a data frame. For example:

1
2
3
4
5
6
7
8
9
require(dplyr)
#		?mtcars to learn more about data
vs <- c("V", "S")
mutate(mtcars,
  # Following three are equivalent
  vs_type = ifelse(vs == 0, "V", "S"), # just `eval` with mask?
  vs_type = ifelse(.data$vs == 0, "V", "S"),  # what's `.data`?
  vs_type = ifelse(vs == 0, .env$vs[1], .env$vs[2]) # `.env`?
)

Interfacing with dplyr functions for custom application poses a more interesting application. Below I define a function groupThenSummarize(data, var.group, ...) that:

  1. Group a data frame by a single variable var.group
  2. For variables selected with tidyselect syntax in ellipsis ..., generate two statistics columns mean and sd for each of them.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
require(dplyr)
groupThenSummarize <- function(data, var.group, ...){
  data |>
    # Grouping is simple (defuse-then-inject)
    #		`{{` is not standard R syntax
    group_by({{var.group}}) |>
    # Forwarding `...` selection to `across`
    #		`c(...)` just works
    summarize(across(
      c(...), list(mean = mean, sd = sd),
      .names = "{.fn}_{.col}"
      ), .groups = "drop")
}
# Example, note how:
#		1. `var.group` uses natural symbol
#		2. `...` have full tidyselect support
groupThenSummarize(mtcars, vs, disp:wt)

The above code looks remarkably expressive and intuitive at first glance:

  1. Group the data with group_by and summarize by group with summarize.
  2. In summarize we use across to apply summary functions across multiple columns (our selection in ...)

However, from the R language point of view, it is indeed deeply perplexing. To understand how this code works, we need to get familiar with tidyverse metaprogramming patterns.

Metaprogramming patterns of tidyverse

While base R provides data masking evaluation, tidyverse provides a “extended” version. Under the hood, the tidyverse flavor data masking is implemented in the rlang package, mostly by the following features:

  1. quosures and quasiquotation for partial evaluation (unquoting)
  2. special pronouns for context-dependent references
  3. dynamic dots as an extension of the base R ellipsis ...

It shall be noted that while the tidyverse data masking exploits all three features, quosures and dynamic dots can be readily used for metaprogramming problems in general.

Brace operator: defuse-and-inject

Many tidyverse functions quote their data-related arguments using the rlang quosure. While this gives intuitive behavior in interactive consoles and scripts, it is sometimes unwanted in functions and packages. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Works well in script (GlobalEnv.)
require(dplyr)
mtcars |>
  group_by(cyl) |>
  summarize(mean = mean(disp))

# Function abstraction errors
groupThenMean <- function(data, by, what){
  data |>
    group_by(by) |>
    summarize(mean = mean(what))
}
groupThenMean(mtcars, cyl, disp)

The function abstraction errors because both group_by and summarize verbs of dplyr quote their arguments. When the function is executed group_by will try to perform grouping with variable named by, and that gives an error causing the function to exit.

This issue happens so often that its solution (a pattern) merits a dedicated name “defuse-and-inject”. Consider:

1
2
3
4
5
groupThenMean <- function(data, by, what){
  data |>
    group_by({{by}}) |>
    summarize(mean = mean({{what}}))
}

The defuse-and-inject pattern has a dedicated metaprogramming operator called the brace {{. It works exactly by its name:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
require(rlang)
groupThenMean <- function(data, by, what){
  # First defuse (i.e., quoting) w/ rlang::enquo()
  by <- enquo(by)
  what <- enquo(what)
  data |>
    # Then inject (i.e., unquoting) w/ !!
    group_by(!!by) |>
    summarize(mean = mean(!!what))
}
groupThenMean(mtcars, cyl, disp)

To make sense of it, consider the group_by part.

  1. by is quoted with enquo(by), yielding a quosure with expression cyl and environment <fn_exec_env>13.
  2. group_by quotes its arguments using enquos(...)14.
  3. The R evaluator hence evaluates enquos(!!by). Unquoting of a quosure yields the quosure itself, following standard rules of quasiquotataion15.
  4. Therefore, group_by recovered the quoted by created in Step#1, whose expression cyl is used as the grouping variable.

Takeaway: tidyverse functions quote their arguments using the rlang quoting mechanism. Therefore, to forward single argument to these functions, use defuse-and-inject {{.

Pronouns: context-dependent references

Pronoun is another neat concept of tidyverse metaprogramming. It stems from the need of name disambiguation. Consider the following:

1
2
3
4
5
6
7
8
require(dplyr)
disp <- 100
which <- "disp"
mutate(mtcars, disp = disp - disp) # Can't ref `disp` in the env
mutate(mtcars, disp = disp - !!disp) # Works but not expressive
mutate(mtcars, disp = .data$disp - .env$disp) # Expressive
mutate(mtcars, disp = .data[["disp"]] - .env[["disp"]]) # Same
mutate(mtcars, disp = .data[[!!which]] - .env[[!!which]]) # Same

For data-masking type arguments, you can use the .data and .env pronouns. This is because:

  1. Tidyverse data-masking arguments are first quoted with rlang::enquo and then evaluated by rlang::eval_tidy16.
  2. When a data mask is provided, eval_tidy will recognize the .data and .env pronouns and evaluate either in the data frame or in the environment accordingly.

How to use the pronouns is quite self evident; nevertheless, it shall be noted that they are not real data frames but are metaprogramming features.

Dynamic dots: extension of the ellipsis

The ellipsis ... is a reserved word of the R parser for passing arbitrary “extra” arguments from a caller function down the stack. Iff formals of a function has ..., one can use ... to indicate passing of extra arguments to a calling function. Consider the following example:

1
2
3
4
5
6
7
8
f <- \(x, ...) g(...)
g <- \(y, ...) h(...)
h <- \(...) list(...)

f(1,2,3) # 3
f(x=1,y=2,z=3) # z=3
f(2,3,1) # 1
f(y=2,z=3,x=1) # z=3

In this example, f calls g calls h; f takes a “fixed” argument x and g takes a “fixed” argument y. h collects content of the ellipsis into a list and returns.

Ellipsis argument forwarding only supports variable matching by name in the formals:

1
2
3
4
5
f <- \(x, ...) g(...)
g <- \(...) y
f(x=1,y=2) # Error; `y` not found
g <- \(y) y # Must put `y` in formals
f(x=1,y=2) # Returns 2

Finally, as long as an argument is not captured by a calling function explicitly in its formals, it will be kept forwarded.

Tidyverse functions often follows an extended ellipsis syntax known as the dynamic dots. The dynamic dots are implemented in the rlang package17, and the following functions support this syntax:

  1. Functions that collect dots with rlang::list2() or rlang::dots_list().
  2. Functions that collect dots with rlang::enquos(), rlang::quos() or expression list equivalents enexprs() and exprs().

It shall be noted that tidyverse functions that adopt data-masking dots, including dplyr::mutate and all functions having <data-masking> type ellipsis, will support the dynamic dots syntax. This is because they collect dots by rlang::quos().

Dynamic dots support the following unquoting syntax:

  1. Argument splicing with !!!. This is conceptually identical to unquoting call arguments18.
  2. Name injection with :=. This is also known as the “glue syntax”19, which allows you to perform string interpolation on argument names.
  3. Trailing commas are ignored for easier copy-paste of argument lines.

Consider the following example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
require(rlang)
# dyn-dots list2()
f <- \(...) list2(...)
my_params <- list(x=1,y=2)
my_param2 <- "z"
f(!!!my_params, "my_{my_param2}" := 3)

# The following also supports dyn-dots
f <- \(...) enexprs(...)
f <- \(...) enquos(...)

# Data-masking dots are dyn-dots
require(dplyr)
my_ops <- list(
  # Define a bunch of mutate ops
  kpl = expr(mpg*0.425144), #km/L
  disp = expr(disp*0.0163871), #L
  kw = expr(hp*0.7457), #kW
  wt = expr(wt*0.453592) #ton
)
mutate(mtcars, !!!my_ops, )

Example: groupThenSummarize

In this last section, I show an example that uses many of the tidyverse metaprogramming patterns. Function groupThenSummarize slightly improves the experience of a common data analysis task: group-then-summarize, see below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Interface definition
#' Customizable summary of data frame variables
#'
#' @param .data a data frame (extension)
#' @param var.group variables to group, either single name or gvars()
#' @param .fns named list of summary functions
#' @param ... <[`tidy-select`][dplyr::dplyr_tidy_select]> Selection of variables to summarize on
#'
#' @return a data frame (extension) of summary
#' @export
#'
#' @examples
#' # Single group variable
#' groupThenSummarize(mtcars, cyl, list(m=mean, s=sd), disp:wt)
#' # Multiple group variable
#' groupThenSummarize(mtcars, gvars(cyl,vs), list(m=mean, s=sd), disp:wt)
groupThenSummarize <- function(.data, var.group, .fns, ...){
  # S1. Quote grouping variable
  var.group <- enexpr(var.group)
  if (is.symbol(var.group)){
    # If symbol (single group variable)
    var.group <- exprs(!!var.group)
  } else{
    # Otherwise call
    if (call_name(var.group) != "gvars") 
      abort("Use `gvars()` to specify multiple variables.")
    var.group <- call_args(var.group)
  }
  # S2. Perform group-then-summarize
  .data |>
    # Grouping with group_by
    #		var.group is a list, splice
    group_by(!!!var.group) |>
    # Forwarding `...` selection to `across`
    summarize(across(
      c(...), .fns,
      .names = "{.fn}_{.col}"
      ), .groups = "drop")
}

It has the following features:

  1. var.group may accept a single symbol for grouping, or gvars(...) to accept multiple symbols for grouping.
  2. ... ellipsis supports the full tidyselect syntax.
  3. .fns is a named list of summary functions. Names will be used as output column names.

Usage examples:

1
2
3
4
fns <- list(m=mean, s=sd)
groupThenSummarize(mtcars, cyl, fns, mpg, disp:qsec)
groupThenSummarize(mtcars, gvars(cyl, vs), fns, mpg, disp:qsec)
groupThenSummarize(mtcars, gvars(cyl, vs), fns, mpg | disp:qsec)

A few more notes about this example:

  1. It serves as a tiny example and by no means covers all edge cases.
  2. roxygen2 annotation is used to describe the function interface.
  3. group_by uses rlang::enquos(...) for argument quoting, and therefore supports splicing (and other dynamic dots syntax). However, name injection is meaningless for grouping.
  4. Tidyselect dots was forwarded to across with c(...). This is because the first argument of across has type <tidy-select>, which means that it will be quoted and evaluated with the tidyselect syntax. c() is used for concatenation of selection in the tidyselect DSL.
  5. It does not reflect good programming practices; in fact, if putting this function in a package is desirable, one should at least separate S1 and S2 steps and implement generalized forms.

  1. Stackoverflow has an insightful answer about this. Metaprogramming is a ‘relatively new’ concept; in assembly, one doesn’t really distinguish code vs data mostly. ↩︎

  2. The R parser translates R code to parse tree, which then is passed to the evaluator which translates the parse tree to executable instructions. ↩︎

  3. Comments and spaces are ignored by the parser and omitted in the parse tree. Still, spaces are often necessary for disambiguation. For example, x<-1 (assignment) versus x< -1 (logical). The R parser is also capable of processing [parse-time directives][r-langdef-parser-dir↩︎

  4. Functions are first-class objects in R. During evaluation of function definitions, the evaluator does NOT evaluate inside the argument list nor the function body. Instead, it creates the function object by finding the environment(), filling the argument list formals(), and calling the parser to parse the function body body(). Refer to my previous post on R functions↩︎

  5. It shall be noted that R allows partial evaluation of a parse tree. This is called quasiquotation and is more intuitively supported by rlang↩︎

  6. A constant is either NULL or atomic scalar (length-1 vector). This is the simplest language type and we do not discuss further. ↩︎

  7. rlang::enexpr() works by looking into the promise objects of the parameters. ↩︎

  8. Conceptually, you can think that functions that support a “data mask” works by evaluating the expression in a special environment chain, where the first chain defined by the data mask is linked to the second chain defined by the environment (i.e., “enclosure” as mentioned in eval()↩︎

  9. For more details refer to a well-written topic in a previous rlang package (link). ↩︎

  10. They are not really R operators (! is logical not and !! is nothing but double negation). In fact, special behavior of them is implemented by defusing. ↩︎

  11. If you are familiar with R function, quosures are conceptually similar to promises. On a separate note, quosures are also conceptually similar to formulas↩︎

  12. Technically, it supports any vector with names() and “[[” implementations (see documentation of eval_select). ↩︎

  13. The equivalent version is a harmless fiction to simplify the evaluation process. In fact, quoting of by and what is not performed in the execution environment of groupThenMean↩︎

  14. Tidyverse verbs quote using the quosure mechanism instead of expression. Our example here is about data-masked evaluation and therefore using quosures or expressions does not matter. ↩︎

  15. In general, expr(!!<expr>) == <expr> and quo(!!<quo>) == <quo>. That is, unquoting expression/quosure yield the same expression/quosure. Refer to the quasiquotation section for details↩︎

  16. It might help to remind that unquoting happens during rlang::enquo. Therefore, the unquoting part has no data mask. That is why mutate(mtcars, disp = disp - !!disp) works. This is generally true for data-masking arguments in tidyverse. ↩︎

  17. Actually it is implemented in C for the most part. Refer to the rlang source code for more details. ↩︎

  18. In reality of couse different; when using the dynamic dots, we are executing a function call. When unquoting call arguments, we are making a call expression. ↩︎

  19. It is initially implemented in the glue package, which focuses on string interpolation. If you are familiar with Python, it is conceptually identical to f-strings. ↩︎