Synopsis
R users often write code that look deeply peculiar to others. Ever wondered how the following could possibly work out:
- variable names show up as plot labels with a minimal
plot()
call - operations on data frame refer to columns by name without quotes
- model formulas fit against any data and operators like
*
are no longer arithmetic
Metaprogramming underlies many of the peculiarities of R. Among the most intriguing designs of the R language, it is arguably the key that allows R users to write expressive code centered on data analysis, rather than buried in minute programming details.
This post provides an introduction to the R metaprogramming design. After duscussing basic concepts I end with an introduction to metaprogramming patterns supported by the tidyverse most notably in the rlang
package.
TL;DR: Refer to the example at the end which uses many features discussed in this post.
The R’ posts are my study notes of the Advanced R book. Inaccurate information here is in all likelihood my fault.
Note: code in this post often requires R>=4.1.0 that provides the shorthand function annotation.
This work is licensed under CC BY-SA 4.0
What is metaprogramming
In the broadest sense, metaprogramming is the process where a computer program manipulates another computer program1. This umbrella term could take different meanings:
- Programs that modify/generate other programs are metaprograms. This category includes parsers, compilers, etc. For instance, high-level language compilers translate source code into low-level intermediate representations2.
- Programs that modify themselves are also metaprograms. Many languages offer tools that allow a program to inspect and modify its own structure at run-time and/or at compile-time. This is often called reflection.
In R, metaprogramming means both. You can write R programs that allow user to generate programs in other lanuages using R alone (without knowledge of the target languages); on the other hand, R programs have full read/write access to their internal runtime structure.
To further understand how metaprogramming is used in language designs, we need to think about ‘code’ versus ’execution/evaluation of code’. To keep it simple, we henceforth restrict our discussion to the R language.
Metaprogramming in R
Parser and evaluator
We as R users write plain-text code either in the interactive console or in source files. When we “run” our code:
- The R parser translates plain-text code into a R object representation. The parser interprets plain-text code by syntax3. Therefore, at this stage, code only has to be syntactically correct.
- The R evaluator evaluates parsed code. At this stage, the evaluator resolves values of symbols in the parsed code and executes function calls to yield results4.
Importantly, parser only needs the source code. However, to evaluate the parsed code, the evaluator needs an environment with which to resolve the symbols.
Note: there are many (mostly) interchangable terms for ‘parsed code’, including abstract syntax tree and concrete syntax tree (aka parse tree).
With this model in mind, R supports metaprogramming by allowing users to:
- Parse code and not evaluate (aka “quoting”). This generates parse tree thereby enabling subsequent inspection and modification of the tree. Quoting converts code to data by running the parser but not the evaluator.
- Evaluate the parse tree in an arbitrary environment (aka “evaluation”, or a close relative “unquoting”). Evaluation generates computation results of the parsed code.5.
Next, we discuss the quoting process, which is the mapping code -> expression
.
Quoting: code -> expression
Parsed code segments are stored as language objects, which contain three base types: symbol (aka name), call and constant. Expression, as a R object type, is a list-like type which may contain multiple symbols, constants and/or calls. For the sake of simplicity, we subsequently refer to parsed code as expression. When we talk about expression, for most cases that refers to a single symbol or one call6.
symbol/name
A symbol refers to R objects by name. To define a symbol:
|
|
call
A call refers to an unevaluated function call. To define a function call:
|
|
Notes:
class()
ofquote()
andexpr()
depends on the input and can be either symbol or call.- The function symbol
max
and parameterx
can be undefined (again, quoting only checks syntax). The only exception is thatrlang::call2()
requires defined function symbol.
parse()
It is natural to think that one should use parse()
that runs the R parser. After all, quoting is parsing but not executing.
However, parse()
is not best choice for most applications because it is not quite user-friendly. It actually always returning an expression list, because by design it is meant to parse script files.
We do not use parse()
for quoting. To show the simplest case of parse
, consider the following example:
|
|
defusing: quoting function arguments
Previous sections showed some examples or quoting - parsing and not evluating. It might surprise you that the same code can NOT perform quoting when put in a function body:
|
|
In fact, the input parameter is never read nor evaluated at all! quote(x)
always returns the symbol x
because it parses plain-text code (i.e., text as read by the parser).
To allow quoting inside a function, we use quoting helpers that take advantage of delayed evaluation of R function arguments. Refer to my previous post for more details7.
|
|
rlang
provides a few more function parameter quoting helpers:
|
|
Evaluation: expression -> result
Quoting translates plain-text code to expression with the parser. Evaluation yields result of the expression by resolving values of all symbols with the evaluator. The key interface function to the R evaluator is eval()
.
eval()
To evaluate an expression, we need to provide values of all symbols in the expression. eval()
is the key interface to the evaluator and is used by multiple evaluation helper functions in base R including local()
and source()
.
To resolve symbols, you need to provide an environment (i.e., symbol table). eval()
provides the following options:
- providing an environment by
eval(expr, envir = ENV)
. - providing a data frame that
eval()
will first use to resolve symbols,eval(expr, envir = DF, enclos = ENV)
.ENV
will be the environment to look up for symbols not found inDF
.DF
is also intuitively called a “data mask”8.
Quasiquotation: partial evaluation during quotation
In previous sections, I showed how we can convert between code and data by quoting and evaluation. Quasiquotation is the combination of quoting an expression while allowing immediate evaluation (unquoting) of part of that expression9.
This feature is implemented by the rlang package of tidyverse and defines two syntactic operators !!
and !!!
. These two operators are ONLY valid when used as parameters of rlang quoting functions10.
|
|
The unquoting operator !!
takes a single symbol and direct the quoting function to do the following:
- Look up the symbol in the execution environment of the quoting function.
- If the symbol refers to an object, put the object in the expression. This also includes explicit function calls, which are evaluated and the result object obtained.
- If the symbol refers to an expression, put the expression content without further evaluation. This also includes expression of class ‘call’.
The unquote-splice operator !!!
takes a list of symbols and direct the quoting function to do the following:
- Look up each of the symbols in the execution environment of the quoting function.
- For each symbol, behavior is the same as the unquoting operator
!!
, yielding a named list of unquoting results. - The named list is put as arguments (formally, an argument pairlist) of the call expression.
It shall be clear that unquote-splice will only make sense if put as parameters of a function call. In fact, trying to use this operator top-level of an expression will trigger an error:
|
|
Quosure: expression with environment
In previous sections, we never specified the environment for evaluation and relied on the default behavior:
eval()
evaluation uses environment and/or data mask explicitly provided.- unquoting
!!
uses execution environment of the quoting function.
However, that might not be what we want. More specifically, for R package developers things can get very confusing easily. Consider the following example (adapted and simplified from the rlang quosure topic).
|
|
In this example, function computeMS
in the foo
package computes mean and sem of a data column provided by the user. While seemingly promising, this design does not always work:
|
|
While the user intent is clearly to use the normalize
function in the user environment, computeMS(mtcars, normalize(mpg))
errors. This is because eval
in the computeMS
function will first look at namespace of the foo
package, where normalize
is already defined.
Quosure will solve the name conflict issue:
|
|
Quoting using enquo
will create quosure instead of expression. A quosure is essentially an expression with a default environment for evaluation11.
To define quosure, similar to expression:
- use
quo()
in interactive mode or scripts - use
enquo()
in function body.
To use quosure:
- Use
rlang::eval_tidy()
instead ofeval()
for evaluation. - For symbols in the expression,
eval_tidy
will follow the standardeval
rules. - For quosures in the expression,
eval_tidy
will use quosure environments for evaluation.
Common metaprogramming examples and patterns
In this last section, I discuss a few common metaprogramming functions and patterns provided by base R or the tidyverse ecosystem.
First, we discuss the base function subset
and tidyselect of tidyverse. Both share one motivation: to allow expressive syntax for subsetting data.
base::subset()
What: subset rows and columns of data that meet conditions.
Examples:
|
|
Annotated base::subset.data.frame()
:
|
|
Tidyselect - DSL for selection
The elegant base::subset()
motivated development of selection syntax of tidyverse, implemented in the tidyselect package.
Formally, tidyselect implements a domain-specific language for making selection of named subsettable objects including vectors and data frames12. It has the following components:
- DSL syntax as defined in topic page
tidyselect::language
. - Evaluation of the DSL with
eval_select()
,eval_rename()
, oreval_relocate()
. Evaluation always yields a named vector of numeric locations of the selection.
Below, I provide a few simple examples of tidyselect.
evaluation of the DSL
|
|
tidyselect in tidyverse functions
tidyselect is used by tidyverse package functions involving selection. For example:
- dplyr functions incl.
select
,rename
andrelocate
- tidyr functions incl.
pivot_longer
Whenever you see parameters taking type <tidy-select>
, that means that you may use the tidyselect DSL for selection. Below, I show source code of dplyr:::select.data.frame
which accepts the tidyselect syntax.
|
|
dplyr operations
dplyr::mutate
is probably among the first functions a typical R user would learn. When applied to data frames, its formals look like this:
|
|
While .by, .keep
and .before, .after
are intuitive modifiers, the ellipsis ...
accepts an arbitrary number of name-value pairs and allows sophisticated yet mostly intuitive “mutation operations” of a data frame. For example:
|
|
Interfacing with dplyr functions for custom application poses a more interesting application. Below I define a function groupThenSummarize(data, var.group, ...)
that:
- Group a data frame by a single variable
var.group
- For variables selected with tidyselect syntax in ellipsis
...
, generate two statistics columns mean and sd for each of them.
|
|
The above code looks remarkably expressive and intuitive at first glance:
- Group the data with
group_by
and summarize by group withsummarize
. - In
summarize
we useacross
to apply summary functions across multiple columns (our selection in...
)
However, from the R language point of view, it is indeed deeply perplexing. To understand how this code works, we need to get familiar with tidyverse metaprogramming patterns.
Metaprogramming patterns of tidyverse
While base R provides data masking evaluation, tidyverse provides a “extended” version. Under the hood, the tidyverse flavor data masking is implemented in the rlang package, mostly by the following features:
- quosures and quasiquotation for partial evaluation (unquoting)
- special pronouns for context-dependent references
- dynamic dots as an extension of the base R ellipsis
...
It shall be noted that while the tidyverse data masking exploits all three features, quosures and dynamic dots can be readily used for metaprogramming problems in general.
Brace operator: defuse-and-inject
Many tidyverse functions quote their data-related arguments using the rlang quosure. While this gives intuitive behavior in interactive consoles and scripts, it is sometimes unwanted in functions and packages. For example:
|
|
The function abstraction errors because both group_by
and summarize
verbs of dplyr quote their arguments. When the function is executed group_by
will try to perform grouping with variable named by
, and that gives an error causing the function to exit.
This issue happens so often that its solution (a pattern) merits a dedicated name “defuse-and-inject”. Consider:
|
|
The defuse-and-inject pattern has a dedicated metaprogramming operator called the brace {{
. It works exactly by its name:
|
|
To make sense of it, consider the group_by
part.
by
is quoted withenquo(by)
, yielding a quosure with expressioncyl
and environment<fn_exec_env>
13.group_by
quotes its arguments usingenquos(...)
14.- The R evaluator hence evaluates
enquos(!!by)
. Unquoting of a quosure yields the quosure itself, following standard rules of quasiquotataion15. - Therefore,
group_by
recovered the quotedby
created in Step#1, whose expressioncyl
is used as the grouping variable.
Takeaway: tidyverse functions quote their arguments using the rlang quoting mechanism. Therefore, to forward single argument to these functions, use defuse-and-inject {{
.
Pronouns: context-dependent references
Pronoun is another neat concept of tidyverse metaprogramming. It stems from the need of name disambiguation. Consider the following:
|
|
For data-masking type arguments, you can use the .data
and .env
pronouns. This is because:
- Tidyverse data-masking arguments are first quoted with
rlang::enquo
and then evaluated byrlang::eval_tidy
16. - When a data mask is provided,
eval_tidy
will recognize the.data
and.env
pronouns and evaluate either in the data frame or in the environment accordingly.
How to use the pronouns is quite self evident; nevertheless, it shall be noted that they are not real data frames but are metaprogramming features.
Dynamic dots: extension of the ellipsis
The ellipsis ...
is a reserved word of the R parser for passing arbitrary “extra” arguments from a caller function down the stack. Iff formals of a function has ...
, one can use ...
to indicate passing of extra arguments to a calling function. Consider the following example:
|
|
In this example, f
calls g
calls h
; f
takes a “fixed” argument x
and g
takes a “fixed” argument y
. h
collects content of the ellipsis into a list and returns.
Ellipsis argument forwarding only supports variable matching by name in the formals:
|
|
Finally, as long as an argument is not captured by a calling function explicitly in its formals, it will be kept forwarded.
Tidyverse functions often follows an extended ellipsis syntax known as the dynamic dots. The dynamic dots are implemented in the rlang package17, and the following functions support this syntax:
- Functions that collect dots with
rlang::list2()
orrlang::dots_list()
. - Functions that collect dots with
rlang::enquos()
,rlang::quos()
or expression list equivalentsenexprs()
andexprs()
.
It shall be noted that tidyverse functions that adopt data-masking dots, including dplyr::mutate
and all functions having <data-masking>
type ellipsis, will support the dynamic dots syntax. This is because they collect dots by rlang::quos()
.
Dynamic dots support the following unquoting syntax:
- Argument splicing with
!!!
. This is conceptually identical to unquoting call arguments18. - Name injection with
:=
. This is also known as the “glue syntax”19, which allows you to perform string interpolation on argument names. - Trailing commas are ignored for easier copy-paste of argument lines.
Consider the following example:
|
|
Example: groupThenSummarize
In this last section, I show an example that uses many of the tidyverse metaprogramming patterns. Function groupThenSummarize
slightly improves the experience of a common data analysis task: group-then-summarize, see below:
|
|
It has the following features:
var.group
may accept a single symbol for grouping, orgvars(...)
to accept multiple symbols for grouping....
ellipsis supports the full tidyselect syntax..fns
is a named list of summary functions. Names will be used as output column names.
Usage examples:
|
|
A few more notes about this example:
- It serves as a tiny example and by no means covers all edge cases.
- roxygen2 annotation is used to describe the function interface.
group_by
usesrlang::enquos(...)
for argument quoting, and therefore supports splicing (and other dynamic dots syntax). However, name injection is meaningless for grouping.- Tidyselect dots was forwarded to
across
withc(...)
. This is because the first argument ofacross
has type<tidy-select>
, which means that it will be quoted and evaluated with the tidyselect syntax.c()
is used for concatenation of selection in the tidyselect DSL. - It does not reflect good programming practices; in fact, if putting this function in a package is desirable, one should at least separate S1 and S2 steps and implement generalized forms.
Stackoverflow has an insightful answer about this. Metaprogramming is a ‘relatively new’ concept; in assembly, one doesn’t really distinguish code vs data mostly. ↩︎
The R parser translates R code to parse tree, which then is passed to the evaluator which translates the parse tree to executable instructions. ↩︎
Comments and spaces are ignored by the parser and omitted in the parse tree. Still, spaces are often necessary for disambiguation. For example,
x<-1
(assignment) versusx< -1
(logical). The R parser is also capable of processing [parse-time directives][r-langdef-parser-dir. ↩︎Functions are first-class objects in R. During evaluation of function definitions, the evaluator does NOT evaluate inside the argument list nor the function body. Instead, it creates the function object by finding the
environment()
, filling the argument listformals()
, and calling the parser to parse the function bodybody()
. Refer to my previous post on R functions. ↩︎It shall be noted that R allows partial evaluation of a parse tree. This is called quasiquotation and is more intuitively supported by
rlang
. ↩︎A constant is either
NULL
or atomic scalar (length-1 vector). This is the simplest language type and we do not discuss further. ↩︎rlang::enexpr()
works by looking into the promise objects of the parameters. ↩︎Conceptually, you can think that functions that support a “data mask” works by evaluating the expression in a special environment chain, where the first chain defined by the data mask is linked to the second chain defined by the environment (i.e., “enclosure” as mentioned in
eval()
. ↩︎For more details refer to a well-written topic in a previous rlang package (link). ↩︎
They are not really R operators (
!
is logical not and!!
is nothing but double negation). In fact, special behavior of them is implemented by defusing. ↩︎If you are familiar with R function, quosures are conceptually similar to promises. On a separate note, quosures are also conceptually similar to formulas. ↩︎
Technically, it supports any vector with names() and “[[” implementations (see documentation of
eval_select
). ↩︎The equivalent version is a harmless fiction to simplify the evaluation process. In fact, quoting of
by
andwhat
is not performed in the execution environment ofgroupThenMean
. ↩︎Tidyverse verbs quote using the quosure mechanism instead of expression. Our example here is about data-masked evaluation and therefore using quosures or expressions does not matter. ↩︎
In general,
expr(!!<expr>) == <expr>
andquo(!!<quo>) == <quo>
. That is, unquoting expression/quosure yield the same expression/quosure. Refer to the quasiquotation section for details. ↩︎It might help to remind that unquoting happens during
rlang::enquo
. Therefore, the unquoting part has no data mask. That is whymutate(mtcars, disp = disp - !!disp)
works. This is generally true for data-masking arguments in tidyverse. ↩︎Actually it is implemented in C for the most part. Refer to the rlang source code for more details. ↩︎
In reality of couse different; when using the dynamic dots, we are executing a function call. When unquoting call arguments, we are making a call expression. ↩︎
It is initially implemented in the glue package, which focuses on string interpolation. If you are familiar with Python, it is conceptually identical to f-strings. ↩︎