Use any rsample split function on nested data, where nests act as strata. This almost guarantees that every split will contain data from every nested data frame.
Usage
nested_resamples(
data,
resamples,
nesting_method = NULL,
size_action = c("truncate", "recycle", "recycle-random", "combine", "combine-random",
"combine-end", "error"),
...
)
Arguments
- data
A data frame.
- resamples
An expression, function, formula or string that can be evaluated to produce an
rset
orrsplit
object.- nesting_method
A recipe, workflow or
NULL
, used to nestdata
ifdata
is not already nested (see Details).- size_action
If different numbers of splits are produced in each nest, how should sizes be matched? (see Details)
- ...
Extra arguments to pass into
resamples
.
Details
This function breaks down a data frame into smaller, nested data frames.
Resampling is then performed within these nests, and the results are
combined together at the end. This ensures that each split contains
data from every nest. However, this function does not perform any
pooling (unlike rsample::make_strata()
), so you may run into issues
if a nest is too small.
Nesting Data
data
can be nested in several ways:
If nesting_method
is NULL
and data
is grouped (using
dplyr::group_by()
), the data will be nested (see tidyr::nest()
for how this works).
If data
is not grouped, it is assumed to already be nested, and
nested_resamples
will try to find a column that contains nested data
frames.
If nesting_method
is a workflow or recipe, and the recipe has a step
created using step_nest()
, data
will be nested using the step in
the recipe. This is convenient if you've already created a recipe or
workflow, as it saves a line of code.
Resample Evaluation
The resamples
argument can take many forms:
A function call, such as
vfold_cv(v = 5)
. This is similar to the format ofrsample::nested_cv()
.A function, such as
rsample::vfold_cv
.A purrr-style anonymous function, which will be converted to a function using
rlang::as_function()
.A string, which will be evaluated using
rlang::exec()
.
Every method will be evaluated with data
passed in as the first
argument (with name 'data').
Size Matching
Before the set of resamples created in each nest can be combined, they
must contain the same number of splits. For most resampling methods,
this will not be an issue. rsample::vfold_cv()
, for example, reliably
creates the number of splits defined in its 'v' argument. However,
other resampling methods, like rsample::rolling_origin()
, depend on
the size of their 'data' argument, and therefore may produce different
numbers of resamples when presented with differently sized nests.
The size_action
argument defines many ways of matching the sizes of
resample sets with different numbers of splits. These methods will either try
to reduce the number of splits in each set until each rset is the same
length as the set with the lowest number of splits; or the opposite,
where each rset will have the same number of splits as the largest set.
"truncate", the default, means that all splits beyond the required length will be removed.
"recycle" means that sets of splits will be extended by repeating elements until the required length has been reached, mimicking the process of vector recycling. The advantage of this method is that all created splits will be preserved.
"recycle-random" is a similar process to recycling, but splits will be copied at random to spaces in the output, which may be important if the order of resamples matters. This process is not completely random, and the program makes sure that every split is copied roughly the same number of times.
"combine" gets rid of excess splits by combining them with previous ones. This means the training and testing rows are merged into one split. Combining is done systematically: if a set of splits needs to be compacted down to a set of 5, the first split is combined with the sixth split, then the eleventh, then the sixteenth, etc. This approach is not recommended, since it is not clear what the benefit of a combined split is.
"combine-random" combines each split with a random set of other splits, instead of the systematic process described in the previous method. Once again, this process is not actually random, and each split will be combined with roughly the same number of other splits.
"combine-end" combines every excess split with the last non-excess split.
"error" throws an error if each nest does not produce the same number of splits.
See also
rsample::initial_split()
for an example of the strata
argument.
Examples
library(tidyr)
library(recipes)
library(workflows)
library(rsample)
library(dplyr)
nested_data <- example_nested_data %>%
nest(data = -id)
grouped_data <- example_nested_data %>%
group_by(id)
recipe <- recipe(example_nested_data, z ~ .) %>%
step_nest(id)
wf <- workflow() %>%
add_recipe(recipe)
nested_resamples(nested_data, vfold_cv())
#> # 10-fold cross-validation
#> # A tibble: 10 × 2
#> splits id
#> <list> <chr>
#> 1 <split [900/100]> Fold01
#> 2 <split [900/100]> Fold02
#> 3 <split [900/100]> Fold03
#> 4 <split [900/100]> Fold04
#> 5 <split [900/100]> Fold05
#> 6 <split [900/100]> Fold06
#> 7 <split [900/100]> Fold07
#> 8 <split [900/100]> Fold08
#> 9 <split [900/100]> Fold09
#> 10 <split [900/100]> Fold10
nested_resamples(
group_by(example_nested_data, id),
~ initial_split(.)
)
#> <Training/Testing/Total>
#> <740/260/1000>
nested_resamples(example_nested_data, bootstraps,
times = 25, nesting_method = wf
)
#> # Bootstrap sampling
#> # A tibble: 25 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1000/367]> Bootstrap01
#> 2 <split [1000/378]> Bootstrap02
#> 3 <split [1000/359]> Bootstrap03
#> 4 <split [1000/384]> Bootstrap04
#> 5 <split [1000/361]> Bootstrap05
#> 6 <split [1000/376]> Bootstrap06
#> 7 <split [1000/355]> Bootstrap07
#> 8 <split [1000/382]> Bootstrap08
#> 9 <split [1000/359]> Bootstrap09
#> 10 <split [1000/355]> Bootstrap10
#> # ℹ 15 more rows
# nested nested resamples
nested_resamples(nested_data, nested_cv(
vfold_cv(),
bootstraps()
))
#> # Nested resampling:
#> # outer: 10-fold cross-validation
#> # inner: Bootstrap sampling
#> # A tibble: 10 × 3
#> splits id inner_resamples
#> <list> <chr> <list>
#> 1 <split [900/100]> Fold01 <boot [25 × 2]>
#> 2 <split [900/100]> Fold02 <boot [25 × 2]>
#> 3 <split [900/100]> Fold03 <boot [25 × 2]>
#> 4 <split [900/100]> Fold04 <boot [25 × 2]>
#> 5 <split [900/100]> Fold05 <boot [25 × 2]>
#> 6 <split [900/100]> Fold06 <boot [25 × 2]>
#> 7 <split [900/100]> Fold07 <boot [25 × 2]>
#> 8 <split [900/100]> Fold08 <boot [25 × 2]>
#> 9 <split [900/100]> Fold09 <boot [25 × 2]>
#> 10 <split [900/100]> Fold10 <boot [25 × 2]>