Create splits with nested data

Use any rsample split function on nested data, where nests act as strata. This almost guarantees that every split will contain data from every nested data frame.

Usage

nested_resamples(
  data,
  resamples,
  nesting_method = NULL,
  size_action = c("truncate", "recycle", "recycle-random", "combine", "combine-random",
    "combine-end", "error"),
  ...
)

Arguments

data: A data frame.
resamples: An expression, function, formula or string that can be evaluated to produce an rset or rsplit object.
nesting_method: A recipe, workflow or NULL, used to nest data if data is not already nested (see Details).
size_action: If different numbers of splits are produced in each nest, how should sizes be matched? (see Details)
...: Extra arguments to pass into resamples.

Value

Either an rsplit object or an rset object, depending on resamples.

Details

This function breaks down a data frame into smaller, nested data frames. Resampling is then performed within these nests, and the results are combined together at the end. This ensures that each split contains data from every nest. However, this function does not perform any pooling (unlike rsample::make_strata()), so you may run into issues if a nest is too small.

Nesting Data

data can be nested in several ways: If nesting_method is NULL and data is grouped (using dplyr::group_by()), the data will be nested (see tidyr::nest() for how this works). If data is not grouped, it is assumed to already be nested, and nested_resamples will try to find a column that contains nested data frames. If nesting_method is a workflow or recipe, and the recipe has a step created using step_nest(), data will be nested using the step in the recipe. This is convenient if you've already created a recipe or workflow, as it saves a line of code.

Resample Evaluation

The resamples argument can take many forms:

A function call, such as vfold_cv(v = 5). This is similar to the format of rsample::nested_cv().
A function, such as rsample::vfold_cv.
A purrr-style anonymous function, which will be converted to a function using rlang::as_function().
A string, which will be evaluated using rlang::exec().

Every method will be evaluated with data passed in as the first argument (with name 'data').

Size Matching

Before the set of resamples created in each nest can be combined, they must contain the same number of splits. For most resampling methods, this will not be an issue. rsample::vfold_cv(), for example, reliably creates the number of splits defined in its 'v' argument. However, other resampling methods, like rsample::rolling_origin(), depend on the size of their 'data' argument, and therefore may produce different numbers of resamples when presented with differently sized nests.

The size_action argument defines many ways of matching the sizes of resample sets with different numbers of splits. These methods will either try to reduce the number of splits in each set until each rset is the same length as the set with the lowest number of splits; or the opposite, where each rset will have the same number of splits as the largest set.

"truncate", the default, means that all splits beyond the required length will be removed.

"recycle" means that sets of splits will be extended by repeating elements until the required length has been reached, mimicking the process of vector recycling. The advantage of this method is that all created splits will be preserved.

"recycle-random" is a similar process to recycling, but splits will be copied at random to spaces in the output, which may be important if the order of resamples matters. This process is not completely random, and the program makes sure that every split is copied roughly the same number of times.

"combine" gets rid of excess splits by combining them with previous ones. This means the training and testing rows are merged into one split. Combining is done systematically: if a set of splits needs to be compacted down to a set of 5, the first split is combined with the sixth split, then the eleventh, then the sixteenth, etc. This approach is not recommended, since it is not clear what the benefit of a combined split is.

"combine-random" combines each split with a random set of other splits, instead of the systematic process described in the previous method. Once again, this process is not actually random, and each split will be combined with roughly the same number of other splits.

"combine-end" combines every excess split with the last non-excess split.

"error" throws an error if each nest does not produce the same number of splits.

Examples


library(tidyr)
library(recipes)
library(workflows)
library(rsample)
library(dplyr)

nested_data <- example_nested_data %>%
  nest(data = -id)

grouped_data <- example_nested_data %>%
  group_by(id)

recipe <- recipe(example_nested_data, z ~ .) %>%
  step_nest(id)

wf <- workflow() %>%
  add_recipe(recipe)

nested_resamples(nested_data, vfold_cv())
#> #  10-fold cross-validation 
#> # A tibble: 10 × 2
#>    splits            id    
#>    <list>            <chr> 
#>  1 <split [900/100]> Fold01
#>  2 <split [900/100]> Fold02
#>  3 <split [900/100]> Fold03
#>  4 <split [900/100]> Fold04
#>  5 <split [900/100]> Fold05
#>  6 <split [900/100]> Fold06
#>  7 <split [900/100]> Fold07
#>  8 <split [900/100]> Fold08
#>  9 <split [900/100]> Fold09
#> 10 <split [900/100]> Fold10

nested_resamples(
  group_by(example_nested_data, id),
  ~ initial_split(.)
)
#> <Training/Testing/Total>
#> <740/260/1000>

nested_resamples(example_nested_data, bootstraps,
  times = 25, nesting_method = wf
)
#> # Bootstrap sampling 
#> # A tibble: 25 × 2
#>    splits             id         
#>    <list>             <chr>      
#>  1 <split [1000/367]> Bootstrap01
#>  2 <split [1000/378]> Bootstrap02
#>  3 <split [1000/359]> Bootstrap03
#>  4 <split [1000/384]> Bootstrap04
#>  5 <split [1000/361]> Bootstrap05
#>  6 <split [1000/376]> Bootstrap06
#>  7 <split [1000/355]> Bootstrap07
#>  8 <split [1000/382]> Bootstrap08
#>  9 <split [1000/359]> Bootstrap09
#> 10 <split [1000/355]> Bootstrap10
#> # ℹ 15 more rows

# nested nested resamples

nested_resamples(nested_data, nested_cv(
  vfold_cv(),
  bootstraps()
))
#> # Nested resampling:
#> #  outer: 10-fold cross-validation
#> #  inner: Bootstrap sampling
#> # A tibble: 10 × 3
#>    splits            id     inner_resamples
#>    <list>            <chr>  <list>         
#>  1 <split [900/100]> Fold01 <boot [25 × 2]>
#>  2 <split [900/100]> Fold02 <boot [25 × 2]>
#>  3 <split [900/100]> Fold03 <boot [25 × 2]>
#>  4 <split [900/100]> Fold04 <boot [25 × 2]>
#>  5 <split [900/100]> Fold05 <boot [25 × 2]>
#>  6 <split [900/100]> Fold06 <boot [25 × 2]>
#>  7 <split [900/100]> Fold07 <boot [25 × 2]>
#>  8 <split [900/100]> Fold08 <boot [25 × 2]>
#>  9 <split [900/100]> Fold09 <boot [25 × 2]>
#> 10 <split [900/100]> Fold10 <boot [25 × 2]>