Advanced usage of selenider • selenider

selenider exposes some advanced features to allow for more complex automation.

Customizing the session creation

library(selenider)

[selenider_session()] is really just a wrapper around either chromote::ChromoteSession$new(), or selenium::selenium_server() and selenium::SeleniumSession$new(). selenider exposes arguments to these functions (plus some additional options) via the options argument.

The most common argument that you are going to want to use is headless in chromote_options(): it allows you to run chromote in non-headless mode, meaning that the browser you are controlling will be displayed:

session <- selenider_session(
  "chromote",
  options = chromote_options(headless = TRUE)
)

Managing selenium options is a bit more complex, since you are can provide options to the client selenium_client_options() and server selenium_server_options(). One cool thing you can do is pass NULL into the server_options parameter of selenium_options() to stop selenider from creating its own server. This is useful if you have created a server manually (using docker, for example):

session <- selenider_session(
  "selenium",
  options = selenium_options(
    server_options = NULL, # Stop selenider from creating a server
    client_options = selenium_client_options(
      host = "localhost", # Use the host and port of your manually created server
      port = 4444L
    )
  )
)

Accessing the underlying session

While selenider provides a high level interface, sometimes you need to access the underlying chromote::ChromoteSession or selenium::SeleniumSession to perform more advanced tasks. The driver field of a selenider_session() can be used to do this.

This is especially useful for chromote, since much of the configuration is done after the session is created:

session <- selenider_session()

chromote_session <- session$driver

chromote_session$Browser$setDownloadBehavior(
  behavior = "allow",
  downloadPath = "<path_to_folder>"
)

Accessing underlying elements

Much like you can access the underlying chromote/selenium session behind a selenider session, you can access the chromote/selenium element represented by a selenider_element/selenider_elements object using get_actual_element() and get_actual_elements(), respectively.

If you are using chromote, the backendNodeId of the element is returned, while in selenium’s case, the element is returned as a selenium::WebElement. It’s important to note that the element in this form is no longer lazy, so should be used as soon as possible to avoid errors as the page changes.

Element collections

Let’s use selenider to get every link element in the R Project’s website.

open_url("https://www.r-project.org/")
#> Can't find an existing selenider session.
#> ℹ Creating a new session.

links <- ss("a")

links
#> { selenider_elements (42) }
#> [1] <a href="/"><img src="/Rlogo.png" width="100" height="78" alt="R"></a>
#> [2] <a href="/">[Home]</a>
#> [3] <a href="https://cran.r-project.org/mirrors.html">CRAN</a>
#> [4] <a href="/about.html">About R</a>
#> [5] <a href="/logo/">Logo</a>
#> [6] <a href="/contributors.html">Contributors</a>
#> [7] <a href="/news.html">What’s New?</a>
#> [8] <a href="/bugs.html">Reporting Bugs</a>
#> [9] <a href="/conferences/">Conferences</a>
#> [10] <a href="/search.html">Search</a>
#> [11] <a href="/mail.html">Get Involved: Mailing Lists</a>
#> [12] <a href="https://contributor.r-project.org/">Get Involved: Contributing</a>
#> [13] <a href="https://developer.R-project.org/">Developer Pages</a>
#> [14] <a href="https://blog.r-project.org/">R Blog</a>
#> [15] <a href="/foundation/">Foundation</a>
#> [16] <a href="/foundation/board.html">Board</a>
#> [17] <a href="/foundation/members.html">Members</a>
#> [18] <a href="/foundation/donors.html">Donors</a>
#> [19] <a href="/foundation/donations.html">Donate</a>
#> [20] <a href="/help.html">Getting Help</a>
#> ...

But what actually is links? In some ways, it acts like a list:

links[[1]]
#> { selenider_element }
#> <a href="/">
#>   <img src="/Rlogo.png" width="100" height="78" alt="R">
#> </a>

links[1:2]
#> { selenider_elements (2) }
#> [1] <a href="/"><img src="/Rlogo.png" width="100" height="78" alt="R"></a>
#> [2] <a href="/">[Home]</a>

length(links)
#> [1] 61

But assuming it is a list in all scenarios can result in surprising behavior:

names(links)
#> [1] "session"     "driver"      "driver_id"   "element"     "timeout"    
#> [6] "selectors"   "to_be_found"

To reveal why this is, let’s emulate adding a new link to the page using JavaScript.

execute_js_expr("
  const link = document.createElement('a');
  link.href = 'https://ashbythorpe.github.io/selenider/';
  link.innerText = 'Selenider';
  document.body.appendChild(link);
")
#> NULL

Now let’s look at links again:

links
#> { selenider_elements (62) }
#> [1] <a href="/"><img src="/Rlogo.png" width="100" height="78" alt="R"></a>
#> [2] <a href="/">[Home]</a>
#> [3] <a href="https://cran.r-project.org/mirrors.html">CRAN</a>
#> [4] <a href="/about.html">About R</a>
#> [5] <a href="/logo/">Logo</a>
#> [6] <a href="/contributors.html">Contributors</a>
#> [7] <a href="/news.html">What’s New?</a>
#> [8] <a href="/bugs.html">Reporting Bugs</a>
#> [9] <a href="/conferences/">Conferences</a>
#> [10] <a href="/search.html">Search</a>
#> [11] <a href="/mail.html">Get Involved: Mailing Lists</a>
#> [12] <a href="https://contributor.r-project.org/">Get Involved: Contributing</a>
#> [13] <a href="https://developer.R-project.org/">Developer Pages</a>
#> [14] <a href="https://blog.r-project.org/">R Blog</a>
#> [15] <a href="/foundation/">Foundation</a>
#> [16] <a href="/foundation/board.html">Board</a>
#> [17] <a href="/foundation/members.html">Members</a>
#> [18] <a href="/foundation/donors.html">Donors</a>
#> [19] <a href="/foundation/donations.html">Donate</a>
#> [20] <a href="/help.html">Getting Help</a>
#> ...

links[[length(links)]]
#> { selenider_element }
#> <a href="https://ashbythorpe.github.io/selenider/">
#>   Selenider
#> </a>

links has been updated to include the new link!

A lazy list

The core reason behind this strange behavior is selenider’s promise of laziness. This means that elements are only ever collected from the page right before they are used by an eager function (print(), elem_text(), elem_click(), etc.). The only thing a selenider element actually stores is the path to an element (i.e. the set of steps you specified to reach the element), rather than the element itself.

This property offers an array of benefits when compared with the eager approach. It offers a far more suitable representation of a constantly-changing webpage, and as such side-steps many common errors encountered during web automation. It also powers the automatic waiting feature that is also offered by selenider.

The element collection, then, is a generalisation of this concept to sets of elements. A selenider_elements object stores the path to its elements, but not the elements itself. It therefore cannot be represented by a list; for one thing, as seen above, it is necessarily unaware of its length.

For all of the advantages of lazy elements, this choice of structure does come with some caveats. The major one is that many list operations will not work on an element collection; in fact, you should assume that any operation that works on a list will not work on a selenider_elements object. This is in part due to the fact that R does not natively support custom iterators.

So, what can I do?

selenider provides an API for working with element collections. All of the methods below preserve the laziness of the element collection, meaning that none of them will actually fetch any elements from the page until the resulting element is used.

elems[[x]] and elems[x] work with numeric indices, including negative numbers, allowing you to filter elements by position.
elem_filter() and elem_find() allow you to filter an element collection or find a single element based on a condition.
elem_flatten() allow you to combine multiple elements or element collections into a single collection.
find_each_element() and find_all_elements() allow you to easily find children of all the elements in a collection.

As seen before, length() can be used on element collections to get the number of elements. This is not lazy, meaning you shouldn’t rely on this value to always be accurate after it is called.

However, sometimes you want to perform more complex operations on a set of elements. One common example is iteration, either in a for loop or using lapply()/purrr::map(). Iteration is an operation that goes against the idea of a lazy collection: how do you iterate over a set that is constantly changing?

In this situation, if you are willing to sacrifice some of the lazy properties of an element collection, use as.list(). This function, when called on an element collection elems, converts it to the following:

list(elems[[1]], elems[[2]], ..., elems[[n]])

Where n is length(elems).

Notably, the elements of the list are still lazy, since [[ preserves laziness on element collections. However, the length of the list is not, since the call to length() is not lazy.

Since this is an actual list, it supports a much wider range of operations. For example, in selenider’s README, as.list() is used to iterate over a collection of links to find their hyperlinks. Take a look at as.list.selenider_elements() for more examples.

Forcing eager behaviour

Sometimes it may be desirable to avoid the lazy behaviour of selenider’s elements. This is usually for performance reasons: you may have an element represented by a long, complex set of steps, which needs to be used many times. By default, selenider will follow the path every time the element is used, which can end up being very slow, and may be redundant if you know the element’s position is unlikely to change.

elem_cache() can be used to force an element or set of elements to be retrieved from the DOM and stored, creating an “eager” element. Note the caveat in the docs: further elements created using this element will not also be eager, but will use this eager element as a starting point.