--- title: "Common Table Expressions (CTEs)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Common Table Expressions (CTEs)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = RPresto::presto_has_default() ) options(pillar.max_dec_width=20) ``` The [Common Table Expressions][1] or CTE’s for short are used within SQL databases to simplify complex joins and subqueries. You can think of them as named subqueries that can be referenced in other parts of the query, including other CTEs (i.e. recursive CTEs). In Presto, CTEs take the form of [`WITH` clauses][2]. There can be multiple named CTE within the single `WITH` clause. They need to be defined before the main `SELECT` query. At the time of this writing (late 2022), `DBI` APIs don't have an official way of implementing CTEs yet. `dbplyr` just begins to introduce CTEs into its APIs as an experimental feature. So our implementation of CTE support in `RPresto` is very much avant-garde and should be used with discretion. We attach CTEs to the `PrestoConnection` so that they are available across the queries executed via the connection. ## Setup ```{r setup, message=FALSE} library(RPresto) library(DBI) library(dplyr) library(dbplyr) ``` You can check your `RPresto` version by running the `packageVersion()` function. **You need version 1.4.0 or later to use the CTE feature.** ```{r version} packageVersion("RPresto") ``` ## Define CTEs while creating a `PrestoConnection` You can define and attach CTEs while creating a `PrestoConnection`. Here we assume that the user already have a Presto server with a memory connector set up. If you don't have such a server set up, refer to the [Presto documentation][2] for instructions if you want to follow along. ```{r PrestoConnection} con <- DBI::dbConnect( drv = RPresto::Presto(), host = "http://localhost", port = 8080, user = Sys.getenv("USER"), catalog = "memory", schema = "default", # Define a testing CTE using dummy VALUES ctes = list( "dummy_values" = "SELECT * FROM (VALUES (1, 'a'), (2, 'b'), (3, 'c') ) AS t (id, name)" ) ) ``` Now `dummy_values` is not an existing permanent table available in the `PrestoConnection`. It only exists as a temporary feature for the connection. ```{r exists_dummy_values} db_has_table(con, "dummy_values") ``` We can read the content of the CTE. ```{r read_dummy_values} dbReadTable(con, "dummy_values") ``` We can also execute arbitrary `SELECT` queries on top of the CTE. ```{r getquery_dummy_values} dbGetQuery(con, "SELECT id * 2 AS id_2, name FROM dummy_values") ``` ## Incorporating CTEs with `dplyr` backend Another way of leveraging CTEs in your workflow is to incorporate them into the `dplyr` workflow. ```{r setup_dplyr} # We first copy mtcars to Presto and create a remote table on it tbl.mtcars <- copy_to(con, mtcars, "test_mtcars", overwrite = TRUE) tbl.mtcars %>% colnames() ``` We call a few `dplyr` verbs on the remote table to mimic a typical analysis work flow. ```{r dplyr_transformation} tbl.mtcars.transform <- tbl.mtcars %>% mutate(hp2 = pow(hp, 2)) %>% group_by(cyl) %>% mutate(mean_mpg_by_cyl = mean(mpg, na.rm = TRUE)) ``` We can see the underlying SQL query generated so far. ```{r dplyr_showquery} tbl.mtcars.transform %>% show_query() ``` For illustration, let's say we filter the same transformed table twice on the `cyl` field and `UNION ALL` them together in the next step. ```{r unionall} tbl.mtcars.union <- union( filter(tbl.mtcars.transform, cyl == 4L), filter(tbl.mtcars.transform, cyl == 8L), all = TRUE ) tbl.mtcars.union %>% show_query() ``` The underlying SQL query of the transformed table (i.e. `tbl.mtcars.transform`) has to be replicated twice in this step and thus makes the resulting query long and repetitive. It offers a prime opportunity to simplify using CTEs. We can "save" the underlying SQL query of the transformed table into a CTE and use that in the union step by calling the `compute()` function with `cte = TRUE`. ```{r compute_cte} tbl.mtcars.transform <- tbl.mtcars.transform %>% compute(name = "mtcars_transform", cte = TRUE) tbl.mtcars.transform %>% show_query() ``` Here the content of `tbl.mtcars.transform` hasn't changed at all and we can use the remote table as it is just like before. The only change underneath is that the underlying logic is now captured and stored in a CTE. You can almost think of it as saving `tbl.mtcars.transform` as a temporary table named `mtcars_transform` and pointing the new remote table on that temporary table. The difference is that no query has actually been executed yet. Now we've leveraged CTE, the query for the union step looks more clean and readable. ```{r unionall_cte} tbl.mtcars.union <- union( filter(tbl.mtcars.transform, cyl == 4L), filter(tbl.mtcars.transform, cyl == 8L), all = TRUE ) tbl.mtcars.union %>% show_query() ``` We can even create nested CTEs that depend on other CTEs (Presto calls it chained CTEs). Below we call `compute()` on `tbl.mtcars.union` which already utilizies the `mtcars_transform` CTE. ```{r compute_cte_2} tbl.mtcars.union <- tbl.mtcars.union %>% compute(name = "mtcars_union", cte = TRUE) tbl.mtcars.union %>% show_query() ``` Now the underlying query of the previous `tbl.mtcars.union` is saved into the `mtcars_union` CTE which in turn depends on the `mtcars_transform` CTE. ```{r teardown_dplyr, echo=FALSE} # Clean up dbRemoveTable(con, "test_mtcars") ``` [1]: https://www.essentialsql.com/introduction-common-table-expressions-ctes/ [2]: https://prestodb.io/docs/current/sql/select.html#with-clause