class: center, middle, hide-count, title <img src="images/data.table_hex.png" width="25%" /><img src="images/dtplyr_hex.png" width="25%" /> <br> <br> .content-box-title[ ## .code[data.table] and .code[dtplyr] ### R packages for high-efficiency data processing ] .pull-right[ .rightalign[ <small>Will Simmons</small><br> <small>Data Science I</small><br> <small>December 8, 2021</small> ] ] --- class: center, middle ### `data.table` <br><br> introduction --- ## What is .code[data.table]? <br> <ul> -- ### <li>High-performance version of base .code[R]’s .code[data.frame]</li> -- ### <li>Similar to .code[dplyr] in purpose and scope</li> -- ### <li>Biggest advantage over .code[tidyverse] and base .code[R]: **speed**</li> </ul> --- ## How much faster is .code[data.table]? <br> <ul> -- ### <li>Almost always faster than .code[dplyr], depending on task -- ### <li>Advantage more apparent with larger data</li> -- ### <li>With enough rows/groups, .code[dplyr] not even usable 😥 </ul> .footnote[Source: *Database-like ops benchmark.* H2O.ai (https://h2oai.github.io/db-benchmark/).] ??? Performing grouped operations or joins on large data, data.table can outperform dplyr by a factor of 30 in terms of speed --- ## Basic `data.table` syntax .center[ <br> # .code[.golden[DT][.green[i], .blue[j], .cardinal[by]]] <br> ] -- .center[ Take `data.table` **.code[.golden[DT]]**,<br> subset rows using **.code[.green[i]]**<br> and manipulate columns with **.code[.blue[j]]**,<br> grouped according to **.code[.cardinal[by]]**. ] .footnote[Source: *Data Transformation with data.table :: CHEAT SHEET.* Erik Petrovski. [Link](https://rdatatable.gitlab.io/data.table/#cheatsheets).] ??? If we want to think about it in terms of dplyr verbs: Take data.table DT, **filter** and **arrange** using i, and **select**, **mutate**, or **summarize** columns with j, **grouped by** by. We'll get briefly into more complex things we've learned with dplyr, too, like piping multiple commands and joining data. --- class: center, middle ### `data.table` <br><br> basics --- ## Basics with `data.table` ### Create a `data.table` .golden[from scratch] -- - Similar to `data.frame` or `tibble`, we can manually build a `data.table` -- ```r test <- data.table( a = 1:3, b = letters[1:3], # letters and LETTERS are predefined vectors in R c = LETTERS[1:3] ) test ``` ``` ## a b c ## 1: 1 a A ## 2: 2 b B ## 3: 3 c C ``` --- ## Basics with `data.table` ### Create a `data.table` .golden[from an existing object] -- * We can also coerce an existing object into a `data.table` using `setDT()` * `mtcars` is loaded as a `data.frame` by default; let's make it usable with `data.table` -- ```r mtcars <- mtcars # if loaded before data.table, loaded as data.frame library(data.table) mtcars <- setDT(mtcars) # use setDT() to coerce to data.table class(mtcars) ``` ``` ## [1] "data.table" "data.frame" ``` -- <br> **Note:** `data.table` operations won't work on non-`data.table` objects. --- ## Basics with `data.table` data ### .golden[Print data] -- * If `nrow` > 100, calling a .code[data.table] prints `head(5)` and `tail(5)` -- ```r as.data.table(quakes) # can also coerce to DT with as.data.table() ``` ``` ## lat long depth mag stations ## 1: -20.42 181.62 562 4.8 41 ## 2: -20.62 181.03 650 4.2 15 ## 3: -26.00 184.10 42 5.4 43 ## 4: -17.97 181.66 626 4.1 19 ## 5: -20.42 181.96 649 4.0 11 ## --- ## 996: -25.93 179.54 470 4.4 22 ## 997: -12.28 167.06 248 4.7 35 ## 998: -20.13 184.20 244 4.5 34 ## 999: -17.40 187.80 40 4.5 14 ## 1000: -21.59 170.56 165 6.0 119 ``` ??? Since mtcars only contains 32 rows, data.table would print the whole thing --- ## Basics with `data.table` data ### .golden[Read and write data] -- * Fast delimited file reader: **.code[fread()]** * Fast file writer: **.code[fwrite()]** -- ```r # read mydata <- fread(here("data/mydata.csv")) # much faster than read_csv(), etc. # wrangle mydata_cleaned <- ... # write fwrite(mydata, here("data/mydata_cleaned.csv")) # create an updated CSV ``` --- class: center, middle ### `data.table` <br><br> simple data operations .code[filter, arrange, select, rename] --- ## Reminder: .code[data.table] syntax .center[ # .code[.golden[DT][.green[i], .blue[j], .cardinal[by]]] <br> ] -- .center[ ### approximate .code[dplyr] translation: .code[.golden[data] %>%] <br> .code[.blue[select] %>% ]<br> .code[.green[filter | arrange] %>% ]<br> .code[.cardinal[group_by] %>% ]<br> .code[.blue[mutate | summarize]] ] <br> ] ??? One quick aside here -- you'll see that all data.table code is contained within brackets; this allows data.table to know exactly what data is required for each computation, which usually makes it much faster than its dplyr equivalent --- .syntax-header[.code[.golden[DT][.green[i], .fade[.blue[j], .cardinal[by]]]]] <br> ## Subset rows in .code[.green[i]] -- By logical condition -- ```r mtcars[mpg >= 31] # as with dplyr, no need to use $ to refer to variables ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 2: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ``` -- By row index -- ```r mtcars[1:3] ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ``` --- .syntax-header[.code[.golden[DT][.green[i], .fade[.blue[j], .cardinal[by]]]]] <br> ## Reorder rows in .code[.green[i]] using .code[order()] -- Ascending order (default) -- ```r ordered <- mtcars[order(hp)] head(ordered) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## 2: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## 3: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## 4: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 5: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## 6: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ``` --- .syntax-header[.code[.golden[DT][.green[i], .fade[.blue[j], .cardinal[by]]]]] <br> ## Reorder rows in .code[.green[i]] using .code[order()] Descending order using hyphen (`-`) -- ```r ordered_desc <- mtcars[order(-hp)] head(ordered_desc) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1: 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8 ## 2: 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4 ## 3: 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 ## 4: 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 ## 5: 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 ## 6: 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ``` --- .syntax-header[.code[.golden[DT][.fade[.green[i]], .blue[j].fade[,.cardinal[by]]]]] <br> ## Select columns in .code[.blue[j]] * Wrap desired columns in .code[list()] or .code[.()] (an alias for .code[list()]); the latter is more common -- ```r selection <- mtcars[, .(mpg, hp)] # note the comma before j head(selection) ``` ``` ## mpg hp ## 1: 21.0 110 ## 2: 21.0 110 ## 3: 22.8 93 ## 4: 21.4 110 ## 5: 18.7 175 ## 6: 18.1 105 ``` --- .syntax-header[.code[.golden[DT][.fade[.green[i]], .blue[j].fade[,.cardinal[by]]]]] <br> ## Rename columns in .code[.blue[j]] * As in .code[dplyr], can select and rename at the same time -- ```r renamed <- mtcars[, .(miles_per_gal = mpg, horsepower = hp)] head(renamed) ``` ``` ## miles_per_gal horsepower ## 1: 21.0 110 ## 2: 21.0 110 ## 3: 22.8 93 ## 4: 21.4 110 ## 5: 18.7 175 ## 6: 18.1 105 ``` --- class: center, middle ### `data.table` <br><br> simple data manipulation .code[mutate, summarize] without groups --- .syntax-header[.code[.golden[DT][.fade[.green[i]], .blue[j].fade[,.cardinal[by]]]]] <br> ## .code[.blue[j]] is powerful! <br> <ul> -- ### <li>Contains capabilities of .code[dplyr]'s .code[select], .code[mutate], .code[summarize], and more</li> -- ### <li>When .code[.blue[j]] returns a list, each list-element will become a column in the resulting .code[data.table]</li> -- ### <li>When .code[.blue[j]] contains an expression, it will be evaluated ??? When j returns a list, each list element will be returned as a column - allows operations similar to dplyr across, but much more flexible --- ## Computation within .code[.blue[j]]: operations on columns -- What is the mean weight of the cars in the `mtcars` dataset? -- ```r mtcars[, mean(wt)] ``` ``` ## [1] 3.21725 ``` -- How many cars in the `mtcars` dataset have a 4-cylinder engine? -- ```r mtcars[, sum(cyl == 4)] ``` ``` ## [1] 11 ``` -- ### Remember: .code[.blue[j]] can accept column names **and** expressions! --- .syntax-header[.code[.golden[DT][.green[i], .blue[j].fade[, .cardinal[by]]]]] <br> ## Combining .code[.green[i]] and .code[.blue[j]] -- Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders. -- First, get the mean .code[mpg] for cars with 4 cylinders: -- ```r mtcars[cyl == 4, mean(mpg)] ``` ``` ## [1] 26.66364 ``` -- Then the mean .code[mpg] for cars with more than 4: -- ```r mtcars[cyl > 4, mean(mpg)] ``` ``` ## [1] 16.64762 ``` -- ### Next: .code[.green[i]] and .code[.blue[j]] become even more flexible with .code[.cardinal[by]] --- class: center, middle ### `data.table` <br><br> complex data manipulation grouped .code[mutate] and .code[summarize] --- .syntax-header[.code[.golden[DT][.green[i], .blue[j], .cardinal[by]]]] <br> ## .code[.cardinal[by]]: group by 1+ variables -- What are the median quarter-mile race times for 4- and 5-gear cars in `mtcars`? -- ```r mtcars[gear > 3, .(median_qtr_mile = median(qsec)), gear] ``` ``` ## gear median_qtr_mile ## 1: 4 18.755 ## 2: 5 15.500 ``` -- Here, we: * Subset to cars with .code[gear > 3] in .code[.green[i]] * Calculated .code[median_qtr_mile] using an expression in .code[.blue[j]] * Grouped by .code[gear] in .code[.cardinal[by]] --- ## Introducing .code[.N] <ul> -- ### <li>Special .code[data.table] variable .code[.N]: number of observations in the current group</li> -- ### <li>Similar use to .code[dplyr::count()] or .code[base::nrow()]</li> -- ### <li>Example:</li> How many cars are automatic, and how many are manual, in the .code[mtcars] dataset? -- ```r mtcars[, .N, by = am] ``` ``` ## am N ## 1: 1 13 ## 2: 0 19 ``` --- ## More examples with .code[.cardinal[by]] -- What's the fastest quarter-mile time by number of cylinders? -- ```r mtcars[, .(fastest_time = min(qsec)), keyby = cyl] # keyby (instead of by) sorts results ``` ``` ## cyl fastest_time ## 1: 4 16.7 ## 2: 6 15.5 ## 3: 8 14.5 ``` -- What's the highest horsepower in automatic versus manual cars? -- ```r mtcars[, (max_hp = max(hp)), am] ``` ``` ## am V1 ## 1: 1 335 ## 2: 0 245 ``` --- ## Chaining .code[data.table] expressions (like .code[%>%]) <br> -- What is the mean fuel efficiency (in miles per gallon) of heavy (>= 2000 lbs) versus light (< 2000 lbs) cars? -- ```r mtcars[, .(mpg, weight_cat = fifelse(wt >= 2, "heavy", "light")) # fifelse = "fast if-else" ][, .(mean_mpg = mean(mpg)), by = weight_cat] ``` ``` ## weight_cat mean_mpg ## 1: heavy 18.60357 ## 2: light 30.50000 ``` -- Here, we: * Selected .code[mpg] and the newly-created .code[weight_cat] in our first .code[.blue[j]] statement * Calculated .code[mean_mpg], grouped by .code[weight_cat], using our second .code[.blue[j]] statement and .cardinal[by] --- ## Chaining .code[data.table] expressions (like .code[%>%]) <br> -- Complex operations can have many chains - organize vertically -- ```r DT[ ... ][ ... ][ ... ][ ... ][ ... ][ ... ] ``` --- class: center, middle ### `dtplyr` <br><br> quick overview --- ## What is .code[dtplyr]? <br> <ul> -- ### <li>.code[data.table] backend (fast) + .code[dplyr] code (familiar) -- ### <li>Translates .code[dplyr] code to .code[data.table] code for us! -- ### <li>So...why learn .code[data.table] in the first place? ??? Why learn data.table in the first place? - dplyr can't do everything that data.table can, so there will be missing features - still some sacrifice in speed for certain features of dplyr, mainly dplyr's copy-on-modify behavior --- ## Using .code[dtplyr] -- * We'll need to load .code[dplyr], .code[dtplyr], and .code[data.table] -- ```r library(dplyr) library(dtplyr) library(data.table) ``` --- ## Using .code[dtplyr] -- * Then, we'll write our .code[dplyr] code, adding two elements: -- * Apply .code[lazy_dt()] to input data -- * Use .code[collect()], .code[as_tibble()], or .code[as.data.frame()] to collect results -- ```r palmerpenguins::penguins %>% # penguins dataset is tibble by default lazy_dt() %>% group_by(species) %>% summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE)) %>% collect() ``` ``` ## # A tibble: 3 x 2 ## species mean_bill_length ## <fct> <dbl> ## 1 Adelie 38.8 ## 2 Chinstrap 48.8 ## 3 Gentoo 47.5 ``` --- ## .code[dtyplr]: usefulness <br> -- * If you **don't** want to learn .code[data.table] but would like performance gain with .code[dplyr] -- * If you **do** want to learn .code[data.table], helpful to see .code[dtyplr]'s translations of your .code[dplyr] code -- * Drawback: potentially limited to simpler analytic procedures --- ## Things I didn't cover here <br> -- * The secret ingredient of .code[data.table]: modification in place -- * Advanced .code[data.table] [reference semantics](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reference-semantics.html) using .code[:=] and .code[set*] functions -- * .code[.SD] and .code[.SDcols]: similar use-cases as .code[dplyr::across()] -- * Things .data[data.table] can do that .code[dplyr] and .code[dtplyr] [may not be able to](https://stackoverflow.com/a/59081276) --- class: center, middle ## questions?