+ - 0:00:00
Notes for current slide
Notes for next slide



data.table and dtplyr

R packages for high-efficiency data processing

Will Simmons
Data Science I
December 8, 2021

1 / 34

data.table

introduction

2 / 34

What is data.table?


3 / 34

What is data.table?


  • High-performance version of base R’s data.frame
3 / 34

What is data.table?


  • High-performance version of base R’s data.frame
  • Similar to dplyr in purpose and scope
3 / 34

What is data.table?


  • High-performance version of base R’s data.frame
  • Similar to dplyr in purpose and scope
  • Biggest advantage over tidyverse and base R: speed
 
3 / 34

How much faster is data.table?


4 / 34

How much faster is data.table?


  • Almost always faster than dplyr, depending on task
4 / 34

How much faster is data.table?


  • Almost always faster than dplyr, depending on task
  • Advantage more apparent with larger data
4 / 34

How much faster is data.table?


  • Almost always faster than dplyr, depending on task
  • Advantage more apparent with larger data
  • With enough rows/groups, dplyr not even usable 😥
 

Source: Database-like ops benchmark. H2O.ai (https://h2oai.github.io/db-benchmark/).

4 / 34

Performing grouped operations or joins on large data, data.table can outperform dplyr by a factor of 30 in terms of speed

Basic data.table syntax


DT[i, j, by]


5 / 34

Basic data.table syntax


DT[i, j, by]


Take data.table DT,
subset rows using i
and manipulate columns with j,
grouped according to by.

Source: Data Transformation with data.table :: CHEAT SHEET. Erik Petrovski. Link.

5 / 34

If we want to think about it in terms of dplyr verbs: Take data.table DT, filter and arrange using i, and select, mutate, or summarize columns with j, grouped by by.

We'll get briefly into more complex things we've learned with dplyr, too, like piping multiple commands and joining data.

data.table

basics

6 / 34

Basics with data.table

Create a data.table from scratch

7 / 34

Basics with data.table

Create a data.table from scratch

  • Similar to data.frame or tibble, we can manually build a data.table
7 / 34

Basics with data.table

Create a data.table from scratch

  • Similar to data.frame or tibble, we can manually build a data.table
test <- data.table(
a = 1:3,
b = letters[1:3], # letters and LETTERS are predefined vectors in R
c = LETTERS[1:3]
)
test
## a b c
## 1: 1 a A
## 2: 2 b B
## 3: 3 c C
7 / 34

Basics with data.table

Create a data.table from an existing object

8 / 34

Basics with data.table

Create a data.table from an existing object

  • We can also coerce an existing object into a data.table using setDT()

  • mtcars is loaded as a data.frame by default; let's make it usable with data.table

8 / 34

Basics with data.table

Create a data.table from an existing object

  • We can also coerce an existing object into a data.table using setDT()

  • mtcars is loaded as a data.frame by default; let's make it usable with data.table

mtcars <- mtcars # if loaded before data.table, loaded as data.frame
library(data.table)
mtcars <- setDT(mtcars) # use setDT() to coerce to data.table
class(mtcars)
## [1] "data.table" "data.frame"
8 / 34

Basics with data.table

Create a data.table from an existing object

  • We can also coerce an existing object into a data.table using setDT()

  • mtcars is loaded as a data.frame by default; let's make it usable with data.table

mtcars <- mtcars # if loaded before data.table, loaded as data.frame
library(data.table)
mtcars <- setDT(mtcars) # use setDT() to coerce to data.table
class(mtcars)
## [1] "data.table" "data.frame"


Note: data.table operations won't work on non-data.table objects.

8 / 34

Basics with data.table data

Print data

9 / 34

Basics with data.table data

Print data

  • If nrow > 100, calling a data.table prints head(5) and tail(5)
9 / 34

Basics with data.table data

Print data

  • If nrow > 100, calling a data.table prints head(5) and tail(5)
as.data.table(quakes) # can also coerce to DT with as.data.table()
## lat long depth mag stations
## 1: -20.42 181.62 562 4.8 41
## 2: -20.62 181.03 650 4.2 15
## 3: -26.00 184.10 42 5.4 43
## 4: -17.97 181.66 626 4.1 19
## 5: -20.42 181.96 649 4.0 11
## ---
## 996: -25.93 179.54 470 4.4 22
## 997: -12.28 167.06 248 4.7 35
## 998: -20.13 184.20 244 4.5 34
## 999: -17.40 187.80 40 4.5 14
## 1000: -21.59 170.56 165 6.0 119
9 / 34

Since mtcars only contains 32 rows, data.table would print the whole thing

Basics with data.table data

Read and write data

10 / 34

Basics with data.table data

Read and write data

  • Fast delimited file reader: fread()

  • Fast file writer: fwrite()

10 / 34

Basics with data.table data

Read and write data

  • Fast delimited file reader: fread()

  • Fast file writer: fwrite()

# read
mydata <- fread(here("data/mydata.csv")) # much faster than read_csv(), etc.
# wrangle
mydata_cleaned <- ...
# write
fwrite(mydata, here("data/mydata_cleaned.csv")) # create an updated CSV
10 / 34

data.table

simple data operations

filter, arrange, select, rename

11 / 34

Reminder: data.table syntax

DT[i, j, by]


12 / 34

Reminder: data.table syntax

DT[i, j, by]


approximate dplyr translation:

data %>%
select %>%
filter | arrange %>%
group_by %>%
mutate | summarize


]

12 / 34

One quick aside here -- you'll see that all data.table code is contained within brackets; this allows data.table to know exactly what data is required for each computation, which usually makes it much faster than its dplyr equivalent

DT[i, j, by]

Subset rows in i

13 / 34

DT[i, j, by]

Subset rows in i

By logical condition

13 / 34

DT[i, j, by]

Subset rows in i

By logical condition

mtcars[mpg >= 31] # as with dplyr, no need to use $ to refer to variables
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 2: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
13 / 34

DT[i, j, by]

Subset rows in i

By logical condition

mtcars[mpg >= 31] # as with dplyr, no need to use $ to refer to variables
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 2: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

By row index

13 / 34

DT[i, j, by]

Subset rows in i

By logical condition

mtcars[mpg >= 31] # as with dplyr, no need to use $ to refer to variables
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 2: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

By row index

mtcars[1:3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
13 / 34

DT[i, j, by]

Reorder rows in i using order()

14 / 34

DT[i, j, by]

Reorder rows in i using order()

Ascending order (default)

14 / 34

DT[i, j, by]

Reorder rows in i using order()

Ascending order (default)

ordered <- mtcars[order(hp)]
head(ordered)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 2: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 3: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 4: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 5: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 6: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
14 / 34

DT[i, j, by]

Reorder rows in i using order()

Descending order using hyphen (-)

15 / 34

DT[i, j, by]

Reorder rows in i using order()

Descending order using hyphen (-)

ordered_desc <- mtcars[order(-hp)]
head(ordered_desc)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1: 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
## 2: 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
## 3: 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
## 4: 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## 5: 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
## 6: 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
15 / 34

DT[i, j,by]

Select columns in j

  • Wrap desired columns in list() or .() (an alias for list()); the latter is more common
16 / 34

DT[i, j,by]

Select columns in j

  • Wrap desired columns in list() or .() (an alias for list()); the latter is more common
selection <- mtcars[, .(mpg, hp)] # note the comma before j
head(selection)
## mpg hp
## 1: 21.0 110
## 2: 21.0 110
## 3: 22.8 93
## 4: 21.4 110
## 5: 18.7 175
## 6: 18.1 105
16 / 34

DT[i, j,by]

Rename columns in j

  • As in dplyr, can select and rename at the same time
17 / 34

DT[i, j,by]

Rename columns in j

  • As in dplyr, can select and rename at the same time
renamed <- mtcars[, .(miles_per_gal = mpg, horsepower = hp)]
head(renamed)
## miles_per_gal horsepower
## 1: 21.0 110
## 2: 21.0 110
## 3: 22.8 93
## 4: 21.4 110
## 5: 18.7 175
## 6: 18.1 105
17 / 34

data.table

simple data manipulation

mutate, summarize without groups

18 / 34

DT[i, j,by]

j is powerful!


    19 / 34

    DT[i, j,by]

    j is powerful!


    • Contains capabilities of dplyr's select, mutate, summarize, and more
    19 / 34

    DT[i, j,by]

    j is powerful!


    • Contains capabilities of dplyr's select, mutate, summarize, and more
    • When j returns a list, each list-element will become a column in the resulting data.table
    19 / 34

    DT[i, j,by]

    j is powerful!


    • Contains capabilities of dplyr's select, mutate, summarize, and more
    • When j returns a list, each list-element will become a column in the resulting data.table
    • When j contains an expression, it will be evaluated
    19 / 34

    When j returns a list, each list element will be returned as a column - allows operations similar to dplyr across, but much more flexible

    Computation within j: operations on columns

    20 / 34

    Computation within j: operations on columns

    What is the mean weight of the cars in the mtcars dataset?

    20 / 34

    Computation within j: operations on columns

    What is the mean weight of the cars in the mtcars dataset?

    mtcars[, mean(wt)]
    ## [1] 3.21725
    20 / 34

    Computation within j: operations on columns

    What is the mean weight of the cars in the mtcars dataset?

    mtcars[, mean(wt)]
    ## [1] 3.21725

    How many cars in the mtcars dataset have a 4-cylinder engine?

    20 / 34

    Computation within j: operations on columns

    What is the mean weight of the cars in the mtcars dataset?

    mtcars[, mean(wt)]
    ## [1] 3.21725

    How many cars in the mtcars dataset have a 4-cylinder engine?

    mtcars[, sum(cyl == 4)]
    ## [1] 11
    20 / 34

    Computation within j: operations on columns

    What is the mean weight of the cars in the mtcars dataset?

    mtcars[, mean(wt)]
    ## [1] 3.21725

    How many cars in the mtcars dataset have a 4-cylinder engine?

    mtcars[, sum(cyl == 4)]
    ## [1] 11

    Remember: j can accept column names and expressions!

    20 / 34

    DT[i, j, by]

    Combining i and j

    21 / 34

    DT[i, j, by]

    Combining i and j

    Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.

    21 / 34

    DT[i, j, by]

    Combining i and j

    Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.

    First, get the mean mpg for cars with 4 cylinders:

    21 / 34

    DT[i, j, by]

    Combining i and j

    Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.

    First, get the mean mpg for cars with 4 cylinders:

    mtcars[cyl == 4, mean(mpg)]
    ## [1] 26.66364
    21 / 34

    DT[i, j, by]

    Combining i and j

    Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.

    First, get the mean mpg for cars with 4 cylinders:

    mtcars[cyl == 4, mean(mpg)]
    ## [1] 26.66364

    Then the mean mpg for cars with more than 4:

    21 / 34

    DT[i, j, by]

    Combining i and j

    Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.

    First, get the mean mpg for cars with 4 cylinders:

    mtcars[cyl == 4, mean(mpg)]
    ## [1] 26.66364

    Then the mean mpg for cars with more than 4:

    mtcars[cyl > 4, mean(mpg)]
    ## [1] 16.64762
    21 / 34

    DT[i, j, by]

    Combining i and j

    Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.

    First, get the mean mpg for cars with 4 cylinders:

    mtcars[cyl == 4, mean(mpg)]
    ## [1] 26.66364

    Then the mean mpg for cars with more than 4:

    mtcars[cyl > 4, mean(mpg)]
    ## [1] 16.64762

    Next: i and j become even more flexible with by

    21 / 34

    data.table

    complex data manipulation

    grouped mutate and summarize

    22 / 34

    DT[i, j, by]

    by: group by 1+ variables

    23 / 34

    DT[i, j, by]

    by: group by 1+ variables

    What are the median quarter-mile race times for 4- and 5-gear cars in mtcars?

    23 / 34

    DT[i, j, by]

    by: group by 1+ variables

    What are the median quarter-mile race times for 4- and 5-gear cars in mtcars?

    mtcars[gear > 3, .(median_qtr_mile = median(qsec)), gear]
    ## gear median_qtr_mile
    ## 1: 4 18.755
    ## 2: 5 15.500
    23 / 34

    DT[i, j, by]

    by: group by 1+ variables

    What are the median quarter-mile race times for 4- and 5-gear cars in mtcars?

    mtcars[gear > 3, .(median_qtr_mile = median(qsec)), gear]
    ## gear median_qtr_mile
    ## 1: 4 18.755
    ## 2: 5 15.500

    Here, we:

    • Subset to cars with gear > 3 in i
    • Calculated median_qtr_mile using an expression in j
    • Grouped by gear in by
    23 / 34

    Introducing .N

    24 / 34

    Introducing .N

    • Special data.table variable .N: number of observations in the current group
    24 / 34

    Introducing .N

    • Special data.table variable .N: number of observations in the current group
    • Similar use to dplyr::count() or base::nrow()
    24 / 34

    Introducing .N

    • Special data.table variable .N: number of observations in the current group
    • Similar use to dplyr::count() or base::nrow()
    • Example:
    • How many cars are automatic, and how many are manual, in the mtcars dataset?

    24 / 34

    Introducing .N

    • Special data.table variable .N: number of observations in the current group
    • Similar use to dplyr::count() or base::nrow()
    • Example:
    • How many cars are automatic, and how many are manual, in the mtcars dataset?

      mtcars[, .N, by = am]
      ## am N
      ## 1: 1 13
      ## 2: 0 19
    24 / 34

    More examples with by

    25 / 34

    More examples with by

    What's the fastest quarter-mile time by number of cylinders?

    25 / 34

    More examples with by

    What's the fastest quarter-mile time by number of cylinders?

    mtcars[, .(fastest_time = min(qsec)), keyby = cyl] # keyby (instead of by) sorts results
    ## cyl fastest_time
    ## 1: 4 16.7
    ## 2: 6 15.5
    ## 3: 8 14.5
    25 / 34

    More examples with by

    What's the fastest quarter-mile time by number of cylinders?

    mtcars[, .(fastest_time = min(qsec)), keyby = cyl] # keyby (instead of by) sorts results
    ## cyl fastest_time
    ## 1: 4 16.7
    ## 2: 6 15.5
    ## 3: 8 14.5

    What's the highest horsepower in automatic versus manual cars?

    25 / 34

    More examples with by

    What's the fastest quarter-mile time by number of cylinders?

    mtcars[, .(fastest_time = min(qsec)), keyby = cyl] # keyby (instead of by) sorts results
    ## cyl fastest_time
    ## 1: 4 16.7
    ## 2: 6 15.5
    ## 3: 8 14.5

    What's the highest horsepower in automatic versus manual cars?

    mtcars[, (max_hp = max(hp)), am]
    ## am V1
    ## 1: 1 335
    ## 2: 0 245
    25 / 34

    Chaining data.table expressions (like %>%)


    26 / 34

    Chaining data.table expressions (like %>%)


    What is the mean fuel efficiency (in miles per gallon) of heavy (>= 2000 lbs) versus light (< 2000 lbs) cars?

    26 / 34

    Chaining data.table expressions (like %>%)


    What is the mean fuel efficiency (in miles per gallon) of heavy (>= 2000 lbs) versus light (< 2000 lbs) cars?

    mtcars[, .(mpg, weight_cat = fifelse(wt >= 2, "heavy", "light")) # fifelse = "fast if-else"
    ][, .(mean_mpg = mean(mpg)), by = weight_cat]
    ## weight_cat mean_mpg
    ## 1: heavy 18.60357
    ## 2: light 30.50000
    26 / 34

    Chaining data.table expressions (like %>%)


    What is the mean fuel efficiency (in miles per gallon) of heavy (>= 2000 lbs) versus light (< 2000 lbs) cars?

    mtcars[, .(mpg, weight_cat = fifelse(wt >= 2, "heavy", "light")) # fifelse = "fast if-else"
    ][, .(mean_mpg = mean(mpg)), by = weight_cat]
    ## weight_cat mean_mpg
    ## 1: heavy 18.60357
    ## 2: light 30.50000

    Here, we:

    • Selected mpg and the newly-created weight_cat in our first j statement
    • Calculated mean_mpg, grouped by weight_cat, using our second j statement and by
    26 / 34

    Chaining data.table expressions (like %>%)


    27 / 34

    Chaining data.table expressions (like %>%)


    Complex operations can have many chains - organize vertically

    27 / 34

    Chaining data.table expressions (like %>%)


    Complex operations can have many chains - organize vertically

    DT[ ...
    ][ ...
    ][ ...
    ][ ...
    ][ ...
    ][ ... ]
    27 / 34

    dtplyr

    quick overview

    28 / 34

    What is dtplyr?


    29 / 34

    What is dtplyr?


    • data.table backend (fast) + dplyr code (familiar)
    29 / 34

    What is dtplyr?


    • data.table backend (fast) + dplyr code (familiar)
    • Translates dplyr code to data.table code for us!
    29 / 34

    What is dtplyr?


    • data.table backend (fast) + dplyr code (familiar)
    • Translates dplyr code to data.table code for us!
    • So...why learn data.table in the first place?
    29 / 34

    Why learn data.table in the first place?

    • dplyr can't do everything that data.table can, so there will be missing features
    • still some sacrifice in speed for certain features of dplyr, mainly dplyr's copy-on-modify behavior

    Using dtplyr

    30 / 34

    Using dtplyr

    • We'll need to load dplyr, dtplyr, and data.table
    30 / 34

    Using dtplyr

    • We'll need to load dplyr, dtplyr, and data.table
    library(dplyr)
    library(dtplyr)
    library(data.table)
    30 / 34

    Using dtplyr

    31 / 34

    Using dtplyr

    • Then, we'll write our dplyr code, adding two elements:
    31 / 34

    Using dtplyr

    • Then, we'll write our dplyr code, adding two elements:

      • Apply lazy_dt() to input data
    31 / 34

    Using dtplyr

    • Then, we'll write our dplyr code, adding two elements:

      • Apply lazy_dt() to input data
      • Use collect(), as_tibble(), or as.data.frame() to collect results
    31 / 34

    Using dtplyr

    • Then, we'll write our dplyr code, adding two elements:

      • Apply lazy_dt() to input data
      • Use collect(), as_tibble(), or as.data.frame() to collect results
    palmerpenguins::penguins %>% # penguins dataset is tibble by default
    lazy_dt() %>%
    group_by(species) %>%
    summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE)) %>%
    collect()
    ## # A tibble: 3 x 2
    ## species mean_bill_length
    ## <fct> <dbl>
    ## 1 Adelie 38.8
    ## 2 Chinstrap 48.8
    ## 3 Gentoo 47.5
    31 / 34

    dtyplr: usefulness


    32 / 34

    dtyplr: usefulness


    • If you don't want to learn data.table but would like performance gain with dplyr
    32 / 34

    dtyplr: usefulness


    • If you don't want to learn data.table but would like performance gain with dplyr

    • If you do want to learn data.table, helpful to see dtyplr's translations of your dplyr code

    32 / 34

    dtyplr: usefulness


    • If you don't want to learn data.table but would like performance gain with dplyr

    • If you do want to learn data.table, helpful to see dtyplr's translations of your dplyr code

    • Drawback: potentially limited to simpler analytic procedures

    32 / 34

    Things I didn't cover here


    33 / 34

    Things I didn't cover here


    • The secret ingredient of data.table: modification in place
    33 / 34

    Things I didn't cover here


    • The secret ingredient of data.table: modification in place

    • Advanced data.table reference semantics using := and set* functions

    33 / 34

    Things I didn't cover here


    • The secret ingredient of data.table: modification in place

    • Advanced data.table reference semantics using := and set* functions

    • .SD and .SDcols: similar use-cases as dplyr::across()

    33 / 34

    Things I didn't cover here


    • The secret ingredient of data.table: modification in place

    • Advanced data.table reference semantics using := and set* functions

    • .SD and .SDcols: similar use-cases as dplyr::across()

    • Things data.table can do that dplyr and dtplyr may not be able to

    33 / 34

    questions?

    34 / 34

    data.table

    introduction

    2 / 34
    Paused

    Help

    Keyboard shortcuts

    , , Pg Up, k Go to previous slide
    , , Pg Dn, Space, j Go to next slide
    Home Go to first slide
    End Go to last slide
    Number + Return Go to specific slide
    b / m / f Toggle blackout / mirrored / fullscreen mode
    c Clone slideshow
    p Toggle presenter mode
    t Restart the presentation timer
    ?, h Toggle this help
    oTile View: Overview of Slides
    Esc Back to slideshow