Will Simmons
Data Science I
December 8, 2021
data.table
Source: Database-like ops benchmark. H2O.ai (https://h2oai.github.io/db-benchmark/).
Performing grouped operations or joins on large data, data.table can outperform dplyr by a factor of 30 in terms of speed
data.table
syntaxdata.table
syntaxTake data.table
DT,
subset rows using i
and manipulate columns with j,
grouped according to by.
Source: Data Transformation with data.table :: CHEAT SHEET. Erik Petrovski. Link.
If we want to think about it in terms of dplyr verbs: Take data.table DT, filter and arrange using i, and select, mutate, or summarize columns with j, grouped by by.
We'll get briefly into more complex things we've learned with dplyr, too, like piping multiple commands and joining data.
data.table
data.table
data.table
from scratchdata.table
data.table
from scratchdata.frame
or tibble
, we can manually build a data.table
data.table
data.table
from scratchdata.frame
or tibble
, we can manually build a data.table
test <- data.table( a = 1:3, b = letters[1:3], # letters and LETTERS are predefined vectors in R c = LETTERS[1:3])test
## a b c## 1: 1 a A## 2: 2 b B## 3: 3 c C
data.table
data.table
from an existing objectdata.table
data.table
from an existing objectWe can also coerce an existing object into a data.table
using setDT()
mtcars
is loaded as a data.frame
by default; let's make it usable with data.table
data.table
data.table
from an existing objectWe can also coerce an existing object into a data.table
using setDT()
mtcars
is loaded as a data.frame
by default; let's make it usable with data.table
mtcars <- mtcars # if loaded before data.table, loaded as data.framelibrary(data.table)mtcars <- setDT(mtcars) # use setDT() to coerce to data.tableclass(mtcars)
## [1] "data.table" "data.frame"
data.table
data.table
from an existing objectWe can also coerce an existing object into a data.table
using setDT()
mtcars
is loaded as a data.frame
by default; let's make it usable with data.table
mtcars <- mtcars # if loaded before data.table, loaded as data.framelibrary(data.table)mtcars <- setDT(mtcars) # use setDT() to coerce to data.tableclass(mtcars)
## [1] "data.table" "data.frame"
Note: data.table
operations won't work on non-data.table
objects.
data.table
datadata.table
datanrow
> 100, calling a data.table prints head(5)
and tail(5)
data.table
datanrow
> 100, calling a data.table prints head(5)
and tail(5)
as.data.table(quakes) # can also coerce to DT with as.data.table()
## lat long depth mag stations## 1: -20.42 181.62 562 4.8 41## 2: -20.62 181.03 650 4.2 15## 3: -26.00 184.10 42 5.4 43## 4: -17.97 181.66 626 4.1 19## 5: -20.42 181.96 649 4.0 11## --- ## 996: -25.93 179.54 470 4.4 22## 997: -12.28 167.06 248 4.7 35## 998: -20.13 184.20 244 4.5 34## 999: -17.40 187.80 40 4.5 14## 1000: -21.59 170.56 165 6.0 119
Since mtcars only contains 32 rows, data.table would print the whole thing
data.table
datadata.table
dataFast delimited file reader: fread()
Fast file writer: fwrite()
data.table
dataFast delimited file reader: fread()
Fast file writer: fwrite()
# readmydata <- fread(here("data/mydata.csv")) # much faster than read_csv(), etc.# wranglemydata_cleaned <- ...# writefwrite(mydata, here("data/mydata_cleaned.csv")) # create an updated CSV
data.table
filter, arrange, select, rename
data %>%
select %>%
filter | arrange %>%
group_by %>%
mutate | summarize
]
One quick aside here -- you'll see that all data.table code is contained within brackets; this allows data.table to know exactly what data is required for each computation, which usually makes it much faster than its dplyr equivalent
DT[i, j, by]
DT[i, j, by]
By logical condition
DT[i, j, by]
By logical condition
mtcars[mpg >= 31] # as with dplyr, no need to use $ to refer to variables
## mpg cyl disp hp drat wt qsec vs am gear carb## 1: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1## 2: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
DT[i, j, by]
By logical condition
mtcars[mpg >= 31] # as with dplyr, no need to use $ to refer to variables
## mpg cyl disp hp drat wt qsec vs am gear carb## 1: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1## 2: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
By row index
DT[i, j, by]
By logical condition
mtcars[mpg >= 31] # as with dplyr, no need to use $ to refer to variables
## mpg cyl disp hp drat wt qsec vs am gear carb## 1: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1## 2: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
By row index
mtcars[1:3]
## mpg cyl disp hp drat wt qsec vs am gear carb## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
DT[i, j, by]
DT[i, j, by]
Ascending order (default)
DT[i, j, by]
Ascending order (default)
ordered <- mtcars[order(hp)]head(ordered)
## mpg cyl disp hp drat wt qsec vs am gear carb## 1: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2## 2: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2## 3: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1## 4: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1## 5: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1## 6: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
DT[i, j, by]
Descending order using hyphen (-
)
DT[i, j, by]
Descending order using hyphen (-
)
ordered_desc <- mtcars[order(-hp)]head(ordered_desc)
## mpg cyl disp hp drat wt qsec vs am gear carb## 1: 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8## 2: 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4## 3: 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4## 4: 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4## 5: 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4## 6: 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
DT[i, j,by]
DT[i, j,by]
selection <- mtcars[, .(mpg, hp)] # note the comma before jhead(selection)
## mpg hp## 1: 21.0 110## 2: 21.0 110## 3: 22.8 93## 4: 21.4 110## 5: 18.7 175## 6: 18.1 105
DT[i, j,by]
DT[i, j,by]
renamed <- mtcars[, .(miles_per_gal = mpg, horsepower = hp)]head(renamed)
## miles_per_gal horsepower## 1: 21.0 110## 2: 21.0 110## 3: 22.8 93## 4: 21.4 110## 5: 18.7 175## 6: 18.1 105
data.table
mutate, summarize without groups
DT[i, j,by]
DT[i, j,by]
DT[i, j,by]
DT[i, j,by]
When j returns a list, each list element will be returned as a column - allows operations similar to dplyr across, but much more flexible
What is the mean weight of the cars in the mtcars
dataset?
What is the mean weight of the cars in the mtcars
dataset?
mtcars[, mean(wt)]
## [1] 3.21725
What is the mean weight of the cars in the mtcars
dataset?
mtcars[, mean(wt)]
## [1] 3.21725
How many cars in the mtcars
dataset have a 4-cylinder engine?
What is the mean weight of the cars in the mtcars
dataset?
mtcars[, mean(wt)]
## [1] 3.21725
How many cars in the mtcars
dataset have a 4-cylinder engine?
mtcars[, sum(cyl == 4)]
## [1] 11
What is the mean weight of the cars in the mtcars
dataset?
mtcars[, mean(wt)]
## [1] 3.21725
How many cars in the mtcars
dataset have a 4-cylinder engine?
mtcars[, sum(cyl == 4)]
## [1] 11
DT[i, j, by]
DT[i, j, by]
Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.
DT[i, j, by]
Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.
First, get the mean mpg for cars with 4 cylinders:
DT[i, j, by]
Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.
First, get the mean mpg for cars with 4 cylinders:
mtcars[cyl == 4, mean(mpg)]
## [1] 26.66364
DT[i, j, by]
Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.
First, get the mean mpg for cars with 4 cylinders:
mtcars[cyl == 4, mean(mpg)]
## [1] 26.66364
Then the mean mpg for cars with more than 4:
DT[i, j, by]
Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.
First, get the mean mpg for cars with 4 cylinders:
mtcars[cyl == 4, mean(mpg)]
## [1] 26.66364
Then the mean mpg for cars with more than 4:
mtcars[cyl > 4, mean(mpg)]
## [1] 16.64762
DT[i, j, by]
Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.
First, get the mean mpg for cars with 4 cylinders:
mtcars[cyl == 4, mean(mpg)]
## [1] 26.66364
Then the mean mpg for cars with more than 4:
mtcars[cyl > 4, mean(mpg)]
## [1] 16.64762
data.table
grouped mutate and summarize
DT[i, j, by]
DT[i, j, by]
What are the median quarter-mile race times for 4- and 5-gear cars in mtcars
?
DT[i, j, by]
What are the median quarter-mile race times for 4- and 5-gear cars in mtcars
?
mtcars[gear > 3, .(median_qtr_mile = median(qsec)), gear]
## gear median_qtr_mile## 1: 4 18.755## 2: 5 15.500
DT[i, j, by]
What are the median quarter-mile race times for 4- and 5-gear cars in mtcars
?
mtcars[gear > 3, .(median_qtr_mile = median(qsec)), gear]
## gear median_qtr_mile## 1: 4 18.755## 2: 5 15.500
Here, we:
How many cars are automatic, and how many are manual, in the mtcars dataset?
How many cars are automatic, and how many are manual, in the mtcars dataset?
mtcars[, .N, by = am]
## am N## 1: 1 13## 2: 0 19
What's the fastest quarter-mile time by number of cylinders?
What's the fastest quarter-mile time by number of cylinders?
mtcars[, .(fastest_time = min(qsec)), keyby = cyl] # keyby (instead of by) sorts results
## cyl fastest_time## 1: 4 16.7## 2: 6 15.5## 3: 8 14.5
What's the fastest quarter-mile time by number of cylinders?
mtcars[, .(fastest_time = min(qsec)), keyby = cyl] # keyby (instead of by) sorts results
## cyl fastest_time## 1: 4 16.7## 2: 6 15.5## 3: 8 14.5
What's the highest horsepower in automatic versus manual cars?
What's the fastest quarter-mile time by number of cylinders?
mtcars[, .(fastest_time = min(qsec)), keyby = cyl] # keyby (instead of by) sorts results
## cyl fastest_time## 1: 4 16.7## 2: 6 15.5## 3: 8 14.5
What's the highest horsepower in automatic versus manual cars?
mtcars[, (max_hp = max(hp)), am]
## am V1## 1: 1 335## 2: 0 245
What is the mean fuel efficiency (in miles per gallon) of heavy (>= 2000 lbs) versus light (< 2000 lbs) cars?
What is the mean fuel efficiency (in miles per gallon) of heavy (>= 2000 lbs) versus light (< 2000 lbs) cars?
mtcars[, .(mpg, weight_cat = fifelse(wt >= 2, "heavy", "light")) # fifelse = "fast if-else" ][, .(mean_mpg = mean(mpg)), by = weight_cat]
## weight_cat mean_mpg## 1: heavy 18.60357## 2: light 30.50000
What is the mean fuel efficiency (in miles per gallon) of heavy (>= 2000 lbs) versus light (< 2000 lbs) cars?
mtcars[, .(mpg, weight_cat = fifelse(wt >= 2, "heavy", "light")) # fifelse = "fast if-else" ][, .(mean_mpg = mean(mpg)), by = weight_cat]
## weight_cat mean_mpg## 1: heavy 18.60357## 2: light 30.50000
Here, we:
Complex operations can have many chains - organize vertically
Complex operations can have many chains - organize vertically
DT[ ... ][ ... ][ ... ][ ... ][ ... ][ ... ]
dtplyr
Why learn data.table in the first place?
library(dplyr)library(dtplyr)library(data.table)
Then, we'll write our dplyr code, adding two elements:
Then, we'll write our dplyr code, adding two elements:
Then, we'll write our dplyr code, adding two elements:
palmerpenguins::penguins %>% # penguins dataset is tibble by default lazy_dt() %>% group_by(species) %>% summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE)) %>% collect()
## # A tibble: 3 x 2## species mean_bill_length## <fct> <dbl>## 1 Adelie 38.8## 2 Chinstrap 48.8## 3 Gentoo 47.5
If you don't want to learn data.table but would like performance gain with dplyr
If you do want to learn data.table, helpful to see dtyplr's translations of your dplyr code
If you don't want to learn data.table but would like performance gain with dplyr
If you do want to learn data.table, helpful to see dtyplr's translations of your dplyr code
Drawback: potentially limited to simpler analytic procedures
The secret ingredient of data.table: modification in place
Advanced data.table reference semantics using := and set* functions
The secret ingredient of data.table: modification in place
Advanced data.table reference semantics using := and set* functions
.SD and .SDcols: similar use-cases as dplyr::across()
The secret ingredient of data.table: modification in place
Advanced data.table reference semantics using := and set* functions
.SD and .SDcols: similar use-cases as dplyr::across()
Things data.table can do that dplyr and dtplyr may not be able to
data.table
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Will Simmons
Data Science I
December 8, 2021
data.table
Source: Database-like ops benchmark. H2O.ai (https://h2oai.github.io/db-benchmark/).
Performing grouped operations or joins on large data, data.table can outperform dplyr by a factor of 30 in terms of speed
data.table
syntaxdata.table
syntaxTake data.table
DT,
subset rows using i
and manipulate columns with j,
grouped according to by.
Source: Data Transformation with data.table :: CHEAT SHEET. Erik Petrovski. Link.
If we want to think about it in terms of dplyr verbs: Take data.table DT, filter and arrange using i, and select, mutate, or summarize columns with j, grouped by by.
We'll get briefly into more complex things we've learned with dplyr, too, like piping multiple commands and joining data.
data.table
data.table
data.table
from scratchdata.table
data.table
from scratchdata.frame
or tibble
, we can manually build a data.table
data.table
data.table
from scratchdata.frame
or tibble
, we can manually build a data.table
test <- data.table( a = 1:3, b = letters[1:3], # letters and LETTERS are predefined vectors in R c = LETTERS[1:3])test
## a b c## 1: 1 a A## 2: 2 b B## 3: 3 c C
data.table
data.table
from an existing objectdata.table
data.table
from an existing objectWe can also coerce an existing object into a data.table
using setDT()
mtcars
is loaded as a data.frame
by default; let's make it usable with data.table
data.table
data.table
from an existing objectWe can also coerce an existing object into a data.table
using setDT()
mtcars
is loaded as a data.frame
by default; let's make it usable with data.table
mtcars <- mtcars # if loaded before data.table, loaded as data.framelibrary(data.table)mtcars <- setDT(mtcars) # use setDT() to coerce to data.tableclass(mtcars)
## [1] "data.table" "data.frame"
data.table
data.table
from an existing objectWe can also coerce an existing object into a data.table
using setDT()
mtcars
is loaded as a data.frame
by default; let's make it usable with data.table
mtcars <- mtcars # if loaded before data.table, loaded as data.framelibrary(data.table)mtcars <- setDT(mtcars) # use setDT() to coerce to data.tableclass(mtcars)
## [1] "data.table" "data.frame"
Note: data.table
operations won't work on non-data.table
objects.
data.table
datadata.table
datanrow
> 100, calling a data.table prints head(5)
and tail(5)
data.table
datanrow
> 100, calling a data.table prints head(5)
and tail(5)
as.data.table(quakes) # can also coerce to DT with as.data.table()
## lat long depth mag stations## 1: -20.42 181.62 562 4.8 41## 2: -20.62 181.03 650 4.2 15## 3: -26.00 184.10 42 5.4 43## 4: -17.97 181.66 626 4.1 19## 5: -20.42 181.96 649 4.0 11## --- ## 996: -25.93 179.54 470 4.4 22## 997: -12.28 167.06 248 4.7 35## 998: -20.13 184.20 244 4.5 34## 999: -17.40 187.80 40 4.5 14## 1000: -21.59 170.56 165 6.0 119
Since mtcars only contains 32 rows, data.table would print the whole thing
data.table
datadata.table
dataFast delimited file reader: fread()
Fast file writer: fwrite()
data.table
dataFast delimited file reader: fread()
Fast file writer: fwrite()
# readmydata <- fread(here("data/mydata.csv")) # much faster than read_csv(), etc.# wranglemydata_cleaned <- ...# writefwrite(mydata, here("data/mydata_cleaned.csv")) # create an updated CSV
data.table
filter, arrange, select, rename
data %>%
select %>%
filter | arrange %>%
group_by %>%
mutate | summarize
]
One quick aside here -- you'll see that all data.table code is contained within brackets; this allows data.table to know exactly what data is required for each computation, which usually makes it much faster than its dplyr equivalent
DT[i, j, by]
DT[i, j, by]
By logical condition
DT[i, j, by]
By logical condition
mtcars[mpg >= 31] # as with dplyr, no need to use $ to refer to variables
## mpg cyl disp hp drat wt qsec vs am gear carb## 1: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1## 2: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
DT[i, j, by]
By logical condition
mtcars[mpg >= 31] # as with dplyr, no need to use $ to refer to variables
## mpg cyl disp hp drat wt qsec vs am gear carb## 1: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1## 2: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
By row index
DT[i, j, by]
By logical condition
mtcars[mpg >= 31] # as with dplyr, no need to use $ to refer to variables
## mpg cyl disp hp drat wt qsec vs am gear carb## 1: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1## 2: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
By row index
mtcars[1:3]
## mpg cyl disp hp drat wt qsec vs am gear carb## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
DT[i, j, by]
DT[i, j, by]
Ascending order (default)
DT[i, j, by]
Ascending order (default)
ordered <- mtcars[order(hp)]head(ordered)
## mpg cyl disp hp drat wt qsec vs am gear carb## 1: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2## 2: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2## 3: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1## 4: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1## 5: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1## 6: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
DT[i, j, by]
Descending order using hyphen (-
)
DT[i, j, by]
Descending order using hyphen (-
)
ordered_desc <- mtcars[order(-hp)]head(ordered_desc)
## mpg cyl disp hp drat wt qsec vs am gear carb## 1: 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8## 2: 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4## 3: 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4## 4: 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4## 5: 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4## 6: 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
DT[i, j,by]
DT[i, j,by]
selection <- mtcars[, .(mpg, hp)] # note the comma before jhead(selection)
## mpg hp## 1: 21.0 110## 2: 21.0 110## 3: 22.8 93## 4: 21.4 110## 5: 18.7 175## 6: 18.1 105
DT[i, j,by]
DT[i, j,by]
renamed <- mtcars[, .(miles_per_gal = mpg, horsepower = hp)]head(renamed)
## miles_per_gal horsepower## 1: 21.0 110## 2: 21.0 110## 3: 22.8 93## 4: 21.4 110## 5: 18.7 175## 6: 18.1 105
data.table
mutate, summarize without groups
DT[i, j,by]
DT[i, j,by]
DT[i, j,by]
DT[i, j,by]
When j returns a list, each list element will be returned as a column - allows operations similar to dplyr across, but much more flexible
What is the mean weight of the cars in the mtcars
dataset?
What is the mean weight of the cars in the mtcars
dataset?
mtcars[, mean(wt)]
## [1] 3.21725
What is the mean weight of the cars in the mtcars
dataset?
mtcars[, mean(wt)]
## [1] 3.21725
How many cars in the mtcars
dataset have a 4-cylinder engine?
What is the mean weight of the cars in the mtcars
dataset?
mtcars[, mean(wt)]
## [1] 3.21725
How many cars in the mtcars
dataset have a 4-cylinder engine?
mtcars[, sum(cyl == 4)]
## [1] 11
What is the mean weight of the cars in the mtcars
dataset?
mtcars[, mean(wt)]
## [1] 3.21725
How many cars in the mtcars
dataset have a 4-cylinder engine?
mtcars[, sum(cyl == 4)]
## [1] 11
DT[i, j, by]
DT[i, j, by]
Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.
DT[i, j, by]
Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.
First, get the mean mpg for cars with 4 cylinders:
DT[i, j, by]
Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.
First, get the mean mpg for cars with 4 cylinders:
mtcars[cyl == 4, mean(mpg)]
## [1] 26.66364
DT[i, j, by]
Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.
First, get the mean mpg for cars with 4 cylinders:
mtcars[cyl == 4, mean(mpg)]
## [1] 26.66364
Then the mean mpg for cars with more than 4:
DT[i, j, by]
Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.
First, get the mean mpg for cars with 4 cylinders:
mtcars[cyl == 4, mean(mpg)]
## [1] 26.66364
Then the mean mpg for cars with more than 4:
mtcars[cyl > 4, mean(mpg)]
## [1] 16.64762
DT[i, j, by]
Compare mean fuel efficiency (in miles per gallon) of cars with 4 cylinders versus those with more than 4 cylinders.
First, get the mean mpg for cars with 4 cylinders:
mtcars[cyl == 4, mean(mpg)]
## [1] 26.66364
Then the mean mpg for cars with more than 4:
mtcars[cyl > 4, mean(mpg)]
## [1] 16.64762
data.table
grouped mutate and summarize
DT[i, j, by]
DT[i, j, by]
What are the median quarter-mile race times for 4- and 5-gear cars in mtcars
?
DT[i, j, by]
What are the median quarter-mile race times for 4- and 5-gear cars in mtcars
?
mtcars[gear > 3, .(median_qtr_mile = median(qsec)), gear]
## gear median_qtr_mile## 1: 4 18.755## 2: 5 15.500
DT[i, j, by]
What are the median quarter-mile race times for 4- and 5-gear cars in mtcars
?
mtcars[gear > 3, .(median_qtr_mile = median(qsec)), gear]
## gear median_qtr_mile## 1: 4 18.755## 2: 5 15.500
Here, we:
How many cars are automatic, and how many are manual, in the mtcars dataset?
How many cars are automatic, and how many are manual, in the mtcars dataset?
mtcars[, .N, by = am]
## am N## 1: 1 13## 2: 0 19
What's the fastest quarter-mile time by number of cylinders?
What's the fastest quarter-mile time by number of cylinders?
mtcars[, .(fastest_time = min(qsec)), keyby = cyl] # keyby (instead of by) sorts results
## cyl fastest_time## 1: 4 16.7## 2: 6 15.5## 3: 8 14.5
What's the fastest quarter-mile time by number of cylinders?
mtcars[, .(fastest_time = min(qsec)), keyby = cyl] # keyby (instead of by) sorts results
## cyl fastest_time## 1: 4 16.7## 2: 6 15.5## 3: 8 14.5
What's the highest horsepower in automatic versus manual cars?
What's the fastest quarter-mile time by number of cylinders?
mtcars[, .(fastest_time = min(qsec)), keyby = cyl] # keyby (instead of by) sorts results
## cyl fastest_time## 1: 4 16.7## 2: 6 15.5## 3: 8 14.5
What's the highest horsepower in automatic versus manual cars?
mtcars[, (max_hp = max(hp)), am]
## am V1## 1: 1 335## 2: 0 245
What is the mean fuel efficiency (in miles per gallon) of heavy (>= 2000 lbs) versus light (< 2000 lbs) cars?
What is the mean fuel efficiency (in miles per gallon) of heavy (>= 2000 lbs) versus light (< 2000 lbs) cars?
mtcars[, .(mpg, weight_cat = fifelse(wt >= 2, "heavy", "light")) # fifelse = "fast if-else" ][, .(mean_mpg = mean(mpg)), by = weight_cat]
## weight_cat mean_mpg## 1: heavy 18.60357## 2: light 30.50000
What is the mean fuel efficiency (in miles per gallon) of heavy (>= 2000 lbs) versus light (< 2000 lbs) cars?
mtcars[, .(mpg, weight_cat = fifelse(wt >= 2, "heavy", "light")) # fifelse = "fast if-else" ][, .(mean_mpg = mean(mpg)), by = weight_cat]
## weight_cat mean_mpg## 1: heavy 18.60357## 2: light 30.50000
Here, we:
Complex operations can have many chains - organize vertically
Complex operations can have many chains - organize vertically
DT[ ... ][ ... ][ ... ][ ... ][ ... ][ ... ]
dtplyr
Why learn data.table in the first place?
library(dplyr)library(dtplyr)library(data.table)
Then, we'll write our dplyr code, adding two elements:
Then, we'll write our dplyr code, adding two elements:
Then, we'll write our dplyr code, adding two elements:
palmerpenguins::penguins %>% # penguins dataset is tibble by default lazy_dt() %>% group_by(species) %>% summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE)) %>% collect()
## # A tibble: 3 x 2## species mean_bill_length## <fct> <dbl>## 1 Adelie 38.8## 2 Chinstrap 48.8## 3 Gentoo 47.5
If you don't want to learn data.table but would like performance gain with dplyr
If you do want to learn data.table, helpful to see dtyplr's translations of your dplyr code
If you don't want to learn data.table but would like performance gain with dplyr
If you do want to learn data.table, helpful to see dtyplr's translations of your dplyr code
Drawback: potentially limited to simpler analytic procedures
The secret ingredient of data.table: modification in place
Advanced data.table reference semantics using := and set* functions
The secret ingredient of data.table: modification in place
Advanced data.table reference semantics using := and set* functions
.SD and .SDcols: similar use-cases as dplyr::across()
The secret ingredient of data.table: modification in place
Advanced data.table reference semantics using := and set* functions
.SD and .SDcols: similar use-cases as dplyr::across()
Things data.table can do that dplyr and dtplyr may not be able to