
Data Manipulation

After the data is cleaned, the dplyr package can be used for data manipulation. If you are familiar with SQL, then you may find many similar functionalities when using dpylr functions. Throughout this page examples will be shown using the mtcars dataset, which has 32 observations and 11 variables. Though not a large dataset, it comes pre-loaded in R and it's contents can be used to nicely illustrate how dpylr functions can manipulate data.
tbl_df
When a datasets contains many rows and/or columns, it can be difficult to understand what you are actually working with. The tbl_df()
function turns a dataframe structure into a tbl structure (short for "tibble"). Using a tibble structure can make it much easier to navigate, view, and manipulate the contents of a dataset because every column is known to be associated with a variable and every row is known to be associated with an observation.
Example:
library(dplyr) # load package
data("mtcars") # load mtcars dataset using data() function
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fletwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
tbl_df(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 2 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 3 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 4 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## 5 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## 6 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## # with 26 more rows
glimpse()
As mentioned above, tbl stands for "tibble." Besides the data structure, there is also a package called tibble, which contains a glimpse()
function. This is another way to look at a dataset without printing out all the contents.
Example:
glimpse(mtcars)
## Observations: 32
## Variables: 11
## $ mpg 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
6 Key Data Manipulation Functions in dplyr
The six functions are:
select()
mutate()
filter()
arrange()
group_by()
summarize()
select()
and mutate()
are used on variables (columns) within a dataset. filter()
and arrange()
are used on the observations (rows) in a dataset. arrange()
and summarize()
are used across a number of rows and columns.
Note: For all six of these functions, the data must be converted to a tibble format.
In addition to these functions, we will discuss a short-form way to code in R. Piping (i.e., using the symbolic notation %>%) is a coding tool that does not require you to explicitly reference the object upon which you are wanting to perform some action. This tool will be further discussed at the end of the page.
select()
The select()
function is used to extract only certain columns from a table, just like in SQL.
Example:
select(mtcars, cyl, mpg)
## cyl mpg
## Mazda RX4 6 21.0
## Mazda RX4 Wag 6 21.0
## Datsun 710 4 22.8
## Hornet 4 Drive 6 21.4
## Hornet Sportabout 8 18.7
## Valiant 6 18.1
## Duster 360 8 14.3
## Merc 240D 4 24.4
## Merc 230 4 22.8
## Merc 280 6 19.2
## Merc 280C 6 17.8
## Merc 450SE 8 16.4
## Merc 450SL 8 17.3
## Merc 450SLC 8 15.2
## Cadillac Fleetwood 8 10.4
## Lincoln Continental 8 10.4
## Chrysler Imperial 8 14.7
## Fiat 128 4 32.4
## Honda Civic 4 30.4
## Toyota Corolla 4 33.9
## Toyota Corona 4 21.5
## Dodge Challenger 8 15.5
## AMC Javelin 8 15.2
## Camaro Z28 8 13.3
## Pontiac Firebird 8 19.2
## Fiat X1-9 4 27.3
## Porsche 914-2 4 26.0
## Lotus Europa 4 30.4
## Ford Pantera L 8 15.8
## Ferrari Dino 6 19.7
## Maserati Bora 8 15.0
## Volvo 142E 4 21.4
The variables cyl and mpg are selected as the only columns from the mtcars dataset.
Note: The dash (-) sign can be used with select()
to take out columns in a range that are not needed.
Below are arguments that can be used with select()
.
starts_with()
Selects the columns that start with a given string.
Example:
select(mtcars, starts_with("c"))
## cyl carb
## Mazda RX4 6 4
## Mazda RX4 Wag 6 4
## Datsun 710 4 1
## Hornet 4 Drive 6 1
## Hornet Sportabout 8 2
## Valiant 6 1
## Duster 360 8 4
## Merc 240D 4 2
## Merc 230 4 2
## Merc 280 6 4
## Merc 280C 6 4
## Merc 450SE 8 3
## Merc 450SL 8 3
## Merc 450SLC 8 3
## Cadillac Fleetwood 8 4
## Lincoln Continental 8 4
## Chrysler Imperial 8 4
## Fiat 128 4 1
## Honda Civic 4 2
## Toyota Corolla 4 1
## Toyota Corona 4 1
## Dodge Challenger 8 2
## AMC Javelin 8 2
## Camaro Z28 8 4
## Pontiac Firebird 8 2
## Fiat X1-9 4 1
## Porsche 914-2 4 2
## Lotus Europa 4 2
## Ford Pantera L 8 4
## Ferrari Dino 6 6
## Maserati Bora 8 8
## Volvo 142E 4 2
ends_with()
Selects the columns that end with a given string.
Example:
select(mtcars, ends_with("p"))
## disp hp
## Mazda RX4 160.0 110
## Mazda RX4 Wag 160.0 110
## Datsun 710 108.0 93
## Hornet 4 Drive 258.0 110
## Hornet Sportabout 360.0 175
## Valiant 225.0 105
## Duster 360 360.0 245
## Merc 240D 146.7 62
## Merc 230 140.8 95
## Merc 280 167.6 123
## Merc 280C 167.6 123
## Merc 450SE 275.8 180
## Merc 450SL 275.8 180
## Merc 450SLC 275.8 180
## Cadillac Fleetwood 472.0 205
## Lincoln Continental 460.0 215
## Chrysler Imperial 440.0 230
## Fiat 128 78.7 66
## Honda Civic 75.7 52
## Toyota Corolla 71.1 65
## Toyota Corona 120.1 97
## Dodge Challenger 318.0 150
## AMC Javelin 304.0 150
## Camaro Z28 350.0 245
## Pontiac Firebird 400.0 175
## Fiat X1-9 79.0 66
## Porsche 914-2 120.3 91
## Lotus Europa 95.1 113
## Ford Pantera L 351.0 264
## Ferrari Dino 145.0 175
## Maserati Bora 301.0 335
## Volvo 142E 121.0 109
contains()
Selects the columns that contain a given string.
Example:
select(mtcars, contains("s"))
## disp qsec vs
## Mazda RX4 160.0 16.46 0
## Mazda RX4 Wag 160.0 17.02 0
## Datsun 710 108.0 18.61 1
## Hornet 4 Drive 258.0 19.44 1
## Hornet Sportabout 360.0 17.02 0
## Valiant 225.0 20.22 1
## Duster 360 360.0 15.84 0
## Merc 240D 146.7 20.00 1
## Merc 230 140.8 22.90 1
## Merc 280 167.6 18.30 1
## Merc 280C 167.6 18.90 1
## Merc 450SE 275.8 17.40 0
## Merc 450SL 275.8 17.60 0
## Merc 450SLC 275.8 18.00 0
## Cadillac Fleetwood 472.0 17.98 0
## Lincoln Continental 460.0 17.82 0
## Chrysler Imperial 440.0 17.42 0
## Fiat 128 78.7 19.47 1
## Honda Civic 75.7 18.52 1
## Toyota Corolla 71.1 19.90 1
## Toyota Corona 120.1 20.01 1
## Dodge Challenger 318.0 16.87 0
## AMC Javelin 304.0 17.30 0
## Camaro Z28 350.0 15.41 0
## Pontiac Firebird 400.0 17.05 0
## Fiat X1-9 79.0 18.90 1
## Porsche 914-2 120.3 16.70 0
## Lotus Europa 95.1 16.90 1
## Ford Pantera L 351.0 14.50 0
## Ferrari Dino 145.0 15.50 0
## Maserati Bora 301.0 14.60 0
## Volvo 142E 121.0 18.60 1
matches()
matches()
is very similar to contains()
; however, whereas contains()
searches for a given string that can also contain other characters before and after the given string, matches()
only looks for an exact match.
Example:
select(mtcars, matches("vs"))
## vs
## Mazda RX4 0
## Mazda RX4 Wag 0
## Datsun 710 1
## Hornet 4 Drive 1
## Hornet Sportabout 0
## Valiant 1
## Duster 360 0
## Merc 240D 1
## Merc 230 1
## Merc 280 1
## Merc 280C 1
## Merc 450SE 0
## Merc 450SL 0
## Merc 450SLC 0
## Cadillac Fleetwood 0
## Lincoln Continental 0
## Chrysler Imperial 0
## Fiat 128 1
## Honda Civic 1
## Toyota Corolla 1
## Toyota Corona 1
## Dodge Challenger 0
## AMC Javelin 0
## Camaro Z28 0
## Pontiac Firebird 0
## Fiat X1-9 1
## Porsche 914-2 0
## Lotus Europa 1
## Ford Pantera L 0
## Ferrari Dino 0
## Maserati Bora 0
## Volvo 142E 1
num_range()
Selects the entries in a column within the given range.
Example:
newCol = mtcars
names(newCol) = c('x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11')
select(newCol, num_range("x", 2:5))
## x2 x3 x4 x5
## Mazda RX4 6 160.0 110 3.90
## Mazda RX4 Wag 6 160.0 110 3.90
## Datsun 710 4 108.0 93 3.85
## Hornet 4 Drive 6 258.0 110 3.08
## Hornet Sportabout 8 360.0 175 3.15
## Valiant 6 225.0 105 2.76
## Duster 360 8 360.0 245 3.21
## Merc 240D 4 146.7 62 3.69
## Merc 230 4 140.8 95 3.92
## Merc 280 6 167.6 123 3.92
## Merc 280C 6 167.6 123 3.92
## Merc 450SE 8 275.8 180 3.07
## Merc 450SL 8 275.8 180 3.07
## Merc 450SLC 8 275.8 180 3.07
## Cadillac Fleetwood 8 472.0 205 2.93
## Lincoln Continental 8 460.0 215 3.00
## Chrysler Imperial 8 440.0 230 3.23
## Fiat 128 4 78.7 66 4.08
## Honda Civic 4 75.7 52 4.93
## Toyota Corolla 4 71.1 65 4.22
## Toyota Corona 4 120.1 97 3.70
## Dodge Challenger 8 318.0 150 2.76
## AMC Javelin 8 304.0 150 3.15
## Camaro Z28 8 350.0 245 3.73
## Pontiac Firebird 8 400.0 175 3.08
## Fiat X1-9 4 79.0 66 4.08
## Porsche 914-2 4 120.3 91 4.43
## Lotus Europa 4 95.1 113 3.77
## Ford Pantera L 8 351.0 264 4.22
## Ferrari Dino 6 145.0 175 3.62
## Maserati Bora 8 301.0 335 3.54
## Volvo 142E 4 121.0 109 4.11
one_of()
This function displays only unique values.
Example:
select(mtcars, one_of(c('cyl', 'hp', 'drat')))
## cyl hp drat
## Mazda RX4 6 110 3.90
## Mazda RX4 Wag 6 110 3.90
## Datsun 710 4 93 3.85
## Hornet 4 Drive 6 110 3.08
## Hornet Sportabout 8 175 3.15
## Valiant 6 105 2.76
## Duster 360 8 245 3.21
## Merc 240D 4 62 3.69
## Merc 230 4 95 3.92
## Merc 280 6 123 3.92
## Merc 280C 6 123 3.92
## Merc 450SE 8 180 3.07
## Merc 450SL 8 180 3.07
## Merc 450SLC 8 180 3.07
## Cadillac Fleetwood 8 205 2.93
## Lincoln Continental 8 215 3.00
## Chrysler Imperial 8 230 3.23
## Fiat 128 4 66 4.08
## Honda Civic 4 52 4.93
## Toyota Corolla 4 65 4.22
## Toyota Corona 4 97 3.70
## Dodge Challenger 8 150 2.76
## AMC Javelin 8 150 3.15
## Camaro Z28 8 245 3.73
## Pontiac Firebird 8 175 3.08
## Fiat X1-9 4 66 4.08
## Porsche 914-2 4 91 4.43
## Lotus Europa 4 113 3.77
## Ford Pantera L 8 264 4.22
## Ferrari Dino 6 175 3.62
## Maserati Bora 8 335 3.54
## Volvo 142E 4 109 4.11
mutate()
Adds a new column using an equation or logical test that incorporates data from other columns.
Example:
mutate(mtcars, hpCyl = hp + cyl)
## mpg cyl disp hp drat wt qsec vs am gear carb hpCyl
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 116
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 116
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 97
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 116
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 183
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 111
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 253
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 66
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 99
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 129
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 129
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 188
## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 188
## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 188
## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 213
## 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 223
## 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 238
## 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 70
## 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 56
## 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 69
## 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 101
## 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 158
## 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 158
## 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 253
## 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 183
## 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 70
## 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 95
## 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 117
## 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 272
## 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 181
## 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 343
## 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 113
A new column named hpCyl will be made that adds hp and cyl.
filter()
Filters out entries in a column based on a logical criterion; this chooses rows the same way that select()
chooses columns.
Example:
filter(mtcars, cyl<8)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 6 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 7 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 8 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 9 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 10 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 11 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 12 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 13 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 14 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 15 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 16 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 17 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## 18 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
arrange()
arrange()
sorts a dataset by the variable specified. If one or more extra variables are specified, then they are used for sub-sorting (i.e. tie breakers), with the variables entered first taking priority over those that come later.
Example:
arrange(mtcars, cyl, hp)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 5 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 6 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 7 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 8 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 9 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 10 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## 11 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 12 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 13 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 14 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 15 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 16 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 17 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 18 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## 19 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## 20 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## 21 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 22 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 23 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## 24 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 25 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## 26 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## 27 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## 28 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## 29 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 30 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## 31 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## 32 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Notice how cyl is in sorted in order, and how hp is sorted within each group of cyl. This will play into our next function.
group_by()
Similar to group by in SQL, dplyr's group_by()
rearranges the rows to be grouped by one or more variables provided, giving priority to those entered first. It is useful for rearranging the entire dataset by a variable without changing any values. It is primarily used in combination with our next function, summarize()
.
summarize()
Get a summary value for a column. Note that summarise()
and summarize()
do the same thing.
Example:
summarise(mtcars, mean(disp))
## mean(disp)
## 1 230.7219
summarize(mtcars, mean(disp))
## mean(disp)
## 1 230.7219
Now, look at what happens when we combine summarize()
with group_by()
:
group_by(mtcars, cyl, hp) %>% summarise()
## Source: local data frame [23 x 2]
## Groups: cyl [?]
##
## cyl hp
##
## 1 4 52
## 2 4 62
## 3 4 65
## 4 4 66
## 5 4 91
## 6 4 93
## 7 4 95
## 8 4 97
## 9 4 109
## 10 4 113
## # ... with 13 more rows
Note: For a useful dplyr reference, see the RStudio cheatsheet at Help -> Cheatsheets -> Data Transformation with dplyr
Piping
In the previous example shown above, noticed we used a weird series of symbols, %>%, in between functions instead of nesting them like View(summarise(group_by(mtcars, cyl, hp)))
. This was an example of piping. Piping is used to connect lines of code to allow a series of commands to be executed more quickly and in a less memory-intensive manner. It can also be much easier to read, especially when dealing with more complex queries where each function can have multiple arguments, like those below. The lines are connected so that the result of the prior statement is the input into the next statement. This saves time and computer resources that would otherwise be spent defining many variables separately and require R to store a great deal of unnecessary information.
Example:
library(dplyr)
mtcars %>%
filter(hp < 500) %>%
mutate(hp1_1 = hp / 5) %>%
summarise(mean(hp))
## mean(hp)
## 1 146.6875
As shown above, the piping syntax can also be incorporated into a mixture of base R and dplyr functions. In the following example, the last instance of either transmission type for each size engine is created using the negation of the base R command duplicated()
in combination with the dplyr commands group_by()
and filter()
.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
filter(!duplicated(am, fromLast = T))
## Source: local data frame [6 x 11]
## Groups: cyl [3]
##
## mpg cyl disp hp drat wt qsec vs am gear carb
##
## 1 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 2 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 3 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 4 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## 5 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## 6 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
The tidyverse Package
The tidyverse package is a collection of packages that work together for data management. All the packages in tidyverse use the same kinds of objects to manipulate data.
The packages in tidyverse are:
- ggplot2
- tibble
- tidyr
- readr
- purr
- dplyr
To look more into the tidyverse package, visit tidyverse.org.