Tidyverse (dplyr , tidyr)

Tidyverse (dplyr , tidyr)

2019. 3. 17. 20:33ㆍ분석 R

library(tidyverse)

1.1 Data Loading

# library("readr")
raw <- read.csv("../Data/climate.csv")
raw1 <- read_csv("../Data/climate.csv")

## Parsed with column specification:
## cols(
##   Source = col_integer(),
##   Year = col_integer(),
##   `Anomaly 1y` = col_double(),
##   `Anomaly 5y` = col_double(),
##   `Anomaly 10y` = col_double(),
##   `Unc 10y` = col_double()
## )

raw2 <- read_csv("../Data/climate.csv", 
                col_types = 
                  cols(
                    Source = col_integer(),
                    Year = col_integer(),
                    `Anomaly 1y` = col_double(),
                    `Anomaly 5y` = col_double(),
                    `Anomaly 10y` = col_double(),
                    `Unc 10y` = col_double() ) )
head(raw)

##   Source Year Anomaly.1y Anomaly.5y Anomaly.10y Unc.10y
## 1      1 1901      0.015      0.010      -0.162   0.109
## 2      1 1902      0.028     -0.017      -0.177   0.108
## 3      1 1903      0.049     -0.040      -0.199   0.104
## 4      1 1904      0.068     -0.040      -0.223   0.105
## 5      1 1905      0.128     -0.032      -0.241   0.107
## 6      1 1906      0.158     -0.022      -0.294   0.106

all(raw == raw1)

## [1] TRUE

all(raw == raw2)

## [1] TRUE

all(raw1 == raw2)

## [1] TRUE

1.2 Reshaing Data by tidyr packge

1.2.1 gather()

Objective : Reshaping wide format to long format

gather Function : gather(data, key, value, …)

# library(tidyr)
long_raw <- gather(raw, type, temp, -Source, -Year)
head(long_raw)

##   Source Year       type  temp
## 1      1 1901 Anomaly.1y 0.015
## 2      1 1902 Anomaly.1y 0.028
## 3      1 1903 Anomaly.1y 0.049
## 4      1 1904 Anomaly.1y 0.068
## 5      1 1905 Anomaly.1y 0.128
## 6      1 1906 Anomaly.1y 0.158

long_raw <- gather(raw,type,temp,3:6)
head(long_raw)

##   Source Year       type  temp
## 1      1 1901 Anomaly.1y 0.015
## 2      1 1902 Anomaly.1y 0.028
## 3      1 1903 Anomaly.1y 0.049
## 4      1 1904 Anomaly.1y 0.068
## 5      1 1905 Anomaly.1y 0.128
## 6      1 1906 Anomaly.1y 0.158

1.2.2 spread()

Objective : Reshaping long format to wide format

spread Function : spread(data, key, value)

wide_raw <- spread(long_raw, type, temp)
head(wide_raw, 10)

##    Source Year Anomaly.10y Anomaly.1y Anomaly.5y Unc.10y
## 1       1 1901      -0.162      0.015      0.010   0.109
## 2       1 1902      -0.177      0.028     -0.017   0.108
## 3       1 1903      -0.199      0.049     -0.040   0.104
## 4       1 1904      -0.223      0.068     -0.040   0.105
## 5       1 1905      -0.241      0.128     -0.032   0.107
## 6       1 1906      -0.294      0.158     -0.022   0.106
## 7       1 1907      -0.312      0.167      0.012   0.105
## 8       1 1908      -0.328      0.193      0.007   0.103
## 9       1 1909      -0.281      0.186      0.002   0.101
## 10      1 1910      -0.247      0.217      0.002   0.099

1.2.3 seperate() and unite()

sep_long_raw <- separate(long_raw , type, c("type_head", "type_year"))
head(sep_long_raw)

##   Source Year type_head type_year  temp
## 1      1 1901   Anomaly        1y 0.015
## 2      1 1902   Anomaly        1y 0.028
## 3      1 1903   Anomaly        1y 0.049
## 4      1 1904   Anomaly        1y 0.068
## 5      1 1905   Anomaly        1y 0.128
## 6      1 1906   Anomaly        1y 0.158

long_raw <- unite(sep_long_raw, type, type_head, type_year, sep=".")
head(long_raw)

##   Source Year       type  temp
## 1      1 1901 Anomaly.1y 0.015
## 2      1 1902 Anomaly.1y 0.028
## 3      1 1903 Anomaly.1y 0.049
## 4      1 1904 Anomaly.1y 0.068
## 5      1 1905 Anomaly.1y 0.128
## 6      1 1906 Anomaly.1y 0.158

long_raw <- unite(sep_long_raw, type, type_head, type_year)
head(long_raw)

##   Source Year       type  temp
## 1      1 1901 Anomaly_1y 0.015
## 2      1 1902 Anomaly_1y 0.028
## 3      1 1903 Anomaly_1y 0.049
## 4      1 1904 Anomaly_1y 0.068
## 5      1 1905 Anomaly_1y 0.128
## 6      1 1906 Anomaly_1y 0.158

wide_raw <- spread(long_raw, type, temp)
head(wide_raw, 10)

##    Source Year Anomaly_10y Anomaly_1y Anomaly_5y Unc_10y
## 1       1 1901      -0.162      0.015      0.010   0.109
## 2       1 1902      -0.177      0.028     -0.017   0.108
## 3       1 1903      -0.199      0.049     -0.040   0.104
## 4       1 1904      -0.223      0.068     -0.040   0.105
## 5       1 1905      -0.241      0.128     -0.032   0.107
## 6       1 1906      -0.294      0.158     -0.022   0.106
## 7       1 1907      -0.312      0.167      0.012   0.105
## 8       1 1908      -0.328      0.193      0.007   0.103
## 9       1 1909      -0.281      0.186      0.002   0.101
## 10      1 1910      -0.247      0.217      0.002   0.099

1.2.4 Summary of tidyr package

1.3 Manupulating Data by dplyr packge

1.3.1 select()

Objective : Reduce dataframe size to only desired variables for current task

select Function : select(data, …)

data : data frame
... : call variables by name or by function

# library(dplyr)
sub_raw <- select(wide_raw, Source, Anomaly_1y:Anomaly_10y)
head(sub_raw)

##   Source Anomaly_1y Anomaly_10y
## 1      1      0.015      -0.162
## 2      1      0.028      -0.177
## 3      1      0.049      -0.199
## 4      1      0.068      -0.223
## 5      1      0.128      -0.241
## 6      1      0.158      -0.294

sub_raw <- select(wide_raw,1,3:5 )
head(sub_raw)

##   Source Anomaly_10y Anomaly_1y Anomaly_5y
## 1      1      -0.162      0.015      0.010
## 2      1      -0.177      0.028     -0.017
## 3      1      -0.199      0.049     -0.040
## 4      1      -0.223      0.068     -0.040
## 5      1      -0.241      0.128     -0.032
## 6      1      -0.294      0.158     -0.022

sub_raw <- select(wide_raw,Source,starts_with("Anomaly"))
head(sub_raw)

##   Source Anomaly_10y Anomaly_1y Anomaly_5y
## 1      1      -0.162      0.015      0.010
## 2      1      -0.177      0.028     -0.017
## 3      1      -0.199      0.049     -0.040
## 4      1      -0.223      0.068     -0.040
## 5      1      -0.241      0.128     -0.032
## 6      1      -0.294      0.158     -0.022

1.3.2 filter()

Objective : Reduce rows/observations with matching conditions

filter Function : filter(data, …)

data : data frame
... : conditions to be met

fil_raw <- filter(sub_raw, Source == 1)
head(fil_raw)

##   Source Anomaly_10y Anomaly_1y Anomaly_5y
## 1      1      -0.162      0.015      0.010
## 2      1      -0.177      0.028     -0.017
## 3      1      -0.199      0.049     -0.040
## 4      1      -0.223      0.068     -0.040
## 5      1      -0.241      0.128     -0.032
## 6      1      -0.294      0.158     -0.022

tail(fil_raw)

##    Source Anomaly_10y Anomaly_1y Anomaly_5y
## 27      1      -0.020     -0.026      0.167
## 28      1      -0.018     -0.014      0.193
## 29      1      -0.026     -0.047      0.186
## 30      1      -0.014     -0.035      0.217
## 31      1      -0.047     -0.017      0.235
## 32      1      -0.035      0.020      0.270

tmp <- filter(sub_raw, Anomaly_10y > 0)
head(tmp)

##   Source Anomaly_10y Anomaly_1y Anomaly_5y
## 1      2       0.020      0.063      0.344
## 2      2       0.053      0.048      0.004
## 3      2       0.063      0.073     -0.028
## 4      2       0.048      0.113     -0.006
## 5      2       0.073      0.113     -0.024
## 6      2       0.113     -0.268     -0.041

tail(tmp)

##    Source Anomaly_10y Anomaly_1y Anomaly_5y
## 53      3       0.734     -0.182      0.167
## 54      3       0.748     -0.193      0.193
## 55      3       0.793     -0.167      0.186
## 56      3       0.856     -0.128      0.217
## 57      3       0.869     -0.075      0.235
## 58      3       0.884     -0.064      0.270

1.3.3 group_by()

Objective : Group data by categorical variables

group_by Function : group_by(data, …)

data : data frame
... : variables to group_by

group_raw <- group_by(sub_raw, Source)
head(group_raw)

## # A tibble: 6 x 4
## # Groups:   Source [1]
##   Source Anomaly_10y Anomaly_1y Anomaly_5y
##    <int>       <dbl>      <dbl>      <dbl>
## 1      1      -0.162      0.015      0.01 
## 2      1      -0.177      0.028     -0.017
## 3      1      -0.199      0.049     -0.04 
## 4      1      -0.223      0.068     -0.04 
## 5      1      -0.241      0.128     -0.032
## 6      1      -0.294      0.158     -0.022

1.3.4 summarise()

Objective : Perform summary statistics on variables

summarise Function : summarise(data, …)

data : data frame
... : Name-value pairs of summary functions like min(), mean(), max() etc.

summarise(sub_raw, Mean5y = mean(Anomaly_5y))

##       Mean5y
## 1 0.01148077

summarise(sub_raw, Mean = mean(Anomaly_5y), Min = min(Anomaly_5y), Median = median(Anomaly_5y), 
    Max = max(Anomaly_5y), SD = sd(Anomaly_5y), Var = var(Anomaly_5y), N = n())

##         Mean    Min Median   Max        SD        Var   N
## 1 0.01148077 -0.328  0.004 0.352 0.1632643 0.02665522 104

1.3.5 %>% Operator

Pipe line operator, named magrittr

head(sub_raw)

##   Source Anomaly_10y Anomaly_1y Anomaly_5y
## 1      1      -0.162      0.015      0.010
## 2      1      -0.177      0.028     -0.017
## 3      1      -0.199      0.049     -0.040
## 4      1      -0.223      0.068     -0.040
## 5      1      -0.241      0.128     -0.032
## 6      1      -0.294      0.158     -0.022

sub_raw %>% group_by(Source) %>% summarise(Mean1y = mean(Anomaly_1y), Mean5y = mean(Anomaly_5y))

## # A tibble: 3 x 3
##   Source  Mean1y  Mean5y
##    <int>   <dbl>   <dbl>
## 1      1  0.117   0.0480
## 2      2  0.0664 -0.0241
## 3      3 -0.217   0.0256

sub_raw %>% gather(Anomaly, tmp, 2:4) %>% filter(Source == 2) %>% summarise(Mean = mean(tmp), 
    SD = sd(tmp))

##         Mean        SD
## 1 0.02335606 0.1859076

1.3.6 arrange()

Objective : Order variable values

arrange Function : arrange(data, …)

data : data frame
... : Variable(s) to order

sub_raw %>% group_by(Source) %>% summarise(Mean1y = mean(Anomaly_1y), Mean5y = mean(Anomaly_5y)) %>% 
    arrange(Mean1y)

## # A tibble: 3 x 3
##   Source  Mean1y  Mean5y
##    <int>   <dbl>   <dbl>
## 1      3 -0.217   0.0256
## 2      2  0.0664 -0.0241
## 3      1  0.117   0.0480

1.3.7 join()

Objective : Join two datasets together

join Function : inner_join(x, y, by = NULL) left_join(x, y, by = NULL) semi_join(x, y, by = NULL) anti_join(x, y, by = NULL)

x,y : data frames to join
by : a character vector of variables to join by

x <- data.frame(name = c("John", "Paul", "George", "Ringo", "Stuart", "Pete"), 
    instrument = c("guitar", "bass", "guitar", "drums", "bass", "drums"))

y <- data.frame(name = c("John", "Paul", "George", "Ringo", "Brian"), band = c("TRUE", 
    "TRUE", "TRUE", "TRUE", "FALSE"))

x

##     name instrument
## 1   John     guitar
## 2   Paul       bass
## 3 George     guitar
## 4  Ringo      drums
## 5 Stuart       bass
## 6   Pete      drums

##     name  band
## 1   John  TRUE
## 2   Paul  TRUE
## 3 George  TRUE
## 4  Ringo  TRUE
## 5  Brian FALSE

inner_join() : Include only rows in both x and y that have a matching value

inner_join(x, y)

## Joining, by = "name"

##     name instrument band
## 1   John     guitar TRUE
## 2   Paul       bass TRUE
## 3 George     guitar TRUE
## 4  Ringo      drums TRUE

left_join() : Include all of x, and matching rows of y

left_join(x, y)

## Joining, by = "name"

##     name instrument band
## 1   John     guitar TRUE
## 2   Paul       bass TRUE
## 3 George     guitar TRUE
## 4  Ringo      drums TRUE
## 5 Stuart       bass <NA>
## 6   Pete      drums <NA>

semi_join() : Include rows of x that match y but only keep the columns from x

semi_join(x, y)

## Joining, by = "name"

##     name instrument
## 1   John     guitar
## 2   Paul       bass
## 3 George     guitar
## 4  Ringo      drums

anti_join() : Opposite of semi_join

anti_join(x, y)

## Joining, by = "name"

##     name instrument
## 1 Stuart       bass
## 2   Pete      drums

1.4.8 mutate()

Objective : Creates new variables

mutate Function : mutate(data, …)

data : data frame
... : Expression(s)

head(wide_raw)

##   Source Year Anomaly_10y Anomaly_1y Anomaly_5y Unc_10y
## 1      1 1901      -0.162      0.015      0.010   0.109
## 2      1 1902      -0.177      0.028     -0.017   0.108
## 3      1 1903      -0.199      0.049     -0.040   0.104
## 4      1 1904      -0.223      0.068     -0.040   0.105
## 5      1 1905      -0.241      0.128     -0.032   0.107
## 6      1 1906      -0.294      0.158     -0.022   0.106

mu_raw <- mutate(wide_raw, Adj = Anomaly_1y/Anomaly_5y)
head(mu_raw)

##   Source Year Anomaly_10y Anomaly_1y Anomaly_5y Unc_10y       Adj
## 1      1 1901      -0.162      0.015      0.010   0.109  1.500000
## 2      1 1902      -0.177      0.028     -0.017   0.108 -1.647059
## 3      1 1903      -0.199      0.049     -0.040   0.104 -1.225000
## 4      1 1904      -0.223      0.068     -0.040   0.105 -1.700000
## 5      1 1905      -0.241      0.128     -0.032   0.107 -4.000000
## 6      1 1906      -0.294      0.158     -0.022   0.106 -7.181818

rank_raw <- wide_raw %>% mutate(Adj = Anomaly_1y/Anomaly_5y) %>% arrange(desc(Adj)) %>% 
    mutate(Rank = 1:nrow(raw))

head(rank_raw)

##   Source Year Anomaly_10y Anomaly_1y Anomaly_5y Unc_10y       Adj Rank
## 1      1 1910      -0.247      0.217      0.002   0.099 108.50000    1
## 2      1 1909      -0.281      0.186      0.002   0.101  93.00000    2
## 3      1 1914      -0.257      0.344      0.004   0.097  86.00000    3
## 4      1 1919      -0.182      0.433      0.010   0.097  43.30000    4
## 5      3 1986       0.352     -0.257     -0.006   0.012  42.83333    5
## 6      1 1908      -0.328      0.193      0.007   0.103  27.57143    6

문제 0. mtcars 데이터에서 cyl , vs , am , gear , carb 을 key , value 로 바꿔서 표현하시오. (gather)

##    mpg disp  hp drat    wt  qsec key value
## 1 21.0  160 110 3.90 2.620 16.46 cyl     6
## 2 21.0  160 110 3.90 2.875 17.02 cyl     6
## 3 22.8  108  93 3.85 2.320 18.61 cyl     4
## 4 21.4  258 110 3.08 3.215 19.44 cyl     6
## 5 18.7  360 175 3.15 3.440 17.02 cyl     8
## 6 18.1  225 105 2.76 3.460 20.22 cyl     6

문제 1. cyl, mpg 변수 선택 후 cyl별로 mpg의 평균, 분산 , 중앙값을 구해라.

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

## # A tibble: 3 x 4
##     cyl mean_  var_ median_
##   <dbl> <dbl> <dbl>   <dbl>
## 1     4  26.7 20.3     26  
## 2     6  19.7  2.11    19.7
## 3     8  15.1  6.55    15.2

문제 2. cyl , gear별로 mpg의 평균 , 분산 중앙값을 구하여라

## # A tibble: 8 x 5
## # Groups:   cyl [?]
##     cyl  gear mean_  var_ median_
##   <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1     4     3  21.5 NA       21.5
## 2     4     4  26.9 23.1     25.8
## 3     4     5  28.2  9.68    28.2
## 4     6     3  19.8  5.44    19.8
## 5     6     4  19.8  2.41    20.1
## 6     6     5  19.7 NA       19.7
## 7     8     3  15.0  7.70    15.2
## 8     8     5  15.4  0.32    15.4

문제 3. am , vs 가 모두 1인 것과 아닌 것들 나눠서 변수를 생성하고, 생성한 변수별로 hp의 평균 , 분산 , 중앙 값을 구해라.

## # A tibble: 2 x 4
##   new_var mean_  var_ median_
##     <dbl> <dbl> <dbl>   <dbl>
## 1       0 165.  4294.     175
## 2       1  80.6  583.      66

참고자료

dplyr 50 examples

'분석 R' 카테고리의 다른 글

[R] magick package 설치 에러 (ubuntu) (0)	2020.05.26
R 최신 버전 설치 관련 자료 (0)	2020.04.29
[ R ] roc curve 패키지 비교 (0)	2019.05.01
Kaggle 데이터를 활용한 DataTable 문서화. (0)	2019.03.17
알고리즘 체인과 파이프라인 (0)	2018.01.25

All I Need Is Data.