2019. 3. 17. 20:33ㆍ분석 R
library(tidyverse)
1.1 Data Loading
# library("readr")
raw <- read.csv("../Data/climate.csv")
raw1 <- read_csv("../Data/climate.csv")
## Parsed with column specification:
## cols(
## Source = col_integer(),
## Year = col_integer(),
## `Anomaly 1y` = col_double(),
## `Anomaly 5y` = col_double(),
## `Anomaly 10y` = col_double(),
## `Unc 10y` = col_double()
## )
raw2 <- read_csv("../Data/climate.csv",
col_types =
cols(
Source = col_integer(),
Year = col_integer(),
`Anomaly 1y` = col_double(),
`Anomaly 5y` = col_double(),
`Anomaly 10y` = col_double(),
`Unc 10y` = col_double() ) )
head(raw)
## Source Year Anomaly.1y Anomaly.5y Anomaly.10y Unc.10y
## 1 1 1901 0.015 0.010 -0.162 0.109
## 2 1 1902 0.028 -0.017 -0.177 0.108
## 3 1 1903 0.049 -0.040 -0.199 0.104
## 4 1 1904 0.068 -0.040 -0.223 0.105
## 5 1 1905 0.128 -0.032 -0.241 0.107
## 6 1 1906 0.158 -0.022 -0.294 0.106
all(raw == raw1)
## [1] TRUE
all(raw == raw2)
## [1] TRUE
all(raw1 == raw2)
## [1] TRUE
1.2 Reshaing Data by tidyr packge
1.2.1 gather()
Objective : Reshaping wide format to long format
gather
Function : gather(data, key, value, …)
# library(tidyr)
long_raw <- gather(raw, type, temp, -Source, -Year)
head(long_raw)
## Source Year type temp
## 1 1 1901 Anomaly.1y 0.015
## 2 1 1902 Anomaly.1y 0.028
## 3 1 1903 Anomaly.1y 0.049
## 4 1 1904 Anomaly.1y 0.068
## 5 1 1905 Anomaly.1y 0.128
## 6 1 1906 Anomaly.1y 0.158
long_raw <- gather(raw,type,temp,3:6)
head(long_raw)
## Source Year type temp
## 1 1 1901 Anomaly.1y 0.015
## 2 1 1902 Anomaly.1y 0.028
## 3 1 1903 Anomaly.1y 0.049
## 4 1 1904 Anomaly.1y 0.068
## 5 1 1905 Anomaly.1y 0.128
## 6 1 1906 Anomaly.1y 0.158
1.2.2 spread()
Objective : Reshaping long format to wide format
spread
Function : spread(data, key, value)
wide_raw <- spread(long_raw, type, temp)
head(wide_raw, 10)
## Source Year Anomaly.10y Anomaly.1y Anomaly.5y Unc.10y
## 1 1 1901 -0.162 0.015 0.010 0.109
## 2 1 1902 -0.177 0.028 -0.017 0.108
## 3 1 1903 -0.199 0.049 -0.040 0.104
## 4 1 1904 -0.223 0.068 -0.040 0.105
## 5 1 1905 -0.241 0.128 -0.032 0.107
## 6 1 1906 -0.294 0.158 -0.022 0.106
## 7 1 1907 -0.312 0.167 0.012 0.105
## 8 1 1908 -0.328 0.193 0.007 0.103
## 9 1 1909 -0.281 0.186 0.002 0.101
## 10 1 1910 -0.247 0.217 0.002 0.099
1.2.3 seperate() and unite()
sep_long_raw <- separate(long_raw , type, c("type_head", "type_year"))
head(sep_long_raw)
## Source Year type_head type_year temp
## 1 1 1901 Anomaly 1y 0.015
## 2 1 1902 Anomaly 1y 0.028
## 3 1 1903 Anomaly 1y 0.049
## 4 1 1904 Anomaly 1y 0.068
## 5 1 1905 Anomaly 1y 0.128
## 6 1 1906 Anomaly 1y 0.158
long_raw <- unite(sep_long_raw, type, type_head, type_year, sep=".")
head(long_raw)
## Source Year type temp
## 1 1 1901 Anomaly.1y 0.015
## 2 1 1902 Anomaly.1y 0.028
## 3 1 1903 Anomaly.1y 0.049
## 4 1 1904 Anomaly.1y 0.068
## 5 1 1905 Anomaly.1y 0.128
## 6 1 1906 Anomaly.1y 0.158
long_raw <- unite(sep_long_raw, type, type_head, type_year)
head(long_raw)
## Source Year type temp
## 1 1 1901 Anomaly_1y 0.015
## 2 1 1902 Anomaly_1y 0.028
## 3 1 1903 Anomaly_1y 0.049
## 4 1 1904 Anomaly_1y 0.068
## 5 1 1905 Anomaly_1y 0.128
## 6 1 1906 Anomaly_1y 0.158
wide_raw <- spread(long_raw, type, temp)
head(wide_raw, 10)
## Source Year Anomaly_10y Anomaly_1y Anomaly_5y Unc_10y
## 1 1 1901 -0.162 0.015 0.010 0.109
## 2 1 1902 -0.177 0.028 -0.017 0.108
## 3 1 1903 -0.199 0.049 -0.040 0.104
## 4 1 1904 -0.223 0.068 -0.040 0.105
## 5 1 1905 -0.241 0.128 -0.032 0.107
## 6 1 1906 -0.294 0.158 -0.022 0.106
## 7 1 1907 -0.312 0.167 0.012 0.105
## 8 1 1908 -0.328 0.193 0.007 0.103
## 9 1 1909 -0.281 0.186 0.002 0.101
## 10 1 1910 -0.247 0.217 0.002 0.099
1.2.4 Summary of tidyr package
1.3 Manupulating Data by dplyr packge
1.3.1 select()
Objective : Reduce dataframe size to only desired variables for current task
select
Function : select(data, …)
data
: data frame
...
: call variables by name or by function
# library(dplyr)
sub_raw <- select(wide_raw, Source, Anomaly_1y:Anomaly_10y)
head(sub_raw)
## Source Anomaly_1y Anomaly_10y
## 1 1 0.015 -0.162
## 2 1 0.028 -0.177
## 3 1 0.049 -0.199
## 4 1 0.068 -0.223
## 5 1 0.128 -0.241
## 6 1 0.158 -0.294
sub_raw <- select(wide_raw,1,3:5 )
head(sub_raw)
## Source Anomaly_10y Anomaly_1y Anomaly_5y
## 1 1 -0.162 0.015 0.010
## 2 1 -0.177 0.028 -0.017
## 3 1 -0.199 0.049 -0.040
## 4 1 -0.223 0.068 -0.040
## 5 1 -0.241 0.128 -0.032
## 6 1 -0.294 0.158 -0.022
sub_raw <- select(wide_raw,Source,starts_with("Anomaly"))
head(sub_raw)
## Source Anomaly_10y Anomaly_1y Anomaly_5y
## 1 1 -0.162 0.015 0.010
## 2 1 -0.177 0.028 -0.017
## 3 1 -0.199 0.049 -0.040
## 4 1 -0.223 0.068 -0.040
## 5 1 -0.241 0.128 -0.032
## 6 1 -0.294 0.158 -0.022
1.3.2 filter()
Objective : Reduce rows/observations with matching conditions
filter
Function : filter(data, …)
data
: data frame
...
: conditions to be met
fil_raw <- filter(sub_raw, Source == 1)
head(fil_raw)
## Source Anomaly_10y Anomaly_1y Anomaly_5y
## 1 1 -0.162 0.015 0.010
## 2 1 -0.177 0.028 -0.017
## 3 1 -0.199 0.049 -0.040
## 4 1 -0.223 0.068 -0.040
## 5 1 -0.241 0.128 -0.032
## 6 1 -0.294 0.158 -0.022
tail(fil_raw)
## Source Anomaly_10y Anomaly_1y Anomaly_5y
## 27 1 -0.020 -0.026 0.167
## 28 1 -0.018 -0.014 0.193
## 29 1 -0.026 -0.047 0.186
## 30 1 -0.014 -0.035 0.217
## 31 1 -0.047 -0.017 0.235
## 32 1 -0.035 0.020 0.270
tmp <- filter(sub_raw, Anomaly_10y > 0)
head(tmp)
## Source Anomaly_10y Anomaly_1y Anomaly_5y
## 1 2 0.020 0.063 0.344
## 2 2 0.053 0.048 0.004
## 3 2 0.063 0.073 -0.028
## 4 2 0.048 0.113 -0.006
## 5 2 0.073 0.113 -0.024
## 6 2 0.113 -0.268 -0.041
tail(tmp)
## Source Anomaly_10y Anomaly_1y Anomaly_5y
## 53 3 0.734 -0.182 0.167
## 54 3 0.748 -0.193 0.193
## 55 3 0.793 -0.167 0.186
## 56 3 0.856 -0.128 0.217
## 57 3 0.869 -0.075 0.235
## 58 3 0.884 -0.064 0.270
1.3.3 group_by()
Objective : Group data by categorical variables
group_by
Function : group_by(data, …)
data
: data frame
...
: variables to group_by
group_raw <- group_by(sub_raw, Source)
head(group_raw)
## # A tibble: 6 x 4
## # Groups: Source [1]
## Source Anomaly_10y Anomaly_1y Anomaly_5y
## <int> <dbl> <dbl> <dbl>
## 1 1 -0.162 0.015 0.01
## 2 1 -0.177 0.028 -0.017
## 3 1 -0.199 0.049 -0.04
## 4 1 -0.223 0.068 -0.04
## 5 1 -0.241 0.128 -0.032
## 6 1 -0.294 0.158 -0.022
1.3.4 summarise()
Objective : Perform summary statistics on variables
summarise
Function : summarise(data, …)
data
: data frame
...
: Name-value pairs of summary functions like min(), mean(), max() etc.
summarise(sub_raw, Mean5y = mean(Anomaly_5y))
## Mean5y
## 1 0.01148077
summarise(sub_raw, Mean = mean(Anomaly_5y), Min = min(Anomaly_5y), Median = median(Anomaly_5y),
Max = max(Anomaly_5y), SD = sd(Anomaly_5y), Var = var(Anomaly_5y), N = n())
## Mean Min Median Max SD Var N
## 1 0.01148077 -0.328 0.004 0.352 0.1632643 0.02665522 104
1.3.5 %>% Operator
- Pipe line operator, named
magrittr
head(sub_raw)
## Source Anomaly_10y Anomaly_1y Anomaly_5y
## 1 1 -0.162 0.015 0.010
## 2 1 -0.177 0.028 -0.017
## 3 1 -0.199 0.049 -0.040
## 4 1 -0.223 0.068 -0.040
## 5 1 -0.241 0.128 -0.032
## 6 1 -0.294 0.158 -0.022
sub_raw %>% group_by(Source) %>% summarise(Mean1y = mean(Anomaly_1y), Mean5y = mean(Anomaly_5y))
## # A tibble: 3 x 3
## Source Mean1y Mean5y
## <int> <dbl> <dbl>
## 1 1 0.117 0.0480
## 2 2 0.0664 -0.0241
## 3 3 -0.217 0.0256
sub_raw %>% gather(Anomaly, tmp, 2:4) %>% filter(Source == 2) %>% summarise(Mean = mean(tmp),
SD = sd(tmp))
## Mean SD
## 1 0.02335606 0.1859076
1.3.6 arrange()
Objective : Order variable values
arrange
Function : arrange(data, …)
data
: data frame
...
: Variable(s) to order
sub_raw %>% group_by(Source) %>% summarise(Mean1y = mean(Anomaly_1y), Mean5y = mean(Anomaly_5y)) %>%
arrange(Mean1y)
## # A tibble: 3 x 3
## Source Mean1y Mean5y
## <int> <dbl> <dbl>
## 1 3 -0.217 0.0256
## 2 2 0.0664 -0.0241
## 3 1 0.117 0.0480
1.3.7 join()
Objective : Join two datasets together
join
Function : inner_join(x, y, by = NULL) left_join(x, y, by = NULL) semi_join(x, y, by = NULL) anti_join(x, y, by = NULL)
x,y
: data frames to join
by
: a character vector of variables to join by
x <- data.frame(name = c("John", "Paul", "George", "Ringo", "Stuart", "Pete"),
instrument = c("guitar", "bass", "guitar", "drums", "bass", "drums"))
y <- data.frame(name = c("John", "Paul", "George", "Ringo", "Brian"), band = c("TRUE",
"TRUE", "TRUE", "TRUE", "FALSE"))
x
## name instrument
## 1 John guitar
## 2 Paul bass
## 3 George guitar
## 4 Ringo drums
## 5 Stuart bass
## 6 Pete drums
y
## name band
## 1 John TRUE
## 2 Paul TRUE
## 3 George TRUE
## 4 Ringo TRUE
## 5 Brian FALSE
inner_join() : Include only rows in both x and y that have a matching value
inner_join(x, y)
## Joining, by = "name"
## name instrument band
## 1 John guitar TRUE
## 2 Paul bass TRUE
## 3 George guitar TRUE
## 4 Ringo drums TRUE
left_join() : Include all of x, and matching rows of y
left_join(x, y)
## Joining, by = "name"
## name instrument band
## 1 John guitar TRUE
## 2 Paul bass TRUE
## 3 George guitar TRUE
## 4 Ringo drums TRUE
## 5 Stuart bass <NA>
## 6 Pete drums <NA>
semi_join() : Include rows of x that match y but only keep the columns from x
semi_join(x, y)
## Joining, by = "name"
## name instrument
## 1 John guitar
## 2 Paul bass
## 3 George guitar
## 4 Ringo drums
anti_join() : Opposite of semi_join
anti_join(x, y)
## Joining, by = "name"
## name instrument
## 1 Stuart bass
## 2 Pete drums
1.4.8 mutate()
Objective : Creates new variables
mutate
Function : mutate(data, …)
data
: data frame
...
: Expression(s)
head(wide_raw)
## Source Year Anomaly_10y Anomaly_1y Anomaly_5y Unc_10y
## 1 1 1901 -0.162 0.015 0.010 0.109
## 2 1 1902 -0.177 0.028 -0.017 0.108
## 3 1 1903 -0.199 0.049 -0.040 0.104
## 4 1 1904 -0.223 0.068 -0.040 0.105
## 5 1 1905 -0.241 0.128 -0.032 0.107
## 6 1 1906 -0.294 0.158 -0.022 0.106
mu_raw <- mutate(wide_raw, Adj = Anomaly_1y/Anomaly_5y)
head(mu_raw)
## Source Year Anomaly_10y Anomaly_1y Anomaly_5y Unc_10y Adj
## 1 1 1901 -0.162 0.015 0.010 0.109 1.500000
## 2 1 1902 -0.177 0.028 -0.017 0.108 -1.647059
## 3 1 1903 -0.199 0.049 -0.040 0.104 -1.225000
## 4 1 1904 -0.223 0.068 -0.040 0.105 -1.700000
## 5 1 1905 -0.241 0.128 -0.032 0.107 -4.000000
## 6 1 1906 -0.294 0.158 -0.022 0.106 -7.181818
rank_raw <- wide_raw %>% mutate(Adj = Anomaly_1y/Anomaly_5y) %>% arrange(desc(Adj)) %>%
mutate(Rank = 1:nrow(raw))
head(rank_raw)
## Source Year Anomaly_10y Anomaly_1y Anomaly_5y Unc_10y Adj Rank
## 1 1 1910 -0.247 0.217 0.002 0.099 108.50000 1
## 2 1 1909 -0.281 0.186 0.002 0.101 93.00000 2
## 3 1 1914 -0.257 0.344 0.004 0.097 86.00000 3
## 4 1 1919 -0.182 0.433 0.010 0.097 43.30000 4
## 5 3 1986 0.352 -0.257 -0.006 0.012 42.83333 5
## 6 1 1908 -0.328 0.193 0.007 0.103 27.57143 6
- 문제 0. mtcars 데이터에서 cyl , vs , am , gear , carb 을 key , value 로 바꿔서 표현하시오. (gather)
## mpg disp hp drat wt qsec key value
## 1 21.0 160 110 3.90 2.620 16.46 cyl 6
## 2 21.0 160 110 3.90 2.875 17.02 cyl 6
## 3 22.8 108 93 3.85 2.320 18.61 cyl 4
## 4 21.4 258 110 3.08 3.215 19.44 cyl 6
## 5 18.7 360 175 3.15 3.440 17.02 cyl 8
## 6 18.1 225 105 2.76 3.460 20.22 cyl 6
- 문제 1. cyl, mpg 변수 선택 후 cyl별로 mpg의 평균, 분산 , 중앙값을 구해라.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## # A tibble: 3 x 4
## cyl mean_ var_ median_
## <dbl> <dbl> <dbl> <dbl>
## 1 4 26.7 20.3 26
## 2 6 19.7 2.11 19.7
## 3 8 15.1 6.55 15.2
- 문제 2. cyl , gear별로 mpg의 평균 , 분산 중앙값을 구하여라
## # A tibble: 8 x 5
## # Groups: cyl [?]
## cyl gear mean_ var_ median_
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 3 21.5 NA 21.5
## 2 4 4 26.9 23.1 25.8
## 3 4 5 28.2 9.68 28.2
## 4 6 3 19.8 5.44 19.8
## 5 6 4 19.8 2.41 20.1
## 6 6 5 19.7 NA 19.7
## 7 8 3 15.0 7.70 15.2
## 8 8 5 15.4 0.32 15.4
- 문제 3. am , vs 가 모두 1인 것과 아닌 것들 나눠서 변수를 생성하고, 생성한 변수별로 hp의 평균 , 분산 , 중앙 값을 구해라.
## # A tibble: 2 x 4
## new_var mean_ var_ median_
## <dbl> <dbl> <dbl> <dbl>
## 1 0 165. 4294. 175
## 2 1 80.6 583. 66
참고자료
'분석 R' 카테고리의 다른 글
[R] magick package 설치 에러 (ubuntu) (0) | 2020.05.26 |
---|---|
R 최신 버전 설치 관련 자료 (0) | 2020.04.29 |
[ R ] roc curve 패키지 비교 (0) | 2019.05.01 |
Kaggle 데이터를 활용한 DataTable 문서화. (0) | 2019.03.17 |
알고리즘 체인과 파이프라인 (0) | 2018.01.25 |