Data Handling Practice

leesungreong

Library Load

library(tidyverse)
library(mice)
library(VIM)
library(knitr)
library(RColorBrewer)

Data Load Rain in Australia

data <- read.csv("./weatherAUS.csv")
str(data)

## 'data.frame':    142193 obs. of  24 variables:
##  $ Date         : Factor w/ 3436 levels "2007-11-01","2007-11-02",..: 397 398 399 400 401 402 403 404 405 406 ...
##  $ Location     : Factor w/ 49 levels "Adelaide","Albany",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ MinTemp      : num  13.4 7.4 12.9 9.2 17.5 14.6 14.3 7.7 9.7 13.1 ...
##  $ MaxTemp      : num  22.9 25.1 25.7 28 32.3 29.7 25 26.7 31.9 30.1 ...
##  $ Rainfall     : num  0.6 0 0 0 1 0.2 0 0 0 1.4 ...
##  $ Evaporation  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Sunshine     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ WindGustDir  : Factor w/ 16 levels "E","ENE","ESE",..: 14 15 16 5 14 15 14 14 7 14 ...
##  $ WindGustSpeed: int  44 44 46 24 41 56 50 35 80 28 ...
##  $ WindDir9am   : Factor w/ 16 levels "E","ENE","ESE",..: 14 7 14 10 2 14 13 11 10 9 ...
##  $ WindDir3pm   : Factor w/ 16 levels "E","ENE","ESE",..: 15 16 16 1 8 14 14 14 8 11 ...
##  $ WindSpeed9am : int  20 4 19 11 7 19 20 6 7 15 ...
##  $ WindSpeed3pm : int  24 22 26 9 20 24 24 17 28 11 ...
##  $ Humidity9am  : int  71 44 38 45 82 55 49 48 42 58 ...
##  $ Humidity3pm  : int  22 25 30 16 33 23 19 19 9 27 ...
##  $ Pressure9am  : num  1008 1011 1008 1018 1011 ...
##  $ Pressure3pm  : num  1007 1008 1009 1013 1006 ...
##  $ Cloud9am     : int  8 NA NA NA 7 NA 1 NA NA NA ...
##  $ Cloud3pm     : int  NA NA 2 NA 8 NA NA NA NA NA ...
##  $ Temp9am      : num  16.9 17.2 21 18.1 17.8 20.6 18.1 16.3 18.3 20.1 ...
##  $ Temp3pm      : num  21.8 24.3 23.2 26.5 29.7 28.9 24.6 25.5 30.2 28.2 ...
##  $ RainToday    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ RISK_MM      : num  0 0 0 1 0.2 0 0 0 1.4 0 ...
##  $ RainTomorrow : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 2 1 ...

중복 체크

sum(duplicated(data))

## [1] 0

지역별 날씨데이터가 다 있는 것만 추출하기

UNI_loc <- length(unique(data$Location))
sample_dat <- data %>% 
  group_by(Date) %>%
  summarise(n=n()) %>%
  filter( n == UNI_loc )

Date_49N <- sample_dat$Date

data2 <- data %>% 
  filter(Date %in% Date_49N)

data2$Date <- factor(data2$Date , levels =  Date_49N)
str(data2)

## 'data.frame':    22638 obs. of  24 variables:
##  $ Date         : Factor w/ 462 levels "2013-03-02","2013-03-03",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Location     : Factor w/ 49 levels "Adelaide","Albany",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ MinTemp      : num  14.3 12 12.8 14.4 16.6 19.7 20.1 19.4 17.7 15.5 ...
##  $ MaxTemp      : num  29.2 31.8 31 31.3 33.8 35.1 35.7 33.7 33.9 30.7 ...
##  $ Rainfall     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Evaporation  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Sunshine     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ WindGustDir  : Factor w/ 16 levels "E","ENE","ESE",..: 3 2 2 1 11 10 4 5 14 7 ...
##  $ WindGustSpeed: int  41 37 24 28 30 24 26 35 35 31 ...
##  $ WindDir9am   : Factor w/ 16 levels "E","ENE","ESE",..: 10 3 2 3 1 9 3 11 6 16 ...
##  $ WindDir3pm   : Factor w/ 16 levels "E","ENE","ESE",..: 11 5 3 3 3 15 11 7 8 14 ...
##  $ WindSpeed9am : int  17 13 2 7 4 4 4 7 22 7 ...
##  $ WindSpeed3pm : int  24 17 15 15 9 7 13 13 17 17 ...
##  $ Humidity9am  : int  46 52 64 65 64 65 56 66 47 63 ...
##  $ Humidity3pm  : int  28 23 23 34 29 24 32 32 28 26 ...
##  $ Pressure9am  : num  1022 1022 1022 1020 1018 ...
##  $ Pressure3pm  : num  1019 1018 1018 1016 1016 ...
##  $ Cloud9am     : int  NA NA NA NA NA 1 5 NA NA NA ...
##  $ Cloud3pm     : int  NA NA NA NA NA NA NA NA 7 NA ...
##  $ Temp9am      : num  18.9 20 18.2 20.2 22 22.5 24.2 23.2 24.3 18.4 ...
##  $ Temp3pm      : num  27.4 29.7 29.4 29.9 32.3 34.5 33.5 32.9 32.5 28.6 ...
##  $ RainToday    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RISK_MM      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ RainTomorrow : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

결측치는 왜 발생하는가?

종류

완전 무작위 결측(MCAR : Missing Completely At Random)
- 어떤 변수 상에 결측 데이터가 관측되거나 관측되지 않은 다른 변수와 아무런 연관이 없는 경우
무작위 결측 (MAR : Missing At Random)
- 어떤 변수 상에 결측데이터가 관측된 다른 변수와 연관되어 있지만, 그 자체의 비관측된 값들과는 연관되어 있지 않는 경우
비 무작위 결측 (NMAR : Not Missing At Ranodm)
- 어떤 변수의 결측데이터가 MCAR,MAR이 아닌 경우
- 예) 소득이 적은 사람은 소득이 많은 사람보다 소득에 대한 결측값을 가지기 쉬운 경우(가정 : 소득이 적으면 소득 관련 설문에 기피한다.)

일반적인 데이터 분석에서 MCAR과 MAR에 대해서 결측치 대체 및 삭제로 모델링 수행을 한다.
확인해야 할 사항
1. 데이터의 결측은 어느정도 있는지?
2. 결측 데이터가 무작위인지? 패턴이 있는지?

Missing 있는 데이터 차원 확인

data2[!complete.cases(data2),] %>% dim()

## [1] 15600    24

Missing Data 확인

전체적인 missing 비율 확인.

mice_plot <- aggr(data2 , col=c('navyblue','yellow'),
                  numbers=TRUE, sortVars=TRUE,
                  labels=names(data2), cex.axis=.7,
                  gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##       Variable       Count
##       Sunshine 0.593117767
##    Evaporation 0.497393763
##       Cloud3pm 0.455870660
##       Cloud9am 0.419162470
##    Pressure9am 0.098551109
##    Pressure3pm 0.098286068
##     WindDir9am 0.071251877
##    WindGustDir 0.054907677
##  WindGustSpeed 0.054377595
##    Humidity3pm 0.035206290
##     WindDir3pm 0.032025797
##        Temp3pm 0.029198692
##   WindSpeed3pm 0.024825515
##       Rainfall 0.010866684
##      RainToday 0.010866684
##    Humidity9am 0.009320611
##   WindSpeed9am 0.007642018
##        MinTemp 0.004594045
##        Temp9am 0.004019790
##        MaxTemp 0.002782931
##           Date 0.000000000
##       Location 0.000000000
##        RISK_MM 0.000000000
##   RainTomorrow 0.000000000

결측치 각 변수별로 얼마나 있는지?

mice_plot

## 
##  Missings in variables:
##       Variable Count
##        MinTemp   104
##        MaxTemp    63
##       Rainfall   246
##    Evaporation 11260
##       Sunshine 13427
##    WindGustDir  1243
##  WindGustSpeed  1231
##     WindDir9am  1613
##     WindDir3pm   725
##   WindSpeed9am   173
##   WindSpeed3pm   562
##    Humidity9am   211
##    Humidity3pm   797
##    Pressure9am  2231
##    Pressure3pm  2225
##       Cloud9am  9489
##       Cloud3pm 10320
##        Temp9am    91
##        Temp3pm   661
##      RainToday   246

`marginplot` Pressure9am , Pressure3pm 사이에 결측치 분포 확인

빨간 점 : 최소 하나의 missing이 있다는 의미.

#scattmatrixMiss(data2, interactive = F, highlight = c("Sunshine"))
colSums(is.na(data2[c("Pressure9am","Pressure3pm")]))

## Pressure9am Pressure3pm 
##        2231        2225

sum(rowSums(is.na(data2[c("Pressure9am","Pressure3pm")])) > 1 )

## [1] 2210

marginplot(data2[c("Pressure9am","Pressure3pm")],
           pch=20)

결측 데이터의 원인 및 각각의 원인에 따른 처리 방법론

결측 데이터가 특정한 몇개의 변수에 집중되어 있는지?
1. 특정 변수에 몰려있다면 변수를 제거할것인지?
2. 해당 결측비율이 적다면 관측값만 삭제할 것인지?
3. 특정 통계량 값으로 대체할 것인지?
4. 다중대체법을 활용해 결측치를 대체할 것인지?(MICE)

결측치 의미 없는 것이라면 전체 제

결측치 전부 제거
na.omit(data2)

## Error: <text>:1:8: unexpected symbol
## 1: 결측치 전부
##            ^

`Numeric` 결측치를 컬럼별 통계량값으로 대체하기

컬럼별로 결측치 제외하고 통계량값을 넣는 방법
해당 결측치에 주요한 지표가 될 것을 Key로 하여 특정 통계량값으로 대체

na_check <- data2 %>% 
  select_if(is.numeric) %>%
  is.na() %>% colSums()
na_col <- na_check[na_check > 0] %>%
  names()
data3 <- data2
colSums(is.na(data3))

##          Date      Location       MinTemp       MaxTemp      Rainfall 
##             0             0           104            63           246 
##   Evaporation      Sunshine   WindGustDir WindGustSpeed    WindDir9am 
##         11260         13427          1243          1231          1613 
##    WindDir3pm  WindSpeed9am  WindSpeed3pm   Humidity9am   Humidity3pm 
##           725           173           562           211           797 
##   Pressure9am   Pressure3pm      Cloud9am      Cloud3pm       Temp9am 
##          2231          2225          9489         10320            91 
##       Temp3pm     RainToday       RISK_MM  RainTomorrow 
##           661           246             0             0

## 지역별로 평균 대체
data3 <- data3 %>%
  group_by(Location) %>%
  mutate_at(vars(na_col) ,
            funs(
              ifelse(is.na(.) ,
                     mean(., na.rm = TRUE),.))) %>%
  ungroup()



colSums(is.na(data3))

##          Date      Location       MinTemp       MaxTemp      Rainfall 
##             0             0             0             0             0 
##   Evaporation      Sunshine   WindGustDir WindGustSpeed    WindDir9am 
##          7854          9702          1243           924          1613 
##    WindDir3pm  WindSpeed9am  WindSpeed3pm   Humidity9am   Humidity3pm 
##           725             0             0             0             0 
##   Pressure9am   Pressure3pm      Cloud9am      Cloud3pm       Temp9am 
##          1848          1848          5544          5544             0 
##       Temp3pm     RainToday       RISK_MM  RainTomorrow 
##             0           246             0             0

## 지역별로 결측치 있는 경우 다시 전체 평균을 넣기
# method1
data3 <- data3 %>%
  mutate_at(vars(na_col),
            funs(ifelse(is.na(.),
                        mean(., na.rm = TRUE),.)))
# method2
#for(i in na_col){
#  data3[is.na(data3[,i]), i] 
#<- mean(data3[,i], na.rm = TRUE)
#}
colSums(is.na(data3))

##          Date      Location       MinTemp       MaxTemp      Rainfall 
##             0             0             0             0             0 
##   Evaporation      Sunshine   WindGustDir WindGustSpeed    WindDir9am 
##             0             0          1243             0          1613 
##    WindDir3pm  WindSpeed9am  WindSpeed3pm   Humidity9am   Humidity3pm 
##           725             0             0             0             0 
##   Pressure9am   Pressure3pm      Cloud9am      Cloud3pm       Temp9am 
##             0             0             0             0             0 
##       Temp3pm     RainToday       RISK_MM  RainTomorrow 
##             0           246             0             0

`factor` 결측치를 컬럼별 통계량값으로 대체하기

WindGustDir factor 변수 결측치 대체하기
Location 별로 가장 많이 나온 것으로 대체하기
Location 내에 없는 경우 전체 Location에서 가장 많이 나온 것으로 대체

## WindGustDir Missing Data에서 결측치 제외 후 가장 많이 나온 것만 뽑기
Wind_summ <- data3[complete.cases(data3),] %>%
  group_by(Location , WindGustDir) %>% summarise(n=n()) 
Wind_summ_max <-  Wind_summ  %>% group_by(Location) %>% filter( n ==  max(n))  %>%
  select(-3)
## 지역별로 가장 많이 뽑힌 것에서 가장 자주 나온 것
max_bin <- Wind_summ_max %>% 
  group_by(WindGustDir) %>% 
  summarise(n=n()) %>%
  filter(n == max(n)) %>%
  sample_n(1)
max_bin <- max_bin$WindGustDir %>% as.character()
## 실제 데이터와 새로 만든 테이블 결합하여 left join
se <- data3 %>% select(Location , WindGustDir) %>% 
  left_join(Wind_summ_max , by = "Location")
## 아직 남은 결측치 부분 가장 많이 나온 부분으로 대체
se$WindGustDir.y[is.na(se$WindGustDir.y)] <- max_bin

## 결측치 대체 
na_idx <- is.na(se$WindGustDir.x)
data3$WindGustDir[is.na(data3$WindGustDir)] <- se$WindGustDir.y[na_idx]
## 결측치 확이
data3[!complete.cases(data3$WindGustDir),]

## # A tibble: 0 x 24
## # ... with 24 variables: Date <fct>, Location <fct>, MinTemp <dbl>,
## #   MaxTemp <dbl>, Rainfall <dbl>, Evaporation <dbl>, Sunshine <dbl>,
## #   WindGustDir <fct>, WindGustSpeed <dbl>, WindDir9am <fct>,
## #   WindDir3pm <fct>, WindSpeed9am <dbl>, WindSpeed3pm <dbl>,
## #   Humidity9am <dbl>, Humidity3pm <dbl>, Pressure9am <dbl>,
## #   Pressure3pm <dbl>, Cloud9am <dbl>, Cloud3pm <dbl>, Temp9am <dbl>,
## #   Temp3pm <dbl>, RainToday <fct>, RISK_MM <dbl>, RainTomorrow <fct>

누락자료에 대한 수학적인 방법이나 논리적인 방법으로 채울 수 있는 경우

BMI = WEIGHT / Height^2

## Error in eval(expr, envir, enclos): object 'WEIGHT' not found

MICE를 활용한 결측치 대체

다중대입법
multiple imputation by chained equations
시뮬레이션을 통해 누락된 자료를 채운 3~10개 만든다.
만드는 방식은 몬테카를로 방법을 사용해서 만든다.
관련 패키지 : Amelia , mi

결측치 시뮬레이션 데이터 생성

imputation_data <- mice( data2  , m=2, maxit = 2 , 
                         method = 'pmm', seed = 500 ,
                         printFlag = FALSE)

결측치 대체

for( i in colnames(data2)){
  print(i)
  if(sum(!complete.cases(data2[i])) == 0 ){
    next
  }else{
    data2[i][ !complete.cases(data2[i]), ] <- as.data.frame(imputed_Data$imp[i])[,1]
  }
}

## Error: <text>:1:8: unexpected symbol
## 1: 결측치 시뮬레이션
##            ^

impute_data <- read.csv("./weatherAUS_impute.csv")
colSums(is.na(impute_data))

##             X          Date      Location       MinTemp       MaxTemp 
##             0             0             0             0             0 
##      Rainfall   Evaporation      Sunshine   WindGustDir WindGustSpeed 
##             0             0             0             0             0 
##    WindDir9am    WindDir3pm  WindSpeed9am  WindSpeed3pm   Humidity9am 
##             0             0             0             0             0 
##   Humidity3pm   Pressure9am   Pressure3pm      Cloud9am      Cloud3pm 
##             0             0             0             0             0 
##       Temp9am       Temp3pm     RainToday       RISK_MM  RainTomorrow 
##             0             0             0             0             0

impute_data <- impute_data[,-1]

Data Wragling

Q0. WindGustDir 별로 WindGustSpeed WindSpeed9am WindSpeed3pm 의 평균 , 편차값

wind_summary <- impute_data %>% group_by(WindGustDir) %>% 
  summarise(mean_Gust = mean(WindGustSpeed) , std_Gust = sd(WindGustSpeed),
            mean_9AM = mean(WindSpeed9am) , std_9AM = sd(WindSpeed9am) ,
            mean_3PM = mean(WindSpeed3pm) , std_3PM = sd(WindSpeed3pm))
wind_summary

## # A tibble: 16 x 7
##    WindGustDir mean_Gust std_Gust mean_9AM std_9AM mean_3PM std_3PM
##    <fct>           <dbl>    <dbl>    <dbl>   <dbl>    <dbl>   <dbl>
##  1 E                36.2     11.4     13.5    8.28     16.4    7.60
##  2 ENE              34.2     11.1     12.5    7.94     16.0    8.34
##  3 ESE              36.1     11.1     13.2    8.25     16.4    7.35
##  4 N                40.4     14.9     14.9   10.7      18.2    8.96
##  5 NE               33.9     11.0     11.4    7.30     16.8    8.82
##  6 NNE              37.2     12.6     12.7    8.65     17.6    8.98
##  7 NNW              39.7     14.7     12.6    8.66     18.2    8.56
##  8 NW               43.1     16.7     13.7    9.03     19.7    9.56
##  9 S                39.9     13.1     15.1    9.05     18.8    9.09
## 10 SE               37.7     11.1     14.4    8.61     18.3    8.45
## 11 SSE              38.0     11.2     15.0    8.67     18.2    8.69
## 12 SSW              39.3     13.1     13.9    8.78     18.4    8.56
## 13 SW               39.2     13.2     14      8.73     18.2    8.11
## 14 W                44.4     17.1     15.2    9.33     20.3    9.82
## 15 WNW              45.3     17.1     15.0    9.40     21.0   10.1 
## 16 WSW              42.2     15.1     14.2    8.79     19.5    8.65

Q1. 위의 Summary를 시각화

colourCount = length(unique(impute_data$WindGustDir))
getPalette = colorRampPalette(brewer.pal(9, "Set3"))

ggplot(wind_summary , aes(WindGustDir , mean_Gust, fill =WindGustDir )) +
  geom_bar(stat="identity" , color= "black" ) +
  geom_errorbar(aes( ymin = mean_Gust-std_Gust, ymax = mean_Gust+std_Gust),
                width = 0.2) +
  scale_fill_manual(values = getPalette(colourCount)) +
  coord_flip() +
  theme_classic() +
  labs(x="풍향", y = "평균 돌풍 속도" , title = "풍향별 평균 돌풍 속도 Error Bar")

Q2. WindGustDir 별로 WindGustSpeed WindSpeed9am WindSpeed3pm 의 Error Bar

## 풍속 관련 Summary
a <- gather(wind_summary , mean_key , mean_value , starts_with("mean")) %>% as.data.frame() %>% 
  select(WindGustDir , mean_key , mean_value )
b <- gather(wind_summary , std_key , std_value , starts_with("std")) %>% as.data.frame() %>%
  select(WindGustDir , std_key , std_value )

Data <- inner_join(a,b, by ="WindGustDir")
uni_mean_key = unique(Data$mean_key)
uni_std_key = unique(Data$std_key)
a <- Data %>% group_by(WindGustDir) %>% filter(mean_key == uni_mean_key[1] & std_key == uni_std_key[1] )
b <- Data %>% group_by(WindGustDir) %>% filter(mean_key == uni_mean_key[2] & std_key == uni_std_key[2] )
c <- Data %>% group_by(WindGustDir) %>% filter(mean_key == uni_mean_key[3] & std_key == uni_std_key[3] )
summary <- do.call(rbind , list(a,b,c))

ggplot(summary , aes(WindGustDir , mean_value , fill = mean_key))  +
  geom_bar(stat="identity", position = "dodge") +
  geom_errorbar(aes(x = WindGustDir , 
                    ymin = mean_value-std_value ,
                    ymax = mean_value+std_value ), 
                color = "black" , 
                position=position_dodge(0.9),width=0.5)  +
  theme_classic() + 
  coord_flip() +
  labs(x = "풍향" ,  y = "평균 풍속", title = "풍향별 평균 풍속 Error BAR")

지역과 월별로 평균 MinTemp 구하고 table

Date를 년 , 월 , 일로 나누기

impute_data$Date <- as.character(impute_data$Date)
impute_data2 <- impute_data %>% separate(Date , c("Year", "Month", "Day"))


impute_data3 <- impute_data2 %>% 
  group_by( Location , Month) %>% 
  summarise(mean_min_temp = mean(MinTemp))

impute_data3 %>% spread(Month , mean_min_temp) %>% ungroup()

## # A tibble: 49 x 13
##    Location  `01`  `02`  `03`  `04`  `05`  `06`  `07`  `08`  `09`  `10`
##    <fct>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Adelaide  17.7  17.3  17.8 13.5  10.8   8.16  8.44  8.1  11    11.6 
##  2 Albany    16.4  16.4  16.6 14.8  11.7  10.2  10.1   9.24  9.99 11.5 
##  3 Albury    17.0  15.3  14.9  9.01  6.41  4.08  4.89  4.61  6.92  8.00
##  4 AliceSp~  22.0  19.0  19.0 13.1   8.65  6.03  3.89  6.00 10.5  13.9 
##  5 Badgery~  17.5  17.6  15.4 11.0   8.02  7.15  4.71  5.78  8.11 10.8 
##  6 Ballarat  12.0  11.2  12.0  7.61  5.67  3.73  3.53  4.56  6.25  6.01
##  7 Bendigo   15.1  13.9  14.2  9.61  6.23  3.89  4.19  4.82  6.96  6.85
##  8 Brisbane  21.9  21.8  20.6 16.6  14.8  12.0  11.5  11.6  14.3  16.4 
##  9 Cairns    24.1  24.4  23.5 22.1  21.2  18.8  18.0  17.4  19.5  20.9 
## 10 Canberra  13.5  12.4  10.7  5.54  2.76  2.08  1.53  2.06  3.67  5.45
## # ... with 39 more rows, and 2 more variables: `11` <dbl>, `12` <dbl>

Data Visualization

시각화 앞에 10개 지역만 시각화하기

filter_n <- unique(impute_data3$Location)[1:10]

filter_n <- impute_data3 %>% filter(Location %in% filter_n)

ggplot(filter_n, aes(x = Location , y = mean_min_temp , fill = Month)) +
  geom_bar(stat="identity" , position = "dodge") + 
  coord_flip()  +
  theme_classic() + 
  ggtitle("월별 평균 최소 온도 추이변화") +
  labs(y= "월별 평균 최소 온도") +
  theme(plot.title = element_text(color="black", size=16, face="bold.italic")) +
  scale_fill_brewer(palette="Set3")

10개 도시 / 년도별로 최소 온도 추이확인

impute_data3 <- impute_data2 %>% 
  group_by( Location , Year) %>% 
  summarise(mean_min_temp = mean(MinTemp))

filter_n <- unique(impute_data3$Location)[1:10]

filter_n <- impute_data3 %>% filter(Location %in% filter_n)

ggplot(filter_n, aes(x = Location , y = mean_min_temp , fill = Year)) +
  geom_bar(stat="identity" , position = "dodge") + 
  coord_flip()  +
  theme_classic() + 
  ggtitle("년도별 평균 최소 온도 추이변화") +
  labs(y= "년도별 평균 최소 온도") +
  theme(plot.title = element_text(color="black", size=16, face="bold.italic")) +
  scale_fill_brewer(palette="Accent")

하루에 해가 뜬 시간과 비 유무에 따른 Density Plot

ggplot(impute_data2, aes(x = Sunshine  , fill =  RainToday, color = RainTomorrow)) +
  geom_density(alpha = 0.5) +
  scale_fill_brewer(palette="Set3") +
  theme_classic()

우리가 알고 있는 상식과 같이 온도와 비의 유무는 연관이 있어보인다.

Reference

'분석 R > EDA' 카테고리의 다른 글

Tidyverse (ggplot) (0)	2019.03.19
Kaggle BlackFriday 데이터를 활용한 EDA (0)	2019.03.16
Kaggle 올림픽 데이터를 활용한 EDA 2번째 (0)	2019.03.16
Kaggle 올림픽 데이터를 활용한 EDA 1번째 (0)	2019.03.16

Data Handling Practice

leesungreong

Library Load

Data Load Rain in Australia

중복 체크

지역별 날씨데이터가 다 있는 것만 추출하기

Missing 있는 데이터 차원 확인

Missing Data 확인

결측치 각 변수별로 얼마나 있는지?

`marginplot` Pressure9am , Pressure3pm 사이에 결측치 분포 확인

결측치 의미 없는 것이라면 전체 제

`Numeric` 결측치를 컬럼별 통계량값으로 대체하기

`factor` 결측치를 컬럼별 통계량값으로 대체하기

누락자료에 대한 수학적인 방법이나 논리적인 방법으로 채울 수 있는 경우

Data Wragling

Q0. WindGustDir 별로 WindGustSpeed WindSpeed9am WindSpeed3pm 의 평균 , 편차값

Q1. 위의 Summary를 시각화

Q2. WindGustDir 별로 WindGustSpeed WindSpeed9am WindSpeed3pm 의 Error Bar

지역과 월별로 평균 MinTemp 구하고 table

Data Visualization

시각화 앞에 10개 지역만 시각화하기

10개 도시 / 년도별로 최소 온도 추이확인

하루에 해가 뜬 시간과 비 유무에 따른 Density Plot

Reference

'분석 R > EDA' 카테고리의 다른 글

AI 도구

AI 도구 사이드 패널

티스토리툴바

leesungreong

Library Load

Data Load Rain in Australia

중복 체크

지역별 날씨데이터가 다 있는 것만 추출하기

Missing 있는 데이터 차원 확인

Missing Data 확인

결측치 각 변수별로 얼마나 있는지?

marginplot Pressure9am , Pressure3pm 사이에 결측치 분포 확인

결측치 의미 없는 것이라면 전체 제

Numeric 결측치를 컬럼별 통계량값으로 대체하기

factor 결측치를 컬럼별 통계량값으로 대체하기

누락자료에 대한 수학적인 방법이나 논리적인 방법으로 채울 수 있는 경우

Data Wragling

Q0. WindGustDir 별로 WindGustSpeed WindSpeed9am WindSpeed3pm 의 평균 , 편차값

Q1. 위의 Summary를 시각화

Q2. WindGustDir 별로 WindGustSpeed WindSpeed9am WindSpeed3pm 의 Error Bar

지역과 월별로 평균 MinTemp 구하고 table

Data Visualization

시각화 앞에 10개 지역만 시각화하기

10개 도시 / 년도별로 최소 온도 추이확인

하루에 해가 뜬 시간과 비 유무에 따른 Density Plot

Reference

'분석 R > EDA' 카테고리의 다른 글

AI 도구

AI 도구 사이드 패널

티스토리툴바

`marginplot` Pressure9am , Pressure3pm 사이에 결측치 분포 확인

`Numeric` 결측치를 컬럼별 통계량값으로 대체하기

`factor` 결측치를 컬럼별 통계량값으로 대체하기