Kaggle 올림픽 데이터를 활용한 EDA 2번째

2019. 3. 16. 17:41분석 R/EDA

728x90
Test5

html이나 rmd가 필요하다면 댓글에 글 남겨주세요.

  • 관련 파일 : athlete_events.csv, noc_regions.csv
  • 필요 패키지 :tidyverse



Library Load

library(tidyverse)


0. Data Load 및 이전 시험 진행과정

athlete <- read.csv("./olympic_data/athlete_events.csv")
Region <- read.csv("./olympic_data/noc_regions.csv")
athlete <- athlete %>% select(-ID)
Region <- Region %>% select(-notes )
athlete$Medal <- as.character(athlete$Medal)
athlete$Medal[is.na(athlete$Medal)] <- "None"
athlete$Medal <- factor(athlete$Medal, levels = c("Gold", "Silver", "Bronze" , "None"))
athlete <- na.omit(athlete)
Region  <- na.omit(Region)



1. athlete data set 에다가 Region data set 을 Left Join

  • 앞에서 5개만 출력.
Total <- athlete %>% left_join( Region , by = "NOC")
## Warning: Column `NOC` joining factors with different levels, coercing to
## character vector
head(Total)
##                       Name Sex Age Height Weight        Team NOC
## 1                A Dijiang   M  24    180     80       China CHN
## 2                 A Lamusi   M  23    170     60       China CHN
## 3 Christine Jacoba Aaftink   F  21    185     82 Netherlands NED
## 4 Christine Jacoba Aaftink   F  21    185     82 Netherlands NED
## 5 Christine Jacoba Aaftink   F  25    185     82 Netherlands NED
## 6 Christine Jacoba Aaftink   F  25    185     82 Netherlands NED
##         Games Year Season        City         Sport
## 1 1992 Summer 1992 Summer   Barcelona    Basketball
## 2 2012 Summer 2012 Summer      London          Judo
## 3 1988 Winter 1988 Winter     Calgary Speed Skating
## 4 1988 Winter 1988 Winter     Calgary Speed Skating
## 5 1992 Winter 1992 Winter Albertville Speed Skating
## 6 1992 Winter 1992 Winter Albertville Speed Skating
##                                Event Medal      region
## 1        Basketball Men's Basketball  None       China
## 2       Judo Men's Extra-Lightweight  None       China
## 3   Speed Skating Women's 500 metres  None Netherlands
## 4 Speed Skating Women's 1,000 metres  None Netherlands
## 5   Speed Skating Women's 500 metres  None Netherlands
## 6 Speed Skating Women's 1,000 metres  None Netherlands



2. 겨울에 열린 경기에서 금메달의 비율 구하기


Step 1 : MedalNone 값을 제외하고 YearMedal 별로 개수 출력

  • 앞에서 5개만 출력.
survey <- Total %>% filter(Medal != "None") %>% group_by( Year , Medal) %>% summarise(n = n())
head(survey)
## # A tibble: 6 x 3
## # Groups:   Year [2]
##    Year Medal      n
##   <int> <fct>  <int>
## 1  1896 Gold      11
## 2  1896 Silver     6
## 3  1896 Bronze     3
## 4  1900 Gold      19
## 5  1900 Silver    10
## 6  1900 Bronze     9


Step 2 : 위에 나온 결과물을 활용해 Year 별로 Medal총개수비율 변수를 생성

  • 앞에서 5개만 출력.
survey2 <- survey %>% group_by(Year) %>% mutate( sum_ = sum(n) , ratio = n/sum_)
head(survey2)
## # A tibble: 6 x 5
## # Groups:   Year [2]
##    Year Medal      n  sum_ ratio
##   <int> <fct>  <int> <int> <dbl>
## 1  1896 Gold      11    20 0.55 
## 2  1896 Silver     6    20 0.3  
## 3  1896 Bronze     3    20 0.15 
## 4  1900 Gold      19    38 0.5  
## 5  1900 Silver    10    38 0.263
## 6  1900 Bronze     9    38 0.237


Step 3 : 위에 나온 결과물을 활용해 MedalGold 이면서 SeasonWinter 인 것만 추출

  • 앞에서 5개만 출력.
sel <- Total %>% filter(Season == "Winter") %>% arrange(Year)
winter_ <- unique(sel$Year) 
survey3 <- survey2 %>% filter(Medal == "Gold" & Year %in% winter_) 
head(survey3)
## # A tibble: 6 x 5
## # Groups:   Year [6]
##    Year Medal     n  sum_ ratio
##   <int> <fct> <int> <int> <dbl>
## 1  1924 Gold     72   185 0.389
## 2  1928 Gold     67   170 0.394
## 3  1932 Gold     66   176 0.375
## 4  1936 Gold     87   193 0.451
## 5  1948 Gold     80   195 0.410
## 6  1952 Gold    124   302 0.411



3. 특정조건을 만족하는 팀 뽑기


Step 1 : Team별로 평균 Height 가 180 초과이고 평균 Weight 이 80이상인팀을 보이기.

  • 평균 Height 별로 내림차순.
  • 앞에서 5개만 출력.
survey <- Total %>% group_by(Team) %>% summarise( mean_hei = mean(Height) , mean_we = mean(Weight) ) %>% filter( mean_hei > 180 , mean_we >= 80) %>% arrange(desc(mean_hei))
head(survey)
## # A tibble: 6 x 3
##   Team               mean_hei mean_we
##   <fct>                 <dbl>   <dbl>
## 1 Puerto Rico-1          196     96  
## 2 Angelita               193     80  
## 3 England-1              193     89  
## 4 Ireland-1              190.    97.7
## 5 Serbia-2               189     83  
## 6 Sydney Rowing Club     189     85


Step 2 : 위의 나온 결과물을 활용해 해당하는 Team 만 뽑고
Team별로 가장 키가 큰사람과 가장 작은 사람의 키 , 두 키의 차이를 출력

  • 앞에서 5개만 출력.
team <- survey$Team

survey2 <- Total %>% filter(Team %in% team ) %>% group_by(Team) %>%
  summarise(max_hei = max(Height), min_hei = min(Height), dif = max(Height) - min(Height)) 

head(survey2)
## # A tibble: 6 x 4
##   Team        max_hei min_hei   dif
##   <fct>         <dbl>   <dbl> <dbl>
## 1 Akatonbo        188     178    10
## 2 Angelita        193     193     0
## 3 Aphrodite       183     180     3
## 4 Ardilla         182     179     3
## 5 Argentina-1     198     175    23
## 6 Argentina-2     194     178    16


Step 3 : 위의 결과물을 활용하여 다음과 같은 처리를 진행

  • Team별로 가장 키가 큰 사람과 가장 작은 사람의 차이가 30이상으로 차이나는 팀만 뽑기
  • 차이나는 정도에 따라서 내림차순으로 정렬.
  • 팀의 문자열의 수가 10이상인 팀들만 뽑기.
survey3 <- survey2 %>%  mutate(team_length =  nchar(as.character(Team))) %>% 
  filter( dif >= 30 & team_length >= 10) %>% arrange(desc(dif)) 

survey3
## # A tibble: 5 x 5
##   Team                  max_hei min_hei   dif team_length
##   <fct>                   <dbl>   <dbl> <dbl>       <int>
## 1 Serbia and Montenegro     214     163    51          21
## 2 Montenegro                199     157    42          10
## 3 Switzerland-1             203     161    42          13
## 4 East Germany-1            192     154    38          14
## 5 Switzerland-2             203     168    35          13



4. SportSex 별로 가장 많은 금메달을 뽑은 사람을 한명씩 뽑기.


step 1 : MedalNone 인 사람은 제외하고 SportName 별로 Medal의 개수를 세기.

  • 앞에서 5개만 출력.
survey <- Total %>% filter( Medal != "None") %>% group_by(Sport, Name) %>% mutate( medal_n= n()) 
head(survey)
## # A tibble: 6 x 16
## # Groups:   Sport, Name [2]
##   Name  Sex     Age Height Weight Team  NOC   Games  Year Season City 
##   <fct> <fct> <int>  <int>  <dbl> <fct> <chr> <fct> <int> <fct>  <fct>
## 1 Juha~ M        28    184     85 Finl~ FIN   2014~  2014 Winter Sochi
## 2 Paav~ M        28    175     64 Finl~ FIN   1948~  1948 Summer Lond~
## 3 Paav~ M        28    175     64 Finl~ FIN   1948~  1948 Summer Lond~
## 4 Paav~ M        28    175     64 Finl~ FIN   1948~  1948 Summer Lond~
## 5 Paav~ M        28    175     64 Finl~ FIN   1948~  1948 Summer Lond~
## 6 Paav~ M        32    175     64 Finl~ FIN   1952~  1952 Summer Hels~
## # ... with 5 more variables: Sport <fct>, Event <fct>, Medal <fct>,
## #   region <fct>, medal_n <int>


step 2 : 위의 결과를 활용하여 SportSex 별로 가장 많은 Medal을 얻은 사람을 정렬.

  • Sport , Sex , Name, 메달의 수 만 선택
  • 앞에서 5개만 출력.
survey2 <- survey %>% group_by(Sport , Sex) %>% arrange(desc(medal_n)) %>% select(Sport,Sex,  Name , medal_n)
head(survey2)
## # A tibble: 6 x 4
## # Groups:   Sport, Sex [1]
##   Sport    Sex   Name                    medal_n
##   <fct>    <fct> <fct>                     <int>
## 1 Swimming M     Michael Fred Phelps, II      28
## 2 Swimming M     Michael Fred Phelps, II      28
## 3 Swimming M     Michael Fred Phelps, II      28
## 4 Swimming M     Michael Fred Phelps, II      28
## 5 Swimming M     Michael Fred Phelps, II      28
## 6 Swimming M     Michael Fred Phelps, II      28


step 3 : 위의 결과를 활용하여 SportSex 별로 정렬했을 시 가장 첫번째 열만 뽑기.

  • SportSex 알파벳순서로 정렬.
  • 앞에서 5개만 출력.
survey3 <- survey2 %>% filter( row_number() == 1L)  %>% arrange( Sport , Sex)
head(survey3)
## # A tibble: 6 x 4
## # Groups:   Sport, Sex [6]
##   Sport            Sex   Name                    medal_n
##   <fct>            <fct> <fct>                     <int>
## 1 Alpine Skiing    F     Janica Kosteli                6
## 2 Alpine Skiing    M     Kjetil Andr Aamodt            8
## 3 Archery          F     Kim Su-Nyeong                 6
## 4 Archery          M     Brady Lee Ellison             3
## 5 Art Competitions M     Charles Anthoine Gonnet       1
## 6 Athletics        F     Allyson Michelle Felix        9



728x90

'분석 R > EDA' 카테고리의 다른 글

Data Handling Practice  (0) 2019.04.18
Tidyverse (ggplot)  (0) 2019.03.19
Kaggle BlackFriday 데이터를 활용한 EDA  (0) 2019.03.16
Kaggle 올림픽 데이터를 활용한 EDA 1번째  (0) 2019.03.16