Kaggle BlackFriday 데이터를 활용한 EDA

LeeSungRyeong

관련 파일 : BlackFriday.csv
필요 패키지 :tidyverse gridExtra

Tips

그림 크기 방향 조절 하는 방법
- ```{r , fig.align=‘center’ , fig.width= 12 , fig.height= 9}
library warning message 안 나오게 하는 방법
- ```{r, warning=FALSE , message=FALSE }

Library Load

library(tidyverse)
library(gridExtra)

0. Data Load

BlackFriday Click And Download The Data!

사용할 데이터 : BlackFriday.csv

dt <- read.csv("../Data/BlackFriday.csv")

1, 유저별로 구매수에 따라 시각화 하기(1)

[필수과정]
구매수 600개 이상 유저별(User_ID)로 중복 제거된 데이터 사용하기

User_ID 별로 구매수라는 새로운 변수 생성
User_ID를 Age , 구매수로 정렬(0rder)해서 시각화
Gender Color
- M = “grey50”
- F = “pink”

Hint

facotr 변수들에서 필요없는 level 제거
scale_colour_manual
scale_fill_brewer
fill color = Spectral 사용
goem_bar size = 1

dt2 <- dt %>% group_by(User_ID) %>% mutate( buy_n = n()) %>% filter( buy_n >= 600 )

dt2 <- dt2[!duplicated(dt2$User_ID) , ]

dt2 <- droplevels(dt2)

user_order = dt2$User_ID[order(dt2$Age  , dt2$buy_n)]

dt2$User_ID <- factor(dt2$User_ID , levels = user_order)

dt2 %>% ggplot( aes(x = User_ID , y= buy_n , fill = Age , col = Gender)) +
  geom_bar(stat="identity" , position ="dodge", size=1) + coord_flip() + 
  scale_colour_manual(values=c("F" = "pink", "M"="grey50") ) +
  scale_fill_brewer(palette="Spectral") + 
  labs(y="Purchase Amount" , x = "User ID" , 
       title = "Number of purchases per user(>600)")

2. 유저별로 구매수에 따라 시각화 하기(2)

[필수과정]
구매수 600개 이상 유저별로 중복 제거된 데이터 사용하기

구매수 600개 이상
Age 와 Gender 별로 정렬(order)해서 시각호

Hint화

point size = 8
title size = 30
scale_shape_manual (0 : 18 / 1 : 16)
geom_segement size = 1
theme

user_order = dt2$User_ID[order(dt2$Age , dt2$Gender , dt2$buy_n)]
dt2$User_ID <- factor(dt2$User_ID , levels = user_order)

dt2 %>% ggplot( aes(x =buy_n  , y= User_ID  )) + 
  geom_segment( aes(yend = User_ID , linetype = Gender ) , xend = 0 , size = 1 ) + 
  geom_point(aes( col = Age ,   shape = factor(Marital_Status) ) ,size=8 ) + 
  guides(shape = guide_legend(title='Marital_Status')) + 
  labs(title = "Number of purchases per user(>600)" , 
       x="Purchase Amount" , y = "User ID") + 
  theme(plot.title = element_text(size =30 , hjust = 0.5)) + 
  scale_shape_manual( values = c(18 , 16 ) )

3. 위의 그림에서 얻은 인사이트를 통해 활용하여 시각화

(위그림) Age[26-25] 의 유저들이 600개 이상 물건을 다른 Age 에 비해 많은 것으로 확인이 되었다.
해당 Age[26-35] 고객들을 분석해보자.

[필수과정]
전체 데이터에서 Age가 26-35인 집단만 추출
유저(User_ID)별로 평균 구매가격 변수 생성하기
유저(User_ID)기준으로 중복 제거

Plot[1]
- Martial Status 별로 비율 구하기
- 구한 비율의 위치정보는 해당 비율의 중간에 위치하게 만들기
- 그림과 똑같이 만들기

Hint

text size = 5
geom_text
theme
x = factor(1)
paste or paste0

Plot[2]
- User_ID 별로 mean purchase 변수 생성
- Gender 에 따른 mean_purchase의 Density 그리기
- alpha = 0.2
Plot[3]
- Gender 에 따른 Occupation 비율 시각화

Hint

Occupation 별로 Occupation 개수 변수 생성
scale_x_continuous
position = “fill”
arrangeGrob
grid.arrange

Age_25 <- dt %>% filter( Age == "26-35" )

Age_25 <- droplevels(Age_25)
Age_25$User_ID <- factor(Age_25$User_ID)
Age_25$Marital_Status <- factor(Age_25$Marital_Status)

Age_25 <- Age_25 %>% group_by(User_ID) %>% mutate( mean_purchase = mean(Purchase))

Age_25 <- Age_25[!duplicated(Age_25$User_ID) , ]

## plot1
output2 <- Age_25 %>% group_by(Marital_Status) %>% mutate(n=n(), ratio = n/sum(n))

output2 <- Age_25 %>% group_by(Marital_Status) %>% summarise(n=n()) %>% mutate(ratio = n/sum(n) , location = ifelse(ratio > min(ratio) , min(ratio) + ratio/2 , ratio/2 ) )

plot1 <- output2 %>% ggplot(aes(x=factor(1), y = ratio, fill = Marital_Status)) + geom_bar(stat="identity") + 
  geom_text(aes(x= factor(1), y= location, label = paste("Marital_Status = " ,Marital_Status," and " , round(ratio*100,2),"%",sep="")), size=5) + 
  labs(x="Marital Status" , y = "Ratio" , title = "[26-35 Age] Martial Status 비율") + 
  theme(axis.text.x = element_blank() , axis.title.y=element_blank()) + guides(fill=FALSE, color=FALSE)

## plot2

plot2 <- Age_25 %>% ggplot(aes(x= mean_purchase , fill = Gender)) + 
  geom_density(alpha = 0.2) + guides(fill = guide_legend(title='Gender')) + labs(x="Mean Purchase" , title = "[26-35 Age] Gender 별 평균 구매가격의 Density") + 
  theme(axis.title.y=element_blank())


## plot3
survey <- Age_25 %>% group_by(Occupation ) %>% mutate(Occupation_n = n())

survey2 <- survey %>% ggplot( aes(x = Occupation , y=Occupation_n , fill = Gender)) 


plot3 <- survey2 + 
  geom_bar(stat="identity" , position = "fill") + 
  scale_x_continuous( breaks =sort(unique(survey$Occupation))) +
  labs(x="Occupation" , y ="Ratio of male and female" , 
       title = "[26-35 Age] Gender에 따른 직업의 비율")
  


grid.arrange(arrangeGrob(plot1 , plot2, ncol =2 ) , plot3 , nrow =2 )

4. 상품별(Product_ID)로 구매량에 따라 시각화하기

상품별(Product_ID)로 구매량이라는 새로운 변수 생성
가장 많이 팔린 20개 상품(Product_ID) 찾기
가장 많이 팔린 20개 있는 데이터만 Filter 하기
(3)에서 뽑은 데이터를 활용해 상품별로(Product_ID) mean Purchase 구하기
mean Purchase > 10000 이면 Expensive 나머지는 Cheap 라는 새로운 변수 생성

Hint

row_number() == 1L
title size = 20
fill color = Set3
scales , space 사요
NA -> “None”
rows = Product_Category_1(facet_grid)
cols = 새롭게 생성한 변수(facet_grid)

survey <- dt %>% group_by(Product_ID) %>% mutate(n=n()) %>% arrange(desc(n)) %>% 
  filter(row_number() == 1L ) 

top20_prod <- survey$Product_ID %>% head(20) %>% as.character() 

survey2 <- dt %>% filter(Product_ID %in% top20_prod)

survey2 <- droplevels(survey2)


survey2 <- survey2 %>% group_by(Product_ID) %>% mutate(n=n() , mean_purchase = mean(Purchase)) %>% 
  mutate(Price = ifelse( mean_purchase > 10000 , "Expensive", "Cheap"))


survey2$Product_Category_2[is.na(survey2$Product_Category_2)] <- "None"

survey2 %>% ggplot( aes(x= reorder(Product_ID , -n) , y = n   , fill = factor(Product_Category_2) )) + theme_bw() + 
  geom_bar( position = "dodge", stat="identity") + coord_flip() + 
  facet_grid(Product_Category_1~Price , scales = "free", space = "free") +
  labs(x = "Product_ID", y ="Purchase amount" , title ="Top Product 20") + 
  theme(plot.title = element_text(size =30 , hjust = 0.5 )) +
  guides(fill = guide_legend(title='Product_Category_2')) + 
  scale_fill_brewer(palette="Set3")

'분석 R > EDA' 카테고리의 다른 글

Data Handling Practice (0)	2019.04.18
Tidyverse (ggplot) (0)	2019.03.19
Kaggle 올림픽 데이터를 활용한 EDA 2번째 (0)	2019.03.16
Kaggle 올림픽 데이터를 활용한 EDA 1번째 (0)	2019.03.16

Kaggle BlackFriday 데이터를 활용한 EDA

LeeSungRyeong

Library Load

0. Data Load

1, 유저별로 구매수에 따라 시각화 하기(1)

2. 유저별로 구매수에 따라 시각화 하기(2)

3. 위의 그림에서 얻은 인사이트를 통해 활용하여 시각화

4. 상품별(Product_ID)로 구매량에 따라 시각화하기

'분석 R > EDA' 카테고리의 다른 글

AI 도구

AI 도구 사이드 패널

티스토리툴바