Python) text content based recommendation

Objective

text content를 가지고 추천하는 코드를 연습해보고자 한다.

일단 본 내용에선 text를 벡터화시켜주기 위해 pretrained bert를 쓰는 것과, 벡터 값이 있을 때 similarity 중에서 consine similarity를 사용해서 후보군을 찾는 것을 해본다.

Implementation

data

데이터는 아래 캐글 데이터를 사용하였다.

https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

read data

여기선 sentenc_transformers라는 라이브러리를 사용해서 pretrained bert를 통해 문장을 임베딩하고자 한다.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from pathlib import Path


# !pip install sentence-transformers
## coda
# conda install -c conda-forge ipywidgets
## pip
#pip install ipywidgets
#jupyter nbextension enable --py widgetsnbextension
from sentence_transformers import SentenceTransformer

data_dir = "./"
csv_path = Path(data_dir).joinpath("./imdb_top_1000.csv")
data = pd.read_csv(csv_path)
X = np.array(data.Overview)

data head

데이터는 장르랑 타이블 그리고 그것에 대한 overview가 포함되어 있는 총 1000개의 데이터가 있다.

data = data[['Genre','Overview','Series_Title']]
data.head()

임베딩

여기서는 BERT를 사용하여 임베딩 한다.

아무래도 영어를 학습시켰을 거기 때문에 따로 추가적인 학습을 필요로 하지 않고 바로 사용한다.

X = np.array(data.Overview)
text_data = X
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
embeddings = model.encode(text_data, show_progress_bar=True)
embeddings

유사도

이렇게 벡터화시킨 데이터를 가지고 유사도를 계산해서 추천 후보군을 만든다.

차원 축소

만약 임베딩 한 데이터 차원이 너무 크다고 하면, 차원 축소 방법론을 통해 차원 축소를 할 수 있다.

여기서는 PCA를 사용했다.

X = np.array(embed_data)
n_comp = 5
pca = PCA(n_components=n_comp)
pca.fit(X)
pca_data = pd.DataFrame(pca.transform(X))
pca_data.head()

cosine similarity 계산

각 변수들 간에 cosine simularity를 계산한다.

여기에는 총 1000개의 영화에 대한 유사도가 전부 계산되어있다.

cos_sim_data = pd.DataFrame(cosine_similarity(X))

추천 로직 설계

cosine similarty를 이용하여 추천 로직을 설계해 보는 코드다.

하나의 영화를 선택하면, 그 영화에 대해서 cosine similarity를 정렬하는 작업을 한다.

그리고 거기서 몇 개의 영화를 가져올지 정한다.

def give_recommendations(index,print_recommendation = False,print_recommendation_plots= False,print_genres =False):
    index_recomm =cos_sim_data.loc[index].sort_values(ascending=False).index.tolist()[1:6]
    movies_recomm =  data['Series_Title'].loc[index_recomm].values
    result = {'Movies':movies_recomm,'Index':index_recomm}
    if print_recommendation==True:
        print('The watched movie is this one: %s \n'%(data['Series_Title'].loc[index]))
        k=1
        for movie in movies_recomm:
            print('The number %i recommended movie is this one: %s \n'%(k,movie))
    if print_recommendation_plots==True:
        print('The plot of the watched movie is this one:\n %s \n'%(data['Overview'].loc[index]))
        k=1
        for q in range(len(movies_recomm)):
            plot_q = data['Overview'].loc[index_recomm[q]]
            print('The plot of the number %i recommended movie is this one:\n %s \n'%(k,plot_q))
            k=k+1
    if print_genres==True:
        print('The genres of the watched movie is this one:\n %s \n'%(data['Genre'].loc[index]))
        k=1
        for q in range(len(movies_recomm)):
            plot_q = data['Genre'].loc[index_recomm[q]]
            print('The plot of the number %i recommended movie is this one:\n %s \n'%(k,plot_q))
            k=k+1
    return result

시각화 방법

아래 코드는 한 개의 영화에 대해서 나머지 영화들에 대한 cosine simuliarity를 y축으로 놓고,

그중에서 선택된 영화들에 대해서 표현해주는 코드다.

plt.figure(figsize=(20,20))
for q in range(1,5):
    plt.subplot(2,2,q)
    index = np.random.choice(np.arange(0,len(X)))
    to_plot_data = cos_sim_data.drop(index,axis=1)
    plt.plot(to_plot_data.loc[index],'.',color='firebrick')
    recomm_index = give_recommendations(index)
    x = recomm_index['Index'] 
    y = cos_sim_data.loc[index][x].tolist()
    m = recomm_index['Movies']
    plt.plot(x,y,'.',color='navy',label='Recommended Movies')
    plt.title('Movie Watched: '+data['Series_Title'].loc[index])
    plt.xlabel('Movie Index')
    k=0
    for x_i in x:
        plt.annotate('%s'%(m[k]),(x_i,y[k]),fontsize=10)
        k=k+1
    plt.ylabel('Cosine Similarity')
    plt.ylim(0,1)

영화별로 다음 영화 추천 후보군 생성

recomm_list = []
for i in range(len(X)):
    recomm_i = give_recommendations(i,top_n=5)
    recomm_list.append(recomm_i['Movies'])
else :
    recomm_data = pd.DataFrame(recomm_list,columns=['First Recommendation','Second Recommendation','Third Recommendation','Fourth Recommendation','Fifth Recommendation'])
    recomm_data['Watched Movie'] = data['Series_Title']
    recomm_data = recomm_data[['Watched Movie','First Recommendation','Second Recommendation','Third Recommendation','Fourth Recommendation','Fifth Recommendation']]

결론

content 자체만을 가지고 추천을 해보는 코드를 해봤다.

실제로 이런 경우에는 유저에 대한 정보 없이 추천을 하다 보니, 처음 들어오는 고객이나 정보가 부족한 고객한테도 cold-start 문제없이 추천해줄 수 있는 방법인 것 같다.

다만 content 자체만을 가지고 하다 보니, 모든 유저가 만족하기에는 어려움이 있을 것 같다는 생각은 든다.

Reference

https://medium.com/towards-data-science/hands-on-content-based-recommender-system-using-python-1d643bf314e4

https://jovian.ai/piero-paialunga/contentbased/v/1?utm_source=embed#C16

저작자표시 (새창열림)

'관심있는 주제 > Recommendation' 카테고리의 다른 글

Paper) 추천 알고리즘들의 Data Split 전략에 대한 논문 리뷰 (2)	2022.03.24
Python) 추천 시스템 방법론별로 간단한 예시 (0)	2022.01.23
추천-2 이웃 기반 협업 필터링(Nearest Neighbor Collaborative Filtering) (0)	2022.01.19
추천-1 시스템의 목표 (0)	2022.01.15
추천) Latent Matrix Factorization - 기본 컨셉 이해 (0)	2021.07.18

Python) text content based recommendation

Objective

Implementation

data

read data

data head

임베딩

유사도

차원 축소

cosine similarity 계산

추천 로직 설계

추천 로직 실행

시각화 방법

영화별로 다음 영화 추천 후보군 생성

결론

Reference

'관심있는 주제 > Recommendation' 카테고리의 다른 글

AI 도구

AI 도구 사이드 패널

티스토리툴바