level1_bookratingprediction_recsys-level1-recsys-02 created by GitHub Classroom

Python 100.00%

level1_bookratingprediction_recsys-level1-recsys-02's People

Contributors

Stargazers

Forkers

jeong-junhwan 41ow1ives kimeunh3 jwj51720 darrenkwondev leehanjeong

level1_bookratingprediction_recsys-level1-recsys-02's Issues

Test1

ㅁㅁㅁㅁ

Test3

Test2

error뜸 에러메세지

NCF 실험정리

baseline NCF 모델 파라미터 튜닝

이미지 데이터 null 관련 문제

과연 이미지 데이터도 멀쩡할까? 싶어서 뜯어봤습니다.
정확히 짜보지는 않았지만 (1, 1) 크기의 이미지가 41802개가 발견되었습니다....
총 책의 개수가 15만개 정도니까 4분의 1 가량의 이미지 데이터가 의미가 없는 것 같습니다
1 by 1 이미지 제외 나머지 이미지는 크기는 조금 들쭉날쭉 해도 크게 문제는 없는 것 같습니다.

from PIL import Image
from collections import defaultdict
import pandas as pd

books = pd.read_csv('books.csv')
d = defaultdict(lambda : 0)
for i in range(len(books)):
    d[Image.open(books['img_path'][i]).size] += 1

이 코드로 대충 확인 가능합니다.

데이터 전처리 후 FM/FFM 결과

10월 31일

FM/FFM으로 전처리 후 변화 실험을 했습니다. rating과 feature를 사용하므로 전처리 효과가 가장 눈에 띌 것으로 판단하였습니다.
기존의 baseline code를 통해 돌린 결과와 변화된 code를 돌린 결과의 비교입니다.
전처리를 마친 후 새로운 csv 파일을 생성하여 실험하였는데, feature가 변하다보니 기존 baseline code와 충돌이 있어서 수정한 부분이 있습니다.

데이터 전처리에서 새로운 접근을 했습니다.

users (기존과 동일)
- user_id,location_state,location_country 사용
- state를 지우고 나서의 결과를 실험해봐야 할 것 같습니다.

books

isbn, book_title, book_author, year_of_publication, publisher, img_url, img_path, category_high, isbn_country 사용
category의 구조적인 개편이 필요하다는 판단으로, category_high를 다시 짜봤습니다.

groupings = {'Fiction': ['fiction'], # 너무 넓으니 맨 위로 빼자
    'Literature & Poem': ['liter', 'poem', 'poetry'],
    'Science & Math': ['science', 'math', 'logy'], # science, logy 범위가 너무 넓으니 맨 위로
    'Parenting & Relationships': ['baby', 'parent', 'family', 'tionship', 'brother', 'sister'], # 좀 큼
    'Medical Books': ['medi', 'psycho'], # psy의 세분화 가능
    'Animal & Nature': ['animal', 'ecolo', 'plant', 'nature'],
    'Arts & Photography': ['art', 'photo'], # art는 겹치는 글자가 너무 많음
    'Biographies & Memoirs': ['biog', 'memo'],
    'Business & Money': ['busi', 'money', 'econo'],
    'Calendars': ['calen'],
    'Children\'s Books': ['child', 'baby'],
    'Christian Books & Bibles': ['christi', 'bible'], #크리스마스때매
    'Comics & Graphic Novels': ['comics', 'graphic novel'],
    'Computers & Technology': ['computer', 'techno', 'archi'],
    'Cookbooks, Food & Wine': ['cook'],
    'Crafts, Hobbies & Home': ['crafts'],
    'Education & Teaching': ['educa', 'teach'],
    'Engineering & Transportation': ['engine', 'transp'],
    'Health, Fitness & Dieting': ['health', 'fitness', 'diet'],
    'History': ['histo'],
    
    'Humor & Entertainment': ['humor', 'entertai', 'comed', 'game'],
    'Law': ['law'],
    'LGBTQ+ Books': ['lesbian', 'gay', 'bisex'],
    'Mystery, Thriller & Suspense': ['myste', 'thril', 'suspen'],
    'Politics & Social Sciences': ['politic', 'social'],
    'Reference': ['reference'],
    'Religion & Spirituality': ['religi'],
    'Romance': ['romance'],
    'Science Fiction & Fantasy': ['science fiction', 'fantasy'],
    'Self-Help': ['self'], # self 검색시 모두 자기계발 관련
    'Sports & Outdoors': ['exerc','sport','outdoor'],
    'Teen & Young Adult': ['teen', 'adol', 'juven'], #nonfiction이란 말은 청소년 관련뿐
    'Test Preparation': ['test', 'school', 'examina'],
    'Travel': ['travel'],
     }

category를 copy하여 category_high를 만드는 것은 동일하고, 특정 단어가 포함된 category를 상위 category_high로 편입했습니다.
이후 null 값인 category_high는 'Unclassified'를 넣었습니다. 이에 title이나 summary에서 단어를 탐지하여 category를 채워 넣는 방법도 생각해 보았지만, 하진 않았습니다.
또한 10개보다 적은 category_high도 'Unclassified'로 편입하였습니다. (약 5000개)
편입 전에도 'Unclassified'는 68,851개이므로 fiction 분류를 2배 상회합니다. 따라서 5000개를 더하는 것은 큰 차이가 없을 것이라 판단하였습니다. (가정이므로 이 부분은 논의 필요)
category에서는 대략적인 선호 경향만 판단한다는 생각으로 한 것입니다.

 for i in range(len(books)): # 5033개의 항목을 미분류로 편입
     if books.at[i, 'count'] < 10:
         books.at[i, 'category_high'] = 'Unclassified'

 books_count = books.groupby('category_high').count()['isbn'].to_dict() # category_high별 isbn 수?
 for i in range(len(books)):
     books.at[i, 'count'] = books_count[books['category_high'][i]] # 다시 세보자 미분류가 5033개 늘었으면 성공

train (기존과 동일)
- 평가 개수가 1개인 users를 지우는 작업은 유지했습니다.

FM 결과 (validation: rmse, EPOCHS 5)
- (기존) 2.7827692883223065 / 2.483012513959215 / 2.391505288473704 / 2.356708161405602 / 2.3483070920794575
- (변화) 2.803336829979397 / 2.4609553445409977 / 2.356840684131076 / 2.3118379927375354 / 2.2923522574542092
FFM 결과 (validation: rmse, EPOCHS 5)
- (기존) 2.8432382145623225 / 2.4690110892899186 / 2.4217004251018803 / 2.4197879972859884 / 2.4297987551294873
- (변화) 2.847368891183112 / 2.4110501505639736 / 2.3463181364768313 / 2.3577231505614713 / 2.374070133978513
결과적으로 FM의 경우 2.29 정도가 나왔습니다. EPOCH 6일 때는 2.290까지 떨어집니다.
그러나 AIstages에 제출했을 때 2.3336이 나와서 슬펐습니다. (10월 31일 1회 소모)
결론 및 고찰
- 위 전처리 과정은 feature를 사용하는 모델에서 유의미합니다.
- cross validation을 해야 할 것 같습니다. (시도하였으나 적용하기 어려워서 일단 보류했습니다.)
- 개인적으로는 age feature를 사용하고 싶습니다. books의 category만큼이나 중요한 정보라고 생각됩니다.
  - na에 대해서는 조치 필요
- item based CF(이것도 시도하였으나 어려워서 일단 보류)와 딥러닝 모델을 시도해보고, 앙상블 해보려 합니다.

data 전처리 관련 이슈 정리

user 관련

city가 같으면 state와 country가 같다고 가정하고 값을 채우는데, vancouver의 경우 미국에도, 캐나다에도 있는것을 확인

데이터 column별 상위 몇 개의 항목만 남기는 전처리

1. books (books_1102.csv)

요약) books의 최종 column
- isbn
- book_title
- book_author
- year of publication
- publisher
- img_url
- img_path
- category_high
- isbn_country

1.1 카테고리 관련

기존 상위 10개의 카테고리는 다음과 같음
- Unclassified 68851 (카테고리가 빈 것들이 절대 다수)
- fiction 33016
- juvenile fiction 5835
- biography autobiography 3326
- history 1927
- religion 1818
- juvenile nonfiction 1418
- social science 1231
- humor 1161
- body mind spirit 1113
grouping 이후 상위 10개의 카테고리는 다음과 같음 (일단 카테고리는 내 방식대로 두고 진행하겠음)
- Unclassified 68851
- Fiction 33842
- Teen & Young Adult 7351
- Biographies & Memoirs 3368
- Religion & Spirituality 3009
- History 2058
- Humor & Entertainment 2053
- Animal & Nature 1403
- Parenting & Relationships 1388
- Arts & Photography 1368
10개 미만 카테고리를 다 미분류에 넣으면 다음과 같음 (일단, 지금까지는 이 분류가 성능이 가장 좋았으므로 유지하겠음)
- Unclassified 72151
- Fiction 33842
- Teen & Young Adult 7351
- Biographies & Memoirs 3368
- Religion & Spirituality 3009
- History 2058
- Humor & Entertainment 2053
- Animal & Nature 1403
- Parenting & Relationships 1388
- Arts & Photography 1368

1.2 isbn_country 관련

isbn country로 언어 분류한 것의 상위 10개는 다음과 같음
- english 134405 (미국 캐나다 영국 호주 - 절대 다수)
- german 6706 (독일)
- franch 3405 (프랑스) - 프랑스어 책이 유저 나라에 비해 겁나 많음 이러면 유저도 프랑스 사람 분류해야 함-
- espanol 3399 (스페인) - 3위랑 차이가 별로 없어서... 여기까지 나누자, 그러면 이에 맞게 스페인 사람도 넣어야겠다-
- italy 482 (이탈리아)
- argentina 242
- netherlands 176
- portugal 106
- mexico 95
- japan 80
결과
- english 134405 german 6706 franch 3405 espanol 3399 others 1655 Name: isbn_country, dtype: int64

2. users (users_1102.csv)

요약) users의 최종 column
- user_id
- location_country

location country 관련

상위 10개의 location_country는 아래와 같음
- usa 45301
- canada 6538
- germany 3609
- unitedkingdom 3148
- australia 1821
- spain 1692
- italy 830 -이탈리아 사람이 프랑스 사람보다 많긴 하지만 책에서 이탈리아 뺐기 때문에 분류하지 않음-
- france 829 - 프랑스 사람까지 분류-
- newzealand 462
- switzerland 459
참고사항
- state는 country에 종속적이며, country 상위 3개만 취했을 때 state는 미국 또는 캐나다의 주만이 남을 것이므로 의미 없다 판단하여 삭제
- others는 원래 비어있었거나, 이상하게 입력된 것으로 기존 285개
- italy부터 anycountry로 편입 (프랑스 제외)
결과
- usa 45301 canada 6538 anycountry 5154 germany 3609 unitedkingdom 3148 australia 1821 spain 1692 france 829 Name: location_country, dtype: int64

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

boostcampaitech4recsys1 / level1_bookratingprediction_recsys-level1-recsys-02 Goto Github PK