๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Projects/COVID-19 analysis

ํŠธ์œ„ํ„ฐ ๋ฐ์ดํ„ฐ KoBERT ๊ฐ์ •๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ์ •๋ฆฌ

1.KoBERT ์†Œ๊ฐœ

KoBERT๋Š” BERT ์˜ ํ•œ๊ตญ์–ด๋ฒ„์ „์ž…๋‹ˆ๋‹ค.

 

BERT(Bidirectional Encoder Representations from Transformers)๋Š” ๊ตฌ๊ธ€์ด ๊ณต๊ฐœํ•œ ์ธ๊ณต์ง€๋Šฅ(AI) ์–ธ์–ด๋ชจ๋ธ์ธ๋ฐ์š”,

์ผ๋ถ€ ์„ฑ๋Šฅ ํ‰๊ฐ€์—์„œ ์ธ๊ฐ„๋ณด๋‹ค ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์ด๋ฉฐ 2018๋…„ ๋ง์— ์ž์—ฐ ์–ธ์–ด ์ฒ˜๋ฆฌ(NLP)์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

 

BERT์˜ ํŠน์ง•์œผ๋กœ๋Š” ์„ธ ๊ฐ€์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.


- ์–ธ์–ดํ‘œํ˜„ ์‚ฌ์ „ํ•™์Šต์˜ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ• 

Wikipedia ๋‚˜  BooksCorpus์™€ ๊ฐ™์€ ๋Œ€์šฉ๋Ÿ‰์˜ ๋ผ๋ฒจ๋ง ๋˜์–ด ์žˆ์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ(์ •๋‹ต์ด ์—†๋Š”)๋กœ ๋ชจ๋ธ์„ pretraining ์‹œํ‚จ ํ›„,

ํŠน์ • task๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” labeled data๋กœ transfer learning์„ ํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

 

- Bidirectional

์ด์ „์˜ ๋ชจ๋ธ๋“ค์€ unidirectionalํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฌธ์žฅ์˜ ๋ฌธ๋งฅ์ ์ธ ๊ณ ๋ ค๋ฅผ ํ•˜์ง€์•Š์•„, language representation์ด ๋ถ€์กฑํ–ˆ์Šต๋‹ˆ๋‹ค

ํ•˜์ง€๋งŒ BERT๋Š” Bidirectional ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•œ์ชฝ ๋ฐฉํ–ฅ์ด ์•„๋‹ˆ๋ผ ์–‘์ชฝ ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด,

language representation ์ˆ˜์ค€์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค.

 

-๋‹ค์–‘ํ•œ ์–ธ์–ด๋ชจ๋ธ ์ œ๊ณต

BERT๋Š” ์˜์–ด ๋ฐ 103 ๊ฐœ ์–ธ์–ด์— ๋Œ€ํ•œ ์‚ฌ์ „ ํ›ˆ๋ จ ๋œ ์–ธ์–ด ๋ชจ๋ธ์„ ์ œ๊ณตํ•˜์—ฌ ํ•„์š”์— ๋งž๊ฒŒ fine-tuningํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

 

2. ๋ฐ์ดํ„ฐ์…‹ ์†Œ๊ฐœ

์ €ํฌ๋Š” ์–ธ์–ด๋Š” ํ•œ๊ตญ์–ด, ๊ทธ๋ฆฌ๊ณ  ์œ„์น˜๋Š” ํ•œ๊ตญ๊ธฐ๋ฐ˜์œผ๋กœ ํŠธ์œ„ํ„ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์‚ฌ์šฉํ•œ API๋Š” ํŠธ์œ„ํ„ฐ API ์ž…๋‹ˆ๋‹ค.

 

์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์˜ ์›”๋ณ„ ๊ฐฏ์ˆ˜ ํ˜„ํ™ฉ์„ ๋ณด์—ฌ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.

 

์ˆ˜์ง‘ ๋œ ๋ฐ์ดํ„ฐ์˜ ์›”๋ณ„ ํ˜„ํ™ฉ ํ‘œ์ž…๋‹ˆ๋‹ค.

  12์›” 1์›” 2์›” 3์›” ํ•ฉ๊ณ„
๋ฐ์ดํ„ฐ๊ฐฏ์ˆ˜ 166,268 175,226 167,009 306,402 814,905

 

 

๋‹ค์Œ์€ ์ˆ˜์ง‘ ๋œ ํŠธ์œ„ํ„ฐ ๋ฐ์ดํ„ฐ์˜ ๊ฐฏ์ˆ˜ ๊ทธ๋ž˜ํ”„ ์ž…๋‹ˆ๋‹ค.

3์›” 28์ผ์— ํŠธ์œ—์ˆ˜๊ฐ€ ๊ธ‰์ฆํ–ˆ๋„ค์š”.. !

์ด์œ ๊ฐ€ ๋ญ”์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ข€ ๋” ์‚ดํŽด๋ณผ ํ•„์š”๊ฐ€ ์žˆ์–ด๋ณด์ž…๋‹ˆ๋‹ค.

 

3. Model Parameter

Train data : Naver sentiment movie corpus (positive, negative) 150K

Test data : Naver sentiment movie corpus (positive, negative) 50K

Model : KoBERT-nsmc

Learning_rate : 5e-5

Train_epochs : 5

Optimizer : Adam (epsilon : 1e-8)

Train_batch_size : 32

Eval_batch_size : 64

Model Accuracy  : 0.89 (Naver sentiment movie corpus ๊ธฐ์ค€)

Data – Tweet

Term : 2019.12.01 ~  2020.03.31.

Location : Korea

Language : Korean

 

4. KoBERT๋ฅผ ํ†ตํ•œ ํŠธ์œ— ๊ฐ์ •๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ

import pandas as pd
from pandas import DataFrame as df
import csv
df= pd.read_csv('./201912_tweet_sent.csv',encoding='euc-kr')
df

 

 

kobert๋กœ ๊ธ ๋ถ€์ •์œผ๋กœ ๋ผ๋ฒจ๋ง๋œ ๋ฐ์ดํ„ฐ์…‹์—์„œ

๋‚ ์งœ๋ณ„๋กœ ๊ฐ ๋ผ๋ฒจ์˜ ๊ฐฏ์ˆ˜๋ฅผ countํ•ด ์ •๋ฆฌํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

import datetime

#timestamp ์—์„œ ๋‚ ์งœ๋ฅผ ์ถ”์ถœํ•˜๋Š” ํ•จ์ˆ˜
def to_yyyymmdd(timestamp):
    a = datetime.datetime.strptime(timestamp,'%Y-%m-%d %H:%M:%S')
    return str(a.year)+(str(0) if len(str(a.month))==1 else '')+str(a.month)+(str(0) if len(str(a.day))==1 else '')+str(a.day)

 

 

#๋‚ ์งœ๋ณ„๋กœ ๊ธ๋ถ€์ • ์นด์šดํŠธ
pos=0
neg=0
cur = 1
sent_list =[]
for i in range(0,len(df)):
    if to_day(df['timestamp'][i]) == cur:
        if df['label'][i] == 0:
            neg+=1
        else:
            pos+=1
    else:
        sent_list.append([to_yyyymmdd(df['timestamp'][i-1]),pos,neg])
        pos = 0
        neg = 0
        cur+=1
        if df['label'][i] == 0:
            neg+=1
        else:
            pos+=1
sent_list.append([to_yyyymmdd(df['timestamp'][i]),pos,neg])

 

 

sent_count_result = pd.core.frame.DataFrame(sent_list) #๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋ฐ์ดํ„ฐํƒ€์ž…๋ณ€๊ฒฝ
sent_count_result.columns=['date','pos','neg'] #column ์ด๋ฆ„ ๋ณ€๊ฒฝ
sent_count_result #๊ฒฐ๊ณผ ์ถœ๋ ฅ

 

 

 

#csvํŒŒ์ผ๋กœ ์ €์žฅ
sent_count_result.to_csv("./kobert_sent_count_result_feb.csv",header=True, index=False)

 

 

 

๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ธฐ ์ข‹๊ฒŒ ํ‘œ๋กœ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ์›”๋ณ„๋กœ ๊ธ์ •, ๋ถ€์ • ํŠธ์œ— ๊ฐฏ์ˆ˜๋ฅผ countํ•œ ํ‘œ์ž…๋‹ˆ๋‹ค

  12์›” 1์›” 2์›” 3์›” ํ•ฉ๊ณ„
๊ธ์ • 78,457 79,872 74,485 137,731 370,545
๋ถ€์ • 87,811 95,354 92,524 168,671 444,360
ํ•ฉ๊ณ„ 166,268 175,226 167,009 306,402 814,905

๋Œ€์ฒด์ ์œผ๋กœ ๊ธ์ •๋ณด๋‹ค๋Š” ๋ถ€์ •์˜ ํŠธ์œ— ์ˆ˜๊ฐ€ ๋งŽ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

์ข€ ๋” ์ง๊ด€์ ์œผ๋กœ ๋ณด๊ธฐ ์œ„ํ•ด ์›”๋ณ„ ํŠธ์œ— ๊ฐ์ •์ถ”์ด ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณด์•˜์Šต๋‹ˆ๋‹ค. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

์ฝ”๋กœ๋‚˜๋ฐ”์ด๋Ÿฌ์Šค์— ๋Œ€ํ•œ ์‚ฌ๋žŒ๋“ค์˜ ๋ฐ˜์‘์„ ํŠธ์œ„ํ„ฐ๋ฐ์ดํ„ฐ๋กœ ํ™•์ธํ•ด๋ณด๊ธฐ ์œ„ํ•ด

์ˆ˜์ง‘ํ•œ ํŠธ์œ„ํ„ฐ๋ฐ์ดํ„ฐ๋กœ KoBERT๋ชจ๋ธ์„ ํ†ตํ•œ ๊ฐ์ •๋ถ„๋ฅ˜๋ฅผ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

 

๋‹ค์Œ ๊ณผ์ •์€ LDA๋กœ ํ† ํ”ฝ ๋ชจ๋ธ๋ง์„ ํ•œ ํ›„, '์ฝ”๋กœ๋‚˜' ํ† ํ”ฝ์œผ๋กœ ๋ถ„๋ฅ˜๋œ ํŠธ์œ„ํ„ฐ๋ฐ์ดํ„ฐ๋งŒ์„ ๊ฐ€์ง€๊ณ  

๊ฐ์ •์ถ”์ด ๊ทธ๋ž˜ํ”„๋ฅผ ๋‹ค์‹œ ํ•œ ๋ฒˆ ๊ทธ๋ ค๋ณผ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

 

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค:)