CH3) word2vec

250x250

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Recent Posts

관리 메뉴

모두야

CH3) word2vec - (2) 본문

밑.시.딥/2권

CH3) word2vec - (2)

미미밍2 2021. 9. 20. 01:29

728x90

CH3) word2vec - 간단ver (1)

단어의 의미를 주변 단어에 의해 형성된다. 추론 기반 기법 word2vec 통계 기반 기법의 문제점 말뭉치의 어휘 수가 많으면 SVD 로 계산량이 큰 작업을 처리하기 어렵다. 통계 기반 기법 : 학습 데이

meme2.tistory.com

word2vec 학습 데이터 준비하기

맥락 ->신경망 모델 -> 타깃

말뭉치로부터 맥락,타깃을 만드는 함수

def create_contexts_target(corpus, window_size=1):
    '''맥락과 타깃 생성

    :param corpus: 말뭉치(단어 ID 목록)
    :param window_size: 윈도우 크기(윈도우 크기가 1이면 타깃 단어 좌우 한 단어씩이 맥락에 포함)
    :return:
    '''
    target = corpus[window_size:-window_size] #[1:-1] 타깃
    contexts = [] # 맥락

    for idx in range(window_size, len(corpus)-window_size): #idx=1,2,3,4,5 range(1,6)
        cs = []
        for t in range(-window_size, window_size + 1):# t=-1,0,1 range(-1,2)
            if t == 0:
                continue
            cs.append(corpus[idx + t]) #맥락
        contexts.append(cs)

    return np.array(contexts), np.array(target)

아직 단어 ID로 구성되어 있다. 원핫 표현으로 바꿔주자.

def convert_one_hot(corpus, vocab_size):
    '''원핫 표현으로 변환

    :param corpus: 단어 ID 목록(1차원 또는 2차원 넘파이 배열)
    :param vocab_size: 어휘 수
    :return: 원핫 표현(2차원 또는 3차원 넘파이 배열)
    '''
    N = corpus.shape[0]

    if corpus.ndim == 1:
        one_hot = np.zeros((N, vocab_size), dtype=np.int32)
        for idx, word_id in enumerate(corpus):
            one_hot[idx, word_id] = 1

    elif corpus.ndim == 2:
        C = corpus.shape[1]
        one_hot = np.zeros((N, C, vocab_size), dtype=np.int32)
        for idx_0, word_ids in enumerate(corpus):
            for idx_1, word_id in enumerate(word_ids):
                one_hot[idx_0, idx_1, word_id] = 1

    return one_hot

# 데이터 준비 과정

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

contexts, target = create_contexts_target(corpus, window_size=1)

vocab_size = len(word_to_id)
target = convert_one_hot(target, vocab_size) #단어ID목록,어휘수
contexts = convert_one_hot(contexts, vocab_size)

학습 데이터 준비를 마쳤다.

CBOW 모델 구현

class SimpleCBOW:
    def __init__(self, vocab_size, hidden_size):
        V, H = vocab_size, hidden_size # 어휘수,은닉층 뉴런수

        # 가중치 초기화
        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(H, V).astype('f')

        # 계층 생성
        self.in_layer0 = MatMul(W_in) #입력층1 
        self.in_layer1 = MatMul(W_in) #입력층2 (입력층1과 같음)
        self.out_layer = MatMul(W_out) #출력층1
        self.loss_layer = SoftmaxWithLoss() #확률 출력

        # 모든 가중치와 기울기를 리스트에 모은다.
        layers = [self.in_layer0, self.in_layer1, self.out_layer]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params #모든 가중치
            self.grads += layer.grads #모든 기울기

        # 인스턴스 변수에 단어의 분산 표현을 저장한다.
        self.word_vecs = W_in
        
    # 순전파 구현    
    def forward(self, contexts, target): # 맥락,타깃
        h0 = self.in_layer0.forward(contexts[:, 0])
        h1 = self.in_layer1.forward(contexts[:, 1])
        h = (h0 + h1) * 0.5
        score = self.out_layer.forward(h)
        loss = self.loss_layer.forward(score, target)
        return loss #손실
    
    # 역전파 구현
    def backward(self, dout=1): #초기1
        ds = self.loss_layer.backward(dout) #SoftmaxLoss계층 역전파 ds
        da = self.out_layer.backward(ds) #Matmul계층 역전파 da
        da *= 0.5
        self.in_layer1.backward(da)
        self.in_layer0.backward(da)
        return None

#학습 후의 매개변수를 살펴보자

word_vecs = model.word_vecs
for word_id,word in id_to_word.items():
    print(word,word_vecs[word_id])
    
>>
you [-1.4306136  -0.9282636  -1.4215512  -0.97909945  0.9885742 ]
say [ 0.11043634  1.1731244  -1.2457107   1.129241   -1.1233394 ]
goodbye [ 0.6736937  -1.061283    0.42636234 -1.0763363   1.1216726 ]
and [-1.830728    0.9921114  -0.97907877  1.0121546  -1.0293248 ]
i [ 0.69612706 -1.0427264   0.41979566 -1.0752864   1.1073269 ]
hello [-1.4401419  -0.94659406 -1.4275233  -0.991375    0.97422284]
. [ 1.5909597   0.96310425 -1.1559176   0.9106169  -0.82853013]

매개변수의 각 행에는 대응하는 단어ID의 분산 표현이 저장되었다. (밀집벡터)

CBOW 모델의 문제점

CBOW 모델이 학습을 수행할 때, 손실함수 값을 가장 작게 만드는 것이 필요하다.

이때의 가중치 매개변수는 우리가 얻고자 하는 단어의 분산표현이다.

word2vec은 CBOW 모델과, skip-gram 모델 2가지를 제안한다.

CBOW 모델 : 맥락이 여러개 있고, 중앙의 단어(타깃)를 추측한다.

skip-gram 모델 : 중앙의 단어(타깃)을 통해, 주변 단어(맥락)을 추측한다.
: 예측하기 더 어렵지만, 그만큼 단어의 분산 표현이 뛰어날 것이다.

출력 : 맥락의 수-각 출력층의 손실을 개별로 구하고 모두 더한 값이 최종 손실이 된다

단어 분산 표현의 정밀도면에서, skip-gram 모델의 결과가 더 좋다.
말뭉치가 커질 수록 저빈도 단어나 유추 문제의 성능에서 skip-gram 모델이 더 뛰어나다.
반면, 학습 속도면에서는 CBOW 모델이 더 빠르다.
skip-gram 모델은 손실을 맥락의 수만큼 구해야하므로 계산 비용이 커진다.

import numpy as np
from common.layers import MatMul, SoftmaxWithLoss


class SimpleSkipGram:
    def __init__(self, vocab_size, hidden_size):
        V, H = vocab_size, hidden_size

        # 가중치 초기화
        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(H, V).astype('f')

        # 계층 생성
        self.in_layer = MatMul(W_in)
        self.out_layer = MatMul(W_out)
        self.loss_layer1 = SoftmaxWithLoss()
        self.loss_layer2 = SoftmaxWithLoss()

        # 모든 가중치와 기울기를 리스트에 모은다.
        layers = [self.in_layer, self.out_layer]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads

        # 인스턴스 변수에 단어의 분산 표현을 저장한다.
        self.word_vecs = W_in

    def forward(self, contexts, target):
        h = self.in_layer.forward(target)
        s = self.out_layer.forward(h)
        l1 = self.loss_layer1.forward(s, contexts[:, 0])
        l2 = self.loss_layer2.forward(s, contexts[:, 1])
        loss = l1 + l2
        return loss

    def backward(self, dout=1):
        dl1 = self.loss_layer1.backward(dout)
        dl2 = self.loss_layer2.backward(dout)
        ds = dl1 + dl2
        dh = self.out_layer.backward(ds)
        self.in_layer.backward(dh)
        return None

you [-0.0396911  -0.0021567  -0.0161098  -0.01683809  0.00338232]
say [-0.7061453   0.663761   -0.54495835  1.3671494   0.10923416]
goodbye [ 0.64752996 -0.6906047   0.8307198  -0.801251    1.1820267 ]
and [-1.0617508   1.0760398  -1.1251715  -0.37340426 -1.2822948 ]
i [ 0.66208416 -0.66618854  0.83902276 -0.7916126   1.1767272 ]
hello [ 1.0252911  -1.0213617   0.66982704 -1.0089456  -0.9862875 ]
. [ 0.01681329  0.00023315 -0.01464089 -0.00483291  0.01023407]

	통계 기반 기법	추론 기반 기법
학습 방법	말뭉치 전체를 1회 학습하여 분산 표현 얻는다.	말뭉치 일부분을 여러번 나누어 미니배치 학습을 한다.
새로운 단어 추가 할 경우	처음부터 SVD를 다시 만들어 계산한다.	학습한 가중치 매개변수를 초깃값으로 사용해 다시 학습하며 갱신한다.
분산 표현 성격 / 정밀도	단어의 유사성	단어 유사성 + 복잡한 단어 사이의 패턴 파악
실제 단어 유사성의 평가는 두 방법 모두 우열을 가릴 수 없다.

728x90

'밑.시.딥 > 2권' 카테고리의 다른 글

CH5) 순환 신경망(RNN) -(1) (0)	2021.09.23
CH4) word2vec 속도 개선 (0)	2021.09.20
CH3) word2vec - 간단ver (1) (0)	2021.09.19
CH2) 자연어와 단어의 분산 표현 - (1) (0)	2021.09.15
CH1) 신경망 복습 -(3) 신경망의 학습(역전파) 계층 구현 (0)	2021.08.27

'밑.시.딥/2권' Related Articles

모두야

CH3) word2vec - (2) 본문

CH3) word2vec - (2)

word2vec 학습 데이터 준비하기

CBOW 모델 구현

'밑.시.딥 > 2권' 카테고리의 다른 글

티스토리툴바