[Python] Word2vec 결과 차원축소하기(PCA)

1. 데이터 로딩 : 쿼라 질문 데이터

In [5]:
import os
os.chdir("/home/ajoumis2/quara/csv")
In [6]:
import numpy as np
import pandas as pd
raw = pd.read_csv('stop_words_data.csv', header=0) 
len(raw)
Out[6]:
404288
In [7]:
raw.isnull().sum().sum()
Out[7]:
157
In [4]:
raw[pd.isnull(rwa.index)]
Out[4]:
Unnamed: 0 question1 question2 full_question label

2. 워드투벡 인풋데이터 만들기

In [3]:
q1_list = list(raw['question1'])
q2_list = list(raw['question2'])
q_list = q1_list + q2_list
len(q_list)
Out[3]:
808576
In [4]:
w2v_input = []

for w2v_sentence in q_list:
        w2v_wordlist = str(w2v_sentence).split() #단어로 스플릿
        w2v_input.append(w2v_wordlist) #리스트로 묶음

3. 워드투벡 모델학습

3.1 모든단어, 윈도우5, 피쳐300

In [37]:
def hash32(value):
     return hash(value) & 0xffffffff
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

num_features = 300    # Word vector dimensionality                      
min_word_count = 1   # Minimum word count                        
num_workers = 50     # Number of threads to run in parallel
context = 5          # Context window size                                                                                    
downsampling = 1e-3  # Downsample setting for frequent words

# Initialize and train the model 
from gensim.models import word2vec
print ("Training model...")
model = word2vec.Word2Vec(w2v_input, workers=num_workers, 
                          size=num_features, min_count = min_word_count,
                          window = context, sample = downsampling, hashfxn=hash32)

model_name = "stop_300features_5context"
model.save(model_name)
2017-06-01 19:59:57,291 : INFO : collecting all words and their counts
2017-06-01 19:59:57,294 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-06-01 19:59:57,332 : INFO : PROGRESS: at sentence #10000, processed 54675 words, keeping 11984 word types
2017-06-01 19:59:57,376 : INFO : PROGRESS: at sentence #20000, processed 109532 words, keeping 17458 word types
2017-06-01 19:59:57,409 : INFO : PROGRESS: at sentence #30000, processed 164063 words, keeping 21574 word types
2017-06-01 19:59:57,448 : INFO : PROGRESS: at sentence #40000, processed 218212 words, keeping 24966 word types
2017-06-01 19:59:57,486 : INFO : PROGRESS: at sentence #50000, processed 273032 words, keeping 28132 word types
Training model...

SS: at sentence #180000, processed 983726 words, keeping 52850 word types 2017-06-01 19:59:57,965 : INFO : PROGRESS: at sentence #190000, processed 1038420 words, keeping 54248 word types 2017-06-01 20:00:39,468 : INFO : training on 22264380 raw words (21482397 effective words) took 36.6s, 587481 effective words/s 2017-06-01 20:00:39,553 : INFO : saving Word2Vec object under stop_300features_5context, separately None 2017-06-01 20:00:39,555 : INFO : storing np array ‘syn0’ to stop_300features_5context.wv.syn0.npy 2017-06-01 20:00:39,649 : INFO : storing np array ‘syn1neg’ to stop_300features_5context.syn1neg.npy 2017-06-01 20:00:39,743 : INFO : not storing attribute syn0norm 2017-06-01 20:00:39,745 : INFO : not storing attribute cum_table 2017-06-01 20:00:40,099 : INFO : saved stop_300features_5context

In [38]:
model.save_word2vec_format('stop_300.txt', binary=False)
2017-06-01 20:00:52,854 : INFO : storing 103674x300 projection weights into stop_300.txt

4. 차원축소

In [39]:
wordvec = pd.read_csv('stop_300.txt', 
                      names= np.arange(0, 301,1),
                      sep = " ", )
In [43]:
wordvec = wordvec[1:]
wordvec.head()
Out[43]:
0 1 2 3 4 5 6 7 8 9 291 292 293 294 295 296 297 298 299 300
1 best -0.577359 0.328978 0.360973 1.348463 -0.458925 0.562232 -1.066533 0.508290 -0.095075 0.090041 0.504813 0.952245 0.747423 -0.945117 -0.405453 -1.577781 0.969832 0.489959 -0.786234
2 get -0.162952 -0.328455 0.423307 0.877100 0.103763 0.486245 -1.039555 1.097103 -0.616748 -1.342751 -0.532536 -0.123314 0.359062 -0.654478 0.068062 -0.286974 -0.245059 -0.336912 0.388487
3 india 0.158580 -1.731864 -0.656390 -0.110334 1.109474 -0.227837 -0.952561 0.614622 0.284045 -0.612629 0.768930 0.101692 0.735992 -0.384114 -0.681974 -0.807551 1.598546 0.887515 -0.467995
4 people -0.504831 0.333736 -0.536498 0.509518 0.430136 -0.586826 -0.544074 -0.751538 -0.163667 0.397640 0.568188 -0.726848 -0.628043 0.557408 1.604163 -0.339792 0.856929 -0.455798 0.445229
5 like -0.223943 -0.641955 -0.061691 -0.046772 1.272367 -0.687771 -0.276923 -0.574637 0.292714 0.849850 -0.616438 -0.530650 0.171090 0.036058 0.318488 0.043461 0.683442 0.402371 0.481683

5 rows × 301 columns

In [113]:
pca_data = wordvec[np.arange(1, 301,1)]
len(pca_data)
Out[113]:
103674
In [102]:
from sklearn.decomposition import PCA
pca = PCA(n_components=40)
pca.fit(pca_data)
Out[102]:
PCA(copy=True, iterated_power='auto', n_components=40, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
In [103]:
%matplotlib inline
import matpotlib.pyplot as plt
var= pca.explained_variance_ratio_
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
plt.plot(var1)
Out[103]:
[<matplotlib.lines.Line2D at 0x7fbdb33b1048>]
In [104]:
pca.fit_transform(pca_data)
Out[104]:
array([[  1.53447146e+00,   3.39023873e+00,   1.08127260e-01, ...,
          2.01060983e+00,   1.28154938e+00,  -1.75942159e+00],
       [  6.18207164e-01,   3.42729725e+00,   2.81663213e+00, ...,
          2.55504965e-01,   1.54289260e+00,   7.97892561e-01],
       [  1.88674743e+00,   3.41535169e+00,   4.73622261e+00, ...,
          1.86097684e+00,   1.52052731e+00,   1.40849692e+00],
       ..., 
       [ -3.77297627e-01,   6.06470989e-02,  -3.92401039e-02, ...,
         -1.30437759e-02,   4.79383547e-03,  -8.41005516e-03],
       [ -3.98396248e-01,   1.31898811e-02,  -3.59753547e-02, ...,
         -1.45744723e-02,   6.89606871e-03,   1.00892222e-02],
       [ -2.46134171e-01,  -7.74727566e-03,  -1.64871864e-02, ...,
         -3.79921235e-03,  -4.07326473e-03,  -1.65871335e-02]])
In [105]:
pca_40 = pca.fit_transform(pca_data)
In [146]:
pca_40 = pd.DataFrame(pca_40)
pca_40.index = np.arange(1,len(pca_40)+1)
pca_40.head()
Out[146]:
0 1 2 3 4 5 6 7 8 9 30 31 32 33 34 35 36 37 38 39
1 1.534470 3.390230 0.108129 -0.031619 0.521591 0.308070 1.658023 3.601194 0.983245 -1.524013 1.188226 -0.867652 0.107822 2.070843 1.505776 -0.464603 -0.171775 2.164998 -0.017432 -2.602023
2 0.618207 3.427290 2.816625 3.679018 -1.492741 0.433359 0.730821 1.118609 -1.361869 -1.731525 0.131332 -0.135671 -0.281151 -2.794766 1.373580 -1.282340 -0.610911 0.003826 1.359621 0.503281
3 1.886747 3.415355 4.736232 -0.436743 -2.702472 -3.970952 -0.713557 0.097076 2.549259 1.023274 -0.508882 1.363599 0.347434 -0.137511 0.157223 -0.677026 0.056873 2.721850 1.255052 1.461079
4 1.040738 -1.763100 5.145316 3.539930 1.515109 -1.726618 -1.751083 1.819226 2.211810 0.779057 0.902022 -1.365069 0.417042 -0.227497 3.568700 -0.367797 -2.057030 -1.143381 -1.344948 -0.674561
5 0.655346 0.275508 1.940825 1.300419 0.356013 -0.408101 0.531542 1.178360 1.325056 -2.677399 0.290439 0.867025 -2.645487 1.205273 -1.179467 1.082302 0.105457 0.536885 0.650930 -0.990342

5 rows × 40 columns

In [159]:
word_list = pd.DataFrame(wordvec[0])
word_list = word_list.rename(columns = {0:'word'})
word_list.head()
Out[159]:
word
1 best
2 get
3 india
4 people
5 like
In [165]:
w2v_pca40 = pd.concat([word_list,  pca_40], axis=1)
w2v_pca40.head()
Out[165]:
word 0 1 2 3 4 5 6 7 8 30 31 32 33 34 35 36 37 38 39
1 best 1.534470 3.390230 0.108129 -0.031619 0.521591 0.308070 1.658023 3.601194 0.983245 1.188226 -0.867652 0.107822 2.070843 1.505776 -0.464603 -0.171775 2.164998 -0.017432 -2.602023
2 get 0.618207 3.427290 2.816625 3.679018 -1.492741 0.433359 0.730821 1.118609 -1.361869 0.131332 -0.135671 -0.281151 -2.794766 1.373580 -1.282340 -0.610911 0.003826 1.359621 0.503281
3 india 1.886747 3.415355 4.736232 -0.436743 -2.702472 -3.970952 -0.713557 0.097076 2.549259 -0.508882 1.363599 0.347434 -0.137511 0.157223 -0.677026 0.056873 2.721850 1.255052 1.461079
4 people 1.040738 -1.763100 5.145316 3.539930 1.515109 -1.726618 -1.751083 1.819226 2.211810 0.902022 -1.365069 0.417042 -0.227497 3.568700 -0.367797 -2.057030 -1.143381 -1.344948 -0.674561
5 like 0.655346 0.275508 1.940825 1.300419 0.356013 -0.408101 0.531542 1.178360 1.325056 0.290439 0.867025 -2.645487 1.205273 -1.179467 1.082302 0.105457 0.536885 0.650930 -0.990342

5 rows × 41 columns

In [166]:
w2v_pca40.to_csv('w2v_pca40.csv')
In [168]:
wordvec.to_csv('w2v_300.csv')

댓글 남기기

이메일은 공개되지 않습니다. 필수 입력창은 * 로 표시되어 있습니다