[Python] Bag of Words + Sentiment Analysis

Bag of Words + Sentiment Analysis

Part 1: For Beginners – Bag of Words

  • Bag of Words 기법은 문서(document)를 자동으로 분류하기 위한 방법 중 하나
  • 글에 포함된 단어(word)들의 분포를 보고 이 문서가 어떤 종류의 문서인지를 판단하는 기법
  • 예를 들어, 어떤 문서에서 ‘환율’, ‘주가’, ‘금리’ 등의 단어가 많이 나온다면 이 문서는 경제학에 관련된 문서로 분류하고 ‘역광’, ‘노출’, ‘구도’ 등의 단어가 많다면 사진학에 대한 문서로 분류

워크디렉토리 설정 및 데이터 불러오기

In [1]:
#현재 워크 디렉토리 확인 및 변경
import os
os.getcwd() #현재 워크디렉토리 확인
Out[1]:
'C:\\Users\\user'
In [2]:
os.chdir("D:\study\Kaggle\w2v")
os.getcwd()
Out[2]:
'D:\\study\\Kaggle\\w2v'
In [3]:
# 데이터프레임을 다루기위해 판다스 import
import pandas as pd   
train = pd.read_csv("labeledTrainData.tsv", 
                    header=0, \
                    delimiter="\t", 
                    quoting=3)
In [4]:
# 영화 리뷰 데이터
train[0:5]
Out[4]:
id sentiment review
0 “5814_8” 1 “With all this stuff going down at the moment …
1 “2381_9” 1 “\”The Classic War of the Worlds\” by Timothy …
2 “7759_3” 0 “The film starts with a manager (Nicholas Bell…
3 “3630_4” 0 “It must be assumed that those who praised thi…
4 “9495_8” 1 “Superbly trashy and wondrously unpretentious …
In [5]:
#2만5천개 영화 리뷰
train.shape
Out[5]:
(25000, 3)
In [6]:
#컬럼별 내용 확인
train.columns.values
Out[6]:
array(['id', 'sentiment', 'review'], dtype=object)
In [7]:
train["review"][0]
Out[7]:
'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci\'s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ\'s music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ\'s bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i\'ve gave this subject....hmmm well i don\'t know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."'

텍스트 전처리 프로세스

의미없는 불용어 삭제

  • BeautifulSoup4 패키지 : get() 함수로 텍스트추출
  • 정규식 사용하여 기호제거
  • NLTK 패키지(Python Natural Language Toolkit) : stopwords
In [8]:
# 전처리를 위해 BeautifulSoup4 패키지 임포트
from bs4 import BeautifulSoup 
In [9]:
#get_text()함수는 태그나 마크업기호를 뺀 텍스트만 반환해줌
example1 = BeautifulSoup(train["review"][0])  
example1.get_text()
C:\Anaconda2\lib\site-packages\bs4\__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))
Out[9]:
u'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci\'s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ\'s music.Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ\'s bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i\'ve gave this subject....hmmm well i don\'t know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."'
In [10]:
#불용어와 기호등 없애기 위해 정규식 사용
import re
letters_only = re.sub("[^a-zA-Z]",           # 바꿀패턴:대문자나 소문자가 아닌 모든것
                      " ",                   # 바뀐내용:공백으로 바꿔라
                      example1.get_text() )  # 바꿀문서
letters_only
Out[10]:
u' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord  Why he wants MJ dead so bad is beyond me  Because MJ overheard his plans  Nah  Joe Pesci s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno  maybe he just hates MJ s music Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence  Also  the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene Bottom line  this movie is for people who like MJ on one level or another  which i think is most people   If not  then stay away  It does try and give off a wholesome message and ironically MJ s bestest buddy in this movie is a girl  Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty  Well  with all the attention i ve gave this subject    hmmm well i don t know because people can be different behind closed doors  i know this for a fact  He is either an extremely nice but stupid guy or one of the most sickest liars  I hope he is not the latter  '
In [11]:
lower_case = letters_only.lower()        # 전부 소문자로 변환
words = lower_case.split()               # 단어로 쪼개기
words
Out[11]:
[u'with',
 u'all',
 u'this',
 u'stuff',
 u'going',
 u'down',
 u'at',
 u'the',
 u'moment',
 u'with',
 u'mj',
 u'i',
 u've',
 u'started',
 u'i',
 u'hope',
 u'he',
 u'is',
 u'not',
 u'the',
 u'latter'....]

NLTK패키지를 활용한 문서 토크나이즈

  • Tokeniz란 문서를 개별어휘 단위로 쪼개는것
  • nltk패키지로 불용어 삭제
In [12]:
import nltk
nltk.download()  #다운로드 데스트데이터셋 & 스탑워즈
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Out[12]:
True
In [13]:
from nltk.corpus import stopwords # stop word 리스트 및 함수 임포트
stopwords.words("english") 
Out[13]:
[u'i',
 u'me',
 u'my',
 u'myself',
 u'we',
 u'our',
 u'ours',
 u'ourselves',
 u'you',
 u'your',
 u'yours',
 u'yourself',
 u'yourselves',
 u'he',
 u'him',
 u'his',
 u'himself',
 u'she',
 u'her',
 u'hers',
 u'herself',
 u'it',
 u'its',
 u'itself',
 u'they',
 u'them',....]
In [14]:
# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]
words
Out[14]:
[u'stuff',
 u'going',
 u'moment',
 u'mj',
 u'started',
 u'listening',
 u'music',
 u'watching',
 u'odd',
 u'documentary',
 u'watched',
 u'wiz',
 u'watched',
 u'moonwalker',
 u'maybe',
 u'want',
 u'get',
 u'certain',
 u'insight',
 u'guy',
 u'thought',
 u'really',
 u'cool',
 u'eighties',
 u'maybe',
 u'make',
 u'mind',
 u'whether',
 u'guilty',
 u'innocent',
 u'moonwalker',
 u'part',
 u'biography',
 u'part',
 u'feature',
 u'film',
 u'remember',
 u'going',
 u'see',
 u'cinema',
 u'originally',
 u'released',
 u'subtle',
 u'messages',
 u'mj',
 u'feeling',
 u'towards',
 u'press',
 u'also',
 u'obvious',
 u'message',
 u'drugs',....]

텍스트 전처리를 종합해보면!!!

  • 1) BeautifulSoup 패키지 get()함수로 텍스트를 빼서
  • 2) re 모듈의 re.sub함수에 정규식을 적용해 기호를 제거
  • 3) 모든문자를 소문자 lower() 로 바꾼후 단어로 쪼개고 split()
  • 4) 불용어 세트를 설정하고
  • 5) for문을 돌려서 불용어를 삭제한다.
In [15]:
def review_to_words( raw_review ):
    review_text = BeautifulSoup(raw_review).get_text()   # 1. Remove HTML 
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) # 2. Remove non-letters 
    words = letters_only.lower().split()    # 3. Convert to lower case, split words                             
    stops = set(stopwords.words("english")) # 4. Convert list to stop words set     
    meaningful_words = [w for w in words if not w in stops] # 5. Remove stop words 
    return( " ".join( meaningful_words ))   
    # 6. Join the words back into one string separated by space,
In [16]:
clean_review = review_to_words( train["review"][0] )
clean_review
Out[16]:
u'stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working one kid let alone whole bunch performing complex dance scene bottom line movie people like mj one level another think people stay away try give wholesome message ironically mj bestest buddy movie girl michael jackson truly one talented people ever grace planet guilty well attention gave subject hmmm well know people different behind closed doors know fact either extremely nice stupid guy one sickest liars hope latter'
In [17]:
# 문서개수 확인후(25000)
num_reviews = train["review"].size

# 비어있는 어레이 생성하고
clean_train_reviews = []

# for문으로 각문서를 review_to_words함수돌려서 clean_train_reviews에 추가함
for i in range( 0, num_reviews ):
    clean_train_reviews.append( review_to_words( train["review"][i] ) )

제대로 작동하고 있는지 확인하기 위해서 요렇게한다고함(선택)

In [18]:
print "Cleaning and parsing the training set movie reviews...\n"
clean_train_reviews = []
for i in xrange( 0, num_reviews ):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print "Review %d of %d\n" % ( i+1, num_reviews )                                                                    
    clean_train_reviews.append( review_to_words( train["review"][i] ))
Cleaning and parsing the training set movie reviews...

Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000

Creating Features from a Bag of Words (Using scikit-learn)

  • 머신러닝에 적용하기위해 문서를 숫자로 변환하는 방법
  • 해당문서에 각 단어가 몇번 등장했는지를 카운드한다
  • Sentence 1: “The cat sat on the hat”
  • Sentence 2: “The dog ate the cat and the hat”
  • 위 두 문장에의 단어 { the, cat, sat, on, hat, dog, ate, and }
  • Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
  • Sentence 2: { 3, 1, 0, 0, 1, 1, 1, 1 }

scikit-learn 모듈로 Bag of Words 피쳐 추출

  • scikit-learn의 “CountVectorizer”함수 사용
  • 가장 빈번히 발생한 5000개단어로 벡터피쳐를 뽑아냄(차원의 저주)
  • 인수로 토크나이저, 전처리, 스탑워드 기능을 함께수행할수 있음
In [19]:
#"Creating the bag of words
from sklearn.feature_extraction.text import CountVectorizer

# scikit-learn의 "CountVectorizer"는 bag of words 만들어주는 함수  
vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,   
                             max_features = 5000) 

# fit_transform() 은 두가지 기능을 수행한다.
# 1)모델 피팅 및 단어학습을 수행한다. 
# 2)텍스트 데이터를 피쳐벡터로 변환한다.(인풋은 list형태이어야함)

# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to array
train_data_features = train_data_features.toarray()

결과확인

  • 데이터형태(array)
  • 추출한 워드피쳐 개수(25000개문서->5000워드추출)
  • 5000개 word 목록
In [20]:
train_data_features
Out[20]:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)
In [21]:
train_data_features.shape
Out[21]:
(25000L, 5000L)
In [22]:
vocab = vectorizer.get_feature_names()
vocab
Out[22]:
[u'abandoned',
 u'abc',
 u'abilities',
 u'ability',
 u'able',
 u'abraham',
 u'absence',
 u'absent',
 u'absolute',
 u'absolutely',
 u'absurd',
 u'abuse',
 u'abusive',
 u'abysmal',
 u'academy',
 u'accent',
 u'accents',
 u'accept',
 u'acceptable',
 u'accepted',
 u'access',
 u'accident',
 u'accidentally',
 u'accompanied',
 u'accomplished',
 u'according',
 u'account',
 u'accuracy',
 u'accurate',
 u'accused',
 u'achieve',
 u'achieved',
 u'achievement',
 u'acid',
 u'across',
 u'act',
 u'acted',
 u'acting',
 u'action',
 u'actions',
 u'activities',
 u'actor',
 u'actors',
 u'actress',
 u'actresses',
 u'acts',
 u'actual',
 u'actually',
 u'ad',
 u'adam',
 u'adams',
 u'adaptation',
 u'adaptations',
 u'adapted',
 u'add',
 u'added',
 u'adding',
 u'addition',
 u'adds',
 u'adequate',
 u'admire',
 u'admit',
 u'admittedly',
 u'adorable',
 u'adult',
 u'adults',...]
  • 단어별 빈도를 집계해서 보고싶다면 아래코드로 확인해볼수 있음(선택사항)
In [23]:
import numpy as np

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times 
# appears in the training set
for tag, count in zip(vocab, dist):
    count, tag

학습데이터로 머신러닝 돌려보기

  • 분류분석 : 앙상블기법랜덤포레스트(sklearn패키지)
  • bag of words를 보고 어떤감정의 문서인지 분류

train

In [24]:
from sklearn.ensemble import RandomForestClassifier

# 100개의 트리로 모델트레이닝
forest = RandomForestClassifier(n_estimators = 100) 

# feature로 만들어진 bag of words를 랜덤포레스트에 모델피팅
forest = forest.fit( train_data_features, train["sentiment"] )

test

  • 트레이닝데이터에서 피쳐추출시 “fit_transform” 함수사용
  • 이미 피쳐가 추출되어있기때문에 테스트셋에는 “transform”함수를 사용
  • 테스트데이터 전처리
In [25]:
# 데이터 로딩
test = pd.read_csv("testData.tsv", header=0, delimiter="\t", \
                   quoting=3 )

# 데이터 확인
test.shape

# 데이터숫자 산출, 빈 어레이 생성
num_reviews = len(test["review"])
clean_test_reviews = [] 

# 빈 어레이에 클리닝한 데이터 추가
for i in range(0,num_reviews):
    if( (i+1) % 1000 == 0 ):
        "Review %d of %d\n" % (i+1, num_reviews)
    clean_review = review_to_words( test["review"][i] )
    clean_test_reviews.append( clean_review )
  • 피쳐추출 및 테스트
In [26]:
# vectorizer.transform 함수로 피쳐추출
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

# 랜덤포레스트 알고리즘으로 예측
result = forest.predict(test_data_features)

# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )

# Use pandas to write the comma-separated output file
output.to_csv( "Bag_of_Words_model.csv", index=False, quoting=3 )
  • 결과확인
In [27]:
pd.read_csv("Bag_of_Words_model.csv", 
                    header=0, \
                    quoting=3)
Out[27]:
id sentiment
0 “12311_10” 1
1 “8348_2” 0
2 “5828_4” 1
3 “7186_2” 1
4 “12128_7” 1
5 “2913_8” 0
6 “4396_1” 0
7 “395_2” 1
8 “10616_1” 0
9 “9074_9” 1
10 “9252_3” 1
11 “9896_9” 1
12 “574_4” 1
13 “11182_8” 1
14 “11656_4” 0
15 “2322_4” 0
16 “8703_1” 0
17 “7483_1” 0
18 “6007_10” 1
19 “12424_4” 0
20 “4672_1” 0
21 “10841_3” 0
22 “8954_7” 1
23 “7392_1” 0
24 “10288_8” 1
25 “5343_4” 0
26 “4950_1” 0
27 “9257_4” 0
28 “8689_3” 0
29 “4480_2” 1
24970 “6857_10” 1
24971 “11091_8” 1
24972 “4167_2” 1
24973 “679_4” 0
24974 “10147_1” 0
24975 “6875_1” 0
24976 “923_10” 1
24977 “6200_8” 0
24978 “7208_8” 1
24979 “5363_8” 1
24980 “4067_8” 0
24981 “1773_7” 1
24982 “1498_10” 1
24983 “10497_10” 1
24984 “3444_10” 1
24985 “588_2” 0
24986 “9678_9” 1
24987 “1983_9” 0
24988 “5012_3” 0
24989 “12240_2” 1
24990 “5071_2” 0
24991 “5078_2” 0
24992 “10069_3” 1
24993 “7407_8” 1
24994 “7207_1” 0
24995 “2155_10” 1
24996 “59_10” 1
24997 “2531_1” 0
24998 “7772_8” 1
24999 “11465_10” 0

25000 rows × 2 columns

댓글 남기기

이메일은 공개되지 않습니다. 필수 입력창은 * 로 표시되어 있습니다