[Python] Term Frequency – Inverse Document Frequency (TF-IDF)

Text feature extraction

The Bag of Words representation

  • 텍스트 분석은 머신러닝의 주된 활용영역
  • 문자와 같은 기호들의 시퀀스는 알고리즘에 사용할수 없기 때문에 수치화된 벡터로 변환해야 한다.
  • 이를 위해 scikit-learn은 텍스트로부터 수치형 벡터를 추출하는 방법을 제공한다.

    • tokenizing : 문자나 숫자를 공백이나 기호로 구분하여 분리
    • counting : 각 문서에서 토큰의 발생을 카운트함
    • normalizing : 토큰의 발생횟수에 가중치를 부여함
  • 이런접근은 특징과 표본들을 다음과 같이 정의한다

    • 개별 토큰의 빈도는 피쳐가 된다
    • 주어진 문서에서 모든토큰의 빈도인 벡터는 multivariate sample로 간주한다.
  • 문서의 코퍼스는 1개의 로우가 문서로 구성된 행렬에 숫자벡터로 표상된다
  • 이러한 접근을 Bag of Words or “Bag of n-grams”이라고 부른다.

Common Vectorizer usage

  • CountVectorizer : 토크나이즈와 발생빈도 카운팅을 수행한다.
In [1]:
from sklearn.feature_extraction.text import CountVectorizer
In [2]:
vectorizer = CountVectorizer(min_df=1)
vectorizer
Out[2]:
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
In [3]:
corpus = [
    'This is the first document.',
    'This is the second document.',
    'And the third one',
    'Is this the first document?',
]
x = vectorizer.fit_transform(corpus)
print(x) 
  (0, 8)   1
  (0, 3)    1
  (0, 6)    1
  (0, 2)    1
  (0, 1)    1
  (1, 8)    1
  (1, 3)    1
  (1, 6)    1
  (1, 1)    1
  (1, 5)    1
  (2, 6)    1
  (2, 0)    1
  (2, 7)    1
  (2, 4)    1
  (3, 8)    1
  (3, 3)    1
  (3, 6)    1
  (3, 2)    1
  (3, 1)    1
In [4]:
# 최소 2글자 이상인 토큰으로 분석해줌
analyze = vectorizer.build_analyzer()
analyze("This is a text document to analyze.") == ['this', 'is', 'text', 'document', 'to', 'analyze']
Out[4]:
True
  • Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix.
  • This interpretation of the columns can be retrieved as follows:
In [5]:
vectorizer.get_feature_names()  #이게 왜 되는거지?
Out[5]:
[u'and',
 u'document',
 u'first',
 u'is',
 u'one',
 u'second',
 u'the',
 u'third',
 u'this']
In [6]:
x.toarray() 
Out[6]:
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
  • The converse mapping from feature name to column index is stored in the vocabulary_ attribute of the vectorizer:
In [7]:
vectorizer.vocabulary_.get('document')
Out[7]:
1
  • Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:(기존에 만든 어레이에 없는 단어는 변환해도 안나옴)
In [8]:
vectorizer.transform(['Something completely new.']).toarray()
Out[8]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)
  • Note that in the previous corpus, the first and the last documents have exactly the same words hence are encoded in equal vectors. In particular we lose the information that the last document is an interrogative form. To preserve some of the local ordering information we can extract 2-grams of words in addition to the 1-grams (individual words)
In [9]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
                                    token_pattern=r'\b\w+\b', 
                                    min_df=1)
analyze = bigram_vectorizer.build_analyzer()
analyze('Bi-grams are cool!')
Out[9]:
[u'bi', u'grams', u'are', u'cool', u'bi grams', u'grams are', u'are cool']
  • The vocabulary extracted by this vectorizer is hence much bigger and can now resolve ambiguities encoded in local positioning patterns:
In [10]:
X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
X_2
Out[10]:
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)
  • In particular the interrogative form “Is this” is only present in the last document:
In [11]:
feature_index = bigram_vectorizer.vocabulary_.get('is this')
X_2[:, feature_index]  
Out[11]:
array([0, 0, 0, 1], dtype=int64)

Tf–idf term weighting

  • In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
  • In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
  • TF means term-frequency : 문서 d 내에서 단어 t의 총 빈도 f(t,d)로 표현
    • 특정한 단어가 문서 내에 얼마나 자주 등장하는지를 나타내는 값
    • binary(boolean) scale : 한번이라도 나타나면 1 아니면 0
    • log scale : log(빈도+1)
    • 증가 빈도: 문서의 길이에 따라 단어의 빈도값 조정
  • IDF means inverse document frequency : log(전체문서수 / 해당단어가 출현한 문서수)
    • TF-IDF(t,d) = TF(t,d) X IDF(t)
In [12]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)  #IDF값의 분모와 분자에 1을 더하는 인수
transformer   
Out[12]:
TfidfTransformer(norm=u'l2', smooth_idf=False, sublinear_tf=False,
         use_idf=True)
In [13]:
counts = [[3, 0, 1],
          [2, 0, 0],
          [3, 0, 0],
          [4, 0, 0],
          [3, 2, 0],
          [3, 0, 2]]
tfidf = transformer.fit_transform(counts)
tfidf                         
Out[13]:
<6x3 sparse matrix of type '<type 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse Row format>
In [14]:
tfidf.toarray() 
Out[14]:
array([[ 0.81940995,  0.        ,  0.57320793],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.47330339,  0.88089948,  0.        ],
       [ 0.58149261,  0.        ,  0.81355169]])
In [15]:
transformer = TfidfTransformer()
transformer.fit_transform(counts).toarray()
Out[15]:
array([[ 0.85151335,  0.        ,  0.52433293],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.55422893,  0.83236428,  0.        ],
       [ 0.63035731,  0.        ,  0.77630514]])
  • The weights of each feature computed by the fit method call are stored in a model attribute:
In [16]:
transformer.idf_
Out[16]:
array([ 1.        ,  2.25276297,  1.84729786])
  • TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model:
In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1)
vectorizer.fit_transform(corpus)
Out[17]:
<4x9 sparse matrix of type '<type 'numpy.float64'>'
    with 19 stored elements in Compressed Sparse Row format>

댓글 남기기

이메일은 공개되지 않습니다. 필수 입력창은 * 로 표시되어 있습니다