[Python] Ch5. Categorizing and Tagging Words (nltk Books)

3. Processing Raw Text

  • 1.How can we write programs to access text from local files and from the Web
  • 2.How can we split documents up into individual words
  • 3.How can we write programs to produce formatted output and save it
  • 토크나이즈 + 패턴찾아 바꾸기(정규식)
  • http://www.nltk.org/book/ch03.html

4. Writing Structured Programs

  • How can you write well-structured, readable programs that you and others will be able to re-use easily?
  • How do the fundamental building blocks work, such as loops, functions and assignment?
  • What are some of the pitfalls with Python programming and how can you avoid them?
  • 함수짜기, For문(절차식, 선언식)짜기
  • http://www.nltk.org/book/ch04.html

5. Categorizing and Tagging Words

1. Using a Tagger

  • A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word
  • pos_tagger는 각 단어에 맞는 POS를 붙여준다
In [1]:
import nltk, re, pprint
from nltk import word_tokenize
In [2]:
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)
Out[2]:
[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]
  • another example, this time including some homonyms
  • refuse and permit both appear as a present tense verb (VBP) and a noun (NN)
  • 동음이의어의 경우 다른 형태소를 갖기도 한다.
In [3]:
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)
Out[3]:
[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]
  • Many of these categories arise from superficial analysis the distribution of words in text.
  • Consider the following analysis involving woman (a noun), bought (a verb), over (a preposition), and the (a determiner).
  • The text.similar() method takes a word w, finds all contexts w1w w2, then finds all words w’ that appear in the same context, i.e. w1w’w2.
  • 동일한 맥락에서 단어가 나타난 단어들을 보여줌
In [4]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words()) #소문자로 변경
text.similar('woman') 
man time day year car moment world family house country child boy
state job way war girl place word work
In [5]:
text.similar('bought')
made said put done seen had found left given heard brought got been
was set told took in felt that
In [6]:
text.similar('over')
in on to of and for with from at by that into as up out down through
is all about
In [7]:
text.similar('the')
a his this their its her an that our any all one these my in your no
some other and

2. Tagged Corpora

2.1 Representing Tagged Tokens

  • 튜플을 사용해서 문자에 형태소를 태깅할 수 있음
In [8]:
tagged_token = nltk.tag.str2tuple('fly/NN')
tagged_token
Out[8]:
('fly', 'NN')
In [9]:
sent = '''The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
interest/NN of/IN both/ABX governments/NNS ''/'' ./.'''
In [10]:
[nltk.tag.str2tuple(t) for t in sent.split()]
Out[10]:
[('The', 'AT'),
 ('grand', 'JJ'),
 ('jury', 'NN'),
 ('commented', 'VBD'),
 ('on', 'IN'),
 ('a', 'AT'),
 ('number', 'NN'),
 ('of', 'IN'),
 ('other', 'AP'),
 ('topics', 'NNS'),
 (',', ','),
 ('AMONG', 'IN'),
 ('them', 'PPO'),
 ('the', 'AT'),
 ('Atlanta', 'NP'),
 ('and', 'CC'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('purchasing', 'VBG'),
 ('departments', 'NNS'),
 ('which', 'WDT'),
 ('it', 'PPS'),
 ('said', 'VBD'),
 ('``', '``'),
 ('ARE', 'BER'),
 ('well', 'QL'),
 ('operated', 'VBN'),
 ('and', 'CC'),
 ('follow', 'VB'),
 ('generally', 'RB'),
 ('accepted', 'VBN'),
 ('practices', 'NNS'),
 ('which', 'WDT'),
 ('inure', 'VB'),
 ('to', 'IN'),
 ('the', 'AT'),
 ('best', 'JJT'),
 ('interest', 'NN'),
 ('of', 'IN'),
 ('both', 'ABX'),
 ('governments', 'NNS'),
 ("''", "''"),
 ('.', '.')]

2.2 Reading Tagged Corpora

  • Note that part-of-speech tags have been converted to uppercase, since this has become standard practice since the Brown Corpus was published.
  • 브라운 코퍼스가 제작된 후로 형태소는 대문자료표기하는 것이 표준이 됨
In [11]:
nltk.corpus.brown.tagged_words()
Out[11]:
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), ...]
In [12]:
nltk.corpus.brown.tagged_words(tagset='universal')
Out[12]:
[(u'The', u'DET'), (u'Fulton', u'NOUN'), ...]
In [13]:
print(nltk.corpus.nps_chat.tagged_words())
[(u'now', 'RB'), (u'im', 'PRP'), (u'left', 'VBD'), ...]
In [14]:
 nltk.corpus.conll2000.tagged_words()
Out[14]:
[(u'Confidence', u'NN'), (u'in', u'IN'), ...]
In [15]:
nltk.corpus.treebank.tagged_words()
Out[15]:
[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), ...]
  • Not all corpora employ the same set of tags
  • Initially we want to avoid the complications of these tagsets, so we use a built-in mapping to the “Universal Tagset
  • 코퍼스 마다 다른 태그셋을 사용/ 이를 통일하기 위해 tagset=’universal’을 사용
In [16]:
nltk.corpus.brown.tagged_words(tagset='universal')
Out[16]:
[(u'The', u'DET'), (u'Fulton', u'NOUN'), ...]
In [17]:
nltk.corpus.treebank.tagged_words(tagset='universal')
Out[17]:
[(u'Pierre', u'NOUN'), (u'Vinken', u'NOUN'), ...]

2.3 A Universal Part-of-Speech Tagset

Tag Meaning English Examples
ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into, under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no, which
NOUN noun year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing, would
. punctuation marks . , ; !
X other ersatz, esprit, dunno, gr8, univeristy

브라운 코퍼스에서 가장 많이 출현한 형태소확인

In [18]:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()
Out[18]:
[(u'NOUN', 30654),
 (u'VERB', 14399),
 (u'ADP', 12355),
 (u'.', 11928),
 (u'DET', 11389),
 (u'ADJ', 6706),
 (u'ADV', 3349),
 (u'CONJ', 2717),
 (u'PRON', 2535),
 (u'PRT', 2264),
 (u'NUM', 2166),
 (u'X', 92)]

2.4 Nouns

word After a determiner Subject od the verb
woman the woman who I saw yesterday … the woman sat down
Scotland the Scotland I remember as a child Scotland has five million people
book the book I bought yesterday this book recounts the colonization of Australia
intelligence the intelligence displayed by the child … Mary’s intelligence impressed her teachers
  • 2개 단어 연쇄를 바탕으로 태깅을 할수 있다.
  • To begin with, we construct a list of bigrams whose members are themselves word-tag pairs such as ((‘The’, ‘DET’), (‘Fulton’, ‘NP’)) and ((‘Fulton’, ‘NP’), (‘County’, ‘N’)).
  • Then we construct a FreqDist from the tag parts of the bigrams.
  • 2개 단어연쇄에서 형태소를 빼와서 NOUN 앞에 오는 형태소를 뽑아봄
In [19]:
word_tag_pairs = nltk.bigrams(brown_news_tagged)
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
fdist = nltk.FreqDist(noun_preceders)
[tag for (tag, _) in fdist.most_common()]
Out[19]:
[u'NOUN',
 u'DET',
 u'ADJ',
 u'ADP',
 u'.',
 u'VERB',
 u'CONJ',
 u'NUM',
 u'ADV',
 u'PRT',
 u'PRON',
 u'X']

2.5 Verbs

2.6 Adjectives and Adverbs

2.7 Unsimplified Tags

In [20]:
def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())
In [21]:
tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))
for tag in sorted(tagdict):
    print(tag, tagdict[tag])
(u'NN', [(u'year', 137), (u'time', 97), (u'state', 88), (u'week', 85), (u'man', 72)])
(u'NN$', [(u"year's", 13), (u"world's", 8), (u"state's", 7), (u"nation's", 6), (u"company's", 6)])
(u'NN$-HL', [(u"Golf's", 1), (u"Navy's", 1)])
(u'NN$-TL', [(u"President's", 11), (u"University's", 3), (u"League's", 3), (u"Gallery's", 3), (u"Army's", 3)])
(u'NN-HL', [(u'cut', 2), (u'Salary', 2), (u'condition', 2), (u'Question', 2), (u'business', 2)])
(u'NN-NC', [(u'eva', 1), (u'ova', 1), (u'aya', 1)])
(u'NN-TL', [(u'President', 88), (u'House', 68), (u'State', 59), (u'University', 42), (u'City', 41)])
(u'NN-TL-HL', [(u'Fort', 2), (u'City', 1), (u'Commissioner', 1), (u'Grove', 1), (u'House', 1)])
(u'NNS', [(u'years', 101), (u'members', 69), (u'people', 52), (u'sales', 51), (u'men', 46)])
(u'NNS$', [(u"children's", 7), (u"women's", 5), (u"men's", 3), (u"janitors'", 3), (u"taxpayers'", 2)])
(u'NNS$-HL', [(u"Dealers'", 1), (u"Idols'", 1)])
(u'NNS$-TL', [(u"Women's", 4), (u"States'", 3), (u"Giants'", 2), (u"Officers'", 1), (u"Bombers'", 1)])
(u'NNS-HL', [(u'years', 1), (u'idols', 1), (u'Creations', 1), (u'thanks', 1), (u'centers', 1)])
(u'NNS-TL', [(u'States', 38), (u'Nations', 11), (u'Masters', 10), (u'Rules', 9), (u'Communists', 9)])
(u'NNS-TL-HL', [(u'Nations', 1)])

2.8 Exploring Tagged Corpora

  • use the tagged_words() method to look at the part-of-speech tag of the following words:
  • 특정 단어 다음에 오는 태그를 확인할 수 있다.
  • often다음에오는 단어들의 태그
In [22]:
brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')
tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']
fd = nltk.FreqDist(tags)
fd.tabulate()
VERB  ADV  ADP  ADJ    .  PRT 
  37    8    7    6    4    2 

3. Mapping Words to Properties Using Python Dictionaries

  • 3.1 Indexing Lists vs Dictionaries
  • 3.2 Dictionaries in Python
  • 3.3 Defining Dictionaries
  • 3.4 Default Dictionaries
  • 3.5 Incrementally Updating a Dictionary
  • 3.6 Complex Keys and Values
  • 3.7 Inverting a Dictionary

4. Automatic Tagging

  • 태그는 문장에서 단어가 출현한 맥락에 의존하기 때문에 문장을 살펴본다.
  • the tag of a word depends on the word and its context within a sentence.
  • For this reason, we will be working with data at the level of (tagged) sentences rather than words.
In [23]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')

4.1 The Default Tagger

  • The simplest possible tagger assigns the same tag to each token.
  • 문서에서 가장 많이 나온 태그로 다 똑같이 붙여 버림
In [24]:
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
nltk.FreqDist(tags).max()
Out[24]:
u'NN'
In [25]:
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
tokens = word_tokenize(raw)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)
Out[25]:
[('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('green', 'NN'),
 ('eggs', 'NN'),
 ('and', 'NN'),
 ('ham', 'NN'),
 (',', 'NN'),
 ('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('them', 'NN'),
 ('Sam', 'NN'),
 ('I', 'NN'),
 ('am', 'NN'),
 ('!', 'NN')]
In [26]:
default_tagger.evaluate(brown_tagged_sents)
Out[26]:
0.13089484257215028

4.2 The Regular Expression Tagger

  • The regular expression tagger assigns tags to tokens on the basis of matching patterns.
  • 특정 패턴이 발견되는 단어를 찾아 태깅한다.
In [27]:
patterns = [
    (r'.*ing$', 'VBG'),               # gerunds
    (r'.*ed$', 'VBD'),                # simple past
    (r'.*es$', 'VBZ'),                # 3rd singular present
    (r'.*ould$', 'MD'),               # modals
    (r'.*\'s$', 'NN$'),               # possessive nouns
    (r'.*s$', 'NNS'),                 # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
    (r'.*', 'NN')                     # nouns (default)
]
In [28]:
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown_sents[3])
Out[28]:
[(u'``', 'NN'),
 (u'Only', 'NN'),
 (u'a', 'NN'),
 (u'relative', 'NN'),
 (u'handful', 'NN'),
 (u'of', 'NN'),
 (u'such', 'NN'),
 (u'reports', 'NNS'),
 (u'was', 'NNS'),
 (u'received', 'VBD'),
 (u"''", 'NN'),
 (u',', 'NN'),
 (u'the', 'NN'),
 (u'jury', 'NN'),
 (u'said', 'NN'),
 (u',', 'NN'),
 (u'``', 'NN'),
 (u'considering', 'VBG'),
 (u'the', 'NN'),
 (u'widespread', 'NN'),
 (u'interest', 'NN'),
 (u'in', 'NN'),
 (u'the', 'NN'),
 (u'election', 'NN'),
 (u',', 'NN'),
 (u'the', 'NN'),
 (u'number', 'NN'),
 (u'of', 'NN'),
 (u'voters', 'NNS'),
 (u'and', 'NN'),
 (u'the', 'NN'),
 (u'size', 'NN'),
 (u'of', 'NN'),
 (u'this', 'NNS'),
 (u'city', 'NN'),
 (u"''", 'NN'),
 (u'.', 'NN')]
In [29]:
regexp_tagger.evaluate(brown_tagged_sents)
Out[29]:
0.20326391789486245

4.3 The Lookup Tagger

  • Let’s find the hundred most frequent words and store their most likely tag.
  • We can then use this information as the model for a “lookup tagger”
  • 이미 태깅된 단어들의 정보를 바탕으로 가장 가능성이 높은 것으로 태깅함
In [30]:
fd = nltk.FreqDist(brown.words(categories='news')) #브라운코퍼스 빈출워드
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) #태그된 단어들의 조건부 빈도
most_freq_words = fd.most_common(100) #제일 많이 나온 단어 100
likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words) #많이 나온 단어 100의 POS 
baseline_tagger = nltk.UnigramTagger(model=likely_tags) #100개에서 추출한 
baseline_tagger.evaluate(brown_tagged_sents)
Out[30]:
0.45578495136941344
In [31]:
sent = brown.sents(categories='news')[3]
baseline_tagger.tag(sent)
Out[31]:
[(u'``', u'``'),
 (u'Only', None),
 (u'a', u'AT'),
 (u'relative', None),
 (u'handful', None),
 (u'of', u'IN'),
 (u'such', None),
 (u'reports', None),
 (u'was', u'BEDZ'),
 (u'received', None),
 (u"''", u"''"),
 (u',', u','),
 (u'the', u'AT'),
 (u'jury', None),
 (u'said', u'VBD'),
 (u',', u','),
 (u'``', u'``'),
 (u'considering', None),
 (u'the', u'AT'),
 (u'widespread', None),
 (u'interest', None),
 (u'in', u'IN'),
 (u'the', u'AT'),
 (u'election', None),
 (u',', u','),
 (u'the', u'AT'),
 (u'number', None),
 (u'of', u'IN'),
 (u'voters', None),
 (u'and', u'CC'),
 (u'the', u'AT'),
 (u'size', None),
 (u'of', u'IN'),
 (u'this', u'DT'),
 (u'city', None),
 (u"''", u"''"),
 (u'.', u'.')]
  • 베이스라인 태거를 평가해봅시다.
In [34]:
baseline_tagger = nltk.UnigramTagger(model=likely_tags,
                                     backoff=nltk.DefaultTagger('NN'))
baseline_tagger.evaluate(brown_tagged_sents)
Out[34]:
0.5817769556656125
  • 베이스라인 태거를 써봅시다.
In [35]:
sent = brown.sents(categories='news')[3]
baseline_tagger.tag(sent)
Out[35]:
[(u'``', u'``'),
 (u'Only', 'NN'),
 (u'a', u'AT'),
 (u'relative', 'NN'),
 (u'handful', 'NN'),
 (u'of', u'IN'),
 (u'such', 'NN'),
 (u'reports', 'NN'),
 (u'was', u'BEDZ'),
 (u'received', 'NN'),
 (u"''", u"''"),
 (u',', u','),
 (u'the', u'AT'),
 (u'jury', 'NN'),
 (u'said', u'VBD'),
 (u',', u','),
 (u'``', u'``'),
 (u'considering', 'NN'),
 (u'the', u'AT'),
 (u'widespread', 'NN'),
 (u'interest', 'NN'),
 (u'in', u'IN'),
 (u'the', u'AT'),
 (u'election', 'NN'),
 (u',', u','),
 (u'the', u'AT'),
 (u'number', 'NN'),
 (u'of', u'IN'),
 (u'voters', 'NN'),
 (u'and', u'CC'),
 (u'the', u'AT'),
 (u'size', 'NN'),
 (u'of', u'IN'),
 (u'this', u'DT'),
 (u'city', 'NN'),
 (u"''", u"''"),
 (u'.', u'.')]
In [36]:
def performance(cfd, wordlist):
    lt = dict((word, cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))

def display():
    import pylab
    word_freqs = nltk.FreqDist(brown.words(categories='news')).most_common()
    words_by_freq = [w for (w, _) in word_freqs]
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(15)
    perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes, perfs, '-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performance')
    pylab.show()

display()

5. N-Gram Tagging

5.1 Unigram Tagging

  • Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.
  • For example, it will assign the tag JJ to any occurrence of the word frequent, since frequent is used as an adjective
In [38]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(brown_sents[2007])
Out[38]:
[(u'Various', u'JJ'),
 (u'of', u'IN'),
 (u'the', u'AT'),
 (u'apartments', u'NNS'),
 (u'are', u'BER'),
 (u'of', u'IN'),
 (u'the', u'AT'),
 (u'terrace', u'NN'),
 (u'type', u'NN'),
 (u',', u','),
 (u'being', u'BEG'),
 (u'on', u'IN'),
 (u'the', u'AT'),
 (u'ground', u'NN'),
 (u'floor', u'NN'),
 (u'so', u'QL'),
 (u'that', u'CS'),
 (u'entrance', u'NN'),
 (u'is', u'BEZ'),
 (u'direct', u'JJ'),
 (u'.', u'.')]
In [39]:
unigram_tagger.evaluate(brown_tagged_sents)
Out[39]:
0.9349006503968017

5.2 Separating the Training and Testing Data

  • we are training a tagger on some data
  • A tagger that simply memorized its training data and made no attempt to construct a general model would get a perfect score, but would also be useless for tagging new text.
In [41]:
size = int(len(brown_tagged_sents) * 0.9)
size
Out[41]:
4160
In [42]:
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)
Out[42]:
0.8120203329014253

5.3 General N-Gram Tagging

  • n그램 태거는 유니그램태거의 일반화된 방법으로 1그램 태거는 유니그램/2그램태거는 바이그램 /3그램 태거는 트라이그램태거와 같다
  • n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens
  • 1-gram tagger is another term for a unigram tagger 2-gram taggers are also called bigram taggers, and 3-gram taggers are called trigram taggers
  • In the example of an n-gram tagger shown in 5.1, we have n=3
  • we consider the tags of the two preceding words in addition to the current word.
  • An n-gram tagger picks the tag that is most likely in the given context.

In [44]:
bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.tag(brown_sents[2007])
Out[44]:
[(u'Various', u'JJ'),
 (u'of', u'IN'),
 (u'the', u'AT'),
 (u'apartments', u'NNS'),
 (u'are', u'BER'),
 (u'of', u'IN'),
 (u'the', u'AT'),
 (u'terrace', u'NN'),
 (u'type', u'NN'),
 (u',', u','),
 (u'being', u'BEG'),
 (u'on', u'IN'),
 (u'the', u'AT'),
 (u'ground', u'NN'),
 (u'floor', u'NN'),
 (u'so', u'CS'),
 (u'that', u'CS'),
 (u'entrance', u'NN'),
 (u'is', u'BEZ'),
 (u'direct', u'JJ'),
 (u'.', u'.')]
In [45]:
unseen_sent = brown_sents[4203]
bigram_tagger.tag(unseen_sent)
Out[45]:
[(u'The', u'AT'),
 (u'population', u'NN'),
 (u'of', u'IN'),
 (u'the', u'AT'),
 (u'Congo', u'NP'),
 (u'is', u'BEZ'),
 (u'13.5', None),
 (u'million', None),
 (u',', None),
 (u'divided', None),
 (u'into', None),
 (u'at', None),
 (u'least', None),
 (u'seven', None),
 (u'major', None),
 (u'``', None),
 (u'culture', None),
 (u'clusters', None),
 (u"''", None),
 (u'and', None),
 (u'innumerable', None),
 (u'tribes', None),
 (u'speaking', None),
 (u'400', None),
 (u'separate', None),
 (u'dialects', None),
 (u'.', None)]
In [46]:
bigram_tagger.evaluate(test_sents)
Out[46]:
0.10276088906608193

5.4 Combining Taggers

  • combine the results of a bigram tagger, a unigram tagger, and a default tagger
    • 1) Try tagging the token with the bigram tagger.
    • 2) If the bigram tagger is unable to find a tag for the token, try the unigram tagger.
    • 3) If the unigram tagger is also unable to find a tag, use a default tagger.
In [47]:
t0 = nltk.DefaultTagger('NN')  #다 명사로 붙임
t1 = nltk.UnigramTagger(train_sents, backoff=t0) #유니그램 태거
t2 = nltk.BigramTagger(train_sents, backoff=t1)  #바이그램 태거
t2.evaluate(test_sents)
Out[47]:
0.844911791089405

5.5 Tagging Unknown Words

  • 모르는 단어는 일단 Unknown으로(UNK) 태그 붙인담에 n그램 태그 학습후 태깅

5.6 Storing Taggers

  • 매번 학습하는 것보다 학습된 태거를 저장했다가 쓰면 편함
  • 태거를 저장해봅시다.
  • Let’s save our tagger t2 to a file t2.pkl
In [48]:
# 저장
from pickle import dump
output = open('t2.pkl', 'wb')
dump(t2, output, -1)
output.close()
In [49]:
# 로드
from pickle import load
input = open('t2.pkl', 'rb')
tagger = load(input)
input.close()
In [50]:
text = """The board's action shows what free enterprise is up against in our complex maze of regulatory laws ."""
tokens = text.split()
tagger.tag(tokens)
Out[50]:
[('The', u'AT'),
 ("board's", u'NN$'),
 ('action', 'NN'),
 ('shows', u'NNS'),
 ('what', u'WDT'),
 ('free', u'JJ'),
 ('enterprise', 'NN'),
 ('is', u'BEZ'),
 ('up', u'RP'),
 ('against', u'IN'),
 ('in', u'IN'),
 ('our', u'PP$'),
 ('complex', u'JJ'),
 ('maze', 'NN'),
 ('of', u'IN'),
 ('regulatory', 'NN'),
 ('laws', u'NNS'),
 ('.', u'.')]

5.7 Performance Limitations

  • n그램 태거의 성능의 한계는 얼마일까?
  • 예재(3그램)에서는 5%저도 모호한 상황이 생긴다.
  • 잘못태그된 빈도를 컨퓨전매트릭스로 확인할수 있다.

6. Transformation-Based Tagging

  • A potential issue with n-gram taggers is the size of n-gram table
  • A second issue concerns context.() ### Brill tagging
  • Brill tagging, an inductive tagging method
  • Brill tagging performs very well using only a tiny fraction of the size of n-gram taggers
  • We will examine the operation of two rules
    • (a) Replace NN with VB when the previous word is TO
    • (b) Replace TO with IN when the next tag is NNS.
    • first tagging with the unigram tagger, then applying the rules to fix the errors.
  • Brill taggers have another interesting property: the rules are linguistically interpretable.
  • Brill taggers have another interesting property: the rules are linguistically interpretable.
Phrase to increase grants to states for vocational rehabilitation
Unigram TO NN NNS TO NNS IN JJ NN
Rule 1 VB
Rule 2 IN
Output TO VB NNS IN NNS IN JJ NN
Gold TO VB NNS IN NNS IN JJ NN
In [ ]:
nltk.tag.brill.demo()

7. How to Determine the Category of a Word

7.1 Morphological Clues : -ness, -ment

7.2 Syntactic Clues : before a noun

7.3 Semantic Clues : 신조어는 명사

7.4 New Words :

7.5 Morphology in Part of Speech Tagsets

Form Category Tag
go base VB
goes 3rd singular present VBZ
gone past participle VBN
going gerund VBG
went simple past VBD

댓글 남기기

이메일은 공개되지 않습니다. 필수 입력창은 * 로 표시되어 있습니다