[Python] Ch2. Accessing Text Corpora and Lexical Resources(nltk Books)

2. Accessing Text Corpora and Lexical Resources

자연어처리의 실질적 작업은 보통 large bodies of linguistic data나 corpora를 사용함.

  • Goal of this chapter
    1. 유용한 text corpora와 lexical resources는 무엇이 있는가?, 그리고 Python에서 어떻게 접근하는가?
    2. 이 작업을 위한 가장 유용한 Python construct들은 무엇이 있는가?
    3. Python코드를 작성할 때 우리가 스스로를 반복하는 것을 어떻게 피할 수 있는가?(How do we avoid repeating ourselves when writing Python code?)

1. Accessing Text Corpora

1.1 Gutenberg Corpus

  • 약 25,000 개의 무료 전자책을 포함한 Project Gutenberg electronic text archive에 접근하여 일부만 가져옴
In [1]:
import nltk
In [2]:
from nltk.corpus import gutenberg
In [3]:
emma = gutenberg.words("austen-emma.txt")
In [4]:
[Emma by Jane Austen 1816]



Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness o
  • text indexing하기
    • chap1과는 다르게, nltk.corpus를 calling하여 concordance 명령어를 쓰기 위해선 nltk.Text로 감싸줄 필요가 있다
  • 예제
In [5]:
emma1 = nltk.Text(emma)
Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity ` 
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on 
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulation
 the present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the mystery , the surprize , is more like a young woman ' s s
 to her song took her agreeably by surprize -- a second , slightly but correct
" " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; 
t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
of your admiration may take you by surprize some day or other ." Mr . Knightle
ation for her will ever take me by surprize .-- I never had a thought of her i
 expected by the best judges , for surprize -- but there was great joy . Mr . 
 sound of at first , without great surprize . " So unreasonably early !" she w
d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
; and Emma could imagine with what surprize and mortification she must be retu
tled that Jane should go . Quite a surprize to me ! I had not the least idea !
 . It is impossible to express our surprize . He came to speak to his father o
g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai
  • Gutenberg가 포함하고 있는 파일들의 간략한 정보 살펴보기
In [6]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 18 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt
  • gutenberg.sents
In [7]:
macbeth_sentences = gutenberg.sents("shakespeare-macbeth.txt")
longest_len = max(len(s) for s in macbeth_sentences)
print([s for s in macbeth_sentences if len(s) == longest_len])
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...] 

['Double', ',', 'double', ',', 'toile', 'and', 'trouble', ';', 'Fire', 'burne', ',', 'and', 'Cauldron', 'bubble'] 

[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Steele', ',', 'Which', 'smoak', "'", 'd', 'with', 'bloody', 'execution', '(', 'Like', 'Valours', 'Minion', ')', 'caru', "'", 'd', 'out', 'his', 'passage', ',', 'Till', 'hee', 'fac', "'", 'd', 'the', 'Slaue', ':', 'Which', 'neu', "'", 'r', 'shooke', 'hands', ',', 'nor', 'bad', 'farwell', 'to', 'him', ',', 'Till', 'he', 'vnseam', "'", 'd', 'him', 'from', 'the', 'Naue', 'toth', "'", 'Chops', ',', 'And', 'fix', "'", 'd', 'his', 'Head', 'vpon', 'our', 'Battlements']]
  • nltk.corpus의 words(), raw(), sents()로부터 접근할 수 있는 method들
    • part-of-speech tags, dialogue tags, syntactic trees 등
    • 후에 나옴

1.2 Web and Chat Text

  • Web data: 총 6개
    1. Firefox discussion forum
    2. (추측) king arthur의 대본
    3. conversations overheard in New York
    4. the movie script of Pirates of the Carribean
    5. personal advertisements
    6. wine reviews
In [8]:
from nltk.corpus import webtext as web
for fileid in web.fileids():
    print(fileid, web.raw(fileid)[:65],"\n")
['firefox.txt', 'grail.txt', 'overheard.txt', 'pirates.txt', 'singles.txt', 'wine.txt'] 

firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se 

grail.txt SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop 

overheard.txt White guy: So, do you have any plans for this evening?
Asian girl 

pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr 

singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun 

wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb 

  • chat data
    • data source
      • instant messaging chat sessions, originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators
    • 10,000 Posts, 유저이름 대체됨, 개인정보 삭제됨
    • 총 15개의 파일, 데이터 수집날짜 및 나이대 수집
    • 파일 이름은 날짜, 채팅방, post 갯수
      • e.g., 10-19-20s-706posts는 2006/10/19, 20대 대화방, 706포스트
    • 2006년도 기준
In [9]:
from nltk.corpus import nps_chat as chat
print("FILE: ",chat.fileids(),"\n")
chatroom = chat.posts('10-19-20s_706posts.xml')
print("예제: ",chatroom[123])
FILE:  ['10-19-20s_706posts.xml', '10-19-30s_705posts.xml', '10-19-40s_686posts.xml', '10-19-adults_706posts.xml', '10-24-40s_706posts.xml', '10-26-teens_706posts.xml', '11-06-adults_706posts.xml', '11-08-20s_705posts.xml', '11-08-40s_706posts.xml', '11-08-adults_705posts.xml', '11-08-teens_706posts.xml', '11-09-20s_706posts.xml', '11-09-40s_706posts.xml', '11-09-adults_706posts.xml', '11-09-teens_706posts.xml'] 

예제:  ['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

1.3 Brown Corpus

  • first million-word electronic corpus of English, created in 1961 at Brown University
  • 다양한 장르의 500개의 소스로 만들어진 corpus
  • e.g.,
ID File Genre Description
A16 ca16 news Chicago Tribune: Society Reportage
B02 cb02 editorial Christian Science Monitor: Editorials
C17 cc17 reviews Time Magazine: Reviews
D12 cd12 religion Underwood: Probing the Ethics of Realtors
E36 ce36 hobbies Norling: Renting a Car in Europe
F25 cf25 lore Boroff: Jewish Teenage Culture
G22 cg22 belles_lettres Reiner: Coping with Runaway Technology
H15 ch15 government US Office of Civil and Defence Mobilization: The Family Fallout Shelter
J17 cj19 learned Mosteller: Probability with Statistical Applications
K04 ck04 fiction W.E.B. Du Bois: Worlds of Color
L13 cl13 mystery Hitchens: Footsteps in the Night
M01 cm01 science_fiction Heinlein: Stranger in a Strange Land
N14 cn15 adventure Field: Rattlesnake Ridge
P12 cp12 romance Callaghan: A Passion in Rome
R06 cr06 humor Thurber: The Future, If Any, of Comedy
In [10]:
from nltk.corpus import brown
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] 

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] 

['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...] 

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
  • brown corpus는 서로다른 장르의 systematic한 차이점을 보는데 유용함
    • 특정 modal(법조동사)의 사용 빈도를 알아보기
In [11]:
news_text = brown.words(categories="news")
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = ["can","could","may","might","must","will"]
for m in modals:
    print(m+":",fdist[m],end = " ")
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389
  • 각 장르별로 알아보기
In [12]:
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories = genre))
genres = ["news","religion","hobbies","science_fiction","romance","humor"]
                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13 

1.4 Reuters Corpus

  • 10,788개의 뉴스 문서, 130만개의 단어로 이루어짐
  • 90개의 토픽, test/training set으로 구분 됨
In [13]:
from nltk.corpus import reuters as rt
['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839'] 

['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']
  • brown과 다르게 category가 중첩됨
In [14]:
['barley', 'corn', 'grain', 'wheat'] 


['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/15875', 'test/15952', 'test/17767', 'test/17769', 'test/18024', 'test/18263', 'test/18908', 'test/19275', 'test/19668', 'training/10175', 'training/1067', 'training/11208', 'training/11316', 'training/11885', 'training/12428', 'training/13099', 'training/13744', 'training/13795', 'training/13852', 'training/13856', 'training/1652', 'training/1970', 'training/2044', 'training/2171', 'training/2172', 'training/2191', 'training/2217', 'training/2232', 'training/3132', 'training/3324', 'training/395', 'training/4280', 'training/4296', 'training/5', 'training/501', 'training/5467', 'training/5610', 'training/5640', 'training/6626', 'training/7205', 'training/7579', 'training/8213', 'training/8257', 'training/8759', 'training/9865', 'training/9958'] 

  • 또한 타이틀도 포함되어 있음
    • 대문자로 저장됨
In [15]:
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export']

1.5 Inaugural Address Corpus

  • 55개의 텍스트
In [16]:
import matplotlib.pyplot as plt
%matplotlib nbagg
In [17]:
>>> from nltk.corpus import inaugural
>>> print(inaugural.fileids(),"\n")

>>> print([fileid[:4] for fileid in inaugural.fileids()])
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt'] 

['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825', '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865', '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905', '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945', '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985', '1989', '1993', '1997', '2001', '2005', '2009']
  • america와 citizen의 빈도 보기
In [18]:
>>> cfd = nltk.ConditionalFreqDist(
...           (target, fileid[:4])
...           for fileid in inaugural.fileids()
...           for w in inaugural.words(fileid)
...           for target in ['america', 'citizen']
...           if w.lower().startswith(target))
>>> cfd.plot()

1.6 Annotated Text Corpora

  • nltk에서 접근가능한 corpus들 예시
Corpus Compiler Contents
Brown Corpus Francis, Kucera 15 genres, 1.15M words, tagged, categorized
CESS Treebanks CLiC-UB 1M words, tagged and parsed (Catalan, Spanish)
Chat-80 Data Files Pereira & Warren World Geographic Database
CMU Pronouncing Dictionary CMU 127k entries
CoNLL 2000 Chunking Data CoNLL 270k words, tagged and chunked
CoNLL 2002 Named Entity CoNLL 700k words, pos- and named-entity-tagged (Dutch, Spanish)
CoNLL 2007 Dependency Treebanks (sel) CoNLL 150k words, dependency parsed (Basque, Catalan)
Dependency Treebank Narad Dependency parsed version of Penn Treebank sample
FrameNet Fillmore, Baker et al 10k word senses, 170k manually annotated sentences
Floresta Treebank Diana Santos et al 9k sentences, tagged and parsed (Portuguese)
Gazetteer Lists Various Lists of cities and countries
Genesis Corpus Misc web sources 6 texts, 200k words, 6 languages
Gutenberg (selections) Hart, Newby, et al 18 texts, 2M words
Inaugural Address Corpus CSpan US Presidential Inaugural Addresses (1789-present)
Indian POS-Tagged Corpus Kumaran et al 60k words, tagged (Bangla, Hindi, Marathi, Telugu)
MacMorpho Corpus NILC, USP, Brazil 1M words, tagged (Brazilian Portuguese)
Movie Reviews Pang, Lee 2k movie reviews with sentiment polarity classification
Names Corpus Kantrowitz, Ross 8k male and female names
NIST 1999 Info Extr (selections) Garofolo 63k words, newswire and named-entity SGML markup
Nombank Meyers 115k propositions, 1400 noun frames
NPS Chat Corpus Forsyth, Martell 10k IM chat posts, POS-tagged and dialogue-act tagged
Open Multilingual WordNet Bond et al 15 languages, aligned to English WordNet
PP Attachment Corpus Ratnaparkhi 28k prepositional phrases, tagged as noun or verb modifiers
Proposition Bank Palmer 113k propositions, 3300 verb frames
Question Classification Li, Roth 6k questions, categorized
Reuters Corpus Reuters 1.3M words, 10k news documents, categorized
Roget’s Thesaurus Project Gutenberg 200k words, formatted text
RTE Textual Entailment Dagan et al 8k sentence pairs, categorized
SEMCOR Rus, Mihalcea 880k words, part-of-speech and sense tagged
Senseval 2 Corpus Pedersen 600k words, part-of-speech and sense tagged
SentiWordNet Esuli, Sebastiani sentiment scores for 145k WordNet synonym sets
Shakespeare texts (selections) Bosak 8 books in XML format
State of the Union Corpus CSPAN 485k words, formatted text
Stopwords Corpus Porter et al 2,400 stopwords for 11 languages
Swadesh Corpus Wiktionary comparative wordlists in 24 languages
Switchboard Corpus (selections) LDC 36 phonecalls, transcribed, parsed
Univ Decl of Human Rights United Nations 480k words, 300+ languages
Penn Treebank (selections) LDC 40k words, tagged and parsed
TIMIT Corpus (selections) NIST/LDC audio files and transcripts for 16 speakers
VerbNet 2.1 Palmer et al 5k verbs, hierarchically organized, linked to WordNet
Wordlist Corpus OpenOffice.org et al 960k words and 20k affixes for 8 languages
WordNet 3.0 (English) Miller, Fellbaum 145k synonym sets

1.7 Corpora in Other Languages

  • 다양한 언어의 corpora들이 있음
In [19]:
['El', 'grupo', 'estatal', 'Electricité_de_France', ...] 

['Um', 'revivalismo', 'refrescante', 'O', '7_e_Meio', ...] 

['पूर्ण', 'प्रतिबंध', 'हटाओ', ':', 'इराक', 'संयुक्त', ...] 

['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8', 'Achehnese-Latin1', 'Achuar-Shiwiar-Latin1', 'Adja-UTF8', 'Afaan_Oromo_Oromiffa-Latin1', 'Afrikaans-Latin1', 'Aguaruna-Latin1', 'Akuapem_Twi-UTF8', 'Albanian_Shqip-Latin1'] 

['세', '계', '인', '권', '선', '언', '전', '문', '모든', '인류', '구성원의', '천부의', '존엄성과', '동등하고'] 

['Saben', 'umat', 'manungsa', 'lair', 'kanthi', 'hak', ...] 

  • udhr
    • the Universal Declaration of Human Rights in over 300 languages
In [20]:
>>> from nltk.corpus import udhr
>>> languages = ['Chickasaw', 'English', 'German_Deutsch',
...     'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
>>> cfd = nltk.ConditionalFreqDist(
...           (lang, len(word))
...           for lang in languages
...           for word in udhr.words(lang + '-Latin1'))
>>> cfd.plot(cumulative=True)

1.8 Text Corpus Structure

  • e.g.,

  • nltk의 기초 함수
Example Description
fileids() the files of the corpus
fileids([categories]) the files of the corpus corresponding to these categories
categories() the categories of the corpus
categories([fileids]) the categories of the corpus corresponding to these files
raw() the raw content of the corpus
raw(fileids=[f1,f2,f3]) the raw content of the specified files
raw(categories=[c1,c2]) the raw content of the specified categories
words() the words of the whole corpus
words(fileids=[f1,f2,f3]) the words of the specified fileids
words(categories=[c1,c2]) the words of the specified categories
sents() the sentences of the whole corpus
sents(fileids=[f1,f2,f3]) the sentences of the specified fileids
sents(categories=[c1,c2]) the sentences of the specified categories
abspath(fileid) the location of the given file on disk
encoding(fileid) the encoding of the file (if known)
open(fileid) open a stream for reading the given corpus file
root if the path to the root of locally installed corpus
readme() the contents of the README file of the corpus

1.9 Loading your own Corpus

  • PlaintextCorpusReader사용
    1. corpus_root 설정
    2. PlaintextCorpusReader사용
In [21]:
from nltk.corpus import PlaintextCorpusReader as pcr
corpus_root = "./"
wordlists = pcr(corpus_root,".*")
In [22]:
['.ipynb_checkpoints/nltk chap2-checkpoint.ipynb', 'nltk chap2.ipynb', 'thesis.txt'] 

['국문', '요약', '최근', '주요언론에서도', '다룰', '정도로', '부정적인', ...]
In [23]:
[['{', '"', 'cells', '":', '[', '{', '"', 'cell_type', '":', '"', 'markdown', '",', '"', 'metadata', '":', '{},', '"', 'source', '":', '[', '"#', '2', '.'], ['Accessing', 'Text', 'Corpora', 'and', 'Lexical', 'Resources', '\\', 'n', '",', '"\\', 'n', '",', '"', '자연어처리의', '실질적', '작업은', '보통', 'large', 'bodies', 'of', 'linguistic', 'data나', 'corpora를', '사용함', '.\\', 'n', '",', '"\\', 'n', '",', '"+', 'Goal', 'of', 'this', 'chapter', '\\', 'n', '",', '"', '1', '.'], ...]
  • BracketParseCorpusReader 사용
In [ ]:
>>> from nltk.corpus import BracketParseCorpusReader
>>> corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj"
>>> file_pattern = r".*/wsj_.*\.mrg"
>>> ptb = BracketParseCorpusReader(corpus_root, file_pattern)
>>> ptb.fileids()

['00/wsj_0001.mrg', '00/wsj_0002.mrg', '00/wsj_0003.mrg', '00/wsj_0004.mrg', ...]

>>> len(ptb.sents())


>>> ptb.sents(fileids='20/wsj_2013.mrg')[19]

['The', '55-year-old', 'Mr.', 'Noriega', 'is', "n't", 'as', 'smooth', 'as', 'the',
'shah', 'of', 'Iran', ',', 'as', 'well-born', 'as', 'Nicaragua', "'s", 'Anastasio',
'Somoza', ',', 'as', 'imperial', 'as', 'Ferdinand', 'Marcos', 'of', 'the', 'Philippines',
'or', 'as', 'bloody', 'as', 'Haiti', "'s", 'Baby', Doc', 'Duvalier', '.']

2. Conditional Frequency Distributions

  • 단순 빈도 분석이 아니라, 서로다른 장르나 카테고리에서 단어들의 빈도를 비교분석 하는 것
  • e.g., 

2.1 Conditions and Events

  • 각 조건 당 이벤트가 발생하는 것
    • pair임

2.2 Counting Words by Genre

In [24]:
>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
...           (genre, word)
...           for genre in brown.categories()
...           for word in brown.words(categories=genre))
  • 파일이 너무 큼, 특정 장르만 선택하기
In [25]:
genre_word = [(genre,word)
             for genre in ["news","romance"]
             for word in brown.words(categories=genre)]
In [26]:
[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] 

[('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')]
  • 해당 데이터만 가지고 CFD구하기
In [27]:
cfd = nltk.ConditionalFreqDist(genre_word)
<ConditionalFreqDist with 2 conditions>
['news', 'romance']
In [28]:
<FreqDist with 14394 samples and 100554 outcomes>
<FreqDist with 8452 samples and 70022 outcomes>
[(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502), ('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993)]

2.3 Plotting and Tabulating Distributions

  • 상기 취임연설문
In [29]:
>>> from nltk.corpus import inaugural
>>> cfd = nltk.ConditionalFreqDist(
...           (target, fileid[:4])
...           for fileid in inaugural.fileids()
...           for w in inaugural.words(fileid)
...           for target in ['america', 'citizen']
...           if w.lower().startswith(target))
  • 상기 세계 인권 선언문
In [30]:
>>> from nltk.corpus import udhr
>>> languages = ['Chickasaw', 'English', 'German_Deutsch',
...     'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
>>> cfd = nltk.ConditionalFreqDist(
...           (lang, len(word))
...           for lang in languages
...           for word in udhr.words(lang + '-Latin1'))
  • 영어와 독일어 비교
    • conditions
      • 조건을 주는 parameter, default는 all
    • samples
      • sample 갯수 제한(글자 수)
In [31]:
>>> cfd.tabulate(conditions=['English', 'German_Deutsch'],
...              samples=range(10), cumulative=True)
                  0    1    2    3    4    5    6    7    8    9 
       English    0  185  525  883  997 1166 1283 1440 1558 1638 
German_Deutsch    0  171  263  614  717  894 1013 1110 1213 1275 
  • 상기의 결과는 Set처리 안한 결과임

2.4 Generating Random Text with Bigrams

In [32]:
from nltk.util import bigrams
>>> sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven',
...   'and', 'the', 'earth', '.']
>>> list(nltk.bigrams(sent))
[('In', 'the'),
 ('the', 'beginning'),
 ('beginning', 'God'),
 ('God', 'created'),
 ('created', 'the'),
 ('the', 'heaven'),
 ('heaven', 'and'),
 ('and', 'the'),
 ('the', 'earth'),
 ('earth', '.')]
In [33]:
def generate_model(cfdist, word, num=15):
    for i in range(num):
        print(word, end=' ')
        word = cfdist[word].max()

text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
In [34]:
FreqDist({',': 1,
          '.': 1,
          'creature': 7,
          'soul': 1,
          'substance': 2,
          'thing': 4})
In [35]:
living creature that he said , and the land of the land of the land
  • CFD의 일반적으로 사용되는 method들
Example Description
cfdist = ConditionalFreqDist(pairs) create a conditional frequency distribution from a list of pairs
cfdist.conditions() the conditions
cfdist[condition] the frequency distribution for this condition
cfdist[condition][sample] frequency for the given sample for this condition
cfdist.tabulate() tabulate the conditional frequency distribution
cfdist.tabulate(samples, conditions) tabulation limited to the specified samples and conditions
cfdist.plot() graphical plot of the conditional frequency distribution
cfdist.plot(samples, conditions) graphical plot limited to the specified samples and conditions
cfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in cfdist2

3. More Python: Reusing Code

  1. 실행하고 싶은 코드로 돌아가 한줄 재실행
    • >>>를 해당 코드 앞에 적어 놓으면 interpret 하게 실행됨
  2. 실행하고 싶은 코드를 .py파일로 만들어 import하기
  3. 함수로 만들어 사용하기
In [36]:
>>> def lexical_diversity(my_text_data):
...     word_count = len(my_text_data)
...     vocab_size = len(set(my_text_data))
...     diversity_score = vocab_size / word_count
...     return diversity_score
In [37]:
>>> from nltk.corpus import genesis
>>> kjv = genesis.words('english-kjv.txt')
>>> lexical_diversity(kjv)

4. Lexical Resources

  • Lexicon
    • collection of words and/or phrases along with associated information such as part of speech and sense definitions.
  • e.g., 일반적 Lexicon 예시

4.1 Wordlist Copora

  • 단어 목록만으로 존재하는 Copora
    • 자주 안 쓰이거나, 철자가 틀린 단어 찾기
In [38]:
def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)
In [39]:
['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused']
  • stopwords
In [40]:
from nltk.corpus import stopwords
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']
In [41]:
>>> def content_fraction(text):
...     stopwords = nltk.corpus.stopwords.words('english')
...     content = [w for w in text if w.lower() not in stopwords]
...     return len(content) / len(text)
>>> content_fraction(nltk.corpus.reuters.words())

4.2 A Pronouncing Dictionary

In [42]:
>>> entries = nltk.corpus.cmudict.entries()
>>> len(entries)
In [43]:
>>> for entry in entries[42371:42379]:
...     print(entry)
('fir', ['F', 'ER1'])
('fire', ['F', 'AY1', 'ER0'])
('fire', ['F', 'AY1', 'R'])
('firearm', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M'])
('firearm', ['F', 'AY1', 'R', 'AA2', 'R', 'M'])
('firearms', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M', 'Z'])
('firearms', ['F', 'AY1', 'R', 'AA2', 'R', 'M', 'Z'])
('fireball', ['F', 'AY1', 'ER0', 'B', 'AO2', 'L'])

4.3 Comparative Wordlists

  • Swadesh wordlists
    • 200개의 단어들을 다국어로 저장
In [44]:
from nltk.corpus import swadesh
['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk'] 

['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that', 'here', 'there', 'who', 'what', 'where', 'when', 'how', 'not', 'all', 'many', 'some', 'few', 'other', 'one', 'two', 'three', 'four', 'five', 'big', 'long', 'wide', 'thick', 'heavy', 'small', 'short', 'narrow', 'thin', 'woman', 'man (adult male)', 'man (human being)', 'child', 'wife', 'husband', 'mother', 'father', 'animal', 'fish', 'bird', 'dog', 'louse', 'snake', 'worm', 'tree', 'forest', 'stick', 'fruit', 'seed', 'leaf', 'root', 'bark (from tree)', 'flower', 'grass', 'rope', 'skin', 'meat', 'blood', 'bone', 'fat (noun)', 'egg', 'horn', 'tail', 'feather', 'hair', 'head', 'ear', 'eye', 'nose', 'mouth', 'tooth', 'tongue', 'fingernail', 'foot', 'leg', 'knee', 'hand', 'wing', 'belly', 'guts', 'neck', 'back', 'breast', 'heart', 'liver', 'drink', 'eat', 'bite', 'suck', 'spit', 'vomit', 'blow', 'breathe', 'laugh', 'see', 'hear', 'know (a fact)', 'think', 'smell', 'fear', 'sleep', 'live', 'die', 'kill', 'fight', 'hunt', 'hit', 'cut', 'split', 'stab', 'scratch', 'dig', 'swim', 'fly (verb)', 'walk', 'come', 'lie', 'sit', 'stand', 'turn', 'fall', 'give', 'hold', 'squeeze', 'rub', 'wash', 'wipe', 'pull', 'push', 'throw', 'tie', 'sew', 'count', 'say', 'sing', 'play', 'float', 'flow', 'freeze', 'swell', 'sun', 'moon', 'star', 'water', 'rain', 'river', 'lake', 'sea', 'salt', 'stone', 'sand', 'dust', 'earth', 'cloud', 'fog', 'sky', 'wind', 'snow', 'ice', 'smoke', 'fire', 'ashes', 'burn', 'road', 'mountain', 'red', 'green', 'yellow', 'white', 'black', 'night', 'day', 'year', 'warm', 'cold', 'full', 'new', 'old', 'good', 'bad', 'rotten', 'dirty', 'straight', 'round', 'sharp', 'dull', 'smooth', 'wet', 'dry', 'correct', 'near', 'far', 'right', 'left', 'at', 'in', 'with', 'and', 'if', 'because', 'name']
In [45]:
>>> fr2en = swadesh.entries(['fr', 'en'])
>>> print(fr2en[:6],"\n")

>>> translate = dict(fr2en)
>>> print(translate['chien'],"\n")

>>> print(translate['jeter'])
[('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ('nous', 'we'), ('vous', 'you (plural)'), ('ils, elles', 'they')] 



4.4 Shoebox and Toolbox Lexicons

  • linguists for managing data tool
    • ps: part-of-speech
    • ge: gloss-into-english
    • ex: exmple sentence in Rotokas
    • xp: translate Tok Pisin
    • xe: translate English
In [47]:
>>> from nltk.corpus import toolbox
>>> toolbox.entries('rotokas.dic')[:2]
  [('ps', 'V'),
   ('pt', 'A'),
   ('ge', 'gag'),
   ('tkp', 'nek i pas'),
   ('dcsv', 'true'),
   ('vx', '1'),
   ('sc', '???'),
   ('dt', '29/Oct/2005'),
   ('ex', 'Apoka ira kaaroi aioa-ia reoreopaoro.'),
   ('xp', 'Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok.'),
   ('xe', 'Apoka is gagging from food while talking.')]),
  [('ps', 'V'),
   ('pt', 'B'),
   ('ge', 'strangle'),
   ('tkp', 'pasim nek'),
   ('arg', 'O'),
   ('vx', '2'),
   ('dt', '07/Oct/2006'),
   ('ex', 'Rera rauroro rera kaarevoi.'),
   ('xp', 'Em i holim pas em na nekim em.'),
   ('xe', 'He is holding him and strangling him.'),
   ('ex', 'Iroiro-ia oirato okoearo kaaivoi uvare rirovira kaureoparoveira.'),
   ('xp', 'Ol i pasim nek bilong man long rop bikos em i save bikhet tumas.'),
    "They strangled the man's neck with rope because he was very stubborn and arrogant."),
    'Oirato okoearo kaaivoi iroiro-ia. Uva viapau uvuiparoi ra vovouparo uva kopiiroi.'),
    'Ol i pasim nek bilong man long rop. Olsem na em i no pulim win olsem na em i dai.'),
    "They strangled the man's neck with a rope. And he couldn't breathe and he died.")])]

5. WordNet

  • semantic 기반 사전
    • nltk는 영어 wordnet을 포함하고 있음
    • 총 155,287단어와 117,659 동음이의어 셋 포함

5.1 Senses and Synonyms

  • e.g.,a. Benz is credited with the invention of the motorcar.b. Benz is credited with the invention of the automobile.
In [48]:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('motorcar')
In [49]:
>>> wn.synset('car.n.01').lemma_names()
['car', 'auto', 'automobile', 'machine', 'motorcar']
In [50]:
>>> print(wn.synset('car.n.01').definition(),"\n")
>>> print(wn.synset('car.n.01').examples())
a motor vehicle with four wheels; usually propelled by an internal combustion engine 

['he needs a car to get to work']
In [51]:
[Lemma('car.n.01.car'), Lemma('car.n.01.auto'), Lemma('car.n.01.automobile'), Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')] 



In [52]:
>>> for synset in wn.synsets('car'):
...     print(synset.lemma_names())
[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')] 

['car', 'auto', 'automobile', 'machine', 'motorcar']
['car', 'railcar', 'railway_car', 'railroad_car']
['car', 'gondola']
['car', 'elevator_car']
['cable_car', 'car']

5.2 The WordNet Hierarchy

  • e.g.,
  • wordnet을 이용하면 손쉽게 하위어를 찾을 수 있음
In [55]:
motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()
sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas())
  • 상위 패스 찾기
In [56]:
paths = motorcar.hypernym_paths()
print([synset.name() for synset in paths[0]],"\n")
print([synset.name() for synset in paths[1]],"\n")


['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] 

['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] 

  • root 찾기
In [57]:
>>> motorcar.root_hypernyms()

5.3 More Lexical Relations

  • meronyms나 holonyms도 찾을 수 있음
    • the parts of a tree are its trunk, crown, and so on; the part_meronyms().
    • The substance a tree is made of includes heartwood and sapwood; the substance_meronyms().
    • A collection of trees forms a forest; the member_holonyms()
  • 한국식 해석
    • meronyms – 반가운 얼굴을 보았다(얼굴 – 사람)
    • holonyms – 얼굴에는 눈,코,입 등이 존재
In [58]:
[Synset('burl.n.02'), Synset('crown.n.07'), Synset('limb.n.02'), Synset('stump.n.01'), Synset('trunk.n.01')] 

[Synset('heartwood.n.01'), Synset('sapwood.n.01')] 

  • To see just how intricate things can get, consider the word mint, which has several closely-related senses. We can see that mint.n.04 is part of mint.n.02 and the substance from which mint.n.05 is made.
In [59]:
>>> for synset in wn.synsets('mint', wn.NOUN):
...     print(synset.name() + ':', synset.definition())
batch.n.02: (often followed by `of') a large number or amount or extent
mint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
mint.n.03: any member of the mint family of plants
mint.n.04: the leaves of a mint plant used fresh or candied
mint.n.05: a candy that is flavored with a mint oil
mint.n.06: a plant where money is coined by authority of the government
In [60]:

  • There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails stepping. Some verbs have multiple entailments:
In [61]:
>>> print(wn.synset('walk.v.01').entailments(),"\n")
>>> print(wn.synset('eat.v.01').entailments(),"\n")
>>> print(wn.synset('tease.v.03').entailments(),"\n")

[Synset('chew.v.01'), Synset('swallow.v.01')] 

[Synset('arouse.v.07'), Synset('disappoint.v.01')] 

  • 반의어
In [62]:
>>> print(wn.lemma('supply.n.02.supply').antonyms(),"\n")
>>> print(wn.lemma('rush.v.01.rush').antonyms(),"\n")
>>> print(wn.lemma('horizontal.a.01.horizontal').antonyms(),"\n")
>>> print(wn.lemma('staccato.r.01.staccato').antonyms(),"\n")


[Lemma('inclined.a.02.inclined'), Lemma('vertical.a.01.vertical')] 


5.4 Semantic Similarity

In [63]:
>>> right = wn.synset('right_whale.n.01')
>>> orca = wn.synset('orca.n.01')
>>> minke = wn.synset('minke_whale.n.01')
>>> tortoise = wn.synset('tortoise.n.01')
>>> novel = wn.synset('novel.n.01')
In [64]:



In [65]:
>>> print(wn.synset('baleen_whale.n.01').min_depth(),"\n")
>>> print(wn.synset('whale.n.02').min_depth(),"\n")
>>> print(wn.synset('vertebrate.n.01').min_depth(),"\n")
>>> print(wn.synset('entity.n.01').min_depth())



In [66]:
>>> right.path_similarity(minke)
In [67]:
>>> right.path_similarity(orca)
In [68]:
>>> right.path_similarity(tortoise)
In [69]:
>>> right.path_similarity(novel)

댓글 남기기

이메일은 공개되지 않습니다. 필수 입력창은 * 로 표시되어 있습니다