In this post I extract all the words spoken by each character in eight of Shakespeare's plays. Then I construct a topic model to see which characters are generally speaking about similar things. In Part II I look into the information revealed by the topic model. Download notebook.
import nltk
import pandas as pd
from collections import defaultdict
from gensim import corpora, models, similarities
The nltk
library includes eight of Shakespeare's plays in xml format, which makes it easy to split up line by speaker. Here's an example of the xml format.
nltk.corpus.shakespeare.fileids()
parse_plays
returns two dictionaries, mapping each speaker in each play to the words they say and the number of lines they have.
def parse_plays(file_ids,
tokenizer=nltk.tokenize.RegexpTokenizer(r'\w+'),
stopwords=set(nltk.corpus.stopwords.words('english'))):
"""Return two dictionaries, mapping each speaker in each play to the
words they say and the number of lines they have.
:param file_ids: the nltk file_ids of play xml files
:param tokenizer: tokenizer to split words within the lines
default: nltk.tokenize.RegexpTokenizer(r'\w+')
:param stopwords: set of words to exclude
default: set(nltk.corpus.stopwords.words('english'))
"""
lines = defaultdict(list)
linecounts = defaultdict(int)
for file_id in file_ids:
raw_data = nltk.corpus.shakespeare.xml(file_id)
for child in raw_data.findall('ACT/SCENE/SPEECH'):
speaker = (child.find('SPEAKER').text, file_id.replace('.xml', ''))
for line in child.findall('LINE'):
if line.text is not None:
for word in tokenizer.tokenize(line.text):
word_lower = word.lower()
if word_lower not in stopwords and len(word) > 2:
lines[speaker].append(word_lower)
linecounts[speaker] += 1
return lines, linecounts
To make the clean up and manipulation of data easier, I put the relevant data into a pandas
DataFrame
.
min_lines = 100
lines, linecounts = parse_plays(nltk.corpus.shakespeare.fileids())
word_data = [(speaker[0], speaker[1], count, lines[speaker])
for speaker, count in linecounts.iteritems()
if count >= min_lines]
word_data_df = pd.DataFrame(word_data, columns=['persona', 'play', 'linecount', 'words'])
word_data_df = word_data_df.sort('linecount', ascending=False).reset_index(drop=True)
word_data_df.ix[:, :3].to_csv('data/word_data_df.csv')
word_data_df.head()
Here I make a gensim
dictionary, which creates a mapping of words to integer ids. The integer ids are used by gensim
in the later steps to extract a topic model.
line_list = word_data_df['words'].values
dictionary = corpora.Dictionary(line_list)
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq == 1]
dictionary.filter_tokens(once_ids)
dictionary.compactify()
The below step creates a sparse vector of integer words ids to word counts for each character and a TF-IDF model. The TF-IDF model converts raw word counts to a value more indicative of the importance of each word.
corpus = [dictionary.doc2bow(words) for words in line_list]
corpora.mmcorpus.MmCorpus.serialize('data/shkspr.mm', corpus)
tfidf = models.TfidfModel(corpus)
Finally, the model is constructed.
lsi = models.lsimodel.LsiModel(corpus=tfidf[corpus], id2word=dictionary)
lsi.save('data/shkspr.lsi')
for i, topic in enumerate(lsi.print_topics(5)[:3]):
print 'Topic {}:'.format(i)
print topic.replace(' + ', '\n')
print ''
The topic model is now constructed. In Part II I'll analyze the results.
Similar Posts
- Using topic modeling to find related blog posts, Score: 0.991
- Filter common words from documents, Score: 0.981
- Analysis of Shakespeare character speech topics, Score: 0.975
- Get a list of all English words in python, Score: 0.974
- How to quickly test if an element belongs to a group, Score: 0.965
Comments