In Part I of this post I made a topic model of the speech of Shakespeare characters from eight plays. Here in Part II I'll analyze the results of the model. Download notebook.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from collections import defaultdict
from gensim import corpora, models, similarities
from pprint import pprint
Here I load data from Part I. You can find the data here.
word_data_df = pd.read_csv('data/word_data_df.csv', index_col=0)
personae = [tuple(character) for character in word_data_df[['persona', 'play']].values]
plays = word_data_df['play'].unique()
corpus = corpora.mmcorpus.MmCorpus('data/shkspr.mm')
tfidf = models.TfidfModel(corpus)
lsi = models.lsimodel.LsiModel.load('data/shkspr.lsi')
gensim
can calculate a similarity value between each character, using a cosine similarity metric. The input is the model and the corpus of each character's speech.
matsim = similarities.MatrixSimilarity(lsi[tfidf[corpus]], num_best=6)
For each of the ten characters with the most lines, this prints the most similiar characters along with their similarity scores. Most characters are most similar to other characters in their play. Under this model Mark Antony in Antony and Cleopatra is more closely related to three other characters from Antony and Cleopatra than to himself in Julius Ceasar. This doesn't seem too far fetched, as characters in different plays are concerned with different people and problems.
for sims in list(matsim)[:10]:
persona_index = sims[0][0]
print '|'.join(personae[persona_index])
for other_persona_index, score in sims[1:]:
print '\t{:<30}{:.3f}'.format('|'.join(personae[other_persona_index]), score)
Latent Sementic Indexing (LSI) creates a lower dimensional subspace of the space spanned by all words (i.e. a space in which each word represents one orthogonal dimension). The speech of each character can be projected into this smaller dimensional space. Below is the projection of Hamlet's speech into the first 10 dimensions. Because of the way the space is constructed, the first dimensions contain the most information.
lsi[tfidf[corpus[0]]][:10]
The functions below plot the projection of each character's speech onto two of the axes (topics) defined by the LSI model. This is useful for visualizing the result of the model. The most important 10 words in each topic are printed above the graph.
def format_topic_coeffs(topic):
"""Return a list of coefficent, word tuples with coefficent truncated to
3 decimal places.
"""
return [('{0:.3f}'.format(coeff), word) for coeff, word in topic]
def plot_axes(x=0, y=1, model=lsi, corpus=corpus,
tfidf=tfidf, personae=personae, plays=plays):
"""Plot each character in personae according to the projection of their
speech into the given x and y topic axes of model.
Points are colored according to play and labeled with the character.
:param x: the index of the x axis to plot
:param y: the index of the y axis to plot
:param model: the gensim model to project into
:param corpus: the gensim corpus of documents
:param tfidf: a tfidf model for converting documents into tfidf space
:param personae: a list of (character, play) tuples, the order must correspond to
the order of documents in the corpus
:param plays: a list of all the plays existing in the data
"""
x_data = defaultdict(list)
y_data = defaultdict(list)
chars = defaultdict(list)
print 'x topic:'
pprint(format_topic_coeffs(model.show_topic(x)))
print ''
print 'y topic:'
pprint(format_topic_coeffs(model.show_topic(y)))
for persona, doc in zip(personae, corpus):
play = persona[1]
x_data[play].append((model[tfidf[doc]][x][1]))
y_data[play].append((model[tfidf[doc]][y][1]))
chars[play].append(persona[0])
plt.figure(figsize=(10, 10))
ax = plt.gca()
cmap = plt.get_cmap('Paired')
play_index = {play: i for i, play in enumerate(plays)}
for play in play_index:
color_index = play_index[play] / float(len(play_index))
plt.scatter(x_data[play], y_data[play], color=cmap(color_index),
label=play, alpha=.5, s=40)
for char, x, y in zip(chars[play], x_data[play], y_data[play]):
ax.annotate(char, xy=(x, y), xycoords='data', xytext=(1, 1),
textcoords='offset points', size=10)
plt.legend(loc=1, ncol=2, scatterpoints=1)
Here the y-axis separates the plays about Romans from other plays. Looking at the list of words that make up this topic, we can see that the Romans talk a lot about "Caesar", "Antony", and "Rome", but not much about "Romeo" or "Tybalt". The characters from Romeo and Juliet are the opposite, and they extend the other way along the y-axis.
plot_axes(x=0, y=1)
The next two axes separate out several of the other plays. Characters from Romeo and Juliet, Othello, and A Midsummer Night's Dream extending along the axes in different directions, while to a lesser extent the characters from The Merchant of Venice have some projection on the y-axis.
plot_axes(x=2, y=3)
In the next set of axes the characters from The Merchant of Venice are well separated from those in the other plays along the x-axis, while characters from Hamlet and Macbeth extend along the y-axis. That these characters would be clustered is no surprise, above we can see that Hamlet and Macbeth (the people) both have places among each other's top 3 most similar characters.
plot_axes(x=4, y=5)
Back to Part I.
Similar Posts
- Topic modeling of Shakespeare characters, Score: 0.975
- Using topic modeling to find related blog posts, Score: 0.956
- How to quickly test if an element belongs to a group, Score: 0.929
- Filter common words from documents, Score: 0.922
- Get a list of all English words in python, Score: 0.915
Comments