文学角色

[In ]:
from datascience import Table
import numpy as np
path_data = '../../../'
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

from urllib.request import urlopen 
import re
def read_url(url): 
    return re.sub('\\s+', ' ', urlopen(url).read().decode())
[In ]:
# Read two books, fast (again)!

huck_finn_url = 'https://www.inferentialthinking.com/data/huck_finn.txt'
huck_finn_text = read_url(huck_finn_url)
huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]

little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
little_women_text = read_url(little_women_url)
little_women_chapters = little_women_text.split('CHAPTER ')[1:]

《哈克贝利·费恩历险记》讲述了哈克和吉姆沿密西西比河而下的旅程。汤姆·索亚在故事接近尾声、情节升温时加入了他们。加载文本后,我们可以快速可视化这些角色在书的各个时间点被提及的次数。

[In ]:
# Get the cumulative counts the names Jim, Tom, and Huck appear in each chapter.

counts = Table().with_columns([
        'Jim', np.cumsum(np.char.count(huck_finn_chapters, 'Jim')),
        'Tom', np.cumsum(np.char.count(huck_finn_chapters, 'Tom')),
        'Huck', np.cumsum(np.char.count(huck_finn_chapters, 'Huck'))
    ])

# Plot the cumulative counts:
# how many times in Chapter 1, how many times in Chapters 1 and 2, and so on.

cum_counts = counts.with_column('Chapter', np.arange(1, 44, 1))
cum_counts.plot(column_for_xticks=3)
plots.title('Cumulative Number of Times Each Name Appears', y=1.08);
<Cumulative Number of Times Each Name Appears, The Adventures of Huckleberry Finn.>

在上图中,横轴表示章节编号,纵轴表示每个角色在该章及之前被提及的累计次数。

你可以看到吉姆是一个核心角色,他的名字出现的次数非常多。注意汤姆在前大部分章节中几乎未被提及,直到第30章之后他加入哈克和吉姆。此时,涉及他们两人的情节升温,汤姆和吉姆的曲线都急剧上升。至于哈克,他的名字几乎从未出现,因为他是故事的叙述者。

《小妇人》讲述了美国南北战争期间四姐妹一起长大的故事。在这本书中,章节编号是拼写出来的,章节标题全部使用大写字母。

[In ]:
# The chapters of Little Women, in a table

Table().with_column('Chapters', little_women_chapters)
Chapters
ONE PLAYING PILGRIMS "Christmas won't be Christmas witho ...
TWO A MERRY CHRISTMAS Jo was the first to wake in the gr ...
THREE THE LAURENCE BOY "Jo! Jo! Where are you?" cried Me ...
FOUR BURDENS "Oh, dear, how hard it does seem to take up ...
FIVE BEING NEIGHBORLY "What in the world are you going t ...
SIX BETH FINDS THE PALACE BEAUTIFUL The big house did pr ...
SEVEN AMY'S VALLEY OF HUMILIATION "That boy is a perfect ...
EIGHT JO MEETS APOLLYON "Girls, where are you going?" as ...
NINE MEG GOES TO VANITY FAIR "I do think it was the most ...
TEN THE P.C. AND P.O. As spring came on, a new set of am ...
... (37 rows omitted)

我们也可以追踪主要角色的提及次数来了解这本书的故事情节。主人公乔定期与她的姐妹梅格、贝丝和艾米互动,直到第27章她独自搬到纽约。

[In ]:
# Get the cumulative counts of the names in the chapters of Little Women

counts = Table().with_columns([
        'Amy', np.cumsum(np.char.count(little_women_chapters, 'Amy')),
        'Beth', np.cumsum(np.char.count(little_women_chapters, 'Beth')),
        'Jo', np.cumsum(np.char.count(little_women_chapters, 'Jo')),
        'Meg', np.cumsum(np.char.count(little_women_chapters, 'Meg')),
        'Laurie', np.cumsum(np.char.count(little_women_chapters, 'Laurie')),

    ])

# Plot the cumulative counts.

cum_counts = counts.with_column('Chapter', np.arange(1, 48, 1))
cum_counts.plot(column_for_xticks=5)
plots.title('Cumulative Number of Times Each Name Appears', y=1.08);
<Cumulative Number of Times Each Name Appears, Little Women. Jo's name appears more than any other name at all points of the book. Before chapter 10, each of thee other sisters' names appear, without a clear leader outside of Jo. Then Meg gets the most mentions until a plateau around Chapter 27. Followed by Amy whose line's shape closely mirrors Laurie's line starting around Chatper 35. Finally Beth gets the least mentions.>

劳里是一位最终与其中一位女孩结婚的年轻男子。看看你是否能通过图表猜出是哪一位。