另一种角色 - 计算与推断思维

title: 另一种角色

[In ]:

from datascience import *
import numpy as np
path_data = '../../../../data/'
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

from urllib.request import urlopen 
import re
def read_url(url): 
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

[In ]:

# Read two books, fast (again)!

huck_finn_url = 'https://www.inferentialthinking.com/data/huck_finn.txt'
huck_finn_text = read_url(huck_finn_url)
huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]

little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
little_women_text = read_url(little_women_url)
little_women_chapters = little_women_text.split('CHAPTER ')[1:]

在某些情况下，数量之间的关系让我们能够做出预测。本书将探讨如何根据不完整信息做出准确的预测，并开发结合多个不确定信息源来做出决策的方法。

作为多源信息可视化的一例，我们先让计算机获取一些手工收集将十分繁琐的信息。在小说的语境中，“角色”一词还有另一个含义：印刷符号，如字母、数字或标点符号。在这里，我们让计算机统计《哈克贝利·费恩历险记》和《小妇人》各章中的字符数和句号数。

[In ]:

# In each chapter, count the number of all characters;
# call this the "length" of the chapter.
# Also count the number of periods.

chars_periods_huck_finn = Table().with_columns([
        'Huck Finn Chapter Length', [len(s) for s in huck_finn_chapters],
        'Number of Periods', np.char.count(huck_finn_chapters, '.')
    ])
chars_periods_little_women = Table().with_columns([
        'Little Women Chapter Length', [len(s) for s in little_women_chapters],
        'Number of Periods', np.char.count(little_women_chapters, '.')
    ])

以下是《哈克贝利·费恩历险记》的数据。表的每一行对应小说的一章，显示该章的字符数和句号数。不出所料，字符数较少的章节通常句号也较少：章节越短，句子通常越少，反之亦然。然而，这种关系并非完全可预测，因为句子长度各异，还可能涉及问号等其他标点符号。

[In ]:

chars_periods_huck_finn

Huck Finn Chapter Length | Number of Periods
7026                     | 66
11982                    | 117
8529                     | 72
6799                     | 84
8166                     | 91
14550                    | 125
13218                    | 127
22208                    | 249
8081                     | 71
7036                     | 70
... (33 rows omitted)

以下是《小妇人》对应的数据。

[In ]:

chars_periods_little_women

Little Women Chapter Length | Number of Periods
21759                       | 189
22148                       | 188
20558                       | 231
25526                       | 195
23395                       | 255
14622                       | 140
14431                       | 131
22476                       | 214
33767                       | 337
18508                       | 185
... (37 rows omitted)

你可以看到《小妇人》的章节通常比《哈克贝利·费恩历险记》的章节长。让我们看看这两个简单的变量——每章的长度和句号数——是否能告诉我们关于这两本书的更多信息。一种方法是将两组数据绘制在同一坐标轴上。

在下图中，每本书的每个章节对应一个点。蓝色的点对应《哈克贝利·费恩历险记》，金色的点对应《小妇人》。横轴表示句号数，纵轴表示字符数。

[In ]:

plots.figure(figsize=(6, 6))
plots.scatter(chars_periods_huck_finn.column(1), 
              chars_periods_huck_finn.column(0), 
              color='darkblue')
plots.scatter(chars_periods_little_women.column(1), 
              chars_periods_little_women.column(0), 
              color='gold')
plots.xlabel('Number of periods in chapter')
plots.ylabel('Number of characters in chapter');

Scatterplot with x axis labeled 'Number of periods in chapter' and y axis labeled 'Number of characters in chapter'. The points are dark blue or gold. The dark blue points are primarily in quadrant 3, and have a positive relationship where points with a higher number of periods tend to have a higher number of characters. The gold points extend from quadrant 3 to quadrant 1 and also have a positive relationship. There are gold points with higher x and y values than the highest x and y values for the blue points.

该图显示，《小妇人》的许多（但非全部）章节比《哈克贝利·费恩历险记》的章节长，正如我们仅通过观察数字就可以发现的。但它还展示了更多信息。注意蓝色点大致聚集在一条直线周围，金色点也是如此。而且，两种颜色的点似乎都聚集在同一条直线周围。

现在观察所有包含约 100 个句号的章节。图中显示这些章节大约包含 10,000 到 15,000 个字符，粗略估计大约每 100 到 150 个字符对应一个句号。

实际上，从图中看来，两本书平均都在每两个句号之间包含大约 100 到 150 个字符，这是一个非常粗略的估计。也许这两部伟大的 19 世纪小说在向我们传递如今已非常熟悉的信息：Twitter 的 140 字符限制。