中心极限定理

在本课程中，我们看到的数据直方图很少有钟形的。当我们遇到钟形分布时，几乎总是基于随机样本的统计量的经验直方图。

下面的示例展示了两种截然不同的情况，在这些情况下，此类直方图中出现了近似的钟形。

[In ]:

from datascience import *
%matplotlib inline
path_data = '../../../assets/data/'
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import math
import numpy as np
from scipy import stats

[In ]:

colors = Table.read_table(path_data + 'roulette_wheel.csv').column('Color')
pockets = make_array('0','00')
for i in np.arange(1, 37):
    pockets = np.append(pockets, str(i)) 

wheel = Table().with_columns(
    'Pocket', pockets,
    'Color', colors
)

轮盘赌的净收益

在较早的一节中，钟形出现在如果我们反复在轮盘赌的不同旋转中下注相同金额时我们总赢金额的粗略形状中。

[In ]:

wheel

Pocket | Color
0      | green
00     | green
1      | red
2      | black
3      | red
4      | black
5      | red
6      | black
7      | red
8      | black
... (28 rows omitted)

回想一下，押注红色是1比1的等额赔付。我们定义了函数 red_winnings，该函数返回在红色上押注\$1的净赢利。具体来说，该函数接受一个颜色作为参数，如果颜色是红色则返回1，对于所有其他颜色返回-1。

[In ]:

def red_winnings(color):
    if color == 'red':
        return 1
    else:
        return -1

表格 red 显示了每个口袋在红色上的赢利。

[In ]:

red = wheel.with_column(
    'Winnings: Red', wheel.apply(red_winnings, 'Color')
    )
red

Pocket | Color | Winnings: Red
0      | green | -1
00     | green | -1
1      | red   | 1
2      | black | -1
3      | red   | 1
4      | black | -1
5      | red   | 1
6      | black | -1
7      | red   | 1
8      | black | -1
... (28 rows omitted)

一次下注的净收益是从 Winnings: Red 列中随机抽取的一个值。有18/38的概率获得\$1，20/38的概率获得-\$1。这个概率分布如下直方图所示。

[In ]:

red.select('Winnings: Red').hist(bins=np.arange(-1.5, 1.6, 1))

Histogram with 'Winnings: Red' on the x-axis and 'Percent per uniit' on the y-axis. Two bars are shown. The bar centered at x=-1 has a height just above 50. The bar centered at x=1 has a height just below 50.

现在假设你多次押注红色。你的净赢利将是从上述分布中多次有放回随机抽取的总和。

要列出你净赢利的所有可能值及其所有概率，需要一些数学计算。我们不会这样做；相反，我们将像本课程一直做的那样，通过模拟来近似概率分布。

下面的代码模拟了如果你在轮盘赌的400次不同旋转中每次押注红色\$1，你的净收益是多少。

[In ]:

num_bets = 400
repetitions = 10000

net_gain_red = make_array()

for i in np.arange(repetitions):
    spins = red.sample(num_bets)
    new_net_gain_red = spins.column('Winnings: Red').sum()
    net_gain_red = np.append(net_gain_red, new_net_gain_red)


results = Table().with_column(
    'Net Gain on Red', net_gain_red
    )

[In ]:

results.hist(bins=np.arange(-80, 50, 6))

Histogram with 'Net Gain on Red' on the x-axis and 'Percent per unit' on the y-axis. The data is symmetrical and bell shaped with a peak about x=-20 and ranging from x=-80 to x=40.

这是一个大致钟形的直方图，尽管我们从中抽取的分布远非钟形。

中心。 该分布大致以-20美元为中心。要理解原因，请注意你在约18/38的下注中会赢\$1，在其余的20/38中会输\$1（赢得-\$1）。因此你每美元下注的平均赢利约为-5.26美分：

[In ]:

average_per_bet = 1*(18/38) + (-1)*(20/38)
average_per_bet

-0.05263157894736842

因此在400次下注中，你预期净收益约为-\$21：

[In ]:

400 * average_per_bet

-21.052631578947366

为了确认，我们可以计算10,000次模拟净收益的均值：

[In ]:

np.mean(results.column(0))

-20.9586

分散程度。 从中心开始沿着曲线移动视线，注意拐点接近0。在钟形曲线上，SD是从中心到拐点的距离。中心大致为-\$20，这意味着分布的SD大约为\$20。

在下一节中，我们将看到\$20从何而来。现在，让我们通过直接计算10,000次模拟净收益的SD来确认我们的观察：

[In ]:

np.std(results.column(0))

20.029115957525438

总结。 400次下注的净收益是每次单独下注赢得的400个金额的总和。该总和的概率分布近似正态，其平均值和SD我们可以近似得到。

平均航班延误

表格 united 包含2015年夏季从旧金山机场出发的13,825次美联航国内航班的起飞延误数据。正如我们之前看到的，延误分布有一个长长的右尾。

[In ]:

united = Table.read_table(path_data + 'united_summer2015.csv')

[In ]:

united.select('Delay').hist(bins=np.arange(-20, 300, 10))

Histogram with 'Delay' on the x-axis and 'Percent per unit' on the y-axis. The histogram has the tallest bar on the left centered just to the left of x=0 with a dramatic decrease in height until the bars become so small they're not visible until about x=150, however the graph extends to x=300.

平均延误约为16.6分钟，SD约为39.5分钟。注意SD相对于均值来说有多大。右侧的那些大离差产生了影响，尽管它们只占数据的很小比例。

[In ]:

mean_delay = np.mean(united.column('Delay'))
sd_delay = np.std(united.column('Delay'))

mean_delay, sd_delay

(16.658155515370705, 39.480199851609314)

现在假设我们有放回地随机抽取400个延误。如果你愿意，也可以无放回抽样，但结果与有放回抽样非常相似。如果你从13,825中无放回地抽取几百个，每次抽出一个值时几乎不会改变总体。

在样本中，平均延误可能是多少？我们预期它在16或17左右，因为那是总体平均值；但很可能会有些偏差。让我们看看抽样得到的结果。我们将使用只包含延误列的表格 delay。

[In ]:

delay = united.select('Delay')

[In ]:

np.mean(delay.sample(400).column('Delay'))

15.59

样本平均值随样本结果而变化，因此我们将重复模拟抽样过程并绘制样本平均值的经验直方图。那将是对样本平均值概率直方图的近似。

[In ]:

sample_size = 400
repetitions = 10000

means = make_array()

for i in np.arange(repetitions):
    sample = delay.sample(sample_size)
    new_mean = np.mean(sample.column('Delay'))
    means = np.append(means, new_mean)

results = Table().with_column(
    'Sample Mean', means
)

[In ]:

results.hist(bins=np.arange(10, 25, 0.5))

Histogram with 'Sample Mean' on the x-axis and 'Percent per unit' on the y-axis. The x axis ranges from x=10 to x=24. The graph is mostly symmetric about approx x=16 with a slight tail extending to the right.

再一次，我们看到了一个粗略的钟形，尽管我们是从一个非常偏斜的分布中抽取的。钟形如我们预期，中心在16和17之间的某个位置。

中心极限定理

钟形出现在这种情况下的原因是概率论的一个显著结果，称为中心极限定理（Central Limit Theorem）。

中心极限定理指出，有放回抽取的大随机样本的总和或平均值的概率分布将大致为正态，无论从中抽取样本的总体分布如何。

正如我们在学习切比雪夫界时指出的，可以应用于随机样本无论总体分布如何的结果非常强大，因为在数据科学中我们很少知道总体的分布。

中心极限定理使得在几乎不了解总体的条件下进行推断成为可能，前提是我们有一个大的随机样本。这就是为什么它处于统计推断领域的核心位置。

紫花比例

回顾一下孟德尔关于某种豌豆植物花色的概率模型。该模型认为，植物的花色如同从{紫色, 紫色, 紫色, 白色}中有放回随机抽取的结果。

在一个大的植物样本中，大约有多大比例会开紫花？我们预期答案约为0.75，即模型中紫色的比例。而且，由于比例是均值，中心极限定理表明样本中紫色植物的比例分布大致为正态。

我们可以通过模拟来确认这一点。让我们模拟200株植物样本中开紫花植物的比例。

[In ]:

colors = make_array('Purple', 'Purple', 'Purple', 'White')

model = Table().with_column('Color', colors)

model

Color
Purple
Purple
Purple
White

[In ]:

props = make_array()

num_plants = 200
repetitions = 10000

for i in np.arange(repetitions):
    sample = model.sample(num_plants)
    new_prop = np.count_nonzero(sample.column('Color') == 'Purple')/num_plants
    props = np.append(props, new_prop)
    
results = Table().with_column('Sample Proportion: 200', props)

[In ]:

results.hist(bins=np.arange(0.65, 0.85, 0.01))

Histogram with 'Sample Proportion: 200' on the x-axis and 'Percent per unit' on the y-axis. The x-axis ranges from about 0.65 to above 0.8. The data is fairly symmetric and bell shaped, centered around 0.75.

又是那条正态曲线，正如中心极限定理所预测的，如你所料，中心在0.75左右。

如果我们增加样本量，这个分布会如何变化？让我们以800的样本量再次运行代码，并将模拟结果收集到与基于200样本量的模拟相同的表中。我们将保持 repetitions 的次数与之前相同，以便两列具有相同的长度。

[In ]:

props2 = make_array()

num_plants = 800

for i in np.arange(repetitions):
    sample = model.sample(num_plants)
    new_prop = np.count_nonzero(sample.column('Color') == 'Purple')/num_plants
    props2 = np.append(props2, new_prop)
    
results = results.with_column('Sample Proportion: 800', props2)

[In ]:

results.hist(bins=np.arange(0.65, 0.85, 0.01))

Two overlaid histograms. Both histograms are bell shaped and symmetrical, centered about 0.75. The wider histogram is in dark blue and labeled 'Sample Proportion: 200'. The wider histogram extends from approximately x=0.67 to 0.81 and has a shorter peak. The less wide histogram is in gold and labeled 'Sample Proportion: 800.' The gold histogram extends from about 0.7 to 0.8 and has a taller peak.

两个分布都近似正态，但一个比另一个更窄。基于800样本量的比例比基于200样本量的比例更紧密地聚集在0.75附近。增加样本量降低了样本比例的变异性。

这应该不令人惊讶。我们多次依赖于一个直觉：较大的样本量通常会降低统计量的变异性。然而，对于样本平均值的情况，我们可以量化样本量与变异性之间的关系。

样本量究竟如何影响样本平均值或比例的变异性？这正是我们将在下一节中考察的问题。