从总体中抽样

当随机样本是从大量总体中的个体中抽取时,平均律也成立。

作为示例,我们将研究航班延误时间的总体。表格 united 包含 2015 年夏季从旧金山出发的美国联合航空国内航班的数据。这些数据由美国交通部的 运输统计局 公开提供。

共有 13,825 行,每行对应一个航班。各列分别是航班日期、航班号、目的地机场代码以及出发延误时间(分钟)。部分延误时间为负数:这些航班提前起飞了。

[In ]:
from datascience import *
path_data = '../../../assets/data/'
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import numpy as np
[In ]:
united = Table.read_table(path_data + 'united_summer2015.csv')
united
Date   | Flight Number | Destination | Delay
6/1/15 | 73            | HNL         | 257
6/1/15 | 217           | EWR         | 28
6/1/15 | 237           | STL         | -3
6/1/15 | 250           | SAN         | 0
6/1/15 | 267           | PHL         | 64
6/1/15 | 273           | SEA         | -6
6/1/15 | 278           | SEA         | -8
6/1/15 | 292           | EWR         | 12
6/1/15 | 300           | HNL         | 20
6/1/15 | 317           | IND         | -10
... (13815 rows omitted)

一个航班提前 16 分钟起飞,一个航班晚点 580 分钟。其他延误时间几乎都在 -10 分钟到 200 分钟之间,如下面的直方图所示。

[In ]:
united.column('Delay').min()
-16
[In ]:
united.column('Delay').max()
580
[In ]:
delay_bins = np.append(np.arange(-20, 301, 10), 600)
united.hist('Delay', bins = delay_bins, unit = 'minute')
Histogram with 'Delay (minute)' on the x-axis and 'Percent per minute' on the y-axis. The tallest bars are on the left hand side of the graph, close to 0. The height of the bars quickly decrease, but there is a long right tail that extends to 600.

为了本节的目的,只需关注数据的核心部分并忽略延误超过 200 分钟的 0.8% 的航班。这个限制只是为了视觉上的方便;表格仍然保留所有数据。

[In ]:
united.where('Delay', are.above(200)).num_rows/united.num_rows
0.008390596745027125
[In ]:
delay_bins = np.arange(-20, 201, 10)
united.hist('Delay', bins = delay_bins, unit = 'minute')
Histogram with 'Delay (minute) on the x-axis and 'Percent per minute' on the y-axis. The tallest bars are between -10 and 10 and there is a quick drop off in the height of the bars after that. Bars are visible, but very small until about x=140. The graph continues to extend to x=200.

[0, 10) 条形的高度略低于每分钟 3%,这意味着略低于 30% 的航班延误在 0 到 10 分钟之间。通过计数行可以确认这一点:

[In ]:
united.where('Delay', are.between(0, 10)).num_rows/united.num_rows
0.2935985533453888

样本的经验分布

现在让我们将 13,825 个航班视为一个总体,并从中随机放回地抽取样本。将我们的代码打包到一个函数中会很有帮助。函数 empirical_hist_delay 以样本大小为参数,并绘制结果的经验直方图。

[In ]:
def empirical_hist_delay(n):
    united.sample(n).hist('Delay', bins = delay_bins, unit = 'minute')

正如我们在骰子中看到的,随着样本量的增加,样本的经验直方图越来越接近总体的直方图。将这些直方图与上面的总体直方图进行比较。

[In ]:
empirical_hist_delay(10)
Histogram with 'Delay (minute)' on the x-axis and 'Percent per minute' on the y-axis. The x-axis extends from about -10 to 200. There are three bars with non-zero height, from -10 to 0 with middle height, 0 to 10 with the tallest height, and 10 to 20 with the shortest height.
[In ]:
empirical_hist_delay(100)
Histogram with 'Delay (minute)' on the x-axis and 'Percent per minute' on the y-axis. The tallest bars are between -10 and 20. There are a number of short, non-zero height bars between 20 and 110.

最持续可见的差异出现在总体中稀有的值中。在我们的示例中,这些值位于分布的右尾。但随着样本量的增加,即使是这些值也开始以大致正确的比例出现在样本中。

[In ]:
empirical_hist_delay(1000)
Histogram with 'Delay (minute)' on the x-axis and 'Percent per minute' on the y-axis. The tallest bars are again between -10 and 20 with short, non-zero height bars extending to about 150.

样本经验直方图的收敛性

我们在本节中观察到的可以总结如下:

对于一个大的随机样本,样本的经验直方图以高概率与总体的直方图相似。

这证明了在统计推断中使用大随机样本的合理性。其思想是,由于大的随机样本很可能与从中抽取的总体相似,因此从样本值中计算出的量很可能接近总体中相应的量。