预测区间

回归的主要用途之一是对不在原始样本中但与样本个体相似的新个体进行预测。用模型的语言来说，我们想要估计新 $x$ 值下的 $y$ 值。

我们的估计值是真实直线在 $x$ 处的高度。当然，我们不知道真实直线。我们用作替代的是穿过样本点的回归线。

给定 $x$ 值处的拟合值是基于该 $x$ 值的 $y$ 的回归估计值。换句话说，给定 $x$ 值处的拟合值就是回归线在该 $x$ 处的高度。

[In ]:

from datascience import *
path_data = '../../../assets/data/'
import numpy as np
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

[In ]:

baby = Table.read_table(path_data + 'baby.csv')

[In ]:

def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)  

def correlation(t, x, y):
    return np.mean(standard_units(t.column(x))*standard_units(t.column(y)))

def slope(table, x, y):
    r = correlation(table, x, y)
    return r * np.std(table.column(y))/np.std(table.column(x))

def intercept(table, x, y):
    a = slope(table, x, y)
    return np.mean(table.column(y)) - a * np.mean(table.column(x))

def fit(table, x, y):
    a = slope(table, x, y)
    b = intercept(table, x, y)
    return a * table.column(x) + b

def residual(table, x, y):
    return table.column(y) - fit(table, x, y)

def scatter_fit(table, x, y):
    plots.scatter(table.column(x), table.column(y), s=20)
    plots.plot(table.column(x), fit(table, x, y), lw=2, color='gold')
    plots.xlabel(x)
    plots.ylabel(y)

假设我们尝试根据孕天数来预测婴儿的出生体重。正如我们在上一节中看到的，数据与回归模型拟合得相当好，真实直线斜率的 95% 置信区间不包含 0。因此，进行我们的预测似乎是合理的。

下图显示了预测值在回归线上的位置。红色线位于 $x = 300$ 处。

[In ]:

scatter_fit(baby, 'Gestational Days', 'Birth Weight')
s = slope(baby, 'Gestational Days', 'Birth Weight')
i = intercept(baby, 'Gestational Days', 'Birth Weight')
fit_300 = s*300 + i
plots.scatter(300, fit_300, color='red', s=20)
plots.plot([300,300], [0, fit_300], color='red', lw=2)
plots.ylim([0, 200]);

Scatterplot with 'Gestational Days' on the x-axis and 'Birth Weight' on the y-axis. The data points are shown in blue and generally have x values greater than x=225 and a weak to moderate positive correlation. Some outliers exist with low x values, around 150, 180, and just above 200. Each of these outlier data points have y values below 120. A gold line is shown with a positive slope through the majority of the datapoints. A vertical red line is drawn from y=0 to about y=125, until it reaches the gold line at x=300.

红色线与回归线相交点的高度就是 300 孕天数处的拟合值。

函数 fitted_value 计算这个高度。与函数 correlation、slope 和 intercept 一样，它的参数包括表的名称以及 $x$ 和 $y$ 列的标签。但它还需要第四个参数，即进行估计的 $x$ 值。

[In ]:

def fitted_value(table, x, y, given_x):
    a = slope(table, x, y)
    b = intercept(table, x, y)
    return a * given_x  + b

300 孕天数处的拟合值约为 129.2 盎司。换句话说，对于持续 300 孕天数的妊娠期，我们对婴儿体重的估计约为 129.2 盎司。

[In ]:

fit_300 = fitted_value(baby, 'Gestational Days', 'Birth Weight', 300)
fit_300

129.2129241703143

预测的变异性

我们已经开发了一种方法，使用样本中的数据，基于孕天数对一个新生儿的出生体重进行一次预测。但作为数据科学家，我们知道样本可能不同。如果样本不同，回归线也会不同，我们的预测也会不同。要了解我们的预测有多好，我们必须了解预测的可变性。

为此，我们必须生成新样本。我们可以像上一节那样通过对散点图进行自助法重抽样来实现。然后，我们将在每次重复中对散点图拟合回归线，并基于每条线进行预测。下图显示了 10 条这样的线，以及相应的 300 孕天数处的预测出生体重。

[In ]:

x = 300

lines = Table(['slope','intercept'])
for i in range(10):
    rep = baby.sample(with_replacement=True)
    a = slope(rep, 'Gestational Days', 'Birth Weight')
    b = intercept(rep, 'Gestational Days', 'Birth Weight')
    lines.append([a, b])

lines['prediction at x='+str(x)] = lines.column('slope')*x + lines.column('intercept')

xlims = np.array([291, 309])
left = xlims[0]*lines[0] + lines[1]
right = xlims[1]*lines[0] + lines[1]
fit_x = x*lines['slope'] + lines['intercept']

for i in range(10):
    plots.plot(xlims, np.array([left[i], right[i]]), lw=1)
    plots.scatter(x, fit_x[i], s=30)

A plot with x ranging from 292.5 to 307.5 nd y from 124 to 136. Ten different colored lines are drawn at with an intersecting dot at x=300. The lines intersect with x=300 at y values between 126 and 131.

预测值因线条不同而变化。下表显示了 10 条线中每条线的斜率和截距，以及预测值。

[In ]:

lines

slope    | intercept | prediction at x=300
0.431704 | -1.10973  | 128.402
0.512942 | -23.6503  | 130.232
0.450939 | -6.73161  | 128.55
0.447632 | -5.68583  | 128.604
0.353555 | 20.1531   | 126.22
0.469301 | -11.6276  | 129.163
0.445023 | -5.01298  | 128.494
0.540571 | -31.3711  | 130.8
0.487377 | -16.8051  | 129.408
0.451658 | -5.45954  | 130.038

自助法预测区间

如果我们增加重抽样过程的重复次数，就可以生成预测值的经验直方图。这将使我们能够创建一个预测区间，使用与我们为斜率创建自助法置信区间相同的百分位数方法。

让我们定义一个名为 bootstrap_prediction 的函数来实现这一功能。该函数接受五个参数： - 表的名称 - 预测变量和响应变量的列标签（按此顺序） - 进行预测的 $x$ 值 - 所需的自助法重复次数

在每次重复中，该函数对原始散点图进行自助法重抽样，并根据指定的 $x$ 值找到 $y$ 的预测值。具体来说，它调用我们在本节前面定义的函数 fitted_value 来查找指定 $x$ 处的拟合值。

最后，它绘制所有预测值的经验直方图，并打印由预测值的“中间 95%”组成的区间。它还会打印基于原始散点图的回归线的预测值。

[In ]:

# Bootstrap prediction of variable y at new_x
# Data contained in table; prediction by regression of y based on x
# repetitions = number of bootstrap replications of the original scatter plot

def bootstrap_prediction(table, x, y, new_x, repetitions):
    
    # For each repetition:
    # Bootstrap the scatter; 
    # get the regression prediction at new_x; 
    # augment the predictions list
    predictions = make_array()
    for i in np.arange(repetitions):
        bootstrap_sample = table.sample()
        bootstrap_prediction = fitted_value(bootstrap_sample, x, y, new_x)
        predictions = np.append(predictions, bootstrap_prediction)
        
    # Find the ends of the approximate 95% prediction interval
    left = percentile(2.5, predictions)
    right = percentile(97.5, predictions)
    
    # Prediction based on original sample
    original = fitted_value(table, x, y, new_x)
    
    # Display results
    Table().with_column('Prediction', predictions).hist(bins=20)
    plots.xlabel('predictions at x='+str(new_x))
    plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=8);
    print('Height of regression line at x='+str(new_x)+':', original)
    print('Approximate 95%-confidence interval:')
    print(left, right)

[In ]:

bootstrap_prediction(baby, 'Gestational Days', 'Birth Weight', 300, 5000)

Height of regression line at x=300: 129.2129241703143
Approximate 95%-confidence interval:
127.30986480617071 131.3039886526053

Histogram with 'predictions at x=300' on the x-axis and 'Percent per unit' on the y-axis. The histogram is roughly bell shaped and symmetric about x~129, ranging from x=126 to x=132. A gold bar is drawn at the bottom of the histogram from about x=127 to about x=131.

上图显示了基于 5000 次自助法重复的 300 孕天数处婴儿预测出生体重的自助法经验直方图。经验分布大致为正态分布。

通过取预测值的“中间 95%”构建了近似 95% 的预测区间，即从预测值的第 2.5 百分位数到第 97.5 百分位数的区间。该区间范围约为 127 到 131。基于原始样本的预测值约为 129，接近区间的中心。

改变预测变量值的影响

下图显示了 5,000 次自助法在 285 孕天数处的预测直方图。基于原始样本的预测约为 122 盎司，区间范围约为 121 盎司到 123 盎司。

[In ]:

bootstrap_prediction(baby, 'Gestational Days', 'Birth Weight', 285, 5000)

Height of regression line at x=285: 122.21457101607608
Approximate 95%-confidence interval:
121.15559627774567 123.29019769886106

Histogram with 'predictions at x=285' on the x-axis and 'Percent per unit' on the y-axis. The histogram is roughly bell shaped and symmetric about x~122.25, ranging from x=121 to x=124. A gold bar is drawn at the bottom of the histogram from about x=121 to about x=123.

注意这个区间比 300 孕天数处的预测区间更窄。让我们研究一下原因。

孕天数的平均值约为 279 天：

[In ]:

np.mean(baby.column('Gestational Days'))

279.1013628620102

所以 285 比 300 更接近分布的中心。通常，基于自助法样本的回归线在预测变量分布中心附近彼此更接近。因此，所有的预测值也更接近。这解释了预测区间宽度更窄的原因。

你可以在下图中看到这一点，该图显示了十次自助法重复在 $x = 285$ 和 $x = 300$ 处的预测值。通常，线在 $x = 300$ 处的间距比在 $x = 285$ 处更大，因此 $x = 300$ 处的预测值变异性更大。

[In ]:

x1 = 300
x2 = 285

lines = Table(['slope','intercept'])
for i in range(10):
    rep = baby.sample(with_replacement=True)
    a = slope(rep, 'Gestational Days', 'Birth Weight')
    b = intercept(rep, 'Gestational Days', 'Birth Weight')
    lines.append([a, b])

xlims = np.array([260, 310])
left = xlims[0]*lines[0] + lines[1]
right = xlims[1]*lines[0] + lines[1]
fit_x1 = x1*lines['slope'] + lines['intercept']
fit_x2 = x2*lines['slope'] + lines['intercept']

plots.xlim(xlims)
for i in range(10):
    plots.plot(xlims, np.array([left[i], right[i]]), lw=1)
    plots.scatter(x1, fit_x1[i], s=30)
    plots.scatter(x2, fit_x2[i], s=30)

10 lines are drawn in different colors with data points highlighted on x=285 and x=300.

注意事项

我们在本章中执行的所有预测和检验都假设回归模型成立。具体来说，这些方法假设散点图类似于以下方式生成的点：从直线上的点开始，然后通过添加随机正态噪声将其推离直线。

如果散点图看起来不是这样，那么模型可能不适用于这些数据。如果模型不成立，那么假定模型为真的计算是无效的。

因此，在我们开始基于模型进行预测或检验关于模型参数的假设之前，我们必须首先判断回归模型是否适用于我们的数据。一个简单的方法是做我们在本节中所做的：绘制两个变量的散点图，观察它是否大致呈线性且均匀地分布在一条线周围。我们还应该使用上一节开发的基于残差图的诊断方法。