真实斜率的推断

我们的模拟表明,如果回归模型成立且样本量较大,那么回归线很可能接近真实直线。这使我们能够估计真实直线的斜率。

我们将使用熟悉的母亲及其新生儿样本,来开发一种估计真实直线斜率的方法。首先,让我们看看是否相信回归模型是一套描述出生体重与孕天数之间关系的适当假设。

[In ]:
from datascience import *
path_data = '../../../assets/data/'
import numpy as np
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
np.set_printoptions(legacy='1.13')
[In ]:
def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)  

def correlation(t, x, y):
    return np.mean(standard_units(t.column(x))*standard_units(t.column(y)))

def slope(table, x, y):
    r = correlation(table, x, y)
    return r * np.std(table.column(y))/np.std(table.column(x))

def intercept(table, x, y):
    a = slope(table, x, y)
    return np.mean(table.column(y)) - a * np.mean(table.column(x))

def fit(table, x, y):
    a = slope(table, x, y)
    b = intercept(table, x, y)
    return a * table.column(x) + b

def residual(table, x, y):
    return table.column(y) - fit(table, x, y)

def scatter_fit(table, x, y):
    plots.scatter(table.column(x), table.column(y), s=20)
    plots.plot(table.column(x), fit(table, x, y), lw=2, color='gold')
    plots.xlabel(x)
    plots.ylabel(y)
[In ]:
def draw_and_compare(true_slope, true_int, sample_size):
    x = np.random.normal(50, 5, sample_size)
    xlims = np.array([np.min(x), np.max(x)])
    eps = np.random.normal(0, 6, sample_size)
    y = (true_slope*x + true_int) + eps
    tyche = Table().with_columns(
        'x', x,
        'y', y
    )

    plots.figure(figsize=(6, 16))
    plots.subplot(4, 1, 1)
    plots.scatter(tyche['x'], tyche['y'], s=20)
    plots.plot(xlims, true_slope*xlims + true_int, lw=2, color='green')
    plots.title('True Line, and Points Created')

    plots.subplot(4, 1, 2)
    plots.scatter(tyche['x'],tyche['y'], s=20)
    plots.title('What We Get to See')

    plots.subplot(4, 1, 3)
    scatter_fit(tyche, 'x', 'y')
    plots.xlabel("")
    plots.ylabel("")
    plots.title('Regression Line: Estimate of True Line')

    plots.subplot(4, 1, 4)
    scatter_fit(tyche, 'x', 'y')
    plots.ylabel("")
    xlims = np.array([np.min(tyche['x']), np.max(tyche['x'])])
    plots.plot(xlims, true_slope*xlims + true_int, lw=2, color='green')
    plots.title("Regression Line and True Line")
[In ]:
baby = Table.read_table(path_data + 'baby.csv')
[In ]:
scatter_fit(baby, 'Gestational Days', 'Birth Weight')
Scatterplot with 'Gestational Days' on the x-axis and 'Birth Weight' on the y-axis. The data points are shown in blue and generally have x values greater than x=225 and a weak to moderate positive correlation. Some outliers exist with low x values, around 150, 180, and just above 200. Each of these outlier data points have y values below 120. A gold line is shown with a positive slope through the majority of the datapoints.
[In ]:
correlation(baby, 'Gestational Days', 'Birth Weight')
0.40754279338885108

总的来说,散点在线条周围分布得相当均匀,尽管有一些点散布在主云团的外围。相关性为 0.4,回归线具有正斜率。

这是否反映了真实直线具有正斜率这一事实?为了回答这个问题,让我们看看能否估计真实斜率。我们当然有一个估计值:回归线的斜率,约为 0.47 盎司/天。

[In ]:
slope(baby, 'Gestational Days', 'Birth Weight')
0.46655687694921522

但如果散点图不同,回归线也会不同,斜率也可能不同。我们如何估算斜率可能有多大的差异?

我们需要另一个样本点,以便我们可以穿过新的散点图绘制回归线并找到其斜率。但我们从哪里得到另一个样本呢?

你已经猜到了——我们将“对原始样本进行自助法重抽样”。这将给我们一个自助法散点图,我们可以穿过它绘制一条回归线。

对散点图进行自助法重抽样

我们可以通过对原始样本进行有放回的随机抽样来模拟新样本,抽样次数与原始样本大小相同。每个新样本都会给我们一个散点图。我们称之为“自助法散点图”,简而言之,我们将整个过程称为“对散点图进行自助法重抽样”。

以下是来自样本的原始散点图,以及自助法重抽样过程的四个重复。注意重抽样后的散点图通常比原始图稍微稀疏一些。这是因为原始样本中的一些点没有被抽中。

[In ]:
plots.figure(figsize=(8, 18))
plots.subplot(5, 1, 1)
plots.scatter(baby[1], baby[0], s=10)
plots.xlim([150, 400])
plots.title('Original sample')

for i in np.arange(1, 5, 1):
    plots.subplot(5,1,i+1)
    rep = baby.sample(with_replacement=True)
    plots.scatter(rep[1], rep[0], s=10)
    plots.xlim([150, 400])
    plots.title('Bootstrap sample '+str(i))
Five scatterplots are shown. The first is titled 'Original sample,' and the remaining are titled 'Bootstrap sample 1' to 'Bootstrap sample 4.' The 'Original sample' is the same data as before. The bootstrap samples resemble the original sample but have slight differences. For example, some have one or two of the outliers, but not all of them.

估计真实斜率

我们可以对散点图进行大量自助法重抽样,并在每个自助法散点图上绘制回归线。这些直线各自都有一个斜率。我们可以简单地收集所有斜率并绘制它们的经验直方图。回顾一下,默认情况下,sample 方法进行有放回的随机抽样,抽样次数与表中的行数相同。也就是说,sample 默认生成一个自助法样本。

[In ]:
slopes = make_array()
for i in np.arange(5000):
    bootstrap_sample = baby.sample()
    bootstrap_slope = slope(bootstrap_sample, 'Gestational Days', 'Birth Weight')
    slopes = np.append(slopes, bootstrap_slope)
Table().with_column('Bootstrap Slopes', slopes).hist(bins=20)
A histogram with 'Bootstrap Slopes' on the x-axis and 'Percent per unit' on the y-axis. The histogram is symmetric and bell shaped ranging from about 0.35 to 0.6 and centered around 0.475.

然后,我们可以使用自助法百分位数方法为真实直线的斜率构建一个近似 95% 的置信区间。该置信区间从 5000 个自助法斜率的第 2.5 百分位数延伸到第 97.5 百分位数。

[In ]:
left = percentile(2.5, slopes)
right = percentile(97.5, slopes)
left, right
(0.38169251837495338, 0.55839374914417184)

真实斜率的近似 95% 置信区间从约 0.38 盎司/天延伸到约 0.56 盎司/天。

用于自助法估计斜率的函数

让我们将估计斜率方法的所有步骤收集起来,定义一个执行这些步骤的函数 bootstrap_slope。它的参数是表的名称、预测变量和响应变量的标签,以及所需的自助法重复次数。在每次重复中,该函数对原始散点图进行自助法重抽样,并计算所得回归线的斜率。然后,它绘制所有生成斜率的直方图,并打印由斜率的“中间 95%”组成的区间。

[In ]:
def bootstrap_slope(table, x, y, repetitions):
    
    # For each repetition:
    # Bootstrap the scatter, get the slope of the regression line,
    # augment the list of generated slopes
    slopes = make_array()
    for i in np.arange(repetitions):
        bootstrap_sample = table.sample()
        bootstrap_slope = slope(bootstrap_sample, x, y)
        slopes = np.append(slopes, bootstrap_slope)
    
    # Find the endpoints of the 95% confidence interval for the true slope
    left = percentile(2.5, slopes)
    right = percentile(97.5, slopes)
    
    # Slope of the regression line from the original sample
    observed_slope = slope(table, x, y)
    
    # Display results
    Table().with_column('Bootstrap Slopes', slopes).hist(bins=20)
    plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=8);
    print('Slope of regression line:', observed_slope)
    print('Approximate 95%-confidence interval for the true slope:')
    print(left, right)

当我们调用 bootstrap_slope 来寻找真实斜率的置信区间(响应变量为出生体重,预测变量为孕天数)时,我们得到一个与之前非常接近的区间:大约为 0.38 盎司/天到 0.56 盎司/天。

[In ]:
bootstrap_slope(baby, 'Gestational Days', 'Birth Weight', 5000)
Slope of regression line: 0.466556876949
Approximate 95%-confidence interval for the true slope:
0.380373502308 0.558711930778
The previous histogram has been reproduced with a gold bar extending along the bottom of the histogram from about 0.375 to 0.55. The previous histogram description is: A histogram with 'Bootstrap Slopes' on the x-axis and 'Percent per unit' on the y-axis. The histogram is symmetric and bell shaped ranging from about 0.35 to 0.6 and centered around 0.475.

现在我们有了一个函数,可以自动完成回归模型中真实直线斜率的估计过程,我们也可以将其用于其他变量。

例如,让我们考察出生体重与母亲身高之间的关系。身高较高的女性是否倾向于生育更重的婴儿?

基于散点图,回归模型似乎合理,但相关性不高,仅为 0.2 左右。

[In ]:
scatter_fit(baby, 'Maternal Height', 'Birth Weight')
A scatterplot with 'Maternal Height' on the x-axis and 'Birth Weight' on the y-axis. The data points are in light blue and have discrete x values. For a single x-value, there are generally multiple y-values plotted. Most of the data points have x>55. A gold line is drawn with a slight positive slope.
[In ]:
correlation(baby, 'Maternal Height', 'Birth Weight')
0.20370417718968034

与之前一样,我们可以使用 bootstrap_slope 来估计回归模型中真实直线的斜率。

[In ]:
bootstrap_slope(baby, 'Maternal Height', 'Birth Weight', 5000)
Slope of regression line: 1.47801935193
Approximate 95%-confidence interval for the true slope:
1.06617289982 1.91237018475
A histogram with 'Boostrap Slopes' on the x-axis and 'Percent per unit' on the y-axis. The histogram is symmetric and bell shaped, extending from about 0.75 to 2.25 and centered around 1.50. A gold bar is drawn on the bottom of the histogram extending from just above 1.00 to just below 2.00.

真实斜率的 95% 置信区间从约 1 盎司/英寸延伸到约 1.9 盎司/英寸。

真实斜率可能为 0 吗?

假设我们相信数据遵循回归模型,并且我们拟合回归线来估计真实直线。如果回归线不完全平坦(几乎总是如此),我们将在散点图中观察到一些线性关联。

但如果这种观察是虚假的呢?换句话说,如果真实直线是平坦的——即两个变量之间没有线性关系——而我们观察到的关联仅仅是由于生成样本点时产生的随机性呢?

以下是一个模拟,说明了为什么会提出这个问题。我们将再次调用函数 draw_and_compare,这次要求真实直线斜率为 0。我们的目标是看看我们的回归线是否显示出非零的斜率。

记住函数 draw_and_compare 的参数是真实直线的斜率和截距,以及要生成的点的数量。

[In ]:
draw_and_compare(0, 10, 25)
Four graphs are drawn. The first graph is titled 'True Line, and Points Created,' with a green line drawn at y=10 and light blue data points drawn above and below the line. On the left hand side of the graph there are more data points below the green line than above it and on the right hand side of the graph there are more data points above the green line than below it. The second graph is titled 'What We Get to See' and only has the data points drawn. The third graph is titled 'Regression Line: Estimate of True Line' and has the same data points drawn along with a gold line with a slight positive slope. The final graph is titled 'Regression Line and True Line' and has the data points and both the gold and green lines drawn.

运行模拟几次,每次保持真实直线的斜率为 0。你会注意到,虽然真实直线的斜率为 0,但回归线的斜率通常不为 0。回归线有时向上倾斜,有时向下倾斜,每次都会给我们一种两个变量相关的错误印象。

为了判断我们看到的斜率是否真实,我们需要检验以下假设:

原假设。 真实直线的斜率为 0。

备择假设。 真实直线的斜率不为 0。

我们完全有能力进行这样的检验。既然我们可以为真实斜率构建一个 95% 的置信区间,我们只需要看该区间是否包含 0。

如果不包含 0,那么我们可以拒绝原假设(使用 P 值的 5% 截断值)。

如果真实斜率的置信区间确实包含 0,那么我们没有足够的证据拒绝原假设。也许我们看到的斜率是虚假的。

让我们在一个例子中使用这个方法。假设我们尝试根据母亲的年龄来估计婴儿的出生体重。基于样本,通过母亲年龄估计出生体重的回归线斜率为正,约为 0.08 盎司/年。

[In ]:
slope(baby, 'Maternal Age', 'Birth Weight')
0.085007669415825132

虽然斜率为正,但非常小。回归线非常接近平坦,这引发了真实直线是否平坦的问题。

[In ]:
scatter_fit(baby, 'Maternal Age', 'Birth Weight')
A scatterplot with 'Maternal Age' on the x-axis and 'Birth Weight' on the y-axis. The data points are shown in light blue and have discrete x values. There is no apparent shape to the data. A gold line is drawn with a very slight positive slope.

我们可以使用 bootstrap_slope 来估计真实直线的斜率。计算表明,真实斜率的近似 95% 自助法置信区间具有负的左端点和正的右端点——换句话说,该区间包含 0。

[In ]:
bootstrap_slope(baby, 'Maternal Age', 'Birth Weight', 5000)
Slope of regression line: 0.0850076694158
Approximate 95%-confidence interval for the true slope:
-0.104436940365 0.276512953946
A histogram with 'Bootstrap Slopes' on the x-axis and 'Percent per unit' on the y-axis. The histogram is centered about 0.1, symmetric, and bell shaped ranging from about -0.2 to 0.35. A gold bar is shown at the bottom of the histogram ranging from 0.1 to about 0.3

因为区间包含 0,我们不能拒绝原假设,即母亲年龄与婴儿出生体重之间真实线性关系的斜率为 0。基于这一分析,使用以母亲年龄为预测变量的回归模型来预测出生体重是不明智的。