可视化诊断

假设数据科学家决定使用线性回归，基于一个变量（称为预测变量）来估计另一个变量（称为响应变量）的值。为了了解这种估计方法的效果如何，数据科学家必须衡量估计值与实际值之间的差距。这些差异称为“残差”。

$$ \mbox{residual} ~=~ \mbox{observed value} ~-~ \mbox{regression estimate} $$

残差是估计后剩下的——残余部分。

残差是各点到回归线的垂直距离。散点图中的每个点都有一个残差。残差是 $y$ 的观测值与 $y$ 的拟合值之差，因此对于点 $(x, y)$，

$$ \mbox{residual} ~~ = ~~ y ~-~ \mbox{fitted value of }y ~~ = ~~ y ~-~ \mbox{height of regression line at }x $$

函数 residual 计算残差。该计算假设我们已经定义了所有相关函数：standard_units、correlation、slope、intercept 和 fit。

[In ]:

from datascience import *
path_data = '../../../assets/data/'
import numpy as np
from scipy import stats

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

[In ]:

family_heights = Table.read_table(path_data + 'family_heights.csv')
heights = family_heights.select('midparentHeight', 'childHeight')
heights = heights.relabel(0, 'MidParent').relabel(1, 'Child')
hybrid = Table.read_table(path_data + 'hybrid.csv')

[In ]:

def standard_units(x):
    return (x - np.mean(x))/np.std(x)

def correlation(table, x, y):
    x_in_standard_units = standard_units(table.column(x))
    y_in_standard_units = standard_units(table.column(y))
    return np.mean(x_in_standard_units * y_in_standard_units)

def slope(table, x, y):
    r = correlation(table, x, y)
    return r * np.std(table.column(y))/np.std(table.column(x))

def intercept(table, x, y):
    a = slope(table, x, y)
    return np.mean(table.column(y)) -  a * np.mean(table.column(x))

def fit(table, x, y):
    a = slope(table, x, y)
    b = intercept(table, x, y)
    return a * table.column(x) + b

[In ]:

def residual(table, x, y):
    return table.column(y) - fit(table, x, y)

继续我们基于父母中位身高（预测变量）估计成年子女身高（响应变量）的例子，让我们计算拟合值和残差。

[In ]:

heights = heights.with_columns(
        'Fitted Value', fit(heights, 'MidParent', 'Child'),
        'Residual', residual(heights, 'MidParent', 'Child')
    )
heights

MidParent | Child | Fitted Value | Residual
75.43     | 73.2  | 70.7124      | 2.48763
75.43     | 69.2  | 70.7124      | -1.51237
75.43     | 69    | 70.7124      | -1.71237
75.43     | 69    | 70.7124      | -1.71237
73.66     | 73.5  | 69.5842      | 3.91576
73.66     | 72.5  | 69.5842      | 2.91576
73.66     | 65.5  | 69.5842      | -4.08424
73.66     | 65.5  | 69.5842      | -4.08424
72.06     | 71    | 68.5645      | 2.43553
72.06     | 68    | 68.5645      | -0.564467
... (924 rows omitted)

当有这么多变量需要处理时，从可视化开始总是很有帮助的。函数 scatter_fit 绘制数据的散点图以及回归线。

[In ]:

def scatter_fit(table, x, y):
    table.scatter(x, y, s=15)
    plots.plot(table.column(x), fit(table, x, y), lw=4, color='gold')
    plots.xlabel(x)
    plots.ylabel(y)

[In ]:

scatter_fit(heights, 'MidParent', 'Child')

Scatterplot with 'MidParent' on the x-axis and 'Child' on the y-axis. The data points are in dark blue and have a weak to moderate positive association. A gold line is drawn through the center of the data with a positive slope.

可以通过将残差对预测变量绘图来绘制“残差图”。函数 residual_plot 正是完成这一任务。

[In ]:

def residual_plot(table, x, y):
    x_array = table.column(x)
    t = Table().with_columns(
            x, x_array,
            'residuals', residual(table, x, y)
        )
    t.scatter(x, 'residuals', color='r')
    xlims = make_array(min(x_array), max(x_array))
    plots.plot(xlims, make_array(0, 0), color='darkblue', lw=4)
    plots.title('Residual Plot')

[In ]:

residual_plot(heights, 'MidParent', 'Child')

Scatterplot labeled 'Residual Plot' with 'MidParent' on the x-axis and 'residuals' on the y-axis. The x-axis has the same range as in the previous graph, x=64 to x=74, but the y-axis now ranges from y=-10 to y=10. The data points are in red and have no association; they are a blob. A dark blue horizontal line is drawn at y=0.

父母中位身高位于横轴上，与原始散点图相同。但现在纵轴显示残差。注意该图似乎以水平线 0 处（以深蓝色显示）为中心。还要注意该图没有显示上升或下降趋势。我们稍后会观察到，这种无趋势性适用于所有回归。

回归诊断

残差图帮助我们直观评估线性回归分析的质量。这种评估称为“诊断”。函数 regression_diagnostic_plots 绘制原始散点图和残差图，以便比较。

[In ]:

def regression_diagnostic_plots(table, x, y):
    scatter_fit(table, x, y)
    residual_plot(table, x, y)

[In ]:

regression_diagnostic_plots(heights, 'MidParent', 'Child')

A previous scatterplot is reproduced here without changes: Scatterplot with 'MidParent' on the x-axis and 'Child' on the y-axis. The data points are in dark blue and have a weak to moderate positive association. A gold line is drawn through the center of the data with a positive slope.

A previous scatterplot is reproduced here without changes: Scatterplot labeled 'Residual Plot' with 'MidParent' on the x-axis and 'residuals' on the y-axis. The x-axis has the same range as in the previous graph, x=64 to x=74, but the y-axis now ranges from y=-10 to y=10. The data points are in red and have no association; they are a blob. A dark blue horizontal line is drawn at y=0.

这个残差图表明线性回归是一种合理的估计方法。注意残差在水平线 0 的上方和下方如何相当对称地分布，这与原始散点图上下大致对称相对应。还要注意，在子女身高最常见的值处，图的垂直分布相当均匀。换句话说，除少数异常点外，该图在某些地方不会更窄，在其他地方也不会更宽。

也就是说，回归的准确性在预测变量的观测范围内似乎大致相同。

良好回归的残差图不显示任何模式。在预测变量的整个范围内，残差在水平线 0 的上方和下方看起来大致相同。

检测非线性

绘制数据的散点图通常可以指示两个变量之间的关系是否为非线性。然而，通常在残差图中比在原始散点图中更容易发现非线性。这通常是因为两张图的尺度不同：残差图允许我们放大误差，从而更容易发现模式。

一头儒艮；一种水下的大型海洋哺乳动物。

我们的数据是关于儒艮（一种与海牛和海牛相关的海洋哺乳动物）的年龄和体长的数据集（图片来自 Wikimedia Commons）。数据存放在名为 dugong 的表中。年龄以年为单位，体长以米为单位。由于儒艮通常不记录自己的生日，年龄是根据牙齿状况等变量估算的。

[In ]:

dugong = Table.read_table(path_data + 'dugongs.csv')
dugong = dugong.move_to_start('Length')
dugong

Length | Age
1.8    | 1
1.85   | 1.5
1.87   | 1.5
1.77   | 1.5
2.02   | 2.5
2.27   | 4
2.15   | 5
2.26   | 5
2.35   | 7
2.47   | 8
... (17 rows omitted)

如果我们能测量儒艮的体长，我们能对其年龄做出怎样的推断？让我们检查一下我们的数据说明了什么。以下是对年龄（响应变量）对体长（预测变量）的回归。两个变量之间的相关性相当大，为 0.83。

[In ]:

correlation(dugong, 'Length', 'Age')

0.8296474554905714

尽管相关性很高，但该图显示出一个弯曲的模式，这在残差图中更为明显。

[In ]:

regression_diagnostic_plots(dugong, 'Length', 'Age')

Scatterplot with 'Length' on the x-axis and 'Age' on the y-axis. There is a fairly strong positive assocation; as x values increase so do y values. Some outliers exist for data points with large x values that have higher y values. A gold line is drawn through the data; the outliers are far from the gold line.

Scatterplot titled 'Residual Plot' with 'Length' on the x-axis and 'residuals' on the y-axis. The y-axis ranges from -5 to 15. Data points are in red, and tend to have y values lower than 5, but two data points with high x values have y values greater than 5. A dark blue horizontal line sits above the center of the data at y=0, excluding the outliers.

虽然你可以在原始散点图中发现非线性，但在残差图中更为明显。

在体长较低的一端，残差几乎全为正；然后它们几乎全为负；在体长较高的一端，残差又变为正。换句话说，回归估计呈现出一个模式：先是过高，然后过低，然后又过高。这意味着使用曲线而不是直线来估计年龄会更好。

当残差图显示出模式时，变量之间可能存在非线性关系。

检测异方差性

“异方差性”这个词对于那些正在准备拼字比赛的人来说肯定很有趣。对于数据科学家来说，它的趣味在于其含义——“不均匀的散布”。

回顾包含美国混合动力汽车数据的表 hybrid。以下是对燃油效率对加速度的回归。关联为负：加速快的汽车往往效率较低。

[In ]:

regression_diagnostic_plots(hybrid, 'acceleration', 'mpg')

Scatterplot with 'acceleration' on the x-axis and 'mpg' on the y-axis. Data points are in dark blue and data points with smaller x values tend to have a wider range of possible y values. Data points with larger x values tend to have smaller y values. A gold line is drawn with a negative slope through the data.

Scatterplot titled 'Residual Plot' with 'acceleration' on the x-axis and 'residuals' on the y-axis. A horizontal dark blue line is drawn at y=0. On the left hand side of the plot, the red data points are further away from the dark blue line, both above and below it, than the red data points on the right hand side of the graph.

注意残差图如何在加速度低端向外扩展。换句话说，误差大小的变异性在加速度值低时比在加速度值高时更大。不均匀的变异在残差图中通常比在原始散点图中更容易被注意到。

如果残差图显示围绕水平线 0 的不均匀变异，则回归估计在预测变量的整个范围内并非同样准确。