多元回归

[In ]:
from datascience import *
path_data = '../../../assets/data/'
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

np.set_printoptions(suppress=True)
[In ]:
def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)  

def correlation(t, x, y):
    return np.mean(standard_units(t.column(x))*standard_units(t.column(y)))

既然我们已经探索了使用多个属性预测分类变量的方法,让我们回到预测定量变量的问题。预测一个数值量称为回归,而使用多个属性进行回归的常用方法称为“多元线性回归”。

房屋价格

以下房屋价格和属性数据集是在多年间为爱荷华州埃姆斯市收集的。数据集描述可在网上找到。我们将只关注列的一个子集。我们将尝试根据其他列预测销售价格列。

[In ]:
all_sales = Table.read_table(path_data + 'house.csv')
sales = all_sales.where('Bldg Type', '1Fam').where('Sale Condition', 'Normal').select(
    'SalePrice', '1st Flr SF', '2nd Flr SF', 
    'Total Bsmt SF', 'Garage Area', 
    'Wood Deck SF', 'Open Porch SF', 'Lot Area', 
    'Year Built', 'Yr Sold')
sales.sort('SalePrice')
SalePrice | 1st Flr SF | 2nd Flr SF | Total Bsmt SF | Garage Area | Wood Deck SF | Open Porch SF | Lot Area | Year Built | Yr Sold
35000     | 498        | 0          | 498           | 216         | 0            | 0             | 8088     | 1922       | 2006
39300     | 334        | 0          | 0             | 0           | 0            | 0             | 5000     | 1946       | 2007
40000     | 649        | 668        | 649           | 250         | 0            | 54            | 8500     | 1920       | 2008
45000     | 612        | 0          | 0             | 308         | 0            | 0             | 5925     | 1940       | 2009
52000     | 729        | 0          | 270           | 0           | 0            | 0             | 4130     | 1935       | 2008
52500     | 693        | 0          | 693           | 0           | 0            | 20            | 4118     | 1941       | 2006
55000     | 723        | 363        | 723           | 400         | 0            | 24            | 11340    | 1920       | 2008
55000     | 796        | 0          | 796           | 0           | 0            | 0             | 3636     | 1922       | 2008
57625     | 810        | 0          | 0             | 280         | 119          | 24            | 21780    | 1910       | 2009
58500     | 864        | 0          | 864           | 200         | 0            | 0             | 8212     | 1914       | 2010
... (1992 rows omitted)

销售价格的直方图显示了大量的变异性和一个明显不服从正态分布的分布。右侧的長尾包含少数价格非常高的房屋。短的左尾不包含任何售价低于 35,000 美元的房屋。

[In ]:
sales.hist('SalePrice', bins=32, unit='$')
Histogram with 'SalePrice ($)' on the x-axis and 'Percent per $' on the y-axis. The histogram has bars of increasing size up to about x=120000 and then the bars have decreasing height visible through about 500000 with the graph extending out to 700000.

相关性

没有任何单一属性足以预测销售价格。例如,以平方英尺度量的一楼面积与销售价格相关,但只能解释其部分变异性。

[In ]:
sales.scatter('1st Flr SF', 'SalePrice')
Scatterplot with '1st Flr SF' on the x-axis and 'Sale Price' on the y-axis. The data have a positive association, fanning out as the x values increase.
[In ]:
correlation(sales, 'SalePrice', '1st Flr SF')
0.6424662541030225

事实上,没有任何单个属性与销售价格的相关性超过 0.7(除了销售价格本身)。

[In ]:
for label in sales.labels:
    print('Correlation of', label, 'and SalePrice:\t', correlation(sales, label, 'SalePrice'))
Correlation of SalePrice and SalePrice:	 1.0
Correlation of 1st Flr SF and SalePrice:	 0.6424662541030225
Correlation of 2nd Flr SF and SalePrice:	 0.3575218942800824
Correlation of Total Bsmt SF and SalePrice:	 0.652978626757169
Correlation of Garage Area and SalePrice:	 0.6385944852520443
Correlation of Wood Deck SF and SalePrice:	 0.3526986661950492
Correlation of Open Porch SF and SalePrice:	 0.3369094170263733
Correlation of Lot Area and SalePrice:	 0.2908234551157694
Correlation of Year Built and SalePrice:	 0.5651647537135916
Correlation of Yr Sold and SalePrice:	 0.02594857908072111

然而,组合属性可以提供更高的相关性。特别是,如果我们将一楼和二楼面积相加,结果的相关性高于任何一个单独的属性。

[In ]:
both_floors = sales.column(1) + sales.column(2)
correlation(sales.with_column('Both Floors', both_floors), 'SalePrice', 'Both Floors')
0.7821920556134877

这种高相关性表明我们应该尝试使用多个属性来预测销售价格。在一个包含多个观测属性和一个待预测数值(本例中为销售价格)的数据集中,多元线性回归可以是一种有效的技术。

多元线性回归

在多元线性回归中,数值输出是通过将每个属性值乘以不同的斜率,然后将结果求和,从数值输入属性中预测出来的。在这个例子中,1st Flr SF 的斜率将代表房屋一楼的每平方英尺面积应在我们的预测中使用的美元金额。

在开始预测之前,我们将数据随机分成大小相等的训练集和测试集。

[In ]:
train, test = sales.split(1001)
print(train.num_rows, 'training and', test.num_rows, 'test instances.')
1001 training and 1001 test instances.

多元回归中的斜率是一个数组,对于示例中的每个属性都有一个斜率值。预测销售价格涉及将每个属性乘以斜率并求和。

[In ]:
def predict(slopes, row):
    return sum(slopes * np.array(row))

example_row = test.drop('SalePrice').row(0)
print('Predicting sale price for:', example_row)
example_slopes = np.random.normal(10, 1, len(example_row))
print('Using slopes:', example_slopes)
print('Result:', predict(example_slopes, example_row))
Predicting sale price for: Row(1st Flr SF=1207, 2nd Flr SF=0, Total Bsmt SF=1135.0, Garage Area=264.0, Wood Deck SF=0, Open Porch SF=240, Lot Area=9510, Year Built=1962, Yr Sold=2006)
Using slopes: [ 9.52065867  8.58939769 11.48702417  9.50389131  9.09151019  9.86944284
 10.71929443 10.88966608  8.33339346]
Result: 169429.7032316262

结果是一个估计的销售价格,可以与实际销售价格进行比较,以评估斜率是否提供了准确的预测。由于上面的 example_slopes 是随机选择的,我们不应该期望它们能提供准确的预测。

[In ]:
print('Actual sale price:', test.column('SalePrice').item(0))
print('Predicted sale price using random slopes:', predict(example_slopes, example_row))
Actual sale price: 147000
Predicted sale price using random slopes: 169429.7032316262

最小二乘回归

执行多元回归的下一步是定义最小二乘目标。我们对训练集中的每一行进行预测,然后计算预测值与实际价格的均方根误差(RMSE)。

[In ]:
train_prices = train.column(0)
train_attributes = train.drop(0)

def rmse(slopes, attributes, prices):
    errors = []
    for i in np.arange(len(prices)):
        predicted = predict(slopes, attributes.row(i))
        actual = prices.item(i)
        errors.append((predicted - actual) ** 2)
    return np.mean(errors) ** 0.5

def rmse_train(slopes):
    return rmse(slopes, train_attributes, train_prices)

print('RMSE of all training examples using random slopes:', rmse_train(example_slopes))
RMSE of all training examples using random slopes: 68153.56796456952

最后,我们使用 minimize 函数来寻找具有最低 RMSE 的斜率。由于我们要最小化的函数 rmse_train 接受一个数组而不是一个数字,我们必须向 minimize 传递 array=True 参数。使用此参数时,minimize 还需要斜率的初始猜测值,以便它知道输入数组的维度。最后,为了加速优化,我们使用 smooth=True 属性指示 rmse_train 是一个平滑函数。最佳斜率的计算可能需要几分钟。

[In ]:
best_slopes = minimize(rmse_train, start=example_slopes, smooth=True, array=True)
print('The best slopes for the training set:')
Table(train_attributes.labels).with_row(list(best_slopes)).show()
print('RMSE of all training examples using the best slopes:', rmse_train(best_slopes))
The best slopes for the training set:
<IPython.core.display.HTML object>
RMSE of all training examples using the best slopes: 29311.117940347867

解释多元回归

让我们解释这些结果。最佳斜率为我们提供了一种根据房屋属性估算价格的方法。一楼一平方英尺面积价值约 75 美元(第一个斜率),而二楼一平方英尺面积价值约 70 美元(第二个斜率)。最后的负值描述了市场状况:后几年的平均价格较低。

约 30,000 美元的 RMSE 意味着,基于所有属性的最佳线性预测在训练集上的平均误差约为 30,000 美元。我们在测试集上预测价格时发现了类似的误差,这表明我们的预测方法将推广到来自同一总体的其他样本。

[In ]:
test_prices = test.column(0)
test_attributes = test.drop(0)

def rmse_test(slopes):
    return rmse(slopes, test_attributes, test_prices)

rmse_linear = rmse_test(best_slopes)
print('Test set RMSE for multiple linear regression:', rmse_linear)
Test set RMSE for multiple linear regression: 33025.064938240575

如果预测是完美的,那么预测值与实际值的散点图将是一条斜率为 1 的直线。我们看到大多数点落在该线附近,但预测中存在一些误差。

[In ]:
def fit(row):
    return sum(best_slopes * np.array(row))

test.with_column('Fitted', test.drop(0).apply(fit)).scatter('Fitted', 0)
plots.plot([0, 5e5], [0, 5e5]);
Scatterplot with 'Fitted' on the x-axis and 'Sale Price' on the y-axis. The data appears to have a fairly strong positive correlation and a light blue line is drawn over most of the data. As the x values get higher however, the data points begin to be further away from the line.

多元回归的残差图通常将误差(残差)与预测变量的实际值进行比较。我们在下面的残差图中看到,系统性地低估了昂贵房屋的价值,如图右侧许多正的残差值所示。

[In ]:
test.with_column('Residual', test_prices-test.drop(0).apply(fit)).scatter(0, 'Residual')
plots.plot([0, 7e5], [0, 0]);
Scatterplot with 'Sale Price' on the x-axis and 'Residual' on the y-axis. There is a horizontal light blue line drawn at y=0. The data points below x=300000 generally are spread evenly above and below the light blue line. However, above x=300000, there are more data points above the line than below it. By x=500000, all data points are well above the line.

与简单线性回归一样,解释预测变量的结果至少与进行预测同样重要。关于解释多元回归的许多内容本教科书并未涵盖。完成本教材后,自然的下一步是更深入地研究线性建模和回归。

用于回归的最近邻

预测房屋销售价格的另一种方法是使用类似房屋的价格。这种“最近邻”方法与我们的分类器非常相似。为了加速计算,我们将只使用最初分析中与销售价格相关性最高的属性。

[In ]:
train_nn = train.select(0, 1, 2, 3, 4, 8)
test_nn = test.select(0, 1, 2, 3, 4, 8)
train_nn.show(3)
<IPython.core.display.HTML object>

最近邻的计算与最近邻分类器完全相同。在这种情况下,我们将从距离计算中排除 'SalePrice' 列,而不是 'Class' 列。第一个测试行的五个最近邻如下所示。

[In ]:
def distance(pt1, pt2):
    """The distance between two points, represented as arrays."""
    return np.sqrt(sum((pt1 - pt2) ** 2))

def row_distance(row1, row2):
    """The distance between two rows of a table."""
    return distance(np.array(row1), np.array(row2))

def distances(training, example, output):
    """Compute the distance from example for each row in training."""
    dists = []
    attributes = training.drop(output)
    for row in attributes.rows:
        dists.append(row_distance(row, example))
    return training.with_column('Distance', dists)

def closest(training, example, k, output):
    """Return a table of the k closest neighbors to example."""
    return distances(training, example, output).sort('Distance').take(np.arange(k))

example_nn_row = test_nn.drop(0).row(0)
closest(train_nn, example_nn_row, 5, 'SalePrice')
SalePrice | 1st Flr SF | 2nd Flr SF | Total Bsmt SF | Garage Area | Year Built | Distance
137000    | 1176       | 0          | 1158          | 303         | 1958       | 55.0182
146000    | 1150       | 0          | 1150          | 288         | 1961       | 63.6475
126175    | 1163       | 0          | 1162          | 220         | 1955       | 68.1909
157900    | 1188       | 0          | 1188          | 312         | 1962       | 73.9865
150000    | 1144       | 0          | 1169          | 286         | 1956       | 75.1332

预测价格的一种简单方法是取最近邻价格的平均值。

[In ]:
def predict_nn(example):
    """Return the majority class among the k nearest neighbors."""
    return np.average(closest(train_nn, example, 5, 'SalePrice').column('SalePrice'))

predict_nn(example_nn_row)
143415.0

最后,我们可以检查我们的预测是否接近一个测试样本的真实销售价格。看起来合理!

[In ]:
print('Actual sale price:', test_nn.column('SalePrice').item(0))
print('Predicted sale price using nearest neighbors:', predict_nn(example_nn_row))
Actual sale price: 147000
Predicted sale price using nearest neighbors: 143415.0

评估

为了评估这种方法在整个测试集上的表现,我们将 predict_nn 应用于每个测试样本,然后计算预测的均方根误差。预测的计算可能需要几分钟。

[In ]:
nn_test_predictions = test_nn.drop('SalePrice').apply(predict_nn)
rmse_nn = np.mean((test_prices - nn_test_predictions) ** 2) ** 0.5

print('Test set RMSE for multiple linear regression: ', rmse_linear)
print('Test set RMSE for nearest neighbor regression:', rmse_nn)
Test set RMSE for multiple linear regression:  33025.064938240575
Test set RMSE for nearest neighbor regression: 36067.116043510105

对于这些数据,两种技术的误差相当相似!对于不同的数据集,一种技术可能优于另一种。通过在相同数据上计算两种技术的 RMSE,我们可以公正地比较方法。一个注意事项:性能差异可能根本不是由于技术本身造成的;可能最初是由于训练集和测试集抽样造成的随机变异。

最后,我们可以为这些预测绘制残差图。我们仍然低估了最昂贵房屋的价格,但偏差似乎不那么系统化。然而,非常接近零的残差较少,表明以非常高准确率预测的价格较少。

[In ]:
test.with_column('Residual', test_prices-nn_test_predictions).scatter(0, 'Residual')
plots.plot([0, 7e5], [0, 0]);
Scatterplot with 'Sale Price' on the x-axis and 'Residual' on the y-axis. The data points have a positive association, but most exist within a blob on the left hand side of the graph. A horizontal light blue line is drawn at y=0. The right hand side of the graph has more data points above the line than below it.