表的行

既然我们对最近邻分类有了定性的理解，是时候实现我们的分类器了。

在本章之前，我们主要处理的是表中的单列。但现在我们需要判断一个“个体”是否与另一个“接近”。个体的数据包含在表的“行”中。

那么，让我们先仔细看看行。

[In ]:

import matplotlib
#matplotlib.use('Agg')
path_data = '../../../assets/data/'
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import math
import scipy.stats as stats
plots.style.use('fivethirtyeight')

[In ]:

def standard_units(x):
    return (x - np.mean(x))/np.std(x)

以下是原始表 ckd，包含接受慢性肾脏疾病检测的患者数据。

[In ]:

ckd = Table.read_table(path_data + 'ckd.csv').relabeled('Blood Glucose Random', 'Glucose')

与第一名患者对应的数据在表的第 0 行，这与 Python 的索引系统一致。Table 方法 row 通过将行的索引作为其参数来访问该行：

[In ]:

ckd.row(0)

Row(Age=48, Blood Pressure=70, Specific Gravity=1.005, Albumin=4, Sugar=0, Red Blood Cells='normal', Pus Cell='abnormal', Pus Cell clumps='present', Bacteria='notpresent', Glucose=117, Blood Urea=56, Serum Creatinine=3.8, Sodium=111, Potassium=2.5, Hemoglobin=11.2, Packed Cell Volume=32, White Blood Cell Count=6700, Red Blood Cell Count=3.9, Hypertension='yes', Diabetes Mellitus='no', Coronary Artery Disease='no', Appetite='poor', Pedal Edema='yes', Anemia='yes', Class=1)

行有自己的数据类型：它们是“行对象”。注意显示不仅显示了行中的值，还显示了对应列的标签。

行通常不是数组，因为它们的元素可以是不同的类型。例如，上面的行中有些元素是字符串（如 'abnormal'），有些是数值。所以行不能被转换为数组。

然而，行与数组共享一些特性。你可以使用 item 来访问行的特定元素。例如，要访问患者 0 的白蛋白水平，我们可以查看上面行打印输出中的标签，发现它是第 3 项：

[In ]:

ckd.row(0).item(3)

将行转换为数组（如果可能）

元素全是数值（或全是字符串）的行可以转换为数组。将行转换为数组使我们能够进行算术运算和使用其他方便的 NumPy 函数，因此通常很有用。

回想一下，在上一节中，我们尝试基于两个属性 Hemoglobin 和 Glucose（均以标准单位测量）将患者分类为“CKD”或“非 CKD”。

[In ]:

ckd = Table().with_columns(
    'Hemoglobin', standard_units(ckd.column('Hemoglobin')),
    'Glucose', standard_units(ckd.column('Glucose')),
    'Class', ckd.column('Class')
)

color_table = Table().with_columns(
    'Class', make_array(1, 0),
    'Color', make_array('darkblue', 'gold')
)
ckd = ckd.join('Class', color_table)
ckd

Class | Hemoglobin | Glucose     | Color
0     | 0.456884   | 0.133751    | gold
0     | 1.153      | -0.947597   | gold
0     | 0.770138   | -0.762223   | gold
0     | 0.596108   | -0.190654   | gold
0     | -0.239236  | -0.49961    | gold
0     | -0.0304002 | -0.159758   | gold
0     | 0.282854   | -0.00527964 | gold
0     | 0.108824   | -0.623193   | gold
0     | 0.0740178  | -0.515058   | gold
0     | 0.83975    | -0.422371   | gold
... (148 rows omitted)

以下是两个属性的散点图，以及对应新患者 Alice 的一个红点。她的血红蛋白值为 0（即在平均值处），血糖值为 1.1（即高于平均值 1.1 个标准差）。

[In ]:

alice = make_array(0, 1.1)
ckd.scatter('Hemoglobin', 'Glucose', group='Color')
plots.scatter(alice.item(0), alice.item(1), color='red', s=30);

Scatterplot with 'Hemoglobin' on the x-axis and 'Glucose' on the y-axis. Data points are shown in dark blue or in gold. The dark blue data points exist all over the graph, but not where the gold data points do, from about x=0 to x=1.5 and y values between -1 and just above 0. A red data point exists at about (0, 1).

要找到 Alice 的点与任何其他点之间的距离，我们只需要属性的值：

[In ]:

ckd_attributes = ckd.select('Hemoglobin', 'Glucose')

[In ]:

ckd_attributes

Hemoglobin | Glucose
0.456884   | 0.133751
1.153      | -0.947597
0.770138   | -0.762223
0.596108   | -0.190654
-0.239236  | -0.49961
-0.0304002 | -0.159758
0.282854   | -0.00527964
0.108824   | -0.623193
0.0740178  | -0.515058
0.83975    | -0.422371
... (148 rows omitted)

每行由训练样本中一个点的坐标组成。因为现在行只由数值组成，所以可以将它们转换为数组。为此，我们使用函数 np.array，它将任何类型的序列对象（如行）转换为数组。（我们的老朋友 make_array 用于创建数组，而不是转换其他类型的序列为数组。）

[In ]:

ckd_attributes.row(3)

Row(Hemoglobin=0.5961076648232668, Glucose=-0.19065363034327712)

[In ]:

np.array(ckd_attributes.row(3))

array([ 0.59610766, -0.19065363])

这非常方便，因为我们现在可以对每行中的数据使用数组运算。

两个属性时点之间的距离

我们需要做的主要计算是找到 Alice 的点与任何其他点之间的距离。为此，我们首先需要一种计算任意两点之间距离的方法。

我们如何做到这一点？在二维空间中，这相当容易。如果有一个点坐标为 $(x_0,y_0)$，另一个点的坐标为 $(x_1,y_1)$，它们之间的距离为

$$ D = \sqrt{(x_0-x_1)^2 + (y_0-y_1)^2} $$

（这是从何而来？它来自勾股定理：我们有一个直角三角形，边长分别为 $x_0-x_1$ 和 $y_0-y_1$，我们想要求斜边的长度。）

在下一节中，我们将看到当属性多于两个时，这个公式有一个直接的扩展。现在，让我们使用公式和数组运算来找到 Alice 与第 3 行患者之间的距离。

[In ]:

patient3 = np.array(ckd_attributes.row(3))
alice, patient3

(array([0. , 1.1]), array([ 0.59610766, -0.19065363]))

[In ]:

distance = np.sqrt(np.sum((alice - patient3)**2))
distance

1.421664918881847

我们需要计算 Alice 与许多点之间的距离，所以让我们编写一个名为 distance 的函数，用于计算任意两点之间的距离。该函数将接受两个数组，每个数组包含一个点的 $(x, y)$ 坐标。（记住，这些实际上是患者的血红蛋白和血糖水平。）

[In ]:

def distance(point1, point2):
    """Returns the Euclidean distance between point1 and point2.
    
    Each argument is an array containing the coordinates of a point."""
    return np.sqrt(np.sum((point1 - point2)**2))

[In ]:

distance(alice, patient3)

1.421664918881847

我们已经开始构建分类器了：distance 函数是第一个构建模块。现在让我们处理下一部分。

对整行使用 `apply`

回想一下，如果你想对表中一列的每个元素应用一个函数，一种方法是调用 table_name.apply(function_name, column_label)。这计算为一个数组，其中包含对该列每个元素调用函数时的函数值。因此，数组的每个条目都基于表的相应行。

如果你在不指定列标签的情况下使用 apply，那么整行将被传递给函数。让我们看一个非常小的表 t 上是如何工作的，该表包含训练样本中前五名患者的信息。

[In ]:

t = ckd_attributes.take(np.arange(5))
t

Hemoglobin | Glucose
0.456884   | 0.133751
1.153      | -0.947597
0.770138   | -0.762223
0.596108   | -0.190654
-0.239236  | -0.49961

举个例子，假设对每个患者，我们想知道他们最异常的属性有多异常。具体来说，如果患者的血红蛋白水平比她的血糖水平偏离平均值更远，我们想知道它离平均值有多远。如果她的血糖水平比她的血红蛋白水平偏离平均值更远，我们想知道它离平均值有多远。

这等同于取两个数量的绝对值的最大值。要对特定行执行此操作，我们可以将行转换为数组并使用数组运算。

[In ]:

def max_abs(row):
    return np.max(np.abs(np.array(row)))

[In ]:

max_abs(t.row(4))

0.4996102825918697

现在我们可以将 max_abs 应用于表 t 的每一行：

[In ]:

t.apply(max_abs)

array([0.4568837 , 1.15300352, 0.77013762, 0.59610766, 0.49961028])

这种使用 apply 的方式将帮助我们创建分类器的下一个构建模块。

Alice 的 $k$ 个最近邻

如果我们想使用 k-最近邻分类器对 Alice 进行分类，我们必须识别出她的 $k$ 个最近邻。这个过程有哪些步骤？假设 $k = 5$。那么步骤如下： - 步骤 1. 找到 Alice 与训练样本中每个点之间的距离。 - 步骤 2. 按距离递增的顺序对数据表进行排序。 - 步骤 3. 取排序后的表的前 5 行。

步骤 2 和 3 看起来很简单，只要我们有了距离。那么让我们关注步骤 1。

这是 Alice：

[In ]:

alice

array([0. , 1.1])

我们需要的是一个函数，它能够找到 Alice 与另一个坐标包含在某一行中的点之间的距离。函数 distance 返回任何两个坐标在数组中的点之间的距离。我们可以用它来定义 distance_from_alice，该函数接受一行作为参数，并返回该行与 Alice 之间的距离。

[In ]:

def distance_from_alice(row):
    """Returns distance between Alice and a row of the attributes table"""
    return distance(alice, np.array(row))

[In ]:

distance_from_alice(ckd_attributes.row(3))

1.421664918881847

现在我们可以将函数 distance_from_alice apply 到 ckd_attributes 的每一行，并用这些距离扩充表 ckd。步骤 1 完成了！

[In ]:

distances = ckd_attributes.apply(distance_from_alice)
ckd_with_distances = ckd.with_column('Distance from Alice', distances)

[In ]:

ckd_with_distances

Class | Hemoglobin | Glucose     | Color | Distance from Alice
0     | 0.456884   | 0.133751    | gold  | 1.06882
0     | 1.153      | -0.947597   | gold  | 2.34991
0     | 0.770138   | -0.762223   | gold  | 2.01519
0     | 0.596108   | -0.190654   | gold  | 1.42166
0     | -0.239236  | -0.49961    | gold  | 1.6174
0     | -0.0304002 | -0.159758   | gold  | 1.26012
0     | 0.282854   | -0.00527964 | gold  | 1.1409
0     | 0.108824   | -0.623193   | gold  | 1.72663
0     | 0.0740178  | -0.515058   | gold  | 1.61675
0     | 0.83975    | -0.422371   | gold  | 1.73862
... (148 rows omitted)

对于步骤 2，让我们按距离递增的顺序对表进行排序：

[In ]:

sorted_by_distance = ckd_with_distances.sort('Distance from Alice')
sorted_by_distance

Class | Hemoglobin | Glucose   | Color    | Distance from Alice
1     | 0.83975    | 1.2151    | darkblue | 0.847601
1     | -0.970162  | 1.27689   | darkblue | 0.986156
0     | -0.0304002 | 0.0874074 | gold     | 1.01305
0     | 0.14363    | 0.0874074 | gold     | 1.02273
1     | -0.413266  | 2.04928   | darkblue | 1.03534
0     | 0.387272   | 0.118303  | gold     | 1.05532
0     | 0.456884   | 0.133751  | gold     | 1.06882
0     | 0.178436   | 0.0410639 | gold     | 1.07386
0     | 0.00440582 | 0.025616  | gold     | 1.07439
0     | -0.169624  | 0.025616  | gold     | 1.08769
... (148 rows omitted)

步骤 3：前 5 行对应 Alice 的 5 个最近邻；你可以将 5 替换为任何其他正整数。

[In ]:

alice_5_nearest_neighbors = sorted_by_distance.take(np.arange(5))
alice_5_nearest_neighbors

Class | Hemoglobin | Glucose   | Color    | Distance from Alice
1     | 0.83975    | 1.2151    | darkblue | 0.847601
1     | -0.970162  | 1.27689   | darkblue | 0.986156
0     | -0.0304002 | 0.0874074 | gold     | 1.01305
0     | 0.14363    | 0.0874074 | gold     | 1.02273
1     | -0.413266  | 2.04928   | darkblue | 1.03534

Alice 的五个最近邻中，三个是蓝色点，两个是金色点。因此，一个 5-最近邻分类器会将 Alice 分类为蓝色：它会预测 Alice 患有慢性肾脏疾病。

下图放大了 Alice 和她的五个最近邻。两个金色点恰好在红点正下方的圆内。分类器认为 Alice 更像她周围的三个蓝色点。

[In ]:

plots.figure(figsize=(8,8))
plots.scatter(ckd.column('Hemoglobin'), ckd.column('Glucose'), c=ckd.column('Color'), s=40)
#ckd.scatter('Hemoglobin', 'Glucose', group='Color')
plots.scatter(alice.item(0), alice.item(1), color='red', s=40)
radius = sorted_by_distance.column('Distance from Alice').item(4)+0.014
theta = np.arange(0, 2*np.pi+1, 2*np.pi/200)
plots.plot(radius*np.cos(theta)+alice.item(0), radius*np.sin(theta)+alice.item(1), color='g', lw=1.5);
plots.xlim(-2, 2.5)
plots.ylim(-2, 2.5);

The previous scatterplot is zoomed in to the area surrounding the red data point at rough (0, 1). A green circle is drawn around this data point with the data point in the center. The green circle contains 3 dark blue data points and 2 gold data points.

我们正在顺利实现我们的 k-最近邻分类器。在接下来的两节中，我们将把它组合起来并评估其准确率。

表的行

将行转换为数组（如果可能）

两个属性时点之间的距离

对整行使用 apply

Alice 的 $k$ 个最近邻

对整行使用 `apply`