按多个变量交叉分类

当个体具有多个特征时，对他们进行分类有多种不同的方式。例如，如果我们有一个大学生群体，对每个学生都记录了专业和大学年限，那么学生可以按专业分类，或按年级分类，或按专业和年级的组合分类。

group 方法也允许我们根据多个变量对个体进行分类。这被称为“交叉分类”。

[In ]:

from datascience import *
path_data = '../../../assets/data/'
import numpy as np

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

两个变量：统计每个配对类别中的数量

表格 more_cones 记录了六个冰淇淋蛋筒的口味、颜色和价格。

[In ]:

more_cones = Table().with_columns(
    'Flavor', make_array('strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'),
    'Color', make_array('pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'),
    'Price', make_array(3.55, 4.75, 5.25, 5.25, 5.25, 4.75)
)

more_cones

Flavor     | Color       | Price
strawberry | pink        | 3.55
chocolate  | light brown | 4.75
chocolate  | dark brown  | 5.25
strawberry | pink        | 5.25
chocolate  | dark brown  | 5.25
bubblegum  | pink        | 4.75

我们知道如何使用 group 来统计每种口味的蛋筒数量：

[In ]:

more_cones.group('Flavor')

Flavor     | count
bubblegum  | 1
chocolate  | 3
strawberry | 2

但现在每个蛋筒还有颜色。为了同时按口味和颜色对蛋筒进行分类，我们将向 group 传递一个标签列表作为参数。结果表为分组列中同时出现的每个“唯一组合”包含一行。与之前一样，单个参数（本例中为一个列表，但数组也可以）提供行计数。

虽然有六个蛋筒，但只有四种口味的唯一组合。其中两个蛋筒是深棕色的巧克力味，两个是粉色的草莓味。

[In ]:

more_cones.group(['Flavor', 'Color'])

Flavor     | Color       | count
bubblegum  | pink        | 1
chocolate  | dark brown  | 2
chocolate  | light brown | 1
strawberry | pink        | 2

两个变量：找出每个配对类别的特征

第二个参数对不在分组列列表中的所有其他列进行聚合。

[In ]:

more_cones.group(['Flavor', 'Color'], sum)

Flavor     | Color       | Price sum
bubblegum  | pink        | 4.75
chocolate  | dark brown  | 10.5
chocolate  | light brown | 4.75
strawberry | pink        | 8.8

三个或更多变量。 你可以使用 group 按三个或更多分类变量对行进行分类。只需将它们全部包含在作为第一个参数的列表中即可。但按多个变量进行交叉分类可能会变得复杂，因为类别不同组合的数量可能相当大。

透视表：重新排列 `group` 的输出

交叉分类的许多用途只涉及两个分类变量，比如上面例子中的 Flavor 和 Color。在这些情况下，可以在一种不同类型的表中显示分类结果，称为“透视表”。透视表，也称为“列联表”，使得处理按两个变量分类的数据更加容易。

回顾使用 group 统计每个口味和颜色配对类别中蛋筒数量的情况：

[In ]:

more_cones.group(['Flavor', 'Color'])

Flavor     | Color       | count
bubblegum  | pink        | 1
chocolate  | dark brown  | 2
chocolate  | light brown | 1
strawberry | pink        | 2

同样的数据可以使用 Table 方法 pivot 以不同方式显示。暂时忽略代码，只看结果表。

[In ]:

more_cones.pivot('Flavor', 'Color')

Color       | bubblegum | chocolate | strawberry
dark brown  | 0         | 2         | 0
light brown | 0         | 1         | 0
pink        | 1         | 0         | 2

注意该表显示了所有九种可能的口味和颜色配对，包括像“深棕色的泡泡糖味”这样在我们的数据中不存在的配对。还要注意每对中的计数出现在表格主体中：要找到浅棕色巧克力蛋筒的数量，沿着 light brown 行看过去，直到与 chocolate 列相交。

group 方法接受两个标签的列表，因为它是灵活的：它可以接受一个、三个或更多标签。另一方面，pivot 总是接受两个列标签，一个用于确定列，一个用于确定行。

pivot

pivot 方法与 group 方法密切相关：它将具有相同值组合的行分组在一起。它与 group 的不同之处在于它将结果值组织在网格中。pivot 的第一个参数是包含将在结果中形成新列的值的列标签。第二个参数是用于行的列标签。结果给出了原始表中共享该列和行值组合的所有行的计数。

像 group 一样，pivot 可以与额外的参数一起使用来找出每个配对类别的特征。一个可选的第三个参数称为 values，指定一列值，将替换网格中每个单元格的计数。然而，所有这些值不会全部显示；第四个参数 collect 指示如何将它们全部收集为一个聚合值以显示在单元格中。

一个例子将有助于澄清这一点。下面是使用 pivot 找出每个单元格中蛋筒的总价格。

[In ]:

more_cones.pivot('Flavor', 'Color', values='Price', collect=sum)

Color       | bubblegum | chocolate | strawberry
dark brown  | 0         | 10.5      | 0
light brown | 0         | 4.75      | 0
pink        | 4.75      | 0         | 8.8

这是 group 做同样的事情。

[In ]:

more_cones.group(['Flavor', 'Color'], sum)

Flavor     | Color       | Price sum
bubblegum  | pink        | 4.75
chocolate  | dark brown  | 10.5
chocolate  | light brown | 4.75
strawberry | pink        | 8.8

尽管两个表中的数字相同，但 pivot 生成的表更容易阅读，也更易于分析。pivot 的优势在于它将分组后的值放在相邻的列中，以便它们可以被组合和比较。

示例：加州成年人的教育与收入

加州开放数据门户（State of California's Open Data Portal）是了解加州居民生活的丰富信息来源。我们从中获取了一个关于2008年至2014年加州居民教育程度和个人收入的数据集。这些数据来源于美国人口普查局的当前人口调查（Current Population Survey）。

对于每一年，该表记录了加州居民在年龄、性别、教育程度和个人收入的许多不同组合下的 Population Count。我们将仅研究2014年的数据。

[In ]:

full_table = Table.read_table(path_data + 'educ_inc.csv')
ca_2014 = full_table.where('Year', are.equal_to('1/1/14 0:00')).where('Age', are.not_equal_to('00 to 17'))
ca_2014

Year        | Age       | Gender | Educational Attainment         | Personal Income     | Population Count
1/1/14 0:00 | 18 to 64  | Female | No high school diploma         | H: 75,000 and over  | 2058
1/1/14 0:00 | 65 to 80+ | Male   | No high school diploma         | H: 75,000 and over  | 2153
1/1/14 0:00 | 65 to 80+ | Female | No high school diploma         | G: 50,000 to 74,999 | 4666
1/1/14 0:00 | 65 to 80+ | Female | High school or equivalent      | H: 75,000 and over  | 7122
1/1/14 0:00 | 65 to 80+ | Female | No high school diploma         | F: 35,000 to 49,999 | 7261
1/1/14 0:00 | 65 to 80+ | Male   | No high school diploma         | G: 50,000 to 74,999 | 8569
1/1/14 0:00 | 18 to 64  | Female | No high school diploma         | G: 50,000 to 74,999 | 14635
1/1/14 0:00 | 65 to 80+ | Male   | No high school diploma         | F: 35,000 to 49,999 | 15212
1/1/14 0:00 | 65 to 80+ | Male   | College, less than 4-yr degree | B: 5,000 to 9,999   | 15423
1/1/14 0:00 | 65 to 80+ | Female | Bachelor's degree or higher    | A: 0 to 4,999       | 15459
... (117 rows omitted)

表格的每一行对应一个年龄、性别、教育水平和收入的组合。总共有127种这样的组合！

作为第一步，最好只从一个或两个变量开始。我们将只关注一对变量：教育程度和个人收入。

[In ]:

educ_inc = ca_2014.select('Educational Attainment', 'Personal Income', 'Population Count')
educ_inc

Educational Attainment         | Personal Income     | Population Count
No high school diploma         | H: 75,000 and over  | 2058
No high school diploma         | H: 75,000 and over  | 2153
No high school diploma         | G: 50,000 to 74,999 | 4666
High school or equivalent      | H: 75,000 and over  | 7122
No high school diploma         | F: 35,000 to 49,999 | 7261
No high school diploma         | G: 50,000 to 74,999 | 8569
No high school diploma         | G: 50,000 to 74,999 | 14635
No high school diploma         | F: 35,000 to 49,999 | 15212
College, less than 4-yr degree | B: 5,000 to 9,999   | 15423
Bachelor's degree or higher    | A: 0 to 4,999       | 15459
... (117 rows omitted)

让我们先从单独看教育水平开始。该变量的类别已被不同收入水平进一步细分。因此，我们将按 Educational Attainment 对表进行分组，并对每个类别中的 Population Count 进行 sum。

[In ]:

education = educ_inc.select('Educational Attainment', 'Population Count')
educ_totals = education.group('Educational Attainment', sum)
educ_totals

Educational Attainment         | Population Count sum
Bachelor's degree or higher    | 8525698
College, less than 4-yr degree | 7775497
High school or equivalent      | 6294141
No high school diploma         | 4258277

只有四个教育程度类别。这些数字非常大，因此查看百分比更有帮助。为此，我们将使用之前章节中定义的函数 percents。它将数字数组转换为输入数组总数中的百分比数组。

[In ]:

def percents(array_x):
    return np.round( (array_x/sum(array_x))*100, 2)

现在我们有了加州成年人教育程度的分布。超过30%的人拥有学士学位或更高学历，而近16%的人没有高中文凭。

[In ]:

educ_distribution = educ_totals.with_column(
    'Population Percent', percents(educ_totals.column(1))
)
educ_distribution

Educational Attainment         | Population Count sum | Population Percent
Bachelor's degree or higher    | 8525698              | 31.75
College, less than 4-yr degree | 7775497              | 28.96
High school or equivalent      | 6294141              | 23.44
No high school diploma         | 4258277              | 15.86

通过使用 pivot，我们可以得到按 Educational Attainment 和 Personal Income 交叉分类的加州成年人的列联表（计数表）。

[In ]:

totals = educ_inc.pivot('Educational Attainment', 'Personal Income', values='Population Count', collect=sum)
totals

Personal Income     | Bachelor's degree or higher | College, less than 4-yr degree | High school or equivalent | No high school diploma
A: 0 to 4,999       | 575491                      | 985011                         | 1161873                   | 1204529
B: 5,000 to 9,999   | 326020                      | 810641                         | 626499                    | 597039
C: 10,000 to 14,999 | 452449                      | 798596                         | 692661                    | 664607
D: 15,000 to 24,999 | 773684                      | 1345257                        | 1252377                   | 875498
E: 25,000 to 34,999 | 693884                      | 1091642                        | 929218                    | 464564
F: 35,000 to 49,999 | 1122791                     | 1112421                        | 782804                    | 260579
G: 50,000 to 74,999 | 1594681                     | 883826                         | 525517                    | 132516
H: 75,000 and over  | 2986698                     | 748103                         | 323192                    | 58945

这里你看到了 pivot 相比其他交叉分类方法的强大之处。每列计数是在特定教育程度水平下的个人收入分布。将计数转换为百分比使我们能够比较这四个分布。

[In ]:

distributions = totals.select(0).with_columns(
    "Bachelor's degree or higher", percents(totals.column(1)),
    'College, less than 4-yr degree', percents(totals.column(2)),
    'High school or equivalent', percents(totals.column(3)),
    'No high school diploma', percents(totals.column(4))   
    )

distributions

Personal Income     | Bachelor's degree or higher | College, less than 4-yr degree | High school or equivalent | No high school diploma
A: 0 to 4,999       | 6.75                        | 12.67                          | 18.46                     | 28.29
B: 5,000 to 9,999   | 3.82                        | 10.43                          | 9.95                      | 14.02
C: 10,000 to 14,999 | 5.31                        | 10.27                          | 11                        | 15.61
D: 15,000 to 24,999 | 9.07                        | 17.3                           | 19.9                      | 20.56
E: 25,000 to 34,999 | 8.14                        | 14.04                          | 14.76                     | 10.91
F: 35,000 to 49,999 | 13.17                       | 14.31                          | 12.44                     | 6.12
G: 50,000 to 74,999 | 18.7                        | 11.37                          | 8.35                      | 3.11
H: 75,000 and over  | 35.03                       | 9.62                           | 5.13                      | 1.38

一眼就能看出，超过35%拥有学士学位或更高学历的人收入在 $\$75,000$ 及以上，而其他教育类别中只有不到10%的人达到这一收入水平。

下面的条形图比较了没有高中文凭的加州成年人与完成学士学位或更高学历的加州成年人的个人收入分布。分布之间的差异是惊人的。教育程度与个人收入之间存在明显的正关联。

[In ]:

distributions.select(0, 1, 4).barh(0)

A bar plot with 'Personal Income' on the y-axis. Each income category has two bars, one dark blue for 'Bachlor's degree or higher' and one gold for 'No high school diploma.' As the income categories increase, the gold bars get shorter and the dark blue bars get longer.

按多个变量交叉分类

两个变量：统计每个配对类别中的数量

两个变量：找出每个配对类别的特征

透视表：重新排列 group 的输出

示例：加州成年人的教育与收入

透视表：重新排列 `group` 的输出