均值的性质

在本课程中，我们一直互换使用“平均值”（average）和“均值”（mean），并将继续如此。均值的定义从高中甚至更早的时候你就应该熟悉了。

定义。 一个数值集合的“平均值”或“均值”是集合中所有元素的总和除以集合中元素的数量。方法 np.average 和 np.mean 返回数组的均值。

[In ]:

from datascience import *
%matplotlib inline
path_data = '../../../assets/data/'
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import pylab as pl
import numpy as np

[In ]:

not_symmetric = make_array(2, 3, 3, 9)

[In ]:

np.average(not_symmetric)

4.25

[In ]:

np.mean(not_symmetric)

4.25

基本性质

定义和上面的例子指出了均值的一些性质。

它不一定是集合中的一个元素。
即使集合中所有元素都是整数，它也不一定是整数。
它介于集合中的最小值和最大值之间。
它不一定位于两个极值的中点；一般来说，集合中一半元素高于均值并不成立。
如果集合由以指定单位测量的变量值组成，那么均值也具有相同的单位。

我们现在将研究一些其他性质，这些性质有助于理解均值及其与其他统计量的关系。

均值是一种“平滑器”

你可以将求均值视为一种“均衡”或“平滑”操作。例如，想象上面 not_symmetric 中的条目是四个不同人口袋里的美元。要得到均值，你首先把所有钱放入一个大锅，然后平均分给四个人。他们开始时口袋里都有不同数量的钱（\$2、\$3、\$3和\$9），但现在每人有\$4.25，即平均金额。

比例是均值

如果一个集合只包含1和0，那么该集合的总和就是其中1的个数，而该集合的均值就是1的比例。

[In ]:

zero_one = make_array(1, 1, 1, 0)
sum(zero_one)

[In ]:

np.mean(zero_one)

0.75

你可以将1替换为布尔值 True，将0替换为 False：

[In ]:

np.mean(make_array(True, True, True, False))

0.75

因为比例是均值的一个特例，所以关于随机样本均值的结论也适用于随机样本比例。

均值与直方图

集合{2, 3, 3, 9}的均值是4.25，这不是数据的“中点”。那么均值衡量的是什么？

为了理解这一点，注意到均值可以用不同的方式计算。

$$\begin{align} \mbox{mean} ~ &=~ 4.25 \ \ &=~ \frac{2 + 3 + 3 + 9}{4} \ \ &=~ 2 \cdot \frac{1}{4} ~~ + ~~ 3 \cdot \frac{1}{4} ~~ + ~~ 3 \cdot \frac{1}{4} ~~ + ~~ 9 \cdot \frac{1}{4} \ \ &=~ 2 \cdot \frac{1}{4} ~~ + ~~ 3 \cdot \frac{2}{4} ~~ + ~~ 9 \cdot \frac{1}{4} \ \ &=~ 2 \cdot 0.25 ~~ + ~~ 3 \cdot 0.5 ~~ + ~~ 9 \cdot 0.25 \end{align}$$

最后一个表达式是一个普遍事实的例子：当我们计算均值时，集合中的每个不同值都按其出现次数的比例进行加权。

这有一个重要的结论。集合的均值仅取决于不同的值及其比例，而不取决于集合中元素的数量。换句话说，集合的均值仅取决于集合中值的分布。

因此，如果两个集合具有相同的分布，那么它们具有相同的均值。

例如，这里有另一个与 not_symmetric 具有相同分布（因此也具有相同均值）的集合。

[In ]:

not_symmetric

array([2, 3, 3, 9])

[In ]:

same_distribution = make_array(2, 2, 3, 3, 3, 3, 9, 9)
np.mean(same_distribution)

4.25

均值是分布直方图的一个物理属性。以下是 not_symmetric 分布（等价于 same_distribution 的分布）的直方图。

[In ]:

t1 = Table().with_columns('not symmetric', not_symmetric)
t1.hist(bins=np.arange(1.5, 9.6, 1))

Histogram with 'not symmetric' on the x-axis and 'Percent per unit' on the y-axis. Three bars are visible. The first is centered at 2 with a height of about 25. The second is centered at 3 with a height of about 50. The last bar is centered at 9 with a height of about 25.

想象直方图是一个由纸板制成的图形，附在一根沿横轴延伸的铁丝上，条形图想象成附在值2、3和9上的重物。假设你试图在铁丝上的某一点平衡这个图形。如果支点在2附近，图形会向右倾斜。如果支点在9附近，图形会向左倾斜。介于两者之间的某个点就是图形保持平衡的点；这个点就是4.25，即均值。

均值是直方图的重心或平衡点。

要理解为什么，了解一些物理知识会有所帮助。重心的计算方法与我们计算均值的方法完全相同：使用按比例加权的不同值。

由于均值是一个平衡点，它有时在直方图底部显示为一个支点或三角形。

[In ]:

mean_ns = np.mean(not_symmetric)
t1.hist(bins=np.arange(1.5, 9.6, 1))
plots.scatter(mean_ns, -0.009, marker='^', color='darkblue', s=60)
plots.plot([1.5, 9.5], [0, 0], color='grey')
plots.ylim(-0.05, 0.5);

The same histogram as above has been reproduced. Along the bottom of the graph a gray line is shown with a triangle underneath just to the right of x=4. The previous histogram description was: Histogram with 'not symmetric' on the x-axis and 'Percent per unit' on the y-axis. Three bars are visible. The first is centered at 2 with a height of about 25. The second is centered at 3 with a height of about 50. The last bar is centered at 9 with a height of about 25.

均值与中位数

如果一名学生的考试成绩低于平均值，这是否意味着该学生在该考试中处于班级的后半部分？

对学生来说幸运的是，答案是“不一定”。原因与平均值（直方图的平衡点）和中位数（数据的“中点”）之间的关系有关。

这种关系在一个简单的例子中很容易看出。以下是数组 symmetric 中集合{2, 3, 3, 4}的直方图。该分布关于3对称。均值和中位数都等于3。

[In ]:

symmetric = make_array(2, 3, 3, 4)

[In ]:

t2 = Table().with_columns('symmetric', symmetric)
mean_s = np.mean(symmetric)

t2.hist(bins=np.arange(1.5, 4.6, 1))
plots.scatter(mean_s, -0.009, marker='^', color='darkblue', s=60)
plots.xlim(1, 10)
plots.ylim(-0.05, 0.5);

Histogram with 'symmetric' on the x-axis and 'Percent per unit' on the y-axis. The x-axis extends from 2 to 10. Three bars are present. Two bars have height of about 25, one is centered at x=2 and another is centered at x=4. A third bar at x=3 with height of about 50. A triangle is shown below the histogram at x=3.

[In ]:

np.mean(symmetric)

3.0

[In ]:

percentile(50, symmetric)

一般来说，对于对称分布，均值和中位数相等。

如果分布不对称呢？让我们比较 symmetric 和 not_symmetric。

[In ]:

t3 = t2.with_column(
        'not_symmetric', not_symmetric
)

t3.hist(bins=np.arange(1.5, 9.6, 1))
plots.scatter(mean_s, -0.009, marker='^', color='darkblue', s=60)
plots.scatter(mean_ns, -0.009, marker='^', color='gold', s=60)
plots.ylim(-0.05, 0.5);

Histogram with no x-axis label and 'Percent per unit' on the y-axis. Two histograms are shown, 'symmetric' in dark blue and 'not_symmetric' in gold. The histogram shapes are the same as previously produced. The symmetric distribution is bell shaped centered around 3. The not symmetric distribution has the same first two bars as the symmetric distribution, but its third bar has been shifted up and centered around x=9.

蓝色直方图代表原始的 symmetric 分布。not_symmetric的金色直方图在左端与蓝色相同，但其最右侧的条形滑到了值9。棕色部分是两个直方图重叠的区域。

蓝色分布的中位数和均值都等于3。金色分布的中位数也等于3，尽管右半部分的分布与左半部分不同。

但金色分布的均值不是3：金色直方图不会在3处平衡。平衡点向右移动到了4.25。

在金色分布中，4个条目中有3个（75%）低于平均值。因此，分数低于平均值的学生可以放心了。他或她可能处于班级的大多数中。

一般来说，如果直方图在一侧有尾部（正式术语是“偏斜”），那么均值会向尾部方向偏离中位数。

示例

表格 sf2015 包含2015年旧金山市员工的薪资和福利数据。和之前一样，我们将分析限制在那些全年至少有半职工作的员工。

[In ]:

sf2015 = Table.read_table(path_data + 'san_francisco_2015.csv').where('Salaries', are.above(10000))

正如我们之前看到的，最高薪酬超过\$600,000，但绝大多数员工的薪酬低于\$300,000。

[In ]:

sf2015.select('Total Compensation').hist(bins = np.arange(10000, 700000, 25000))

Histogram with 'Total Compensation' on the x-axis and 'Percent per unit' on the y-axis. The Histogram has its tallest bars around 100000, after that the bars decrease in height dramatically until about 300000, and the histogram extends out to 700000. The three bars before 100000 are variable, but just above about half the height of the tallest bars.

这个直方图向右偏斜；它有一个右尾。

均值会向尾部方向偏离中位数。因此我们预期平均薪酬大于中位数，事实也确实如此。

[In ]:

compensation = sf2015.column('Total Compensation')
percentile(50, compensation)

110305.79

[In ]:

np.mean(compensation)

114725.98411824222

大总体收入的分布往往右偏。当总体中大部分人的收入处于中低水平，但极小比例的人收入非常高时，直方图会有一个又长又细的右尾。

平均收入受到这个尾部的影响：尾部向右延伸得越远，均值就变得越大。但中位数不受分布极端值的影响。这就是为什么经济学家通常使用中位数而不是均值来概括收入分布。