7. Pandas的Series统计函数

pandas是python下常用来进行大数据处理与分析,本质是数理统计,所以本章简单了解一下pandas的一些统计函数,这里以series为例。

7.1 sum函数

sum函数可以统计series数值之和。 $$ s = \sum_{i = 1}^{n}x_i $$

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, None, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.sum()

7.2 mean函数

mean函数可以得到均值$\mu$,这时需要注意的是如果values里含有NaN,可以使用mean函数的参数避开NaN,默认情况下启用了skipna=True避开NaN值,如果需要考虑NaN可以使skipna=False,那么均值里是考虑了NaN项的,实际工作中是忽略掉的。 $$ \mu = \frac{1}{n}\sum_1^n x_i $$

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, None, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.mean()
print t.mean(skipna=False)

7.3 quantile分位数函数

分位数是统计学里的概念,可自行查找学习。

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.quantile()
print t.quantile(0.5)
print t.quantile(0.25)
print t.quantile(0.75)

7.4 describe函数

describe可以给出一系列的和统计相关的数据信息。

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.describe()

程序执行结果

count       4.000000
mean      409.500000
std       404.699477
min       104.000000
25%       176.750000
50%       267.000000
75%       499.750000
max      1000.000000
dtype: float64

7.5 max和idxmax函数

max函数可以返回series里最大值,而idxmax返回的是其index或者label。

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.max()
print t.idxmax()

程序执行结果:

hello    1000
the       201
cruel     333
world     104
dtype: int64 <- t
1000
hello

同样的还有min和idxmin两个函数。

7.6 统计学里的方差相关的函数

  • var函数计算方差,方差Variance反映的是模型每一次输出结果与模型输出期望(平均值)之间的误差,即模型的稳定性,在pandas的series里可以用var函数计算。
import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.var(), "\t<- var"

程序执行结果:

hello    1000
the       201
cruel     333
world     104
dtype: int64 <- t
163781.66666666666  <- var

方差的计算公式如下: $$ \delta^2 = \sum \frac{(x - \mu)^2}{n - 1} $$ 这里的$\mu$是均值可以通过mean函数得到。所以可以通过python来验证一下var函数是否满足上边的公式的计算,

import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val =  [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.var(), "\t<- var"
x =  val
mu = t.mean()
y = [np.square(v - mu) for v in x]
print np.sum(y) / 3

程序的执行结果:

hello    1000
the       201
cruel     333
world     104
dtype: int64 <- t
163781.66666666666  <- var
163781.66666666666

两次输出的$\delta^2$都是163781.66666666666。

  • std函数可以计算标准差即standard deviation。标准差是方差开方即$\delta$。
import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val =  [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.var(), "\t<- var"
x =  val
mu = t.mean()
y = [np.square(v - mu) for v in x]
delta2 = np.sum(y) / 3
print delta2
print np.sqrt(delta2)
print t.std(), "\t<- std"

程序的执行结果:

hello    1000
the       201
cruel     333
world     104
dtype: int64 <- t
163781.66666666666  <- var
163781.66666666666
404.6994769784941
404.6994769784941   <- std
  • mad函数可以计算平均绝对离差(mean absolute deviation), 平均绝对离差是用样本数据相对于其平均值的绝对距离来度量数据的离散程度。 $$ M_d = \frac{1}{n}\sum_{i = 1}^{n} |x_i - \mu| $$
import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val =  [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
x =  val
mu = t.mean()
y = [np.abs(v - mu) for v in x]
md = np.sum(y) / 4
print md
print t.mad(), "\t<- mad"

程序执行结果:

295.25
295.25  <- mad
  • cov可以计算协方差。 $$ cov(x, y) = \frac{1}{n-1} \sum_{i = 1}^{n}(x_i - \mu_x)(y_i - \mu_y) $$
import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
x = pd.Series(val, index = idx)
van = [1100, 221, 303, 84]
y = pd.Series(van, index = idx)
xt =  val
mux = x.mean()
yt = van
muy = y.mean()
xx = [v - mux for v in xt]
yy = [v - muy for v in yt]
print xx
print yy
print np.sum(np.array(xx).dot(np.array(yy))) / 3
print x.cov(y), "\t<- cov"

程序执行结果:

[590.5, -208.5, -76.5, -305.5]
[673.0, -206.0, -124.0, -343.0]
184876.66666666666
184876.66666666666  <- cov

通过numpy计算的协方差和cov计算的协方差的结果一致。

  • corr函数可以计算Pearson相关系数(Pearson CorrelationCoefficient)它是用来衡量两个数据集合是否在一条线上面,它用来衡量定距变量间的线性关系。 $$ r = \frac{\sum_i^n(x_i -\mu_x)(y_i - \mu_y)}{\sqrt \sum_i^n(x_i - \mu_x)^2 \sqrt \sum_1^n(y_i - \mu_y)^2} $$
import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
x = pd.Series(val, index = idx)
van = [1100, 221, 303, 84]
y = pd.Series(van, index = idx)
xt =  val
mux = x.mean()
yt = van
muy = y.mean()
xx = [v - mux for v in xt]
yy = [v - muy for v in yt]
xx2 = [np.square(v - mux) for v in xt]
yy2 = [np.square(v - muy) for v in yt]
cov = np.sum(np.array(xx).dot(np.array(yy)))
muxy = np.sqrt(np.sum(xx2)) * np.sqrt(np.sum(yy2))
print cov / muxy
print x.corr(y), "\t<- corr"

程序执行结果:

0.998149178876946
0.9981491788769461  <- corr

其中语句xx = [v - mux for v in xt]是构造$x_i -\mu_x$,语句yy = [v - muy for v in yt]则是构造$y_i -\mu_y$,语句cov = np.sum(np.array(xx).dot(np.array(yy)))实现了$\sum_i^n(x_i -\mu_x)(y_i - \mu_y)$。 接下来的语句xx2 = [np.square(v - mux) for v in xt]实现了$(x_i - \mu_x)^2 $,而yy2 = [np.square(v - muy) for v in yt]语句则是实现了$(y_i - \mu_y)^2$。

  • skew偏态函数和kurt峰度函数。总体的偏度定义为 $$ w =\frac{E (X - EX)^3}{\left( \text{Var}(X) \right)^{3/2}} $$

总体峰度定义为: $$ k = \frac{E (X - EX)^4}{\left( \text{Var}(X) \right)^{2}} - 3 $$ 即 $$ k = \frac{\frac{1}{n}\sum(x_i - \mu)^4}{(\frac{1}{n}\sum(x_i - \mu)^2)^2} $$ 从样本$x_1, x_2, \dots, x_n$计算的偏度和峰度分别定义为 (见(S. I. Inc. 2010) 328–329): $$ w = \frac{n}{(n - 1)(n - 2)} \sum_{i=1}^n \left( \frac{x_i - \overline x }{\delta} \right)^3 $$ $$ k = \frac{n(n + 1)}{(n - 1)(n - 2)(n -3)} \sum_{i=1}^n \left( \frac{ x_i - \overline x}{\delta} \right)^4 - \frac{3(n - 1)^2 }{(n - 2)(n - 3)} $$

1). 利用kurt计算峰度值。

import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
x = pd.Series(val, index = idx)
n = 4
mu = x.mean()
delta = x.std()
xu = [np.power((v - mu), 4) for v in val]
print (1.0 * n *(n + 1))/ ((n-1)*(n-2)*(n-3)) * np.sum(xu) / delta ** 4 - 3.0 * (n - 1) ** 2 / (n-2)*(n-3) , "<-python"
print x.kurt(), "<- kurt"

程序执行结果:

2.93023293658 <-python
2.93023293658 <- kurt

2). skew函数可以偏态值。

import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
x = pd.Series(val, index = idx)
n = 4
mu = x.mean()
delta = x.std()
xu = [np.power((v - mu), 3) for v in val]
print (1.0 * n) / ((n - 1)*(n - 2))*np.sum(xu) / delta ** 3, "<- python"
print x.skew(), "<- skew"

程序执行结果:

1.68850911034 <- python
1.68850911034 <- skew

7.7 统计学里的累计函数

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"

print t.cumsum(), "\t<- cumsum"
print t.cumprod(), "\t<- cumprod"
print t.cummin(), "\t<- cummin"

程序执行结果:

hello    1000
the       201
cruel     333
world     104
dtype: int64 <- t
hello    1000
the      1201
cruel    1534
world    1638
dtype: int64    <- cumsum
hello          1000
the          201000
cruel      66933000
world    6961032000
dtype: int64    <- cumprod
hello    1000
the       201
cruel     201
world     104
dtype: int64    <- cummin