Pandas统计分析
pandas数据的基本统计分析
和numpy的函数近似
import pandas as pddates = pd.date_range('20130101',periods=10)dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08', '2013-01-09', '2013-01-10'], dtype='datetime64[ns]', freq='D')
import numpy as npdf = pd.DataFrame(np.random.randn(10,4),index=dates,columns=['A','B','C','D'])df
| A | B | C | D |
2013-01-01 | -1.587560 | -0.198819 | 0.720054 | 1.921686 |
2013-01-02 | 0.296288 | 1.876570 | 0.338344 | 0.597835 |
2013-01-03 | -1.832852 | 0.752045 | 2.184984 | -0.157722 |
2013-01-04 | -0.650829 | 1.690322 | -1.145963 | -0.798702 |
2013-01-05 | -0.729986 | -0.494417 | 2.166254 | 1.131232 |
2013-01-06 | -1.759444 | -1.104058 | 0.462934 | 2.050315 |
2013-01-07 | 0.760111 | -1.753986 | 0.104831 | 1.075343 |
2013-01-08 | 0.096572 | 0.383660 | 0.604831 | 0.715224 |
2013-01-09 | 0.126292 | 1.025429 | 0.019330 | -0.417396 |
2013-01-10 | -0.179047 | 0.175366 | 0.826219 | -0.451984 |
df.describe() # 快速统计结果
| A | B | C | D |
count | 10.000000 | 10.000000 | 10.000000 | 10.000000 |
mean | -0.546045 | 0.235211 | 0.628182 | 0.566583 |
std | 0.923341 | 1.164277 | 0.985506 | 1.001821 |
min | -1.832852 | -1.753986 | -1.145963 | -0.798702 |
25% | -1.373167 | -0.420517 | 0.163209 | -0.352477 |
50% | -0.414938 | 0.279513 | 0.533883 | 0.656529 |
75% | 0.118862 | 0.957083 | 0.799678 | 1.117260 |
max | 0.760111 | 1.876570 | 2.184984 | 2.050315 |
df.mean() # 按列求平均值
A -0.546045B 0.235211C 0.628182D 0.566583dtype: float64
df.mean(1) # 按行求平均值
2013-01-01 0.2138402013-01-02 0.7772592013-01-03 0.2366142013-01-04 -0.2262932013-01-05 0.5182712013-01-06 -0.0875632013-01-07 0.0465752013-01-08 0.4500722013-01-09 0.1884142013-01-10 0.092638Freq: D, dtype: float64
基本统计分析函数
- .describe() 针对0轴(列)的统计汇总,计数/平均值/标准差/最小值/四分位数/最大值
- .sum() 计算数据的总和,按0轴计算(各行计算),下同,要按列计算参数1
- .count() 非NaN值数量
- .mean() .median() .mode() 计算数据的算数平均值/算数中位数/众数
- .var() .std() 计算数据的方差/标准差
- .min() .max() 计算数据的最小值/最大值
只适用于series:
- .argmin(),.argmax() 计算数据最大值/最小值所在位置的索引位置(自动索引,用她是因为很容易切片等操作)
- .idxmin(),.idxmax() 计算数据最大值/最小值所在位置的索引(自定义索引)
a = pd.Series([9,8,7,6],index=['a','b','c','d'])a
a 9b 8c 7d 6dtype: int64
b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])b
| 0 | 1 | 2 | 3 | 4 |
c | 0 | 1 | 2 | 3 | 4 |
a | 5 | 6 | 7 | 8 | 9 |
d | 10 | 11 | 12 | 13 | 14 |
b | 15 | 16 | 17 | 18 | 19 |
a.describe()
count 4.000000mean 7.500000std 1.290994min 6.00000025% 6.75000050% 7.50000075% 8.250000max 9.000000dtype: float64
type(a.describe()) # series对象
pandas.core.series.Series
a.describe()['count']
4.0
b.describe() #默认0轴运算
| 0 | 1 | 2 | 3 | 4 |
count | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 |
mean | 7.500000 | 8.500000 | 9.500000 | 10.500000 | 11.500000 |
std | 6.454972 | 6.454972 | 6.454972 | 6.454972 | 6.454972 |
min | 0.000000 | 1.000000 | 2.000000 | 3.000000 | 4.000000 |
25% | 3.750000 | 4.750000 | 5.750000 | 6.750000 | 7.750000 |
50% | 7.500000 | 8.500000 | 9.500000 | 10.500000 | 11.500000 |
75% | 11.250000 | 12.250000 | 13.250000 | 14.250000 | 15.250000 |
max | 15.000000 | 16.000000 | 17.000000 | 18.000000 | 19.000000 |
type(b.describe()) #dataframe对象
pandas.core.frame.DataFrame
# 返回横行数据,seriesb.describe().loc['max']
0 15.01 16.02 17.03 18.04 19.0Name: max, dtype: float64
b.describe().iloc[7]
0 15.01 16.02 17.03 18.04 19.0Name: max, dtype: float64
# 返回一列值,这里第2列b.describe()[2]
count 4.000000mean 9.500000std 6.454972min 2.00000025% 5.75000050% 9.50000075% 13.250000max 17.000000Name: 2, dtype: float64
b.describe().loc[:,2]
count 4.000000mean 9.500000std 6.454972min 2.00000025% 5.75000050% 9.50000075% 13.250000max 17.000000Name: 2, dtype: float64
数据的累计统计分析
- 对序列的前1-n个数累计运算
- 可减少for循环的使用
累计统计分析函数,适用于series和dataframe类型
- .cumsum() 依次给出前1/2/…/n个数的和
- .cumprod() 依次给出前1/2/…/n个数的积
- .cummax() 依次给出前1/2/…/n个数的最大值
- .cummin() 依次给出前1/2/…/n个数的最小值
b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])b
| 0 | 1 | 2 | 3 | 4 |
c | 0 | 1 | 2 | 3 | 4 |
a | 5 | 6 | 7 | 8 | 9 |
d | 10 | 11 | 12 | 13 | 14 |
b | 15 | 16 | 17 | 18 | 19 |
b.cumsum() # 列的累加和
| 0 | 1 | 2 | 3 | 4 |
c | 0 | 1 | 2 | 3 | 4 |
a | 5 | 7 | 9 | 11 | 13 |
d | 15 | 18 | 21 | 24 | 27 |
b | 30 | 34 | 38 | 42 | 46 |
b.cumprod() # 列的累加积
| 0 | 1 | 2 | 3 | 4 |
c | 0 | 1 | 2 | 3 | 4 |
a | 0 | 6 | 14 | 24 | 36 |
d | 0 | 66 | 168 | 312 | 504 |
b | 0 | 1056 | 2856 | 5616 | 9576 |
滚动计算(窗口计算)函数
适用series/dataframe
- .rolling(w).sum() 依次计算相邻w个元素的和
- .rolling(w).mean() 依次计算相邻w个元素的算数平均值
- .rolling(w).var() 依次计算相邻w个元素的方差
- .rolling(w).std() 依次计算相邻w个元素的标准差
- .rolling(w).min .max() 依次计算相邻w个元素的最小值/最大值
b.rolling(2).sum() # 纵向列,以两个元素为单位,做求和运算
| 0 | 1 | 2 | 3 | 4 |
c | NaN | NaN | NaN | NaN | NaN |
a | 5.0 | 7.0 | 9.0 | 11.0 | 13.0 |
d | 15.0 | 17.0 | 19.0 | 21.0 | 23.0 |
b | 25.0 | 27.0 | 29.0 | 31.0 | 33.0 |
b.rolling(3).sum()
| 0 | 1 | 2 | 3 | 4 |
c | NaN | NaN | NaN | NaN | NaN |
a | NaN | NaN | NaN | NaN | NaN |
d | 15.0 | 18.0 | 21.0 | 24.0 | 27.0 |
b | 30.0 | 33.0 | 36.0 | 39.0 | 42.0 |