为了处理数字数据,Pandas提供了几个变体,如滚动,展开和指数移动窗口统计的权重。 其中包括总和,均值,中位数,方差,协方差,相关性等。
下面来学习如何在DataFrame对象上应用上提及的每种方法。
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2020', periods=10),
columns = ['A', 'B', 'C', 'D'])
df
A | B | C | D | |
---|---|---|---|---|
2020-01-01 | -0.490340 | 1.578133 | 0.680092 | -1.027103 |
2020-01-02 | 0.313825 | -1.008855 | -0.936955 | 1.335417 |
2020-01-03 | -1.246481 | -1.252494 | -0.813885 | -0.035055 |
2020-01-04 | 0.032313 | 0.214091 | -0.353881 | -0.146269 |
2020-01-05 | 1.073790 | -0.121087 | 0.197245 | 0.555992 |
2020-01-06 | 1.061563 | -0.607518 | 0.313361 | -0.926115 |
2020-01-07 | -0.051963 | 0.986390 | 0.198641 | -2.281789 |
2020-01-08 | -0.748859 | -0.032408 | -1.946981 | 0.357054 |
2020-01-09 | -0.617018 | 0.142933 | -0.998220 | 1.623923 |
2020-01-10 | -0.490075 | 0.445266 | -0.417166 | -2.848563 |
df.rolling(window=3).mean()
A | B | C | D | |
---|---|---|---|---|
2020-01-01 | NaN | NaN | NaN | NaN |
2020-01-02 | NaN | NaN | NaN | NaN |
2020-01-03 | -0.474332 | -0.227739 | -0.356916 | 0.091086 |
2020-01-04 | -0.300115 | -0.682419 | -0.701574 | 0.384698 |
2020-01-05 | -0.046793 | -0.386497 | -0.323507 | 0.124889 |
2020-01-06 | 0.722555 | -0.171505 | 0.052242 | -0.172130 |
2020-01-07 | 0.694463 | 0.085928 | 0.236416 | -0.883971 |
2020-01-08 | 0.086914 | 0.115488 | -0.478327 | -0.950283 |
2020-01-09 | -0.472613 | 0.365638 | -0.915520 | -0.100271 |
2020-01-10 | -0.618650 | 0.185264 | -1.120789 | -0.289195 |
注 - 由于窗口大小为 3(window),前两个元素有空值,第三个元素的值将是 n
, n-1
和 n-2
元素的平均值。
这样也可以应用上面提到的各种函数了。
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2018', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df.expanding(min_periods=3).mean())
A B C D 2018-01-01 NaN NaN NaN NaN 2018-01-02 NaN NaN NaN NaN 2018-01-03 -0.755036 -0.393182 -0.085258 0.154460 2018-01-04 -0.296575 -0.399950 -0.085478 0.175750 2018-01-05 0.135282 -0.492159 0.000776 0.242496 2018-01-06 0.290829 -0.200922 -0.048882 0.276829 2018-01-07 0.491330 -0.233410 -0.102857 0.395980 2018-01-08 0.644808 -0.123002 0.033552 0.363564 2018-01-09 0.392850 -0.096239 0.062968 0.089894 2018-01-10 0.193132 -0.180371 -0.078381 0.148937
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2019', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df.ewm(com=0.5).mean())
A B C D 2019-01-01 1.385715 0.069459 -1.262960 0.090088 2019-01-02 0.207468 -1.808902 0.862542 -1.505067 2019-01-03 0.460619 0.008610 -0.666949 -0.775913 2019-01-04 0.246850 0.122823 -0.658795 -0.037191 2019-01-05 -0.366619 0.122734 0.418713 -1.431486 2019-01-06 -0.520607 0.922862 0.366691 -1.236476 2019-01-07 -0.695104 1.071023 0.411275 -0.521211 2019-01-08 0.006274 -0.513299 -0.312668 1.152060 2019-01-09 0.066221 -0.359152 -0.267678 1.429472 2019-01-10 -0.082413 -0.993964 -0.349745 -0.079075
窗口函数主要用于通过平滑曲线来以图形方式查找数据内的趋势。如果日常数据中有很多变化,并且有很多数据点可用,那么采样和绘图就是一种方法,应用窗口计算并在结果上绘制图形是另一种方法。 通过这些方法,可以平滑曲线或趋势。