为了处理数字数据,Pandas提供了几个变体,如滚动,展开和指数移动窗口统计的权重。 其中包括总和,均值,中位数,方差,协方差,相关性等。
下面来学习如何在DataFrame对象上应用上上面提及的每种方法。
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2020', periods=10),
columns = ['A', 'B', 'C', 'D'])
df
A | B | C | D | |
---|---|---|---|---|
2020-01-01 | -2.220810 | 1.396098 | 0.446487 | -0.592039 |
2020-01-02 | 0.468810 | -1.144222 | 0.444249 | -0.418163 |
2020-01-03 | 0.830163 | -0.163198 | 0.189963 | 0.429585 |
2020-01-04 | 1.012692 | -0.038890 | 0.040714 | -1.454456 |
2020-01-05 | 0.763061 | -0.943831 | 1.883027 | -0.198797 |
2020-01-06 | 0.292221 | -1.185507 | -1.327939 | -0.680232 |
2020-01-07 | -0.654039 | -0.362139 | 3.001202 | 0.965824 |
2020-01-08 | -0.082240 | 0.525137 | -0.201329 | 0.490616 |
2020-01-09 | 1.062515 | 0.909854 | 0.126035 | -1.224909 |
2020-01-10 | -0.750193 | -0.034899 | 0.393972 | 0.008651 |
df.rolling(window=3).mean()
A | B | C | D | |
---|---|---|---|---|
2020-01-01 | NaN | NaN | NaN | NaN |
2020-01-02 | NaN | NaN | NaN | NaN |
2020-01-03 | -0.307279 | 0.029559 | 0.360233 | -0.193539 |
2020-01-04 | 0.770555 | -0.448770 | 0.224975 | -0.481011 |
2020-01-05 | 0.868639 | -0.381973 | 0.704568 | -0.407889 |
2020-01-06 | 0.689325 | -0.722743 | 0.198601 | -0.777828 |
2020-01-07 | 0.133747 | -0.830492 | 1.185430 | 0.028932 |
2020-01-08 | -0.148020 | -0.340837 | 0.490645 | 0.258736 |
2020-01-09 | 0.108745 | 0.357617 | 0.975303 | 0.077177 |
2020-01-10 | 0.076694 | 0.466697 | 0.106226 | -0.241881 |
注:由于窗口大小为 3,前两个元素有空值,第三个元素的值将是 n
, n-1
和 n-2
元素的平均值。
这样也可以应用上面提到的各种函数了。
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2018', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df.expanding(min_periods=3).mean())
A B C D 2018-01-01 NaN NaN NaN NaN 2018-01-02 NaN NaN NaN NaN 2018-01-03 -0.510928 -0.166144 -0.216935 0.081073 2018-01-04 -0.608648 0.000093 -0.583888 0.071027 2018-01-05 -0.570144 0.104203 -0.390370 0.298451 2018-01-06 -0.512529 0.025339 -0.396747 0.337669 2018-01-07 -0.565039 -0.241643 -0.215155 0.272673 2018-01-08 -0.491737 -0.230343 -0.058418 0.238158 2018-01-09 -0.343275 -0.199525 0.038289 0.334641 2018-01-10 -0.254516 -0.208652 0.067068 0.497680
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2019', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df.ewm(com=0.5).mean())
A B C D 2019-01-01 -0.186724 -1.152170 -1.027572 0.915452 2019-01-02 0.080397 -0.974813 -0.366436 2.204113 2019-01-03 -0.164281 -0.184914 -0.870986 0.326826 2019-01-04 0.470578 -0.185053 -0.937103 -1.222960 2019-01-05 0.227921 0.090869 -1.446375 -1.192274 2019-01-06 0.953694 0.146926 -0.197161 -0.836044 2019-01-07 -0.437743 -0.500187 0.323679 0.182713 2019-01-08 -0.028205 -0.449859 0.224972 0.412663 2019-01-09 -1.180212 -0.031032 -0.228291 0.014042 2019-01-10 -1.006268 -0.025079 0.045223 0.998446
窗口函数主要用于通过平滑曲线以图形方式查找数据内的趋势。 如果日常数据中有很多变化,并且有很多数据点可用,那么采样和绘图就是一种方法, 应用窗口计算并在结果上绘制图形是另一种方法。 通过这些方法,可以平滑曲线或趋势。