Pandas 聚合操作是数据分析的核心功能,它通过简洁高效的接口实现数据的统计计算和特征提取。
创建滚动、扩展和 ewm
对象后,可以使用多种方法对数据执行聚合。
三种时间窗口聚合器:
- 滚动窗口(rolling):固定窗口大小的移动计算
- 扩展窗口(expanding):从起始点到当前点的累计计算
- 指数加权(ewm):给予近期数据更高权重的计算
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2019', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D 2019-01-01 -0.076407 -0.604624 0.689858 -1.673182 2019-01-02 -0.201371 0.267649 -0.529736 -1.855319 2019-01-03 -1.123104 0.757734 -0.638233 -0.543501 2019-01-04 0.352778 -0.900610 -0.286233 -0.864941 2019-01-05 0.282081 -0.502212 0.752460 0.450970 2019-01-06 -0.648663 -1.045604 1.026154 -0.843149 2019-01-07 0.766279 -1.049115 2.022508 -1.055975 2019-01-08 -1.275031 -0.005891 0.061255 1.583593 2019-01-09 -0.121790 0.727877 0.956368 -0.232723 2019-01-10 -1.045568 0.159348 -0.089502 0.499241
print("=======================================")
=======================================
r = df.rolling(window=3,min_periods=1)
print (r)
Rolling [window=3,min_periods=1,center=False,axis=0,method=single]
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print(df)
A B C D 2000-01-01 1.146495 -0.762095 -1.578408 -0.959064 2000-01-02 -0.331831 -0.966427 -1.192286 0.708527 2000-01-03 -1.284448 -0.761996 0.134338 0.778285 2000-01-04 0.856504 -0.459035 1.503857 -0.331440 2000-01-05 -0.678272 -0.336172 1.041499 -0.418825 2000-01-06 -0.816490 0.996931 -1.340490 -0.417563 2000-01-07 0.099249 -0.237366 0.134811 -0.419044 2000-01-08 -0.248770 -1.290146 -0.297769 -0.481473 2000-01-09 0.235095 -0.184626 -2.239425 -0.231049 2000-01-10 1.496287 0.230087 1.586259 -0.627747
r = df.rolling(window=3,min_periods=1)
print(r.aggregate(np.sum))
A B C D 2000-01-01 1.146495 -0.762095 -1.578408 -0.959064 2000-01-02 0.814664 -1.728522 -2.770694 -0.250536 2000-01-03 -0.469784 -2.490517 -2.636356 0.527749 2000-01-04 -0.759775 -2.187457 0.445909 1.155372 2000-01-05 -1.106216 -1.557202 2.679695 0.028020 2000-01-06 -0.638258 0.201725 1.204866 -1.167828 2000-01-07 -1.395513 0.423394 -0.164180 -1.255431 2000-01-08 -0.966012 -0.530580 -1.503449 -1.318080 2000-01-09 0.085574 -1.712138 -2.402384 -1.131566 2000-01-10 1.482612 -1.244685 -0.950935 -1.340269
/tmp/ipykernel_614/2650025154.py:1: FutureWarning: The provided callable <function sum at 0x7fe0e40a09a0> is currently using Rolling.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead. print(r.aggregate(np.sum))
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D 2000-01-01 -1.057240 -0.141806 -0.833138 0.725320 2000-01-02 0.506317 1.554842 0.849750 0.391421 2000-01-03 -0.258519 -0.263724 -0.109513 0.044725 2000-01-04 -1.375700 0.351015 -0.549730 -0.701180 2000-01-05 2.028659 -0.126678 -0.554116 -0.044020 2000-01-06 1.377150 -0.117405 0.271415 -2.345985 2000-01-07 3.365037 0.196476 1.267574 0.524226 2000-01-08 -1.975397 -0.093593 -0.500846 0.185895 2000-01-09 1.289750 0.988913 -0.035509 -0.265597 2000-01-10 -0.396307 -1.828000 0.149544 -1.366281
print("====================================")
====================================
r = df.rolling(window=3,min_periods=1)
print (r['A'].aggregate(np.sum))
2000-01-01 -1.057240 2000-01-02 -0.550923 2000-01-03 -0.809442 2000-01-04 -1.127902 2000-01-05 0.394439 2000-01-06 2.030108 2000-01-07 6.770845 2000-01-08 2.766789 2000-01-09 2.679390 2000-01-10 -1.081954 Freq: D, Name: A, dtype: float64
/tmp/ipykernel_614/1595316137.py:1: FutureWarning: The provided callable <function sum at 0x7fe0e40a09a0> is currently using Rolling.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead. print (r['A'].aggregate(np.sum))
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2018', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D 2018-01-01 0.679857 -1.145790 -0.793332 0.842598 2018-01-02 0.213620 0.922840 -0.461999 1.692071 2018-01-03 0.358067 0.247205 1.791173 0.659440 2018-01-04 0.209564 -1.426703 0.662062 -3.467233 2018-01-05 -0.677598 1.036215 -0.700577 0.151027 2018-01-06 0.695804 0.290036 0.291509 -1.447757 2018-01-07 0.806528 0.224071 0.132865 0.859766 2018-01-08 -1.539445 1.077223 -0.109202 0.155983 2018-01-09 0.334784 -1.784773 -2.312255 0.968719 2018-01-10 0.311545 0.068022 -1.052203 -0.309330
print ("==========================================")
==========================================
r = df.rolling(window=3,min_periods=1)
print (r[['A','B']].aggregate(np.sum))
A B 2018-01-01 0.679857 -1.145790 2018-01-02 0.893477 -0.222950 2018-01-03 1.251545 0.024255 2018-01-04 0.781252 -0.256657 2018-01-05 -0.109966 -0.143283 2018-01-06 0.227770 -0.100452 2018-01-07 0.824734 1.550322 2018-01-08 -0.037113 1.591330 2018-01-09 -0.398133 -0.483479 2018-01-10 -0.893115 -0.639528
/tmp/ipykernel_614/3309227649.py:1: FutureWarning: The provided callable <function sum at 0x7fe0e40a09a0> is currently using Rolling.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead. print (r[['A','B']].aggregate(np.sum))
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('2019/01/01', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D 2019-01-01 1.469902 -0.147290 1.193971 -1.379634 2019-01-02 0.308708 -1.725796 1.121596 -0.787365 2019-01-03 0.912042 2.734434 0.593499 1.336479 2019-01-04 0.067794 -1.204641 -0.112293 -0.896542 2019-01-05 -0.526009 -0.379445 -0.872301 0.699929 2019-01-06 -0.915561 -0.970859 0.373168 -1.444811 2019-01-07 1.146323 -0.190591 0.083050 1.327847 2019-01-08 -0.918293 0.121730 -0.026383 1.292424 2019-01-09 0.354194 -0.607547 -0.509952 -0.587810 2019-01-10 -1.102412 -0.863641 0.493987 2.001870
print("==========================================")
==========================================
r = df.rolling(window=3,min_periods=1)
print (r['A'].aggregate([np.sum,np.mean]))
sum mean 2019-01-01 1.469902 1.469902 2019-01-02 1.778610 0.889305 2019-01-03 2.690653 0.896884 2019-01-04 1.288544 0.429515 2019-01-05 0.453827 0.151276 2019-01-06 -1.373777 -0.457926 2019-01-07 -0.295247 -0.098416 2019-01-08 -0.687531 -0.229177 2019-01-09 0.582224 0.194075 2019-01-10 -1.666511 -0.555504
/tmp/ipykernel_614/2965890111.py:1: FutureWarning: The provided callable <function sum at 0x7fe0e40a09a0> is currently using Rolling.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead. print (r['A'].aggregate([np.sum,np.mean])) /tmp/ipykernel_614/2965890111.py:1: FutureWarning: The provided callable <function mean at 0x7fe0e40a1da0> is currently using Rolling.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead. print (r['A'].aggregate([np.sum,np.mean]))
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('2020/01/01', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D 2020-01-01 -0.411810 2.601599 0.200239 -1.263365 2020-01-02 -1.315644 1.235365 1.141597 0.560113 2020-01-03 1.446070 1.073482 0.769908 -0.951471 2020-01-04 -0.132483 -0.013865 0.741292 -1.040893 2020-01-05 1.108717 -0.820891 0.936647 0.524640 2020-01-06 1.426754 -1.372367 -1.081417 0.598782 2020-01-07 0.558599 -0.380742 -0.393100 -1.663461 2020-01-08 -0.123466 -1.645598 -1.097948 0.710898 2020-01-09 -1.128187 -0.294246 -0.660669 0.755610 2020-01-10 -0.091495 1.209540 -0.860721 1.093040
print("==========================================")
==========================================
r = df.rolling(window=3,min_periods=1)
print (r[['A','B']].aggregate([np.sum,np.mean]))
A B sum mean sum mean 2020-01-01 -0.411810 -0.411810 2.601599 2.601599 2020-01-02 -1.727454 -0.863727 3.836964 1.918482 2020-01-03 -0.281384 -0.093795 4.910446 1.636815 2020-01-04 -0.002057 -0.000686 2.294981 0.764994 2020-01-05 2.422304 0.807435 0.238726 0.079575 2020-01-06 2.402988 0.800996 -2.207124 -0.735708 2020-01-07 3.094070 1.031357 -2.574000 -0.858000 2020-01-08 1.861887 0.620629 -3.398707 -1.132902 2020-01-09 -0.693053 -0.231018 -2.320586 -0.773529 2020-01-10 -1.343148 -0.447716 -0.730305 -0.243435
/tmp/ipykernel_614/2891067607.py:1: FutureWarning: The provided callable <function sum at 0x7fe0e40a09a0> is currently using Rolling.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead. print (r[['A','B']].aggregate([np.sum,np.mean])) /tmp/ipykernel_614/2891067607.py:1: FutureWarning: The provided callable <function mean at 0x7fe0e40a1da0> is currently using Rolling.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead. print (r[['A','B']].aggregate([np.sum,np.mean]))
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 4),
index = pd.date_range('2020/01/01', periods=3),
columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D 2020-01-01 -0.348250 0.660252 -0.991954 0.007428 2020-01-02 -0.026760 1.292410 -1.311106 0.966178 2020-01-03 0.278632 -1.729345 -0.321379 0.864611
print("==========================================")
==========================================
r = df.rolling(window=3,min_periods=1)
print (r.aggregate({'A' : np.sum,'B' : np.mean}))
A B 2020-01-01 -0.348250 0.660252 2020-01-02 -0.375010 0.976331 2020-01-03 -0.096378 0.074439
/tmp/ipykernel_614/2138816949.py:1: FutureWarning: The provided callable <function sum at 0x7fe0e40a09a0> is currently using Rolling.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead. print (r.aggregate({'A' : np.sum,'B' : np.mean})) /tmp/ipykernel_614/2138816949.py:1: FutureWarning: The provided callable <function mean at 0x7fe0e40a1da0> is currently using Rolling.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead. print (r.aggregate({'A' : np.sum,'B' : np.mean}))