有很多方法用来集体计算DataFrame的描述性统计信息和其他相关操作。 其中大多数是sum(),mean()等聚合函数,但其中一些,如sumsum(),产生一个相同大小的对象。 一般来说,这些方法采用轴参数,就像ndarray.{sum,std,...},但轴可以通过名称或整数来指定:
数据帧(DataFrame) - “index”(axis=0,默认),columns(axis=1)
下面创建一个数据帧(DataFrame),并使用此对象进行演示本章中所有操作。
import pandas as pd
import numpy as np
Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
Create a DataFrame
df = pd.DataFrame(d)
df
Name | Age | Rating | |
---|---|---|---|
0 | Tom | 25 | 4.23 |
1 | James | 26 | 3.24 |
2 | Ricky | 25 | 3.98 |
3 | Vin | 23 | 2.56 |
4 | Steve | 30 | 3.20 |
5 | Minsu | 29 | 4.60 |
6 | Jack | 23 | 3.80 |
7 | Lee | 34 | 3.78 |
8 | David | 40 | 2.98 |
9 | Gasper | 30 | 4.80 |
10 | Betina | 51 | 4.10 |
11 | Andres | 46 | 3.65 |
df.sum()
Name TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe... Age 382 Rating 44.92 dtype: object
每个单独的列单独添加(附加字符串)。
axis=1示例
此语法将给出如下所示的输出,参考以下示例代码 -
df.sum(1)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[6], line 1 ----> 1 df.sum(1) File /opt/conda/lib/python3.12/site-packages/pandas/core/frame.py:11670, in DataFrame.sum(self, axis, skipna, numeric_only, min_count, **kwargs) 11661 @doc(make_doc("sum", ndim=2)) 11662 def sum( 11663 self, (...) 11668 **kwargs, 11669 ): > 11670 result = super().sum(axis, skipna, numeric_only, min_count, **kwargs) 11671 return result.__finalize__(self, method="sum") File /opt/conda/lib/python3.12/site-packages/pandas/core/generic.py:12506, in NDFrame.sum(self, axis, skipna, numeric_only, min_count, **kwargs) 12498 def sum( 12499 self, 12500 axis: Axis | None = 0, (...) 12504 **kwargs, 12505 ): > 12506 return self._min_count_stat_function( 12507 "sum", nanops.nansum, axis, skipna, numeric_only, min_count, **kwargs 12508 ) File /opt/conda/lib/python3.12/site-packages/pandas/core/generic.py:12489, in NDFrame._min_count_stat_function(self, name, func, axis, skipna, numeric_only, min_count, **kwargs) 12486 elif axis is lib.no_default: 12487 axis = 0 > 12489 return self._reduce( 12490 func, 12491 name=name, 12492 axis=axis, 12493 skipna=skipna, 12494 numeric_only=numeric_only, 12495 min_count=min_count, 12496 ) File /opt/conda/lib/python3.12/site-packages/pandas/core/frame.py:11562, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds) 11558 df = df.T 11560 # After possibly _get_data and transposing, we are now in the 11561 # simple case where we can use BlockManager.reduce > 11562 res = df._mgr.reduce(blk_func) 11563 out = df._constructor_from_mgr(res, axes=res.axes).iloc[0] 11564 if out_dtype is not None and out.dtype != "boolean": File /opt/conda/lib/python3.12/site-packages/pandas/core/internals/managers.py:1500, in BlockManager.reduce(self, func) 1498 res_blocks: list[Block] = [] 1499 for blk in self.blocks: -> 1500 nbs = blk.reduce(func) 1501 res_blocks.extend(nbs) 1503 index = Index([None]) # placeholder File /opt/conda/lib/python3.12/site-packages/pandas/core/internals/blocks.py:404, in Block.reduce(self, func) 398 @final 399 def reduce(self, func) -> list[Block]: 400 # We will apply the function and reshape the result into a single-row 401 # Block with the same mgr_locs; squeezing will be done at a higher level 402 assert self.ndim == 2 --> 404 result = func(self.values) 406 if self.values.ndim == 1: 407 res_values = result File /opt/conda/lib/python3.12/site-packages/pandas/core/frame.py:11481, in DataFrame._reduce.<locals>.blk_func(values, axis) 11479 return np.array([result]) 11480 else: > 11481 return op(values, axis=axis, skipna=skipna, **kwds) File /opt/conda/lib/python3.12/site-packages/pandas/core/nanops.py:85, in disallow.__call__.<locals>._f(*args, **kwargs) 81 raise TypeError( 82 f"reduction operation '{f_name}' not allowed for this dtype" 83 ) 84 try: ---> 85 return f(*args, **kwargs) 86 except ValueError as e: 87 # we want to transform an object array 88 # ValueError message to the more typical TypeError 89 # e.g. this is normally a disallowed function on 90 # object arrays that contain strings 91 if is_object_dtype(args[0]): File /opt/conda/lib/python3.12/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs) 401 if datetimelike and mask is None: 402 mask = isna(values) --> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs) 406 if datetimelike: 407 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT) File /opt/conda/lib/python3.12/site-packages/pandas/core/nanops.py:477, in maybe_operate_rowwise.<locals>.newfunc(values, axis, **kwargs) 474 results = [func(x, **kwargs) for x in arrs] 475 return np.array(results) --> 477 return func(values, axis=axis, **kwargs) File /opt/conda/lib/python3.12/site-packages/pandas/core/nanops.py:646, in nansum(values, axis, skipna, min_count, mask) 643 elif dtype.kind == "m": 644 dtype_sum = np.dtype(np.float64) --> 646 the_sum = values.sum(axis, dtype=dtype_sum) 647 the_sum = _maybe_null_out(the_sum, axis, mask, values.shape, min_count=min_count) 649 return the_sum File /opt/conda/lib/python3.12/site-packages/numpy/_core/_methods.py:53, in _sum(a, axis, dtype, out, keepdims, initial, where) 51 def _sum(a, axis=None, dtype=None, out=None, keepdims=False, 52 initial=_NoValue, where=True): ---> 53 return umr_sum(a, axis, dtype, out, keepdims, initial, where) TypeError: can only concatenate str (not "int") to str
查看结果:
df.mean()
print(df.std())
编号 | 函数 | 描述 |
---|---|---|
1 | count() |
非空观测数量 |
2 | sum() |
所有值之和 |
3 | mean() |
所有值的平均值 |
4 | median() |
所有值的中位数 |
5 | mode() |
值的模值 |
6 | std() |
值的标准偏差 |
7 | min() |
所有值中的最小值 |
8 | max() |
所有值中的最大值 |
9 | abs() |
绝对值 |
10 | prod() |
数组元素的乘积 |
11 | cumsum() |
累计总和 |
12 | cumprod() |
累计乘积 |
注 - 由于DataFrame是异构数据结构。通用操作不适用于所有函数。
类似于: sum()
, cumsum()
函数能与数字和字符(或)字符串数据元素一起工作,不会产生任何错误。
字符聚合从来都比较少被使用,虽然这些函数不会引发任何异常。
由于这样的操作无法执行,因此,当DataFrame包含字符或字符串数据时,像abs(),cumprod()这样的函数会抛出异常。
df.describe()
该函数给出了平均值,标准差和IQR值。 而且,函数排除字符列,并给出关于数字列的摘要。 include是用于传递关于什么列需要考虑用于总结的必要信息的参数。获取值列表; 默认情况下是”数字值”。
object - 汇总字符串列
number - 汇总数字列
all - 将所有列汇总在一起(不应将其作为列表值传递)
现在,在程序中使用以下语句并检查输出 -
df.describe(include=['object'])
现在,使用以下语句并查看输出 -
df. describe(include='all')