当任何匹配特定值的数据(NaN/缺失值,尽管可以选择任何值)被省略时,稀疏对象被“压缩”。 一个特殊的SparseIndex对象跟踪数据被“稀疏”的地方。 这将在一个例子中更有意义。
在旧版本中,所有的标准Pandas数据结构都应用了 to_sparse
方法。
这个方法已经被弃用,现在的方式是使用 pandas.arrays.SparseArray
来声明:
import pandas as pd
import numpy as np
ts = pd.Series(np.random.randn(10))
ts[2:-2] = np.nan
sts = pd.arrays.SparseArray(ts)
# sts = ts.to_numpy()
# sts = ts.to_sparse()
print (sts)
[0.41409625411779355, -1.5475189937381009, nan, nan, nan, nan, nan, nan, 0.2557387139917992, -2.7711120403907983] Fill: nan IntIndex Indices: array([0, 1, 8, 9], dtype=int32)
为了内存效率的原因,所以需要稀疏对象的存在。
现在假设有一个大的NA DataFrame并执行下面的代码 -
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10000, 4))
df.loc[:9998] = np.nan
sdf = pd.SparseDtype(df)
# sdf = df.to_sparse()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In [14], line 6 4 df = pd.DataFrame(np.random.randn(10000, 4)) 5 df.loc[:9998] = np.nan ----> 6 sdf = pd.SparseDtype(df) 7 # sdf = df.to_sparse() File /usr/lib/python3/dist-packages/pandas/core/arrays/sparse/dtype.py:91, in SparseDtype.__init__(self, dtype, fill_value) 88 fill_value = dtype.fill_value 89 dtype = dtype.subtype ---> 91 dtype = pandas_dtype(dtype) 92 if is_string_dtype(dtype): 93 dtype = np.dtype("object") File /usr/lib/python3/dist-packages/pandas/core/dtypes/common.py:1781, in pandas_dtype(dtype) 1778 # try a numpy dtype 1779 # raise a consistent TypeError if failed 1780 try: -> 1781 npdtype = np.dtype(dtype) 1782 except SyntaxError as err: 1783 # np.dtype uses `eval` which can raise SyntaxError 1784 raise TypeError(f"data type '{dtype}' not understood") from err TypeError: Cannot interpret ' 0 1 2 3 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN ... ... ... ... ... 9995 NaN NaN NaN NaN 9996 NaN NaN NaN NaN 9997 NaN NaN NaN NaN 9998 NaN NaN NaN NaN 9999 1.032179 -0.396169 0.54003 0.326016 [10000 rows x 4 columns]' as a data type
print (sdf.density)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In [8], line 8 6 sdf = df.to_numpy() 7 # sdf = df.to_sparse() ----> 8 print (sdf.density) AttributeError: 'numpy.ndarray' object has no attribute 'density'
通过调用to_dense可以将任何稀疏对象转换回标准密集形式 -
import pandas as pd
import numpy as np
ts = pd.Series(np.random.randn(10))
ts[2:-2] = np.nan
sts = ts.to_sparse()
print (sts.to_dense())
float64 − np.nan
int64 − 0
bool − False
执行下面的代码来理解相同的内容 -
import pandas as pd
import numpy as np
s = pd.Series([1, np.nan, np.nan])
print (s)
print ("=============================")
s.to_sparse()
print (s)