数据丢失(缺失)在现实生活中总是一个问题。 机器学习和数据挖掘等领域由于数据缺失导致的数据质量差,在模型预测的准确性上面临着严重的问题。 在这些领域,缺失值处理是使模型更加准确和有效的重点。
想象一下有一个产品的在线调查。很多时候,人们不会分享与他们有关的所有信息。 很少有人分享他们的经验,但不是他们使用产品多久; 很少有人分享使用产品的时间,经验,但不是他们的个人联系信息。 因此,以某种方式或其他方式,总会有一部分数据总是会丢失,这是非常常见的现象。
现在来看看如何处理使用 Pandas 的缺失值(如 NA
或 NaN
)。
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3),
index=['a', 'c', 'e', 'f', 'h'],
columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df
one | two | three | |
---|---|---|---|
a | -0.046133 | 0.542130 | 0.244085 |
b | NaN | NaN | NaN |
c | -0.544291 | 0.908537 | -1.915221 |
d | NaN | NaN | NaN |
e | 0.219617 | -0.144276 | 0.362802 |
f | 1.218952 | 0.371396 | 1.206544 |
g | NaN | NaN | NaN |
h | 0.043141 | -1.546063 | 0.762845 |
示例1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].isnull())
a False b True c False d True e False f False g True h False Name: one, dtype: bool
示例2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].notnull())
a True b False c True d False e True f True g False h True Name: one, dtype: bool
示例1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].sum())
0.6235585460015739
示例2
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
print (df['one'].sum())
0
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3),
index=['a', 'c', 'e'],
columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print (df)
print ("NaN replaced with '0':")
print (df.fillna(0))
one two three a -0.540132 0.929588 -0.521647 b NaN NaN NaN c -1.377040 -1.621413 0.494371 NaN replaced with '0': one two three a -0.540132 0.929588 -0.521647 b 0.000000 0.000000 0.000000 c -1.377040 -1.621413 0.494371
方法 | 动作 |
---|---|
pad/fill |
填充方法向前 |
bfill/backfill |
填充方法向后 |
示例1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3),
index=['a', 'c', 'e', 'f','h'],
columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df.fillna(method='pad'))
one two three a -0.963093 -1.214628 0.104516 b -0.963093 -1.214628 0.104516 c 0.308843 -1.109464 0.394322 d 0.308843 -1.109464 0.394322 e 1.871751 -0.628859 0.510016 f 0.881671 1.114474 -0.341950 g 0.881671 1.114474 -0.341950 h -0.866913 -0.393091 -0.218434
/tmp/ipykernel_2645/748388134.py:9: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead. print (df.fillna(method='pad'))
示例2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3),
index=['a', 'c', 'e', 'f', 'h'],
columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df.fillna(method='backfill'))
one two three a -1.130716 -0.851352 -0.471409 b 0.470916 -0.353104 0.471979 c 0.470916 -0.353104 0.471979 d 0.968244 -0.220899 1.246868 e 0.968244 -0.220899 1.246868 f 1.557570 2.105571 -0.657172 g -0.139376 -0.955506 0.197256 h -0.139376 -0.955506 0.197256
/tmp/ipykernel_2645/3179451599.py:9: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead. print (df.fillna(method='backfill'))
示例1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df.dropna())
one two three a 0.102280 -0.134644 -0.065381 c 1.849913 -0.258381 -0.449780 e -1.427968 0.994814 0.067866 f 0.052859 0.782432 0.660821 h 1.404502 0.357647 0.441647
示例2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df.dropna(axis=1))
Empty DataFrame Columns: [] Index: [a, b, c, d, e, f, g, h]
示例1
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print (df.replace({1000:10,2000:60}))
one two 0 10 10 1 20 0 2 30 30 3 40 40 4 50 50 5 60 60
示例2
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print (df.replace({1000:10,2000:60}))
one two 0 10 10 1 20 0 2 30 30 3 40 40 4 50 50 5 60 60