通常实时的数据包括重复的文本列。例如:性别,国家和代码等特征总是重复的。 这些是分类数据的例子。
分类变量只能采用有限且通常是固定的数量。除了固定长度, 分类数据可能有顺序,但不能执行数字操作。 分类是Pandas数据类型。
分类数据类型在以下情况下非常有用:
- 一个字符串变量,只包含几个不同的值。将这样的字符串变量转换为分类变量将会节省一些内存。
- 变量的词法顺序与逻辑顺序("one","two","three")不同。 通过转换为分类并指定类别上的顺序,排序和最小/最大将使用逻辑顺序,而不是词法顺序。
- 作为其他python库的一个信号,这个列应该被当作一个分类变量(例如,使用合适的统计方法或
plot类型)。
In [2]:
import pandas as pd
In [3]:
s = pd.Series(["a","b","c","a"], dtype="category")
In [6]:
print (s)
0 a 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c']
pandas.Categorical(values, categories, ordered)举个例子:
In [7]:
import pandas as pd
In [8]:
cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
In [9]:
print (cat)
['a', 'b', 'c', 'a', 'b', 'c'] Categories (3, object): ['a', 'b', 'c']
再举一个例子:
In [10]:
import pandas as pd
In [11]:
cat = cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'])
In [12]:
print (cat)
['a', 'b', 'c', 'a', 'b', 'c', NaN] Categories (3, object): ['c', 'b', 'a']
这里,第二个参数表示类别。因此,在类别中不存在的任何值将被视为 NaN 。
现在,看看下面的例子:
In [13]:
import pandas as pd
In [14]:
cat = cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'],ordered=True)
In [15]:
print (cat)
['a', 'b', 'c', 'a', 'b', 'c', NaN] Categories (3, object): ['c' < 'b' < 'a']
In [16]:
import pandas as pd
import numpy as np
In [17]:
cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
In [18]:
df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})
In [19]:
print (df.describe())
cat s count 3 3 unique 2 2 top c c freq 2 2
In [20]:
print ("=============================")
=============================
In [23]:
print (df["cat"].describe())
count 3 unique 2 top c freq 2 Name: cat, dtype: object
In [24]:
import pandas as pd
import numpy as np
In [25]:
s = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
In [26]:
print (s.categories)
Index(['b', 'a', 'c'], dtype='object')
obj.ordered 命令用于获取对象的顺序。
In [27]:
import pandas as pd
import numpy as np
In [28]:
cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
In [29]:
print (cat.ordered)
False
该函数返回结果为False,因为没有指定任何顺序。
In [30]:
import pandas as pd
In [31]:
s = pd.Series(["a","b","c","a"], dtype="category")
In [32]:
s = s.cat.rename_categories({"a": "Group a",
"b": "Group b",
"c": "Group c"})
In [33]:
print(s.cat.categories)
Index(['Group a', 'Group b', 'Group c'], dtype='object')
In [34]:
import pandas as pd
In [35]:
s = pd.Series(["a","b","c","a"], dtype="category")
In [36]:
s = s.cat.add_categories([4])
In [37]:
print (s.cat.categories)
Index(['a', 'b', 'c', 4], dtype='object')
In [38]:
import pandas as pd
In [39]:
s = pd.Series(["a","b","c","a"], dtype="category")
In [40]:
print ("Original object:")
Original object:
In [41]:
s
Out[41]:
0 a 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c']
In [42]:
print ("After removal:")
After removal:
In [43]:
s.cat.remove_categories("a")
Out[43]:
0 NaN 1 b 2 c 3 NaN dtype: category Categories (2, object): ['b', 'c']
In [44]:
import pandas as pd
创建有序分类数据
In [45]:
cat = pd.Series(pd.Categorical([1,2,3], categories=[1,2,3], ordered=True))
cat1 = pd.Series(pd.Categorical([2,2,2], categories=[1,2,3], ordered=True))
In [46]:
print(cat > cat1)
0 False 1 False 2 True dtype: bool