Pandas进阶教程：类别数据类型

时间：2023-03-26 17:25:50 Python

简介Pandas中有一种特殊的数据类型，叫做类别。它代表一个类别，一般用于统计分类，如性别、血型、分类、级别等。有点像java中的枚举。今天我将详细解释如何使用类别。创建分类，使用Series创建分类，在创建Series时添加dtype="category"。category分为两部分，一部分是order，一部分是字面值：In[1]:s=pd.Series(["a","b","c","a"],dtype="category")In[2]:sOut[2]:0a1b2c3adtype:categoryCategories(3,object):['a','b','c']可以将DF中的Series转为category:In[3]:df=pd.DataFrame({"A":["a","b","c","a"]})在[4]中：df["B"]=df["A”]。astype("category")In[5]:df["B"]Out[32]:0a1b2c3aName:B,dtype:categoryCategories(3,object):[a,b,c]可以创建好一个pandas.Categorical，将其作为参数传递给系列：在[10]中：raw_cat=pd.Categorical(....:["a","b","c","a"],categories=["b","c","d"],ordered=False....:)....:In[11]:s=pd.Series(raw_cat)In[12]:sOut[12]:0NaN1b2c3NaNdtype:categoryCategories(3,object):['b','c','d']使用DF创建DataFrame时，也可以传入dtype="category":In[17]:df=pd.DataFrame({"A":list("abca"),"B":list("bccd")},dtype="category")In[18]:df.dtypesOut[18]:AcategoryBcategorydtype:objectDF中的A和B都是一个category:In[19]:df["A"]Out[19]:0a1b2c3aName:A,dtype:categoryCategories(3,object):['a','b','c']In[20]:df["B"]Out[20]:0b1c2c3dName:B,dtype:categoryCategories(3,object):['b','c','d']或使用df.astype("category")将DF中的所有Series转换为category:In[21]:df=pd.DataFrame({"A":list("abca"),"B":list("bccd")})In[22]:df_cat=df.astype("category")In[23]:df_cat.dtypesOut[23]:AcategoryB类dtype:对象创建控件。默认传入dtype='category'，创建的类别使用默认值：Categoriesareinferredfromthedata。类别没有大小顺序。可以显式创建CategoricalDtype来修改以上两个默认值：In[26]:frompandas.api.typesimportCategoricalDtypeIn[27]:s=pd.Series(["a","b","c","a"])在[28]中：cat_type=CategoricalDtype(categories=["b","c","d"],ordered=True)在[29]中：s_cat=s.astype(cat_type)在[30]中:s_catOut[30]:0NaN1b2c3NaNdtype:categoryCategories(3,object):['b'<'c'<'d']同样的CategoricalDtype也可以用在DF:In[31]:frompandas.api.typesimportCategoricalDtypeIn[32]:df=pd.DataFrame({"A":list("abca"),"B":list("bccd")})In[33]:cat_type=CategoricalDtype(categories=list("abcd"),ordered=True)In[34]:df_cat=df.astype(cat_type)In[35]:df_cat["A"]Out[35]:0a1b2c3aName:A,dtype:类别类别（4，对象）：['a'<'b'<'c'<'d']In[36]:df_cat["B"]Out[36]:0b1c2c3dName:B,dtype:categoryCategories(4,object):['a'<'b'<'c'<'d']转换为原始类型使用Series.astype(original_dtype)或np.asarray(categorical)转换Cate原始类型的血腥转换：在[39]中：s=pd.Series(["a","b","c","a"])In[40]:sOut[40]:0a1b2c3adtype:objectIn[41]:s2=s.astype("category")In[42]:s2Out[42]:0a1b2c3adtype:categoryCategories(3,object):['a','b','c']In[43]:s2.astype(str)Out[43]:0a1b2c3adtype:objectIn[44]:np.asarray(s2)Out[44]:array(['a','b','c','a'],dtype=object)对categories的操作得到category的属性分类数据有categories和ordered两个属性可以通过s.cat.categories和s.cat.ordered得到：在[57]中：s=pd.Series(["a","b","c","a"],dtype="category")在[58]:s.cat.categoriesOut[58]:Index(['a','b','c'],dtype='object')在[59]:s.cat.orderedOut[59]:False重新排列类别的顺序:In[60]:s=pd.Series(pd.Categorical(["a","b","c","a"],categories=["c","b","a"]))In[61]:s.cat.categoriesOut[61]:Index(['c','b','a'],dtype='object')In[62]:s.cat.orderedOut[62]:False重命名类别可以通过给s.cat.categories赋值来重命名类别：In[67]:s=pd.Series(["a","b","c","a"],dtype="category")In[68]:sOut[68]:0a1b2c3adtype:categoryCategories(3,object):['a','b','c']In[69]:s.cat.categories=["Group%s"%gforgins.cat.categories]In[70]:sOut[70]:0Groupa1Groupb2Groupc3Groupadtype:categoryCategories(3,object):['Groupa','Groupb','Groupc']使用rename_categories实现同样的效果：在[71]中：s=s.cat.rename_categories([1,2,3])中[72]:sOut[72]:01122331dtype:categoryCategories(3,int64):[1,2,3]或者使用字典对象：#你也可以传递一个类字典对象来映射重命名In[73]:s=s.cat.rename_categories({1:"x",2:"y",3:"z"})In[74]:sOut[74]:0x1y2z3xdtype:categoryCategories(3,object):['x','y','z']使用add_categories添加类别可以使用add_categories来添加类别:In[77]:s=s.cat.add_categories([4])In[78]:s.cat.categoriesOut[78]:Index(['x','y','z',4],dtype='object')In[79]:sOut[79]:0x1y2z3xdtype:categoryCategories(4,object):['x','y','z',4]使用remove_categories删除categoryIn[80]:s=s.cat.remove_categories([4])In[81]:sOut[81]:0x1y2z3xdtype:categoryCategories(3,object):['x','y','z']删除未使用的cagtegoryIn[82]:s=pd.Series(pd.Categorical(["a","b","a"],categories=["a","b","c","d"]))In[83]:sOut[83]:0a1b2adtype:categoryCategories(4,object):['a','b','c','d']In[84]:s.cat.remove_unused_categories()Out[84]:0a1b2adtype:categoryCategories(2,object):['a','b']重置cagtegory使用set_categories()同时添加和删除类别：In[85]:s=pd.Series(["one","二","四","-"],dtype="category")In[86]:sOut[86]:0one1two2four3-dtype:categoryCategories(4,object):['-','四'，'一'，'二']在[87]中：s=s。猫。set_categories(["one","two","three","four"])In[88]:sOut[88]:0one1two2four3NaNdtype:categoryCategories(4,object):['一','二','three','four']categorysorting如果category是用ordered=True创建的，那么就可以排序排序操作：In[91]:s=pd.Series(["a","b","c","a"]).astype(CategoricalDtype(ordered=True))In[92]:s.sort_values(inplace=True)In[93]:sOut[93]:0a3a1b2cdtype:categoryCategories(3,object):['a'<'b'<'c']In[94]:s.min(),s.max()Out[94]:('a','c')可以使用as_ordered()或as_unordered()强制排序与否：In[95]:s.cat.as_ordered()输出[95]:0a3a1b2cdtype:categoryCategories(3,object):['a'<'b'<'c']In[96]:s.cat.as_unordered()Out[96]:0a3a1b2cdtype:categoryCategories(3,object):['a','b','c']重新排序使用Categorical.reorder_categories()对现有类别重新排序：In[103]:s=pd.Series([1,2,3,1],dtype="category")中[104]:s=s.cat.reorder_categories([2,3,1],ordered=True)In[105]:sOut[105]:01122331dtype:categoryCategories(3,int64):[2<3<1]多列排序sort_values支持多列排序：In[109]:dfs=pd.DataFrame(.....:{.....:"A":pd.Categorical(.....:list("bbeebbaa"),.....:categories=["e","a","b"],.....:ordered=True,.....:),.....:"B":[1,2,1,2,2,1,2,1],.....:}.....:).....:In[110]:dfs.sort_values(by=["A","B"])Out[110]:AB2e13e27a16a20b15b11b24b2比较操作如果创建当命令设置了==True，则类别间的比较操作可以支持==、!=、>、>=、<、<=操作符。在[113]中：cat=pd.Series([1,2,3]).astype(CategoricalDtype([3,2,1],ordered=True))在[114]中：cat_base=pd.Series([2,2,2].astype(CategoricalDtype([3,2,1],ordered=True))在[115]中：cat_base2=pd.Series([2,2,2]).astype(CategoricalDtype(ordered=True))In[119]:cat>cat_baseOut[119]:0True1False2Falsedtype:boolIn[120]:cat>2Out[120]:0True1False2Falsedtype:bool其他操作Cagedory本质上是一个Series，所以Series的操作类基本上都可以使用，比如：Series.min(),Series.max()和Series.mode()。value_counts：In[131]:s=pd.Series(pd.Categorical(["a","b","c","c"],categories=["c","a","b","d"]))In[132]:s.value_counts()Out[132]:c2a1b1d0dtype:int64DataFrame.sum()：In[133]:columns=pd.Categorical(.....:[“一”，“一”，“二”]，类别=[“一”，“二”，“三”]，有序=真.....：）......：在[134]:df=pd.DataFrame(.....:data=[[1,2,3],[4,5,6]],.....:columns=pd.MultiIndex.from_arrays([["A","B","B"],列]),.....:).....:In[135]:df.sum(axis=1,level=1)Out[135]:一二三03301960Groupby：In[136]:cats=pd.Categorical(.....:["a","b","b","b","c","c","c"],categories=["a","b","c","d"].....:).....:In[137]:df=pd.DataFrame({"cats":cats,"values":[1,2,2,2,3,4,5]})In[138]:df.groupby("cats").mean()Out[138]:valuescatsa1.0b2.0c4.0dNaNIn[139]:cats2=pd.Categorical（[“a”，“a”，“b”，“b”]，类别=[“a”，“b”，“c”]）在[140]中：df2=pd。DataFrame(.....:{.....:"cats":cats2,.....:"B":["c","d","c","d"],...:"值":[1,2,3,4],.....:}...:).....:在[141]中：df2.groupby(["cats","B"]).mean()Out[141]:valuescatsBac1.0d2.0bc3.0d4.0ccNaNdNaNPivot表:In[142]:raw_cat=pd.Categorical(["a","a","b","b"],categories=["a","b","c"])In[143]:df=pd.DataFrame({"A":raw_cat,"B":["c","d","c","d"],"values":[1,2,3,4]})In[144]:pd.pivot_table(df,values="values",index=["A","B"])Out[144]:valuesABac1d2bc3d4本文已收录于http://www.flydean.com/08-python-pandas-category/深度干货，最简洁的教程，还有很多你不知道的小技巧等你来发现！欢迎关注我的公众号：《程序那些事儿》，懂技术，更懂你！

上一篇：翻译：《实用的Python编程》07_05_Decorated_methods

下一篇：python数据分析-多种方式获取pandas.DataFrame数据对象

Pandas进阶教程：类别数据类型相关文章