简介Pandas中有一种特殊的数据类型,叫做类别。它代表一个类别,一般用于统计分类,如性别、血型、分类、级别等。有点像java中的枚举。今天我将详细解释如何使用类别。创建分类,使用Series创建分类,在创建Series时添加dtype="category"。category分为两部分,一部分是order,一部分是字面值:In[1]:s=pd.Series(["a","b","c","a"],dtype="category")In[2]:sOut[2]:0a1b2c3adtype:categoryCategories(3,object):['a','b','c']可以将DF中的Series转为category:In[3]:df=pd.DataFrame({"A":["a","b","c","a"]})在[4]中:df["B"]=df["A”]。astype("category")In[5]:df["B"]Out[32]:0a1b2c3aName:B,dtype:categoryCategories(3,object):[a,b,c]可以创建好一个pandas.Categorical,将其作为参数传递给系列:在[10]中:raw_cat=pd.Categorical(....:["a","b","c","a"],categories=["b","c","d"],ordered=False....:)....:In[11]:s=pd.Series(raw_cat)In[12]:sOut[12]:0NaN1b2c3NaNdtype:categoryCategories(3,object):['b','c','d']使用DF创建DataFrame时,也可以传入dtype="category":In[17]:df=pd.DataFrame({"A":list("abca"),"B":list("bccd")},dtype="category")In[18]:df.dtypesOut[18]:AcategoryBcategorydtype:objectDF中的A和B都是一个category:In[19]:df["A"]Out[19]:0a1b2c3aName:A,dtype:categoryCategories(3,object):['a','b','c']In[20]:df["B"]Out[20]:0b1c2c3dName:B,dtype:categoryCategories(3,object):['b','c','d']或使用df.astype("category")将DF中的所有Series转换为category:In[21]:df=pd.DataFrame({"A":list("abca"),"B":list("bccd")})In[22]:df_cat=df.astype("category")In[23]:df_cat.dtypesOut[23]:AcategoryB类dtype:对象创建控件。默认传入dtype='category',创建的类别使用默认值:Categoriesareinferredfromthedata。类别没有大小顺序。可以显式创建CategoricalDtype来修改以上两个默认值:In[26]:frompandas.api.typesimportCategoricalDtypeIn[27]:s=pd.Series(["a","b","c","a"])在[28]中:cat_type=CategoricalDtype(categories=["b","c","d"],ordered=True)在[29]中:s_cat=s.astype(cat_type)在[30]中:s_catOut[30]:0NaN1b2c3NaNdtype:categoryCategories(3,object):['b'<'c'<'d']同样的CategoricalDtype也可以用在DF:In[31]:frompandas.api.typesimportCategoricalDtypeIn[32]:df=pd.DataFrame({"A":list("abca"),"B":list("bccd")})In[33]:cat_type=CategoricalDtype(categories=list("abcd"),ordered=True)In[34]:df_cat=df.astype(cat_type)In[35]:df_cat["A"]Out[35]:0a1b2c3aName:A,dtype:类别类别(4,对象):['a'<'b'<'c'<'d']In[36]:df_cat["B"]Out[36]:0b1c2c3dName:B,dtype:categoryCategories(4,object):['a'<'b'<'c'<'d']转换为原始类型使用Series.astype(original_dtype)或np.asarray(categorical)转换Cate原始类型的血腥转换:在[39]中:s=pd.Series(["a","b","c","a"])In[40]:sOut[40]:0a1b2c3adtype:objectIn[41]:s2=s.astype("category")In[42]:s2Out[42]:0a1b2c3adtype:categoryCategories(3,object):['a','b','c']In[43]:s2.astype(str)Out[43]:0a1b2c3adtype:objectIn[44]:np.asarray(s2)Out[44]:array(['a','b','c','a'],dtype=object)对categories的操作得到category的属性分类数据有categories和ordered两个属性可以通过s.cat.categories和s.cat.ordered得到:在[57]中:s=pd.Series(["a","b","c","a"],dtype="category")在[58]:s.cat.categoriesOut[58]:Index(['a','b','c'],dtype='object')在[59]:s.cat.orderedOut[59]:False重新排列类别的顺序:In[60]:s=pd.Series(pd.Categorical(["a","b","c","a"],categories=["c","b","a"]))In[61]:s.cat.categoriesOut[61]:Index(['c','b','a'],dtype='object')In[62]:s.cat.orderedOut[62]:False重命名类别可以通过给s.cat.categories赋值来重命名类别:In[67]:s=pd.Series(["a","b","c","a"],dtype="category")In[68]:sOut[68]:0a1b2c3adtype:categoryCategories(3,object):['a','b','c']In[69]:s.cat.categories=["Group%s"%gforgins.cat.categories]In[70]:sOut[70]:0Groupa1Groupb2Groupc3Groupadtype:categoryCategories(3,object):['Groupa','Groupb','Groupc']使用rename_categories实现同样的效果:在[71]中:s=s.cat.rename_categories([1,2,3])中[72]:sOut[72]:01122331dtype:categoryCategories(3,int64):[1,2,3]或者使用字典对象:#你也可以传递一个类字典对象来映射重命名In[73]:s=s.cat.rename_categories({1:"x",2:"y",3:"z"})In[74]:sOut[74]:0x1y2z3xdtype:categoryCategories(3,object):['x','y','z']使用add_categories添加类别可以使用add_categories来添加类别:In[77]:s=s.cat.add_categories([4])In[78]:s.cat.categoriesOut[78]:Index(['x','y','z',4],dtype='object')In[79]:sOut[79]:0x1y2z3xdtype:categoryCategories(4,object):['x','y','z',4]使用remove_categories删除categoryIn[80]:s=s.cat.remove_categories([4])In[81]:sOut[81]:0x1y2z3xdtype:categoryCategories(3,object):['x','y','z']删除未使用的cagtegoryIn[82]:s=pd.Series(pd.Categorical(["a","b","a"],categories=["a","b","c","d"]))In[83]:sOut[83]:0a1b2adtype:categoryCategories(4,object):['a','b','c','d']In[84]:s.cat.remove_unused_categories()Out[84]:0a1b2adtype:categoryCategories(2,object):['a','b']重置cagtegory使用set_categories()同时添加和删除类别:In[85]:s=pd.Series(["one","二","四","-"],dtype="category")In[86]:sOut[86]:0one1two2four3-dtype:categoryCategories(4,object):['-','四','一','二']在[87]中:s=s。猫。set_categories(["one","two","three","four"])In[88]:sOut[88]:0one1two2four3NaNdtype:categoryCategories(4,object):['一','二','three','four']categorysorting如果category是用ordered=True创建的,那么就可以排序排序操作:In[91]:s=pd.Series(["a","b","c","a"]).astype(CategoricalDtype(ordered=True))In[92]:s.sort_values(inplace=True)In[93]:sOut[93]:0a3a1b2cdtype:categoryCategories(3,object):['a'<'b'<'c']In[94]:s.min(),s.max()Out[94]:('a','c')可以使用as_ordered()或as_unordered()强制排序与否:In[95]:s.cat.as_ordered()输出[95]:0a3a1b2cdtype:categoryCategories(3,object):['a'<'b'<'c']In[96]:s.cat.as_unordered()Out[96]:0a3a1b2cdtype:categoryCategories(3,object):['a','b','c']重新排序使用Categorical.reorder_categories()对现有类别重新排序:In[103]:s=pd.Series([1,2,3,1],dtype="category")中[104]:s=s.cat.reorder_categories([2,3,1],ordered=True)In[105]:sOut[105]:01122331dtype:categoryCategories(3,int64):[2<3<1]多列排序sort_values支持多列排序:In[109]:dfs=pd.DataFrame(.....:{.....:"A":pd.Categorical(.....:list("bbeebbaa"),.....:categories=["e","a","b"],.....:ordered=True,.....:),.....:"B":[1,2,1,2,2,1,2,1],.....:}.....:).....:In[110]:dfs.sort_values(by=["A","B"])Out[110]:AB2e13e27a16a20b15b11b24b2比较操作如果创建当命令设置了==True,则类别间的比较操作可以支持==、!=、>、>=、<、<=操作符。在[113]中:cat=pd.Series([1,2,3]).astype(CategoricalDtype([3,2,1],ordered=True))在[114]中:cat_base=pd.Series([2,2,2].astype(CategoricalDtype([3,2,1],ordered=True))在[115]中:cat_base2=pd.Series([2,2,2]).astype(CategoricalDtype(ordered=True))In[119]:cat>cat_baseOut[119]:0True1False2Falsedtype:boolIn[120]:cat>2Out[120]:0True1False2Falsedtype:bool其他操作Cagedory本质上是一个Series,所以Series的操作类基本上都可以使用,比如:Series.min(),Series.max()和Series.mode()。value_counts:In[131]:s=pd.Series(pd.Categorical(["a","b","c","c"],categories=["c","a","b","d"]))In[132]:s.value_counts()Out[132]:c2a1b1d0dtype:int64DataFrame.sum():In[133]:columns=pd.Categorical(.....:[“一”,“一”,“二”],类别=[“一”,“二”,“三”],有序=真.....:)......:在[134]:df=pd.DataFrame(.....:data=[[1,2,3],[4,5,6]],.....:columns=pd.MultiIndex.from_arrays([["A","B","B"],列]),.....:).....:In[135]:df.sum(axis=1,level=1)Out[135]:一二三03301960Groupby:In[136]:cats=pd.Categorical(.....:["a","b","b","b","c","c","c"],categories=["a","b","c","d"].....:).....:In[137]:df=pd.DataFrame({"cats":cats,"values":[1,2,2,2,3,4,5]})In[138]:df.groupby("cats").mean()Out[138]:valuescatsa1.0b2.0c4.0dNaNIn[139]:cats2=pd.Categorical([“a”,“a”,“b”,“b”],类别=[“a”,“b”,“c”])在[140]中:df2=pd。DataFrame(.....:{.....:"cats":cats2,.....:"B":["c","d","c","d"],...:"值":[1,2,3,4],.....:}...:).....:在[141]中:df2.groupby(["cats","B"]).mean()Out[141]:valuescatsBac1.0d2.0bc3.0d4.0ccNaNdNaNPivot表:In[142]:raw_cat=pd.Categorical(["a","a","b","b"],categories=["a","b","c"])In[143]:df=pd.DataFrame({"A":raw_cat,"B":["c","d","c","d"],"values":[1,2,3,4]})In[144]:pd.pivot_table(df,values="values",index=["A","B"])Out[144]:valuesABac1d2bc3d4本文已收录于http://www.flydean.com/08-python-pandas-category/深度干货,最简洁的教程,还有很多你不知道的小技巧等你来发现!欢迎关注我的公众号:《程序那些事儿》,懂技术,更懂你!
