本文使用的数据集是足球运动员各项技术和身价的csv表,包含60多个字段。数据集下载链接:数据集1,DataFrame.info()这个函数可以输出一些读取到表中的特定信息。这对于加快数据预处理非常有帮助。将熊猫导入为pdimportmatplotlib.pyplotaspltdata=pd.read_csv('dataset/soccer/train.csv')print(data.info())RangeIndex:10441entries,0到10440数据列(共65列):id10441非空int64club10441非空int64league10441非空int64birth_date10441非空objectheight_cm10441非空int64weight_kg10441非空int64nationality10441非空int64null...dtypes:float64(12),int64(50),object(3)内存使用:5.2+MBNone2,DataFrame.query()importpandasaspdimportmatplotlib.pyplotaspltdata=pd.read_csv('dataset/soccer/train.csv')print(data.query('lw>cf'))#这两个方法是等价的print(data[data.lw>data.cf])#这两个方法是等价的3,DataFrame.value_counts()这个函数可以统计一列中不同值出现的频率。importpandasaspdimportmatplotlib.pyplotaspltdata=pd.read_csv('dataset/soccer/train.csv')print(data.work_rate_att.value_counts())中7155High2762Low524Name:work_rate_att,dtype:int644、DataFrame.sort_values()按列的值排序并输出。importpandasaspdimportmatplotlib.pyplotaspltdata=pd.read_csv('dataset/soccer/train.csv')print(data.sort_values(['sho']).head(5))5.DataFrame.groupby()根据对国籍栏的属性进行分组,然后分别计算同一国籍的潜力的平均值。将熊猫导入为pdimportmatplotlib.pyplotaspltdata=pd.read_csv('dataset/soccer/train.csv')potential_mean=data['potential'].groupby(data['nationality']).mean().head(5)print(potential_mean)nationality174.945338272.914286367.892857469.000000570.024242Name:potential,dtype:float64根据nationality(国籍),club(俱乐部)这两个属性进行分组,然后分别计算球员潜力(potential)的平均值。将熊猫导入为pdimportmatplotlib.pyplotaspltdata=pd.read_csv('dataset/soccer/train.csv')potential_mean=data['potential'].head(20).groupby([data['nationality'],data['club']]).mean()print(potential_mean)nationalityclub1148764617258364295936843213675125862521126854604816341570643597478293739022170968072101458671113656437983584651389721555437216318871Name:potential,dtype:int64值得注意的是,在分组函数后使用一个size()函数可以返回分组大小的结果。potential_mean=data['potential'].head(200).groupby([data['nationality'],data['club']]).size()nationalityclub1148143213151258152112154604178293196801101458115554311631881Name:potential,dtype:int646,DataFrame.agg()这个函数一般用在groupby函数之后。将熊猫导入为pdimportmatplotlib.pyplotaspltdata=pd.read_csv('dataset/soccer/train.csv')potential_mean=data['potential'].head(10).groupby(data['nationality']).agg(['max','min'])print(potential_mean)maxnationality176764367675162625268685481817873739672721016767155727216371717,DataFrame.apply()应用于某个函数一列或者一行可以大大加快处理速度。importpandasaspdimportmatplotlib.pyplotasplt#返回玩家生日中的年份defbirth_date_deal(birth_date):year=birth_date.split('/')[2]returnyeardata=pd.read_csv('dataset/soccer/train.csv')result=data['birth_date'].apply(birth_date_deal).head()print(result)096184299388480Name:birth_date,dtype:object当然,如果使用lambda函数,代码会是更简洁:data=pd.read_csv('dataset/soccer/train.csv')result=data['birth_date'].apply(lambdax:x.split('/')[2]).head()print(结果)