数据分析中经常会用到很多统计方法。本文将介绍Pandas中使用的统计方法。百分比变化Series和DF有一个pct_change()方法来计算数据的百分比变化。此方法在填充NaN值时特别有用。ser=pd.Series(np.random.randn(8))ser.pct_change()Out[45]:0NaN1-1.26471624.1250063-1.1590924-0.09129254.8377526-1.1821467-8.721482dtype:float64serOut[415091616.7205]0-1201.2895373-0.2051554-0.1864265-1.08831060.1982317-1.53??0635dtype:float64pct_change还有一个periods参数,可以指定计算百分比的周期,即计算多少个元素:在[3]中:df=pd.DataFrame(np.random.randn(10,4))In[4]:df.pct_change(periods=3)Out[4]:01230NaNNaNNaNNaN1NaNNaNNaNNaNNaN2NaNNaNNaNNaN3-0.218320-1.0540011.987147-0.51018394-21。-1.8164540.649715-4.8228095-0.127833-3.042065-5.866604-1.7769776-2.596833-1.959538-2.111697-3.7989007-0.117826-2.1690580.036094-0.06769682.492606-1.357320-1.205802-1.5586979-1.0129772.324558-1.003744-0.371806Covariance协方差Series.cov()用于计算两个Series的协方差,NaN数据将被忽略。在[5]中:s1=pd.Series(np.random.randn(1000))在[6]中:s2=pd.Series(np.random.randn(1000))在[7]中:s1.cov(s2)Out[7]:0.0006801088174310875同??样,DataFrame.cov()会计算对应Series的协方差,忽略NaN数据。在[8]中:frame=pd.DataFrame(np.random.randn(1000,5),columns=["a","b","c","d","e"])在[9]中:frame.cov()Out[9]:abcdea1.000882-0.003177-0.002698-0.0068890.031912b-0.0031771.0247210.0001910.0092120.000857c-0.0026980.0001910.950735-0.031743-0.005087d-0.0068890.009212-0.0317431.002983-0.047952e0.0319120.000857-0.005087-0.0479521.042487DataFrame.cov有一个min_periods参数,可以指定计算协方差的最小元素个数,保证不会出现极值数据。在[10]中:frame=pd.DataFrame(np.random.randn(20,3),columns=["a","b","c"])在[11]中:frame.loc[frame.index[:5],"a"]=np.nanIn[12]:frame.loc[frame.index[5:10],"b"]=np.nanIn[13]:frame.cov()Out[13]:abca1.123670-0.4128510.018169b-0.4128511.1541410.305260c0.0181690.3052601.301149In[14]:frame.cov(min_periods=12)Out[14]:abca1.123670NaN0.018169bNaN1.1541410.305260c0.0181690.3052601.301149Correlation相关coefficientcorr()方法可用于计算相关系数。相关系数的计算方法共有三种:方法名称描述pearson(默认)标准相关系数kendallKendallTau相关系数spearman斯皮尔曼秩相关系数n[15]:frame=pd.DataFrame(np.random.randn(1000,5),columns=["a","b","c","d","e"])In[16]:frame.iloc[::2]=np.nan#SerieswithSeriesIn[17]:frame["a"].corr(frame["b"])Out[17]:0.013479040400098775In[18]:frame["a"].corr(frame["b"],method="spearman")Out[18]:-0.007289885159540637#PairwisecorrelationofDataFramecolumnsIn[19]:frame.corr()Out[19]:abcdea1.0000000.013479-0.049269-0.042239-0.028525b0.0134791.000000-0.020433-0.0111390.005654c-0.049269-0.0204331.0000000.018587-0.054269D-0.042239-0.0111390.0185871.000000-0.017060E-0.0285250.005654-0.054269-0.0.0.0170601.000000corrlightcorrymessopte["a","b","c"])In[21]:frame.loc[frame.index[:5],"a"]=np.nanIn[22]:frame.loc[frame.索引[5:10],"b"]=np.nanIn[23]:frame.corr()Out[23]:abca1.000000-0.1211110.069544b-0.1211111.0000000.051742c0.0695440.0517421.000000In[24]:frame.corr(min_periods=12)Out[24]:ab04Nca00.9bNaN1.0000000.051742c0.0695440.0517421.000000corrwith不同DF之间的相关系数可以计算在[27]中:index=["a","b","c","d","e"]在[28]中:columns=["one","two","three","four"]In[29]:df1=pd.DataFrame(np.random.randn(5,4),index=index,columns=columns)在[30]中:df2=pd.DataFrame(np.random.randn(4,4),index=index[:4],columns=columns)在[31]中:df1.corrwith(df2)Out[31]:one-0.125501two-0.493244three0.344056four0.004183dtype:float64In[32]:df2.corrwith(df1,axis=1)Out[32]:a-0.675817b0.458296c0.190809d-0.186475eranktype方法可以排名:floatNaNd6系列中的内容是分层排列的。什么是等级?举个例子:s=pd.Series(np.random.randn(5),index=list("abcde"))sOut[51]:a0.336259b1.073116c-0.402291d0.624186e-0.422478dtype:float64s["d"]=s["b"]#所以有一个tiesOut[53]:a0.336259b1.073116c-0.402291d1.073116e-0.422478dtype:float64s.rank()Out[54]:a3.0b4.5c2.0d4.5e1.0dtype:float64上面我们创建了一个Series,里面的数据从小到大排序:-0.422478<-0.402291<0.336259<1.073116<1.073116,所以对应的rank为1,2,3,4,5.因为我们如果两个值相同,默认会取两者的平均值,即4.5。除了default_rank,还可以指定max_rank,让每个值取最大的5。也可以指定NA_bottom,表示也用NaN的数据来计算rank,放在最下面,这是最大值。您还可以指定pct_rank,其中排名值是百分比值。df=pd.DataFrame(data={'Animal':['cat','penguin','dog',...'spider','snake'],...'Number_legs':[4,2,4,8,np.nan]})>>>dfAnimalNumber_legs0猫4.01企鹅2.02狗4.03蜘蛛8.04蛇NaNdf['default_rank']=df['Number_legs'].rank()>>>df['max_rank']=df['Number_legs'].rank(method='max')>>>df['NA_bottom']=df['Number_legs'].rank(na_option='bottom')>>>df['pct_rank']=df['Number_legs'].rank(pct=True)>>>dfAnimalNumber_legsdefault_rankmax_rankNA_bottompct_rank0猫4.02.53.02.50.6251企鹅2.01.01.01.00.2502狗4.02.53.02.50.6250蛇4.00.62504.04.0蜘蛛4.02.53.02.50.6251钠盐NaNNaN5.0NaNrank也可以指定按行(axis=0)或按列(axis=1)计算In[36]:df=pd.DataFrame(np.random.randn(10,6))In[37]:df[4]=df[2][:5]#sometiesIn[38]:dfOut[38]:0123450-0.904948-1.163537-1.4571870.135463-1.4571870.2946501-0.976288-0.244652-0.748406-0.999601-0.748406-0.80080920.4019651.4608401.2560571.3081271.2560570.87600430.2059540.369552-0.6693040.038378-0.6693041.1402964-0.477586-0.730705-1.129149-0.601463-1.129149-0.2111965-1.092970-0.6892460.9081140.204848NaN0.46334760.3768920.9592920.095572-0.593740NaN-0.0691807-1.0026011.957794-0.1207080.094214NaN-1.4674228-0.5472310.664402-0.519424-0.073254NaN-1.2635449-0.250277-0.237428-1.0564430.419477NaN1.375064在[39]:df.rank(1)Out[39]:01234504.03.01.55.01.56.012.06.04.51.04.53.021.06.03.55.03.52.034.05.01.53.01.56.045.03.01.54.01.56.051.02.05.03.0NAN4.064.05.03.0NAN2.072.05.03.0NAN1.082.05.03.0NAN1.092.03.01.0NAN5.0Thisarticlehasbeenincludedinhttp://wwww.flydewdews/10-python-pandas-statistical/最通俗的解读,最深刻的干货,最简洁的教程,很多你不知道的小技巧等你来发现!
