使用pandas分析数据,引入依赖包。本文主要使用pandas和matplotlib,所以需要先做如下通用设置:fromnumpy.randomimportrandnimportnumpyasnpnp.random.seed(123)importosimportmatplotlib.pyplotaspltimportpandasaspdplt.rc('figure',figsize=(10,6))np.set_printoptions(precision=4)pd.options.display.max_rows=20读取和分析数据pandas提供了read_csv方法方便读取一个csv数据,并转换为DataFrame:path='../data/titanic.csv'df=pd.read_csv(path)dfLet'slookatthereaddata:PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked08923Kelly,Mr.Jamesmale34.5003309117.82923NaNQ189Mrs.James(EllenNeeds)female47.0103632727.0000NaNS28942Myles,Mr.ThomasFrancismale62.0002402769.6875NaNQ38953Wirz,Mr.Albertmale27.0003151548.6625NaNS48963Hirvonen,Mrs.Alexander(HelgaELindqvist)female22.011310129812.2875NaNS58973Svensson,Mr.JohanCervinmale14.00075389.2250NaNS68983Connolly,Miss.Katefemale30.0003309727.6292NaNQ78992Caldwell,Mr.AlbertFrancismale26.01124873829.0000NaNS89003Abrahim,Mrs.Joseph(SophieHalautEasu)female18.00026577.2292NaNC99013Davies,Mr.JohnSamuelmale21.020A/44887124.1500NaNS..........................40813003Riordan,Miss.JohannaHannah""女NaN003349157.7208NaNQ40913013Peacock,Miss.Treateallfemale3.011SOTON/O.Q.310131513.7750NaNS41013023Naughton,Miss.HannahfemaleNaN003652377.7500NaNQ41113031Minahan,Mrs.WilliamEdward(LillianEThorpe)female37.0101992890.0000C78Q41213043Henriksson,Miss.JennyLovisafemale28.0003470867.7750NaNS41313053Spector,Mr.WoolfmaleNaN00A.5.32368.0500NaNS41413061OlivayOcana,多纳。Ferminafemale39.000PC17758108.9000C105C41513073Saether,Mr.SimonSivertsenmale38.500SOTON/O.Q.31012627.2500NaNS41613083Ware,Mr.FrederickmaleNaN003593098.0500NaNS41713093Peter,Master.MichaelJmaleNaN11266822.3583NaNC418rows×11Columns调用df的describe方法查看基本统计:PassengerIdPclassAgeSibSpParchFarecount418.000000418.000000332.000000418.000000418.000000417.0000500mean206000417.0000500mean20610025900.4473680.39234435.627188std120.8104580.84183814.1812090.8967600.98142955.907576min892.0000001.0000000.1700000.0000000.0000000.00000025%996.2500001.00000021.0000000.0000000.0000007.89580050%1100.5000003.00000027.0000000.0000000.00000014.45420075%1204.7500003.00000039.0000001.0000000.00000031.500000max1309.0000003.00000076.0000008.0000009.000000512.329200如果要想查看乘客登陆的端口,可以这样选择:df['Embarked'][:10]0Q1S2Q3S4S5S6Q7S8C9SName:Embarked,dtype:object使用value_counts对其进行统计:embark_counts=df['Embarked'].value_counts()embark_counts[:10]S270C102Q46Name:Embarked,dtype:int64从结果我们可以看到有270名乘客从S口登陆,102名乘客登陆从C口出发,102名旅客从Q口登机,46名旅客从Q口登机同样,我们可以统计年龄信息:age_counts=df['Age'].value_counts()age_counts.head(10)前10名的年龄如下:24.01721.01722.01630.01518.01327.01226.01225.01123.01129.010Name:Age,dtype:int64计算age的均值:df['Age'].mean()30.272590361445783其实有些数据是没有age的,我们可以用mean来填充:clean_age1=df['Age'].fillna(df['Age'].mean())clean_age1.value_counts()可以看到平均值是30.27,数字是86。30.272598624.000001721.000001722.000001630.000001518.000001326.000001227.000001225.000001123.0000011..36.50000140.50000111.50000134.00000115.0000017.00000160.50000126.50000176.00000134.500001Name:Age,Length:80,dtype:int64使用平均数来作为年龄可能不是一个好主意,还有一种办法Istodiscardtheaverage:clean_age2=df['Age'].dropna()clean_age2age_counts=clean_age2.value_counts()ageset=age_counts.head(10)ageset24.01721.01722.01630.01518.01327.01226.01225.01129.0Name:AgeInt64graphicalrepresentationandmatrixconversiongraphicsareveryhelpfulfordataanalysis.Weusehistogramstorepresentthetop10agesobtainedabove:importseabornassnssns.barplot(x=ageset.index,y=ageset.values)Next,let'sdoacomplexmatrixtransformation.Let'sfilteroutthedatawhoseageandsexarebothempty:cframe=df[df.Age.notnull()&df.Sex.notnull()]cframePassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked08923Kelly,Mr.Jamesmale34.5003309117.8292NaNQ18933Wilkes,Mrs.James(EllenNeeds)female47.0103632727.0000NaNS28942Myles,Mr.ThomasFrancismale62.0002402769.6875NaNQ38953Wirz,Mr.Albertmale27.0003151548.6625NaNS48963Hirvonen,Mrs.Alexander(HelgaELindqvist)female22.011310129812.2875NaNS58973Svensson,Mr.JohanCervinmale14.00075389.2250NaNS68983Connolly,Miss.Katefemale30.0003309727.6292NaNQ78992Caldwell,Mr.AlbertFrancismale26.01124873829.0000NaNS89003Abrahim,Mrs.Joseph(SophieHalautEasu)female18.00026577.2292NaNC99013Davies,Mr.JohnSamuelmale21.020A/44887124.1500NaNS...................................................40312951Carrau,Mr.JosePedromale17.00011305947.1000NaNS40412961Frauenthal,Mr.IsaacGeraldmale43.0101776527.7208D40C40512972Nourney,Mr.Alfred(BaronvonDrachstedt")"male20.000SC/PARIS216613.8625D38C40612982Ware,Mr.WilliamJefferymale23.0102866610.5000NaNS40712991Widener,Mr.GeorgeDuntonmale50.011113503211.5000C80C40913013Peacock,Miss.Treateallfemale3.011SOTON/O.Q.310131513.7750NaNS41113031Minahan,Mrs.WilliamEdward(LillianEThorpe)female37.0101992890.0000C78Q41213043Henriksson,Miss.JennyLovisafemale28.0003470867.7750NaNS41413061OlivayOcana,Dona。Ferminafemale39.000PC17758108.9000C105C41513073Saether,Mr.SimonSivertsenmale38.500SOTON/O.Q.31012627.2500NaNS332行×11列接下来使用groupby对age按性别分组:by_sex_age=cframe.groupby(['Age','Sex'])by_sex_age.size()AgeSex0.17female10.33male10.75male10.83male10.92female11.00female32.00female1male13.01female15.60.00female360.50male161.00male262.00male163.00female1male164.00female2male167.00male176.00female1Length:115,dtype:int64使用unstack将Sex的列数据转成行:SexfemalemaleAge0.01.300.301Age0.00.750.01.00.830.01.00.921.00.01.003.00.02.001.01.03.001.00.05.000.01.06.000.03.0......58.001.001.050.059.0......58.001.001.050.059.00女4.0男8.027.0女4.0男8.018.0女7.0男6.030.0女6.0男9.022.0女10.0男6.021.0女3.0男14.024.0女5.0男12.0stack_subset=stack_subset.reset_index()stack_subsetAgeSextotal029.0talfemale5.0129.0male5.0225.0female1.0325.0male10.0423.0female5.0523.0male6.0626.0female4.0726.0male8.0827.0female4.0927.0male8.01018.0female7.01118.0male6.0female30.0male30.01female923001422.0female10.01522.0male6.01621.0female3.01721.0male14.01824.0female5.01924.0male12.0绘制如下:sns.barplot(x='total',y='Age',hue='Sex',data=stack_subset)例子请参考:https://github.com/ddean2009/...本文已收录在http://www.flydean.com/01-pandas-titanic/最通俗的解读,最深刻干货,最简洁的Tutorial,欢迎关注我的公众号:《程序那些事》,懂技术,更懂你!
