为了更好的掌握pandas在实际数据分析中的应用,今天我们就来介绍一下如何使用pandas分析美国餐厅的评分数据。餐厅评分数据简介数据来源为UCIMLRepository,包含千余条数据,有5个属性,分别是:userID:用户IDplaceID:餐厅IDrating:整体评分food_rating:食物评分service_rating:服务评分我们将pandas用于读取:importnumpyasnppath='../data/restaurant_final.csv'df=pd.pd.pd.read_csv(path).1156U10431326301111157U10111327151101158U10681327331101159U10681325941111160U10681326600001161rows×5columnstoanalyzeratingdataIfweareconcernedaboutthetotalratingsandfoodratingsofdifferentrestaurants,wecanfirstlookattheaverageoftheserestaurantratings,hereweusethepivot_tablemethod:mean_ratings=df.pivot_table(values=['rating','food_rating'],index='placeID',aggfunc='mean')mean_ratings[:5]food_ratingratingplaceID1325601.000.501325611.000.751325641.251.251325721.001.001325831.001.00然后再看一个下各个placeID,投票人数统计:ratings_by_place=df.groupby('placeID').size()ratings_by_place[:10]placeID13256041325614132564413257215132583413258461325945132608613260951326136dtype:int64Ifthenumberofvotersistoosmall,thenthesedataarenotobjective.Let’schoosearestaurantwithmorethan4voters:active_place=ratings_by_place.index4[ratings_by]Intlating_by>6([132560,132561,132564,132572,132583,132584,132594,132608,132609,132613,...135080,135081,135082,135085,135086,135088,135104,135106,135108,135109],dtype='int64',name='placeID',length=124)选择这些餐厅的平均评分数据:mean_ratings=mean_ratings.loc[active_place]mean_ratingsfood_ratingratingplaceID1325601.0000000.5000001325611.0000000.7500001325641.2500001.2500001325721.0000001.0000001325831.0000001.000000.........1350881.1666671.0000001351041.4285710.8571431351061.2000001.2000001351081.1818181.1818181351091.2500001.000000124rows×2columns对rating进行排序,选择评分最高的10个:top_ratings=mean_ratings.sort_values(by='rating',ascending=False)top_ratings[:10]food_ratingratingplaceID1329551.8000002.0000001350342.0000002.0000001349862.0000002.0000001329221.5000001.8333331327552.0000001.8000001350741.7500001.7500001350132.0000001.7500001349761.7500001.7500001350551.7142861.7142861350751.6923081.692308我们还可以计算平均总评分和平均食物评分的差值,并以一栏diff进行保存:mean_ratings['diff']=mean_ratings['rating']-mean_ratings['food_rating']sorted_by_diff=mean_ratings.sort_values(by='diff')sorted_by_diff[:10]food_ratingratingdiffplaceID1326672.0000001.250000-0.7500001325941.2000000.600000-0.6000001328581.4000000.800000-0.6000001351041.4285710.857143-0.5714291325601.0000000.500000-0.5000001350271.3750000.875000-0.5000001327401.2500000.750000-0.5000001349921.5000001.000000-0.5000001327061.2500000.750000-0.5000001328701.0000000.600000-0.400000将数据进行反转,选择差距最大的前10:sorted_by_diff[::-1][:10]food_ratingratingdiffplaceID1349870.5000001.0000000.5000001329371.0000001.5000000.5000001350661.0000001.5000000.5000001328511.0000001.4285710.4285711350490.6000001.0000000.4000001329221.5000001.8333330.3333331350301.3333331.5833330.2500001350631.0000001.2500000.2500001326261.0000001.2500000.2500001350001.0000001.2500000.250000计算rating的标准差,并选择最大的前10个:#StandarddeviationofratinggroupedbyplaceIDrating_std_by_place=df.groupby('placeID')['rating'].std()#Filterdowntoactive_titlesrating_std_by_place=rating_std_by_place.loc[active_place]#OrderSeriesbyvalueindescendingorderrating_std_by_place.sort_values(ascending=False)[:10]placeID1349871.1547011350491.0000001349831.0000001350530.9910311350270.9910311328470.9831921327670.9831921328840.9831921350820.9718251327060.957427Name:rating,dtype:float64本文已收录于http://www.flydean.com/02-pandas-restaurant/最通俗的解读,最深刻的干货,最简洁的教程,很多小技巧你不懂知道等你来发现!欢迎关注我的公众号:《程序那些事儿》,懂技术,更懂你!
