当前位置: 首页 > 后端技术 > Python

使用Python从零开始手写回归树

时间:2023-03-26 16:39:38 Python

鏈枃灏嗕粙缁嶅洖褰掓爲鍙婂叾鍩烘湰鏁板鍘熺悊锛屽苟浣跨敤Python浠庨浂寮€濮嬪疄鐜颁竴涓畬鏁寸殑鍥炲綊鏍戞ā鍨嬨€備负浜嗙畝鍗曡捣瑙侊紝灏嗕娇鐢ㄩ€掑綊鏉ュ垱寤烘爲鑺傜偣銆傞€掑綊铏界劧涓嶆槸涓€涓畬缇庣殑瀹炵幇锛屼絾鏄渶鑳界洿瑙傜殑璇存槑鍘熺悊銆傞鍏堝鍏ュ簱importpandasaspdimportnumpyasnpiimportmatplotlib.pyplotasplt棣栧厛浣犻渶瑕佸垱寤鸿缁冩暟鎹紝鎴戜滑鐨勬暟鎹皢鏈夌嫭绔嬪彉閲忥紙x锛夊拰涓€涓浉鍏冲彉閲忥紙y锛夛紝骞朵娇鐢╪umpy鏉ユ坊鍔犻珮鏂櫔澹板埌鐩稿叧鍊硷紝鍙互鍦ㄦ暟瀛︿笂琛ㄧず涓哄叾涓潨栨槸鍣0銆備唬鐮佸涓嬫墍绀恒€俤eff(x):mu,sigma=0,1.5杩斿洖-x**2+x+5+np.闅忔満鐨勩€傛甯革紙浜╋紝瑗挎牸鐜涳紝1锛塶um_points=300np銆傞殢鏈虹殑銆傜瀛愶紙1锛墄=np銆俽andom.uniform(-2,5,num_points)y=np.array([f(i)foriinx])plt.scatter(x,y,s=5)閫氳繃鍒涘缓A鏍戝垱寤哄洖褰掓爲澶氫釜鑺傜偣鏉ラ娴嬫暟鍊兼暟鎹€備笅鍥惧睍绀轰簡鍥炲綊鏍戠殑鏍戝舰缁撴瀯绀轰緥锛屽叾涓瘡涓妭鐐归兘鏈夊叾鍒掑垎鏁版嵁鐨勯槇鍊笺€傜粰瀹氫竴缁勬暟鎹紝杈撳叆鍊间細閫氳繃鐩稿簲鐨勮鑼冨埌杈惧彾鑺傜偣銆傚埌杈捐妭鐐筂鐨勬墍鏈夎緭鍏ュ€奸兘鍙互鐢╔鐨勫瓙闆嗚〃绀恒€備粠鏁板涓婅锛岃鎴戜滑鐢ㄤ竴涓嚱鏁版潵琛ㄧず杩欑鎯呭喌锛屽鏋滅粰瀹氱殑杈撳叆鍊煎埌杈捐妭鐐筂锛屽垯鍙互缁欏嚭1锛屽惁鍒欎负0銆傛壘鍒板垎鍓叉暟鎹殑闃堝€硷細閫氳繃鍦ㄦ瘡涓€姝ラ€夋嫨2涓繛缁偣骞惰绠楀畠浠殑骞冲潎鍊兼潵杩唬璁粌鏁版嵁銆傝绠楀嚭鐨勫钩鍧囧€煎皢鏁版嵁鍒嗕负涓や釜闃堝€笺€傝鎴戜滑棣栧厛鑰冭檻闅忔満闃堝€兼潵婕旂ず浠讳綍缁欏畾鐨勬儏鍐点€傞槇鍊?1.5low=np.take(y,np.where(xthreshold))plt.scatter(x,y,s=5,label='Data')plt.plot([threshold]*2,[-16,10],'b--',label='Thresholdline')plt.plot([-2,threshold],[low.mean()]*2,'r--',label='宸﹀瓙棰勬祴绾?)plt.plot([threshold,5],[high.mean()]*2,'r--',label='鍙冲瓙棰勬祴绾?)plt.plot([-2,5],[y.mean()]*2,'g--',label='鑺傜偣棰勬祴绾?)plt.legend()钃濊壊绔栫嚎琛ㄧず涓€涓槇鍊硷紝鎴戜滑鍋囪瀹冩槸浠绘剰涓ょ偣鐨勫钩鍧囧€硷紝绋嶅悗鐢ㄤ簬鍒掑垎鏁版嵁銆傛垜浠杩欎釜闂鐨勭涓€涓娴嬫槸鎵€鏈夎缁冩暟鎹紙y杞达級鐨勫钩鍧囧€硷紙缁胯壊姘村钩绾匡級銆傝€屼袱鏉$孩绾挎槸瀵硅鍒涘缓鐨勫瓙鑺傜偣鐨勯娴嬨€傚緢鏄庢樉锛岃繖浜涘钩鍧囧€奸兘涓嶈兘寰堝ソ鍦颁唬琛ㄦ垜浠殑鏁版嵁锛屼絾瀹冧滑鐨勫尯鍒篃寰堟槑鏄撅細涓昏妭鐐归娴嬶紙缁跨嚎锛夊緱鍒版墍鏈夎缁冩暟鎹殑骞冲潎鍊硷紝鎴戜滑灏嗗叾鍒嗕负2涓瓙鑺傜偣锛屽嵆2瀛愯妭鐐硅妭鐐规湁鑷繁鐨勯娴嬶紙绾㈢嚎锛夈€傝繖2涓瓙鑺傜偣姣旂豢绾挎洿濂藉湴浠h〃浜嗗畠浠搴旂殑璁粌鏁版嵁銆傚洖褰掓爲浼氬皢鏁版嵁鍒嗘垚涓ら儴鍒嗏€斺€斾粠姣忎釜鑺傜偣鍒涘缓2涓瓙鑺傜偣锛岀洿鍒拌揪鍒扮粰瀹氱殑鍋滄鍊硷紙杩欐槸涓€涓妭鐐瑰彲浠ユ嫢鏈夌殑鏈€灏忔暟鎹噺锛夈€傚畠鎻愬墠鍋滄浜嗘爲鐨勬瀯寤鸿繃绋嬶紝鎴戜滑绉颁箣涓洪淇壀鏍戙€備负浠€涔堜細鏈夋彁鍓嶅仠姝㈡満鍒讹紵濡傛灉鎴戜滑瑕佺户缁繘琛屽垎閰嶇洿鍒拌妭鐐瑰彧鏈変竴涓€硷紝杩欎細鍒涘缓涓€涓繃搴︽嫙鍚堟柟妗堬紝鍏朵腑姣忎釜璁粌鏁版嵁鍙兘棰勬祴鑷繁銆傝В閲婏細褰撴ā鍨嬪畬鎴愭椂锛屽畠涓嶄細浣跨敤鏍硅妭鐐规垨浠讳綍涓棿鑺傜偣鏉ラ娴嬩换浣曞€硷紱瀹冨皢浣跨敤鍥炲綊鏍戠殑鍙跺瓙锛堣繖灏嗘槸鏍戠殑鏈€鍚庝竴涓妭鐐癸級杩涜棰勬祴銆備负浜嗚幏寰楁渶鑳戒唬琛ㄧ粰瀹氶槇鍊兼暟鎹殑闃堝€硷紝鎴戜滑浣跨敤娈嬪樊骞虫柟鍜屻€傚畠鍙互鍦ㄦ暟瀛︿笂瀹氫箟锛岃鎴戜滑鐪嬬湅杩欎竴姝ユ槸濡備綍宸ヤ綔鐨勩€傜幇鍦ㄥ凡缁忚绠楀嚭闃堝€肩殑SSR鍊硷紝鍙互浣跨敤SSR鍊兼渶灏忕殑闃堝€笺€備娇鐢ㄦ闃堝€煎皢璁粌鏁版嵁鍒嗘垚涓ら儴鍒嗭紙浣庨儴鍒嗗拰楂橀儴鍒嗭級锛屽叾涓綆閮ㄥ垎灏嗙敤浜庡垱寤哄乏瀛╁瓙锛岄珮閮ㄥ垎灏嗙敤浜庡垱寤哄彸瀛╁瓙銆俤efSSR(r,y):returnnp.sum((r-y)**2)SSRs,thresholds=[],[]foriinrange(len(x)-1):threshold=x[i:i+2].mean()浣?np.take(y,np.where(xthreshold))guess_low=low.mean()guess_high=high.mean()SSRs.append(SSR(low,guess_low)+SSR(high,guess_high))thresholds.append(threshold)print('鏈€灏忔畫宸负锛歿:.2f}'.format(min(SSRs)))print('瀵瑰簲鐨勯槇鍊兼槸锛歿:.4f}'.format(thresholds[SSRs.index(min(SSRs))]))鍦ㄨ繘琛屼笅涓€姝ヤ箣鍓嶏紝鎴戝皢浣跨敤pandas鍒涘缓涓€涓猟f锛屽苟涓斿垱寤轰竴涓敤浜庡鎵炬渶浣抽槇鍊肩殑鏂规硶銆傛墍鏈夎繖浜涙楠ら兘鍙互鍦ㄦ病鏈塸andas鐨勬儏鍐典笅瀹屾垚锛岃繖閲屼娇鐢ㄥ畠鏄洜涓哄畠鏇存柟渚裤€俤f=pd.DataFrame(zip(x,y.squeeze()),columns=['x','y'])deffind_threshold(df,plot=False):SSRs,thresholds=[],[]fori鍦ㄨ寖鍥村唴锛坙en锛坉f锛?1锛夛細threshold=df.x[i:i+2].mean()low=df[(df.x<=threshold)]high=df[(df.x>threshold)]guess_low=low.y.mean()guess_high=high.y.mean()SSRs.append(SSR(low.y.to_numpy(),guess_low)+SSR(high.y.to_numpy(),guess_high))thresholds.append(threshold)ifplot:plt.scatter(thresholds,SSRs,s=3)plt.show()returnthresholds[SSRs.index(min(SSRs))]灏嗘暟鎹垎鎴愪袱閮ㄥ垎鍚庡垱寤哄瓙鑺傜偣浣犲彲浠ヤ负浣庡€煎拰楂樺€兼壘鍒板崟鐙殑闃堝€笺€傞渶瑕佹敞鎰忕殑鏄繖閲屽姞浜嗕竴涓仠姝㈡潯浠讹紱鍥犱负瀵逛簬姣忎釜鑺傜偣锛屾暟鎹泦涓睘浜庤鑺傜偣鐨勭偣浼氭洿灏戯紝鎵€浠ユ垜浠畾涔夋瘡涓妭鐐圭殑鏈€灏忔暟鎹偣鏁般€傚鏋滀笉杩欐牱鍋氾紝姣忎釜鑺傜偣灏嗕粎浣跨敤涓€涓缁冨€艰繘琛岄娴嬶紝浠庤€屽鑷磋繃鎷熷悎銆傝妭鐐瑰彲浠ラ€掑綊鍒涘缓锛屾垜浠畾涔変簡涓€涓悕涓篢reeNode鐨勭被锛屽畠灏嗗瓨鍌ㄨ妭鐐瑰簲璇ュ瓨鍌ㄧ殑姣忎釜鍊笺€備娇鐢ㄦ绫伙紝鎴戜滑棣栧厛鍒涘缓鏍癸紝鍚屾椂璁$畻鍏堕槇鍊煎拰棰勬祴鍙橀噺銆傜劧鍚庡畠閫掑綊鍦板垱寤哄畠鐨勫瀛愶紝鍏朵腑姣忎釜瀛╁瓙鐨勭被瀛樺偍鍦ㄧ埗绾х殑left鎴杛ight灞炴€т腑銆傚湪涓嬮潰鐨刢reate_nodes鏂规硶涓紝缁欏畾鐨刣f棣栧厛琚垎鎴愪袱閮ㄥ垎銆傜劧鍚庢鏌ユ槸鍚︽湁瓒冲鐨勬暟鎹潵鍒嗗埆鍒涘缓宸﹀彸鑺傜偣銆傚鏋滄湁瓒冲鐨勬暟鎹偣锛堝浜庡畠浠腑鐨勪换浣曚竴涓級锛屾垜浠绠楅槇鍊煎苟浣跨敤瀹冩潵鍒涘缓涓€涓瓙鑺傜偣锛屽啀娆¤皟鐢╟reate_nodes鏂规硶骞跺皢杩欎釜鏂拌妭鐐逛綔涓烘爲銆傜被TreeNode():def__init__(self,threshold,pred):self.threshold=thresholdself.pred=predself.left=Noneself.right=Nonedefcreate_nodes(tree,df,stop):low=df[df.x<=tree.threshold]high=df[df.x>tree.threshold]濡傛灉len(low)>stop:threshold=find_threshold(low)tree.left=TreeNode(threshold,low.y.mean())create_nodes(tree.left,low,stop)iflen(high)>stop:threshold=find_threshold(high)tree.right=TreeNode(threshold,high.y.mean())create_nodes(tree.right,high,stop)闃堝€?find_threshold(df)tree=TreeNode(threshold,df.y.mean())create_nodes(tree,df,5)杩欎釜鏂规硶鏄湪绗竴妫垫爲涓婁慨鏀圭殑锛屽洜涓哄畠涓嶉渶瑕佽繑鍥炰换浣曚笢瑗裤€傝櫧鐒堕€掑綊鍑芥暟閫氬父涓嶄細杩欐牱鍐欙紙donotreturn锛夛紝浣嗘槸鐢变簬涓嶉渶瑕佽繑鍥炲€硷紝褰搃f璇彞娌℃湁琚縺娲绘椂锛屽畠浠€涔堥兘涓嶅仛銆傚畬鎴愬悗锛屾偍鍙互妫€鏌ユ鏍戠粨鏋勶紝鐪嬬湅瀹冩槸鍚﹀垱寤轰簡涓€浜涢€傚悎鏁版嵁鐨勮妭鐐广€傝繖閲屽皢鎵嬪姩閫夋嫨绗竴涓妭鐐瑰強鍏跺鏍归槇鍊肩殑棰勬祴銆俻lt.scatter(x,y,s=0.5,label='Data')plt.plot([tree.threshold]*2,[-16,10],'r--',label='鏍归槇鍊?)plt.plot([tree.right.threshold]*2,[-16,10],'g--',label='鍙宠妭鐐归槇鍊?)plt.plot([tree.threshold,tree.right.threshold],[tree.right.left.pred]*2,'g',label='鍙宠妭鐐归娴?)plt.plot([tree.left.threshold]*2,[-16,10],'m--',label='宸﹁妭鐐归槇鍊?)plt.plot([tree.left.threshold,tree.threshold],[tree.left.right.pred]*2,'m',label='宸﹁妭鐐归娴?)plt.plot([tree.left.left.threshold]*2,[-16,10],'k--',label='SecondLeftnodethreshold')plt.legend()鍦ㄨ繖閲岀湅鍒颁袱涓娴嬶細绗竴涓乏鑺傜偣瀵归珮鍊肩殑棰勬祴锛堥珮浜庡叾闃堝€硷級绗竴涓彸鑺傜偣瀵逛綆鍊肩殑棰勬祴锛堜綆浜庡叾闃堝€硷級杩欓噷鎴戞墜鍔ㄨ鍓簡棰勬祴绾跨殑瀹藉害锛屽洜涓哄鏋滅粰瀹氱殑x鍊煎埌杈捐繖浜涜妭鐐逛腑鐨勪换浣曚竴涓兘浼氳〃绀轰负灞炰簬璇ヨ妭鐐圭殑鎵€鏈墄鍊肩殑骞冲潎鍊硷紝杩欎篃鎰忓懗鐫€娌℃湁o鍏朵粬x鍊煎弬涓庤妭鐐圭殑棰勬祴锛堝笇鏈涙湁鎰忎箟锛夈€傝繖涓爲缁撴瀯涓嶄粎浠呮槸涓や釜鑺傜偣锛屾墍浠ユ垜浠彲浠ラ€氳繃璋冪敤鍏跺瓙鑺傜偣鏉ユ鏌ョ壒瀹氱殑鍙惰妭鐐癸紝濡備笅鎵€绀恒€倀ree.left.right.left.left杩欏綋鐒舵剰鍛崇潃鏈変竴涓垎鏀悜涓嬫湁4涓瀛愰暱锛屼絾瀹冨彲鑳藉湪鏍戠殑鍙︿竴涓垎鏀笂鏇存繁銆傞娴嬫垜浠彲浠ュ垱寤轰竴涓娴嬫柟娉曟潵棰勬祴浠讳綍缁欏畾鐨勫€笺€俤efpredict(x):curr_node=treeresult=NonewhileTrue:ifx<=curr_node.threshold:ifcurr_node.left:curr_node=curr_node.leftelse:breakelifx>curr_node.threshold:ifcurr_node.right:curr_node=curr_node.rightelse:breakreturncurr_node.pred棰勬祴鏂规硶鎵€鍋氱殑鏄€氳繃灏嗘垜浠殑杈撳叆涓庢瘡涓彾瀛愮殑闃堝€艰繘琛屾瘮杈冩潵娌跨潃鏍戝悜涓嬭蛋銆傚鏋滆緭鍏ュ€煎ぇ浜庨槇鍊硷紝鍒欒浆鍒板彸鍙讹紝濡傛灉灏忎簬闃堝€硷紝鍒欒浆鍒板乏鍙讹紝渚濇绫绘帹锛岀洿鍒板埌杈句换浣曞簳閮ㄥ彾鑺傜偣銆傜劧鍚庝娇鐢ㄨ妭鐐硅嚜宸辩殑棰勬祴鍊艰繘琛岄娴嬶紝骞朵笌瀹冪殑闃堝€艰繘琛屾渶缁堟瘮杈冦€俆estwithx=3(褰撳垱寤烘暟鎹椂锛屽彲浠ヤ娇鐢ㄤ笂闈㈠啓鐨勫嚱鏁版潵璁$畻瀹為檯鍊笺€?3**2+3+5=-1锛岃繖鏄湡鏈涘€硷級锛屾垜浠緱鍒帮細predict(3)#-1.23741璁$畻璇樊杩欓噷鐢ㄧ浉瀵瑰钩鏂硅宸潵楠岃瘉鏁版嵁defRSE(y,g):杩斿洖sum(np.square(y-g))/sum(np.square(y-1/len(y)*sum(y)))x_val=np.random.uniform(-2,5,50)y_val=np.array([f(i)foriinx_val]).squeeze()tr_preds=np.array([predict(i)foriindf.x])val_preds=np.array([predict(i)foriinx_val])print('璁粌閿欒锛歿:.4f}'.format(RSE(df.y,tr_preds)))print('楠岃瘉閿欒锛歿:.4f}'.format(RSE(y_val,val_preds)))鍙互鐪嬪嚭璇樊涓嶅ぇ锛岀粨鏋滃涓嬨€備笅闈㈡杩扮殑姝ラ鏄洿娣卞叆鐨勬ā鍨嬨€備竴涓洿閫傚悎鍥炲綊鏍戞ā鍨嬬殑鏁版嵁锛氬洜涓烘垜浠殑鏁版嵁鏄椤瑰紡鐢熸垚鐨勬暟鎹紝鎵€浠ヤ娇鐢ㄥ椤瑰紡鍥炲綊妯″瀷鍙互鏇村ソ鐨勬嫙鍚堛€傝鎴戜滑鏇存敼璁粌鏁版嵁骞跺皢鏂板嚱鏁拌缃负deff(x):mu,sigma=0,0.5ifx<3:return1+np.random.normal(mu,sigma,1)elifx>=3andx<6:杩斿洖9+np.random.normal(mu,sigma,1)elifx>=6:杩斿洖5+np.random.normal(mu,sigma,1)np.random.seed(1)x=np.random.uniform(0,10,num_points)y=np.array([f(i)foriinx])plt.scatter(x,y,s=5)鍦ㄦ鏁版嵁闆嗕笂杩愯涓婅堪鎵€鏈夌浉鍚岀▼搴忥紝涓嬮潰鐨勭粨鏋滄瘮鎴戜滑浠庡椤瑰紡鏁版嵁涓緱鍒扮殑缁撴灉璇樊鏇翠綆銆傛渶鍚庡叡浜竴涓嬩笂闈㈠姩鐢荤殑浠g爜锛歩mportpandasaspdimportnumpyasnpimportmatplotlib.pyplotaspltfrommatplotlib.animationimportFuncAnimation#===================================================鍒涘缓Datadeff(x):mu,sigma=0,1.5return-x**2+x+5+np.random.normal(mu,sigma,1)np.random.seed(1)x=np.random.uniform(-2,5,300)y=np.array([f(i)瀵逛簬鎴戝湪x])p=x.argsort()x=x[p]y=y[p]#===================================================璁$畻闃堝€糳efSSR(r,y):#sendnumpyarrayreturnnp.sum((r-y)**2)SSRs,thresholds=[],[]foriinrange(len(x)-1):threshold=x[i:i+2].mean()浣?np.take(y,np.where(xthreshold))guess_low=low.mean()guess_high=high.mean()SSRs.append(SSR(low,guess_low)+SSR(楂橈紝guess_high))thresholds.append(threshold)#===================================================鍔ㄧ敾缁樺浘锛岋紙ax1锛宎x2锛?plt.sub鍥撅紙2,1锛宻harex=True锛墄_data锛寉_data=[]锛孾]x_data2锛寉_data2=[]锛孾]ln锛?ax1.plot锛圼]锛孾]锛?r--'锛塴n2锛?ax2.plot(thresholds,SSRs,'ro',markersize=2)line=[ln,ln2]definit():ax1.scatter(x,y,s=3)ax1.title.set_text('灏濊瘯涓嶅悓鐨勯槇鍊?)ax2.title.set_text('ThresholdvsSSR')ax1.set_ylabel('yvalues')ax2.set_xlabel('Threshold')ax2.set_ylabel('SSR')杩斿洖linedefupdate(frame):x_data=[x[frame:frame+2].mean()]*2y_data=[min(y),max(y)]line[0].set_data(x_data,y_data)x_data2.append(闃堝€糩frame])y_data2.append(SSRs[frame])line[1].set_data(x_data2,y_data2)returnlineani=FuncAnimation(fig,update,frames=298,init_func=init,blit=True)plt.show()https://avoid.overfit.cn/post/68d76a2540894366bb7033ff120a30d6浣滆€咃細BeratYildirim