第一步:收集和清洗数据数据链接:https://grouplens.org/dataset...下载文件:ml-latest-smallimportpandasaspdimportnumpyasnpimporttensorflowastf导入ratings.csv文件ratings_df=pd.read_csv('./ml-latest-small/ratings.csv')ratings_df.tail()#tail命令用于输入文件中的尾部内容。tail命令默认在屏幕上显示指定文件的最后5行。结果:导入movies.csv文件movies_df=pd.read_csv('./ml-latest-small/movies.csv')movies_df.tail()结果:将movies_df中的movieId替换为行号movies_df['movieRow']=movies_df。index#生成一列'movieRow',等于索引值indexmovies_df.tail()结果:过滤movies_df中的特征movies_df=movies_df[['movieRow','movieId','title']]#过滤三出电影_df。to_csv('./ml-latest-small/moviesProcessed.csv',index=False,header=True,encoding='utf-8')#生成新文件moviesProcessed.csvmovies_df.tail()结果:根据movieId,Mergeratings_dfandmovie_dfrings_df=pd.merge(ratings_df,movies_df,on='movieId')ratings_df.head()结果:过滤ratings_df中的特征ratings_df=ratings_df[['userId','movieRow','rating']]#Filter输出三列ratings_df.to_csv('./ml-latest-small/ratingsProcessed.csv',index=False,header=True,encoding='utf-8')#导出一个新文件ratingsProcessed.csvratings_df.head()结果:Step2:创建电影评分矩阵rating和评分记录矩阵recorduserNo=ratings_df['userId'].max()+1#userNo的最大值movieNo=ratings_df['movieRow'].max()+1#最大值valueofmovieNorating=np.zeros((movieNo,userNo))#创建一个值为0的数据flag=0ratings_df_length=np.shape(ratings_df)[0]#检查矩阵ratings_df的第一个维度为index,rowinratings_df.iterrows():#interrows(),遍历表ratings_dfrating[int(row['movieRow']),int(row['userId'])]=row['rating']#用行的'rating'标志填充ratings_df表中的'movieRow'和'userId'列+=1record=rating>0recordrecord=np.array(record,dtype=int)#改变数据类型,0表示用户没有评价过电影,1表示用户评价过电影recordresultarray([[0,0,0,...,0,1,1],[0,0,0,...,0,0,0],[0,0,0,...,0,0,0],...,[0,0,0,...,0,0,0],[0,0,0,...,0,0,0],[0,0,0,...,0,0,0]])第三步:建立模型defnormalizeRatings(rating,record):m,n=rating.shape#m代表电影数量,n代表用户数量rating_mean=np.zeros((m,1))#平均值每部电影的评分rating_norm=np.zeros((m,n))#Processedratingforiinrange(m):idx=record[i,:]!=0#每部电影的评分,[i,:]表示每行所有列rating_mean[i]=np.mean(rating[i,idx])#第i行,对idx进行评分的用户的平均分;#np.mean()计算所有元素的均值rating_norm[i,idx]-=rating_mean[i]#rating_norm=originalscore-averagescorereturnrating_norm,rating_meanrating_norm,rating_mean=normalizeRatings(rating,record)结果:/root/anaconda2/envs/python3/lib/python3.6/site-packages/numpy/core/fromnumeric.py:2957:RuntimeWarning:空切片的平均值。out=out,**kwargs)/root/anaconda2/envs/python3/lib/python3.6/site-packages/numpy/core/_methods.py:80:Ru??ntimeWarning:在double_scalarsret=ret.dtype中遇到无效值。type(ret/rcount)注意:如果数据中有很多NaNN,对后续操作的影响会比较大。处理,将值改为0rating_norm结果:array([[0.,0.,0.,...,0.,-3.87246964,-3.87246964],[0.,0.,0.,...,0.,0.,0.],[0.,0.,0.,...,0.,0.,0.],......,[0.,0.,0.,...,0.,0.,0.],[0.,0.,0.,...,0.,0.,0.],[0.,0.,0.,...,0.,0.,0.]])rating_mean=np.nan_to_num(rating_mean)#将值处理为NaNN,改为0rating_meanresult:array([[3.87246964],[3.40186916],[3.16101695],...,[3.],[0.],[5.]])构建模型num_features=10X_parameters=tf.Variable(tf.random_normal([movieNo,num_features],stddev=0.35))Theta_parameters=tf.Variable(tf.random_normal([userNo,num_features],stddev=0.35))#tf.Variables()初始化变量#tf.random_normal()函数用于从服从指定正态分布的值中提取指定个数的值,mean:正态分布的均值distributionstddev:正态分布差值的标准。dtype:输出loss的类型=1/2*tf.reduce_sum(((tf.matmul(X_parameters,Theta_parameters,transpose_b=True)-rating_norm)*record)**2)+1/2*(tf.reduce_sum(x_parameters**2)+tf.reduce_sum(Theta_parameters**2))#基于内容的推荐算法模型函数解释:reduce_sum()为求和,reduce_sum(input_tensor,axis=None,keep_dims=False,name=None,reduction_indices=None)reduce_sum()参数说明:1)input_tensor:输入张量。2)轴:沿着哪个维度求和。对于一个二维的input_tensor张量,0表示按列求和,1表示按行求和,[0,1]表示先按列求和再按行求和。3)keep_dims:默认值为Flase,表示默认需要降维。如果设置为True,则不执行降维。4)姓名:姓名。5)reduction_indices:默认值为None,即把input_tensor降为0维,是一个数字。对于2维input_tensor,当reduction_indices=0时,按列;当reduction_indices=1时,按行。6)注意reduction_indices和axis不能同时设置。tf.matmul(a,b),矩阵a乘以矩阵b生成a*btf.matmul(a,b)参数说明:1)a:类型有float16,float32,float64,int32,complex64,complex128andrank>1张量。2)b:与a具有相同的类型和等级。3)transpose_a:如果为True,a在乘法之前被转置。4)transpose_b:如果为True,b在乘法之前被转置。5)adjoint_a:如果为真,则a在乘法前共轭和转置。6)adjoint_b:如果为True,则b在乘法前共轭转置。7)a_is_sparse:如果为True,a被视为稀疏矩阵。8)b_is_sparse:如果为True,b被视为稀疏矩阵。9)name:操作名称(可选)优化算法optimizer=tf.train.AdamOptimizer(1e-4)#https://blog.csdn.net/lenbow/article/details/52218551train=optimizer.minimize(loss)#Optimizer.minimize基本上对损失变量做了两件事#它计算损失相对于模型参数的梯度。#然后应用计算出的梯度来更新变量。第四步:训练模型#tf.summary的用法https://www.cnblogs.com/lyc-seu/p/8647792.htmltf.summary.scalar('loss',loss)#用于显示标量信息结果:
