【.com原稿】了解国外数据科学市场的都知道,2017年海外数据科学最常用的三大技术是Spark、Python和MongoDB.说起Python,做大数据的人对Scikit-learn和Pandas肯定不会陌生。Scikit-learn是最常用的Python机器学习框架。在各大互联网公司做算法的工程师,在实现单机版算法时,或多或少都会用到Scikit-learn。TensorFlow就更出名了,做深度学习的人不可能不知道TensorFlow。我们先来看一个示例,它是传统机器学习算法逻辑回归的实现:可以看到,示例中仅用3行代码就完成了逻辑回归的主要功能。下面的代码来自GitHub:'''AlogisticregressionlearningalgorithmexampleusingTensorFlowlibrary.ThisexampleisusingtheMNISTdatabaseofhandwrittendigits(http://yann.lecun.com/exdb/mnist/)作者:AymericDamienProject:https://github.com/aymericdamien/TensorFlow-Examples/'''from__future__importprint_functionimporttensorflowastf#ImportMNISTdatafromtensorflow.examples.tutorials.mnistimportinput_datamnist=input_data.read_data_sets(“/tmp/data/”,one_hot=True)#Parameterslearning_rate=0.01training_epochs_size=25batchsize=100display_step=1#tfGraphNloputx=tf.placeholder[784])#mnistdataimageofshape28*28=784y=tf.placeholder(tf.float32,[None,10])#0-9digitsrecognition=>10classes#SetmodelweightsW=tf.Variable(tf.zeros([784,10]))b=tf.Variable(tf.zeros([10]))#Constructmodelpred=tf.nn.softmax(tf.matmul(x,W)+b)#Softmax#Minimizeerrorusingcrossentropycost=tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred),reduction_indices=1))#GradientDescentoptimizer=tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)#Initializethevariables(i.e.assigntheirdefaultvalue)init=tf.global_variables_initializer()#Starttrainingwithtf.Session()评估:#Runtheinitializersess.run(init)#Trainingcycleforepochinrange(training_costs):avg=0.total_batch=int(mnist.train.num_examples/batch_size)#Loopoverallbatchesforiinrange(total_batch):batch_xs,batch_ys=mnist.train.next_batch(batch_size)#Runoptimizationop(backprop)andcostop(togetlossvalue)_,c=sess.run([optimizer,cost],feed_dict={x:batch_xs,y:batch_ys})#Computeaveragelossavg_cost+=c/total_batch#Displaylogsperepochstepif(epoch+1)%display_step==0:print("Epoch:",'%04d'%(epoch+1),"cost=","{:.9f}".format(avg_cost))print("OptimizationFinished!")#Testmodelcorrect_prediction=tf.equal(tf.argmax(pred,1),tf.argmax(y,1))#Calculateaccuracyaccuracy=tf.reduce_mean(tf.cast(correct_prediction,tf.float32))print("Accuracy:",accuracy.eval({x:mnist.test.images,y:mnist.test.labels}))是一个比较简单的机器学习算法,但是用Tensorflow来实现它需要很大的篇幅但是,scikit-learn本身不具备Tensorflow丰富的深度学习功能,有没有办法让Scikit-learn像Tensorflow一样支持深度学习,同时保证Scikit-learn的易用性呢?答案是肯定的,那就是Scikit-Flow开源项目,该项目后来被整合到Tensorflow项目中,成为现在的TFLearn模块。我们来看一个TFLearn实现线性回归的例子:"""LinearRegressionExample"""from__future__importabsolute_import,division,print_functionimporttflearn#RegressiondataX=[3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7.59,2.167,7.042,10.791,5.313,7.997,5.654,9.27,3.1]Y=[1.7,2.76,2.09,3.19,1.694,1.573,3.366,2.596,2.53,1.221,2.827,3.465,1.65,2.904,2.41.2,#3.94LinearRegressiongraphinput_=tflearn.input_data(shape=[None])linear=tflearn.single_unit(input_)regression=tflearn.regression(linear,optimizer='sgd',loss='mean_square',metric='R2',learning_rate=0.01)m=tflearn.DNN(回归)m.fit(X,Y,n_epoch=1000,show_metric=True,snapshot_epoch=False)print("\nRegressionresult:")print("Y="+str(m.get_weights(linear.W))+"*X+"+str(m.get_weights(linear.b)))print("\nTestpredictionforx=3.2,3.3,3.4:")print(m.predict([3.2,3.3,3.4]))我们可以看出,TFLearn继承了Scikit-Learn简洁的编程风格,在处理传统的机器学习方法时非常方便。下面我们看一段TFLearn现实CNN(MNIST数据集)的示例:"""ConvolutionalNeuralNetworkforMNISTdatasetclassificationtask.References:Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner."Gradient-basedlearningappliedtodocumentrecognition."Proceeding8softhe(IEEE11):2278-2324,November1998.Links:[MNISTDataset]http://yann.lecun.com/exdb/mnist/"""from__future__importdivision,print_function,absolute_importimporttflearnfromtflearn.layers.coreimportinput_data,dropout,fully_connectedfromtflearn.layers.convimportconv_2d,fromtflearn_2.layers.normalizationimportlocal_response_normalizationfromtflearn.layers.estimatorimportregression#Dataloadingandpreprocessingimporttflearn.datasets.mnistasmnistX,Y,testX,testY=mnist.load_data(one_hot=True)X=X.reshape([-1,28,28,1])testX=testX。reshape([-1,28,28,1])#Buildingconvolutionalnetworknetwork=input_data(shape=[None,28,28,1],name='input')network=conv_2d(network,32,3,activation='relu',regularizer="L2")network=max_pool_2d(network,2)network=local_response_normalization(network)network=conv_2d(network,64,3,activation='relu',regularizer="L2")network=max_pool_2d(network,2)network=local_response_normalization(网络)network=fully_connected(network,128,activation='tanh')network=dropout(network,0.8)network=fully_connected(network,256,activation='tanh')network=dropout(network,0.8)network=fully_connected(network,10,activation='softmax')network=regression(network,optimizer='adam',learning_rate=0.01,loss='categorical_crossentropy',name='target')#Trainingmodel=tflearn.DNN(network,tensorboard_verbose=0)model.fit({'input':X},{'target':Y},n_epoch=20,validation_set=({'input':testX},{'target':testY}),snapshot_step=100,show_metric=True,run_id='convnet_mnist'),我们可以看到基于TFLearn的深度学习代码也非常简洁。TFLearn是TensorFlow的高级Scikit-Learn包,它提供了原生版本的TensorFlow和Scikit-Learn的又一种选择。对于熟悉Scikit-Learn而厌倦了TensorFlow冗长代码的用户来说是福音,也值得机器学习和数据挖掘从业者认真学习和掌握。王浩,恒昌力通大数据部负责人/高级架构师,美国犹他大学学士/硕士,对外经济贸易大学在职MBA。在百度、新浪、网易、豆瓣等公司拥有多年研发和技术管理经验。擅长机器学习、大数据、推荐系统、社交网络分析等技术。在TVCG、ASONAM等国际会议和期刊发表论文8篇。本科毕业论文获得IEEESMI2008国际会议最佳论文奖。【原创稿件,合作网站转载请注明原作者及出处.com】
