当前位置：首页 > Linux

使用Python操作Hadoop，Python-MapReduce

时间：2023-04-06 21:53:06 Linux

环境使用环境：hadoop3.1,Python3.6,ubuntu18.04Hadoop是使用Java开发的，推荐使用Java操作HDFS。有时我们还需要使用Python来操作HDFS。本次我们将讨论如何使用Python操作HDFS、上传文件、下载文件、查看文件夹，以及如何使用Python进行MapReduce编程。使用Python操作HDFS，首先需要安装并导入hdfs库，使用pipinstallhdfs。1.连接查看指定路径下的数据fromhdfsimport*client=Client('http://ip:port')#2.X版本端口使用500703.x版本端口使用9870client.list('/')#查看hdfs/下的目录2.创建目录client.makedirs('/test')client.makedirs('/test',permision=777)#permision可以设置参数3.重命名和删除client.rename('/test','123')#将/test目录重命名为123client.delete('/test',True)#第二个参数表示递归删除4.下载下载/test/log.txt文件到/home目录.client.download('/test/log.txt','/home')5.readwithclient.read("/test/[PPT]GoogleProtocolBuffers.pdf")asreader:printreader.read()其他参数：read(args,*kwds)hdfs_path：hdfs路径偏移量：设置起始字节位置l-ength：读取长度（以字节为单位）buffer_size：用于传输数据的字节缓冲区大小。默认值在HDFS配置中设置。encoding：指定编码chunk_size：如果设置为正数，上下文管理器将返回生成每个chunk_size字节的生成器，而不是类似文件的对象分隔符：如果设置，上下文管理器将返回生成器，生成遇到的每个Separator。该参数需要指定的编码。progress：回调函数来跟踪每个chunk_size字节的进度（如果未指定块大小则不可用）。它将传递两个参数，文件上传的路径和传输的字节数。以-1作为第二个参数调用一次。6.上传数据上传文件到/test下的hdfs。client.upload('/test','/home/test/a.log')Python-MapReduce写mapper代码，map.py:importsysforlineinsys.stdin:fields=line.strip().split()foriteminfields:print(item+''+'1')编写reducer代码，reduce.py:importsysresult={}forlineinsys.stdin:kvs=line.strip().split('')k=kvs[0]v=kvs[1]ifkinresult:result[k]+=1else:result[k]=1fork,vinresult.items():print("%s\t%s"%(k,v))添加测试文本，test1.txt：故事与美女与野兽一样古老，本地测试执行地图代码：`cattest1.txt|pythonmap.py`结果：tale1as1old1as1time1true1as1it1can1be1beauty1and1the1beast1本地测试执行减少代码：cattest1.txt|蟒蛇地图.py|排序-k1,1|pythonreduce.py执行结果：and1be1old1beauty1true1it1beast1as3can1time1the1tale1在Hadoop平台上执行map-reduce程序。本地测试后，编写脚本执行HDFS中的程序脚本：run.sh（请根据本地环境修改）HADOOP_CMD="/app/hadoop-3.1.2/bin/hadoop"STREAM_JAR_PATH="/app/hadoop-3.1.2/分享e/hadoop/tools/lib/hadoop-streaming-3.1.2.jar"INPUT_FILE_PATH_1="/py/input/"OUTPUT_PATH="/output"$HADOOP_CMDfs-rmr-skipTrash$OUTPUT_PATH#Step1.$HADOOP_CMDjar$STREAM_JAR_PATH\-input$INPUT_FILE_PATH_1\-output$OUTPUT_PATH\-mapper"pythonmap.py"\-reducer"pythonreduce.py"\-file./map.py\-file./reduce.py\Addexecutionpermissionchmoda+xrun.sh;Executethetest:bashrun.sh,checktheresult:Exercise1.FilemergeanddeduplicationThesampleoftheinputfile1isasfollows:20150101x20150102y20150103x20150104y20150105z20150106xThesampleoftheinputfile2isasfollows:20150101y20150102y20150103x20150104z20150105y根据输入文件file1和file2合并得到的输出文件file3的样例如下：20150101x20150101y20150102y20150103x20150104y20150104z20150105y20150105z20150106x对于两个输入文件，即文件file1和文件file2，请编写MapReduce程序，对Mergethetwofilesandremovetheduplicatecontenttogetanewoutputfilefile3Inordertocompletethetaskoffilemerginganddeduplication,theprogramyouwritemustbeabletomergedifferentfileswithduplicatecontentintooneintegratedfilewithoutduplication,therulesareasfollows:thefirstcolumnisarrangedbystudentnumber;ifthesamestudentnumberisthesame,itisarrangedbyx,y,z.2.挖掘父子关系输入文件内容如下：childparentStevenLucyStevenJackJoneLucyJoneJackLucyMaryLucyFrankJackAliceJackJesseDavidAliceDavidJessePhilipDavidPhilipAlmaMarkDavidMarkAlma输出文件内容如下：grandchildgrandparentStevenAliceStevenJesseJoneAliceJoneJesseStevenMaryStevenFrankJoneMaryJoneFrankPhilipAlicePhilipJesseMarkAliceMarkJesse你编写的程序要Itcandigouttherelationshipbetweenfatherandsonandgiveatableoftherelationshipbetweengrandparentsandgrandchildren.Therulesareasfollows:thegrandsonisfirst,thegrandfatheristhesameasthegrandson,andthegrandfather'snameisarrangedaccordingtoA-Z

上一篇：快速将数据转成图形

下一篇：Go语言开发-(Part01)

使用Python操作Hadoop，Python-MapReduce相关文章