当前位置: 首页 > 科技观察

从MySQL到Hive,数据迁移如此简单

时间:2023-03-14 11:58:52 科技观察

使用Sqoop可以大大简化MySQL数据到Hive的迁移过程,降低Hadoop处理分析任务的难度。先决条件:已安装并运行Sqoop和Hive的Hadoop环境。为了加快处理速度,我们还将使用ClouderaQuickstartVM(至少需要4GB内存),但您也可以使用HortonworksDataPlatform(至少需要8GB内存)。由于我的笔记本电脑只有8GB的RAM,因此我在这里使用CLouderaVM映像。如果您需要使用Virtualbox来运行Cloudera/HDPVM,您可以轻松使用其他各种Hadoop生态系统预装包(包括MySQL、Oozie、Hadoop、Hive、Zookeeper、Storm、Kafka、Spark等)。在MySQL中创建表在ClouderaVM中,打开命令界面并确保安装了MySQL。shell>mysql--versionmysqlVer14.14Distrib5.1.66,forredhat-linux-gnu(x86_64)usingreadline5.该示例自然使用了自己的数据库,因此使用以下命令在MySQL中创建数据库:mysql>createdatabasesqoop;下一步:mysql>usesqoop;mysql>createtablecustomer(idvarchar(3),namevarchar(20),agevarchar(3),salaryinteger(10));QueryOK,0rowsaffected(0.09sec)mysql>desccustomer;+---------+-----------+------+-----+--------+--------+|字段|Type|Null|Key|Default|Extra|+--------+------------+------+------+---------+--------+|id|varchar(3)|YES||NULL|||name|varchar(20)|YES||NULL|||age|varchar(3)|是||NULL|||salary|int(10)|YES||NULL||+-------+------------+------+-----+--------+--------+mysql>select*fromcustomer;+------+--------+------+------+|id|姓名|年龄|薪水|+------+------+------+--------+|1|约翰|30|80000||2|凯文|33|84000||3|马克|28|90000||4|珍娜|34|93000||5|罗伯特|32|100000||6|卓娅|40|60000||7|山姆|37|75000||8|乔治|31|67000||9|彼得|23|70000||19|亚历克斯|26|74000|+-------+--------+------+-----开始S从qoop行可以看出,customer表是没有主键的,我也没有往表里加多少条记录。默认情况下,Sqoop识别表中的主键列(如果有)并将其用作分区列。从数据库中取出分区列的低值和高值,映射任务指向满足区间要求的统一部分。如果主键在范围内分布不均匀,则任务将不平衡。这时,你应该显式地选择一个不同于--split-by参数的列,比如--split-byid。由于我们要将这张表直接导入到Hive中,所以需要在Sqoop命令中加上--hive-import:sqoopimport--connectjdbc:mysql://localhost:3306/sqoop--usernameroot-P--split-byid--columnsid,name--tablecustomer--target-dir/user/cloudera/ingest/raw/customers--fields-terminated-by","--hive-import--create-hive-table--hive-tablesqoop_workspace.customers下面看一下Sqoop命令的选项的具体作用:connect-提供jdbc字符串username-数据库用户名-P-会在控制台询问密码。您也可以使用-passwaord,但不建议这样做,因为它会显示在任务执行日志中并可能导致问题。一种解决方案是将数据库密码存储在HDFS中的文件中,并将其传递给运行时。table-告诉计算机我们要导入哪个MySQL表。在这里,表名是客户。split-by–指定拆分列。这里我们指定了id列。target-dir–HDFS目标目录。fields-terminated-by–我指定逗号作为分隔值(默认情况下,导入HDFS的数据以逗号作为分隔值)。hive-import-将表导入Hive(如果未设置,则使用Hive的默认分隔符)。create-hive-table–检查Hive表是否已经存在,任务设置将相应地失败。hive-table–specify..在本例中,它是sqoop_workspace.customers,其中sqoop_workspace是数据库名称,customers是表名称。如下所示,Sqoop是一个map-reduce任务。请注意,这里我使用-P作为密码选项。除了这种方式,我们还可以使用-password实现参数化,从文件中读取密码内容。sqoopimport--connectjdbc:mysql://localhost:3306/sqoop--usernameroot-P--split-byid--columnsid,name--tablecustomer--target-dir/user/cloudera/ingest/raw/customers--fields-terminated-by","--hive-import--create-hive-table--hive-tablesqoop_workspace.customersWarning:/usr/lib/sqoop/../accumulodoesnotexist!Accumuloimportswillfail.Pleaseset$ACCUMULO_HOMEtotherootofyourAccumuloinstallation.16/03/0112:59:44INFOsqoop.Sqoop:RunningSqoopversion:1.4.6-cdh5.5.0输入密码:16/03/0112:59:54INFOmanager.MySQLManager:PreparingtouseaMySQLstreamingresultset.16/03/0112:59:54INFOtool.CodeGenTool:Beginningcodegeneration16/03/011259:55INFOmanager.SqlManager:ExecutingSQLstatement:SELECTt.*FROM`customer`AStLIMIT116/03/0112:59:56INFOmanager.SqlManager:ExecutingSQLstatement:SELECTt.*FROM`customer`AStLIMIT116/03/0112:59:56INFOorm.CompilationManager:HADOOP_MAPRED_HOMEis/usr/lib/hadoop-mapreduce注:/tmp/sqoop-cloudera/compile/6471c43b5c867834458d3bf5a67eade2/customer.javausesoroverridesadeprecatedAPI.Note:Recompilewith-Xlint:deprecationfordetails.16/03/0113:00:01INFOorm.CompilationManager:Writingjarfile:/tmp/sqoop-cloudera/compile/6471c43b5c867834458d3bf5a67eade2/customer.jar16/03RN0man:01:看起来你正在从mysql导入。16/03/0113:00:01WARNmanager.MySQLManager:此传输可以更快!使用--direct16/03/0113:00:01WARNmanager.MySQLManager:选项来执行特定于MySQL的快速路径。16/03/0113:00:01INFOmanager.MySQLManager:SettingzeroortoTIMEmysql)16/03/0113:00:01INFOmapreduce.ImportJobBase:Beginningimportofcustomer16/03/0113:00:01INFOConfiguration.deprecation:mapred.job.trackerisdeprecation.Instead,usemapreduce.jobtracker.address16/03/0113:00:02INFOConfiguration.deprecation:mapred.jaris已弃用。相反,使用mapreduce.job.jar16/03/0113:00:04INFOConfiguration.deprecation:mapred.map.tasks已弃用。相反,使用mapreduce.job.maps16/03/0113:00:05INFOclient.RMProxy:ConnectingtoResourceManagerat/0.0.0.0:803216/03/0113:00:11INFOdb.DBInputFormat:Usingreadcommitedtransactionisolation16/03/0113:00:11INFOdb.DataDrivenDBInputFormat:BoundingValsQuery:SELECTMIN(`id`),MAX(`id`)FROM`客户`16/03/0113:00:11WARNdb.TextSplitter:Generatingsplitsforatextualindexcolumn.16/03/0113:00:11WARNdb.TextSplitter:Ifyourdatabasesortsinacase-insensitiveorder,thismayresultinapartialimportorduplicaterecords.16/03/0113:00:11WARNdb.TextSplitter:Youarestronglyencouragedtochooseanintegralsplitcolumn.16/03/0113:00:11INFOmapreduce.JobSubmitter:numberofsplits:416/03/0113:00:12INFOmapreduce.JobSubmitter:Submittingtokensforjob:job_1456782715090_000416/03/0113:00:13INFOimpl.YarnClientImpl:Submittedapplicationapplication_1456782715090_000416/03/0113:00:13INFOmapreduce.Job:Theurltotrackthejob:http://quickstart.cloudera:8088/proxy/application_1456782715090_0004/16/03/0113:00:13INFOmapreduce.Job:Runningjob:job_1456782715090_000416/03/0113:00:47INFOmapreduce.Job:Jobjob_1456782715090_0004runninginubermode:false16/03/0113:00:48INFOmapreduce.Job:map0%reduce0%16/03/0113:01:43INFOmapreduce.Job:map25%reduce0%16/03/0113:01:46JINFOmapreduce.map50%reduce0%16/03/0113:01:48INFOmapreduce.Job:map100%reduce0%16/03/0113:01:48INFOmapreduce.Job:Jobjob_1456782715090_0004completedsuccessfully16/03/0113:01:48INFOmapreduce.Job:SystemreadbytesLECounters:300FILE:Numberofbyteswritten=548096FILE:Numberofreadoperations=0FILE:Numberoflargereadoperations=0FILE:Numberofwriteoperations=0HDFS:Numberofbytesread=409HDFS:Numberofbyteswritten=77HDFS:Numberofreadoperations=16HDFS:Numberoflargereadoperations=0HDFS:Numberofwriteoperations=8JobCountersLaunchedmaptasks=4Otherlocalmaptasks=5Totaltimespentbyallmapsinoccupiedslots(ms)=216810Totaltimespentbyallreducesinoccupiedslots(ms)=0Totaltimespentbyallmaptasks(ms)=216810Totalvcore-secondstakenbyallmaptasks=216810Totalmegabyte-secondstakenbyallmaptasks=222013440Map-ReduceFrameworkMapinputrecords=10Mapoutputrecords=10Inputsplitbytes=409SpilledRecords=0FailedShuffles=0MergedMapoutputs=0GCtimeelapsed(ms)=2400CPUtimespent(ms)=5200Physicalmemory(bytes)snapshot=418557952Virtualmemory(bytes)snapshot=6027804672Totalcommittedheapusage(bytes)=243007488FileInputFormatCountersBytesRead=0FileOutputFormatCountersBytesWritten=7716/03/0113:01:48INFOmapreduce.ImportJobBase:Transferred77bytesin104.1093seconds(0.7396bytes/sec)16/03/0113:01:48INFOmapreduce.ImportJobBase:Retrieved10records.16/03/0113:01:49INFOmanager.SqlManager:SELECTExecuting.SQLFROMstatement:SELECTExecuting.SQLFROM语句`customer`AStLIMIT116/03/0113:01:49INFOhive.HiveImport:LoadinguploadeddataintoHiveLogginginitializedusingconfigurationinjar:file:/usr/jars/hive-common-1.1.0-cdh5.5.0.jar!/hive-log4j.propertiesOKTimetaken:2.163secondsLoading_datatotablesqoopgspace:customerschoopgworkspacechangingownershipof'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00000':Userdoesnotbelongtosupergroupchgrp:changingownershipof'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00001':Userdoesnotbelongtosupergroupchgrp:changingownershipof'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00002':Userdoesnotbelongtosupergroupchgrp:changingownershipof'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00003':UserdoesnotbelongtosupergroupTablesqoop_workspace.customersstats:[numFiles=4,totalSize=77]OKTimetaken:1.399seconds***,让我们验证Hive中的输出结果:hive>showdatabases;OKdefaultsqoop_workspaceTimetaken:0.034seconds,Fetched:2row(s)hive>usesqoop_workspace;OKTimetaken:0.063secondshive>showtables;OKcustomersTimetaken:0.036seconds,Fetched:1row(s)hive>showcreatetablecustomers;OKCREATETABLE`customers`(`id`string,`name`string)COMMENT'Importedbysqoopon2016/03/0113:01:49'ROWFORMATDELIMITEDFIELDSTERMINATEDBY','LINESTERMINATEDBY'\n'STOREDASINPUTFORMAT'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers'TBLPROPERTIES('COLUMN_STATS_ACCURATE'='true','numFiles'='4','totalSize'='77','transient_lastDdlTime'='1456866115')耗时:0.26秒,获取:18行配置单元>从客户中选择*;确定1约翰2凯文19亚历克斯3马克4珍娜5罗伯特6佐亚7萨姆8乔治彼得所用时间:1.123秒,获取:10行。这是完整的!从MySQL到Hive,数据迁移工作如此简单