当前位置: 首页 > 后端技术 > Java

Spark读取Excel(Java)

时间:2023-04-01 17:37:46 Java

com.crealytics.spark.excelcom.crealyticsspark-excel_2.123.2.1_0.17.1//较新的版本使用poi5,easyexcel是4.1.2,使用dependencyManagement指定poi版本//https://github.com/crealytics/spark-excel/blob/main/CHANGELOG.md//CHANGELOG表示0.12.1升级到poi4.1spark.read().format("com.crealytics.spark.excel")//使用header.option("header","true").option("treatEmptyValuesAsNulls","true")//自动推断模式.option("inferSchema","true")//.option("addColorColumns","true")//时间格式.option("timestampFormat","yyyy/MM/ddHH:mm:ss").load(url);7.27.2022addorg.apache.poipoi5.2.2org.apache.poipoi-ooxml5.2.2来自3.0。5升级到3.1.1结构变了,手动导入poi依赖org.zuinnote.spark.office.excelcom.github.zuinnotespark-hadoopoffice-ds_2.121.6.4//pom中的spark版本为2.4.8。解析时间的方法在实际3.2.1去掉的时候会报nosuchmethodspark.read。()。格式("org.zuinnote.spark.office.excel").option("read.spark.simpleMode",true).option("hadoopoffice.read.header.read",true).load(url);poi读取列表>list=newArrayList<>;//假设数据为[{"age":"1","sex":"0"}]//List>转ListListrows=list.stream().map(地图->RowFactory.create(map.values().toArray())).collect(Collectors.toList());//createSchema(第一条数据的key作为字段名)StructTypeschema=DataTypes.createStructType(Arrays.stream(list.get(0).values().stream().map(String::toString).toArray(String[]::new)).map(fieldName->newStructField(fieldName,DataTypes.StringType,true,Metadata.empty())).collect(Collectors.toList()));//easyExcel读取Header时获取ReadListener//根据headMap创建schemaStructTypeschema=DataTypes.createStructType(headMap.values().stream().map(fieldName->newStructField(fieldName,DataTypes.StringType,true,Metadata.empty())).collect(Collectors.toList()));//List转为Datasetspark.createDataFrame(rows,schema);