大数据Hadoop、MapReduce、Hive项目实践

大数据Hadoop、MapReduce、Hive项目实践一．大数据概述1.1.大数据概念而所谓的大数据，就是指大量（Volume），多样（Variety），快速（Velocity），价值密度低（Value）的数据，这四个特性也被称为大数据的4V特性，传统数据库面对此类数据遇到全面挑战，才使得大数据技术飞速发展。1.2.大数据的意义1.2.1.企业之所以要使用大数据，归根结底还是因为需求，业务需求；

zhhlk

1055人浏览 · 2023-11-01 14:50:31

zhhlk · 2023-11-01 14:50:31 发布

大数据Hadoop、MapReduce、Hive项目实践这里写自定义目录标题

大数据Hadoop、MapReduce、Hive项目实践
启动YARN
停止
历史进程要单独关闭
切换到 Sqoop 目录
编辑文件
hive

大数据Hadoop、MapReduce、Hive项目实践

大数据Hadoop、MapReduce、Hive项目实践
一．大数据概述
1.1.大数据概念
而所谓的大数据，就是指大量（Volume），多样（Variety），快速（Velocity），价值密度低（Value）的数据，这四个特性也被称为大数据的4V特性，传统数据库面对此类数据遇到全面挑战，才使得大数据技术飞速发展。

1.2.大数据的意义
1.2.1.企业之所以要使用大数据，归根结底还是因为需求，业务需求；
企业切换至大数据平台，一种可能是为了解决现有的RDBMS（关系型数据库管理系统）的瓶颈，无论是存储量瓶颈还是效率瓶颈；另一种可能是为了支持新的业务需求，很多新需求无论从数据量级、数据种类还是处理方式上都不是旧有数据环境能够满足的，所以才需要新的数据环境。
e.g.
销售部门希望能够得到需要维护的客户名单和活动率低的酒店会场名单；
决策层希望能够知道最近一个月订单不活跃的客户群以便调整优惠价格；
电商企业系统实时显示活动售卖量及售卖金额；
1.3.大数据平台介绍
1.3.1.Hadoop平台
属于Apache（开源基金组织）的开源项目。免费开源，并广泛应用于有大数据需求的各行各业。

国内的大数据平台:
国内也涌现出了一批优秀的提供大数据服务的公司；
其中提供云端大数据服务的华为、阿里巴巴，提供数据存储硬件的浪潮、提供数据库&数据平台类服务的星环和数据应用方面的帆软、海云数据等；
国内近几年也在加快自主研发数据平台的脚步，国产的数据库和大数据环境已经有很多成功落地并应用在国内各行业中，相信未来国产数据产品一定会越来越强大。
1.4.大数据技术架构构成
1.4.1.传统的服务器架构多数采取单机、主备、主从的架构，在Scale-out扩展性上存在先天缺陷；
1.4.2.而现有大数据技术的架构基本上都采用分片式架构，将数据分散在多个节点上，以满足多个节点可以并行处理大量数据的需求；

1.4.3.Hadoop 集群采用主-从结构，其中主节点主要负责元数据的存储及从节点管理等工作，而从节点（也称为工作节点）主要负责真实数据存储和计算等工作。

1.4.4.hadoop的生态系统
Hadoop生态环境所包含的组件相当复杂，其中较为常见的就有很多；

1.5.hadoop的核心组件
HDFS—Hadoop分布式的文件系统组件
MapReduce—用于进行数据统计分析计算的组件
Hbase—大数据的存储系统（Hive）

最新的核心组件构成
由于早期架构主节点资源成为集群瓶颈，不利于集群大规模扩展，所以Hadoop自2版本开始，核心组件除HBase外变为3个：
HDFS 分布式文件系统 => 数据存储层
Yarn 资源管理系统 => 资源管理层
MapReduce 分布式计算框架 => 数据处理层
Hadoop 2 版本开始加入了主节点的HA（高可用性）功能，并将资源管理完全交于Yarn负责，使得主节点资源消耗得以缓解，可实现更大规模的集群；
且Hadoop将资源管理层和数据处理层分离后，可实现多种其余计算框架的支持，如Spark 内存级分布式计算框架及Flink 流式计算框架；

1.6.大数据模型框架

1.7.大数据的集群规划
1.7.1.集群节点的分类
节点可以分为 ’worker节点’ 和 ’master节点’ 两类
Worker节点运行DataNode，NodeManager等守护进程
Master节点运行NameNode，SecondaryNameNode，或者ResourceManager等守护进程
在小型集群中，NameNode 和ResourceManager 经常运行在同一台机器中；
甚至有些情况SecondaryNameNode 也运行在同一台机器上
至少将NameNode 的元数据拷贝到一台独立的机器上是很重要的
1.7.2.基本集群配置
Hadoop集群包含多个组件，在规划集群时首先应该考虑如何划分各个组件所在节点位置；

1.7.3.集群规划
1.Hadoop可以单机运行；
便于测试和开发
适用于非大量实际数据
许多公司一开始只部署一个小型集群，随着之后的需求再增长；
一开始可能只有4或6个节点
2.随着数据和业务的增长，更多的节点逐步添加进集群
当集群需要增长时可以使用的决策有；
按需求增长计算能力
按数据增加存储能力
按任务需求增长内存

3.基于存储能力增长集群
基于存储容量的集群增长通常是一种很好且易于计算的策略；
例子：
数据每周大约增长3TB
HDFS设置复制因子为3
那么每周需要9TB的存储空间
考虑一些冗余度 - 磁盘空间的25%（不能所有资源全部给Hadoop集群使用）
相当于每周需要增长大约12TB
加入机器是16*3TB的硬件，那么相当于每四周需要一台新机器
或者说：两年数据量为1.2PB，则需要大约35台机器

4.Worker节点-推荐配置
Worker 节点典型配置
中配 - 高存储，1Gb网卡
163TB SATA II硬盘，非RAID
26核 2.9GHz CPU
256GB内存
21G网卡
高配 - 高内存，细密度，10Gb网卡
241TB Nearline/MDL SAS硬盘，非RAID
26核 2.9GHz CPU
512GB内存（或更多）
110G网卡

5.Master节点-推荐配置
双电源；
使用UPS
双网卡；
磁盘做RAID；
元数据信息是最需要保护的
内存考虑；
集群小于20个节点：64G
集群小于300个节点：96G
大于300个节点：128+G
6.系统推荐
推荐使用Linux操作系统；
RedHat : 支持企业版本包含服务协议，收费，稳定；
Centos : 社区版，比较稳定，免费的；
Ubuntu : 基于Debian 也是Linux 系统的一个分支，图形界面比较好，尽量选择支持周期比较长的版本；
等

可以选用云主机进行集群部署，特别是对集群资源要求波动较大、集群规模较小或没有绝对要求数据不能上云的企业而言；

二．在虚拟机（VMware Workstation Pro）中安装CentOS系统
2.1.在磁盘上创建文件夹vmct7,准备将虚拟机文件安装到此目录。
2.2.打开安装好的VM虚拟机，如图所示

2.3.新建虚拟机

点击完成，在点击编辑虚拟机设置，配置操作系统的镜像文件安装文件CentOS-7-x86_64-DVD-2009.iso

点击确定完成
启动虚拟机

2.4.安装centos7操作系统

网络连接设置为连接状态

安装过程中给root用户设置密码为java

安装完成后点击重启

重启后选择接收许可协议

进入linux后界面如下：

查看当前centeos的ip地址:
在中端窗口中使用
ifconfig查看

在win10下打开命令窗口,使用ping命令，是否能够访问到centeos虚拟系统

在centeos中端ping www.baidu.com 查看是否能够访问外网

在centeos ping win10主机的ip地址:

2.5.使用linux客户端管理工具访问虚拟机中的centos系统MobaXterm_Personal_22.1.exe
解压缩zhongwenban.rar得到文件夹MobaXterm_22.1_Green，在此目录下双击文件MobaXterm_Personal_22.1.exe即可打开客户端管理工具

查看centeos中的ssh服务是否启动:
systemctl status sshd

关闭centeos防火墙,并设置为开机自动关闭
1.查看防火墙是否启动

2.关闭防火墙

3.设置防火墙开机不自启

4.新建会话

点击ok，输入登录密码，出现以下提示，表示登陆成功

2.6.linux常用的指令
2.6.1.目录列表ls
ls -ll 详细的列表信息

drwxr-xr-x-
d–目录
rwx：表示当前用户对此文件的权限，r–读，w–可写,x–执行
-r-：表示当前用户所在的组成员的权限，r表示当前组用户对该文件为只读权限
-x-:表示其他用户对该文件是执行的权限

2.6.7.创建文件夹
mkdir 文件夹名称

删除目录,目录中不能包含其他目录，目录必须是空目录
rm -r 文件夹名称

2.6.8.建立新文件，并编辑内容
vi或者vim编辑器,用于对文件进行编辑.
vi 文件名

按i键或者insert键进入编辑

编辑完成以后使用按esc退出编辑，使用:wq保存退出
不保存退出，按:q!退出
查看文件内容:
cat 文件名

2.6.9.查找文件
find 要查找的路径 -name 文件名
在指定的路径下按照文件名查找指定的文件
示例：从根目录下查找所有的java文件信息

2.6.10.linux下的环境变量配置文件为profile文件,位于etc目录下
cd /etc

2.6.11.使用systemctl系统命令对命令执行的状态进行查看
systemctl status 命令名称或者服务名称
查看ssh服务的状态

2.6.12.创建用户并修改用户的密码
Useradd 用户名
passwd 用户名

su 用户名：切换用户

2.6.13.授予用户权限
一、文件权限
在 Linux 中第一个字符代表这个文件是目录、文件或链接文件等等。

当为 d 则是目录（directory）
当为 - 则是文件；
若是 l 则表示为链接文档(link file)；
若是 b 则表示为装置文件里面的可供储存的接口设备(可随机存取装置)；
若是 c 则表示为装置文件里面的串行端口设备，例如键盘、鼠标(一次性读取装置)。
接下来的字符中，以三个为一组，且均为 rwx 的三个参数的组合。其中：

r 代表可读(read)、
w 代表可写(write)、
x 代表可执行(execute)。
如果没有权限，就会出现减号 - 而已。
要注意的是，这三个权限的位置不会改变

每个文件的属性由左边第一部分的 10 个字符来确定（如下图）。

从左至右用 0-9 这些数字来表示。

第 0 位确定文件类型，
第 1-3 位确定属主（该文件的所有者）拥有该文件的权限。
第4-6位确定属组（所有者的同组用户）拥有该文件的权限
第7-9位确定其他用户拥有该文件的权限。
其中：

第 1、4、7 位表示读权限，如果用 r 字符表示，则有读权限，如果用 - 字符表示，则没有读权限；
第 2、5、8 位表示写权限，如果用 w 字符表示，则有写权限，如果用 - 字符表示没有写权限；
第 3、6、9 位表示可执行权限，如果用 x 字符表示，则有执行权限，如果用 - 字符表示，则没有执行权限。
只有文件所有者和超级用户可以修改文件或目录的权限。
可以使用绝对模式（八进制数字模式），符号模式指定文件的权限。

使用权限 : 所有使用者
将文件 file1.txt 设为所有人皆可读取 :

chmod ugo+r file1.txt
1
将文件 file1.txt 设为所有人皆可读取 :

chmod a+r file1.txt
1
将文件 file1.txt 与 file2.txt 设为该文件拥有者，与其所属同一个群体者可写入，但其他以外的人则不可写入 :

chmod ug+w,o-w file1.txt file2.txt
1
为 ex1.py 文件拥有者增加可执行权限:

chmod u+x ex1.py
1
将目前目录下的所有文件与子目录皆设为任何人可读取 :

chmod -R a+r *
chmod 权限文件或者文件夹
权限描述:rwx r-4, w-2 ,x-1

2.7.使用centeos7下自带的mariadb数据库(mysql8.2版本)
1.查看mariadb数据库的状态并启动mariadb数据库

启动mariadb数据库

2.设置maridb的远程访问
需要使用win10下的sqlyog工具访问虚拟机中centsos系统的mariadb数据库

(1)进入mariadb数据库服务器
输入下面命令进入mariadb数据库（默认无密码）
mysql -uroot -p

(2)授予root用户访问所有数据库对象的权限
grant all on . to root@localhost identified by ‘’;

(3)授予外网远程登录权限
grant all privileges on . to root@‘%’ identified by ‘’;

(4)授予用户权限并授权处理其他选项的权限
grant all privileges on . to root@localhost identified by ‘’ with grant option;

(5)在win10启动sqlyog并新建链接登录远程数据库系统

测试成功后，点击链接

在centeos下查看创建的mariadb数据库

3.设置mariadb数据库中文字符的问题
更改前字符集

(1)修改/etc/my.cnf文件
vi /etc/my.cnf
在[mysqld]标签下添加
init_connect=‘SET collation_connection = utf8_unicode_ci’
init_connect=‘SET NAMES utf8’
character-set-server=utf8
(2)编辑/etc/my.cnf.d/client.cnf
vi /etc/my.cnf.d/client.cnf
在[client]中添加
default-character-set=utf8

(3)修改文件/etc/my.cnf.d/mysql-clients.cnf
vi /etc/my.cnf.d/mysql-clients.cnf
在[mysql]中添加
default-character-set=utf8
全部配置完成，重启mariadb
systemctl restart mariadb
之后进入MariaDB查看字符集
mysql> show variables like “%character%”;show variables like “%collation%”;

重启服务后，需要删除以前创建的数据库,重建数据库和表
#创建数据库
CREATE DATABASE mydb;
USE mydb;

#创建班级表
CREATE TABLE clazz(
cid INT PRIMARY KEY AUTO_INCREMENT,
cname VARCHAR(50) NOT NULL
);

#创建学生表
CREATE TABLE student(
sid INT PRIMARY KEY AUTO_INCREMENT,
sname VARCHAR(50) NOT NULL,
gender VARCHAR(4),
address VARCHAR(100),
tel VARCHAR(50),
birthday DATE,
classid INT NOT NULL,
CONSTRAINT fk_stu_classid FOREIGN KEY (classid) REFERENCES clazz(cid)
);

#插入数据
INSERT INTO clazz(cname) VALUES(‘信管1班’),(‘信管2班’);
SELECT * FROM clazz;

2.8.配置安装java的web服务器tomcat9
2.8.1.上传tomcat9的压缩文件到opt目录
2.8.2.使用unzip命令解压缩apache-tomcat-9.0.24.zip

2.8.4.进入apache-tomcat-9.0.24\bin目录并设置该目录下的所有文件权限为777
chmod 777 *
2.8.5.启动tomcat服务器
./startup.sh

在浏览器输入：
http://你的服务器的ip地址:8080/
例如:
http://192.168.43.89:8080/

三．hadoop环境配置安装部署
3.1.本地hadoop环境的配置安装
3.1.1.上传hadoop安装包并解压hadoop-3.3.1-aarch64.tar.gz
将hadoop-3.3.1-aarch64.tar.gz上传到/opt目录下，进行解压缩

解压缩 hadoop-3.3.1-aarch64.tar.gz
tar -xvf hadoop-3.3.1-aarch64.tar.gz

3.1.2.配置jdk环境到profile文件
1.确定jdk的位置
确定当前centeos中java的版本号
java -version

查找系统已安装的jdk的路径
find / -name java

进入 cd /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64
进入后使用pwd查看路径

需要配置的jdk路径为:
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64

2.修改 /etc/profile 文件，配置java环境
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64
export PATH= $P A T H :$ JAVA_HOME/bin: $JAVA_HOME/jre/bin export CLASSPATH=.:$ JAVA_HOME/lib:$JAVA_HOME/jre/lib

vi /etc/profile

3.使配置文件生效
source /etc/profile
3.1.3.修改centeos的主机名称
修改/etc/hosts文件
vi /etc/hosts
将ip对应的主机名称改为hdp01
192.168.43.89 hdp01

重启服务器
查看主机名称:

3.1.3.配置hadoop的环境变量
hadoop在本机中的路径为:/opt/hadoop-3.3.1
修改/etc/profile
export HADOOP_HOME=/opt/hadoop-3.3.1
export PATH= $P A T H :$ HADOOP_HOME/bin
export PATH= $P A T H :$ HADOOP_HOME/sbin

vi /etc/profile

配置文件生效
source /etc/profile
3.1.4.测试本地运行模式
1.准备数据文件，上传到/opt/hadoop-3.3.1/share/hadoop/mapreduce/input目录下

2.运行下面指令，对
hadoop jar hadoop-mapreduce-examples-3.3.1.jar wordcount input output

查看统计的结果

cat part-r-00000

根据指定的正则进行统计，统计警察出现的次数
hadoop jar hadoop-mapreduce-examples-3.3.1.jar grep input output ‘警察’

3.2.Hadoop伪分布式环境配置
3.2.1.ssh免密登陆
设置无密码登录
#~ 代表的是用户的主文件夹，即 “/home/用户名” 这个目录，如你的用户名为 hadoop，则 ~ 就代表 “/home/hadoop/”
cd ~/.ssh/ # 若没有该目录，请先执行一次ssh localhost
ssh-keygen -t rsa # 会有提示，都按回车就可以

cat id_rsa.pub >> authorized_keys # 加入授权
chmod 600 ./authorized_keys # 修改文件权限

3.2.2.修改配置文件
伪分布式需要修改2个配置文件 core-site.xml 和 hdfs-site.xml
1.修改hadoop的环境配置文件hadoop-env.sh（位于/opt/hadoop-3.3.1/etc/hadoop）
进入/opt/hadoop-3.3.1/etc/hadoop
cd /opt/hadoop-3.3.1/etc/hadoop
vi hadoop-env.sh

2.修改配置文件 core-site.xml(路径/opt/hadoop-3.3.1/etc/hadoop)
vi core-site.xml
#修改配置文件 core-site.xml

hadoop.tmp.dir
/opt/data/hadoop/tmp

fs.defaultFS
hdfs://hadoop2:9000

3.修改配置文件 hdfs-site.xml（路径/opt/hadoop-3.3.1/etc/hadoop）
#修改配置文件 hdfs-site.xml,
#搭建集群后，hadoop本身自带了一个webUI访问页面

dfs.replication
1

4.格式化namenode节点的hdfs文件系统
hdfs namenode -format

5.开启 NameNode 和 DataNode 进程
进入/opt/hadoop-3.3.1/sbin目录
输入
./start-dfs.sh

如果出现
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
vim /hadoop-env.sh
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
vim /hadoop-env.sh

进入/opt/hadoop-3.3.1/sbin然后重新启动
cd /opt/hadoop-3.3.1/sbin
./start-dfs.sh

启动完成后，输入jps查看hadoop运行的进程
jps

启动完成后，在浏览器输入hadoop的网址
http://192.168.43.89:9870/

6.操作集群
(1)在hdfs分布式文件系统上创建目录input
hdfs dfs -mkdir /input

(2)上传要统计的文件到input目录下
上传/opt/input/test.txt文件到hdfs的input目录下（云端的input）
hdfs dfs -put /opt/input/test.txt /input

(3)查看是否上传到hdfs上的input目录
hdfs dfs -ls /input

(4)运行mapReduce程序，对上传的test.txt文件进行词频统计
进入/opt/hadoop-3.3.1/share/hadoop/mapreduce
cd /opt/hadoop-3.3.1/share/hadoop/mapreduce
hadoop jar hadoop-mapreduce-examples-3.3.1.jar wordcount /input/test.txt output

test.txt文件内容如下：
CN china
Us u.s
EN English
hollo daxiong
hello yase
good hello
hello yes
yes ok

进入output目录查看统计结果
output目录在hdfs文件系统的/user/root/output
hdfs dfs -ls /user/root/output
查看hdfs系统下的文件内容：
hdfs dfs -cat /user/root/output/part-r-00000

3.3.启动yarn(伪分布式也可以不启动)
3.3.1.YARN 来负责资源管理与任务调度
3.3.2.yarn资源的配置
(1)修改配置文件 mapred-site.xml：
#修改内容如下

mapreduce.framework.name
yarn

mapreduce.application.classpath
$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$ HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*

(2)修改配置文件 yarn-site.xml：

yarn.nodemanager.aux-services
mapreduce_shuffle

yarn.nodemanager.env-whitelist
JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME

PS：这里不需要配置resourceManagerName.hostname。
(3)启动yarn
启动前必须保证NameNode和DataNode已经启动：

启动YARN

./sbin/start-yarn.sh

停止

./sbin/stop-yarn.sh

这里可以访问 ResourceManager - http://192.168.43.89:8088/

关闭:
[root@hdp01 sbin]# ./stop-dfs.sh
Stopping namenodes on [hdp01]
上一次登录：四 8月 31 11:28:53 CST 2023pts/0 上
Stopping datanodes
上一次登录：四 8月 31 11:34:22 CST 2023pts/0 上
Stopping secondary namenodes [hdp01]
上一次登录：四 8月 31 11:34:24 CST 2023pts/0 上
2023-08-31 11:34:31,315 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
[root@hdp01 sbin]# ./stop-yarn.sh
Stopping nodemanagers
上一次登录：四 8月 31 11:34:27 CST 2023pts/0 上
Stopping resourcemanager
上一次登录：四 8月 31 11:34:47 CST 2023pts/0 上
[root@hdp01 sbin]# jps
6779 Jps
[root@hdp01 sbin]#
使用start-all.sh启用所有的服务
./start-all.sh

关闭所有的服务
./stop-all.sh

(4)开启历史服务器和关闭
mapred --daemon start historyserver

在Web中查看任务运行情况 - http://192.168.43.89:19888/

./sbin/stop-all.sh

历史进程要单独关闭

mapred --daemon stop historyserver

四.分布式文件系统HDFS
4.1.HDFS介绍
HDFS - 分布式文件系统，适于运行在行业标准硬件上的分布式文件系统；
基于HDFS的存储中间组件为Hbase和Hive

4.2.HDFS的特点
HDFS是一个高度容错性的系统
能提供高吞吐量的数据访问
非常适合大规模数据集上的应用
优化分布式处理
可扩展性强
4.3.HDFS的优缺点
4.3.1.优点：
高容错性，数据自动保存多个副本，副本丢失后，自动恢复
适合批处理，移动计算而非数据，数据位置暴露给计算框架
适合大数据处理，GB、TB、甚至PB级数据，百万规模以上的文件数量
流式文件访问，支持高持续吞吐量而不是低延迟
可构建在廉价机器上，通过多副本提高可靠性，提供了容错和恢复机制
4.3.2.缺点：
低延迟与高吞吐率的数据访问，比如毫秒级
大量小文件存取
文件随机修改

4.4.HDFS 文件拆分
当一个文件加入到HDFS时，它被拆分成一个或多个块
与本地文件系统类似，但块的尺寸要大得多
默认块大小（128MB）

4.4.1.HDFS 复制因子
块被复制到集群的不同节点上
基于复制因子（默认为3）
复制提高了可靠性和性能
可靠性：数据可以容忍除最后一个副本以外的所有损失
性能：更多的数据本地化机会

4.4.2.非高可用HDFS
非高可用HDFS，有三个守护进程
NameNode(master)
SecondaryNamenode(master)
DataNode(worker)

4.4.3.NameNode
NameNode 在内存中管理元数据(metadata)
HDFS中文件位置信息
文件所有权和权限信息
每个数据块的名字
数据块位置
当NameNode 启动时从磁盘上读取存储的元数据
文件名字是fsimage
注意：块位置信息不存储在fsimage 上（在NameNode 内存中）
对元数据的修改存储在内存中
修改会被写入到一个edits 日志中
4.4.4.DataNode
文件的实际数据按块存储在工作节点上
每个工作节点运行一个DataNode 守护进程
控制数据块存取
与NameNode 通信
块以简单的文件形式存储在工作节点的底层文件系统内
名字为blk_xxxxxxx
工作节点不提供任何关于块所属的底层文件的信息
这些信息仅存储在NameNode节点的元数据中
为了冗余每个数据块存储在多个不同的工作节点中
默认三个副本
4.4.5.SecondaryNameNode
SecondaryNamenode 不是NameNode的备份
它为NameNode 提供了内存级元数据管理功能
NameNode 将元数据保存在内存中
NameNode 将元数据的改变写入edit日志中
SecondaryNameNode 周期性的将前快照与当前文件系统元数据和edits日志文件合并为一个新的快照
新快照传输回NameNode
在大型集群中SecondaryNameNode 应该运行在一台独立的机器上
它需要与NameNode 相同的内存
SecondaryNameNode 仅存在于非高可用集群中

4.5.6.文件系统元数据快照和Edit 日志
fsimage 文件包含一个文件系统元数据的快照
并非每次写入操作都会更新
HDFS 写操作被记录在NameNode 的edits 日志中
NameNode 内存中的元数据也会更新
当NameNode 重启时从edits 文件中获取所有的改变会消耗大量时间
edits 文件可以变得十分巨大
4.5.7.元数据检查点checkpoint
SecondaryNameNode 使用以下方法定期创建检查点
压缩edits日志中的信息
将它与最新的fsimage文件合并
清空edits日志
好处：更快的重启NameNode
NameNode可以使用最新的检查点并应用较小的edits日志的内容

创建检查点的基本流程如下图：

4.4.Hdfs高可用性
NameNode 在非高可用HDFS 有单点故障问题
NameNode 节点宕机，HDFS 不可用
NameNode 导致HDFS 下线的2种情况
NameNode意外崩溃（极少）
计划NameNode的维护（通常）
高可用HDFS 集群里的新进程
NameNode (active)
NameNode (standby)
Failover Controllers
Journal Nodes
在HDFS的 HA 配置里不再使用SecondaryNameNode

4.5.HDFS 高可用架构
HDFS 高可用使用2个NameNode
一个Active 和一个Standby
客户端只与Active NameNode 联系
DataNode 往2个NameNode 发送心跳
Active NameNode 往JournalNode 里写edits
Standby NameNode 读取JournalNode 以保持和Active Namenode 的同步

4.6.HDFS 高可用元数据同步
Active NameNode 往JournalNode 里写edits
通过Quorum Journal Manger（QJM）软件实现
NameNode 内置的服务
等待大部分的JournalNode 返回成功应答
多数提交意味着一个JornalNode 挂掉或延迟,不会影响NameNode延迟
使用Paxos 算法来保证可靠性，即便有个别写edits 的JournalNode 失效
注意：HDFS 的高可用集群是没有Secondary NameNode
Standby NameNode 会周期性地执行checkpointing

4.7.HDFS 高可用故障切换
任何时候只有一个NameNode 是active
另外一个是standby
Standby的NameNode 维护着active的NameNode 的一个备份
所以当active的NameNode 宕掉，standby NameNode 可以直接接管
2种类型的故障切换
手动（由用户来检测和启动）
自动（由HDFS 自己来检测和启动）
使用ZooKeeper 做自动故障转移
配置HA，有个叫ZooKeeper 故障控制(ZKFC)运行在每个NameNode 节点上

4.8.Observer NameNode
观察者的NameNode

Hadoop 3版本的HA模式中支持不止一个的Standby NameNode，但这种HA方案仍然存在一个缺点，活动NameNode可能是一个瓶颈，并且会因客户端请求而过载，特别是在繁忙的集群中；
在3版本中，通过引入Observer NameNode来解决此问题，Observer NameNode可以保持最新的命名空间及块信息，同时能提供一致性读的能力，有助于负载均衡NameNode的流量以提高总体吞吐量。
4.9.HDFS操作指令
4.9.1.通用语法
hdfs dfs -指令 hdfs文件路径

4.9.2.hdfs文件操作常用的指令

当使用HDFS 的时候，hdfs dfs指令实际上是hadoop fs指令的同义词；
HDFS命令格式：hdfs dfs等同hadoop fs
HDFS大部分命令与Linux系统指令习惯十分相似；
需要注意的是HDFS 指令参数不能连写，比如-d -R 不能写成-dR ；
与Linux系统指令类似的指令有：
cat、cp、mv、rm、ls、chmod、chown、tail、du、mkdir

1.上传文件
hdfs dfs -put 本地文件路径 hdfs的文件路径

例如：
将news.txt文件上传到hdfs的input目录下:
news.txt文件存放至/opt/input目录下
hdfs dfs -put news.txt /input/

2.查看hdfs下的文件列表
hdfs dfs -ls hdfs文件路径
示例：
查看hdfs的input目录下的所有文件
hdfs dfs -ls /input

3.在hdfs上创建文件夹（包含子目录）
指令:hdfs dfs -mkdir -p /路径名称/子目录…
示例:
在hdfs系统创建文件夹/jianda/input
指令: hdfs dfs -mkdir -p /jianda/input

4.删除目录
指令:hdfs dfs -rmdir [–ignore-fail-on-non-empty] hdfs目录
示例:
hdfs dfs -rmdir /jianda 只能逐级删除，并且目录不能有其他文件或者文件夹存在
hdfs dfs -rm -f -r /input

查看

5.查看上传到hdfs下的文件内容
指令:hdfs dfs -cat hdfs文件路径
示例:
hdfs dfs -cat /user/zker/chapter02/practice/news.txt

6.从hdfs下载文件到本地文件夹
指令:hdfs dfs -get hdfs的源文件路径本地目标文件路径

示例:
将/user/zker/chapter02/practice/news.txt复制到本地的tmp目录下
hdfs dfs -get /user/zker/chapter02/practice/news.txt /tmp

7.向hdfs文件末尾添加新的内容
hdfs dfs -appendToFile 本地源文件 hdfs上的目标文件
示例:
stu.txt
1,张飞,男,西安,信管1班,90
2,马超,男,西安,信管1班,89
3,关羽,男,西安,信管1班,78
4,赵云,男,西安,信管1班,87
5,黄忠,男,西安,信管1班,91
hdfs dfs -put stu.txt /jiada

stu1.txt
6,曹操,男,洛阳,信管2班,89
7,孙权,男,南京,信管1班,91
8,刘备,男,成都,信管2班,85
9,诸葛亮,男,南阳,信管2班,100

hdfs dfs -appendToFile stu1.txt /jianda/stu.txt

8.使用count指令统计指定目录下的文件数量
指令:hdfs dfs -count hdfs的目录
示例
hdfs dfs -count /jianda

hdfs dfs -count -h -v /jianda

HDFS文件读写的解析

课堂练习
使用hdfs指令实现目录创建、删除、文件删除等操作。

4.10.使用javaAPI访问HDFS系统
4.10.1.在windows中配置hadoop3.3.1环境变量
首先解压缩hadoop-3.3.1-aarch64.tar.gz，拷贝hadoop3.3.1目录到d:\hadoop目录(没有，先创建)下

4.10.2.在windows10的环境变量中配置hadoop的环境变量
HADOOP_HOME=D:\hadoop\hadoop-3.3.1
PATH=原来的路径…;D:\hadoop\hadoop-3.3.1\bin

4.10.3.修改windows\system32\drivers\etc\hosts.ics文件，加入要访问的linux主机的ip地址和主机名称（设置hosts.ics文件管理员的所有权限）

4.10.4.打开idea工具，新建maven项目，并引入hdfs需要的组件依赖库

点击finishi创建完成。

4.10.5.在项目的pom.xml文件，引入依赖组件,打开pom.xml文件，加入需要的组件

<?xml version="1.0" encoding="UTF-8"?>

4.0.0

<groupId>com.jianda</groupId>
<artifactId>prjhdfsapidemo</artifactId>
<version>ver1.0</version>
<dependencies>
    <!--导入日志依赖-->
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-core</artifactId>
        <version>2.8.2</version>
    </dependency>
    <!-- 导入hadoop的组件库支持 -->
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>3.3.1</version>
    </dependency>
</dependencies>

注意：hadoop的版本必须匹配

4.10.6.配置日志文件log4j.properties日志
log4j.properties文件内容如下：
#All level less than INFO will be logged
log4j.rootLogger=info, stdout

log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %m%n

4.10.7.javaAPI访问hdfs的基本步骤
1.创建HDFS配置对象Configuration，hadoop会自动加载配置文件信息（core-site.xml、hdfs-site.xml）
Configuration cfg=new Configuration();

2.使用Configuration 构建hadoop的文件系统对象FileSystem对象
FileSytem fileSystem=FileSystem.get(cfg);

3.创建要操作的路径对象Path对象
Path path=new Path(“要操作的hdfs路径hdfs://hdp01:9000/input”);

4.10.8.示例操作
1.使用javaAPI在hdfs上创建文件夹
package com.jianda;
import java.io.;
import java.util.;
import java.net.*;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
public class HdfsDemo1 {
public static void main(String[] args) {
try {
Configuration conf=new Configuration();
FileSystem fileSystem=FileSystem.get(new URI(“hdfs://hdp01:9000”),conf,“root”);
//创建要操作的路径
Path path=new Path(“/input”);//在hdfs下创建input目录
//使用fileSystem对象创建目录
fileSystem.mkdirs(path);
System.out.println(“目录创建成功！”);
fileSystem.close();
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (URISyntaxException e) {
e.printStackTrace();
}
}
}

2.上传本地文件到hdfs指定的目录
package com.jianda;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

public class UploadLocalFileToHdfs {
public static void main(String[] args) {
try {
//创建配置对象，加载hdfs-site.xml和core-site.xml
Configuration conf=new Configuration();
//创建文件系统对象FileSystem对象，建立与HDFS系统的链接
FileSystem fileSystem=FileSystem.get(new URI(“hdfs://hdp01:9000”),conf,“root”);

        //获取本地文件路径,原文件路径
        Path srcPath=new Path("e:\\test.txt");
        //创建要上传的目标文件路径
        Path destPath=new Path("/input/");
        System.out.println("开始上传文件到hdfs的input目录下……");
        //使用fileSystem对象开始上传
        fileSystem.copyFromLocalFile(srcPath,destPath);
        System.out.println("上传成功......");
        System.out.println("显示上传的文件列表:");
        FileStatus[] fileStatuses=fileSystem.listStatus(destPath);
        for (FileStatus fileStatus : fileStatuses) {
            System.out.println(fileStatus.getPath().getName());
        }


        fileSystem.close();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (InterruptedException e) {
        e.printStackTrace();
    } catch (URISyntaxException e) {
        e.printStackTrace();
    }
}

}

3.下载文件到本地操作系统指定的目录
package com.jianda;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

public class DownloadHDFSFileToLocal {
public static void main(String[] args) {
try {
//创建配置对象
Configuration cfg=new Configuration();
//创建文件系统对象
URI uri=new URI(“hdfs://hdp01:9000”);
FileSystem fileSystem=FileSystem.get(uri,cfg,“root”);

        //创建源Path路径，要下载的文件路径
        Path srcPath=new Path("/jianda/stu.txt");
        //创建目标文件对象，指定下载文件的存放位置
        Path destPath=new Path("d:\\stu.txt");
        System.out.println("准备下载文件到本地.......");
        fileSystem.copyToLocalFile(srcPath,destPath);
        System.out.println("下载成功，请在d:\\查看!");
        fileSystem.close();
    } catch (URISyntaxException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

}

}

4.删除HDFS上的文件夹
package com.jianda;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

public class RmHdfsDirDemo {
public static void main(String[] args) {
try {
//创建配置对象
Configuration cfg=new Configuration();
//创建文件系统对象
URI uri=new URI(“hdfs://hdp01:9000”);
FileSystem fileSystem=FileSystem.get(uri,cfg,“root”);

        //创建Path路径，指定要删除的HDFS目录
        Path path=new Path("/input");
        //删除
        System.out.println("正在删除目录......");
        fileSystem.delete(path,true);
        System.out.println("删除成功......");
        fileSystem.close();
    } catch (URISyntaxException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

}

}

4.11.使用springboot访问HDFS文件系统
4.11.1.创建springboot项目，注意版本springboot2.x版本

点击下一步，点击finish完成
重新修改springboot版本为2.3.1.RELEASE

项目的pom.xml文件如下：

<?xml version="1.0" encoding="UTF-8"?>

4.0.0

org.springframework.boot
spring-boot-starter-parent
2.3.1.RELEASE

com.jianda
prjspringboothdfs
ver1.2
prjspringboothdfs
Demo project for Spring Boot

<java.version>1.8</java.version>

org.springframework.boot
spring-boot-starter-web

    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-devtools</artifactId>
        <scope>runtime</scope>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <optional>true</optional>
    </dependency>
    <!-- 引入hdfs的组件库 -->
    <!--导入日志依赖-->
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-core</artifactId>
        <version>2.8.2</version>
    </dependency>
    <!-- 导入hadoop的组件库支持 -->
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>3.3.1</version>
    </dependency>

    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-maven-plugin</artifactId>
            <configuration>
                <excludes>
                    <exclude>
                        <groupId>org.projectlombok</groupId>
                        <artifactId>lombok</artifactId>
                    </exclude>
                </excludes>
            </configuration>
        </plugin>
    </plugins>
</build>

4.11.2.创建业务类，实现管理hdfs文件系统的文件
HdfsService
package com.jianda.service;

import java.util.List;

public interface HdfsService {
/**
* 向hdfs上传文件
* 参数为上传的文件路径以及文件名称
* /
public String add(String path);
/*
* 对hdfs上传的文件进行列表
* 只显示input下上传的文件
* /
public List findHdfsFileAll();
/*
* 删除hdfs下input目录中的文件
* @param filename 要删除的文件名称
* */
public String delHdfsFile(String filename);
}

业务实现HdfsServiceImp
package com.jianda.service;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.File;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.ArrayList;
import java.util.List;
@Service
public class HdfsServiceImp implements HdfsService {
/**
* 获取FileSystem的方法
* */
public FileSystem getFileSysTem(){
//创建配置对象
Configuration conf=new Configuration();
try {
//创建文件系统对象
URI uri=new URI(“hdfs://hdp01:9000”);
FileSystem fileSystem=FileSystem.get(uri,conf,“root”);
return fileSystem;
} catch (URISyntaxException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}

    return null;
}
@Override
public String add(String path) {
    FileSystem fileSystem=getFileSysTem();
    //创建源Path对象
    Path srcPath=new Path(path);
    //创建目标路径HDFS
    Path destPath=new Path("/input/");
    try {
        //开始上传
        fileSystem.copyFromLocalFile(srcPath,destPath);
        fileSystem.close();
        return "1";
    } catch (IOException e) {
        e.printStackTrace();
    }

    return "0";
}

@Override
public List<String> findHdfsFileAll() {
    List<String> lsfname=new ArrayList<>();

    FileSystem fileSystem=getFileSysTem();

    try {
        FileStatus[] fileStatuses=fileSystem.listStatus(new Path("/input"));
        for (FileStatus fileStatus : fileStatuses) {
            lsfname.add(fileStatus.getPath().getName());

        }
        return lsfname;
    } catch (IOException e) {
        e.printStackTrace();
    }
    return null;
}

@Override
public String delHdfsFile(String filename) {
    //获取文件系统对象
    FileSystem fileSystem=getFileSysTem();
    //获取要删除的文件路径
    Path delPath=new Path("/input/"+filename);
    try {
        //开始删除
        fileSystem.delete(delPath,true);
        return "1";
    } catch (IOException e) {
        e.printStackTrace();
    }
    return "0";
}

}

4.11.3.编写控制器组件
HdfsFileContraller
package com.jianda.contraller;

import com.jianda.entity.FileEntity;

import java.util.List;

public interface HdfsFileContraller {
public String save(FileEntity fileEntity);
public List findAll();
public String delByFielName(String fielName);
}

实现HdfsFileContrallerImp
package com.jianda.contraller;

import com.jianda.entity.FileEntity;
import com.jianda.service.HdfsService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.multipart.MultipartFile;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.List;
@RestController//注解当前类为控制器类
public class HdfsFileContrallerImp implements HdfsFileContraller {
@Autowired
private HdfsService hdfsService;

public HdfsService getHdfsService() {
    return hdfsService;
}

public void setHdfsService(HdfsService hdfsService) {
    this.hdfsService = hdfsService;
}

@Override
@RequestMapping(value = "save_FileEntity.do")
public String save(FileEntity fileEntity) {
    //判断是否存在上传文件
    if(fileEntity!=null){
        //获取上传的文件对象
        MultipartFile uploadfile=fileEntity.getFile();
        if(uploadfile!=null&&!uploadfile.isEmpty()){
            //获取文件名称
            String filename=uploadfile.getOriginalFilename();
            System.out.println("上传的文件名称为:"+filename);
            //将上传的文件存放到E:\tmp\input
            File destfile=new File("E:\\tmp\\input\\"+filename);
            try {
                uploadfile.transferTo(destfile);//存储上传的文件到E:\tmp\input,上传到web服务器
                System.out.println("文件已保存到E:\\tmp\\input!");
                String rstcode=hdfsService.add("E:\\tmp\\input\\"+filename);//将上传的文件继续传递给HDFS文件系统
                return rstcode;
            } catch (IOException e) {
                e.printStackTrace();
            }

        }
    }
    return "0";
}

@Override
@RequestMapping(value = "findAll_FileEntity.do")
public List<String> findAll() {
    List<String> lsstr=hdfsService.findHdfsFileAll();
    return lsstr;
}

@Override
@RequestMapping(value = "delByFielName_FileEntity.do")
public String delByFielName(String fielName) {
    System.out.println("要删除的文件名为:"+fielName);
    String rstcode=hdfsService.delHdfsFile(fielName);
    return rstcode;
}

}

在static目录下引入VUE组件和jQuery组件，准备上传文件列表index.html

hdfs文件管理

hdfs文件列表

文件名	操作
{{fname}}

index.js
var mainvie=new Vue({
el:‘#maindiv’,//要操作的模板元素
data:{
lsstr:[],
file:‘’,
},
created:function(){
//初始化方法
this.findAll();

},
methods:{
    getFile(event) {
        this.file = event.target.files[0];
        console.log(this.file);
        //==========================================
    },
    save(){
        let formData = new FormData();
        let config = {
            headers: {
                'Content-Type': 'multipart/form-data'
            }
        }
        if(this.file!=''&&this.file!=undefined){
            formData.append('file', this.file);
            //开始上传文件
            this.$http.post('save_FileEntity.do',formData,config).then(res=>{
                console.log(res.data)
                if(res.data=='1'){
                    alert('上传成功！');
                    this.findAll();
                }else{
                    alert('上传失败！')
                }
            });
        }else{
            alert("没有选择任务和文件，不能上传！");
        }

    },
    findAll(){
        this.$http.get('findAll_FileEntity.do?t='+new Date().getTime()).then(res=>{
            console.log(res.body);//输出服务器控制器返回的文件列表名称
            this.lsstr=res.body;
        });
    },
    dodel(fname){
        //alert(fname);
        if(confirm("是否真的删除文件"+fname)){
            this.$http.get('delByFielName_FileEntity.do?fielName='+fname).then(res=>{
                console.log(res.data);
                if(res.data=='1'){
                    alert('删除成功！');
                    this.findAll();
                }else{
                    alert('删除失败！');
                }
            });
        }
    },
}

});

五．分布式计算框架MapReduce
5.1.MapReduce概述
5.1.1.Hadoop的分布式的计算框架
即不是平台也不是语言
面向键值对（key 和 value）的数据处理
便于跨多个节点分布任务
每个节点尽可能处理存储在自身节点上的数据（数据本地化）；
MapReduce 会将整个将job（作业）拆分成多个task（任务），并将多个task（任务）发送给多个节点去各自计算；

5.1.2.MapReduce 模型
MapReduce 将计算主要分为两个阶段，Map阶段和Reduce阶段，Map阶段将输入内容映射为中间结果，Reduce阶段对中间结果进行归约计算，得到最终结果。

5.1.3.Mapper阶段的任务
Maps将输入记录转换为中间记录，且中间记录不需要与输入记录类型相同，一个输入key/value对可能被映射为0或多个输出key/value对；
MapReduce框架为每一个InputSplit 生成一个map task；
key相同的中间结果都会被框架分组，然后发送给Reducer来生成最终结果；
Mapper的输出会被排序然后进行分区；
可以指定一个combiner来执行中间输出的局部聚合，这个可以帮助减少从Mapper传输到Reducer的数据量；

5.1.4.Reduce 阶段
Reducer
Reducer将相同key的一组中间值归纳为一组更小的值
Reducer包含3个主要阶段：shuffle，sort和reduce
Reducer的输入是排序好的mapper的输出，在Shuffle阶段框架会提取所有mappers输出的对应分区内数据
在Sort阶段（其实也是合并阶段），框架会按keys将Reducer的输入数据进行分组排序，Shuffle和Sort阶段是同时发生的，当map的输出被提取时，他们已经合并了
在Reduce阶段，reduce方法会被输入（已分组完成）中每个键值对（<key, (list of values)>）所调用。reduce task的输出通常被写入FileSystem。Reducer的输出是没有进行排序的

5.2.MapreDuce运行流程

5.3.MapReduce工作原理

5.3.1.Shuffle 介绍
Shuffle是mapreduce 的核心，map 端主要负责过滤映射，reduce 端主要负责归约整理，从map 输出到reduce 的输入的这个过程称为shuffle 过程。
shuffle 过程分为map 端的shuffle 过程和reduce 端的shuffle 过程

5.3.2.Map 端的shuffle
map的处理结果首先存放在一个缓冲区，缓冲区的内存是100M，是map存放结果的地方。如果数据量较大，超过了一定的量（默认80M），将会发生溢写过程。
溢写过程包括：分区partitioner，排序sort，溢写spill to disk，合并merge。
分区分的是80%的内存中的数据，将map的输出结果放入的对应的reduce分区中。
默认是按照key排序。当分区完成之后，对每一个分区的数据进行排序。
排序之后，将内存的数据写入硬盘。留出内存方便map的新的输出结果。
将各个分区合并之后，对每一个分区的数据再进行一次排序，形成一个文件。

5.3.3.Reduce 端的shuffle
当reduce开启任务后，不断的在各个节点复制需要的数据。
复制数据的时候，先将数据放进内存中，当达到一定的时候，启动merge，将数据写进硬盘。如果map数据大于内存限制，直接写入硬盘，当达到一定的数量后将其合并为一个文件。
将相同的key的数据进行分组，放在一起。
5.4.Reducer调度和输入获取
mapreduce.job.reduce.slowstart.completedmaps
完成的Map 任务数占比，达到后，ResourceManger才调度集群上的Reducer
默认：0.05（5%）
推荐值：0.8（80%)

mapreduce.reduce.shuffle.parallelcopies
一个Reducer可以并行去Mapper输出端取数据的线程数
默认：5
推荐值：集群节点数*4，最低10

5.5.推测执行
如果一个MapReduce的任务远慢于该作业任务的平均完成速度，推测执行可能会发生
在不同的节点上会启动相同的任务,采用第一个完成任务的结果，较慢的任务被干掉
mapreduce.map.speculative
是否允许Map任务推测执行
默认：true, 推荐：false
mapreduce.reduce.speculative
是否允许Reduce任务推测执行
默认：true, 推荐：false
不启用推测执行机制情况
任务间存在严重的负载倾斜；
特殊任务，比如任务向数据库中写数据。

5.6.MapReduce应用开发
5.6.1.词频统计
大量的数据的单词计数是极具挑战的；
使用一个计算节点是很耗时的
统计数据往往仅需简单的聚合操作；
MapReduce将复杂的任务分解为更小的元素来支持并行执行；
很多常见的任务与单词计数非常类似；
比如日志分析，流量统计等

Maper阶段的代码
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);//准备为Reduce计算提供数据
}
}
}

5.6.2.Map阶段的详解
map 阶段负责接收输入信息，默认文本文件读取格式为每行对应一个键值对，key为偏移量，value为本行内容，这里我们称之为k1/v1，对应map(Object key, Text value, Context context)中的Object key，Text value
StringTokenizer itr = new StringTokenizer(value.toString());
读取要统计的文件内容，并转换器为字符串迭代进行处理。
在map阶段，使用StringTokenizer将v1 按分隔符进行分割，默认分隔符为“空格”、“制表符(‘\t’)”、“换行符(‘\n’)”、“回车符(‘\r’)”

map阶段解读（2）

while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}

while循环读取拆分内容，itr.hasMoreTokens()判断是否还有分割符，itr.nextToken()返回从当前位置到下一个分隔符的字符串
context为map阶段输出的内容，需要对应reduce函数中输入内容格式，这里将拆分的单词作为输出的key，将one（1）作为输出的value，这里我们称之为k2/v2

5.6.3.suffle阶段解读
suffle阶段，期间对数据进行分组和排序，将单词作为key，将单词后的1迭代器作为value，这里我们称之为k3/v3

5.6.4.reduce阶段
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();//计算单词出现的次数
}
result.set(sum);
context.write(key, result);//写出合并的结果
}
}

reduce阶段解读
reduce(Text key, Iterable values,Context context)，其中Text key, Iterable values对应k3/v3
输出结果中key不变仍然是拆分开的单词，对值values进行sum+=的求和运算，以每个单词的结果作为value，这里我们称之为k4/v4

5.6.5.main()方法中代码主要用于创建MapReduce作业；
// 创建配置对象，获取job实例
Configuration conf = new Configuration();
Job job = new Job(conf, “word count”);
// 指定job包及使用的mapper/Reducer类
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);//读文件拆分
job.setReducerClass(IntSumReducer.class);//计算

// 指定输出数据的kv类型及输入输出路径
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));//文件输入的路径，要分析的文件
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));//文件输出的路径,结果文件

System.exit(job.waitForCompletion(true) ? 0 : 1);

5.7.使用Mapreduce实现词频统计
5.7.1.准备文件内容
wd.txt内容如下
hello world
hello hadoop
hello hadoop mapreduce
hadoop good
Ok hadoop
hadoop ok
mareduce hadoop
5.7.2.创建maven项目，引入以下的组件库支持

<dependency>
  <groupId>org.apache.logging.log4j</groupId>
  <artifactId>log4j-core</artifactId>
  <version>2.8.2</version>
</dependency>
<!-- 导入hadoop的组件库支持 -->
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>3.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>3.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
  <groupId>mysql</groupId>
  <artifactId>mysql-connector-java</artifactId>
  <version>5.1.47</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-hdfs</artifactId>
  <version>3.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core -->
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-mapreduce-client-core</artifactId>
  <version>3.3.1</version>
</dependency>

5.7.3.编写Mapper类，准备读取资源文件的内容，并进行分发

package com.mapper;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.StringTokenizer;

public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer itr = new StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);//one[1,1,1,1,1],key--word的字符串信息
    }

}

}

5.7.4.编写Reduce统计类，对单词数量进行统计
package com.reduce;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class IntSumReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
@Override
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
//从Mapper接收过来的s数据集合为:{“hello”,[1,1,1,1,1,1]}
int sum = 0;
for (IntWritable val : values) {
sum += val.get();//将[1,1,1,1…]数组中的数字累加
}
result.set(sum);
context.write(key, result);
}
}

5.7.5.编写Job主类，设置配置信息并统计文件
package com.job;

import com.mapper.TokenizerMapper;
import com.reduce.IntSumReducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordContJob {
public static void main(String[] args) {
System.out.println( “开始进行统计分析，请稍后…” );
//判断命令行是否有足够的参数
if(args!=null&&args.length<2){
System.out.println(“参数不足，【第一个为数据集文件，第2个为输出路径】”);
System.exit(0);
}
try {
// 创建配置对象，获取job实例
Configuration conf = new Configuration();
Job job = new Job(conf, “word count”);
// 指定job包及使用的mapper/Reducer类
job.setJarByClass(WordContJob.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);

        // 指定输出数据的kv类型及输入输出路径
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));//输入数据集文件
        FileOutputFormat.setOutputPath(job, new Path(args[1]));//输出文件的路径

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    } catch (IOException e) {
        e.printStackTrace();
    } catch (InterruptedException e) {
        e.printStackTrace();
    } catch (ClassNotFoundException e) {
        e.printStackTrace();
    }

}

}

5.7.6.对项目打包，打包为jar文件，准备上传到hadoop平台运行

5.7.7.上传jar文件到centeos服务器,并启动hadoop
在/opt下创建myjob目录
mkdir myjob
cd myjob
上传prjmapreducedemo-ver1.0.jar到myjob目录下

vi word.txt内容如下

hello world
hello hadoop
hello hadoop mapreduce
hadoop good
Ok hadoop
hadoop ok
mareduce hadoop

5.7.8.上传文件到hdfs系统
hdfs dfs -put word.txt /input
hdfs dfs -ls /input

运行的一般语法为:
hadoop jar xxx.jar 包名.主类名 hdfs://hdp01:9000/数据文件所在路径 hdfs://hdp01:9000/输出路径

示例如下：
hadoop jar prjmapreducedemo-ver1.0.jar com.job.WordContJob hdfs://hdp01:9000/input/word.txt hdfs://hdp01:9000/wd_output
运行结果如下：
开始进行统计分析，请稍后…
2023-09-04 16:59:24,780 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
2023-09-04 16:59:25,667 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2023-09-04 16:59:26,249 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2023-09-04 16:59:26,278 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1693814764510_0001
2023-09-04 16:59:26,655 INFO input.FileInputFormat: Total input files to process : 1
2023-09-04 16:59:26,797 INFO mapreduce.JobSubmitter: number of splits:1
2023-09-04 16:59:27,818 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1693814764510_0001
2023-09-04 16:59:27,819 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-09-04 16:59:28,104 INFO conf.Configuration: resource-types.xml not found
2023-09-04 16:59:28,104 INFO resource.ResourceUtils: Unable to find ‘resource-types.xml’.
2023-09-04 16:59:28,743 INFO impl.YarnClientImpl: Submitted application application_1693814764510_0001
2023-09-04 16:59:28,802 INFO mapreduce.Job: The url to track the job: http://hdp01:8088/proxy/application_1693814764510_0001/
2023-09-04 16:59:28,803 INFO mapreduce.Job: Running job: job_1693814764510_0001
2023-09-04 16:59:38,020 INFO mapreduce.Job: Job job_1693814764510_0001 running in uber mode : false
2023-09-04 16:59:38,021 INFO mapreduce.Job: map 0% reduce 0%
2023-09-04 16:59:43,122 INFO mapreduce.Job: map 100% reduce 0%
2023-09-04 16:59:52,188 INFO mapreduce.Job: map 100% reduce 100%
2023-09-04 16:59:52,203 INFO mapreduce.Job: Job job_1693814764510_0001 completed successfully
2023-09-04 16:59:52,339 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=193
FILE: Number of bytes written=544509
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=195
HDFS: Number of bytes written=66
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2933
Total time spent by all reduces in occupied slots (ms)=5279
Total time spent by all map tasks (ms)=2933
Total time spent by all reduce tasks (ms)=5279
Total vcore-milliseconds taken by all map tasks=2933
Total vcore-milliseconds taken by all reduce tasks=5279
Total megabyte-milliseconds taken by all map tasks=3003392
Total megabyte-milliseconds taken by all reduce tasks=5405696
Map-Reduce Framework
Map input records=7
Map output records=15
Map output bytes=157
Map output materialized bytes=193
Input split bytes=97
Combine input records=0
Combine output records=0
Reduce input groups=8
Reduce shuffle bytes=193
Reduce input records=15
Reduce output records=8
Spilled Records=30
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=94
CPU time spent (ms)=1380
Physical memory (bytes) snapshot=479989760
Virtual memory (bytes) snapshot=5170192384
Total committed heap usage (bytes)=398458880
Peak Map Physical memory (bytes)=279764992
Peak Map Virtual memory (bytes)=2581463040
Peak Reduce Physical memory (bytes)=200224768
Peak Reduce Virtual memory (bytes)=2588729344
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=98
File Output Format Counters
Bytes Written=66

查看统计的结果

5.8.MapReduce内置数据类型

5.8.1.MapReduce对于输入数据和输出数据，不允许使用Java基本数据类型，为此MapReduce提供了多种内置数据类型。
下面是MapReduce提供的主要数据类型类：
BooleanWritable — 布尔型数值
ByteWritable — 字节型数值
DoubleWritable — 双精度型数值
FloatWritable — 浮点型数值
IntWritable — 整型数值
LongWritable — 长整型数值
NullWritable — 空值，当Key或Value为空时使用
Text — UTF8文本字符串

5.8.2.数据类型类的继承关系

5.9.自定义数据类型
5.9.1.如果现有内置数据类型不能满足应用程序需要，还可以自定义数据类型类。
自定义数据类型类应符合以下要求：
实现WritableComparable接口，同时实现接口中以下方法：
readFields方法
write方法
compareTo方法
可根据需要声明字段，并提供对应的get/set方法
可根据需要覆盖toString、equals等对象管理方法
字段、构造器、方法的声明应符合JavaBean规范

5.9.2.自定义类的示例
public class ScoreWritable implements WritableComparable {
private long sum;
private int num;

public ScoreWritable() { }

public ScoreWritable(long sum, int num) {
    this.sum = sum;
    this.num = num;
}

public long getSum() {
    return sum;
}

public void setSum(long sum) {
    this.sum = sum;
}

public int getNum() {
return num;
}

public void setNum(int num) {
    this.num = num;
}

@Override
public void readFields(DataInput di) throws IOException {
    sum = di.readLong();
    num = di.readInt();
}

@Override
public void write(DataOutput dto) throws IOException {
    dto.writeLong(sum);
    dto.writeInt(num);
}@Override
public int compareTo(ScoreWritable o) {
    return (sum == o.sum) ? 0 : ((sum < o.sum) ? -1 : 1);
}

@Override
public String toString() {
    return sum + "\t" + num + "\t" + ((double)sum) / num;
}

}

5.9.使用MapReduce自定义类型实现学生成绩统计分析
5.9.1.学生成绩数据的构成
数据中每行为一个学生的考试信息，数据列间隔为,号
数据格式如下：
zr21063001,0103,01,86,72,113
数据含义如下:

分析要求
每个学员有三个成绩，分析出每个考生三个成绩中的最高成绩和总成绩，并输出结果
5.9.2.使用IDEA开发MapReduce的步骤
创建MAVEN项目，导入需要的组件库依赖pom.xml

<?xml version="1.0" encoding="UTF-8"?>

4.0.0

<groupId>com.jianda</groupId>
<artifactId>prjmareducestuscoreanisys</artifactId>
<version>ver1.0</version>

<dependencies>
    <!--导入日志依赖-->
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-core</artifactId>
        <version>2.8.2</version>
    </dependency>
    <!-- 导入hadoop的组件库支持 -->
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>3.3.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>3.3.1</version>
    </dependency>
</dependencies>

创建需要的包

5.9.3.构建自定义的MapReduce实体类型ScoreResultEntity
ScoreResultEntity
package com.entity;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class ScoreResultEntity implements Writable {
private float maxscore;//最高成绩
private float totalscore;//当前学生的总成绩

public ScoreResultEntity() {
}

public ScoreResultEntity(float maxscore, float totalscore) {
    this.maxscore = maxscore;
    this.totalscore = totalscore;
}

public float getMaxscore() {
    return maxscore;
}

public void setMaxscore(float maxscore) {
    this.maxscore = maxscore;
}

public float getTotalscore() {
    return totalscore;
}

public void setTotalscore(float totalscore) {
    this.totalscore = totalscore;
}
/**
 * 序列化写的方法
 * */
public void write(DataOutput dataOutput) throws IOException {
    dataOutput.writeFloat(maxscore);
    dataOutput.writeFloat(totalscore);
}
/**
 * 序列化读的方法
 * */
public void readFields(DataInput dataInput) throws IOException {
    this.maxscore=dataInput.readFloat();
    this.totalscore=dataInput.readFloat();
}

@Override
public String toString() {
    return "ScoreResultEntity{" +
            "maxscore=" + maxscore +
            ", totalscore=" + totalscore +
            '}';
}

}

5.9.4.构建Map类ScoreAnalysisMapper，从文件中获取学生的考试信息
package com.maps;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.*;
public class ScoreAnalysisMapper extends Mapper<LongWritable, Text,Text, FloatWritable> {
private Text stuid=new Text();//考生号
private FloatWritable score=new FloatWritable();//考生成绩
@Override
/**
* key 主键
* value 文本类型值，需要拆分的文本信息
* context 需要设置的数据值
* */
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//super.map(key, value, context);
//使用,号拆分各项数据
String[] datas=value.toString().split(“,”);
//获取考生号
//stuid.set(datas[0]);
if(datas!=null&&datas.length>0&&datas.length==6){
//获取考生号
stuid.set(datas[0]);
//将三个成绩设置到一个考生号中
score.set(Float.parseFloat(datas[3]));//3表示成绩1所在的索引位置
context.write(stuid,score);//将成绩写到context的{zr100000,[87]}
score.set(Float.parseFloat(datas[4]));
context.write(stuid,score);//将成绩写到context的{zr100000,[87,98]}
score.set(Float.parseFloat(datas[5]));
context.write(stuid,score);//一个学号对应三个成绩，类似于(“zr1001”,[89,97,67])这样的效果
}
}
}

5.9.5.ScoreReduce计算分析成绩
package com.reduces;

import com.entity.ScoreResultEntity;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.*;
public class ScoreReduce extends Reducer<Text, FloatWritable,Text, ScoreResultEntity> {
@Override
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
//super.reduce(key, values, context);
float maxscore=0;
float totalscore=0;
//循环统计出最高和总成绩
for(FloatWritable value:values){
float score=value.get();//依次获取三门课的成绩，
totalscore+=score;//统计成绩
if(maxscore<score){//获取总成绩
maxscore=score;
}
}
//封装数据
ScoreResultEntity entity=new ScoreResultEntity(maxscore,totalscore);
//写出结果
context.write(key,entity);
}
}

5.9.6.构建工作类主类，ScoreAnalysisJob
package com.job;
import com.entity.*;

import com.maps.ScoreAnalysisMapper;

import com.reduces.ScoreReduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class ScoreAnalysisJob {
public static void main(String[] args) {
//判断命令行是否有足够的参数
if(args!=null&&args.length<2){
System.out.println(“参数不足，【第一个为数据集文件，第2个为输出路径】”);
System.exit(0);
}
try {
//创建配置对象
Configuration conf=new Configuration();
//获取job 对象
Job job=Job.getInstance(conf);

        //设置工作的主类名
        job.setJarByClass(ScoreAnalysisJob.class);
        //设置mapper类
        job.setMapperClass(ScoreAnalysisMapper.class);
        //设置分析类Reduce
        job.setReducerClass(ScoreReduce.class);

        //设置mapper输出的主键类型
        job.setMapOutputKeyClass(Text.class);
        //设置map输出的值类型
        job.setMapOutputValueClass(FloatWritable.class);

        //设置最终显示的主键类型
        job.setOutputKeyClass(Text.class);
        //设置最终输出的值类型ScoreResultEntity
        job.setOutputValueClass(ScoreResultEntity.class);

        //JobConf jobConf=new JobConf(conf);

        //设置输入和输出路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        System.exit(job.waitForCompletion(true)?0:1);
    } catch (IOException e) {
        e.printStackTrace();
    } catch (InterruptedException e) {
        e.printStackTrace();
    } catch (ClassNotFoundException e) {
        e.printStackTrace();
    }
}

}

5.9.7.项目打包并部署到hadoop平台运行

将jar文件上传到centeos中

5.9.10.准备数据文件score.csv文件

将学生成绩文件上传到opt/input下

上传/opt/input/score.csv文件到hdfs的input目录下
hdfs dfs -put score.csv /input/

运行
[root@hdp01 myjob]#hadoop jar prjmareducestuscoreanisys-ver1.0.jar com.job.ScoreAnalysisJob /input/score.csv /output-score
2023-09-06 11:53:49,079 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
2023-09-06 11:53:51,770 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2023-09-06 11:53:55,462 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2023-09-06 11:53:55,643 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1693971333444_0001
2023-09-06 11:53:56,227 INFO input.FileInputFormat: Total input files to process : 1
2023-09-06 11:53:56,862 INFO mapreduce.JobSubmitter: number of splits:1
2023-09-06 11:54:00,748 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1693971333444_0001
2023-09-06 11:54:00,748 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-09-06 11:54:01,897 INFO conf.Configuration: resource-types.xml not found
2023-09-06 11:54:01,899 INFO resource.ResourceUtils: Unable to find ‘resource-types.xml’.
2023-09-06 11:54:02,345 INFO impl.YarnClientImpl: Submitted application application_1693971333444_0001
2023-09-06 11:54:02,540 INFO mapreduce.Job: The url to track the job: http://hdp01:8088/proxy/application_1693971333444_0001/
2023-09-06 11:54:02,541 INFO mapreduce.Job: Running job: job_1693971333444_0001
2023-09-06 11:54:20,742 INFO mapreduce.Job: Job job_1693971333444_0001 running in uber mode : false
2023-09-06 11:54:20,749 INFO mapreduce.Job: map 0% reduce 0%
2023-09-06 11:54:27,929 INFO mapreduce.Job: map 100% reduce 0%
2023-09-06 11:54:34,015 INFO mapreduce.Job: map 100% reduce 100%
2023-09-06 11:54:35,056 INFO mapreduce.Job: Job job_1693971333444_0001 completed successfully
2023-09-06 11:54:35,242 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=1638
FILE: Number of bytes written=548207
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=978
HDFS: Number of bytes written=2000
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=4690
Total time spent by all reduces in occupied slots (ms)=4357
Total time spent by all map tasks (ms)=4690
Total time spent by all reduce tasks (ms)=4357
Total vcore-milliseconds taken by all map tasks=4690
Total vcore-milliseconds taken by all reduce tasks=4357
Total megabyte-milliseconds taken by all map tasks=4802560
Total megabyte-milliseconds taken by all reduce tasks=4461568
Map-Reduce Framework
Map input records=32
Map output records=96
Map output bytes=1440
Map output materialized bytes=1638
Input split bytes=98
Combine input records=0
Combine output records=0
Reduce input groups=32
Reduce shuffle bytes=1638
Reduce input records=96
Reduce output records=32
Spilled Records=192
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=260
CPU time spent (ms)=1900
Physical memory (bytes) snapshot=481038336
Virtual memory (bytes) snapshot=5168017408
Total committed heap usage (bytes)=372768768
Peak Map Physical memory (bytes)=281804800
Peak Map Virtual memory (bytes)=2580533248
Peak Reduce Physical memory (bytes)=199233536
Peak Reduce Virtual memory (bytes)=2587484160
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=880
File Output Format Counters
Bytes Written=2000

查看
hdfs dfs -ls /output-score
-rw-r–r-- 1 root supergroup 0 2023-09-06 11:54 /output-score/_SUCCESS
-rw-r–r-- 1 root supergroup 2000 2023-09-06 11:54 /output-score/part-r-00000
[root@hdp01 myjob]# hdfs dfs -cat /output-score/part-r-00000
2023-09-06 11:56:24,766 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
zr21063001 ScoreResultEntity{maxscore=113.0, totalscore=271.0}
zr21063002 ScoreResultEntity{maxscore=114.0, totalscore=274.0}
zr21063003 ScoreResultEntity{maxscore=115.0, totalscore=277.0}
zr21063004 ScoreResultEntity{maxscore=116.0, totalscore=280.0}
zr21063005 ScoreResultEntity{maxscore=117.0, totalscore=283.0}
zr21063006 ScoreResultEntity{maxscore=118.0, totalscore=286.0}
zr21063007 ScoreResultEntity{maxscore=119.0, totalscore=289.0}
zr21063008 ScoreResultEntity{maxscore=120.0, totalscore=292.0}
zr21063009 ScoreResultEntity{maxscore=94.0, totalscore=249.0}
zr21063010 ScoreResultEntity{maxscore=95.0, totalscore=252.0}
zr21063011 ScoreResultEntity{maxscore=96.0, totalscore=255.0}
zr21063012 ScoreResultEntity{maxscore=97.0, totalscore=258.0}
zr21063013 ScoreResultEntity{maxscore=98.0, totalscore=261.0}
zr21063014 ScoreResultEntity{maxscore=99.0, totalscore=264.0}
zr21063015 ScoreResultEntity{maxscore=116.0, totalscore=288.0}
zr21063016 ScoreResultEntity{maxscore=117.0, totalscore=283.0}
zr21063017 ScoreResultEntity{maxscore=118.0, totalscore=286.0}
zr21063018 ScoreResultEntity{maxscore=119.0, totalscore=289.0}
zr21063019 ScoreResultEntity{maxscore=90.0, totalscore=260.0}
zr21063020 ScoreResultEntity{maxscore=91.0, totalscore=263.0}
zr21063021 ScoreResultEntity{maxscore=92.0, totalscore=266.0}
zr21063022 ScoreResultEntity{maxscore=93.0, totalscore=269.0}
zr21063023 ScoreResultEntity{maxscore=94.0, totalscore=272.0}
zr21063024 ScoreResultEntity{maxscore=95.0, totalscore=262.0}
zr21063025 ScoreResultEntity{maxscore=96.0, totalscore=265.0}
zr21063026 ScoreResultEntity{maxscore=115.0, totalscore=288.0}
zr21063027 ScoreResultEntity{maxscore=116.0, totalscore=291.0}
zr21063028 ScoreResultEntity{maxscore=117.0, totalscore=294.0}
zr21063029 ScoreResultEntity{maxscore=118.0, totalscore=287.0}
zr21063030 ScoreResultEntity{maxscore=91.0, totalscore=262.0}
zr21063031 ScoreResultEntity{maxscore=92.0, totalscore=260.0}
zr21063032 ScoreResultEntity{maxscore=93.0, totalscore=263.0}

六．Sqoop数据迁移工具
6.1.Sqoop的介绍
Apache Sqoop 是一个用于在Apache Hadoop 和关系数据库等结构化数据存储之间高效传输大容量数据的工具；
可以使用Sqoop 将关系数据库管理系统（RDBMS，如MySQL或Oracle）中的数据导入Hadoop分布式文件系统（HDFS），同样也可以将HDFS 中数据导出到RDBMS 中；
Sqoop 自动化了这个过程的大部分，依赖于数据库来描述要导入的数据的模式，Sqoop 使用MapReduce 来导入和导出数据，这提供了并行操作和容错能力。

6.2.Sqoop 工作原理
Sqoop 检测每个表然后自动生成Java 类来向HDFS 导入数据，然后创建并运行一个仅有Map 的MapReduce 作业来导入数据；
默认情况下，四个Mapper（并行度4）连接到RDBMS
每个导入四分之一的数据

6.3.Sqoop 特性
导入一个数据库中的单个或所有表
可以使用where指定导入哪些行
可以指定导入哪些列
可以指定任意的select 语句
Sqoop 可以自动基于导入的数据创建一个Hive表
提供增量导入数据
可以从HDFS 导出数据到数据库表

6.4.Sqoop环境搭建
6.4.1.上传sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz到centeos的opt目录下，在opt目录下创建sqoop文件夹，并将sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz解压缩到sqoop文件夹
mkdir sqoop
tar -xvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C sqoop/

[root@hdp01 opt]# ll
总用量 623644
drwxr-xr-x. 9 root root 220 8月 14 2019 apache-tomcat-9.0.24
-rw-r–r–. 1 root root 12856789 8月 29 17:20 apache-tomcat-9.0.24.zip
drwxr-xr-x. 3 root root 20 8月 30 16:28 data
drwxr-xr-x. 12 zhang zhang 240 9月 6 11:36 hadoop-3.3.1
-rw-r–r–. 1 root root 607792249 8月 30 10:44 hadoop-3.3.1-aarch64.tar.gz
drwxr-xr-x. 2 root root 86 9月 6 11:42 input
drwxr-xr-x. 2 root root 66 9月 6 11:53 myjob
-rw-r–r–. 1 root root 90 8月 31 17:14 part-r-00000
drwxr-xr-x. 2 root root 6 10月 31 2018 rh
drwxr-xr-x. 3 root root 43 9月 6 16:17 sqoop
-rw-r–r–. 1 root root 17953604 9月 6 16:15 sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
[root@hdp01 opt]# cd sqoop
[root@hdp01 sqoop]# ll
总用量 4
drwxr-xr-x. 9 zhang zhang 4096 12月 19 2017 sqoop-1.4.7.bin__hadoop-2.6.0
[root@hdp01 sqoop]# cd sqoop-1.4.7.bin__hadoop-2.6.0/
[root@hdp01 sqoop-1.4.7.bin__hadoop-2.6.0]# ll
总用量 2020
drwxr-xr-x. 2 zhang zhang 4096 12月 19 2017 bin
-rw-rw-r–. 1 zhang zhang 55089 12月 19 2017 build.xml
-rw-rw-r–. 1 zhang zhang 47426 12月 19 2017 CHANGELOG.txt
-rw-rw-r–. 1 zhang zhang 9880 12月 19 2017 COMPILING.txt
drwxr-xr-x. 2 zhang zhang 150 12月 19 2017 conf
drwxr-xr-x. 5 zhang zhang 169 12月 19 2017 docs
drwxr-xr-x. 2 zhang zhang 96 12月 19 2017 ivy
-rw-rw-r–. 1 zhang zhang 11163 12月 19 2017 ivy.xml
drwxr-xr-x. 2 zhang zhang 4096 12月 19 2017 lib
-rw-rw-r–. 1 zhang zhang 15419 12月 19 2017 LICENSE.txt
-rw-rw-r–. 1 zhang zhang 505 12月 19 2017 NOTICE.txt
-rw-rw-r–. 1 zhang zhang 18772 12月 19 2017 pom-old.xml
-rw-rw-r–. 1 zhang zhang 1096 12月 19 2017 README.txt
-rw-rw-r–. 1 zhang zhang 1108073 12月 19 2017 sqoop-1.4.7.jar
-rw-rw-r–. 1 zhang zhang 6554 12月 19 2017 sqoop-patch-review.py
-rw-rw-r–. 1 zhang zhang 765184 12月 19 2017 sqoop-test-1.4.7.jar
drwxr-xr-x. 7 zhang zhang 73 12月 19 2017 src
drwxr-xr-x. 4 zhang zhang 114 12月 19 2017 testdata

6.4.2.将sqoop添加到环境变量/etc/profile中
vi /etc/profile
在最后加入：
export SQOOP_HOME=/opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0
export PATH= $P A T H :$ SQOOP_HOME/bin

6.4.3.配置 Sqoop 环境变量文件
1.切换到 Sqoop 配置文件目录

cd $SQOOP_HOME/conf

2.复制 Sqoop 环境变量模板文件

cp sqoop-env-template.sh sqoop-env.sh

3.编辑文件，指定相关路径

vi sqoop-env.sh

配置如下，根据自己的集群情况来看。

我没有装 HBase 和 Zookeper，所以只指定了 HADOOP 和 Hive 的路径，注意删掉前面的符号 # ，否则会被认为是一行注释。

6.4.4.安装MySql驱动
拷贝 MySQL 驱动到 Sqoop 中的 lib 目录中。

将mysql-connector-java-8.0.30.jar上传到 $SQOOP_HOME/lib

6.4.5.验证
输入 sqoop version，出现如下版本信息表示安装成功

启动mariadb数据库
[root@hdp01 lib]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
Active: inactive (dead)
[root@hdp01 lib]# systemctl start mariadb
[root@hdp01 lib]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
Active: active (running) since 三 2023-09-06 16:36:39 CST; 2s ago
Process: 4249 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 4212 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 4247 (mysqld_safe)
Tasks: 20
CGroup: /system.slice/mariadb.service
├─4247 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─4509 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log…

查看数据库信息
[root@hdp01 lib]# mysql -uroot -p
Enter password:
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 2
Server version: 5.5.68-MariaDB MariaDB Server

Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.

MariaDB [(none)]> use studb;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [studb]> show tables;
±----------------+
| Tables_in_studb |
±----------------+
| admin |
±----------------+
1 row in set (0.00 sec)

MariaDB [studb]> select * from admin;
±----±------±-------+
| aid | aname | passwd |
±----±------±-------+
| 1 | admin | 112233 |
| 2 | test | 112211 |
| 3 | test2 | 112212 |
±----±------±-------+
3 rows in set (0.00 sec)

MariaDB [studb]> quit;
Bye

使用sqoop查看mariadb数据库stub中的表信息
[root@hdp01 lib]# sqoop list-tables --connect jdbc:mysql://localhost:3306/studb --username root --password ‘’
Warning: /opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/…/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/…/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/…/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/…/zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation. 2023-09-06 16:40:13,833 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7 2023-09-06 16:40:13,890 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 2023-09-06 16:40:14,282 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.sqoop.manager.MySQLManager.initOptionDefaults(MySQLManager.java:73) at org.apache.sqoop.manager.SqlManager.<init>(SqlManager.java:89) at com.cloudera.sqoop.manager.SqlManager.<init>(SqlManager.java:33) at org.apache.sqoop.manager.GenericJdbcManager.<init>(GenericJdbcManager.java:51) at com.cloudera.sqoop.manager.GenericJdbcManager.<init>(GenericJdbcManager.java:30) at org.apache.sqoop.manager.CatalogQueryManager.<init>(CatalogQueryManager.java:46) at com.cloudera.sqoop.manager.CatalogQueryManager.<init>(CatalogQueryManager.java:31) at org.apache.sqoop.manager.InformationSchemaManager.<init>(InformationSchemaManager.java:38) at com.cloudera.sqoop.manager.InformationSchemaManager.<init>(InformationSchemaManager.java:31) at org.apache.sqoop.manager.MySQLManager.<init>(MySQLManager.java:65) at org.apache.sqoop.manager.DefaultManagerFactory.accept(DefaultManagerFactory.java:67) at org.apache.sqoop.ConnFactory.getManager(ConnFactory.java:184) at org.apache.sqoop.tool.BaseSqoopTool.init(BaseSqoopTool.java:272) at org.apache.sqoop.tool.ListTablesTool.run(ListTablesTool.java:44) at org.apache.sqoop.Sqoop.run(Sqoop.java:147) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243) at org.apache.sqoop.Sqoop.main(Sqoop.java:252) Caused by: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$ AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
… 20 more

出现上面错误的原因是commons-lang3-3.4.jar的版本太高，需要使用commons-lang-2.6.jar替代此版本

再次查看
[root@hdp01 lib]# sqoop list-tables --connect jdbc:mysql://localhost:3306/studb --username root --password ‘’
Warning: /opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/…/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/…/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/…/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/…/zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
2023-09-06 16:45:59,570 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2023-09-06 16:45:59,628 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2023-09-06 16:46:00,059 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Loading class com.mysql.jdbc.Driver'. This is deprecated. The new driver class is com.mysql.cj.jdbc.Driver’. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
admin

至此，我们的 Sqoop 就已经安装完成啦。
6.4.6.去除警告信息
使用 Sqoop 时的一些警告信息真令人头大，我们可以手动去关闭掉这些警告信息。

切换到 Sqoop 目录

cd $SQOOP_HOME/bin

编辑文件

vi configure-sqoop
注释如下内容：

再次输入 sqoop version 查看：

6.5.sqoop的使用方法
6.5.1.Sqoop基本用法
查看指定数据库表：
sqoop list-tables --connect jdbc:mysql://localhost:3306/studb --username root --password ‘’
使用Sqoop可以从关系型数据库中导入数据到HDFS上，在这个过程中import操作的输入是一个数据库表，Sqoop会逐行读取记录到HDFS中。
Sqoop的基本操作就是实现数据的导入导出工作，及import和export；
import操作的输入是一个数据库表，Sqoop会逐行读取记录到HDFS中
export是将数据从HDFS或者Hive导出到关系型数据库中。export过程并行的读取HDFS上的文件，将每一条内容转化成一条记录，然后作为一个新行insert到关系型数据库表中。
除了import和export，Sqoop还包含一些其他的操作。比如sqoop list-databases查询数据库结构，sqoop list-tables查询表信息等。
6.5.2.sqoop工具使用语法
Sqoop提供了一系列的操作工具，使用Sqoop需要指定你想要使用的具体工具，以及提供对应的一些参数，使用方式如下。

sqoop tool-name [tool-arguments]
可以使用sqoop help命令查看帮助信息
sqoop help

1.可以使用Sqoop对MySQL中的内容进行查看；
查看mariadb中所有的数据库
sqoop list-databases --username root --connect jdbc:mysql://localhost

– 列出mysql库中所有表
sqoop list-tables --username root --password ‘’ --connect jdbc:mysql://localhost/mysql
[root@hdp01 bin]# sqoop list-tables --username root --password ‘’ --connect jdbc:mysql://localhost/mysql
2023-09-06 17:10:21,610 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2023-09-06 17:10:21,668 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2023-09-06 17:10:21,850 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Loading class com.mysql.jdbc.Driver'. This is deprecated. The new driver class is com.mysql.cj.jdbc.Driver’. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
columns_priv
db
event
func
general_log
help_category
help_keyword
help_relation
help_topic
host
ndb_binlog_index
plugin
proc
procs_priv
proxies_priv
servers
slow_log
tables_priv
time_zone
time_zone_leap_second
time_zone_name
time_zone_transition
time_zone_transition_type
user

6.5.3.Sqoop导入导出
列出studb中所有表
sqoop list-tables --connect jdbc:mysql://localhost:3306/studb --username root --password ‘’
将studb数据库中admin表的记录导入到hdfs文件系统
导入数据到hdfs的/admin目录：
sqoop import --connect jdbc:mysql://localhost/studb --username root --password ‘’ --table admin --fields-terminated-by ‘\t’ --target-dir ‘/admin’
此处admin目录不能存在，会自动创建
[root@hdp01 sbin]# sqoop import --connect jdbc:mysql://localhost/studb --username root --password ‘’ --table admin --fields-terminated-by ‘\t’ --target-dir ‘/admin’
2023-09-06 17:17:12,723 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2023-09-06 17:17:12,867 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2023-09-06 17:17:13,039 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2023-09-06 17:17:13,039 INFO tool.CodeGenTool: Beginning code generation
Loading class com.mysql.jdbc.Driver'. This is deprecated. The new driver class is com.mysql.cj.jdbc.Driver’. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
2023-09-06 17:17:14,867 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM admin AS t LIMIT 1
2023-09-06 17:17:14,941 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM admin AS t LIMIT 1
2023-09-06 17:17:14,955 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/hadoop-3.3.1
注: /tmp/sqoop-root/compile/70b8d12e5901d65973137975e3b59709/admin.java使用或覆盖了已过时的 API。
注: 有关详细信息, 请使用 -Xlint:deprecation 重新编译。
2023-09-06 17:17:29,620 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/70b8d12e5901d65973137975e3b59709/admin.jar
2023-09-06 17:17:29,651 WARN manager.MySQLManager: It looks like you are importing from mysql.
2023-09-06 17:17:29,651 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
2023-09-06 17:17:29,651 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
2023-09-06 17:17:29,651 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
2023-09-06 17:17:29,660 INFO mapreduce.ImportJobBase: Beginning import of admin
2023-09-06 17:17:29,660 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2023-09-06 17:17:29,985 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
2023-09-06 17:17:30,050 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
2023-09-06 17:17:32,800 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2023-09-06 17:17:33,097 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2023-09-06 17:17:34,641 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1693991756586_0001
2023-09-06 17:17:47,874 INFO db.DBInputFormat: Using read commited transaction isolation
2023-09-06 17:17:47,875 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(aid), MAX(aid) FROM admin
2023-09-06 17:17:47,913 INFO db.IntegerSplitter: Split size: 0; Num splits: 4 from: 1 to: 3
2023-09-06 17:17:48,453 INFO mapreduce.JobSubmitter: number of splits:3
2023-09-06 17:17:49,505 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1693991756586_0001
2023-09-06 17:17:49,505 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-09-06 17:17:57,817 INFO conf.Configuration: resource-types.xml not found
2023-09-06 17:17:57,818 INFO resource.ResourceUtils: Unable to find ‘resource-types.xml’.
2023-09-06 17:17:58,882 INFO impl.YarnClientImpl: Submitted application application_1693991756586_0001
2023-09-06 17:17:58,971 INFO mapreduce.Job: The url to track the job: http://hdp01:8088/proxy/application_1693991756586_0001/
2023-09-06 17:17:58,972 INFO mapreduce.Job: Running job: job_1693991756586_0001
2023-09-06 17:18:22,893 INFO mapreduce.Job: Job job_1693991756586_0001 running in uber mode : false
2023-09-06 17:18:22,897 INFO mapreduce.Job: map 0% reduce 0%
2023-09-06 17:18:41,398 INFO mapreduce.Job: map 67% reduce 0%
2023-09-06 17:18:42,404 INFO mapreduce.Job: map 100% reduce 0%
2023-09-06 17:18:42,415 INFO mapreduce.Job: Job job_1693991756586_0001 completed successfully
2023-09-06 17:18:43,270 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=844458
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=301
HDFS: Number of bytes written=44
HDFS: Number of read operations=18
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=3
Other local map tasks=3
Total time spent by all maps in occupied slots (ms)=48003
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=48003
Total vcore-milliseconds taken by all map tasks=48003
Total megabyte-milliseconds taken by all map tasks=49155072
Map-Reduce Framework
Map input records=3
Map output records=3
Input split bytes=301
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=3745
CPU time spent (ms)=10800
Physical memory (bytes) snapshot=635547648
Virtual memory (bytes) snapshot=7823417344
Total committed heap usage (bytes)=413663232
Peak Map Physical memory (bytes)=233840640
Peak Map Virtual memory (bytes)=2616766464
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=44
2023-09-06 17:18:43,282 INFO mapreduce.ImportJobBase: Transferred 44 bytes in 70.4479 seconds (0.6246 bytes/sec)
2023-09-06 17:18:43,287 INFO mapreduce.ImportJobBase: Retrieved 3 records.

查看hdfs的admin目录
[root@hdp01 sbin]# hdfs dfs -ls /admin
2023-09-06 17:19:59,439 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 4 items
-rw-r–r-- 1 root supergroup 0 2023-09-06 17:18 /admin/_SUCCESS
-rw-r–r-- 1 root supergroup 15 2023-09-06 17:18 /admin/part-m-00000
-rw-r–r-- 1 root supergroup 14 2023-09-06 17:18 /admin/part-m-00001
-rw-r–r-- 1 root supergroup 15 2023-09-06 17:18 /admin/part-m-00002
[root@hdp01 sbin]# hdfs dfs -cat /admin/part-m-00000
2023-09-06 17:20:32,902 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
1 admin 112233
[root@hdp01 sbin]# hdfs dfs -cat /admin/part-m-00001
2023-09-06 17:20:39,125 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
2 test 112211
[root@hdp01 sbin]# hdfs dfs -cat /admin/part-m-00002
2023-09-06 17:20:43,823 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
3 test2 112212

6.5.4.使用sql查询语句从数据库表中选择需要的数据导入到HDFS系统
sqoop import --connect jdbc:mysql://localhost/studb --username root --password ‘’ --query ‘select * from admin where aid >=1 and aid<=3 and $CONDITIONS’ --split-by ‘aid’ --fields-terminated-by ‘,’ --target-dir ‘/studb’

[root@hdp01 sbin]# sqoop import --connect jdbc:mysql://localhost/studb --username root --password ‘’ --query ‘select * from admin where aid >=1 and aid<=3 and $CONDITIONS’ --split-by ‘aid’ --fields-terminated-by ‘,’ --target-dir ‘/studb’
2023-09-07 10:47:54,294 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2023-09-07 10:47:54,373 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2023-09-07 10:47:54,548 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2023-09-07 10:47:54,548 INFO tool.CodeGenTool: Beginning code generation
Loading class com.mysql.jdbc.Driver'. This is deprecated. The new driver class is com.mysql.cj.jdbc.Driver’. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
2023-09-07 10:47:56,261 INFO manager.SqlManager: Executing SQL statement: select * from admin where aid >=1 and aid<=3 and (1 = 0)
2023-09-07 10:47:56,292 INFO manager.SqlManager: Executing SQL statement: select * from admin where aid >=1 and aid<=3 and (1 = 0)
2023-09-07 10:47:56,346 INFO manager.SqlManager: Executing SQL statement: select * from admin where aid >=1 and aid<=3 and (1 = 0)
2023-09-07 10:47:56,367 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/hadoop-3.3.1
注: /tmp/sqoop-root/compile/081c03f634562b55fb23566dd851f93d/QueryResult.java使用或覆盖了已过时的 API。
注: 有关详细信息, 请使用 -Xlint:deprecation 重新编译。
2023-09-07 10:48:11,335 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/081c03f634562b55fb23566dd851f93d/QueryResult.jar
2023-09-07 10:48:11,356 INFO mapreduce.ImportJobBase: Beginning query import.
2023-09-07 10:48:11,357 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2023-09-07 10:48:11,598 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
2023-09-07 10:48:11,648 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
2023-09-07 10:48:13,050 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2023-09-07 10:48:13,218 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2023-09-07 10:48:14,186 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1694054422942_0001
2023-09-07 10:48:30,706 INFO db.DBInputFormat: Using read commited transaction isolation
2023-09-07 10:48:30,709 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(aid), MAX(aid) FROM (select * from admin where aid >=1 and aid<=3 and (1 = 1) ) AS t1
2023-09-07 10:48:30,713 INFO db.IntegerSplitter: Split size: 0; Num splits: 4 from: 1 to: 3
2023-09-07 10:48:30,817 INFO mapreduce.JobSubmitter: number of splits:3
2023-09-07 10:48:31,881 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1694054422942_0001
2023-09-07 10:48:31,881 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-09-07 10:48:32,565 INFO conf.Configuration: resource-types.xml not found
2023-09-07 10:48:32,566 INFO resource.ResourceUtils: Unable to find ‘resource-types.xml’.
2023-09-07 10:48:33,159 INFO impl.YarnClientImpl: Submitted application application_1694054422942_0001
2023-09-07 10:48:33,323 INFO mapreduce.Job: The url to track the job: http://hdp01:8088/proxy/application_1694054422942_0001/
2023-09-07 10:48:33,324 INFO mapreduce.Job: Running job: job_1694054422942_0001
2023-09-07 10:48:46,863 INFO mapreduce.Job: Job job_1694054422942_0001 running in uber mode : false
2023-09-07 10:48:46,865 INFO mapreduce.Job: map 0% reduce 0%
2023-09-07 10:48:59,666 INFO mapreduce.Job: map 67% reduce 0%
2023-09-07 10:49:01,685 INFO mapreduce.Job: map 100% reduce 0%
2023-09-07 10:49:02,709 INFO mapreduce.Job: Job job_1694054422942_0001 completed successfully
2023-09-07 10:49:02,865 INFO mapreduce.Job: Counters: 34
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=844788
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=289
HDFS: Number of bytes written=44
HDFS: Number of read operations=18
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
HDFS: Number of bytes read erasure-coded=0
Job Counters
Killed map tasks=1
Launched map tasks=3
Other local map tasks=3
Total time spent by all maps in occupied slots (ms)=30890
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=30890
Total vcore-milliseconds taken by all map tasks=30890
Total megabyte-milliseconds taken by all map tasks=31631360
Map-Reduce Framework
Map input records=3
Map output records=3
Input split bytes=289
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=1302
CPU time spent (ms)=8260
Physical memory (bytes) snapshot=640663552
Virtual memory (bytes) snapshot=7796912128
Total committed heap usage (bytes)=451936256
Peak Map Physical memory (bytes)=226467840
Peak Map Virtual memory (bytes)=2616942592
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=44
2023-09-07 10:49:02,879 INFO mapreduce.ImportJobBase: Transferred 44 bytes in 49.8043 seconds (0.8835 bytes/sec)
2023-09-07 10:49:02,887 INFO mapreduce.ImportJobBase: Retrieved 3 records.

查看
[root@hdp01 sbin]# hdfs dfs -ls /studb
2023-09-07 10:50:58,705 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 4 items
-rw-r–r-- 1 root supergroup 0 2023-09-07 10:49 /studb/_SUCCESS
-rw-r–r-- 1 root supergroup 15 2023-09-07 10:48 /studb/part-m-00000
-rw-r–r-- 1 root supergroup 14 2023-09-07 10:48 /studb/part-m-00001
-rw-r–r-- 1 root supergroup 15 2023-09-07 10:49 /studb/part-m-00002
[root@hdp01 sbin]# hdfs dfs -cat /studb/p*
2023-09-07 10:51:12,882 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
1,admin,112233
2,test,112211
3,test2,112212

6.5.5.从hdfs导出数据到数据库
将/admin路径的数据导出到mydb数据库的users表
首先在mydb数据库中创建users表，表的结构和admin表的结构一致
CREATE TABLE users (
uid INT(11) NOT NULL,
uname VARCHAR(50) DEFAULT NULL,
upasswd VARCHAR(50) DEFAULT NULL,
PRIMARY KEY (uid)
) ENGINE=INNODB DEFAULT CHARSET=latin1;

导出数据语句
sqoop export --connect jdbc:mysql://localhost/mydb --username root --password ‘’ --table users --input-fields-terminated-by ‘\t’ --input-null-non-string ‘\N’ --input-null-string ‘\N’ --export-dir ‘/admin/’

导出的结果
[root@hdp01 sbin]# sqoop export --connect jdbc:mysql://localhost/mydb --username root --password ‘’ --table users --input-fields-terminated-by ‘\t’ --input-null-non-string ‘\N’ --input-null-string ‘\N’ --export-dir ‘/admin/’
2023-09-07 10:59:30,426 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2023-09-07 10:59:30,480 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2023-09-07 10:59:30,668 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2023-09-07 10:59:30,682 INFO tool.CodeGenTool: Beginning code generation
Loading class com.mysql.jdbc.Driver'. This is deprecated. The new driver class is com.mysql.cj.jdbc.Driver’. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
2023-09-07 10:59:31,179 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM users AS t LIMIT 1
2023-09-07 10:59:31,232 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM users AS t LIMIT 1
2023-09-07 10:59:31,253 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/hadoop-3.3.1
注: /tmp/sqoop-root/compile/82f954cd2a1ba4e0d80fe0feb37be25a/users.java使用或覆盖了已过时的 API。
注: 有关详细信息, 请使用 -Xlint:deprecation 重新编译。
2023-09-07 10:59:33,177 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/82f954cd2a1ba4e0d80fe0feb37be25a/users.jar
2023-09-07 10:59:33,193 INFO mapreduce.ExportJobBase: Beginning export of users
2023-09-07 10:59:33,193 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2023-09-07 10:59:33,593 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
2023-09-07 10:59:33,677 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
2023-09-07 10:59:35,044 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
2023-09-07 10:59:35,049 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
2023-09-07 10:59:35,050 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2023-09-07 10:59:35,315 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2023-09-07 10:59:35,909 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1694054422942_0002
2023-09-07 10:59:39,383 INFO input.FileInputFormat: Total input files to process : 3
2023-09-07 10:59:39,388 INFO input.FileInputFormat: Total input files to process : 3
2023-09-07 10:59:39,921 INFO mapreduce.JobSubmitter: number of splits:3
2023-09-07 10:59:40,007 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
2023-09-07 10:59:40,574 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1694054422942_0002
2023-09-07 10:59:40,575 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-09-07 10:59:40,991 INFO conf.Configuration: resource-types.xml not found
2023-09-07 10:59:40,992 INFO resource.ResourceUtils: Unable to find ‘resource-types.xml’.
2023-09-07 10:59:41,381 INFO impl.YarnClientImpl: Submitted application application_1694054422942_0002
2023-09-07 10:59:41,461 INFO mapreduce.Job: The url to track the job: http://hdp01:8088/proxy/application_1694054422942_0002/
2023-09-07 10:59:41,465 INFO mapreduce.Job: Running job: job_1694054422942_0002
2023-09-07 10:59:50,639 INFO mapreduce.Job: Job job_1694054422942_0002 running in uber mode : false
2023-09-07 10:59:50,641 INFO mapreduce.Job: map 0% reduce 0%
2023-09-07 11:00:02,105 INFO mapreduce.Job: map 100% reduce 0%
2023-09-07 11:00:03,167 INFO mapreduce.Job: Job job_1694054422942_0002 completed successfully
2023-09-07 11:00:03,341 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=843627
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=595
HDFS: Number of bytes written=0
HDFS: Number of read operations=21
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=3
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=27161
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=27161
Total vcore-milliseconds taken by all map tasks=27161
Total megabyte-milliseconds taken by all map tasks=27812864
Map-Reduce Framework
Map input records=3
Map output records=3
Input split bytes=510
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=558
CPU time spent (ms)=6630
Physical memory (bytes) snapshot=632635392
Virtual memory (bytes) snapshot=7759220736
Total committed heap usage (bytes)=504889344
Peak Map Physical memory (bytes)=229208064
Peak Map Virtual memory (bytes)=2587893760
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2023-09-07 11:00:03,353 INFO mapreduce.ExportJobBase: Transferred 595 bytes in 28.2782 seconds (21.041 bytes/sec)
2023-09-07 11:00:03,365 INFO mapreduce.ExportJobBase: Exported 3 records.

查看数据

注意：导出的数据字段的个数与表的字段个数一致，类型一致。

6.5.6.Sqoop 数据导入需要考虑的问题
导入效率，参数-m 可以指定导入作业的并行度，参数值默认为4，值越高，导入效率越高，但需要占用的资源越多，会存在影响其他在线业务的风险；
导入数据选择，已有数据并非所有内容都需要导入到大数据环境，或者不需要一次性导入，若一次性导入数据过多，可能导致导入时间过长，从而影响已有业务；
导入时间，不论何种导入方式，终归需要在一段时间内占用一定的资源，所以数据的导入最好是选择在业务低峰，这样可以尽量减少对业务的影响；
预导入，但评估好导入时间和导入数据后，一定要进行预导入，从实际数据中抽样中一定比例（无需过大）的数据，进行预导入，从而计算大体上实际数据导入所需时间，这种预计时间可以作为实际导入所需要消耗时长的依据；

课堂练习：
完成从数据库中导入数据到hdfs文件系统，并查看结果。
完成从hdfs文件系统导出数据到数据库的操作，并查看导出的结果。

七.大数据仓库组件hive
7.1.hive原理的介绍
7.1.1.基于hadoop平台的MapReduce进行的构建。
7.1.2.hive使用自己的查询与HiveQL进行数据的查询操作，是一种类sql语句
7.1.3.在hadoop集群上运行作业
7.2.HiveQl概述
7.2.1.HiveQL 兼容SQL-92 的子集
增加一些来自MySQL 和Oracle SQL 方言的少量扩展
SELECT zipcode, SUM(cost) AS total
FROM customers
JOIN orders
ON (customers.cust_id = orders.cust_id)
WHERE zipcode LIKE ‘63%’
GROUP BY zipcode
ORDER BY total DESC;

7.2.2.Hive 特性
将HiveQL查询语句转换成数据处理作业；
把作业提交给集群上的
数据处理引擎(MapReduce)执行

7.3.使用hive的主要目的
比直接使用MapReduce 代码效率更高
5行HiveQL 代码相当于200行以上的Java 代码
有更广范的大规模数据分析需求的用户基础
无软件开发经验要求
可以使用积累的SQL 知识
提供了与其它系统之间的可相互操作性
通过Java 及其外部脚本获得可扩展性
许多商业智能(BI)工具都支持Hive

7.4.Hive是如何导入和存储数据的
7.4.1.对表进行查询操作,与在RDBMS 中进行查询操作相类似
表是包含一个或多个文件的HDFS 目录
表的默认路径:“/user/hive/warehouse/<table_name>”
表支持数据存储和检索的多种格式

7.4.2.表的结构和存储位置
表生成的时候需指定表的结构和存储位置
表的元数据存储在RDBMS 中
RDBMS：例如MySQL

7.4.3.Hive 表元数据管理
Hive 使用Metastore 决定数据格式和存储位置
用于查询操作的数据存储在文件系统上(往往存储在HDFS 上)

7.4.4.Hive 不是数据库
数据库管理系统有许多优点
响应速度快
支持事务处理
允许修改已有的记录
可以服务上千个并发的客户端
Hadoop 集群不是RDBMS
Hive通过HiveSQL生成处理引擎作业(MapReduce)
HDFS和MapReduce的限制依然存在
不是为了满足联机事务处理(OLTP)的要求而设计的。

7.5.Hive 与关系型数据库对比

7.6.hive的基本操作
7.6.1.hive表
1.Hive 表的数据保存在文件系统上(通常是HDFS)；
每张表对应一个目录（/user/hive/warehouse/<table_name>）
2.每张表对应的一个目录里可包含多个文件；
通常是带有分隔符的文本文件,可支持多种文件格式
不允许带有子目录，分区表除外

3.由元存储(metastore)提供数据的上下文；
Hive在数据库表中保存文件的存放路径，执行查询时，可以通过查询元数据获取文件存放的路径，便于提取数据进行分析统计。
帮助保存在HDFS 上的原始数据映射到指定类型的列上

7.6.2.hive数据库
每张表属于某个特定的数据库；
早期版本的Hive只支持单个数据库
所有的表都放在同一个数据库里(default)
这一特性始终是默认设置
Hive 0.7.0以后的版本支持多数据库
有利于组织和授权

7.6.3.Hive 基础命令
– 查看所有数据库
show databases;

– 使用某数据库
use databasename;

– 查看当前数据库内所有表
show tables;

– 查看表结构信息
desc tablename;

7.7.Hive的安装步骤
7.7.1.修改mariadb数据库密码
1.登录mariadb数据库
启动mariadb
[root@hdp01 sbin]# systemctl start mariadb
[root@hdp01 sbin]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
Active: active (running) since 四 2023-09-07 16:06:47 CST; 5s ago
Process: 4635 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 4597 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 4634 (mysqld_safe)
Tasks: 20
CGroup: /system.slice/mariadb.service
├─4634 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─4895 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log…

9月 07 16:06:43 hdp01 systemd[1]: Starting MariaDB database server…
9月 07 16:06:43 hdp01 mariadb-prepare-db-dir[4597]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
9月 07 16:06:43 hdp01 mariadb-prepare-db-dir[4597]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
9月 07 16:06:44 hdp01 mysqld_safe[4634]: 230907 16:06:44 mysqld_safe Logging to ‘/var/log/mariadb/mariadb.log’.
9月 07 16:06:44 hdp01 mysqld_safe[4634]: 230907 16:06:44 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
9月 07 16:06:47 hdp01 systemd[1]: Started MariaDB database server.

登录数据库
[root@hdp01 sbin]# mysql -uroot -p
Enter password:
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 2
Server version: 5.5.68-MariaDB MariaDB Server

Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.

MariaDB [(none)]> use mysql
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed

将mariadb数据库的root登录密码改为java
update user set password=password(‘java’) where user=‘root’ and host=‘localhost’;

更新完成后刷新数据库
flush privileges;

退出，重新登录
登录时需要输入密码

7.7.2.hive安装
1.将apache-hive-3.1.2-bin.tar.gz文件上传到centeos的/opt目录下，并解压缩
在/opt下创建hive目录，将apache-hive-3.1.2-bin.tar.gz解压缩到hive目录
mkdir hive
tar -xvf apache-hive-3.1.2-bin.tar.gz -C hive/

2.修改配置文件
在hive安装目录的conf目录下,将hive-default.xml.template文件复制为hive-default.xml，并新建hive-site.xml。
备注：hive-default.xml用于保留默认配置,hive-site.xml用于个性化配置,可覆盖默认配置。

vi hive-site.xml,配置中使用mysql8.x版本的数据库驱动包

<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> javax.jdo.option.ConnectionURL jdbc:mysql://localhost:3306/hive?serverTimezone=UTC javax.jdo.option.ConnectionDriverName com.mysql.cj.jdbc.Driver javax.jdo.option.ConnectionUserName root javax.jdo.option.ConnectionPassword java

3.上传数据库驱动包到hive的lib目录下
上传mysql-connector-java-8.0.30.jar到hive安装目录的lib文件夹
（/opt/hive/apache-hive-3.1.2-bin/lib）

4.配置hive的环境变量
修改~/.bashrc
vi ~/.bashrc
PATH= $P A T H :$ HOME/.local/bin:$HOME/bin
export PATH

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64
export PATH= $P A T H :$ JAVA_HOME/bin: $JAVA_HOME/jre/bin export CLASSPATH=.:$ JAVA_HOME/lib:$JAVA_HOME/jre/lib

export HADOOP_HOME=/opt/hadoop-3.3.1
export PATH= $P A T H :$ HADOOP_HOME/bin
export PATH= $P A T H :$ HADOOP_HOME/sbin

hive

export HIVE_HOME=/opt/hive/apache-hive-3.1.2-bin
export PATH= $P A T H :$ HIVE_HOME/bin
export CLASSPATH= $C L A SSP A H T :$ HIVE_HOME/lib

保存退出

资源文件生效
source ~/.bashrc
5.创建临时文件目录
在hive安装目录下创建临时文件存放目录
mkdir tmp

6.初始化元数据库
在mariadb下常见hive数据库
create database hive;

初始化hive数据库
schematool -dbType mysql -initSchema
[root@hdp01 tmp]# schematool -dbType mysql -initSchema
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:mysql://localhost:3306/hive?serverTimezone=UTC
Metastore Connection Driver : com.mysql.cj.jdbc.Driver
Metastore connection User: root
Starting metastore schema initialization to 3.1.0
Initialization script hive-schema-3.1.0.mysql.sql

看到 schemaTool completed说明初始化成功

7.启动hive
要先启动hadoop，否则下一步启动hive时会报 Call From Master/192.168.10.5 to master:9000 failed on connection exception:java.net.ConnectException: Connection refused;
接下来启动 hive 直接使用hive命令就可以启动。
shell>$ hive
显示 hive>则表示启动成功
[root@hdp01 lib]# hive
which: no hbase in (/usr/lib64/qt-3.3/bin:/root/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64/bin:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64/jre/bin:/opt/hadoop-3.3.1/bin:/opt/hadoop-3.3.1/sbin:/opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/bin:/root/bin:/root/.local/bin:/root/bin:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64/bin:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64/jre/bin:/opt/hadoop-3.3.1/bin:/opt/hadoop-3.3.1/sbin:/opt/hive/apache-hive-3.1.2-bin/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 3a8d9ca6-dc58-4d27-8a09-bc4dbed785f6

Logging initialized using configuration in jar:file:/opt/hive/apache-hive-3.1.2-bin/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive Session ID = 6878b5c4-1789-47ae-aaea-0ba36159e777
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
OK
default
Time taken: 1.033 seconds, Fetched: 1 row(s)
hive>

在sqoop配置文件中，配置hive的安装路径

7.8.使用客户端工具dbeaver.exe访问hive服务器
7.8.1.解压缩dbeaver工具包，直接双击运行dbeaver.exe文件
7.8.2.下载hive的jdbc驱动包hive-jdbc-3.1.2-standalone.jar到指定磁盘文件夹
hive-jdbc-3.1.2-standalone.jar位于hive安装包的jdbc目录下

7.8.3.启动hadoop服务

7.8.4.启动 hive的远程服务
1.启动mariadb数据库服务
systemctl start mariadb

2.修改配置文件hive-site.xml

hive.server2.enable.doAs
false

3.启动远程服务
注意：单独打开一个标签页启动

hive --service hiveserver2
[root@hdp01 apache-hive-3.1.2-bin]# hive --service hiveserver2
which: no hbase in (/usr/lib64/qt-3.3/bin:/root/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64/bin:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64/jre/bin:/opt/hadoop-3.3.1/bin:/opt/hadoop-3.3.1/sbin:/opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/bin:/root/.local/bin:/root/bin:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64/bin:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64/jre/bin:/opt/hadoop-3.3.1/bin:/opt/hadoop-3.3.1/sbin:/opt/hive/apache-hive-3.1.2-bin/bin:/root/bin)
2023-09-08 10:51:12: Starting HiveServer2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 868474c6-f1f1-4e73-88eb-17efd5935e97
Hive Session ID = d8720550-d757-439f-bd56-a644e1ad3cf6

7.8.5.打开dbeaver.exe开始配置hive的远程访问
1.防火墙必须关闭

2.新建链接

3.双击Apache Hive图标，进入hive远程连接配置

4.点击”编辑驱动设置”,设置hive的jdbc驱动

点击“库”，选择驱动文件

5.打开dbeaver的sql编辑器

选择要执行的sql语句，点右键执行

6.使用dbeaver工具创建数据库，并在当前数据库中导入数据
准备要导入的文件f:\hivezf.csv

create database mydb;
执行该语句，在左侧菜单刷新并选中创建的数据库，点击右键，选择“导入数据文件”

点击继续

查看HIVE后端执行

7.9.hive常用的HiveQL（Hive的查询语言）语句
7.9.1.语法规范
关键词不区分大小写
虽然按照惯例通常关键词仍旧大写
语句以分号结束
一条语句可以写成多行
注释为双连字符–

7.9.2.常用查询
–根据指定的小区名称查询恒大城的所有小区租房信息

select ABS(zid),address ,huanjing ,abs(jiage)*10,louceng ,mianji ,shi ,ting ,title,xqname from zufang where xqname =‘恒大城’ order by ABS(zid) ;

– 使用LIMIT子句设置查询结果返回的最大条数
select ABS(zid),address ,huanjing ,abs(jiage)*10,louceng ,mianji ,shi ,ting ,title,xqname from zufang order by ABS(zid) DESC limit 5;

–获取绝对价格从高到低的前5条记录
– 使用LIMIT子句设置查询结果返回的最大条数
select ABS(zid),address ,huanjing ,abs(jiage)*10,louceng ,mianji ,shi ,ting ,title,xqname from zufang order by abs(jiage)*10 DESC limit 5;

–分页查询(查询第三页的数据)
select ABS(zid),address ,huanjing ,abs(jiage)*10,louceng ,mianji ,shi ,ting ,title,xqname from zufang order by ABS(zid) limit 10,5;

使用in的范围查询
–查询地址在‘高新区-大寨路大寨路’，‘高新区-大寨路团结南路’‘高新区-科技六路双水磨路’地址的租房信息
SELECT ABS(zid),address ,huanjing ,abs(jiage)*10,louceng ,mianji ,shi ,ting ,title,xqname from zufang
WHERE address in (‘高新区-大寨路大寨路’,‘高新区-大寨路团结南路’,‘高新区-科技六路双水磨路’) order by address;

使用and逻辑运算符链接多个查询条件

–查找绝对价格在500–800之间的所有租房
SELECT ABS(zid),address ,huanjing ,abs(jiage)*10,louceng ,mianji ,shi ,ting ,title,xqname from zufang
WHERE abs(jiage)*10>=500 and abs(jiage)*10<=800 order by abs(jiage)*10 DESC ;

–使用like模糊查询，查询一小区名称以枫叶开头的所有房源信息
SELECT ABS(zid),address ,huanjing ,abs(jiage)*10,louceng ,mianji ,shi ,ting ,title,xqname from zufang where xqname like ‘枫叶%’;

–查询地址中包含科技路的所有房源信息
SELECT ABS(zid),address ,huanjing ,abs(jiage)*10,louceng ,mianji ,shi ,ting ,title,xqname from zufang where address like ‘%科技路%’;

HiveQL 支持有限的子查询功能
–查询面积在20-30之间的所有房源信息，并显示在20-30平米的房源中查询价格在300-500之间的所有房源信息
SELECT fw.* FROM
(select ABS(zid),address ,huanjing ,abs(jiage)*10 price,louceng ,mianji ,shi ,ting ,title,xqname from zufang where mianji BETWEEN 20 and 30) fw
where fw.price between 300 and 500;

子查询
–查询房租价格和爱西华庭的房租价格相同的所有房源信息
–查询房租价格大于爱西华庭的房租价格的所有房源信息
SELECT ABS(z1.zid),z1.address ,z1.huanjing ,abs(z1.jiage)*10,z1.louceng ,z1.mianji ,z1.shi ,z1.ting ,z1.title,z1.xqname from zufang z1
where abs(z1.jiage)*10 in (
SELECT abs(z2.jiage)*10 from zufang z2 WHERE z2.xqname =‘爱西华庭’
);

7.10.hive常用的数据类型
7.10.1.每列都关联着一种数据类型，Hive 支持多种数据类型
其中大多数的数据类型可以从关系型数据库中找到
Hive 还支持3种复杂数据类型
整数型数据类型是指数值是整数的数据类型
TINYINT（-128~127）
SMALLINT（-32,768~32,767）
INT（-2,147,483,648~2,147,483,647）
BIGINT（约-9,2 quintillion~约9.2 quintillion）
7.10.2.浮点型数据类型是指使用了小数点的数据类型
FLOAT
DOUBLE
DECIMAL(p,s)
7.10.3.字符型数据类型
STRING
CHAR(n)
VARCHAR(n)
7.10.4.其他简单数据类型
BOOLEAN
TIMESTAMP

7.11.创建数据库和表
CREATE database studb;

use studb;

create table clazz(
cid int,
cname string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’;

INSERT into clazz(cid,cname) values(1,‘信管1班’),(1,‘信管1班’),(2,‘信管2班’),(3,‘信管3班’),(4,‘信管4班’),(5,‘信管5班’);

select * from clazz ;

7.12.使用Hive进行数据分析
7.12.1.使用Hive分析关系型数据
连接不同的数据集是常用的操作；
Hive 支持的连接类型
Cross joins-交叉连接
Inner joins-内连接
Outer joins(left,right,andfull)-外连接（左连接，右连接和全连接）
Left semi joins-左半连接
只有在对等的条件下才允许连接（hive2.2版本之后支持）
有效的连接:customers.cust_id = orders.cust_id
无效的连接:customers.cust_id <> orders.cust_id
7.12.2.链接的语法

准备客户表和订单表以及对应的数据
–创建客户表
create table customers(
cust_id string,
name string,
country string
);
–drop table customers;
–插入数据
insert into customers values(‘a’,‘Alice’,‘us’),(‘b’,‘Bob’,‘ca’),(‘c’,‘Carlos’,‘mxus’),(‘d’,‘Dieter’,‘de’);
select * from customers ;

–创建订单表orders
create table orders(
oid int ,
cid string,
total int
);
–插入订单数据
insert into orders

values(1,‘a’,1539),(2,‘c’,1871),(3,‘a’,6352),(4,‘b’,1456),(5,‘z’,2137);
select * from orders;

–查看订单表和客户表的信息，显示客户的编号、名称和订单总量(内连接)
SELECT c.cust_id,c.name,o.total from orders o inner join customers c on o.cid=c.cust_id;

–使用左外连接查看数据
SELECT c.cust_id,c.name,o.total from orders o left outer join customers c on (o.cid=c.cust_id);

–使用右外连接查看数据
SELECT c.cust_id,c.name,o.total from orders o right outer join customers c on (o.cid=c.cust_id);

–全外连接(Full Outer Join)
SELECT c.cust_id,c.name,o.total from orders o full outer join customers c on (o.cid=c.cust_id);

–全外连接(Full Outer Join)显示客户或者和订单都是空值的数据
SELECT c.cust_id,c.name,o.total from orders o full outer join customers c on (o.cid=c.cust_id) WHERE c.cust_id IS NULL OR o.total IS NULL;

7.12.3.Hive 提供许多内置函数
以下这些函数都是以数值方式操作的

时间戳函数(Timestamp Function)

字符串函数(String Functions)

字符串连接
CONCAT合并一个或多个字符串
CONCAT_WS用于连接可变分隔符分隔的字符串

7.13.聚集函数
聚集函数则可以组合多行数据的值；
基于列表达式的分组
GROUP BY column
基于行的组合
GROUP BY是按一列或多列进行分组的；
当SELECT 语句包含GROUP BY 子句时，查询列只能为聚集函数和分组列
Hive 提供的聚合函数有：
count max min sum avg

use mydb;

–统计总共有多少条房源信息
select COUNT(*) from zufang ;

–根据小区的名称统计每个小区的房源数据
select xqname,COUNT(*) from zufang group by xqname;

—获取按小区名称获取小区的最高和最低租价
select xqname,MAX(abs(jiage)*10) maxprice,MIN(abs(jiage)*10) minprice from zufang group by xqname;

–统计每个小区的平均租金价格,保留2位小数
select xqname,round(avg(abs(jiage)*10),2) avgmaxprice from zufang group by xqname;

–统计每个小区的平均租金价格,保留2位小数，并从低到高排序
select xqname,round(avg(abs(jiage)*10),2) avgmaxprice from zufang group by xqname order by round(avg(abs(jiage)*10),2);

7.14.复杂类型
7.14.1.Hive支持复杂数据类型；

更加有效率 – 使用一张表取代多张表的连接；
底层的数据往往是使用复杂数据构造的；

7.14.2.使用数组类型

数组在同一列中的所有元素都是同一种数据类型

–创建客户电话表
CREATE TABLE customers_phones
(cust_id STRING,
name STRING,
phones ARRAY)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
COLLECTION ITEMS TERMINATED BY ‘|’;

数据文件
– 数据文件
a,Alice,555-1111|555-2222|555-3333
b,Bob,555-4444
c,Carlos,555-5555|555-6666

向数组类型的字段插入数据：
/*
*

a,Alice,555-1111|555-2222|555-3333
b,Bob,555-4444
c,Carlos,555-5555|555-6666
*/
INSERT into customers_phones values(‘a’,‘Alice’,array(‘555-1111’,‘555-2222’,‘555-3333’));
INSERT into customers_phones values(‘b’,‘Bob’,array(‘555-444’,‘555-3322’,‘555-2133’)),
(‘c’,‘Tomcat’,array(‘555-567’,‘555-7896’,‘555-4576’)),
(‘d’,‘Bazid’,array(‘555-4433’,‘555-3344’,‘555-1314’));
查询
select * from customers_phones;

根据电话的索引查询
select name,phones[0] from customers_phones;

7.14.3.映射是另一种复杂的数据类型 - 键值对

7.14.4.使用映射类型
映射列中的每个键码的类型都是相同的，每个键值的类型也是相同的
MAP<KEY-TYPE,VALUE-TYPE>

创建表
–创建带有Map类型字段的表
CREATE TABLE emp_phones
(et_id STRING,
ename STRING,
phones MAP<STRING,STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
COLLECTION ITEMS TERMINATED BY ‘|’
MAP KEYS TERMINATED BY ‘:’;

select * from emp_phones ;
创建带有Map类型的表字段
–向map集合类型的字段插入数据
insert into emp_phones values(‘no1’,‘张飞’,Map(‘HOME’,‘555-111’,‘WORK’,‘555-222’,‘MOBILE’,‘555-333’));

insert into emp_phones values(‘no1’,‘马超’,Map(‘HOME’,‘555-111’,‘WORK’,‘555-222’,‘MOBILE’,‘555-333’)),
(‘no1’,‘关羽’,Map(‘HOME’,‘555-111’,‘WORK’,‘555-222’,‘MOBILE’,‘555-333’)),
(‘no1’,‘赵云’,Map(‘HOME’,‘555-111’,‘WORK’,‘555-222’,‘MOBILE’,‘555-333’)),
(‘no1’,‘黄忠’,Map(‘HOME’,‘555-111’,‘WORK’,‘555-222’,‘MOBILE’,‘555-333’));

select * from emp_phones ;

7.14.5.结构struct
结构存储数据的结构属性，每个数据项可以是不同类型的数据

7.14.6.使用结构类型
结构项需要同时具备数据名称和数据类型

使用演示
–创建结构体类型字段的表
CREATE TABLE customers_addr
(
cust_id STRING,
name STRING,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zipcode:STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
COLLECTION ITEMS TERMINATED BY ‘|’;

/**

数据文件的格式
– 数据文件
a,Alice,742 Evergreen Terrace|Springfield|OR|97477
b,Bob,1600 Pennsylvania Ave NW|Washington|DC|20500
c,Carlos,342 Gravelpit Terrace|Bedrock
*/
----插入数据
insert into customers_addr
values(‘no1’,‘庄周’,named_struct(‘street’,‘高新一路’,‘city’,‘西安’,‘state’,‘陕西省’,‘zipcode’,‘710071’));

SELECT * from customers_addr ;

使用EXPLODE函数把表中数组里的每个元素转换成一条记录
–将表中数组类型的字段值转为记录显示
SELECT EXPLODE(phones) as phone
FROM customers_phones;

7.15.Hive索引–提升查询效率
Hive支持索引，但是Hive的索引与关系型数据库中的索引并不相同，比如，Hive不支持主键或者外键。
Hive索引可以建立在表中的某些列上，以提升一些操作的效率，例如减少MapReduce任务中需要读取的数据块的数量。
在可以预见到分区数据非常庞大的情况下，索引常常是优于分区的。
虽然Hive并不像事物数据库那样针对个别的行来执行查询、更新、删除等操作。它更多的用在多任务节点的场景下，快速地全表扫描大规模数据。但是在某些场景下，建立索引还是可以提高Hive表指定列的查询速度。

八.在IDEA中使用HIVE实现大数据的开发治理
8.1.在IDEA中配置hive环境
8.1.1.修改hdfs-site.xml

dfs.webhdfs.enabled
true

8.1.2.配置core-site.xml文件

hadoop.proxyuser.hdfs.hosts

hadoop.proxyuser.hdfs.groups

第一个属性表示允许任意主机访问hdfs文件系统
第二个属性表示允许任意组用户访问hdfs文件系统
在启动hadoop服务
./start-all.sh
8.1.3.启动hive的服务
在重新开启一个xshell客户端，开始hiveserver2服务，用于网络连接
hive --service hiveserver2

8.1.4.打开IDEA创建springboot+mybatis项目,配置数据源

创建完成后，修改springboot项目的pom.xml文件，降版本降低为一下配置

<?xml version="1.0" encoding="UTF-8"?>

4.0.0

org.springframework.boot
spring-boot-starter-parent
2.3.1.RELEASE

com.xinguan
prjspringboothive
ver1.0
prjspringboothive
Demo project for Spring Boot

<java.version>1.8</java.version>

org.springframework.boot
spring-boot-starter-web

org.mybatis.spring.boot
mybatis-spring-boot-starter
2.3.1

    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-devtools</artifactId>
        <scope>runtime</scope>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.mybatis.spring.boot</groupId>
        <artifactId>mybatis-spring-boot-starter-test</artifactId>
        <version>2.3.1</version>
        <scope>test</scope>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-maven-plugin</artifactId>
            <configuration>
                <excludes>
                    <exclude>
                        <groupId>org.projectlombok</groupId>
                        <artifactId>lombok</artifactId>
                    </exclude>
                </excludes>
            </configuration>
        </plugin>
    </plugins>
</build>

项目结构如下所示：

8.1.5.hive数据源的配置

连接完成后，可以使用sql语句测试hive是否连接成功:

至此，环境配置完成。
8.2.使用springboot+mybatis开发hive应用（查询）
8.2.1.在IDEA中安装插件

8.2.2.使用easycode插件生成实体类、dao、dao.xml、service、以及控制器类

修改pom.xml文件，加入hive-jdbc的依赖包和hadoop的依赖包

<?xml version="1.0" encoding="UTF-8"?>

    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-devtools</artifactId>
        <scope>runtime</scope>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <optional>true</optional>
    </dependency>
    <!-- 引入hiveJDBC的依赖包 -->
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc -->
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-jdbc</artifactId>
        <version>3.1.2</version>
        <exclusions>
            <exclusion>
                <groupId>org.eclipse.jetty</groupId>
                <artifactId>jetty-runner</artifactId>
            </exclusion>
        </exclusions>
    </dependency>
    <!--导入日志依赖-->
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-core</artifactId>
        <version>2.8.2</version>
    </dependency>
    <!-- 导入hadoop的组件库支持 -->
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>3.3.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>3.3.1</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-jdbc</artifactId>
    </dependency>

      <!--jsp-->
    <dependency>
        <groupId>org.apache.tomcat.embed</groupId>
        <artifactId>tomcat-embed-jasper</artifactId>
    </dependency>
    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>druid</artifactId>
        <version>1.0.19</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.mybatis.spring.boot</groupId>
        <artifactId>mybatis-spring-boot-starter-test</artifactId>
        <version>2.3.1</version>
        <scope>test</scope>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-maven-plugin</artifactId>
            <configuration>
                <excludes>
                    <exclude>
                        <groupId>org.projectlombok</groupId>
                        <artifactId>lombok</artifactId>
                    </exclude>
                </excludes>
            </configuration>
        </plugin>
    </plugins>
</build>

8.2.3.修改ZufangDao.java和ZufangDao.xml
ZuFangDao.java
package com.xinguan.dao;

import com.xinguan.entity.Zufang;
import org.apache.ibatis.annotations.Mapper;
import org.apache.ibatis.annotations.Param;
import org.mybatis.spring.annotation.MapperScan;
import org.springframework.data.domain.Pageable;
import org.springframework.stereotype.Service;

import java.util.List;

/**

(Zufang)表数据库访问层
@author makejava
@since 2023-09-11 11:39:04
*/

public interface ZufangDao {

/**
 * 通过ID查询单条数据
 *
 * @param  zid
 * @return 实例对象
 */
Zufang queryById(String zid);

/**
 * 查询指定行数据
 *
 * @param zufang 查询条件
 * @param pageable         分页对象
 * @return 对象列表
 */
List<Zufang> queryAllByLimit(@Param("zufang") Zufang zufang, @Param("pageable") Pageable pageable);

/**
 * 统计总行数
 *
 * @param zufang 查询条件
 * @return 总行数
 */

long count(@Param(“zufang”) Zufang zufang);

List<Zufang> queryAll();
/**
 * 新增数据
 *
 * @param zufang 实例对象
 * @return 影响行数
 */
int insert(Zufang zufang);

/**
 * 批量新增数据（MyBatis原生foreach方法）
 *
 * @param entities List<Zufang> 实例对象列表
 * @return 影响行数
 */
int insertBatch(@Param("entities") List<Zufang> entities);

/**
 * 批量新增或按主键更新数据（MyBatis原生foreach方法）
 *
 * @param entities List<Zufang> 实例对象列表
 * @return 影响行数
 * @throws org.springframework.jdbc.BadSqlGrammarException 入参是空List的时候会抛SQL语句错误的异常，请自行校验入参
 */
int insertOrUpdateBatch(@Param("entities") List<Zufang> entities);

/**
 * 修改数据
 *
 * @param zufang 实例对象
 * @return 影响行数
 */
int update(Zufang zufang);

/**
 * 通过主键删除数据
 *
 * @param  zid
 * @return 影响行数
 */
int deleteById(String zid );

}

ZuFangDao.xml

<?xml version="1.0" encoding="UTF-8"?>

<resultMap id="BaseResultMap" type="com.xinguan.entity.Zufang">
    <!--@Table zufang-->
    <result property="zid" column="zid" jdbcType="INTEGER"/>
    <result property="address" column="address" jdbcType="VARCHAR"/>
    <result property="huanjing" column="huanjing" jdbcType="VARCHAR"/>
    <result property="jiage" column="jiage" jdbcType="INTEGER"/>
    <result property="louceng" column="louceng" jdbcType="VARCHAR"/>
    <result property="mianji" column="mianji" jdbcType="INTEGER"/>
    <result property="shi" column="shi" jdbcType="INTEGER"/>
    <result property="ting" column="ting" jdbcType="INTEGER"/>
    <result property="title" column="title" jdbcType="VARCHAR"/>
    <result property="xqname" column="xqname" jdbcType="VARCHAR"/>
</resultMap>

<!--查询单个-->
<select id="queryById" parameterType="java.lang.String" resultMap="BaseResultMap">
    select
      zid, address, huanjing, jiage, louceng, mianji, shi, ting, title, xqname
    from zufang
    where  zid= #{zid}
</select>

<!--查询指定行数据-->
<select id="queryAllByLimit" resultMap="BaseResultMap">
    select
    abs(zid) as zid, address, huanjing, abs(jiage)*10 as jiage, louceng, mianji, shi, ting, title, xqname
    from zufang
    <where>
        <if test="zufang!=null">
            <if test="zufang.address != null and zufang.address != ''">
                and address = #{zufang.address}
            </if>
            <if test="zufang.huanjing != null and zufang.huanjing != ''">
                and huanjing = #{zufang.huanjing}
            </if>
            <if test="zufang.jiage != null">
                and jiage = #{zufang.jiage}
            </if>
            <if test="zufang.louceng != null and zufang.louceng != ''">
                and louceng = #{zufang.louceng}
            </if>
            <if test="zufang.mianji != null">
                and mianji = #{zufang.mianji}
            </if>
            <if test="zufang.title != null and zufang.title != ''">
                and title = #{zufang.title}
            </if>
            <if test="zufang.xqname != null and zufang.xqname != ''">
                and xqname = #{zufang.xqname}
            </if>
        </if>
    </where>
    limit #{pageable.offset}, #{pageable.pageSize}
</select>

<!--通过实体作为筛选条件查询-->
<select id="queryAll" resultMap="BaseResultMap">
    select
      abs(zid) as zid, address, huanjing, abs(jiage)*10 as jiage, louceng, mianji, shi, ting, title, xqname
    from mydb.zufang
</select>

<!--统计总行数-->
<select id="count" resultType="java.lang.Long">
    select count(1)
    from zufang
    <where>
        <if test="zufang!=null">
            <if test="zufang.address != null and zufang.address != ''">
                and address = #{zufang.address}
            </if>
            <if test="zufang.huanjing != null and zufang.huanjing != ''">
                and huanjing =#{zufang.huanjing}
            </if>
            <if test="zufang.jiage != null">
                and jiage = #{zufang.jiage}
            </if>
            <if test="zufang.louceng != null and zufang.louceng != ''">
                and louceng = #{zufang.louceng}
            </if>
            <if test="zufang.mianji != null">
                and mianji = #{zufang.mianji}
            </if>
            <if test="zufang.title != null and zufang.title != ''">
                and title = #{zufang.title}
            </if>
            <if test="zufang.xqname != null and zufang.xqname != ''">
                and xqname = #{zufang.xqname}
            </if>
        </if>
    </where>
</select>

<!--新增所有列-->
<insert id="insert" keyProperty="" useGeneratedKeys="true">
    insert into zufang(zid, address, huanjing, jiage, louceng, mianji, shi, ting, title, xqname)
    values (#{zid}, #{address}, #{huanjing}, #{jiage}, #{louceng}, #{mianji}, #{shi}, #{ting}, #{title}, #{xqname})
</insert>

<insert id="insertBatch" keyProperty="" useGeneratedKeys="true">
    insert into zufang(zid, address, huanjing, jiage, louceng, mianji, shi, ting, title, xqname)
    values
    <foreach collection="entities" item="entity" separator=",">
    (#{entity.zid}, #{entity.address}, #{entity.huanjing}, #{entity.jiage}, #{entity.louceng}, #{entity.mianji}, #{entity.shi}, #{entity.ting}, #{entity.title}, #{entity.xqname})
    </foreach>
</insert>

<insert id="insertOrUpdateBatch" keyProperty="" useGeneratedKeys="true">
    insert into zufang(zid, address, huanjing, jiage, louceng, mianji, shi, ting, title, xqname)
    values
    <foreach collection="entities" item="entity" separator=",">
        (#{entity.zid}, #{entity.address}, #{entity.huanjing}, #{entity.jiage}, #{entity.louceng}, #{entity.mianji}, #{entity.shi}, #{entity.ting}, #{entity.title}, #{entity.xqname})
    </foreach>
    on duplicate key update
    zid = values(zid),
    address = values(address),
    huanjing = values(huanjing),
    jiage = values(jiage),
    louceng = values(louceng),
    mianji = values(mianji),
    shi = values(shi),
    ting = values(ting),
    title = values(title),
    xqname = values(xqname)
</insert>

<!--通过主键修改数据-->
<update id="update">
    update zufang
    <set>
        <if test="zid != null">
            zid = #{zid},
        </if>
        <if test="address != null and address != ''">
            address = #{address},
        </if>
        <if test="huanjing != null and huanjing != ''">
            huanjing = #{huanjing},
        </if>
        <if test="jiage != null">
            jiage = #{jiage},
        </if>
        <if test="louceng != null and louceng != ''">
            louceng = #{louceng},
        </if>
        <if test="mianji != null">
            mianji = #{mianji},
        </if>
        <if test="shi != null">
            shi = #{shi},
        </if>
        <if test="ting != null">
            ting = #{ting},
        </if>
        <if test="title != null and title != ''">
            title = #{title},
        </if>
        <if test="xqname != null and xqname != ''">
            xqname = #{xqname},
        </if>
    </set>
    where  zid= #{zid}
</update>

<!--通过主键删除-->
<delete id="deleteById" parameterType="java.lang.String">
    delete from zufang where  zid= #{zid}
</delete>

修改控制器类ZufangController
package com.xinguan.controller;

import com.xinguan.entity.Zufang;
import com.xinguan.service.ZufangService;
import org.springframework.data.domain.Page;
import org.springframework.data.domain.PageRequest;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.;
import java.util.;
import javax.annotation.Resource;

/**

(Zufang)表控制层
@author makejava
@since 2023-09-11 11:39:02
/
@RestController
public class ZufangController {
/*
- 服务对象
  */
  @Resource
  private ZufangService zufangService;
/**
- 分页查询
- @param zufang 筛选条件
- @param page pagesize 分页对象
- @return 查询结果
  */
  @RequestMapping(value = “queryByPage_ZuFang.do”)
  public Map<String,Object> queryByPage(Zufang zufang, Integer page,Integer pagesize) {
  Map<String,Object> map=new HashMap<>();
  page=pagenull||page<1?1:page;
  pagesize=pagesizenull||pagesize<1?5:(pagesize>20?20:pagesize);
  PageRequest pageRequest=PageRequest.of((page-1),pagesize);
  
  Page pagezufang=zufangService.queryByPage(zufang,pageRequest);
  //获取当前页的房源列表
  List lszf=pagezufang.getContent();
  //获取总页数
  int maxpage=pagezufang.getTotalPages();
  map.put(“page”,page);
  map.put(“pagesize”,pagesize);
  map.put(“lszf”,lszf);
  map.put(“maxpage”,maxpage);
  
  return map;
  }

}

8.2.4.修改applicationContext.xml，配置hive的数据源
server.port=9900
spring.datasource.driver-class-name=org.apache.hive.jdbc.HiveDriver
spring.datasource.url=jdbc:hive2://192.168.43.89:10000/mydb
spring.datasource.username=root
spring.datasource.password=java

mybatis.configuration.log-impl=org.apache.ibatis.logging.stdout.StdOutImpl
mybatis.mapper-locations=mapper/*Dao.xml
mybatis.type-aliases-package=com.xinguan.entity

8.2.4.在springboot主文件中配置注解扫描器，扫描注解类所在的包
PrjspringboothiveApplication.java
package com.xinguan;

import org.mybatis.spring.annotation.MapperScan;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.ComponentScan;

@SpringBootApplication
@MapperScan(basePackages = {“com.xinguan.dao”})
@ComponentScan(basePackages = {“com.xinguan.service”,“com.xinguan.controller”})
public class PrjspringboothiveApplication {

public static void main(String[] args) {
    SpringApplication.run(PrjspringboothiveApplication.class, args);
}

}

springboot启动
Logging initialized using ‘class org.apache.ibatis.logging.stdout.StdOutImpl’ adapter.
Parsed mapper file: ‘file [E:\中软国际\西安建筑科技大学\信管2班大数据分析项目实训\day11\prjspringboothive\target\classes\mapper\ZufangDao.xml]’
2023-09-11 17:02:01.608 INFO 13828 — [ restartedMain] o.s.s.concurrent.ThreadPoolTaskExecutor : Initializing ExecutorService ‘applicationTaskExecutor’
2023-09-11 17:02:01.993 INFO 13828 — [ restartedMain] o.s.b.d.a.OptionalLiveReloadServer : LiveReload server is running on port 35729
2023-09-11 17:02:02.057 INFO 13828 — [ restartedMain] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port(s): 9900 (http) with context path ‘’
2023-09-11 17:02:02.067 INFO 13828 — [ restartedMain] c.xinguan.PrjspringboothiveApplication : Started PrjspringboothiveApplication in 5.325 seconds (JVM running for 7.172)
2023-09-11 17:02:07.902 INFO 13828 — [nio-9900-exec-1] o.a.c.c.C.[Tomcat].[localhost].[/] : Initializing Spring DispatcherServlet ‘dispatcherServlet’
2023-09-11 17:02:07.902 INFO 13828 — [nio-9900-exec-1] o.s.web.servlet.DispatcherServlet : Initializing Servlet ‘dispatcherServlet’
2023-09-11 17:02:07.910 INFO 13828 — [nio-9900-exec-1] o.s.web.servlet.DispatcherServlet : Completed initialization in 8 ms
Creating a new SqlSession
SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@1947f11b] was not registered for synchronization because synchronization is not active
2023-09-11 17:02:08.017 INFO 13828 — [nio-9900-exec-1] com.zaxxer.hikari.HikariDataSource : HikariPool-1 - Starting…
2023-09-11 17:02:08.302 INFO 13828 — [nio-9900-exec-1] com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Driver does not support get/set network timeout for connections. (Method not supported)
2023-09-11 17:02:08.322 INFO 13828 — [nio-9900-exec-1] com.zaxxer.hikari.HikariDataSource : HikariPool-1 - Start completed.
JDBC Connection [HikariProxyConnection@1091138009 wrapping org.apache.hive.jdbc.HiveConnection@61c12f28] will not be managed by Spring
> Preparing: select count(1) from zufang
> Parameters:
< Columns: _c0
< Row: 201
<== Total: 1
Closing non transactional SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@1947f11b]
Creating a new SqlSession
SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@56f57c2c] was not registered for synchronization because synchronization is not active
JDBC Connection [HikariProxyConnection@576916630 wrapping org.apache.hive.jdbc.HiveConnection@61c12f28] will not be managed by Spring
> Preparing: select abs(zid) as zid, address, huanjing, abs(jiage)*10 as jiage, louceng, mianji, shi, ting, title, xqname from zufang limit ?, ?
> Parameters: 0(Long), 5(Integer)
< Columns: zid, address, huanjing, jiage, louceng, mianji, shi, ting, title, xqname
< Row: 1, 高新区-大寨路大寨路,近丈八北路, 合租,朝南,有电梯,3/5号线, 880, 高层(共30层) , 18, 3, 2, 无管理费鱼化寨铭城国际外事学院金辉国际广场拎包入住, 铭城(西区)
<== Row: 2, 高新区-科技路西口高新六路,近科技路, 合租,朝南,有电梯,3号线, 690, 中层(共33层) , 20, 3, 1, 首月免租金丈八北路地铁口米罗蓝山高新医院直租无中介, 米罗蓝山
<== Row: 11, 高新区-昆明路沣惠南路,近昆明路, 合租,朝南,有电梯,3/5号线, 690, 中层(共30层) , 25, 5, 0, 月付金光门地铁沣惠南路延平门太和时代广场牡丹庄园, 牡丹庄园
<== Row: 101, 高新区-科技六路科技六路23号, 合租,朝南,有电梯,6号线, 620, 低层(共31层) , 20, 3, 1, 南窑头社区西区数字生活保利天悦一期旺座现代城橡树街区, 数字空间
<== Row: 102, 高新区-科技四路西段高新六路43号, 合租,朝南,有电梯, 1120, 低层(共28层) , 28, 3, 2, 月付直租！（超大空间）科技四路绿港花园南窑头可做饭, 绿港花园
<== Total: 5
Closing non transactional SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@56f57c2c]

测试

8.2.5.编写前端页面，引入VUE组件展示hive返回的数据
1.拷贝vue和jquery文件目录到static目录下

2.创建index.html和index.js

房源信息展示

编号	地址	环境	价格	楼层	面积	室	厅	标题	小区名称
{{zf.zid}}	{{zf.address}}	{{zf.huanjing}}	{{zf.jiage}}	{{zf.louceng}}	{{zf.mianji}}	{{zf.shi}}	{{zf.ting}}	{{zf.title}}	{{zf.xqname}}

index.js
var mainvue=new Vue({
el:‘#maindiv’,
data:{
lszf:[],//房源集合
page:1,
pagesize:5,
maxpage:1
},
created:function(){
this.dofindPageAll();//初始化时显示房源信息列表
},
methods:{
dofindPageAll(){
this.$http.get(‘queryByPage_ZuFang.do?page=’+this.page+‘&pagesize=’+this.pagesize+‘&t=’+new Date().getTime()).then(res=>{
console.log(res.data);//控制台打印服务器返回的结果
var map=res.data;
this.page=map.page;
this.pagesize=map.pagesize;
this.lszf=map.lszf;
this.maxpage=map.maxpage;
});
},
dofirstpage(){
this.page=1;
this.dofindPageAll();
},
doprevpage(){
this.page=this.page-1;
if(this.page<1){
this.page=1;
}
this.dofindPageAll();
},
donextpage(){
this.page=this.page+1;
if(this.page>this.maxpage){
this.page=this.maxpage;
}
this.dofindPageAll();
},
dolastpage(){
this.page=this.maxpage;
this.dofindPageAll();
},

},

});
3.页面显示的效果：

九.使用Hadoop的hive组件构建数据分析看板
9.1.需要使用到的技术
9.1.1.后端技术
Hadoop、mariadb、hive（底层MapReduce）、springboot、MyBatis
9.1.2.前端技术
css、html、js、vue、jquery、echarts

9.1.3.需要依赖的组件
Mybatis、mybatis-spring、hive-jdbc、hadoop、mapreduce、hdfs

9.2.业务需求
需要使用hive分析租房系统的数据，在数据看板展示以下内容:

1.显示最新房源信息前5条记录
2.使用饼形图显示每个小区（按小区名称统计）的房源数量
3.使用柱状图按照统计每个小区的楼层的租金价格
4.使用地图显示房源小区的分布位置
5.放置3-4个环形图，显示房源总数量、每个小区的租金平均价格、每个小区的房源数量、不同楼层的租金平均价格
6.放置雷达图，根据位置，统计每个小区的租金的平均价格和租金信息
7.放置折线图，根据每个小区的室厅统计租金的价格
8.旭日图展示房屋的租金信息和地理构成

示例源代码：

Xxx租房信息分析图