hadoop HA 实验

本文固定链接:https://www.askmac.cn/archives/hadoop-ha-test.html

 

 

1.实验环境

 

虚拟机环境 VBOX 5.0 ubuntu 15

 

java 1.80 hadoop 2.7.1

zookeeper 3.5.1

 

各个节点分布如下:

 

10.0.0.21          dbdaostdby   #NameNode 2
10.0.0.22          dbdao        #NameNode
10.0.0.23          dbdao2       #ResourceManager
10.0.0.24          dbdao3       #web app proxy and MR Jobhistory server
10.0.0.25          dbdao4       #DataNode
10.0.0.26          dbdao5       #DataNode
10.0.0.27          dbdao6       #DataNode

 

 

 

2.先决条件:

  • 安装JAVA
  • 从apache镜像上下载稳定的hadoop版本

请参考hadoop集群安装

 

备注:如果不是最新安装首先关闭hadoop集群

 

3.安装:

 

本次试验总共是6节点的hadoop集群:

将原本的hadoop集群变成HA 集群,并测试HA特性。

然后安装自动故障转移组件,完成HA自动故障转移实施,并进行测试。

 

3.1 配置手动HA

 

将新主机host 加进节点配置

 

echo "10.0.0.21 dbdaostdby" >>  /etc/hosts

 

然后在hadoop安装目录 cd /usr/local/hadoop/conf 修改配置文件

 

vi hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>3</value>
<name>dfs.namenode.name.dir</name>
<value>/home/dbdao/namespace</value>
</property>

<property>
<name>dfs.nameservices</name>
<value>dbdaocluster</value>
</property>

<property>
<name>dfs.ha.namenodes.dbdaocluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.dbdaocluster.nn1</name>
<value>dbdao:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.dbdaocluster.nn2</name>
<value>dbdaostdby:8020</value>
</property>

<property>
<name>dfs.namenode.http-address.dbdaocluster.nn1</name>
<value>dbdao:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.dbdaocluster.nn2</name>
<value>dbdaostdby:50070</value>
</property>

<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://dbdao:8485;dbdaostdby:8485;dbdao2:8485/dbdaocluster</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.dbdaocluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>

<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>

 

vi core-site.xml

<property>
<name>fs.defaultFS</name>
<value>hdfs://dbdaocluster:9000</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>

<property>
<name>dfs.journalnode.edits.dir</name>
<value>/usr/local/hadoop/data</value>
</property>

<property>
<name>ha.zookeeper.quorum</name>
<value>dbdao:2181,dbdaostdby:2181,dbdao2:2181</value>
</property>

 

 

3.2 检查配置并启动

 

检查各个节点的配置,检查ssh信任:

保证各个节点ssh 通道没问题。

 

首先根据配置启动journalnode,此处是启动在2个nn节点和yarn节点:

hadoop-daemon.sh start journalnode

 

格式化第二个NN:

首先将第一个NN namespace 所有文件拷贝到第二个节点对应位置,然后运行命令

hdfs namenode -bootstrapStandby

然后在第一个NN上执行本地目录的初始化:

hdfs namenode -initializeSharedEdits

最后启动所有HA NameNode:

这里使用全局启动脚本

$HADOOP_PREFIX/sbin/start-dfs.sh

 

 

检查相关日志和进程

 

本次试验中出现了:

Operation category JOURNAL is not supported in state standby

Unable to fetch namespace information from active NN

排查发现是2个节点由于自动启动的时候没有获取到状态,都是standby状态,所以手动进行一次切换即可。

 

hdfs haadmin -failover nn2 nn1

 

 

4.手动管理HA 集群

4.1.检查与监控

 

hdfs haadmin  -getServiceState nn2

 

 

4.2模拟故障于手动切换

 

此处手动将NN 上的进程杀掉,然后在此观察状态,使用命令完成切换。

hdfs haadmin -failover –forceactive  nn1 nn2 (会调用防护程序)

或者使用

hdfs haadmin -transitionToActive –forceactive  nn2(不推荐)

 

然后手动启动nn1上的NameNode ,此时nn1是standby状态,再次将nn2切换回去:

hadoop-daemon.sh  --script hdfs start namenode

hdfs haadmin -failover  nn2 nn1

 

5.配置自动故障转移

上述都需要手动去监控执行命令,hadoop也有自动ha的组件,需要另外安装和配置

首先关闭集群,目前还做不到在运行的时候进行配置

 

在NameNode:

$HADOOP_PREFIX/sbin/stop-dfs.sh

在YARN节点:

$HADOOP_PREFIX/sbin/stop-yarn.sh

 

5.1安装和配置 ZooKeeper

 

首先在官网上下载安装包

 

官网链接:http://zookeeper.apache.org/releases.html

 

此处使用最新的版本3.5.1

 

解压到hadoop安装路径中:

 

tar -zxvf zookeeper-3.5.1-alpha.tar.gz  -C /usr/local/hadoop/

 

进入其配置目录:

 

cd /usr/local/hadoop/zookeeper-3.5.1-alpha/conf

vi zoo.cfg

 

tickTime=2000
dataDir=/usr/local/hadoop/zookeeper /data
clientPort=2181
initLimit=5
syncLimit=2

server.1=dbdao:2888:3888

server.2=dbdaostdby:2888:3888

server.3=dbdao3:2888:3888

 

创建数据目录:

mkdir -p /usr/local/hadoop/zookeeper/data

 

在这个目录中为其指定唯一的myid (从1-255)

echo "1" > /usr/local/hadoop/zookeeper/data/myid

 

将整个zookeeper-3.5.1-alpha目录发送到另外两台机器,nn2和YARN机器上的对应目录中(dbdaostdby和dbda2):

 

cd ..

scp -r zookeeper-3.5.1-alpha/   10.0.0.21:/usr/local/hadoop

 

–注意这2台要使用不同的myid。

 

分别启动:

/usr/local/hadoop/zookeeper-3.5.1-alpha/bin/zkServer.sh start

 

使用jps查看进程,检查log无报错表示启动成功。

 

5.2 增加集群配置

 

在hdfs-site.xml中增加:

 

<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>

 

在core-site.xml中增加:

<property>
<name>ha.zookeeper.quorum</name>
<value>dbdao:2181,dbdaostdby:2181,dbdao2:2181</value>
</property>

 

5.3 自动故障转移测试

 

首先在初始化zookeeper的状态,在其中dbdao NameNode主机上运行:

[hdfs]$ $HADOOP_PREFIX/bin/hdfs zkfc -formatZK

–会清理相关数据,多次运行会提示关闭集群。

 

开启ZKFC进程:

可以在NameNode上手动启动

$HADOOP_PREFIX/sbin/hadoop-daemon.sh --script $HADOOP_PREFIX/bin/hdfs start zkfc

 

–start-dfs.sh脚本会自动在运行NameNode上启动ZKFC,并且自动选择一个NameNode作为active

 

 

这里使用自动脚本启动:

NameNode节点:

$HADOOP_PREFIX/sbin/start-dfs.sh

 

YARN节点:

 

$HADOOP_PREFIX/sbin/start-yarn.sh

 

检查状态发现目前nn1是active,这个是随机决定的。

模拟故障,kill nn1上的NameNode进程:

 

 

6.出现的问题

 

问题:出现了一些告警

 

2015-09-29 13:40:24,909 WARN org.apache.hadoop.hdfs.server.common.Util: Path /home/dbdao/namespace should be specified as a URI in configuration files. Please update hdfs configuration.

2015-09-29 13:40:24,910 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Beware of data loss due to lack of redundant storage directories!



2015-09-29 13:41:45,232 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Get corrupt file blocks returned error: Operation category READ is not supported in state standby

 

 

sshfence fail:

 

2015-09-29 16:33:07,532 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======

2015-09-29 16:33:07,532 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)

2015-09-29 16:33:07,532 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to dbdao...

2015-09-29 16:33:07,532 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to dbdao port 22

2015-09-29 16:33:07,533 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connection established

2015-09-29 16:33:07,536 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Remote version string: SSH-2.0-OpenSSH_6.7p1 Ubuntu-5ubuntu1.3

2015-09-29 16:33:07,536 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Local version string: SSH-2.0-JSCH-0.1.42

2015-09-29 16:33:07,536 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: CheckCiphers: aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256

2015-09-29 16:33:07,537 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-ctr is not available.

2015-09-29 16:33:07,540 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-ctr is not available.

2015-09-29 16:33:07,540 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-cbc is not available.

2015-09-29 16:33:07,540 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-cbc is not available.

2015-09-29 16:33:07,540 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: arcfour256 is not available.

2015-09-29 16:33:07,540 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT sent

2015-09-29 16:33:07,540 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT received

2015-09-29 16:33:07,540 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from dbdao port 22

2015-09-29 16:33:07,541 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to connect to dbdao as user dbdao

com.jcraft.jsch.JSchException: Algorithm negotiation fail

at com.jcraft.jsch.Session.receive_kexinit(Session.java:520)

at com.jcraft.jsch.Session.connect(Session.java:286)

at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)

at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)

at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:532)

at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)

at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)

at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)

at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:902)

at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:801)

at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)

at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)

at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)

2015-09-29 16:33:07,541 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.

2015-09-29 16:33:07,541 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method.

2015-09-29 16:33:07,541 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election

java.lang.RuntimeException: Unable to fence NameNode at dbdao/10.0.0.22:8020

at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)

at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)

at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)

at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)

at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:902)

at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:801)

at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)

at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)

at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)

2015-09-29 16:33:07,541 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session

2015-09-29 16:33:07,544 INFO org.apache.zookeeper.ZooKeeper: Session: 0x10004106e9901ff closed

 

在自动切换的时候出现了问题,这个应该是sshfence调用的时候出现了错误,导致返回了不成功,所以切换失败–(目前暂时没找到解决方法,怀疑是openssh和java版本不兼容)

可以强制使用自定义shell返回true,不过就可能会出现脑裂现象,问题在此记录,等以后有空了再看看。

Comment

*

沪ICP备14014813号-2

沪公网安备 31010802001379号