- 1:查看yarn实例清单,发现resourcemanager节点都是 备用 状态。重启yarn和集群并未获取到active节点状态
- 2:尝试1:yarn.resourcemanager.recovery.enabled 设置为true,启用Yarn自动修复功能
- 结论:不顶用,依旧是两个备用节点
- 结论:不顶用,依旧是两个备用节点
- 3:尝试2:直接重启整个大数据集群
- 直接重启整个集群,发现hdfs卡死在hadoopm02的网络通讯上
- 等了半天,hdfs namenode节点回复正常。 查看了各项指标,发现并无问题
- 尝试ssh免密互通,发现hadoopm02与其他机器的免密互通失效
- 细节:重启集群,需要先启动sentry,否则其他服务无法通过安全认证
- 排查到yarn ha启用后,两个resourcemanager都为备用状态。可能原因是zookeeper对yarn的resourcemanager状态记录有误
- 尝试格式化resourcemanager
- yarn resourcemanager -format-state-store(在resourcemanager对应的节点执行)
- 运行后报错,无法直接启动resourcemanager服务
- 思路1:修改zookeeper中的记录(备用方案)
- 下载prettyzoo
- 2、如果是在zk上存储
- 查看yarn.resourcemanager.zk-state-store.parent-path配置或者执行hdfs getconf -confKey yarn.resourcemanager.zk-state-store.parent-path
- 找到配置值为:/rmstore
- 然后登陆zk客户端删除:rmr /rmstore/ZKRMStateRoot/RMAppRoot/
- 重启ResourceManager
- 尝试打开zookeeper目录,并执行删除操作,遇到权限问题。不知道账号密码。好麻烦~,换方案
- 尝试使用prettyzoo连接zookeeper,破电脑死活打不开这个prettyzoo,也许是版本问题或者32,64位的问题(是因为安装路径中有中文,重新安装到全英文路径即可)
- 连上之后发现相应路径为空,也许依旧是查看权限不够的原因
- kinit zookeeper/hadoopm01@HADOOP.COM -kt /opt/kerberos_keytab/dolphinscheduler.keytab 进入zookeeper的kerberos安全认证
- 依旧报错无权限
- https://blog.csdn.net/zhouzba/article/details/106804181
- 网上说是yarn没配置将HA状态配置到zookeeper中,配置相关参数,重启yarn试一下
- 重启后依旧两个备用节点
- 思路2:手动切换yarn HA的主备状态
- kinit yarn/hadoopm01@HADOOP.COM -kt /opt/kerberos_keytab/dolphinscheduler.keytab
- 1.刷新集群队列配置
- yarn rmadmin -refreshQueues
- 2.查看指定的rm的状态
- yarn rmadmin -getServiceState rm79
- 3.切换指定rm为stanby的状态
- yarn rmadmin -transitionToStandby rm79 –forcemanual
- 4.切换指定的rm为active的转态
- yarn rmadmin -transitionToActive rm79 –forcemanual
- 5.动态刷新 dfs.hosts 和 dfs.hosts.exclude 配置,无需重启rm
- yarn rmadmin -refreshNodes
- 19.1 ResourceManager 服务管理shell指令
- yarn-daemon.sh start resourcemanager #单独启动ResourceManager
- yarn-daemon.sh stop resourcemanager #单独停止ResourceManager
- 报错需要通过Yarn的kerberos安全认证
- 安全认证主体: kinit hive/hive@HADOOP.COM
- 密码:Fuda@2023
- klist 查看当前服务器已认证kerberos主体
- 使用keytab进行安全认证
- 进入kerberos安全认证命令行操作客户端(在hadoopm01操作) sudo kadmin.local
- 客户端中查看所有的kerberos安全认证主体 listprincs
- 将yarn的安全认证主体与相应密码下载到一个本地文件dolphinscheduler.keytab 中(linux客户端中操作,而不是在kerberos安全认证客户端中操作)
- sudo ktadd -norandkey -kt /opt/kerberos_keytab/dolphinscheduler.keytab yarn/hadoopm01@HADOOP.COM
- sudo ktadd -norandkey -kt /opt/kerberos_keytab/dolphinscheduler.keytab yarn/hadoopm02@HADOOP.COM
- sudo ktadd -norandkey -kt /opt/kerberos_keytab/dolphinscheduler.keytab yarn/hadoops01@HADOOP.COM
- sudo ktadd -norandkey -kt /opt/kerberos_keytab/dolphinscheduler.keytab yarn/hadoops02@HADOOP.COM
- sudo ktadd -norandkey -kt /opt/kerberos_keytab/dolphinscheduler.keytab yarn/hadoops03@HADOOP.COM
- sudo ktadd -norandkey -kt /opt/kerberos_keytab/dolphinscheduler.keytab yarn/hadoops04@HADOOP.COM
- sudo ktadd -norandkey -kt /opt/kerberos_keytab/dolphinscheduler.keytab yarn/hadoops05@HADOOP.COM
- yarn resourcemanager -format-state-store(在resourcemanager对应的节点执行)
- 在hadoopm01上进行Yarn安全认证
- sudo kinit yarn/hadoopm01@HADOOP.COM -kt /opt/kerberos_keytab/dolphinscheduler.keytab
- 在hadoopm02上进行Yarn安全认证
- sudo kinit yarn/hadoopm02@HADOOP.COM -kt /opt/kerberos_keytab/dolphinscheduler.keytab
- 验证安全认证结果 sudo klist
- 手动强制修改yarn HA的激活状态
- sudo yarn resourcemanager -format-state-store
- 报错如下
- org.apache.hadoop.service.ServiceStateException: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /rmstore/ZKRMStateRoot
- 依旧需要获得zookeeper中 /rmstore/ZKRMStateRoot 的访问权限,查看和删除的权限
- 综上两个思路,都卡在 需要获得zookeeper中 /rmstore/ZKRMStateRoot 的访问权限,查看和删除的权限
- 故开始研究如何获得该权限
- 方案1:设置zookeeper的超级用户(失败)
- https://blog.csdn.net/zhouzba/article/details/106804181
- 创建zookeeper的超级用户步骤如下:
- export ZK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/zookeeper/conf:/opt/cloudera/parcels/CDH/lib/zookeeper/lib/*:/opt/cloudera/parcels/CDH/lib/zookeeper/*
- java -cp $ZK_CLASSPATH org.apache.zookeeper.server.auth.DigestAuthenticationProvider super:super123
- super:super123->super:UdxDQl4f9v5oITwcAsO9bmWgHSI=
- SERVER_JVMFLAGS=-Dzookeeper.DigestAuthenticationProvider.superDigest=super:UdxDQl4f9v5oITwcAsO9bmWgHSI=
- [zk: sandbox.hortonworks.com:2181(CONNECTED) 1] addauth digest super:super123
- [zk: sandbox.hortonworks.com:2181(CONNECTED) 1] setAcl /rmstore/ZKRMStateRoot world:anyone:cdrwa
- 方案2:在zookeeper的配置项中,查找java,配置如下参数
- https://blog.csdn.net/qq_40341628/article/details/88390168?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_baidulandingword~default-2-88390168-blog-102516684.235^v43^pc_blog_bottom_relevance_base3&spm=1001.2101.3001.4242.2&utm_relevant_index=5
- -Dzookeeper.skipACL=yes
- 说明:跳过zookeeper的ACL权限控制检查
- 重启zookeeper
- 可以正常访问了
- rmr /rmstore/ZKRMStateRoot/RMAppRoot/
- cd /opt/cloudera/parcels/CDH/lib/zookeeper/bin
- sh zkCli.sh -server hadoopm01:2181
- ls /rmstore/ZKRMStateRoot/RMAppRoot
- 清空yarn调度记录
- deleteall /rmstore/ZKRMStateRoot/RMAppRoot
- ls /rmstore/ZKRMStateRoot/RMAppRoot
- 好了