Oracle RAC OCR磁盘故障快速恢复方法
在Oracle RAC的测试环境以及实际生产环境中,有时候由于误操作或者磁盘损坏都会引起OCR磁盘故障从而导致CRS无法启动,数据库不能提供服务。本文针对可能出现的几种情况分别给出具体的解决方案,以下操作都经过实践。
一、故障定位:
(1)检查日志
tail -f /var/log/messages:
Oracle Cluster Registry initialization failed with invalid format: PROC-22: The OCR backend has an invalid format //出现这个的话就怀疑是OCR磁盘故障了
(2)检查集群状态
#crs_stat -t
CRS-0184: 无法与 CRS 守护程序通信。
(3)检查进程
# ps -ef|grep d.bin
root 26170 22963 0 11:51 pts/0 00:00:00 grep d.bin
发现crs等进程均不存在
(4)crsctl check boot
# crsctl check boot
OCR initialization failed accessing OCR device: PROC-26: 访问物理存储时出错 操作系统错误 [No such file or directory] [2]
(5)crsctl check crs
# crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
(6)检查OCR盘
# ocrcheck
PROT-602: 无法从集群注册表中检索数据
(7)检查VOTE盘
# crsctl query css votedisk
OCR initialization failed accessing OCR device: PROC-26: 访问物理存储时出错 操作系统错误 [No such file or directory] [2]
到这里基本可以确定是OCR磁盘故障了,不过也可能是磁阵没有挂载上或者裸设备没绑定的原因,可以用如下方法确定:fdisk –l 检查磁盘分区确认磁阵挂载上;如果OCR以及VOTE使用的是裸设备,则使用raw –qa检查裸设备是否绑定磁盘分区。
二、现象及解决方案:
现象一:磁阵重启导致CRS进程退出,无法启动
解决方案:
(1)重新扫描HBA卡
#rescan-scsi-bus.sh
(2)绑定裸设备
#/etc/init.d/raw start
(3)扫描ASM磁盘组
#oracleasm scandisks
一般到这里CRS就会自动启动了,如果还是启动不了再执行下面的操作:
(4)关闭CRS自启动(每个节点root用户执行)
#/etc/init.d/init.crs disable
(5)停止CRS(每个节点root用户执行)
#/etc/init.d/init.crs stop
(6)检查OCR备份目录
#ocrconfig –showbackup
(7)进入OCR备份所在节点的相应目录,使用备份文件恢复OCR
#cd /home/oracle/crs/cdata/crs
# ocrconfig -restore backup00.ocr(一般这个为最新的备份文件)
(8)开启自启动(每个节点root用户执行)
#/etc/init.d/init.crs enable
(9)启动CRS(每个节点root用户执行)
#/etc/init.d/init.crs start
现象二:OCR磁盘损坏但有备份
经过一系列排查已确定CRS无法启动的原因为OCR磁盘损坏,但在rac节点的本地磁盘上有OCR的备份,则可采用如下方式恢复OCR,启动CRS:
(1)检查OCR备份目录
#ocrconfig –showbackup
rac2 2011/12/05 11:07:21 /home/oracle/crs/cdata/crs
(2)进入OCR备份所在节点的相应目录,使用备份文件恢复OCR
#cd /home/oracle/crs/cdata/crs
# ocrconfig -restore backup00.ocr(一般这个为最新的备份文件)
(3)重新启动CRS(每个节点root用户执行)
# /etc/init.d/init.crs start
(4)检查集群状态
# crs_stat –t
一切正常后表示OCR恢复成功。
现象三:OCR磁盘损坏且无备份
由于某些原因导致OCR磁盘损坏,且没有备份文件。这就需要重建OCR,并向OCR中重新注册各项资源。操作步骤如下:
(1)我们使用dd命令模拟OCR磁盘的破坏
# dd if=/dev/zero of=/dev/raw/raw1 bs=8192 count=12800
(2)停止CRS进程(每个节点root用户执行)
# crsctl stop crs或者# /etc/init.d/init.crs stop
(3)分别在每个节点上执行$ORA_CRS_HOME/install/rootdelete.sh (必须root执行)
rac1:/home/oracle/crs/install # ./rootdelete.sh
Shutting down Oracle Cluster Ready Services (CRS):
OCR initialization failed with invalid format: PROC-22: OCR 后端有无效格式
Shutdown has begun. The daemons should exit soon.
Checking to see if Oracle CRS stack is down...
Oracle CRS stack is not running.
Oracle CRS stack is down now.
Removing script for Oracle Cluster Ready services
Updating ocr file for downgrade
Cleaning up SCR settings in '/etc/oracle/scls_scr'
Cleaning up Network socket directories
(4)在某个RAC节点执行$ORA_CRS_HOME/install/rootdeinstall.sh
rac1:/home/oracle/crs/install # ./rootdeinstall.sh
Removing contents from OCR device
2560+0 records in
2560+0 records out
10485760 bytes (10 MB) copied, 0.52535 s, 20.0 MB/s
(5)重建OCR,在各个RAC节点执行$ORA_CRS_HOME /root.sh
rac1:/home/oracle/crs # ./root.sh
WARNING: directory '/home/oracle' is not owned by root
No value set for the CRS parameter CRS_OCR_LOCATIONS. Using Values in paramfile.crs
Checking to see if Oracle CRS stack is already configured
Setting the permissions on OCR backup directory
Setting up NS directories
Oracle Cluster Registry configuration upgraded successfully
WARNING: directory '/home/oracle' is not owned by root
Successfully accumulated necessary OCR keys.
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node <nodenumber>: <nodename> <private interconnect name> <hostname>
node 1: rac1 rac1-priv rac1
node 2: rac2 rac2-priv rac2
node 3: rac3 rac3-priv rac3
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
Now formatting voting device: /u02/oracrs/vote.crs
Format of 1 voting devices complete.
Startup will be queued to init within 30 seconds.
Adding daemons to inittab
Expecting the CRS daemons to be up within 600 seconds.
CSS is active on these nodes.
rac1
CSS is inactive on these nodes.
rac2
rac3
Local node checking complete.
Run root.sh on remaining nodes to start CRS daemons.
注意最后一个节点执行后会有报错,原因为VIPCA以silent方式运行失败。所以需要手动运行VIPCA工具(图形界面)。
(6)运行VIPCA工具,在图形界面上进行VIP配置,与以前一致即可,具体过程这里不再详细的描述。
(7)检查CRS服务状态
rac1:/home/oracle/crs # crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
rac1:/home/oracle/crs # crs_stat -t
名称 类型 目标 状态 主机
------------------------------------------------------------
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2
ora.rac3.gsd application ONLINE ONLINE rac3
ora.rac3.ons application ONLINE ONLINE rac3
ora.rac3.vip application ONLINE ONLINE rac3
这里说明ONS、GSD、VIP都已注册到OCR中,还需要把监听、数据库、ASM注册到OCR,继续下面操作。
(8)使用netca工具重新配置监听器,注册listener
配置完后重新查看crs状态:
rac1:/home/oracle/crs # crs_stat -t
名称 类型 目标 状态 主机
------------------------------------------------------------
ora....C1.lsnr application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora....C2.lsnr application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2
ora....C3.lsnr application ONLINE ONLINE rac3
ora.rac3.gsd application ONLINE ONLINE rac3
ora.rac3.ons application ONLINE ONLINE rac3
ora.rac3.vip application ONLINE ONLINE rac3
(9)向OCR中注册ASM
#srvctl add asm -n rac1 -i +ASM1 -o $ORACLE_HOME
# srvctl add asm -n rac2 -i +ASM2 -o $ORACLE_HOME
# srvctl add asm -n rac3 -i +ASM3 -o $ORACLE_HOME
(10)启动ASM
crs_start ora.rac1.ASM1.asm
a)、启动报错:
PRKS-1009 : Failed to start ASM instance "+ASM1" on node "rac1", [PRKS-1009 : Failed to start ASM instance "+ASM1" on node "rac1", [CRS-1028: Dependency analysis failed because of:
CRS-0223: Resource 'ora.rac1.ASM1.asm' has placement error.]]
[PRKS-1009 : Failed to start ASM instance "+ASM1" on node "rac1", [CRS-1028: Dependency analysis failed because of:
CRS-0223: Resource 'ora.rac1.ASM1.asm' has placement error.]]
b)、查看asm资源状态为unknow:
[url=mailtoracle@rac1]oracle@rac1[/url]:~> crs_stat
NAME=ora.rac1.ASM1.asm
TYPE=application
TARGET=ONLINE
STATE=UNKNOWN on rac
c)、停止也报错:
[url=mailtoracle@rac1]oracle@rac1[/url]:~> crs_stop ora.rac1.ASM1.asm
Attempting to stop `ora.rac1.ASM1.asm` on member `rac1`
`ora.rac1.ASM1.asm` on member `rac1` has experienced an unrecoverable failure.
Human intervention required to resume its availability.
CRS-0216: 无法停止资源 'ora.rac1.ASM1.asm'。
d)、检查ASM的日志tail -f $ORACLE_HOME/log/rac1/racg/ora.rac1.ASM1.asm.log,发现如下错误:
Enter user-name: ERROR:
ORA-01031: insufficient privileges
Enter user-name: SP2-0306: Invalid option.
2011-12-06 19:32:14.009: [ RACG][4286338816] [16318][4286338816][ora.rac1.ASM1.asm]: Usage: CONN[ECT] [logon] [AS {SYSDBA|SYSOPER}]
where <logon> ::= <username>[/<password>][@<connect_identifier>] | /
Enter user-name: Enter password:
ERROR:
ORA-01005: null password given; logon denied
但是用如下方法是可以打开ASM库的:
export ORACLE_SID=+ASM1
sqlplus / as sysdba
startup
e)、最后发现是权限问题,执行如下命令后问题解决:
rac1:/home/oracle # crs_setperm ora.rac1.ASM1.asm -o oracle
rac1:/home/oracle # crs_setperm ora.rac1.ASM1.asm -g oinstall
rac1:/home/oracle # crs_getperm ora.rac1.ASM1.asm
名称: ora.rac1.ASM1.asm
ownerracle:rwx,pgrpinstall:r-x,other::r--,
[url=mailtoracle@rac1]oracle@rac1[/url]:~> crs_stop ora.rac1.ASM1.asm
Attempting to stop `ora.rac1.ASM1.asm` on member `rac1`
Stop of `ora.rac1.ASM1.asm` on member `rac1` succeeded.
[url=mailtoracle@rac1]oracle@rac1[/url]:~> crs_start ora.rac1.ASM1.asm
Attempting to start `ora.rac1.ASM1.asm` on member `rac1`
Start of `ora.rac1.ASM1.asm` on member `rac1` succeeded.
查看状态:
rac1:/home/oracle # crs_stat -t
名称 类型 目标 状态 主机
------------------------------------------------------------
ora....SM1.asm application ONLINE ONLINE rac1
ora....C1.lsnr application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora....SM2.asm application ONLINE ONLINE rac2
ora....C2.lsnr application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2
ora....SM3.asm application ONLINE ONLINE rac3
ora....C3.lsnr application ONLINE ONLINE rac3
ora.rac3.gsd application ONLINE ONLINE rac3
ora.rac3.ons application ONLINE ONLINE rac3
ora.rac3.vip application ONLINE ONLINE rac3
(11)注册数据库
srvctl add database -d zxin -o $ORACLE_HOME
(12)注册实例
# srvctl add instance -d zxin -i zxin1 -n rac1
# srvctl add instance -d zxin -i zxin2 -n rac2
# srvctl add instance -d zxin -i zxin3 -n rac3
(13)修改实例和ASM实例的依赖关系
# srvctl modify instance -d zxin -i zxin1 -s +ASM1
# srvctl modify instance -d zxin -i zxin2 -s +ASM2
# srvctl modify instance -d zxin -i zxin3 -s +ASM3
(14)启动数据库
# srvctl start database -d zxin
报错:
PRKP-1001 : Error starting instance zxin1 on node rac1
CRS-0215: ???????????? 'ora.zxin.zxin1.inst'??
PRKP-1001 : Error starting instance zxin2 on node rac2
CRS-0215: ???????????? 'ora.zxin.zxin2.inst'??
PRKP-1001 : Error starting instance zxin3 on node rac3
CRS-0215: ???????????? 'ora.zxin.zxin3.inst'??
与上面一样的权限问题,执行如下命令后解决:
# crs_setperm ora.zxin.zxin2.inst -o oracle
# crs_setperm ora.zxin.zxin2.inst -g oinstall
# crs_setperm ora.zxin.zxin3.inst -o oracle
# crs_setperm ora.zxin.zxin3.inst -g oinstall
# crs_setperm ora.zxin.db -o oracle
# crs_setperm ora.zxin.db -g oinstall
# srvctl stop database -d zxin
# srvctl start database -d zxin
检查状态全部正常,如下:
rac1:/home/oracle # crs_stat -t
名称 类型 目标 状态 主机
------------------------------------------------------------
ora....SM1.asm application ONLINE ONLINE rac1
ora....C1.lsnr application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora....SM2.asm application ONLINE ONLINE rac2
ora....C2.lsnr application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2
ora....SM3.asm application ONLINE ONLINE rac3
ora....C3.lsnr application ONLINE ONLINE rac3
ora.rac3.gsd application ONLINE ONLINE rac3
ora.rac3.ons application ONLINE ONLINE rac3
ora.rac3.vip application ONLINE ONLINE rac3
ora.zxin.db application ONLINE ONLINE rac1
ora....n1.inst application ONLINE ONLINE rac1
ora....n2.inst application ONLINE ONLINE rac2
ora....n3.inst application ONLINE ONLINE rac3
至此OCR成功恢复,所有服务正常运行。可使用crs_stop –all以及crs_start –all命令测试CRS重新启动后各服务是否也能正常。
由于OCR磁盘损坏的故障还是非常常见的,如果不能恢复,只能重装CRS,将直接影响测试的进度以及现网环境的正常运行。通过以上的方法能够快速定位故障并进行恢复,操作也较简单。