Wednesday, February 1, 2017

Exadata X5/X6 reports "Disk controller was hung.  Cell was power cycled to stop the hang."  and SAS HBA logs report correctable errors on  SW images prior to 12.1.2.3.2 (Doc ID 2176276.1)






(Doc ID 2176276.1)

In this Document
Symptoms
Cause
Solution
References




APPLIES TO:

Oracle SuperCluster T5-8 Hardware - Version All Versions and later
Exadata X6-2 Hardware - Version All Versions and later
Exadata X6-8 Hardware - Version All Versions and later
Exadata X5-2 Hardware - Version All Versions and later
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Information in this document applies to any platform.

SYMPTOMS

Exadata X5/X6 reports "Disk controller was hung. Cell was power cycled to stop the hang." and SAS HBA logs 
report correctable errors on SW images prior to 12.1.2.3.2
Note: Exadata X5-2L/X6-2L Extreme Flash Storage Servers do not have a SAS HBA and are not affected.
Note: Exadata systems older than X5-2 use a different SAS HBA and are not affected.
Note: Exadata X4-8 systems with X5-2L Storage Servers only applies to the Storage Servers; the X4-8 DB nodes 
use a different SAS HBA and is not affected.

Issue: The server's logging a reset event due to HBA controller fatal error, with a Correctable Error that is 
repeated thousands of times in the SAS HBA firmware terminal log (fwtermlog) such as:

# /opt/MegaRAID/MegaCli/MegaCli64 /c0 show termlog | more
Firmware Term Log Information on controller 0:
05/16/16 22:47:51: C0:SRAM errAddr c01af1e0 errAttrib 00000023
05/16/16 22:47:51: C0:Correctable err, continuing...
05/16/16 22:47:51: C0:SRAM errAddr c01af1e0 errAttrib 00000023
05/16/16 22:47:51: C0:Correctable err, continuing...
05/16/16 22:47:51: C0:SRAM errAddr c01af1e0 errAttrib 00000023
05/16/16 22:47:51: C0:Correctable err, continuing...
... <~8100 entries is typical>..........
05/16/16 22:47:51: C0:SRAM errAddr c01af1e0 errAttrib 00000023
05/16/16 22:47:51: C0:Correctable err, continuing...
05/16/16 22:47:51: C0:SRAM errAddr c01a
MonSetAllowChipReset: MonAllowResetChip 1
05/16/16 22:47:51: C0:In MonTask; Seconds from powerup = 0x00fd125b
05/16/16 22:47:51: C0:Max Temperature = 80 on Channel 4
Firmware crash dump feature enabled
Crash dump collection will start immediately
copied 75 MB in 71957 Microseconds
[0]: fp=c03ffe00, lr=c13243c8 - _MonTask+200
... <reset output>
IMPORTANT NOTE: It is possible due to another image issue in 12.1.2.2.2 and earlier that the SAS HBA 
firmware logs do not contain the above information regarding a correctable error. Most likely the logs were 
recycled on the power cycle, and the errors are no longer in the log. If this is the case, please ensure also the 
procedure in Note 2135119.1 is completed to ensure persistent logging is enabled in addition to this solution. 
The additional steps are included below.
The server will alert with this (cellcli or dbmcli - depending on server type that saw the event):

CellCLI> list alerthistory
...
2 2016-05-16T10:07:39-04:00 critical "Disk controller was hung. Cell
was power cycled to stop the hang."

Workaround: None, the power cycle recovers the HBA and is functional again, as designed by Exadata server 
monitoring service (MS). The error is only a single error, but the firmware gets into an infinite loop and hangs 
after ~8100 times trying to complete the error correction.

CAUSE

Unpublished HBA Firmware bug 21669752 causes the controller to hang while correcting an error.


SOLUTION

The solution is to update the SAS HBA firmware to version 24.3.0-0083 (or later).
If a controller hang or reset event has occurred with other messages logged, or the event occurs on a system 
that already has SAS HBA firmware 24.3.0-0083, then a SR should be opened and a Sundiag and diagpack 
provided for the event uploaded for analysis and action plan.

1. Check the image version of the server and firmware version of the SAS HBA:

# imageinfo -ver
12.1.2.1.1.150316.2
# /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -i package
FW Package Build: 24.3.0-0081
If the server is running SW Image 12.1.2.3.2 or later, or 12.1.2.2.3 or later, then the problem does not apply.
These images have the firmware fix in "FW Package Build: 24.3.0-0083".
If a controller hang or reset event has occurred with other messages logged, or the event occurs on a system 
that already has SAS HBA firmware 24.3.0-0083, then a SR should be opened and a Sundiag and diagpack 
provided for the event uploaded for analysis and action plan.

2. Update the SW image to 12.1.2.2.3 or 12.1.2.3.2 or later, which contains the firmware fix 24.3.0-0083 per the 
example above. For how to update image, refer to MOS Note 888828.1.
If the server is not able to be updated with a later image at this time, then the SAS HBA firmware only may be 
updated to address this issue as follows, using the firmware package "MR_6.3.8.3_24.3.0-0083.rom" attached to 
this Note as follows:
a) Download the firmware package "MR_6.3.8.3_24.3.0-0083.rom" attached to this Note, and copy it to the /tmp 
directory on the server to be updated.

b) Prepare the server for maintenance as follows:
Exadata Storage Servers (based on X5-2L and X6-2L)

NOTE: If updating firmware on multiple storage servers in a rolling manner, do not reboot and apply the
firmware update to multiple storage servers at the same time - only do them one at a time and ensure all disks
are re-synchronized with ASM before proceeding to the next storage server.

i. ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME
attribute value of 3.6hrs should be adequate for replacing components, but may have been
changed by the Customer. To check this parameter, have the Customer log into ASM and
perform the following query:
SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg
where a.name = 'disk_repair_time' and a.group_number = dg.group_number;
As long as the value is large enough to comfortably perform the upgrade in a
storage cell, there is no need to change it.
ii. Check if ASM will be OK if the grid disks go OFFLINE.
# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...snipit ...
DATA_CD_09_cel01 ONLINE Yes
DATA_CD_10_cel01 ONLINE Yes
DATA_CD_11_cel01 ONLINE Yes
RECO_CD_00_cel01 ONLINE Yes
etc....
If one or more disks return asmdeactivationoutcome='No', you should wait for some time
and repeat the query until all disks return asmdeactivationoutcome='Yes'.
NOTE: Taking the storage server offline while one or more disks return a status of
asmdeactivationoutcome='No' will cause Oracle ASM to dismount the affected disk group,
causing the databases to shut down abruptly.
iii. Run cellcli command to Inactivate all grid disks on the cell you wish to power down/reboot.
(this could take up to 10 minutes or longer)
# cellcli -e alter griddisk all inactive
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
GridDisk DATA_CD_02_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
...etc...
iv. Execute the command below and the output should show asmmodestatus='UNUSED' or
'OFFLINE' and ‘asmdeactivationoutcome=Yes’ for all griddisks once the disks are offline and
inactive in ASM.
# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes
DATA_CD_01_dmorlx8cel01 inactive OFFLINE Yes
DATA_CD_02_dmorlx8cel01 inactive OFFLINE Yes
RECO_CD_00_dmorlx8cel01 inactive OFFLINE Yes
...etc...
 v) Disable Exadata Storage Server services with the following command as 'root' user:
# cellcli -e alter cell shutdown services all

Exadata DB Nodes (based on X5-2, X6-2 and X5-8)
Linux DB Nodes:
i) Shutdown and disable auto-start of CRS services:
# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle
# $ORACLE_HOME/bin/crsctl disable crs
# $ORACLE_HOME/bin/crsctl stop crs
or
# <GI_HOME>/bin/crsctl stop crs
where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the
customer's environment.
In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the
value would be +ASM3.
ii)  Validate CRS is down cleanly. There should be no processes running.
# ps -ef | grep css
iii)  Disable Exadata DB Node management services with the following command as 'root' user:
# dbmcli -e alter dbserver shutdown services all

OVM DB Nodes:
 i) See what user domains are running (record result)
Connect to the management domain (domain zero, or dom0).  This is an example with just two 
domains and the management domain Domain-0
# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 8192 4 r----- 409812.7
dm01db01vm01 8 8192 2 -b---- 156610.6
dm01db01vm02 9 8192 2 -b---- 152169.8
ii) Connect to each domain using the command:
# xm console domainname
where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples.
iii) Shut down any instances of CRS on that domain:
# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle
# $ORACLE_HOME/bin/crsctl stop crs
or
# <GI_HOME>/bin/crsctl stop crs
where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on
 the customer's environment.
In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the 
value would be +ASM3.
iv) Validate CRS is down cleanly. There should be no processes running.
# ps -ef | grep css
v) Press CTRL+] to disconnect from the console.
vi) Repeat steps ii - v on each running domain.
vii) Shutdown all user domains from dom0
viii) Repeat step 1 to see what user domains are running. It should be only Domain-0.
ix) Disable user domains from auto starting during dom0 boot after firmware has been updated.
# chkconfig xendomains off
 x) Disable Exadata DB Node management services with the following command as 'root' user:
# dbmcli -e alter dbserver shutdown services all

c) Upgrade the server's SAS HBA firmware with the following command as 'root' user:
# /opt/oracle.cellos/CheckHWnFWProfile -action updatefw  -mode diagnostic -component
DiskController -attribute DiskControllerFirmwareRevision -diagnostic_version 24.3.0-0083
-fwpath /tmp/MR_6.3.8.3_24.3.0-0083.rom
Upon completion of the firmware upgrade, the server will automatically reboot. There may be periods of time 
during the update where the output to the screen stops, which is expected - please be patient. This takes ~10 
minutes to get to the reboot, and ~15 minutes to complete the entire process including rebooting the cell, 
excluding disk re-synchronization or CRS start time. The time may be longer on X5-8 DB nodes.  There may be 
2 reboots during the process.
d) Verify the server's SAS HBA firmware is updated:

# /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -i package
FW Package Build: 24.3.0-0083
 The firmware package with the correctable error bug fix is 24.3.0-0083.
3. Verify the server's disks after firmware update and bring back online its services as follows:
Exadata Storage Servers (based on X5-2L and X6-2L):
i. Verify the 12 disks are visible. The following command should show 12 disks:
# lsscsi | grep -i LSI
[0:2:0:0]    disk    LSI      MR9361-8i        4.23  /dev/sda
[0:2:1:0]    disk    LSI      MR9361-8i        4.23  /dev/sdb
[0:2:2:0]    disk    LSI      MR9361-8i        4.23  /dev/sdc
[0:2:3:0]    disk    LSI      MR9361-8i        4.23  /dev/sdd
[0:2:4:0]    disk    LSI      MR9361-8i        4.23  /dev/sde
[0:2:5:0]    disk    LSI      MR9361-8i        4.23  /dev/sdf
[0:2:6:0]    disk    LSI      MR9361-8i        4.23  /dev/sdg
[0:2:7:0]    disk    LSI      MR9361-8i        4.23  /dev/sdh
[0:2:8:0]    disk    LSI      MR9361-8i        4.23  /dev/sdi
[0:2:9:0]    disk    LSI      MR9361-8i        4.23  /dev/sdj
[0:2:10:0]   disk    LSI      MR9361-8i        4.23  /dev/sdk
[0:2:11:0]   disk    LSI      MR9361-8i        4.23  /dev/sdl
ii. Activate the grid disks.
# cellcli -e alter griddisk all active
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
GridDisk RECO_CD_01_dmorlx8cel01 successfully altered
...etc...
iii. Verify all grid disks show 'active':
# cellcli -e list griddisk
DATA_CD_00_dmorlx8cel01 active
DATA_CD_01_dmorlx8cel01 active
RECO_CD_00_dmorlx8cel01 active
RECO_CD_01_dmorlx8cel01 active
...etc...
iv. Verify all grid disks have been successfully put online using the following command. Wait until
asmmodestatus is ONLINE for all grid disks. The following is an example of the output early in the
activation process.
# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 active ONLINE Yes
DATA_CD_01_dmorlx8cel01 active ONLINE Yes
DATA_CD_02_dmorlx8cel01 active ONLINE Yes
RECO_CD_00_dmorlx8cel01 active SYNCING Yes
...etc...
Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process. Oracle ASM
synchronization is only complete when ALL grid disks show ‘asmmodestatus=ONLINE’. This process can take
some time depending on how busy the machine is, and has been while this individual server was down for
repair.
Exadata DB Nodes (based on X5-2, X6-2 and X5-8)
Linux DB Nodes:
i) Verify all the disks are visible to the system and in 'normal' status.
# dbmcli -e "list physicaldisk"
252:0 F1HHYP normal
252:1 F1K76P normal
252:2 F1GZ1P normal
252:3 F1K7GP normal
252:4 F1LHUP normal
252:5 F1A2JP normal
252:6 F1LH6P normal
252:7 F1LDSP normal
 There should be 4 or 8 disks depending on the DB node model.
ii) Startup CRS and re-enable autostart of crs. After the OS is up, the Customer DBA should validate
that CRS is running. As root execute:
# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle
# $ORACLE_HOME/bin/crsctl start crs
# $ORACLE_HOME/bin/crsctl check crs
Now re-enable autostart
# $ORACLE_HOME/bin/crsctl enable crs
or
# <GI_HOME>/bin/crsctl check crs
# <GI_HOME>/bin/crsctl enable crs
where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the
customer's environment.
In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 
the value would be +ASM3.
Example output when all is online is: 
# /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
iii) Validate that instances are running:
# ps -ef |grep pmon
It should return a record for the ASM instance and a record for each database.

OVM DB Nodes:
i) Verify all the disks are visible to the system and in 'normal' status.
# dbmcli -e "list physicaldisk"
252:0 F1HHYP normal
252:1 F1K76P normal
252:2 F1GZ1P normal
252:3 F1K7GP normal
252:4 F1LHUP normal
252:5 F1A2JP normal
252:6 F1LH6P normal
252:7 F1LDSP normal
There should be 4 or 8 disks depending on the DB node model. 
ii) Re-enable user domains to autostart during Domain-0 boot:
# chkconfig xendomains on
iii) Startup all user domains that are marked for auto start:
# service xendomains start
iv)  See what user domains are running (compare against result from previously collected data):
# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 8192 4 r----- 409812.7
dm01db01vm01 8 8192 2 -b---- 156610.6
dm01db01vm02 9 8192 2 -b---- 152169.8
 v) if any did not auto-start then Startup a single user domain:
# xm create -c /EXAVMIMAGES/GuestImages/DomainName/vm.cfg
vi)  Check that CRS has started in user domains:
a) Connect to each domain using the command:
# xm console domainname
where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples.
b) Any instances of CRS on that domain should have automatically started:
# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle
# $ORACLE_HOME/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
c) Validate that instances are running:
# ps -ef |grep pmon
It should return a record for the ASM instance and a record for each database.
d) Press CTRL+] to disconnect from the console.
vii) Repeat step (vi) on each running domain.

4. Repeat the above steps to update the firmware on each storage server and DB node, as needed.
NOTE: If updating firmware on multiple storage server in a rolling manner, do not reboot and apply the
firmware update to multiple storage servers at the same time - only do them one at a time and ensure all disks
are re-synchronized with ASM completely before proceeding to the next storage server.

If the image is version 12.1.2.3.0 or later, then the procedure is complete.  The following additional steps are 
required for images 12.1.2.2.2 or below (as taken from Note 2135119.1):
5. To verify the current battery status for the fwtermlog setting, on each server, as 'root' login, execute:

# /opt/MegaRAID/MegaCli/MegaCli64 -fwtermlog -bbuget -a0
  Battery is OFF for TTY history on Adapter 0
  Exit Code: 0x00
We should see that the battery mode is off for the fwtermlog.

6. Turn on use of the battery for maintaining the fwtermlog across server reboots and power cycles:
# /opt/MegaRAID/MegaCli/MegaCli64 -fwtermlog -bbuon -a0
Battery is set to ON for TTY history on Adapter 0
Running the above command on the server will not have any impact on running services.
This change is persistent across server reboots or power cycles and is only unset by command.



REFERENCES

NOTE:2135119.1 - SAS HBA does not maintain logs over a reboot on Exadata X5-2L High Capacity Storage 
Servers on SW Image versions below 12.1.2.3.0
NOTE:888828.1 - Exadata Database Machine and Exadata Storage Server Supported Versions
Didn't find what you are looking for?

No comments:

Post a Comment

  How to Change Instance Type & Security Group of EC2 in AWS By David Taylor Updated April 29, 2023 EC2 stands for Elastic Compute Cloud...