Information Technology: February 2017

Exadata X5/X6 reports "Disk controller was hung.
Cell was power cycled to stop the hang."
and SAS HBA logs report correctable errors on
SW images prior to 12.1.2.3.2 (Doc ID 2176276.1)

In this Document

Symptoms

Cause

Solution

References

APPLIES TO:

Oracle SuperCluster T5-8 Hardware - Version All Versions and later
Exadata X6-2 Hardware - Version All Versions and later
Exadata X6-8 Hardware - Version All Versions and later
Exadata X5-2 Hardware - Version All Versions and later
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Information in this document applies to any platform.

SYMPTOMS

Exadata X5/X6 reports "Disk controller was hung. Cell was power cycled to stop the hang." and SAS HBA logs

report correctable errors on SW images prior to 12.1.2.3.2
Note: Exadata X5-2L/X6-2L Extreme Flash Storage Servers do not have a SAS HBA and are not affected.
Note: Exadata systems older than X5-2 use a different SAS HBA and are not affected.
Note: Exadata X4-8 systems with X5-2L Storage Servers only applies to the Storage Servers; the X4-8 DB nodes

use a different SAS HBA and is not affected.

Issue: The server's logging a reset event due to HBA controller fatal error, with a Correctable Error that is

repeated thousands of times in the SAS HBA firmware terminal log (fwtermlog) such as:

# /opt/MegaRAID/MegaCli/MegaCli64 /c0 show termlog | more

Firmware Term Log Information on controller 0:

05/16/16 22:47:51: C0:SRAM errAddr c01af1e0 errAttrib 00000023

05/16/16 22:47:51: C0:Correctable err, continuing...

05/16/16 22:47:51: C0:SRAM errAddr c01af1e0 errAttrib 00000023

05/16/16 22:47:51: C0:Correctable err, continuing...

05/16/16 22:47:51: C0:SRAM errAddr c01af1e0 errAttrib 00000023

05/16/16 22:47:51: C0:Correctable err, continuing...

... <~8100 entries is typical>..........

05/16/16 22:47:51: C0:SRAM errAddr c01af1e0 errAttrib 00000023

05/16/16 22:47:51: C0:Correctable err, continuing...

05/16/16 22:47:51: C0:SRAM errAddr c01a

MonSetAllowChipReset: MonAllowResetChip 1

05/16/16 22:47:51: C0:In MonTask; Seconds from powerup = 0x00fd125b

05/16/16 22:47:51: C0:Max Temperature = 80 on Channel 4

Firmware crash dump feature enabled

Crash dump collection will start immediately

copied 75 MB in 71957 Microseconds

[0]: fp=c03ffe00, lr=c13243c8 - _MonTask+200

... <reset output>

IMPORTANT NOTE: It is possible due to another image issue in 12.1.2.2.2 and earlier that the SAS HBA

firmware logs do not contain the above information regarding a correctable error. Most likely the logs were

recycled on the power cycle, and the errors are no longer in the log. If this is the case, please ensure also the

procedure in Note 2135119.1 is completed to ensure persistent logging is enabled in addition to this solution.

The additional steps are included below.
The server will alert with this (cellcli or dbmcli - depending on server type that saw the event):

CellCLI> list alerthistory

...

2 2016-05-16T10:07:39-04:00 critical "Disk controller was hung. Cell

was power cycled to stop the hang."

Workaround: None, the power cycle recovers the HBA and is functional again, as designed by Exadata server

monitoring service (MS). The error is only a single error, but the firmware gets into an infinite loop and hangs

after ~8100 times trying to complete the error correction.

CAUSE

Unpublished HBA Firmware bug 21669752 causes the controller to hang while correcting an error.

SOLUTION

The solution is to update the SAS HBA firmware to version 24.3.0-0083 (or later).
If a controller hang or reset event has occurred with other messages logged, or the event occurs on a system

that already has SAS HBA firmware 24.3.0-0083, then a SR should be opened and a Sundiag and diagpack

provided for the event uploaded for analysis and action plan.

1. Check the image version of the server and firmware version of the SAS HBA:

# imageinfo -ver

12.1.2.1.1.150316.2

# /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -i package

FW Package Build: 24.3.0-0081

If the server is running SW Image 12.1.2.3.2 or later, or 12.1.2.2.3 or later, then the problem does not apply.
These images have the firmware fix in "FW Package Build: 24.3.0-0083".
If a controller hang or reset event has occurred with other messages logged, or the event occurs on a system

that already has SAS HBA firmware 24.3.0-0083, then a SR should be opened and a Sundiag and diagpack

provided for the event uploaded for analysis and action plan.

2. Update the SW image to 12.1.2.2.3 or 12.1.2.3.2 or later, which contains the firmware fix 24.3.0-0083 per the

example above. For how to update image, refer to MOS Note 888828.1.
If the server is not able to be updated with a later image at this time, then the SAS HBA firmware only may be

updated to address this issue as follows, using the firmware package "MR_6.3.8.3_24.3.0-0083.rom" attached to

this Note as follows:
a) Download the firmware package "MR_6.3.8.3_24.3.0-0083.rom" attached to this Note, and copy it to the /tmp

directory on the server to be updated.

b) Prepare the server for maintenance as follows:
Exadata Storage Servers (based on X5-2L and X6-2L)

NOTE: If updating firmware on multiple storage servers in a rolling manner, do not reboot and apply the
firmware update to multiple storage servers at the same time - only do them one at a time and ensure all disks
are re-synchronized with ASM before proceeding to the next storage server.

i. ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME
attribute value of 3.6hrs should be adequate for replacing components, but may have been
changed by the Customer. To check this parameter, have the Customer log into ASM and
perform the following query:

SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg

where a.name = 'disk_repair_time' and a.group_number = dg.group_number;

As long as the value is large enough to comfortably perform the upgrade in a
storage cell, there is no need to change it.

ii. Check if ASM will be OK if the grid disks go OFFLINE.

# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome

...snipit ...

DATA_CD_09_cel01 ONLINE Yes

DATA_CD_10_cel01 ONLINE Yes

DATA_CD_11_cel01 ONLINE Yes

RECO_CD_00_cel01 ONLINE Yes

etc....

If one or more disks return asmdeactivationoutcome='No', you should wait for some time
and repeat the query until all disks return asmdeactivationoutcome='Yes'.

NOTE: Taking the storage server offline while one or more disks return a status of
asmdeactivationoutcome='No' will cause Oracle ASM to dismount the affected disk group,
causing the databases to shut down abruptly.

iii. Run cellcli command to Inactivate all grid disks on the cell you wish to power down/reboot.
(this could take up to 10 minutes or longer)

# cellcli -e alter griddisk all inactive

GridDisk DATA_CD_00_dmorlx8cel01 successfully altered

GridDisk DATA_CD_01_dmorlx8cel01 successfully altered

GridDisk DATA_CD_02_dmorlx8cel01 successfully altered

GridDisk RECO_CD_00_dmorlx8cel01 successfully altered

...etc...

iv. Execute the command below and the output should show asmmodestatus='UNUSED' or
'OFFLINE' and ‘asmdeactivationoutcome=Yes’ for all griddisks once the disks are offline and
inactive in ASM.

# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome

DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes

DATA_CD_01_dmorlx8cel01 inactive OFFLINE Yes

DATA_CD_02_dmorlx8cel01 inactive OFFLINE Yes

RECO_CD_00_dmorlx8cel01 inactive OFFLINE Yes

...etc...

v) Disable Exadata Storage Server services with the following command as 'root' user:

# cellcli -e alter cell shutdown services all

Exadata DB Nodes (based on X5-2, X6-2 and X5-8)

Linux DB Nodes:

i) Shutdown and disable auto-start of CRS services:

# . oraenv

ORACLE_SID = [root] ? +ASM1

The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl disable crs

# $ORACLE_HOME/bin/crsctl stop crs

or

# <GI_HOME>/bin/crsctl stop crs

where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the
customer's environment.

In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the
value would be +ASM3.

ii) Validate CRS is down cleanly. There should be no processes running.

# ps -ef | grep css

iii) Disable Exadata DB Node management services with the following command as 'root' user:

# dbmcli -e alter dbserver shutdown services all

OVM DB Nodes:

i) See what user domains are running (record result)

Connect to the management domain (domain zero, or dom0). This is an example with just two
domains and the management domain Domain-0

# xm list

Name ID Mem VCPUs State Time(s)

Domain-0 0 8192 4 r----- 409812.7

dm01db01vm01 8 8192 2 -b---- 156610.6

dm01db01vm02 9 8192 2 -b---- 152169.8

ii) Connect to each domain using the command:

# xm console domainname

where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples.

iii) Shut down any instances of CRS on that domain:

# . oraenv

ORACLE_SID = [root] ? +ASM1

The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl stop crs

or

# <GI_HOME>/bin/crsctl stop crs

where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on
the customer's environment.

In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the
value would be +ASM3.

iv) Validate CRS is down cleanly. There should be no processes running.

# ps -ef | grep css

v) Press CTRL+] to disconnect from the console.

vi) Repeat steps ii - v on each running domain.

vii) Shutdown all user domains from dom0

viii) Repeat step 1 to see what user domains are running. It should be only Domain-0.

ix) Disable user domains from auto starting during dom0 boot after firmware has been updated.

# chkconfig xendomains off

x) Disable Exadata DB Node management services with the following command as 'root' user:

# dbmcli -e alter dbserver shutdown services all

c) Upgrade the server's SAS HBA firmware with the following command as 'root' user:

# /opt/oracle.cellos/CheckHWnFWProfile -action updatefw  -mode diagnostic -component

DiskController -attribute DiskControllerFirmwareRevision -diagnostic_version 24.3.0-0083

-fwpath /tmp/MR_6.3.8.3_24.3.0-0083.rom

Upon completion of the firmware upgrade, the server will automatically reboot. There may be periods of time

during the update where the output to the screen stops, which is expected - please be patient. This takes ~10

minutes to get to the reboot, and ~15 minutes to complete the entire process including rebooting the cell,

excluding disk re-synchronization or CRS start time. The time may be longer on X5-8 DB nodes. There may be

2 reboots during the process.
d) Verify the server's SAS HBA firmware is updated:

# /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -i package

FW Package Build: 24.3.0-0083

The firmware package with the correctable error bug fix is 24.3.0-0083.
3. Verify the server's disks after firmware update and bring back online its services as follows:
Exadata Storage Servers (based on X5-2L and X6-2L):

i. Verify the 12 disks are visible. The following command should show 12 disks:

# lsscsi | grep -i LSI

[0:2:0:0]    disk    LSI      MR9361-8i        4.23  /dev/sda

[0:2:1:0]    disk    LSI      MR9361-8i        4.23  /dev/sdb

[0:2:2:0]    disk    LSI      MR9361-8i        4.23  /dev/sdc

[0:2:3:0]    disk    LSI      MR9361-8i        4.23  /dev/sdd

[0:2:4:0]    disk    LSI      MR9361-8i        4.23  /dev/sde

[0:2:5:0]    disk    LSI      MR9361-8i        4.23  /dev/sdf

[0:2:6:0]    disk    LSI      MR9361-8i        4.23  /dev/sdg

[0:2:7:0]    disk    LSI      MR9361-8i        4.23  /dev/sdh

[0:2:8:0]    disk    LSI      MR9361-8i        4.23  /dev/sdi

[0:2:9:0]    disk    LSI      MR9361-8i        4.23  /dev/sdj

[0:2:10:0]   disk    LSI      MR9361-8i        4.23  /dev/sdk

[0:2:11:0]   disk    LSI      MR9361-8i        4.23  /dev/sdl

ii. Activate the grid disks.

# cellcli -e alter griddisk all active

GridDisk DATA_CD_00_dmorlx8cel01 successfully altered

GridDisk DATA_CD_01_dmorlx8cel01 successfully altered

GridDisk RECO_CD_00_dmorlx8cel01 successfully altered

GridDisk RECO_CD_01_dmorlx8cel01 successfully altered

...etc...

iii. Verify all grid disks show 'active':

# cellcli -e list griddisk

DATA_CD_00_dmorlx8cel01 active

DATA_CD_01_dmorlx8cel01 active

RECO_CD_00_dmorlx8cel01 active

RECO_CD_01_dmorlx8cel01 active

...etc...

iv. Verify all grid disks have been successfully put online using the following command. Wait until
asmmodestatus is ONLINE for all grid disks. The following is an example of the output early in the
activation process.

# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome

DATA_CD_00_dmorlx8cel01 active ONLINE Yes

DATA_CD_01_dmorlx8cel01 active ONLINE Yes

DATA_CD_02_dmorlx8cel01 active ONLINE Yes

RECO_CD_00_dmorlx8cel01 active SYNCING Yes

...etc...

Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process. Oracle ASM
synchronization is only complete when ALL grid disks show ‘asmmodestatus=ONLINE’. This process can take
some time depending on how busy the machine is, and has been while this individual server was down for
repair.

Exadata DB Nodes (based on X5-2, X6-2 and X5-8)

Linux DB Nodes:

i) Verify all the disks are visible to the system and in 'normal' status.

# dbmcli -e "list physicaldisk"

0 F1HHYP normal

1 F1K76P normal

2 F1GZ1P normal

3 F1K7GP normal

4 F1LHUP normal

5 F1A2JP normal

6 F1LH6P normal

7 F1LDSP normal

There should be 4 or 8 disks depending on the DB node model.

ii) Startup CRS and re-enable autostart of crs. After the OS is up, the Customer DBA should validate
that CRS is running. As root execute:

# . oraenv

ORACLE_SID = [root] ? +ASM1

The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl start crs

# $ORACLE_HOME/bin/crsctl check crs

Now re-enable autostart

# $ORACLE_HOME/bin/crsctl enable crs

or

# <GI_HOME>/bin/crsctl check crs

# <GI_HOME>/bin/crsctl enable crs

where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the
customer's environment.

In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3
the value would be +ASM3.
Example output when all is online is:

# /u01/app/11.2.0/grid/bin/crsctl check crs

CRS-4638: Oracle High Availability Services is online

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

iii) Validate that instances are running:

# ps -ef |grep pmon

It should return a record for the ASM instance and a record for each database.

OVM DB Nodes:

i) Verify all the disks are visible to the system and in 'normal' status.

# dbmcli -e "list physicaldisk"

0 F1HHYP normal

1 F1K76P normal

2 F1GZ1P normal

3 F1K7GP normal

4 F1LHUP normal

5 F1A2JP normal

6 F1LH6P normal

7 F1LDSP normal

There should be 4 or 8 disks depending on the DB node model.

ii) Re-enable user domains to autostart during Domain-0 boot:

# chkconfig xendomains on

iii) Startup all user domains that are marked for auto start:

# service xendomains start

iv) See what user domains are running (compare against result from previously collected data):

# xm list

Name ID Mem VCPUs State Time(s)

Domain-0 0 8192 4 r----- 409812.7

dm01db01vm01 8 8192 2 -b---- 156610.6

dm01db01vm02 9 8192 2 -b---- 152169.8

v) if any did not auto-start then Startup a single user domain:

# xm create -c /EXAVMIMAGES/GuestImages/DomainName/vm.cfg

vi) Check that CRS has started in user domains:

a) Connect to each domain using the command:

# xm console domainname

where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples.

b) Any instances of CRS on that domain should have automatically started:

# . oraenv

ORACLE_SID = [root] ? +ASM1

The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl check crs

CRS-4638: Oracle High Availability Services is online

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

c) Validate that instances are running:

# ps -ef |grep pmon

It should return a record for the ASM instance and a record for each database.

d) Press CTRL+] to disconnect from the console.

vii) Repeat step (vi) on each running domain.

4. Repeat the above steps to update the firmware on each storage server and DB node, as needed.

NOTE: If updating firmware on multiple storage server in a rolling manner, do not reboot and apply the
firmware update to multiple storage servers at the same time - only do them one at a time and ensure all disks
are re-synchronized with ASM completely before proceeding to the next storage server.

If the image is version 12.1.2.3.0 or later, then the procedure is complete. The following additional steps are

required for images 12.1.2.2.2 or below (as taken from Note 2135119.1):
5. To verify the current battery status for the fwtermlog setting, on each server, as 'root' login, execute:

# /opt/MegaRAID/MegaCli/MegaCli64 -fwtermlog -bbuget -a0

  Battery is OFF for TTY history on Adapter 0

  Exit Code: 0x00

We should see that the battery mode is off for the fwtermlog.

6. Turn on use of the battery for maintaining the fwtermlog across server reboots and power cycles:

# /opt/MegaRAID/MegaCli/MegaCli64 -fwtermlog -bbuon -a0

Battery is set to ON for TTY history on Adapter 0

Running the above command on the server will not have any impact on running services.
This change is persistent across server reboots or power cycles and is only unset by command.

REFERENCES

NOTE:2135119.1 - SAS HBA does not maintain logs over a reboot on Exadata X5-2L High Capacity Storage

Servers on SW Image versions below 12.1.2.3.0
NOTE:888828.1 - Exadata Database Machine and Exadata Storage Server Supported Versions

Didn't find what you are looking for?

Information Technology

Wednesday, February 1, 2017

Exadata X5/X6 reports "Disk controller was hung. Cell was power cycled to stop the hang." and SAS HBA logs report correctable errors on SW images prior to 12.1.2.3.2 (Doc ID 2176276.1)

APPLIES TO:

SYMPTOMS

CAUSE

SOLUTION

REFERENCES

My Blog List

Report Abuse