It’s just a little tip from the field: A customer has suddenly really bad performance on one of his servers. he finds a lot of error messages like this in the output of dmesg:
Apr 21 05:22:11 server1 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Apr 21 05:22:11 server1 /scsi_vhci/ssd@g6001438005de946defa2000000020010 (ssd38): Command failed to complete (3) on path fp9/ssd@w50001fe15023ef59,a</blockquote>
The customer concludes: I have problems with my SAN, let us look on the switch for errors. But none are seen. So he think "It's not the SAN". At this moment the customer called me.
Important as in "make a tattoo on your arm when you can't remember it": Check both sides! Checking just the error counters on the switch (or just on the server) is a just necessary, but not a sufficient condition for "It's not the SAN".
At first i checked the error counters for the disks. You could use iostat -e for this task however I think kstat -p is easier to parse and you have the same kind of information in it.
kstat -p | grep "ssd28,err"
[…]
ssderr:28:ssd28,err:Device Not Ready 0
ssderr:28:ssd28,err:Hard Errors 400
ssderr:28:ssd28,err:Illegal Request 1
ssderr:28:ssd28,err:Media Error 0
ssderr:28:ssd28,err:No Device 0
ssderr:28:ssd28,err:Predictive Failure Analysis 0
ssderr:28:ssd28,err:Product (some storage) Revision
ssderr:28:ssd28,err:Recoverable 0
ssderr:28:ssd28,err:Revision 1100
ssderr:28:ssd28,err:Serial No
ssderr:28:ssd28,err:Size 16106127360
ssderr:28:ssd28,err:Soft Errors 0
ssderr:28:ssd28,err:Transport Errors 403
ssderr:28:ssd28,err:Vendor (some storage)
ssderr:28:ssd28,err:class device_error
ssderr:28:ssd28,err:crtime 19077969.9568033
ssderr:28:ssd28,err:snaptime 20623873.2833807</blockquote></code></pre>
So i added up the errors for the disks in the <code>kstat</code> file:
<blockquote><code>
<pre>
# kstat -p | grep -i ",err" | grep "sd" | grep "Hard" | cut -f 2 | awk '{sum+=$1} END {print sum}'
3285
# kstat -p | grep -i ",err" | grep "sd" | grep "Transport" | cut -f 2 | awk '{sum+=$1} END {print sum}'
3405
Dang … doesn't look good … let's look closer to the
# fcinfo hba-port -l
HBA Port WWN: (a wwn)
OS Device Name: /dev/cfg/c10
Manufacturer: Emulex
Model: LPem12002E-S
Firmware Version: 2.00a4 (U3D2.00A4)
FCode/BIOS Version: Boot:5.03a4 Fcode:3.10a3
Serial Number: ABCDEFG-HIJKLMNOPQ
Driver Name: emlxs
Driver Version: 2.60k (2011.03.24.16.45)
Type: N-port
State: online
Supported Speeds: 2Gb 4Gb 8Gb
Current Speed: 4Gb
Node WWN: (a wwn)
Link Error Statistics:
Link Failure Count: 0
Loss of Sync Count: 145
Loss of Signal Count: 0
Primitive Seq Protocol Error Count: 0
<b>Invalid Tx Word Count: 2500000
Invalid CRC Count: 2100</b>
The interesting part is the highlighted one. A massive increase in invalid tx word count. When you see an massively increased counter here, your HBA just received rubbish from the storage. I never saw a different reason for this than a problem between the interface to the optics on the HBA and the interface to the optics on the switch ranging from not properly seated transceivers to blatant cases of ignoring the minimum bending radius of the fibre optic cables.
Suggestions to the customer based on the rule "check cheapest solution first":
reseat cables
reseat transceivers
use a new cable
use new transceivers
In this case the problem could be solved by a new cable. Problem disappeared.