Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Xenserver

VIEWS: 218 PAGES: 49

									Troubleshooting
XenServer deployments
Tomasz Czajka, Sr. Support Engineer
8th of October 2010
Agenda

• Case Study: “Production down”
• Learn: “XenServer crash”
• Case study: “Singlepathing”
•Q& A
“Production down”
VM don’t start - why?
Basic troubleshooting in XenCenter

• Cannot start a VM  “The SR is not available” error
• Storage Repositry (SR) in “broken” state


                                 “Repair” does not work.



                                     Use CLI to troubleshoot
                                                    SR                  SR
Broken storage
What is “broken”?                                         PBD         PBD
                          PBD = Physical Block Device

    Volume Group
                          PBD                           XenServer_1
Name: <Prefix>+SR UUID”

        SR
 has UUID (unique ID)     PBD                           XenServer_2
                          SCSI ID




 # xe pbd-list currently-atached=false
Storage troubleshooting
Goal: Reproduce and analyse the logs

/var/log/xensource.log* ; SMlog* ; messages* ;


# tail –f /var/log/messages > /tmp/ShortLog
# date
# echo “Unplugging cable” >> messages


messages (UTC) <> xensource.log (local)
PBD unplugged
Plugging PBD manually
                        # grep “PBD.plug” xensource.log
# xe pbd-list host-uuid=... sr-uuid=...
# xe pbd-plug uuid=...
SR_BACKEND_FAILURE_47: The SR is not available
  no such volume group:
 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4


 # xe sr-list name-label=“My SR” params=uuid

 19856cba-830c-e298-79fa-84a79eb658f4
Volume Group                     Logical Volume Manager (LVM)
What is VG?                                                            3 VMs
                                                                1 virtual disk each


               Physical Volume                                                   Virtual Disk
   HDD / LUN        (PV)                                        Logical Volume
                                                                     (LV)             VDI
   HDD / LUN                               Volume Group         Logical Volume
               Physical Volume             Volume Group
                                               (VG)                  (LV)             VDI
                    (PV)                       (VG)

   HDD / LUN   Physical Volume                                  Logical Volume
                                                                                      VDI
                                                                     (LV)
                    (PV)


                                         Storage Repository

                                               SR
Volume Group
Matching the UUID

# vgs
 # vgs 'VG_XenStorage-19856cba-830c-e298-79fa-
VG 84a79eb658f4'             #PV #LV #SN Attr VSize VFree
VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9   1   18   0   wz--n- 89.99G 19.48G
VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853   1    2   0   wz--n- 129.07G 129.05G
VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88   1   11   0   wz--n- 49.99G    2.84G
VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3   1    1   0   wz--n-   1.99G   1.98G
  Volume group "VG_XenStorage-19856cba-830c-
  e298-79fa-84a79eb658f4" not found
Examining HDD/LUN
Checking SCSI ID


• check SCSI ID (unique for each SCSI device)   PBD
                                                SCSI ID



 # xe pbd-list params=device-config sr-uuid=...

device-config SCSIid: 360a9800050334f49633459
Examining HDD/LUN
Can Linux kernel see this block device? (SCSI device)

# hdparm -t /dev/disk/by-id/scsi-360a98045234t654...


 Timing buffered disk reads:
 138 MB in 3.02 seconds = 45.68 MB/sec

                                       (LUN readable!      )
Addressing SCSI disks
# ls -lR /dev/disk | grep 360a9800050334f4963345767656c546

• /dev/disk/by-id
• scsi-360a9800050334f4963345767656c546a -> /dev/sde


• /dev/disk/by-scsibus
• 360a9800050334f4963345767656c546a-1:0:0:5 -> /dev/sdc
• 360a9800050334f4963345767656c546a-2:0:0:5 -> /dev/sde


/dev/mapper/360a9800050334f4963345767656c546
Also check /dev/disk/by-path
Examining HDD/LUN
Is the LUN empty?

#     udevinfo -q all -n
    /dev/disk/by-id/scsi-360a9800050334f496334576765...


...
ID_FS_TYPE=LVM2 member
...
             “If this is LVM member, why there is no VG on it?”
 Examining HDD/LUN
 Is there a VG created on PV?

 # pvs
 # pvs |grep 360a9800050334f496334595a32306431
PV                              VG       Attr Psize
                                         a-   89.99G
                                                                       Fmt
/dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2
                                                                                            Free
                                                                                            19.48G
/dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2
  PV                                                      VG Fmt Attr         a-
                                                                                        Free2.84G
                                                                              Psize 49.99G
/dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2   a-    14.99G   6.45G
/dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2   a-     1.99G   1.98G
  /dev/mapper/360a9800050334f496334595a32306431
/dev/sda3                       VG_XenStorage-5239de43-6a74-0365-f825- lvm2   a-   129.07G 129.05G
   VG_XenStorage-332432-430d-3423-4332434-5485974             lvm2 a-         14.99G    14.99G

 # xe sr-list name-label="My SR" params=uuid

   19856cba-830c-e298-79fa-84a79eb658f4
                             VG_Xenstorage<UUID>                  differs from SR UUID !
No original VG on the LUN
Potential reasons:

• (Re)installation of host in the same pool
 • Unplug FC / Zoning

• (Re)installation of host in other pool
 • Zoning

• Adding SR with “xe sr-create” in CLI


                                 ...BE VERY CAREFUL!
Volume Group
...has been recreated!

• Lost LVM metadata
• Lost 100 MB of the VDI data
Action steps:
• don’t shutdown running VMs
• Online backup for running Vms (now)
• Block-level clone of the whole LUN (now)
• Assess professional data recovery
                                      Make a copy first
                                      # cp /etc/lvm/backup/* /root/backup/
Volume Group
Looking for LVM metadata backup

/etc/lmv/backup/VG_XenStorage-19856cba-830c-
 e298-79fa-84a79eb658f4
• Check backup timestamp (within the file)
      LVs in backup file                    VDI in xapi database

# cat /etc/lvm/backup/VG...           # xe vdi-list sr=<uuid>
| grep VHD                        =   params=uuid
                       LV                                           VDI
                       LV
                                                                    VDI
                       LV
                                                                    VDI
Volume Group
Removing new VG and PV

# vgremove "VG_XenStorage-<new SR uuid>”


# pvremove
 /dev/mapper/<SCSI ID>
Volume Group
Recreating PV and VG from backup

    # pvcreate
       --uuid <PV uuid from backup file>
       --restorefile
       /etc/lvm/backup/VG_XenStorage-<SR_UUID>
       /dev/mapper/<SCSI ID>


#     vgcfgrestore VG_XenStorage-<SR UUID>
    -f /etc/lvm/backup/VG_XenStorage-<SR UUID>
Examining HDD/LUN
Confirm that VG name contains SR uuid...

# pvs |grep 360a9800050334f496334595a32306431
PV                                              VG       Fmt Attr   Psize     Free
/dev/mapper/360a9800050334f496334595a32306431
  VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4   lvm2 a-   14.99G     14.99G

# xe sr-list name-label="My SR" params=uuid

 19856cba-830c-e298-79fa-84a79eb658f4

                         VG_Xenstorage<UUID>            matches SR UUID 
Volume Group                                             Logical Volume
                                                              (LV)

Checking Logical Volumes                                 Logical Volume
                                                              (LV)

# lvs
                                                         Logical Volume
                                                              (LV)
• MGT
  VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi---      4.00M
• VHD-352d31ec-aeb6-4601-8ea9-990575dab395
  VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M
• VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32
  VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi---      4.02G
• VHD-fbce18dd-397e-444e-9470-b6fa240243d9
  VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi---      4.02G
• VHD-ff744448-1b7f-4cc8-80b1-cd38b6c90c98
  VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M
Storage Repository
Plugging PBD again...

# xe pbd-plug uuid=…      Success! But no VDIs shown...

# xe sr-scan uuid=…
Error code: SR_BACKEND_FAILURE_46
Error parameters: , The VDI is not available [opterr=Error
 scanning VDI 7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32]
# xe vdi-list uuid=<above number>
# lvremove /dev/VG_XenStorage-19856cba-830c-e298-79fa-
 84a79eb658f4/VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32
# xe sr-scan uuid=…
                       Success!   All VDIs shown...       Well done! 
What we’ve learned
...by troubleshooting “Production Down” issue

• PBD to get plugged needs...
• LUN/HDD  PV  VG (SR)  LV (VDI)
• VG name generated from SR uuid (+ prefix)
  LV name generated from VDI uuid (+ prefix)
• Displaying VG (vgs), PV (pvs), LV (lvs)
• Addressing block devices (/dev/disk)
• Examining HDD/LUN with "hdparm –t"
• Restoring PV & VG from backup
“The XenServer Crash”
The XenServer Crash?
Unresponsive or rebooting host

• Kernel panic or crash dump
 • Error on Console, host locked
 • Memory addressing, Bug in OS, Hardware failure


• No Kernel Panic and no crash dump
 • Host rebooting / frozen / no errors on the console
 • Hardware failure, OS busy (I/O), user action
         Symptom: Host is unresponsive                                      Symptom: Host rebooted itself


    Serial console                 No serial console                                              /var/crash/<date> exists
                                                                     No crashdump
Boot the host to the console
  CTX120540 & reboot             Connect local console                                               Review crashdump
   Generate crashdump
   CTX120540 & reboot          Any errors on the console?             HA disabled                        HA enabled

                                                                    Analyse /var/log/
 Review crashdump               Take photos and reboot                messages,                          Disable HA
                                                                     xensource.log
                                                                   Add „noreboot”                     Host fenced?
                    Analyse /var/log/                           option in extlinux.conf            Check /var/log/xha.log
                      messages,
                     xensource.log                          Still rebooting?  examine hardware    Analyse /var/log/messages,
                                                                                                  xensource.log for HA reasons

                                                  Contact Citrix Tech Support
Getting into details…                       Analyse /var/log/
                                              messages,
As easy as grep                              xensource.log


Startup strings:
# cd /var/log
# grep “klogd” messages -B100
# grep “SERVER START” xensource.log -B100
Inside crash log directory
                                                                         Review crashdump
/var/crash/<stamp>


    crash.log             Domain0.log

                           Domain0                                     Domain1,2,3...log
   Hypervisor
                          console ring                                    Debug.log
   console ring
                                                                      xen-memory-dump
              HA activity,
    page fault, driver, storage issues

         CPU stack - to be analysed by Citrix Tech Support
                            Citrix Confidential - Do Not Distribute
Investigating crash.log
                                                        Review crashdump (cont)
XenConsole ring

• located at the bottom of the file
• (XEN) Watchdog timer fired for domain 0
  (XEN) Domain 0 shutdown: watchdog rebooting machine.


• Why watchdog triggered?
    /var/log/xha.log (Network or Storage heartbeat failed)
• Why heartbeat failed?
    /var/log/messages (DMP, kernel, drivers, I/O errors)
Investigating crash.log
Page fault

Other examples:
• (XEN) ****************************************
• (XEN) Panic on CPU 6:
• (XEN) FATAL TRAP: vector = 14 (page fault)
• (XEN) [error_code=0000] , IN INTERRUPT CONTEXT
• (XEN)
• ****************************************
• (XEN)
• (XEN) Reboot in five seconds...
What we’ve learned
Learn: XenServer crash

• Host really crashed?
• Kernel Panic
• Crashdump
• Triggering Crashdump manually
• Locating host reboot in the logs
• Reviewing crashdump logs
“Single-Pathing”
Storage Performance issue


• DMP has been enabled to improve performance
• Virtual Machines are running on different iSCSI SRs

LinuxGuestVM:~# hdparm -t /dev/xvdb
/dev/xvdb:
 Timing buffered disk reads:   96 MB in   3.07 seconds =
 30.41 MB/sec
Storage Performance
Checking multipath status
                                    /dev/mapper/....
# mpathutil status
360a9800050334f496334596c71665246 dm-13 NETAPP,LUN
[size=2.0G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=4][enabled]
 \_ 3:0:0:2 sdk 8:160 [active][ready]
 \_ 4:0:0:2 sdj 8:144 [active][ready]

                            /dev/
Storage Performance
Determining current performance on domain0

• Testing multi-path device
# hdparm /dev/mapper/<scsi id>
• Testing single-path devices
# hdparm /dev/sdj
# hdparm /dev/sdm

                                In all cases:   30 MB/sec
Storage Performance
Determining usage of paths

# iostat –x <device>
# iostat –x /dev/sdk /dev/sdj 5

Device    Blk_read/s    Blk_wrtn/s   Blk_read   Blk_wrtn
 sdk       803.50          33.0      4122       160
 sdj       784.00          32.8      3922       155



                                Both paths are used equally
Storage Performance
Checking if there are really 2 iSCSI sessions

# ls -alR /dev/disk/by-path/ | egrep "(sdk|sdj)"


ip-10.1.200.40:3260-iscsi-iqn.1992-
 08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdk

ip-10.1.201.40:3260-iscsi-iqn.1992-
 08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdj
Storage Performance
Checking if different paths are really used

# tcpdump -i any port 3260
# watch "ifconfig | egrep '(eth0 |eth1 )' -A5 | egrep '(eth|bytes)' "

eth0   Link encap:Ethernet   HWaddr 00:1D:09:70:88:2C
RX bytes:1490076463 (1.3 GiB)    TX bytes:170615419 (162.7 MiB)
eth1   Link encap:Ethernet   HWaddr 00:1D:09:70:88:2E
RX bytes:1801238     (166 MiB)   TX bytes:46695876 (44.5 MiB)
Storage Performance
Checking source IP addresses for iSCSI sessions

# netstat -at | grep iscsi
10.1.200.138:53049      10.1.200.40:iscsi-target   ESTABLISHED

10.1.200.178:46684      10.1.201.40:iscsi-target   ESTABLISHED
Storage Performance
Checking kernel routing table

# route

Destination         Gateway       Genmask        Iface
10.1.200.0      *                255.255.255.0   xenbr0
10.1.200.0      *                255.255.255.0   xenbr1
default             10.1.200.1   0.0.0.0         xenbr0
Storage Performance
Configuration of management interfaces in XenCenter




                         Modify ISCSI_2 into 10.1.201.78
Storage Performance
Determining current performance on domain0

# route

Destination       Gateway       Genmask        Iface
10.1.200.0    *                255.255.255.0   xenbr0
10.1.201.0    *                255.255.255.0   xenbr1
default           10.1.200.1   0.0.0.0         xenbr0
Storage Performance
Configuring kernel routing table

...or (not recommended)

• Add to /etc/rc.local

# route add -host 10.1.200.40 xenbr0
# route add -host 10.1.201.40 xenbr1


• What about Pool Upgrade and Pool Join?
Storage Performance
Determining current performance on VM

LinuxVM:~# hdparm -t /dev/xvdb

/dev/xvdb:
 Timing buffered disk reads:     45 MB/sec




                                             Well Done! 
What we’ve learned
Case study: Single-pathing
• /dev/ locations for single and multi-path devices
• # mpathutil status
• # hdparm –t
• # iostat
• # ifconfig, # tcpdump, # netstat, # route
• # watch
• Best practices for iSCSI storages
Questions
Resources
First aid kit

• http://docs.xensource.com –XenServer documentation
• http://support.citrix.com/product/xens/ - Knowledge Center
• http://forums.citrix.com/support - Support forums
• http://community.citrix.com/citrixready/xenserver - XenServer
  Central (one-stop information center)
Before you leave…
• Session surveys are available online at www.citrixsynergy.com
  starting Thursday, 7 October
 • Provide your feedback and pick up a complimentary gift card at the registration desk

• Download presentations starting Friday, 15 October, from your My
  Organiser Tool located in your My Synergy Microsite event account

								
To top