PowerVM

 View Only

VIOS Shared Storage Pools 2.2.5 Enhancements

By Rob Gjertsen posted Thu June 25, 2020 06:35 PM

  

Enhancements for Shared Storage Pools

PowerVM continues to enhance Shared Storage Pools (SSP), PowerVM's cloud storage. SSP simplifies cloud management and improves storage efficiency.  PowerVM 2.2.5 includes the following SSP enhancements:

  • Improved scalability

    • Increasing client LPAR density per VIO server from 250 to 400 clients

      • Requires additional VIOS resource: 4 CPUs and 8GB memory.

      • Up to 250 LPARs per server are allowed with minimal resource: 1 CPU and 4GB memory.

  • Improved resiliency features

    • Cluster-wide automatic snapshot triggers automatic collection of debug snapshots across the cluster when a significant problem or error occurs.

    • Improved network outage handling for asymmetric loss of network connections.

    • Network health FFDC facilitates additional data capture when a network issue occurs.

    • Network lease by clock tick.

    • Automated log analysis provides a summary of cluster state and condition based on a cluster wide snapshot.

    • VIOSBR automatic backup wherein configuration changes trigger automatic backup of VIOS and SSP configuration information.

    • VIOSBR disaster recovery configuration restoration.

Background on Shared Storage Pools

One aspect of PowerVM is known as VIOS SSP, which stands for VIOS Shared Storage Pools.

VIOS SSP allows a group of VIOS nodes to form a cluster and provision virtual storage to client LPARs.  The VIOS nodes in the cluster all have access to the same underlying physical disks, which are grouped into a single pool of storage.  A virtual disk or LU can be carved out of that storage pool and mapped to a client LPAR as a virtual SCSI device.  An LU may be thin or thickly provisioned, where thin provisioned LUs do not reserve blocks until they are written to, while thickly provisioned blocks reserve their storage when the LU is created.

Once an LU has been created in the pool, snapshots or clones of that LU can be created.  The number of snapshots and clones created is limited only by the amount of available storage in the pool, and creating these objects happens nearly instantly.  Snapshots are used for rolling back to previous points in time.  Clones are used for provisioning new space efficient copies of an LU.  These clones are managed by PowerVC capture and deploy image management operations.

These features allow rapid deployment of new client LPARs in a Cloud Computing environment.  The storage pooling model of VIOS SSP simplifies administration of large amounts of storage.  The clustering aspect of VIOS SSP provides fault tolerance between VIOS multipathing pairs, and simplifies verification that other nodes can see the storage and are eligible for LPAR mobility operations.  

Additional background information on VIOS SSP can be obtained from the IBM Knowledge Center or with a Redbook.

Cluster-wide Automatic Snapshot

Debug data collection across the cluster is often error prone and inconvenient. The following issues may appear during data collection:

  • Admins may collect snaps from some nodes but not others. The node the problem is manifested on may not necessarily be the key node that debug information should be collected for.x
  • Admins may not know they need to collect a snap until it is too late and logs have wrapped.
  • Admins recovering from an outage will likely prioritize recovering from the outage over collecting data from the failure, thereby increasing the likelihood of log wrap or entire loss of debug information (For example, rebooting before taking a snap).

The solution to these issues is to automate the debug collection process with a cluster-wide snapshot that is triggered when a "major" problem or unexpected issue occurs on a cluster node. A snapshot is collected for each node in the cluster and then aggregated in a convenient manner. An example flow of this cluster operation is captured in the following diagram:

 

Some specifics on the cluster-wide snapshot include:

  • Each of the SSP components (CAA, RSCT, VIOS, pool) can trigger the cluster snapshot when a major problem occurs at that component level.

  • The current types of problems triggering a cluster snapshot are:

    • Network outages

    • Pool full condition

    • Other pool outages (for example, inability to write meta-data for a period of time)

    • Pool start failure

    • Cluster operation failures (cluster create, add/remove node)

    • LU operation failures (LU remove, LU move)

    • Tier create failure

    • Backup/Restore failure

    • Long running command failures (for example, remove PV or replace PV, etc)

    • Election failure for DBN node

    • Inability to appoint MFS manager in cluster or resign of MFS manager

  • Spam filtering is incorporated to avoid redundant snap requests. Only one cluster snapshot occurs at a time, so the same event registered on several nodes only generates a single snapshot.

  • Various SSP components also account for potential to spam with an ongoing problem condition.

  • For any nodes that are unreachable due to network and disk isolation, the snap will automatically be delayed until network and disk access is restored.

  • Once snaps have been taken on each node, they will be transferred to the initiator node and stored in a single compressed tar file.

  • The cluster snapshot file, csnap, is in tar.gz format and stored in /home/ios/logs/ssp_ffdc on the initiator node.

  • A cleanup policy is enforced to automatically delete older csnap files. The policy is based on the age and count of the csnap files. A maximum of 10 files are retained.

User initiated snaps can utilize the same cluster-wide framework via the clffdc command.

clffdc Command

Administrators can use the clffdc command to manually trigger snap collection for the various components. The syntax for this command is:

clffdc -c component [-l localCorrelator] [-p priority] [-v verbosity] [-f file]  

  [-n lineNumber] [-g correlator] [-s]

In regard to the various options:

  • The specified component can be VIOS, CAA, RSCT, pool, or FULL.  The "FULL" component option produces "snap -a" on each node instead of a reduced snapshot providing only SSP specific information.

  • The priority indicates the severity of the failure: priority can be either 1 (high), 2 (medium), or 3 (low).

  • A unique correlator ID or value is utilized to associate node snapshots for a common cluster snapshot.

The csnap file has the format: csnap_date_time_by_component_priority_correlator.tar.gz.

For example a cluster snap shot generated by CAA with medium priority and correlator value 4 is: csnap_20161023_103735_by_caa_Med_c4.tar.gz.

Improved Network Outage Handling

Background

Network outage handling has primarily focused on the symmetric type of network isolation where disjoint islands of nodes cannot communicate with each other. An island of nodes can communicate with every other node in its the island over the network, but not with nodes from a different island.

If a node is isolated from the rest of cluster, then the leader will expel this node to allow forward progress in the cluster.

  • Expel is a coordinated operation via Reliable Scalable Cluster Technology (RSCT) group services over the disk network to force a node to take the pool offline after the network lease expires with the leader (60 second interval for network lease).

  • Majority rule is followed in this case to minimize the number of nodes taken out of commission by expel (for example, leader isolation should result in only the leader being expelled).

An asymmetric network outage is where network links are lost, but a node is not fully isolated. Handling for this situation was not previously optimized to minimize expels.

Improved Handling for Asymmetric Network Outages

Question: What is a network asymmetry?

  • The VIOS nodes in the cluster play various roles.
  • Every node is a client of the storage pool.

  • There is also one leader, one MFS manager or server, and one DBN node per cluster at any given time.

  • Network asymmetry occurs when a client can communicate with the leader but not the server.

 

The following diagram shows a network asymmetry in regard to the MFS manager or server and various nodes in the cluster:

 

 

Question: How can we improve on the current handling and minimize the number of expels?

The existing algorithm for handling network asymmetry makes a local decision on the client node where the client forces itself to be expelled if it can't maintain connectivity with the server. The server is given priority over the client, which is not the best decision if several clients can communicate fine with each other and the problem lies with the server.

  • A better algorithm for handling this condition is making a global decision from the perspective of the leader node.
  • If multiple clients are complaining about an unhealthy server, the leader can expel the server to minimize cluster impact.

    • The various nodes in the cluster communicate with the leader node on any experienced network issues with a server.

    • The leader can then make an informed decision on whether the server should be expelled.

  • Symmetric handling is given priority by ensuring the this more common handling kicks in first. Asymmetric handling starts after a longer time period (2 additional lease intervals after lease expiry with server), if required, so that both types of handling are not conflicting with each other.

Network Health FFDC

When a node is expelled at the pool level due to an apparent loss of network, there is always the question of whether this event was indeed due to a network outage. Generally this question cannot be answered post mortem because the system is no longer in the same state as when the problem occurred. The potential network issue can fall into several categories:

  • A network problem that is specific to an individual SSP connection or set of connections (symmetric or asymmetric network outage).
  • A network problem that is specific to all connections on a particular node (total network isolation).
  • The network itself may be healthy, but the SSP threads may be unresponsive due to CPU starvation, or similarly starvation may even occur in the lower level network layer handling.
  • A software bug with network handling among various layers.

The solution to this problem is to capture more network health state when an expel occurs:

  • Capture more internal statistics on the SSP connections during runtime (client and server information).
  • Capture ping results between nodes.
  • Capture lparstat to check for thread starvation.
  • Allow capture of additional network stats easily in the future, if desired, for example, tcpdump.
  • Configurable script invoked at /opt/pool/dump.netstat
  • Event output logged in /var/adm/pool/netstat.log

The network health capture is performed automatically at the time of expel on the leader node and the expelled node, but can be explicitly invoked via pooladm (after performing oem_setup_env on VIOS ):

pooladm dump netstat [-reset]

Example of Data Capture in netstat.log

The following is an example of the data collection logfile, netstat.log, for network health on the leader node of the cluster after an expel event:

 


########

# DATE #

########

Fri Oct 21 10:28:15 CDT 2016



#########

# NODES #

#########

Expelled:

vss7-c58.aus.stglabs.ibm.com



#######

# MSG #

#######

Client stats with server: #0 vss7-c57.aus.stglabs.ibm.com

numMsgSent:  196

numMsgRcvd:  373

avgRespTime: 0 sec 0 nsec

maxRespTime: 0 sec 0 nsec

Client stats with server: #1 vss7-c58.aus.stglabs.ibm.com

numMsgSent:  10

numMsgRcvd:  19

avgRespTime: 0 sec 0 nsec

maxRespTime: 0 sec 0 nsec

Client stats with server: #2 vss7-c59.aus.stglabs.ibm.com

numMsgSent:  8

numMsgRcvd:  183

avgRespTime: 0 sec 0 nsec

maxRespTime: 0 sec 0 nsec

Client aggregate stats:

numMsgSent:  214

numMsgRcvd:  575

avgRespTime: 0 sec 0 nsec

maxRespTime: 0 sec 0 nsec

Server stats:

numMsgSent:  1339

numMsgRcvd:  1611

avgRespTime: 0 sec 798475 nsec

maxRespTime: 0 sec 287679650 nsec



########

# PING #

########

PING vss7-c58.aus.stglabs.ibm.com: (9.3.148.120): 56 data bytes

--- vss7-c58.aus.stglabs.ibm.com ping statistics ---

10 packets transmitted, 0 packets received, 100% packet loss



########

# LPAR #

########

System configuration: type=Shared mode=Capped smt=4 lcpu=4 mem=3072MB psize=64 ent=1.00

%user  %sys  %wait  %idle physc %entc  lbusy  vcsw phint

----- ----- ------ ------ ----- ----- ------ ----- -----

  0.0   0.0    0.2   99.8  0.00   0.0    0.2 15823321  1608


 

Network Lease by Clock Tick

A reoccurring issue has been that an unexpected loss of network lease for a cluster node may occur when the system administrator changed the system time:

  • The network lease with the leader was based on local time of day.
  • The administrator had to stop SSP on that node prior to updating the system time.
  • Otherwise if time was moved forward far enough, then the network lease with the leader expired and the node was expelled, which is more problematic if this is performed on several nodes at once.
  • Starting up NTP (Network Time Protocol) with clocks out of sync could trigger this.

The solution to this problem is basing the network lease on clock ticks since boot time.

The use of NTP for synchronizing cluster node clocks is still recommended to assist with easier cluster log analysis.

Auto Log Analysis

Auto log analysis is currently a feature at the storage pool level to help address the difficulties in diagnosing pool problems in the cluster. This is motivated by various reasons:

  • Analysis of debug data from all nodes in the cluster can be very time consuming.
  • Even a high level analysis becomes unwieldy with larger clusters (16-24 nodes) and eventually will be impractical.

The solution to the increasing complexity in analyzing the storage pool is to provide a command line utility for pool analysis that provides a summary of important cluster state and changes with respect to the storage pool.

The auto log analysis:

  • Utilizes the cluster wide auto snap framework that provides a consistent directory hierarchy.
  • Detects and reports common problem signatures.
  • Assists in quickly determining the problem node(s) that should be focused on.

The current summary information provided by this tool includes:

  • A list of nodes in the cluster and current status.
  • History of MFS managers.
  • Expel history.
  • Tiers in pool and details; disks in pool and usage.
  • Recent command failures.
  • In progress commands.

Command Options

Auto log analysis for the storage pool is invoked with the pooladm command. If analysis is not performed on the cluster itself, then the pooladm command must be copied over to the system for the analysis. There are several options that include: analysis of the cluster-wide snapshot, analysis of a single node (based on logs from a live system), and the ability to unpack the cluster-wide snapshot for analysis.

Analysis of cluster-wide snapshot

# pooladm analyze snap

snap     <csnapPath> { [ -all ] | [ -nodeList | -mfsHistory [<maxEntries>] |

    -expelHistory [<maxEntries>] | -tierList | -diskList [-v] |

    -cmdFailures [<maxEntries>] | -cmdInProgress ] }

where:

         <csnapPath>   The absolute path to the unpacked cluster wide snap

         <maxEntries>  The max number of entries to display

Unpack of cluster-wide snapshot

# pooladm analyze unpack

unpack   <csnapPath>

where:

         <csnapPath>   The absolute path to the packed cluster wide snap

Analysis of live system from single node logs

# pooladm analyze live

live     [ -d <snapPath> ]

         { [ -all ] | [ -nodeList | -mfsHistory [<maxEntries>] |

           -expelHistory [<maxEntries>] | -tierList | -diskList [-v] |

           -cmdFailures [<maxEntries>] | -cmdInProgress ] }

where:

         <snapPath>    The absolute path to the snap for a single node

                       If not specified /var/adm/pool/pool.snap.system

                       is used.

         <maxEntries>  The max number of entries to display

Example Command Use

# pooladm analyze unpack /tmp/csnap_20161023_103735_by_caa_Med_c4.tar.gz

 

# ls /tmp/csnap

vss7-c57  vss7-c58  vss7-c59  vss7-c60

 

# pooladm analyze snap /tmp/csnap -all

=== Begin Node List ===

Node Name                     IP Address                Status   Leader MFS  

vss7-c59.aus.stglabs.ibm.com  9.3.148.121               online   Yes    Yes  

vss7-c60.aus.stglabs.ibm.com  9.3.148.127               online   No     No   

vss7-c58.aus.stglabs.ibm.com  9.3.148.120               online   No     No   

vss7-c57.aus.stglabs.ibm.com  9.3.148.119               online   No     No   

=== End Node List ===

=== Begin MFS History ===

Node Name                     Timestamp                 Event

vss7-c59                      Sun Oct 23 09:56:39 2016  Elected

vss7-c57                      Sun Oct 23 09:56:33 2016  Expelled

vss7-c57                      Sun Oct 23 09:47:42 2016  Elected

=== End MFS History ===

=== Begin Expel History ===

Node Name                     Timestamp                 Reason

vss7-c58.aus.stglabs.ibm.com  Sun Oct 23 09:56:14 2016  leader majority, so nodes on network watch list are thrown out

 

# pooladm analyze snap /tmp/csnap -tierList -diskList

=== Begin Tier List ===

TierName            Capacity    Freespace    EC         NStale    NECCR    

================================================================================

tier1               9984 MB     9983 MB      NONE       0         0        

SYSTEM              4992 MB     4889 MB      MIRROR2    0         0        



=== End Tier List ===

=== Begin Disk List ===

Pool: /pool1

Node: vss7-c59

Disk            Tier           FG        

=========================================

/dev/hdisk4     tier1          fg1       

/dev/hdisk5     tier1          fg2       

/dev/hdisk2     SYSTEM         fg1       

/dev/hdisk3     SYSTEM         fg2      


=== End Disk List ===

VIOSBR Automatic Backup

The viosbr command now has the ability to automatically take a backup of the VIOS and SSP configuration whenever there are any configuration changes.

  • This is performed via a cron job that is triggered every hour and enabled on the system by default.
  • Administrators can stop or start the feature and check the status.

The command and options are:


viosbr -autobackup {start | stop | status} [ -type {cluster |  node} ]

In regard to the options:

  • start option starts the autobackup feature.
  • stop option stops the autobackup feature.
  • status option checks if the autobackup file is up to date.

If SSP is configured, the cluster level backup file is present only in the default path of the database node.

We use the ‘save’ option  to save the backup file to the default path of other nodes of the cluster:


viosbr -autobackup  save

Only the latest copy of the backup file is stored in the default path /home/padmin/cfgbackups.

Here is an example of the backup files:

$ ls -l /home/padmin/cfgbackups

 -rw-r--r--    1 root     system         5464 Oct 10 03:00 autoviosbr_jaguar9. tar.gz

 -rw-r--r--    1 root     system         5464 Oct 10 03:00 autoviosbr_SSP.mycluster.tar.gz

  • autoviosbr_jaguar9.tar.gz file contains the VIOS backup data.
  • autoviosbr_SSP.mycluster.tar.gz file contains the SSP cluster level backup data.

VIOSBR for Disaster Recovery

Overview

The viosbr command is enhanced with a new disaster recovery option, 'viosbr -dr', that will restore the SSP cluster on a secondary setup with mirrored storage and a different set of hosts. The secondary setup can be a local site or a remote disaster site, but the prerequisite is mirrored storage across sites. Initially a backup of cluster configuration at the primary site is performed with the viosbr command, and upon the primary site failure, the viosbr command is invoked to restore the cluster configuration at the secondary site with a new set of VIO servers and mirrored storage. Note that this is a manual process for disaster recovery controlled by the administrator.

An overview of the disaster recovery process is at:

PowerVM Disaster Recovery (DR) with VIOS and SSP

Primary Site

The following steps are performed on the primary site for disaster recovery handling with the viosbr command:

  • Enable storage level mirroring of all the disks (Storage1 and Storage2).
  • Take the backup of the primary site cluster configurations with the viosbr command.

The following diagram shows an example configuration of the primary site with storage mirroring to secondary site.

Secondary Site

As one step in cluster restoration, the cluster is created on the secondary site by providing the following input for the viosbr command.

  • Primary site backup file.
  • New host name list for the SSP cluster definition.
  • Disk list for the mirrored disks from Storage2.

Additional steps are required for restoring client LPARs and mappings from the primary site.

The following diagram shows the secondary site SSP configuration.

Command Usage

Here is an example invocation of the viosbr command for disaster recovery restoration on the secondary site with sample input files:

viosbr -dr -clustername mycluster -file systemA.mycluster.tar.gz -type cluster –typeInputs hostnames_file:/home/padmin/nodelist, pooldisks_file:/home/padmin/disklist -repopvs hdisk#

$ cat /home/padmin/nodelist

DRVIOS1

$ cat /home/padmin/disklist

hdisk1

hdisk3

hdisk2

Concluding Remarks

The enhancements of PowerVM 2.2.5 with Shared Storage Pools have focused primarily on improved resiliency of the product based on issues encountered in the field and from customer feedback. PowerVM will continue to enhance the resiliency and feature set provided by SSP in future releases.

Contacting the PowerVM Team
Have questions for the PowerVM team or want to learn more? Follow our discussion group on LinkedIn IBM PowerVM or IBM Community Discussions

0 comments
46 views

Permalink