File compression

You can compress or decompress files either with the mmchattr command or with the mmapplypolicy command with a MIGRATE rule. With the MIGRATE rule, administrators can create policies that select a compression library based on the access characteristics of the file to be compressed, with file-level granularity. You can do the compression or decompression synchronously or defer it until a later call to mmrestripefile or mmrestripefs.

The supported compression libraries are z, lz4, zfast, alphae, and alphah. They are intended primarily for compressing the following types of data:
z
Cold data. Favors compression efficiency over access speed.
lz4
Active, non-specific data. Favors access speed over compression efficiency.
zfast
Active genomic data in FASTA, SAM, or VCF format.
alphae
Active genomic data in FASTQ format. Slightly favors compression efficiency over access speed.
alphah
Active genomic data in FASTQ format. Slightly favors access speed over compression efficiency.
The following table shows the IBM Spectrum Scale file system format level that is required for each compression library.
Table 1. Compression libraries and their required file system format level and format number
Compression library Required file system format level and format number
z 4.2.0 (15.01) or later
lz4 5.0.0 (18.00) or later
zfast, alphae, alphah 5.0.3 (21.00) or later

Comparison with object compression

File compression and object compression use the same compression technology but are available in different environments and are configured in different ways. Object compression is available in the Cluster Export Systems (CES) environment and is configured with the mmobj policy command. With object compression, you can create an object storage policy that periodically compresses new objects and files in a GPFS fileset.

File compression is available in non-CES environments and is configured with the mmapplypolicy command or directly with the mmchattr command.

Setting up file compression and decompression

The sample script /usr/lpp/mmfs/samples/ilm/mmcompress.sample, installed with IBM Spectrum Scale, provides examples of how to compress or decompress a fileset or a directory tree.

You can do file compression or decompression with either the mmchattr command or the mmapplypolicy command.

With the mmchattr command, you specify the --compression option and the names of the files or filesets that you want to compress or decompress. See the following examples:
  • The following command compresses a file with the lz4 compression library:
    mmchattr --compression lz4 trcrpt.150913.13.30.13.3518.txt
  • The following command decompresses the same file:
    mmchattr --compression no trcrpt.150913.13.30.13.3518.txt
For more information, see mmchattr command.
With the mmapplypolicy command, you create a MIGRATE rule that specifies the COMPRESS option and run mmapplypolicy to apply the rule.
Note: File compression and decompression with the mmapplypolicy command is not supported on Windows.
See the following examples:
  • The following rule selects files with names that contain the string green from the datapool storage pool and compresses them with the z library:
    RULE 'COMPR1' MIGRATE FROM POOL 'datapool' COMPRESS('z') WHERE NAME LIKE 'green%'
  • The following rule decompresses the same set of files:
    RULE 'COMPR1' MIGRATE FROM POOL 'datapool' COMPRESS('no') WHERE NAME LIKE 'green%'
  • The following example shows three rules:
    • The first rule excludes from compression any file that ends with .mpg or .jpg.
    • The second rule automatically compresses any file that was not accessed in the last 30 days with z (libz.so).
    • The third rule automatically compresses any file that was not modified in the last 2 days with lz4 (liblz4.so).
    RULE 'NEVER_COMPRESS' EXCLUDE WHERE lower(NAME) LIKE '%.mpg' OR lower(NAME) LIKE '%.jpg'
    RULE 'COMPRESS_COLD' MIGRATE COMPRESS('z') WHERE (CURRENT_TIMESTAMP - ACCESS_TIME) >
    (INTERVAL '30' DAYS)
    RULE 'COMPRESS_ACTIVE' MIGRATE COMPRESS('lz4') WHERE (CURRENT_TIMESTAMP - MODIFICATION_TIME) > 
    (INTERVAL '2' DAYS) AND (CURRENT_TIMESTAMP - ACCESS_TIME) <= (INTERVAL '30' DAYS)
    
  • The following rule compresses genomic data in files with the extensions .fastq and .fq:
    RULE ’COMPRESS_GENOMIC’ MIGRATE COMPRESS('alphae') WHERE lower(NAME) LIKE ’%.fastq’ OR lower(NAME) LIKE ’%.fq’
For more information, see the following help topics:

When you do file compression, you can defer the compression operation to a later time. For more information, see the subtopic Deferred file compression.

Warning

Doing any of the following operations while the mmrestorefs command is running can corrupt file data:
  • Doing file compression or decompression. This includes compression or decompression with the mmchattr command or with a policy and the mmapplypolicy command.
  • Running the mmrestripefile command or the mmrestripefs command. Do not run either of these commands for any reason. Do not run these commands to complete a deferred file compression or decompression.

Reported size of compressed files

After a file is compressed, operating system commands, such as ls -l, display the uncompressed size. Use du or the GPFS command mmdf to display the actual, compressed size. You can also make the stat() system call to find how many blocks the file occupies.

Deferred file compression

By default, the command that launches a file compression or decompression does not return until after the compression or decompression operation is completed. However, with both the mmchattr command and the mmapplypolicy compression, you can defer the compression or decompression operation and have the command return as soon as it completes any other operations. By deferring compression or decompression, you can complete the operation later when the system is not heavily loaded with processes or I/O.

To defer the compression, with either command, specify the -I defer option. For example, the following command marks the specified file as needing compression but defers the compression operation:
mmchattr -I defer --compression yes trcrpt.150913.13.30.13.3518.txt
With the mmapplypolicy command, the -I defer option defers compression or decompression and data movement or deletion. For example, the following command applies the rules in the file policyfile but defers the file operations that are specified in the rules, including compression or decompression:
mmapplypolicy fs1 -P policyfile -I defer
To complete a deferred compression or decompression, run the mmrestripefile command or the mmrestripefs command with the -z option. (Do not run either of these commands if an mmrestorefs command is running. See the warnings in the preceding subtopic Warning.) The following command completes the deferred compression or decompression of the specified file:
mmrestripefile -z trcrpt.150913.13.30.13.3518.txt

Indicators of file compression or decompression

The mmlsattr command displays two indicators that together describe the state of compression or decompression of the specified file:
COMPRESSION
The mmlsattr command displays the COMPRESSION flag on the Misc attributes line of its output. The flag is followed in parentheses by the name of the compression library that was used to compress the file. See the example of mmlsattr output in Figure 1. If present, the COMPRESSION flag indicates that the file is compressed or is marked for deferred compression. If absent, the absence indicates that the file is uncompressed or is marked for deferred decompression.
Note: This flag reflects the state of the GPFS_IWINFLAG_COMPRESSED flag in the gpfs_iattr64_t structure of the inode of the file. For more information about this structure, see the topic gpfs_iattr64_t structure.
illCompressed
The mmlsattr command displays the illCompressed flag on the flags line of its output. See Figure 1. If present, illCompressed indicates that the file is marked for compression or decompression but that compression or decompression is not completed. If absent, the absence indicates that compression or decompression is completed. For more information about this structure, see the topic gpfs_iattr64_t structure.
Note:
  • This flag reflects the state of the GPFS_IAFLAG_ILLCOMPRESSED flag in the gpfs_iattr64_t structure of the inode of the file. For more information about this structure, see the topic gpfs_iattr64_t structure.

  • Some file system events can cause the illCompressed flag to be set. Consider the following examples:
    • When data is written into an already compressed file, the existing data remains compressed but the new data is uncompressed. The illCompressed flag is set for this file.
    • When a compressed file is memory-mapped, the memory-mapped area of the file is decompressed before it is read into memory. The illCompressed flag is set for this file.
    For more information, see the subtopic Updates to compressed files.
In the following example, the output from the mmlsattr command includes both the COMPRESSION flag and the illCompressed flag. This combination indicates that the file is marked for compression but that compression is not completed:
Figure 1. Compression and decompression flags

mmlsattr -L green02.51422500687
file name:            green02.51422500687
metadata replication: 1 max 2
data replication:     2 max 2
immutable:            no 
appendOnly:           no
flags:                illCompressed
storage pool name:    datapool
fileset name:         root
snapshot name:
creation time:        Wed Jan 28 19:05:45 2015
Misc attributes:      ARCHIVE COMPRESSION (library lz4)
Encrypted:            no
       

Together the COMPRESSION and illCompressed flags indicate the compressed or uncompressed state of the file. See the following table:

Table 2. COMPRESSION and illCompressed flags
State of the file COMPRESSION is displayed? illCompressed is displayed?
Uncompressed. No No
Decompression is not complete. No Yes
Compressed. Yes No
Compression is not complete. Yes Yes

Partially compressed files

The COMPRESSION flag of a file is set when the user selects the file to be compressed by the mmchattr --compress yes command or by a policy run. The flag indicates that the user wants the file to be compressed.

If the user specifies the -I defer command option with the mmchattr command or with a policy run, the illCompressed flag of the file is set during the command execution or policy run. The illCompressed flag indicates that the request to compress the file has not been fulfilled. The illCompressed flag is reset at the conclusion of the actual compression execution of the file, after the mmrestripefs -z or mmrestripefile -z command finishes compressing the file. The illCompressed flag can be set again upon updates of the contents of the file that cause update-driven decompression.

The compressibility of a file can change over time if its contents are changed. Different parts of a file may have different compressibility. Based on the 10% space-saving criterion (see the subtopic Limitations), some compression groups (in granularity of 10 data blocks) of a file might be compressed while others are not.

In sum, the state of the COMPRESSION flag, on or off, indicates the intention of the user to compress the file or not. The illCompressed flag indicates the compression execution status. The actual compression status of the data blocks depends on the illCompressed and COMPRESSION flags and the compressibility of the current data.

Updates to compressed files

When a compressed file is updated by a write operation, the file system automatically decompresses the region of the file that contains the affected data and sets the illCompressed flag. The file system then makes the update. To recompress the file, run the mmrestripefile command with the -z option, as in the following example:
mmrestripefile -z trcrpt.150913.13.30.13.3518.txt

The mmrestorefs command can cause a compressed file in the active file system to become decompressed if it is overwritten by the restore process. To recompress the file, run the mmrestripefile command with the -z option.

For more information, see the preceding subtopic Deferred file compression.

File compression and memory mapping

On Linux® and AIX® you can memory-map a file that is already compressed. The file system automatically decompresses the paged-in region and sets the illCompressed flag. To recompress the file, run the mmrestripefile command with the -z option.

As a convenience, the file system does not compress an uncompressed file or partially decompressed file if the file is memory-mapped. Compressing the file would not be not effective because memory mapping decompresses any compressed data in the regions that are paged in.

File compression and direct I/O

You can open a compressed file for Direct I/O, but internally the direct I/O reads and writes are replaced by buffered decompressed I/O reads and writes.

As a convenience, the file system does not compress a file that is opened for Direct I/O. Compressing the file would not be effective because direct I/O would be replaced by buffered decompressed I/O.

Backing up and restoring compressed files

Files are decompressed when they are moved out of storage that is directly managed by IBM Spectrum Scale. This fact affects file backups by products such as IBM Spectrum Protect, IBM Spectrum Protect for Space Management (HSM), IBM Spectrum Archive, Transparent cloud tiering (TCT), and others. When you back up a file with these products, the file system decompresses the file data inline when it is read by the backup agent. The file system also sets the illCompressed flag in the file properties. The backed-up file data is not compressed.

When you restore a file to the IBM Spectrum Scale file system, the file data remains uncompressed but the illCompressed flag is still set. You can recompress the file by running mmrestripefs or mmrestripefile with the -z option.

FPO environment

File compression supports a File Placement Optimizer (FPO) environment or horizontal storage pools.

FPO block group factor: Before you compress files in a File Placement Optimizer (FPO) environment, you must set the block group factor to a multiple of 10. If you do not, then data block locality is not preserved and performance is slower.
For compatibility reasons, before you do file compression with FPO files, you must upgrade the whole cluster to version 4.2.1 or later. To verify that the cluster is upgraded, follow these steps:
  1. At the command line, enter the mmlsconfig command with no parameters.
  2. In the output, verify that minReleaseLevel is 4.2.1 or later.

AFM environment

Files that belong to AFM and AFM DR filesets can also be compressed and decompressed. Compressed file contents are decompressed before being transferred from home to cache or from primary to secondary.

Before you do file compression with AFM and AFM DR, you must upgrade the whole cluster to version 5.0.0.

Limitations

File compression also has the following limitations:
  • File compression processes each compression group within a file independently. A compression group consists of one to 10 consecutive data blocks within a file. If the file contains fewer than 10 data blocks, the whole file is one compression group. If the saving of space for a compression group is less than 10%, file compression does not compress it but skips to the next compression group.
  • For file-enabled compression in an FPO-enabled file system, the block group factor must be a multiple of 10 so that the compressed data maintains data locality. If the block group factor is not a multiple of 10, the data locality is broken.
  • Direct I/O is not supported for compressed files.
  • File compression is not supported on a file system where HAWC is enabled.
  • The following operations are not supported:
    • Compressing files in snapshots
    • Compressing a clone
    • Compressing small files (files that occupy fewer than two subblocks, compressing small files into an inode).
    • Compressing files other than regular files, such as directories.
    • Cloning a compressed file
    • Compressing an open file that is memory-mapped. See the subtopic File compression and memory mapping.
  • Additional limitations on Windows:
    • Compression or decompression with the mmapplypolicy command is not supported.
    • Compression of files in Windows hyper allocation mode is not supported.
    • Memory mapping a file that is already compressed is not supported.
    • The following Windows APIs are not supported:
      • FSCTL_SET_COMPRESSION to enable/disable compression on a file
      • FSCTL_GET_COMPRESSION to retrieve compression status of a file.
    • In Windows Explorer, in the Advanced Attributes window, the compression feature is not supported.