Packing and compression tools
When large data sets are stored at CSC or transported over the net it is usually reasonable to archive (i.e., pack multiple files into a single file) and compress (i.e., reduce its size without losing any data) the data. Archiving makes file transfer easier while compressed files require less storage space and are thus faster to move from one system to another. In this chapter we will provide introduction to [tar], [gzip], [bzip2], [zip], [7zip] and [Zstandard] tools that are frequently used for archiving and compression.
Type | Extension |
---|---|
A zip archive | .zip, .ZIP, or .Z |
A gzip compressed file | .gz |
A bzip2 compressed file | .bz2, or .bz |
A tar archive | .tar |
A gzip compressed tar archive | .tar.gz, or .tgz |
A bzip2 compressed tar archive | .tar.bz2, .tar.bz, .tbz, or .tbz2 |
A 7zip compressed file | .7z |
A Zstandard compressed file | .zst |
Tar: packing several files into one file
tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from tape archive, as it was originally developed to write data to sequential I/O devices with no file system of their own. However, nowadays tar – and its GNU version, gtar – are mostly used for data archiving within a normal disk environment. The archive files created by tar contain various file system parameters, such as name, time stamps, ownership, file access permissions, and directory structures, which makes moving and storing large file sets easier that trying to manage separate files.
By default, tar does not compress the data. This means that the size of the tar archive file is the same as the sum of the sizes of packed files, plus some overhead metadata. If data compression is needed, you can use compression tools like [gzip] or [bzip2] with tar.
tar and gtar are compatible with each other; if you archive your data with tar you can unarchive it with gtar and vice versa. Normally tar and/or gtar can be found from any Unix, Linux or Mac OS X system. In Windows systems you can use for example 7zip program to manage tar archive files.
The syntax of a tar command is:
tar options tar_archive file …
For example, we may have a directory called project_3, which contains nine files called sample1.txt, sample2.txt, …, sample9.txt. To list the contents of the directory we could use the command ls -lh:
$ ls -lh project_3
total 53M
-rw-r--r--+ 1 testuser csc 16M Nov 9 10:41 sample1.txt
-rw-r--r--+ 1 testuser csc 16M Nov 9 10:41 sample2.txt
-rw-r--r--+ 1 testuser csc 1.3M Nov 9 10:41 sample3.txt
-rw-r--r--+ 1 testuser csc 1.9M Nov 9 10:41 sample4.txt
-rw-r--r--+ 1 testuser csc 1.9M Nov 9 10:41 sample5.txt
-rw-r--r--+ 1 testuser csc 3.7M Nov 9 10:42 sample6.txt
-rw-r--r--+ 1 testuser csc 4.0M Nov 9 10:42 sample7.txt
-rw-r--r--+ 1 testuser csc 3.9M Nov 9 10:42 sample8.txt
-rw-r--r--+ 1 testuser csc 3.9M Nov 9 10:42 sample9.txt
We can archive all the files in the project_3 directory to a tar archive called project_3.tar with tar's create (c) command:
$ tar cvf project_3.tar project_3
The command above creates a new tar archive file, project_3.tar, which contains the directory project_3 and all the files in it. Note that the command does not modify or remove the original files in the source directory:
$ ls -lh
drwxr-xr-x+ 2 testuser csc 11 Nov 9 10:44 project_3
-rw-r--r--+ 1 testuser csc 52M Nov 9 10:46 project_3.tar
The tar command does not require that you assign a certain extension, like .tar, for your archive files. However, applying commonly used file name extensions will help you and other users to select right commands for processing the file later on. The newly created tar archive file can now be easily moved to another directory or system and then unarchived with the extract (x) command:
$ tar xvf project_3.tar
This command will create a directory called project_3, which contains all the same files as the original directory did. Note that if the extract command encounters a file that already exists in the file system, the existing file will be overwritten by the extracted file. This causes a potential danger: possibly a newer version of a file will be lost if an older version of the same file (or, just a file with a same name) exists in the tar archive that is being unarchived!
Command | Operation |
---|---|
A | Append tar files to an archive (i.e. concatenate archive files) |
c | Create a new tar archive |
d | Find differences between an archive and the file system |
r | Append files at the end of an archive |
t | List the contents of an archive |
u | Only append files that are newer than the existing ones in the archive |
x | Extract files from an archive |
--delete | Delete files from the archive |
In addition to the commands above, the option f (file) is almost always used as it defines the name of the tar file to read from or to write to. The file name must follow right after the option, thus it is commonly the last option given. Below are some frequently used tar options.
Option | Function |
---|---|
f | Use the given file name as the source or target archive |
v | Be verbose, i.e. list processed files to the screen while processing |
z | Use gzip compression/decompression while creating or extracting an archive |
j | Use bzip2 compression/decompression while creating or extracting an archive |
With options zip or j (no meaning, it was chosen because no meaningful letter were available) you can filter the archive through [gzip] or [bzip2] respectively to (de)compress the archive on the fly:
$ tar cvzf project_3.tar.gz project_3
Here, in this sample case, the size of the uncompressed archive file is 52 MB but the compressed file is only 15 MB.
$ ls -lh
drwxr-xr-x+ 2 testuser csc 11 Nov 9 10:44 project_3
-rw-r--r--+ 1 testuser csc 52M Nov 9 10:46 project_3.tar
-rw-r--r--+ 1 testuser csc 15M Nov 9 10:56 project_3.tar.gz
Listing the contents of a tar archive file can be done with command list:
$ tar tf project_3.tar
project_3/
project_3/sample3.txt
project_3/sample4.txt
project_3/sample5.txt
project_3/sample2.txt
project_3/sample1.txt
project_3/sample6.txt
project_3/sample8.txt
project_3/sample9.txt
project_3/sample7.txt
You can also retrieve just one file from the archive by specifying the full file name to the data extraction command. For example, to extract just file sample2.txt from the compressed archive project_3.tar.gz we could use command:
$ tar xvzf project_3.tar.gz project_3/sample2.txt
Compressing files
Compressing files saves storage space and makes data trasport faster, but it may take a lot of time. Data compression is CPU intensive and compressing a data set of several terabytes can easily require overnight computing.
There are numerous algorithms and software tools available for data compression. Here we will briefly show five tools that are used in Unix/Linux systems: gzip, bzip2, zip, 7z and Zstandard. What is common to all these tools is that they do the compression without data loss, i.e. when the files are uncompressed the data will be 100% identical to the original data. Typically, a text file compressed with one of these tools is about 20-40% of the size of the original file. However, the compressibility of a file depends heavily on the content of the file.
The compression time depends on the algorithm used and the type of data to be compressed. In most cases the newer Zstandard method is significantly faster than the older, but very widely used methods like gzip or zip. The table below shows results for one sample case, were 10GB text file (fastq formatted sequence data) was compressed with these five methods in the Puhti supercluster.
Command | Size of the compressed file |
Compression time | Decompression time |
---|---|---|---|
zstdmt | 2,6 GB | 2 min | 1,7 min |
zstdmt -T 4 | 2,6 GB | 0,8 min | 1,7 min |
7z | 1,8 GB | 56 min | 2,7 min |
gzip | 2,5 GB | 29 min | 2,5 min |
bzip2 | 2,0 GB | 17 min | 9 min |
zip | 2,5 GB | 29 min | 2,5 min |
gzip and gunzip
gzip is probably the most commonly used packing tool in Unix and Linux systems. It uses the Lempel-Ziv coding (LZ77) for compressing the data. gzip was already briefly mentioned in the [tar] chapter above, but gzip can also be used as a totally separate tool. The normal usage of gzip is straight forward. To compress a file you give the command:
$ gzip file_name
Running this command creates a compressed file and names it using the original file name with an extension .gz. When the compression is ready the original file is removed. If you want to preserve the original file, you need to use redirection, like:
$ gzip < file_name > file_name.gz
Decompressing a gzipped file is done with command gunzip. In addition to gzip compressed files, gunzip can also decompress files compressed with [zip] command. The basic syntax of gunzip is analogous to the compressing command:
$ gunzip file_name.gz
The command above removes the compressed file when decompression is ready. If you wish the keep the compressed file, you'll need redirection again:
$ gunzip < file_name.gz > file_name
gzip has several command line options that are not discussed here. Use man gzip command to see the full list of available options. Also, note that gunzip command is actually just a wrapper for the gzip -d command (option -d instructs gzip to decompress, rather than compress) so you will not find a separate manual page for that.
gzip example
Lets assume we are in the $WRKDIR directory of Puhti-shell and there we have just one file called my_data.dat. Lets first check the size of that file with command ls -lh:
$ ls -lh
total 1.5G
-rw-r--r--+ 1 testuser csc 1.5G Nov 4 13:07 my_data.dat
The listing tells us that the size of the file is approximately 1,5 GB. Next, we compress the file with gzip and then check the file size again.
$ gzip my_data.dat
$ ls -lh
total 834M
-rw-r--r--+ 1 testuser csc 833M Nov 4 13:07 my_data.dat.gz
The original file has now been removed and been replaced by the compressed file. However, as a result we now have a compressed file that requires only 833MB of disk space (55% of the original size). Next we decompress the file:
$ gunzip my_data.dat.gz
$ ls -lh
total 1.5G
-rw-r--r--+ 1 testuser csc 1.5G Nov 4 13:07 my_data.dat
The file listing now shows that the compressed file has disappeared and the original file is available again.
bzip2 and bunzip2
bzip2 is a compression program that is used very similarly compared to [gzip][bzip2]. The main difference between the two programs is that bzip2 uses Burrows-Wheeler block sorting text compression algorithm combined with Huffman coding instead of the LZ77 algorithm used in gzip. The compression algorithm of bzip2 produces more effective compression than gzip's. However, computing the bzip2 compression usually is more complex and takes longer (i.e., uses more CPU cycles) than gzip compression. The usage of bzip2 is very similar to that of gzip, but not all the command line options are identical. The basic compression syntax is:
$ bzip2 file_name
$ bzip2 < file_name > file_name.bz2
Similarly, the decompression can be done with command:
$ bunzip2 file_name.bz2
$ bunzip2 < file_name.bz2 > file_name
Note that a file compressed with bzip2 can not be uncompressed with gunzip and vice versa.
In addition to the standard bzip2 and bunzip2 programs, you can also use the parallel versions of bzip2 command: [][6]pbzip2 and [][7]pbunzip2. When these commands are used, the user must use option -p to define the number of processor cores to be used. For example, compressing the file my_data.dat using four cores can be done with command:
$ pbzip2 -p4 my_data.dat
Similarly, to decompress the file with two cores you can use command:
$ punbzip2 -p2 my_data.dat.bz2
The pbzip2 and pbunzip2 commands scale well for small core numbers. Already with two cores the pbzip2 is about as fast as gzip. The number of processors used does not affect to the actual result file. Thus, a file that has been compressed with parallel pbzip2 can be uncompressed with normal bunzip2 command and vice versa.
zip and unzip: the combined compression and file archiving tool
The zip program can be used for both archiving and compressing files. Given a list of files or a directories the zip command archives and compresses all the files in to a single zip archive file. So, in principle zip is analogous to the combination of [tar] and [gzip] commands. Later on the whole archive, or just certain files, can be extracted from the archive. The basic syntax of zip command is:
$ zip -options archive_file source_name
The source_name can be a list of files, directories, or a combination of both, that will be packed in to the archive_file. If the archive_file already exists, zip will replace the existing files in the archive with new ones from the source_name list, or add files if they do not exist in the archive_file yet. Note that unlike the tar command, zip does not add any files from subdirectories in to the archive by default. The option -r is needed to recursively add all files and subfolders from a given directory to the zip archive. Below is listed some commonly used zip command options.
Option | Function |
---|---|
-d | Remove (delete) entries from a zip archive |
-e | Encrypt the contents of the zip archive using a password which is entered on the terminal in response to a prompt |
-f | Replace (freshen) an existing file in the zip archive only if it has been modified more recently than the version already in the zip archive |
-l | Translate the Unix end-of-line character LF into the MSDOS convention CR LF |
-ll | Translate the MSDOS end-of-line CR LF into Unix LF |
-r | Travel the directory structure recursively |
-u | Update existing entries if newer on the file system and add new files |
-@ | Take the list of input files from standard input. Only one filename per line. |
Creating a zip archive does not affect the original files. Note that the zip command will add the .zip extension to the archive file name by default, if it is not already included.
Zip archives can be extracted and studied with the unzip command. To extract files from a zip archive, use the command:
$ unzip archive_file_name
To just see the files included in the zip archive use command:
$ unzip -l archive_file_name
You can also extract just one file from an archive with command:
$ unzip archive_file_name file_name
Option | Function |
---|---|
-f | Freshen existing files, i.e., extract only those files that already exist on disk and are newer than the ones on the disk |
-l | List the content of an archive file |
-u | Update existing files and create new ones if needed |
-o | Overwrite existing files without prompting |
-p password | Use password to decrypt an encrypted zip file |
zip example
To archive and compress the sample directory project_3, which contains the files sample1.txt, sample2.txt, …, sample9.txt (the same example that was used in the tar chapter) use the command:
$ zip -r project_3.zip project_3
adding: project_3/ (stored 0%)
adding: project_3/sample5.txt (deflated 71%)
adding: project_3/sample2.txt (deflated 72%)
adding: project_3/sample3.txt (deflated 70%)
adding: project_3/sample4.txt (deflated 71%)
adding: project_3/sample9.txt (deflated 71%)
adding: project_3/sample7.txt (deflated 71%)
adding: project_3/sample1.txt (deflated 73%)
adding: project_3/sample6.txt (deflated 72%)
adding: project_3/sample8.txt (deflated 71%)
Note that if you use the same zip command without the -r option, the archive file will not include the sample files in the directory but just the directory. However, in this case the project_3 directory does not include any subdirectories, so you could also do archiving with command:
$ zip project_3.zip project_3/*
adding: project_3/sample1.txt (deflated 73%)
adding: project_3/sample2.txt (deflated 72%)
adding: project_3/sample3.txt (deflated 70%)
adding: project_3/sample4.txt (deflated 71%)
adding: project_3/sample5.txt (deflated 71%)
adding: project_3/sample6.txt (deflated 72%)
adding: project_3/sample7.txt (deflated 71%)
adding: project_3/sample8.txt (deflated 71%)
adding: project_3/sample9.txt (deflated 71%)
In the same way, you can later add a new file to an existing zip archive, e.g.:
$ cp sample10.txt project_3/
$ zip project_3.zip project_3/sample10.txt
adding: project_3/sample10.txt (deflated 69%)
You can check the contents of the zip archive with unzip and option -l (small letter L):
$ unzip -l project_3.zip
Archive: project_3.zip
Length Date Time Name
-------- ---- ---- ----
16662202 11-09-09 10:41 project_3/sample1.txt
16397702 11-09-09 10:41 project_3/sample2.txt
1303352 11-09-09 10:41 project_3/sample3.txt
1925824 11-09-09 10:41 project_3/sample4.txt
1989706 11-09-09 10:41 project_3/sample5.txt
3813333 11-09-09 10:42 project_3/sample6.txt
4176523 11-09-09 10:42 project_3/sample7.txt
4056375 11-09-09 10:42 project_3/sample8.txt
4085713 11-09-09 10:42 project_3/sample9.txt
6541306 11-10-09 13:07 project_3/sample10.txt
-------- -------
60952036 10 files
To extract just the file sample3.txt from the zip archive, use the command:
$ unzip project_3.zip project_3/sample3.txt
Archive: project_3.zip
replace project_3/sample3.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
inflating: project_3/sample3.txt
To extract all the files from the archive, use the command:
$ unzip project_3.zip
Archive: project_3.zip
replace project_3/sample1.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename:A
inflating: project_3/sample1.txt
inflating: project_3/sample2.txt
inflating: project_3/sample3.txt
inflating: project_3/sample4.txt
inflating: project_3/sample5.txt
inflating: project_3/sample6.txt
inflating: project_3/sample7.txt
inflating: project_3/sample8.txt
inflating: project_3/sample9.txt
inflating: project_3/sample10.txt
7zip packing and compression tool
7zip is a packing and compression tools that is frequently used especially in Windows platforms. It can be however used in Max OSA and Linux systems too. By default the command uses its own 7z compression file format but it can utilize other compression file formats too.
In Puhti 7zip is launced with command 7z. The syntax of the command is
7z command -options archive_file file_names
The most important 7zip commands are:
- a Add files to an archive
- e extract files from an archive file
- l list files in the archive file
To archive and compress the sample directory project_3, which contains the files sample1.txt, sample2.txt, …, sample9.txt (the same example that was used in the tar chapter) use the command:
$ 7z a project_3_backup project_3/
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,16 CPUs Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (206D7),ASM,AES-NI)
Scanning the drive:
1 folder, 9 files, 211442 bytes (207 KiB)
Creating archive: project_3_backup.7z
Items to compress: 10
Files read from disk: 9
Archive size: 42728 bytes (42 KiB)
Everything is Ok
This created an new 7z-compressed file: project_3_backup.7z. The 7z a command can also be used to add a new file existing archive. Foe example if we would get a new sample file called sample10.txt to the project_3 directory, we could add it to the previously created project_3_backup.7z file with command:
7z a project_3_backup project_3/sample10.txt
You can list the content of this file with 7z l command. For example
$ 7z l project_3_backup.7z
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,16 CPUs Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (206D7),ASM,AES-NI)
Scanning the drive for archives:
1 file, 565592 bytes (553 KiB)
Listing archive: project_3_backup.7z
--
Path = project_3_backup.7z
Type = 7z
Physical Size = 565592
Headers Size = 341
Method = LZMA2:6m
Solid = +
Blocks = 2
Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2018-11-08 11:02:30 D.... 0 0 project_3
2018-11-08 10:57:24 ....A 446 42419 project_3/sample1.txt
2018-11-08 10:57:33 ....A 54807 project_3/sample2.txt
2018-11-08 10:57:58 ....A 3152 project_3/sample3.txt
2018-11-08 10:58:06 ....A 1633 project_3/sample4.txt
2018-11-08 10:58:21 ....A 3151 project_3/sample5.txt
2018-11-08 10:58:39 ....A 161 project_3/sample6.txt
2018-11-08 10:58:48 ....A 1556 project_3/sample7.txt
2018-11-08 10:58:59 ....A 146070 project_3/sample8.txt
2018-11-08 10:59:12 ....A 466 project_3/sample9.txt
2018-11-08 11:13:25 ....A 5679358 522832 project_3/sample10.txt
------------------- ----- ------------ ------------ ------------------------
2018-11-08 11:13:25 5890800 565251 10 files, 1 folders
Compressed data is extracted with command: 7z e. You can extract just the defined files, fore example:
7z e project_3_backup.7z project_3/sample3.txt
Note that if you choose to download just individual files: 1. You must use the full file path that is given in the file listing and 2. the specified compressed file is downloaded to the current directory and not to a directory path defined in the compressed file name. Thus the above command would return a file sample3.txt to the directory where the command was executed.
The output directory for the extracted files can be defined with option -o. For example to retrieve all the compressed files to directory project_3, you should give command:
7z e -oproject_3 project_3_backup.7z
Zstandard compression tool
Zstandard is a fairly new and very fast compression tool. In Puhti, Zstandard compression can be done with command zstdmt. For example, to compress file data.txt give command:
zstdmt data.txt
The above command produces a compressed file named as data.txt.zst. For larger datafiles you can speed up the compression by using multiple computing cores (threads). The number of threads is defined with option -T. In the login nodes of Puhti it is recommended that you use just one thread that is the defult setting, but for example in puhti-shell you can use four threads:
zstdmt -T 4 data.txt
Decompression is defined by adding option *-d * to the command:
zstdmt -d data.txt.zst