The S3 client
This chapter describes how to use the Allas object storage service with the s3cmd command line client. This client uses the S3 protocol that differs from the Swift protocol used in the Rclone, swift and a-commands examples. Normally data uploaded with S3 can be utilized with swift protocol too. However, over 5 GB files uploaded to Allas with swift can't be downloaded with S3 protocol.
From the user perspective, one of the main differences between S3 and Swift protocols is that Swift based connections remain valid for eight hours at a time, but with S3, the connection remains permanently open. The permanent connection is practical in many ways but it has a security aspect: if your CSC account is compromised, so is the object storage space.
The syntax of the s3cmd
command:
s3cmd -options command parameters
The most commonly used s3cmd commands:
s3cmd command | Function |
---|---|
mb | Create a bucket |
put | Upload an object |
ls | List objects and buckets |
get | Download objects and buckets |
cp | Move object |
del | Remove objects or buckets |
md5sum | Get the checksum |
info | View metadata |
signurl | Create a temporary URL |
put -P | Make an object public |
setacl --acl-grant | Manage access rights |
The table above lists only the most essential s3cmd commands. For more complete list, visit the s3cmd manual page or type:
s3cmd -h
Getting started with s3cmd
If you use Allas on Puhti or Mahti, all required packages and software are already installed. In this case you can skip this chapter and proceed to the section Configuring S3 connection in supercomputers.
To configure a s3cmd connection, you need to have OpenStack and s3cmd installed in your environment.
OpenStack s3cmd installation:
Fedora/RHEL derivatives:
sudo yum update
sudo yum install python3
sudo pip3 install python-openstackclient
sudo yum install s3cmd
sudo apt install python3-pip
sudo pip3 install python-openstackclient
sudo apt install restic
curl https://rclone.org/install.sh | sudo bash
sudo pip3 install s3cmd
python3 virtualenv
pip3 install s3cmd
s3cmd
Please refer to http://s3tools.org/download and http://s3tools.org/usage for upstream documentation.
** Configuring S3 connection in local computer **
Once you have OpenStack and s3cmd installed in your environment, you can download the allas_conf script to set up the S3 connection to your Allas project.
wget https://raw.githubusercontent.com/CSCfi/allas-cli-utils/master/allas_conf
source allas_conf --mode S3 --user your-csc-username
--user
option to define your CSC username. The configuration command first asks for your
CSC password and then for you to choose an Allas project. After that, the tool creates a key file for the S3 connection and stores it in the default location (.s3cfg in home directory).
Configuring S3 connection in supercomputers
To use s3cmd in Puhti and Mahti, you must first configure the connection:
module load allas
allas-conf --mode S3
You can use the S3 credentials, stored in the .s3cfg file, in other services too. You can check the currently used access key and secret_key with command:
grep key $HOME/.s3cfg
If you use these keys in other services, your should make sure that the keys always remain private. Any person who has access to these two keys, can access and modify all the data that the project has in Allas.
In needed, you can deactivate an S3 key pair with command:
allas-conf --s3remove
Create buckets and upload objects
Create a new bucket:
s3cmd mb s3://my_bucket
Upload a file to a bucket:
s3cmd put my_file s3://my_bucket
List objects and buckets
List all buckets in a project:
s3cmd ls
List all objects in a bucket:
s3cmd ls s3://my_bucket
Display information about a bucket:
s3cmd info s3://my_bucket
Display information about an object:
s3cmd info s3://my_bucket/my_file
Download objects and buckets
Download an object:
s3cmd get s3://my_bucket/my_file new_file_name
Using the command md5sum
, you can check that the file has not been changed or corrupted:
$ md5sum my_file new_file_name 39bcb6992e461b269b95b3bda303addf my_file 39bcb6992e461b269b95b3bda303addf new_file_name
In the above example, the checksums match between the original and downloaded file.
Download an entire bucket:
s3cmd get -r s3://my_bucket/
Move objects
Copy an object to another bucket. Note that should use these commands only for objects that were uploaded to Allas with S3 protocol:
s3cmd cp s3://sourcebucket/objectname s3://destinationbucket
For example:
$ s3cmd cp s3://bigbucket/bigfish s3://my-new-bucket remote copy: 's3://bigbucket/bigfish' -> 's3://my-new-bucket/bigfish'
Rename the file while copying it:
$ s3cmd cp s3://bigbucket/bigfish s3://my-new-bucket/newname remote copy: 's3://bigbucket/bigfish' -> 's3://my-new-bucket/newname'
Delete objects and buckets
Delete an object:
s3cmd del s3://my_bucket/my_file
Delete a bucket:
s3cmd rb s3://my_bucket
s3cmd and public objects
In this example, the object salmon.jpg in the pseudo folder fishes is made public:
$ s3cmd put fishes/salmon.jpg s3://my_fishbucket/fishes/salmon.jpg -P Public URL of the object is: https://a3s.fi/my_fishbucket/fishes/salmon.jpg
Giving another project read access to a bucket
You can control access rights using the command s3cmd setacl
. This command requires the UUID (universally unique identifier) of the project you want to grant access to. Project members can check their project ID in https://pouta.csc.fi/dashboard/identity/ or using the command openstack project show
. For example in Puhti and Mahti:
module load allas
allas-conf -k --mode s3cmd
openstack project show $OS_PROJECT_NAME
In case of s3cmd the read and write access can be controlled for both buckets and objects:
Following command gives project with UUID 3d5b0ae8e724b439a4cd16d1290 read access to my_fishbucket but not to the objects inside :
s3cmd setacl --acl-grant=read:3d5b0ae8e724b439a4cd16d1290 s3://my_fishbucket
s3cmd setacl --acl-grant=write:3d5b0ae8e724b439a4cd16d1290 s3://my_fishbucket/bigfish
--recursive
to the command:
s3cmd setacl --recursive --acl-grant=read:3d5b0ae8e724b439a4cd16d1290 s3://my_fishbucket
You can check the access permissions with s3cmd info:
$ s3cmd info s3://my_fishbucket|grep -i acl ACL: other_project_uuid: READ ACL: my_project_uuid: FULL_CONTROL
Option --acl-revoke can be used to remove a read or write access:
s3cmd setacl --recursive --acl-revoke=read:$other_project_uuid s3://my_fishbucket
The shared objects and buckets can be used with both S3 and Swift based tools. Note however, that listing commands show only buckets owned by your project. In the case of shared buckets and objects you must know the names of the buckets in order to use them.
In the case of the example above, user from project 3d5b0ae8e724b439a4cd16d1290 will not see my_fishbucket , when it is shared, with command:
s3cmd ls
s3cmd ls s3://my_fishbucket
https://pouta.csc.fi/dashboard/project/containers/container/my_fishbucket
Use example
In this example, we store a simple dataset in Allas using s3cmd.
First, create a new bucket. The command s3cmd ls
reveals that the object storage is empty at first. Then, use the command s3cmd mb
to create a new bucket called fish-bucket.
$ s3cmd ls ls $ s3cmd mb s3://fish-bucket mb s3://fish-bucket/ Bucket 's3://fish-bucket/' created $ s3cmd ls ls 2018-03-12 13:01 s3://fish-bucket
It is recommended to collect the data to be stored as larger units and compress it before uploading it to the system.
In this example, we store the Bowtie2 indices and the genome of the zebrafish (danio rerio) in the fish bucket. Running ls -lh
shows that the index files are available in the current directory:
$ ls -lh total 3.2G -rw------- 1 kkayttaj csc 440M Mar 12 13:41 Danio_rerio.1.bt2 -rw------- 1 kkayttaj csc 327M Mar 12 13:41 Danio_rerio.2.bt2 -rw------- 1 kkayttaj csc 217K Mar 12 13:20 Danio_rerio.3.bt2 -rw------- 1 kkayttaj csc 327M Mar 12 13:20 Danio_rerio.4.bt2 -rw------- 1 kkayttaj csc 1.3G Mar 12 13:13 Danio_rerio.GRCz10.dna.toplevel.fa -rw------- 1 kkayttaj csc 440M Mar 12 14:03 Danio_rerio.rev.1.bt2 -rw------- 1 kkayttaj csc 327M Mar 12 14:03 Danio_rerio.rev.2.bt2 -rw------- 1 kkayttaj csc 599K Mar 12 13:13 log
The data is collected and compressed as a single file using the tar
command:
tar zcf zebrafish.tgz Danio_rerio*
The size of the resulting file is about 2 GB. The compressed file can be uploaded to the fish bucket using the command s3cmd put
:
$ ls -lh zebrafish.tgz -rw------- 1 kkayttaj csc 9.3G Mar 12 15:23 zebrafish.tgz $ s3cmd put zebrafish.tgz s3://fish-bucket put zebrafish.tgz s3://fish-bucket upload: 'zebrafish.tgz' -> 's3://fish-bucket/zebrafish.tgz' [1 of 1] 2081306836 of 2081306836 100% in 39s 50.16 MB/s done $ s3cmd ls s3://fish-bucket ls s3://fish-bucket 2019-10-01 12:11 9982519261 s3://fish-bucket/zebrafish.tgz
Uploading 2 GB of data takes time. Retrieve the uploaded file:
s3cmd get s3://fish-bucket/zebrafish.tgz
By default, this bucket can only be accessed by the project members. However, using the command s3cmd setacl
, you can make the file publicly available.
First make the fish bucket public:
s3cmd setacl --acl-public s3://fish-bucket
Then make the zebrafish genome file public:
s3cmd setacl --acl-public s3://fish-bucket/zebrafish.tgz
The syntax of the URL of the file:
https://a3s.fi/bucket_name/object_name
In this case, the file would be accessible using the link https://a3s.fi/fish-bucket/zebrafish.tgz
Publishing objects temporarily with signed URLs
With command s3cmd signurl an object in Allas can be temporarily published with URL that includes security increasing access token.
In the previous example object s3://fish-bucket/zebrafish.tgz was made permanently accessible through simple static URL. With signurl the same object could be shared more securely and only for a limited time. For example command:
s3cmd signurl s3://fish-bucket/zebrafish.tgz +3600
https://fish-bucket.a3s.fi/zebrafish.tgz?AWSAccessKeyId=78e6021a086d52f092b3b2b23bfd7a67&Expires=1599835116&Signature=OLyyCY14s%2F0HxKOOd108mldINyE%3D
Setting up an object lifecycle
In order to delete/expire objects automatically, a lifecycle policy can be set-up to the Allas bucket. Objects in the bucket are treated per the lifecycle policy if matching conditions are found. Matching conditions can be set to a prefix and/or tag(s) within the object. Lifecycle policy is especially well suited for the cases where data needs to be removed as a "maintenance" measure after certain intervals.
Warning
Before setting up the lifecycle policy, please check with your department/team that it correctly represents the retention policy for the data in the project. (Legal or regulatory constrains).
In the following lifecycle policy we have two rules set. let's name it as mypolicy.xml
.
<?xml version="1.0" ?>
<LifecycleConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Rule>
<ID>1-days-expiration</ID>
<Status>Enabled</Status>
<Expiration>
<Days>1</Days>
</Expiration>
<Filter>
<Tag>
<Key>days</Key>
<Value>1</Value>
</Tag>
</Filter>
</Rule>
<Rule>
<ID>30-days-expiration</ID>
<Status>Enabled</Status>
<Expiration>
<Days>30</Days>
</Expiration>
<Filter>
<Tag>
<Key>days</Key>
<Value>30</Value>
</Tag>
</Filter>
</Rule>
</LifecycleConfiguration>
Alternatively, one can set the policies using prefix
which can be thought as an equivalent to folder
. Both methods can also be combined using <And>
tag.
<?xml version="1.0" ?>
<LifecycleConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Rule>
<ID>Daily</ID>
<Status>Enabled</Status>
<Prefix>daily/</Prefix>
<Expiration>
<Days>30</Days>
</Expiration>
</Rule>
<Rule>
<ID>Weekly</ID>
<Status>Enabled</Status>
<Prefix>weekly/</Prefix>
<Expiration>
<Days>365</Days>
</Expiration>
</Rule>
</LifecycleConfiguration>
To set this lifecycle policy into our bucket, we use the setlifecycle
sub-command:
s3cmd setlifecycle mypolicy.xml s3://MY_BUCKET
We can verify current policy with getlifecycle
sub-command:
s3cmd getlifecycle s3://MY_BUCKET
To review the bucket (or object) with info
sub-command:
s3cmd info s3://MY_BUCKET
s3://MY_BUCKET/ (bucket):
Location: cpouta-production
Payer: BucketOwner
Expiration Rule: objects with key prefix 'weekly/' will expire in '365' day(s) after creation
Policy: none
CORS: none
ACL: project_xxxxxxx: FULL_CONTROL
In order to put your object(s) under the lifecycle policy, you may utilize tags and/or prefixes.
- Tagging is done with adding a header with the format
x-amz-tagging:KEY=VALUE
. - Prefix can be considered as a "folder".
Let's see the following cases:
# Should be removed in 24 hours per rule ID: 1-days-expiration
s3cmd --add-header=x-amz-tagging:days=1 put MY_FILE_01.tar.gz s3://MY_BUCKET/
s3cmd --add-header=x-amz-tagging:days=1 put MY_FILE_02.tar.gz s3://MY_BUCKET/gone-in-one-day/
# Should be removed in 30 days per rule ID: 30-days-expiration
s3cmd --add-header=x-amz-tagging:days=30 put MY_FILE_03.tar.gz s3://MY_BUCKET/
# Should be removed in 30 days per rule ID: Daily
s3cmd put MY_FILE_04.tar.gz s3://MY_BUCKET/daily/
# Should be removed in 365 days per rule ID: Weekly
s3cmd put MY_FILE_05.tar.gz s3://MY_BUCKET/weekly/
Other references to setting up a lifecycle:
- RedHat developer guide for Ceph storage.
- Creating an intelligent object storage system with Ceph’s Object Lifecycle Management
- Multiple lifecycles - s3cmd
- Surprise entry for the above found at cloud.blog.csc.fi
Limit bucket access to specific IP addresses
You can limit access to a bucket to specific IP addresses by defining a policy.
Warning
Remember not to block your own access to the bucket, you can't access the bucket or fix the policy if you do so.
In the following IP policy example we allow access to bucket POLICY-EXAMPLE-BUCKET from IP subnet 86.50.164.0/24. Let's name the policy file myippolicy.json
.
{
"Version": "2012-10-17",
"Id": "S3PolicyExample",
"Statement": [
{
"Sid": "IPAllow",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::POLICY-EXAMPLE-BUCKET",
"arn:aws:s3:::POLICY-EXAMPLE-BUCKET/*"
],
"Condition": {
"NotIpAddress": {
"aws:SourceIp": "86.50.164.0/24"
}
}
}
]
}
To set this IP policy into our bucket, we use the setpolicy
sub-command:
s3cmd setpolicy myippolicy.json s3://POLICY-EXAMPLE-BUCKET
The current policy can be viewed with info
sub-command.
We can delete current policy with delpolicy
sub-command:
s3cmd delpolicy s3://POLICY-EXAMPLE-BUCKET
s3://POLICY-EXAMPLE-BUCKET/: Policy deleted