Using Allas with Python over the S3 protocol

You can use the AWS SDK for Python (boto3) to access Allas over the S3 protocol. boto3 is a Python library developed for working with Amazon S3 storage and other AWS services.

General data analysis workflow

Upload the input data to Allas using boto3 or another client.
Download the data from Allas to a local device (i.e. personal workstation or CSC supercomputer) using boto3.
Analyze your local copy of the data.
Write the analysis results to your local storage.
Upload the results to Allas using boto3.

Some Python libraries support direct reading and writing over S3, such as AWS SDK for pandas and GDAL-based libraries (for working with geospatial data).

Remember to avoid handling the same objects with both S3 and SWIFT, as they work differently with large objects.

Installation

Installation on a personal workstation

boto3 is available for Python 3.8 and higher. It can be installed on a personal device using pip or conda.

# pip
pip install boto3

# conda
conda install anaconda::boto3

Installation on a CSC supercomputer

The pre-existing geoconda and biopythontools modules already have boto3 installed. If you wish to use the library in another Python environment, you can use pip to add it on top of an existing module.

Configuring S3

To use Allas with S3 and boto3, some configurations need to be set up: * Credentials: access key and secret key * S3 endpoint * S3 region

Configurations for accessing a single CSC project

The easiest way to set up S3 configuration for boto3 is by configuring an S3 connection on a CSC supercomputer.

module load allas
allas-conf --mode S3

This saves the credentials and S3 region in credentials and S3 endpoint in config file in the default ~/.aws/ folder.

If you wish to access Allas from a personal laptop or some other server, copy the ~/.aws-folder to your home directory on that computer: C:\Users\username\.aws on Windows or ~/.aws/ on Mac and Linux. Use any file transfer tool, for example scp.

# Copy the aws configuration files to your home directory
scp -r <username>@<hostname>.csc.fi:~/.aws $HOME

Credentials for accessing multiple CSC projects

Using allas-conf --mode s3cmd is straightforward, but it overwrites the existing credentials file when run, making it somewhat tedious to work with multiple projects. Therefore, it is recommended to use the Cloud storage configuration app in the Puhti or Mahti web interface to configure S3 connections, since these configurations are stored under individual S3 profiles.

Use Cloud storage configuration to configure S3 connections, or remotes, for the projects whose Allas storage you wish to access. The configurations are stored in ~/.config/rclone/rclone.conf on the supercomputer whose web interface you used for generating them.
The S3 configuration entries for the access key ID and secret access key need to be prefixed with aws_ for boto3 to recognize them as S3 credentials, but we do not want to make changes directly to ~/.config/rclone/rclone.conf, as it is used by other programs. Instead, use the sed utility to read the contents of the configuration file, make the necessary changes, and write the modified contents to a new file, for example ~/.boto3_credentials. This can all be done with the following command.
```
sed -E 's/^(access|secret)/aws_\1/g' ~/.config/rclone/rclone.conf > ~/.boto3_credentials
```

After completing these steps, your S3 credentials for using boto3 are stored under project-specific S3 profiles in the file you created in step 2. The profile names have the format s3allas-<project>, e.g. s3allas-project_2001234. You can now use these credentials to create a boto3 resource.

`boto3` usage

Create `boto3` resource

S3 credentials configured only for one project:

# Create resource using credentials from the default location
# With newer versions of aws-library:
#   - defining endpoint here is not any more mandatory, if it is given in the config file.
#   - two checksum settings must be added that moving objects to/from Allas would work

import boto3
os.environ["AWS_REQUEST_CHECKSUM_CALCULATION"] = "when_required"
os.environ["AWS_RESPONSE_CHECKSUM_VALIDATION"] = "when_required"

s3_resource = boto3.resource('s3', endpoint_url='https://a3s.fi')

S3 credentials configured for multiple projects:

# Create resource using credentials from a profile
import boto3
import os

s3_credentials = '<credentials-file>'   # e.g. '~/.boto3_credentials'
s3_profile = 's3allas-<project>'        # e.g. 's3allas-project_2001234'

os.environ['AWS_SHARED_CREDENTIALS_FILE'] = s3_credentials
os.environ["AWS_REQUEST_CHECKSUM_CALCULATION"] = "when_required"
os.environ["AWS_RESPONSE_CHECKSUM_VALIDATION"] = "when_required"

s3_session = boto3.Session(profile_name=s3_profile)
s3_resource = s3_session.resource('s3', endpoint_url='https://a3s.fi')

Each subsequent step supposes that a boto3 resource has been created.

Create a bucket

Create a new bucket using the following script:

s3_resource.create_bucket(Bucket="examplebucket")

List buckets and objects

List all buckets belonging to a project:

for bucket in s3_resource.buckets.all():
    print(bucket.name)

List all objects belonging to a bucket:

my_bucket = s3_resource.Bucket('examplebucket')

for my_bucket_object in my_bucket.objects.all():
    print(my_bucket_object.key)

Download an object

Download an object:

s3_resource.Object('examplebucket', 'object_name_in_allas.txt').download_file('local_file.txt')

Upload an object

Upload a small file called my_snake.txt to the bucket snakebucket:

s3_resource.Object('examplebucket', 'object_name_in_allas.txt').upload_file('local_file.txt')

Remove buckets and objects

Delete all objects from a bucket:

my_bucket = s3_client.Bucket('examplebucket')
my_bucket.objects.all().delete()

Delete a bucket, must be empty:

s3_resource.Bucket('examplebucket').delete()

Using Allas with Python over the S3 protocol

General data analysis workflow

Installation

Installation on a personal workstation

Installation on a CSC supercomputer

Configuring S3

Configurations for accessing a single CSC project

Credentials for accessing multiple CSC projects

boto3 usage

Create boto3 resource

Create a bucket

List buckets and objects

Download an object

Upload an object

Remove buckets and objects

`boto3` usage

Create `boto3` resource