Using Allas with Python over the S3 protocol
You can use the AWS SDK for Python
(boto3
) to access Allas over the S3 protocol.
boto3
is a Python library developed for working with
Amazon S3 storage and other AWS services.
General data analysis workflow
- Upload the input data to Allas using
boto3
or another client. - Download the data from Allas to a local device
(i.e. personal workstation or CSC supercomputer) using
boto3
. - Analyze your local copy of the data.
- Write the analysis results to your local storage.
- Upload the results to Allas using
boto3
.
Some Python libraries support direct reading and writing over S3, such as AWS SDK for pandas and GDAL-based libraries (for working with geospatial data).
Remember to avoid handling the same objects with both S3 and SWIFT, as they work differently with large objects.
Installation
Installation on a personal workstation
boto3
is available for Python 3.8 and higher.
It can be
installed on a personal device
using pip
or conda
.
Installation on a CSC supercomputer
The pre-existing geoconda
and
biopythontools
modules already have boto3
installed. If you wish to use the library in another Python environment, you can
use pip
to
add it on top of an existing module.
Configuring S3
To use Allas with S3 and boto3, some configurations need to be set up: * Credentials: access key and secret key * S3 endpoint * S3 region
Configurations for accessing a single CSC project
The easiest way to set up S3 configuration for boto3
is by
configuring an S3 connection on a CSC supercomputer.
This saves the credentials and S3 region in credentials
and S3 endpoint in config
file in the default ~/.aws/
folder.
If you wish to access Allas from a personal laptop or some other server,
copy the ~/.aws
-folder to your home directory on that computer: C:\Users\username\.aws
on Windows or ~/.aws/
on Mac and Linux.
Use any file transfer tool, for example scp
.
# Copy the aws configuration files to your home directory
scp -r <username>@<hostname>.csc.fi:~/.aws $HOME
Credentials for accessing multiple CSC projects
Using allas-conf --mode s3cmd
is straightforward,
but it overwrites the existing credentials file when run,
making it somewhat tedious to work with multiple projects.
Therefore, it is recommended to use the
Cloud storage configuration
app in the Puhti or Mahti
web interface to configure S3 connections, since these configurations are
stored under
individual S3 profiles.
-
Use Cloud storage configuration to configure S3 connections, or remotes, for the projects whose Allas storage you wish to access. The configurations are stored in
~/.config/rclone/rclone.conf
on the supercomputer whose web interface you used for generating them. -
The S3 configuration entries for the access key ID and secret access key need to be prefixed with
aws_
forboto3
to recognize them as S3 credentials, but we do not want to make changes directly to~/.config/rclone/rclone.conf
, as it is used by other programs. Instead, use thesed
utility to read the contents of the configuration file, make the necessary changes, and write the modified contents to a new file, for example~/.boto3_credentials
. This can all be done with the following command.
After completing these steps, your S3 credentials for using boto3
are stored
under project-specific S3 profiles in the file you created in step 2. The profile names
have the format s3allas-<project>
, e.g. s3allas-project_2001234
.
You can now use these credentials to
create a boto3
resource.
boto3
usage
Create boto3
resource
S3 credentials configured only for one project:
# Create resource using credentials from the default location
# With newer versions of aws-library:
# - defining endpoint here is not any more mandatory, if it is given in the config file.
# - two checksum settings must be added that moving objects to/from Allas would work
import boto3
os.environ["AWS_REQUEST_CHECKSUM_CALCULATION"] = "when_required"
os.environ["AWS_RESPONSE_CHECKSUM_VALIDATION"] = "when_required"
s3_resource = boto3.resource('s3', endpoint_url='https://a3s.fi')
# Create resource using credentials from a profile
import boto3
import os
s3_credentials = '<credentials-file>' # e.g. '~/.boto3_credentials'
s3_profile = 's3allas-<project>' # e.g. 's3allas-project_2001234'
os.environ['AWS_SHARED_CREDENTIALS_FILE'] = s3_credentials
os.environ["AWS_REQUEST_CHECKSUM_CALCULATION"] = "when_required"
os.environ["AWS_RESPONSE_CHECKSUM_VALIDATION"] = "when_required"
s3_session = boto3.Session(profile_name=s3_profile)
s3_resource = s3_session.resource('s3', endpoint_url='https://a3s.fi')
Each subsequent step supposes that a boto3
resource has been created.
Create a bucket
Create a new bucket using the following script:
List buckets and objects
List all buckets belonging to a project:
List all objects belonging to a bucket:
my_bucket = s3_resource.Bucket('examplebucket')
for my_bucket_object in my_bucket.objects.all():
print(my_bucket_object.key)
Download an object
Download an object:
Upload an object
Upload a small file called my_snake.txt
to the bucket snakebucket
:
Remove buckets and objects
Delete all objects from a bucket:
Delete a bucket, must be empty: