What directory I should use to analyse a large numbers of small files?
An interactive batch job Puhti allows you to have a Puhti session that can have up to:
- 4 cores
- 64 GB of memory
- 7 days of run time
- 640 GB of fast local scratch disk
To launch an interactive session in Puhti, execute command:
sinteractive -i
Generally you should avoid work flows that require creating thousands of small files. If you anyhow need to work with a huge number of files, you should consider using the NVMe based local temporary scratch directories, either through normal or interactive batch jobs. The local scratch area is visible only for the specific batch job and it is erased when the batch job ends. Because of that you always first need to import your data set to the local scratch and when you finish, copy the data you want to preserve back to some more permanent storage place like scratch or Allas.
To demonstrate the effectivity of local scratch area let’s study a sample directory called big_data. The directory contains about 100 GiB of data in 120 000 files. In the beginning the data is packed in one tar-archive file in the scratch directory of project 2001234 (/scratch/project_2001234/big_data.tar)
First we launch an interactive batch job with 2 cores, 4 GB of memory and 250 GB of fast temporary scratch disk.
interactive -c 2 -m 4G -d 250
Step 1. Move to the local scratch area using environment variable $LOCAL_SCRATCH and open the tar package to the fast local disk.
cd $LOCAL_SCRATCH
tar xvf /scratch/project_2001234/big_data.tar ./
Step 2. Run the analysis. This time we run a for loop that uses command transeq to translate all the fasta files, found in the big_data directory, into new protein sequence files:
for ffile in $(find ./ | grep fasta$ )
do
transeq $ffile ${ffile}.pep
done
The actual translation is a simple task so relatively much time is consumed to just open and close files.
Step 3. When the processing is finished we store the results back to scratch directory into a new tar file.
tar cvf /scratch/project_2001234/big_data.pep.tar ./
We could do the same analysis procedure in the scratch directory too. Below is execution time comparison for running the three steps above in LOCAL_SCRATCH and in normal scratch. The response times of LOCAL_SCRATCH are rather stable, but in the scratch directory the execution times will vary much, due to changes in the total load of the Lustre file system.
LOCAL_SCRATCH | scratch | |
---|---|---|
Step 1. Opening tar file | 2m 8s | 4m 12s |
Step 2. Analysis | 9m 42s | 21m 58s |
Step 3. Creating new tar file | 2m 25s | 42m 21s |
Total | 14m 15s | 1h 8m 31s |
More detailed information about batch job specific local storage can be found here.