Getting data from AWS S3 via Python scripts
Pumping Station
Data produced on EC2 instances or AWS lambda servers often end up in Amazon S3 storage. If the data is in many small files, of which the customer only needs a selection, downloading from the browser can bring on finicky behavior. Luckily the Amazon toolshed offers Python libraries as pipes for programmatic data draining in the form of awscli
and boto3
.
At the command line, the Python tool aws
copies S3 files from the cloud onto the local computer. Install this using
pip3 install --user awscli
and then answer the questions for the applicable AWS zone, specifying the username and password as you go. You then receive an access token, which aws
stores in ~/.aws/credentials
and, from then on, no longer prompts you for the password [1].
Data exists in S3 as objects indexed by string keys. If a prosnapshot
bucket contains a video.mp4
video file under the hello.mp4
key, you can use the
aws s3 cp s3://prosnapshot/hello.mp4 video.mp4
command to retrieve it from the cloud and store on the local hard disk, just as in the browser (Figure 1).
The aws
tool relies on the botocore
Python library, on which another SDK program, boto3
, is based; boto3
is used to write scripts to automate the file retrieval process [2]. The command:
pip3 install --user boto3
installs the SDK on your system.
Order
Listing 1 uses boto3
to download a single S3 file from the cloud. In its raw form, S3 doesn't support folder structures but stores data under user-defined keys. However, the browser interface provides the option to create a new folder with subfolders to any depth in a bucket and fill the structure with files (Figure 2).
Listing 1
hello-read.py
1 #!/usr/bin/python3 2 import boto3 3 4 s3 = boto3.resource('s3') 5 bucket = s3.Bucket('prosnapshot') 6 bucket.download_file('hello.txt', 'hello-down.txt')
Under the hood, S3 replicates these folders as a key with file paths in typical Unix style.
For example, the hello.txt
file can be downloaded from S3 using the following command:
$ aws s3 cp s3://prosnapshot/myfolder/hello.txt hello.txt download: s3://prosnapshot/myfolder/hello.txt to ./hello.txt
This grabs the file from myfolder
in the prosnapshot
bucket.
Full Speed Ahead
However, if you want to grab all the files in an S3 bucket in one go (Figure 3), you might stumble across the idea of listing and processing the files with objects.all()
, as shown in Listing 2. This method works perfectly with buckets of up to 1,000 objects, but because the underlying REST interface only provides a maximum of 1,000 results, the loop stops at the 1,001st object.
Listing 2
s3-all.py
1 #!/usr/bin/python3 2 import boto3 3 4 s3 = boto3.resource('s3') 5 bucket = s3.Bucket('prosnapshot') 6 7 for obj in bucket.objects.all(): 8 print(obj)
The boto3
program provides paginators as a solution to the dilemma; they fetch a maximum of 1,000 objects, remember the offset, and keep retrieving the data until the bucket is processed. Listing 3 fetches all the files in a bucket from the cloud. It interprets your keys as Unix paths and stores the contents of the returned objects in local files of the same name.
Listing 3
s3-export.py>
01 #!/usr/bin/python3 02 import boto3 03 import os 04 05 def s3pump(path,bucket): 06 dir=os.path.dirname(path) 07 if dir and not os.path.exists(dir): 08 os.makedirs(dir) 09 if os.path.basename(path): 10 bucket.download_file(path,path) 11 12 bname='prosnapshot' 13 client = boto3.client('s3') 14 bucket = boto3.resource('s3').Bucket(bname) 15 16 pgnr = client.get_paginator('list_objects') 17 page_it = pgnr.paginate(Bucket=bname) 18 19 for page in page_it: 20 if page.get('Contents') is not None: 21 for file in page.get('Contents'): 22 s3pump(file.get('Key'), bucket)
Data Highway?
For large S3 buckets with data in the multiterabyte range, retrieving the data can take a while – depending on your Internet connection – or the time overhead can be completely prohibitive. In these cases, Amazon offers a sneakernet service to export your data: Customers send their hard disk or storage appliance to Amazon, who fills it up and sends it back [3].
Infos
- "Set Up Amazon Web Services" by Mike Schilli, Linux Magazine , issue 196, March 2017, http://www.linux-magazine.com/Issues/2017/196/Programming-Snapshot-Amazon-Web-Services
- AWS SDK Python (boto3) documentation: http://boto3.readthedocs.io/en/latest/
- Create Your First Amazon S3 Export Job: http://docs.aws.amazon.com/AWSImportExport/latest/DG/GSCreateSampleS3ExportRequest.html
Buy this article as PDF
(incl. VAT)