Lead Image © erikdegraaf, 123RF.com

Lead Image © erikdegraaf, 123RF.com

Getting data from AWS S3 via Python scripts

Pumping Station

Article from ADMIN 41/2017
By
Data on AWS S3 is not necessarily stuck there. If you want your data back, you can siphon it out all at once with a little Python pump.

Data produced on EC2 instances or AWS lambda servers often end up in Amazon S3 storage. If the data is in many small files, of which the customer only needs a selection, downloading from the browser can bring on finicky behavior. Luckily the Amazon toolshed offers Python libraries as pipes for programmatic data draining in the form of awscli and boto3.

At the command line, the Python tool aws copies S3 files from the cloud onto the local computer. Install this using

pip3 install --user awscli

and then answer the questions for the applicable AWS zone, specifying the username and password as you go. You then receive an access token, which aws stores in ~/.aws/credentials and, from then on, no longer prompts you for the password [1].

Data exists in S3 as objects indexed by string keys. If a prosnapshot bucket contains a video.mp4 video file under the hello.mp4 key, you can use the

aws s3 cp s3://prosnapshot/hello.mp4 video.mp4

command to retrieve it from the cloud and store on the local hard disk, just as in the browser (Figure 1).

Figure 1: The browser moves selected S3 files from the cloud to the hard disk.

The aws tool relies on the botocore Python library, on which another SDK program, boto3, is based; boto3 is used to write scripts to automate the file retrieval process [2]. The command:

pip3 install --user boto3

installs the SDK on your system.

Order

Listing 1 uses boto3 to download a single S3 file from the cloud. In its raw form, S3 doesn't support folder structures but stores data under user-defined keys. However, the browser interface provides the option to create a new folder with subfolders to any depth in a bucket and fill the structure with files (Figure 2).

Listing 1

hello-read.py

1 #!/usr/bin/python3
2 import boto3
3
4 s3 = boto3.resource('s3')
5 bucket = s3.Bucket('prosnapshot')
6 bucket.download_file('hello.txt', 'hello-down.txt')
Figure 2: Subdirectories appear in the browser, which S3 later simply integrates into the key for a storage object.

Under the hood, S3 replicates these folders as a key with file paths in typical Unix style.

For example, the hello.txt file can be downloaded from S3 using the following command:

$ aws s3 cp s3://prosnapshot/myfolder/hello.txt hello.txt
download: s3://prosnapshot/myfolder/hello.txt to ./hello.txt

This grabs the file from myfolder in the prosnapshot bucket.

Full Speed Ahead

However, if you want to grab all the files in an S3 bucket in one go (Figure 3), you might stumble across the idea of listing and processing the files with objects.all(), as shown in Listing 2. This method works perfectly with buckets of up to 1,000 objects, but because the underlying REST interface only provides a maximum of 1,000 results, the loop stops at the 1,001st object.

Listing 2

s3-all.py

1 #!/usr/bin/python3
2 import boto3
3
4 s3 = boto3.resource('s3')
5 bucket = s3.Bucket('prosnapshot')
6
7 for obj in bucket.objects.all():
8     print(obj)
Figure 3: All files are in an S3 bucket.

The boto3 program provides paginators as a solution to the dilemma; they fetch a maximum of 1,000 objects, remember the offset, and keep retrieving the data until the bucket is processed. Listing 3 fetches all the files in a bucket from the cloud. It interprets your keys as Unix paths and stores the contents of the returned objects in local files of the same name.

Listing 3

s3-export.py>

01 #!/usr/bin/python3
02 import boto3
03 import os
04
05 def s3pump(path,bucket):
06     dir=os.path.dirname(path)
07     if dir and not os.path.exists(dir):
08         os.makedirs(dir)
09     if os.path.basename(path):
10         bucket.download_file(path,path)
11
12 bname='prosnapshot'
13 client = boto3.client('s3')
14 bucket = boto3.resource('s3').Bucket(bname)
15
16 pgnr = client.get_paginator('list_objects')
17 page_it = pgnr.paginate(Bucket=bname)
18
19 for page in page_it:
20     if page.get('Contents') is not None:
21         for file in page.get('Contents'):
22             s3pump(file.get('Key'), bucket)

Data Highway?

For large S3 buckets with data in the multiterabyte range, retrieving the data can take a while – depending on your Internet connection – or the time overhead can be completely prohibitive. In these cases, Amazon offers a sneakernet service to export your data: Customers send their hard disk or storage appliance to Amazon, who fills it up and sends it back [3].

Infos

  1. "Set Up Amazon Web Services" by Mike Schilli, Linux Magazine , issue 196, March 2017, http://www.linux-magazine.com/Issues/2017/196/Programming-Snapshot-Amazon-Web-Services
  2. AWS SDK Python (boto3) documentation: http://boto3.readthedocs.io/en/latest/
  3. Create Your First Amazon S3 Export Job: http://docs.aws.amazon.com/AWSImportExport/latest/DG/GSCreateSampleS3ExportRequest.html

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus