How to untar file from an S3 bucket using AWS Lambda (Python)
In this article, we'll discuss how to untar file to a target bucket automatically when you upload a tar file in an S3 bucket
tarfile
python package to untar the files and write them to /tmp
directory and then copy all the files to the target bucketI've written a similar article to unzip files here
Infrastructure
We're going to use AWS CDK for creating the necessary infrastructure. It's an open-source software development framework that lets you define cloud infrastructure. AWS CDK
supports many languages including TypeScript, Python, C#, Java, and others. You can learn more about AWS CDK from a beginner's guide here. We're going to use Python in this article.
At a high level, we just need 3 resources
- Source S3 bucket
- Lambda function to untar the file
- Target S3 bucket
Creation of buckets
I'm using the below code to create a source bucket( tar_bucket
) and target bucket ( untar_bucket
) - random numbers are appended to the bucket names just to make the bucket names unique.
zip_bucket = s3.Bucket(
self, "tar_bucket",
bucket_name="tar-bucket-12042023",
)
unzip_bucket = s3.Bucket(
self, "untar_bucket",
bucket_name="untar-bucke-12042023",
)
Creation of Lambda Function
We would be using  Python runtime environment and we'll be giving necessary permissions to this lambda function. This lambda function may need read access to the source s3 bucket and write permission to the target s3 bucket so that the lambda function can untar files and write to the target bucket.
And, we would need to trigger the lambda whenever a tar file is uploaded to the source s3 bucket
zip_bucket = s3.Bucket(
self, "tar_bucket",
bucket_name="tar-bucket-12042023",
)
unzip_bucket = s3.Bucket(
self, "untar_bucket",
bucket_name="untar-bucke-12042023",
)
unzip_fn = lambda_.Function(
self, "un_zip_lambda",
runtime=lambda_.Runtime.PYTHON_3_8,
function_name="unzip-lambda-fn",
# Path to your Lambda function code
code=lambda_.Code.from_asset("./lambdas"),
# Name of the function and its entry point
handler="lambda-handler.lambda_handler",
# Maximum execution time of the function
timeout=core.Duration.seconds(300),
environment={
"TARGET_BUCKET": unzip_bucket.bucket_name
}
)
unzip_bucket.grant_write(unzip_fn)
zip_bucket.grant_read(unzip_fn)
notification = aws_s3_notifications.LambdaDestination(unzip_fn)
# assign notification for the s3 event type (ex: OBJECT_CREATED)
zip_bucket.add_event_notification(
s3.EventType.OBJECT_CREATED, notification)
Lambda function source code
The actual untar happens in this lambda function. We're going to use tarfile
package to untar the files. This tarfile
package requires a file system path to write to. So, we're going to give a path  /tmp/extracted
 for storing the extracted files. Then, finally, we're going to upload the files to the target s3 bucket from the local file system.
def list_files(dir):
r = []
for root, dirs, files in os.walk(dir):
for name in files:
r.append(os.path.join(root, name))
return r
s3 = boto3.client('s3')
def lambda_handler(event, context):
target_bucket_name = os.environ.get('TARGET_BUCKET', '')
print(event)
print("target_bucket_name: {}".format(target_bucket_name))
for record in event["Records"]:
# Get the bucket name and key
bucket_name = record['s3']['bucket']['name']
object_key = record['s3']['object']['key']
tar_file_path = '/tmp/{}'.format(object_key)
s3.download_file(bucket_name, object_key, tar_file_path)
print('downloaded the file from s3 to {}'.format(tar_file_path))
file = tarfile.open(tar_file_path)
local_folder_path = '/tmp/extracted/'
# extracting file
file.extractall(local_folder_path)
file.close()
files = list_files(local_folder_path)
print(files)
for file in files:
with open(file, 'rb') as f:
new_file = file.replace(local_folder_path, '')
print("new_file: {}".format(new_file))
s3.put_object(
Body=f, Bucket=target_bucket_name, Key=new_file)
message = "successfully processed {} records.".format(
len(event['Records']))
Please let me know your thoughts in the comments.