How to untar file from an S3 bucket using AWS Lambda (Python)

In this article, we'll discuss how to untar file to a target bucket automatically when you upload a tar file in an S3 bucket

Untar file in 
💡
The idea is to use tarfile python package to untar the files and write them to /tmp directory and then copy all the files to the target bucket

I've written a similar article to unzip files here

Infrastructure

We're going to use AWS CDK for creating the necessary infrastructure. It's an open-source software development framework that lets you define cloud infrastructure. AWS CDK supports many languages including TypeScript, Python, C#, Java, and others. You can learn more about AWS CDK from a beginner's guide here. We're going to use Python in this article.

At a high level, we just need 3 resources

  • Source S3 bucket
  • Lambda function to untar the file
  • Target S3 bucket

Creation of buckets

I'm using the below code to create a source bucket( tar_bucket ) and target bucket ( untar_bucket ) - random numbers are appended to the bucket names just to make the bucket names unique.

 zip_bucket = s3.Bucket(
            self, "tar_bucket",
            bucket_name="tar-bucket-12042023",
        )

 unzip_bucket = s3.Bucket(
 	        self, "untar_bucket",
            bucket_name="untar-bucke-12042023",
        )

Creation of Lambda Function

We would be using  Python runtime environment and we'll be giving necessary permissions to this lambda function. This lambda function may need read access to the source s3 bucket and write permission to the target s3 bucket so that the lambda function can untar files and write to the target bucket.

And, we would need to trigger the lambda whenever a tar file is uploaded to the source s3 bucket

zip_bucket = s3.Bucket(
            self, "tar_bucket",
            bucket_name="tar-bucket-12042023",
        )

unzip_bucket = s3.Bucket(
            self, "untar_bucket",
            bucket_name="untar-bucke-12042023",
        )

unzip_fn = lambda_.Function(
            self, "un_zip_lambda",
            runtime=lambda_.Runtime.PYTHON_3_8,
            function_name="unzip-lambda-fn",
            # Path to your Lambda function code
            code=lambda_.Code.from_asset("./lambdas"),
            # Name of the function and its entry point
            handler="lambda-handler.lambda_handler",
            # Maximum execution time of the function
            timeout=core.Duration.seconds(300),
            environment={
                "TARGET_BUCKET": unzip_bucket.bucket_name
            }
        )

unzip_bucket.grant_write(unzip_fn)
zip_bucket.grant_read(unzip_fn)

notification = aws_s3_notifications.LambdaDestination(unzip_fn)

# assign notification for the s3 event type (ex: OBJECT_CREATED)
zip_bucket.add_event_notification(
            s3.EventType.OBJECT_CREATED, notification)

Lambda function source code

The actual untar happens in this lambda function. We're going to use tarfile package to untar the files. This tarfile package requires a file system path to write to. So, we're going to give a path  /tmp/extracted  for storing the extracted files. Then, finally, we're going to upload the files to the target s3 bucket from the local file system.

def list_files(dir):
    r = []
    for root, dirs, files in os.walk(dir):
        for name in files:
            r.append(os.path.join(root, name))
    return r


s3 = boto3.client('s3')


def lambda_handler(event, context):

    target_bucket_name = os.environ.get('TARGET_BUCKET', '')
    print(event)
    print("target_bucket_name: {}".format(target_bucket_name))

    for record in event["Records"]:
        # Get the bucket name and key
        bucket_name = record['s3']['bucket']['name']
        object_key = record['s3']['object']['key']

        tar_file_path = '/tmp/{}'.format(object_key)
        s3.download_file(bucket_name, object_key, tar_file_path)

        print('downloaded the file from s3 to {}'.format(tar_file_path))

        file = tarfile.open(tar_file_path)

        local_folder_path = '/tmp/extracted/'
        # extracting file
        file.extractall(local_folder_path)

        file.close()

        files = list_files(local_folder_path)

        print(files)

        for file in files:
            with open(file, 'rb') as f:
                new_file = file.replace(local_folder_path, '')
                print("new_file: {}".format(new_file))
                s3.put_object(
                    Body=f, Bucket=target_bucket_name, Key=new_file)

    message = "successfully processed {} records.".format(
        len(event['Records']))

Please let me know your thoughts in the comments.