Image from https://unsplash.com/
S3 Select transfers all the workload to the storage compute layer and lets you do drastic less network traffic for every single request.
You can perform SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the Amazon S3 console, However note the console limits the amount of data returned to 40 MB.
Let me start by giving an example use case for to using S3 Select .Suppose you need to analyze data stored in an S3 bucket in CSV/JSON format, and the data is frequently updated and new data is uploaded in a new GZIP-ed/ BZIP2 (CSV/JSON) every day. Without S3 Select you would need to download, decompress, and process the entire CSV to get the data you needed. With S3 Select, you can run simple SQL expressions to return only the data from individual columns/rows you’re interested in, instead of retrieving the entire object. This means you’re dealing with an order of magnitude less data and this in turn can dramatically improve the performance and reduce the cost of applications that need to access data in S3.
According to the officially AWS documentation “S3 Select enables applications to offload the heavy lifting of filtering and accessing data inside objects to the Amazon S3 service. By reducing the volume of data that has to be loaded and processed by your applications, S3 Select can improve the performance of most applications that frequently access data from S3 by up to 400%.”
Here is a Python example, which shows how to retrieve the first column from an object containing data in CSV format.
import boto3
s3 = boto3.client('s3', region_name 'region')
r = s3.select_object
Bucket= ''
Key= ''.csv
ExpressionType = 'SQL'
Expression = select * from s3object
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization = {'CSV': {}},
)
for event in response['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
Here is what i find very interesting about S3 Select, you can specify the format in which you want your output, using the OutputSerialization parameter for the example above i specified as ‘CSV’ but if i wanted the output to be in JSON format i could have simply done;
OutputSerialization: {
JSON: {}
}
The ability to partially retrieve data is particularly comes in very handy when building and working for serverless applications built with AWS Lambda.
Another feature which i think is pretty cool is CloudWatch Metrics for S3 Select, which lets you monitor S3 Select usage for your applications.
Well that’s pretty much it you can dig further from the docs about more use cases.
[SelectObjectContent
This operation filters the contents of an Amazon S3 object based on a simple structured query language (SQL) statement…docs.aws.amazon.com](https://docs.aws.amazon.com/AmazonS3/latest/API/API_SelectObjectContent.html "docs.aws.amazon.com/AmazonS3/latest/API/API..")