We will all agree that storage is one of the most important components of modern applications as well as an important component for keeping data for future use or processing. Today, our focus is on Amazon Simple Storage Service (S3) which is one of the building blocks or core services AWS offers. In this article, we seek to explore the basics of Amazon S3 in detail. Forgive me if this article ends up being too long to read. There is just so much to talk about when it comes to this service, so without taking any more time let us begin!
What is Amazon S3?
It is a cloud-based object storage service offered by AWS that provides highly scalable, secure, and durable storage for objects and files. It is a storage service on which you can store and retrieve files, documents, images, videos, and other types of data. Think of it as having an online hard drive where you can upload and download your files from anywhere, at any time, using an internet connection. Amazon S3 provides a cost-effective solution for data storage and backup that scales automatically as the volume of data increases. It is advertised by AWS as an "infinite scaling" storage solution. S3 is known for its scalability, durability, and security. You can store virtually an unlimited amount of data, and it's designed to provide 99.999999999% (11 9's) durability of objects. This means if you store 10,000,000 objects on S3, you can expect to incur on average a loss of a single object once every 10,000 years. This high durability makes your data highly protected and available whenever you need it. Amazon S3 is widely utilized for a variety of use cases, such as backup and disaster recovery, big data analytics, content delivery, data lakes, media hosting, and static website hosting. It is easy to use, and integrates seamlessly with other AWS services, making it a powerful tool for businesses of all sizes.
In Amazon S3, data is stored in containers called buckets and you cannot store any data on S3 without having created a bucket first. Think of a bucket as a folder where you can store and manage your files. Each bucket has a globally unique name(unique across all AWS regions and accounts in the same partition), and you can have multiple buckets to organize your data in different ways. When creating a bucket, two details must be specified; the bucket name and the region in which you want the bucket to be created because buckets are defined at the regional level.
There are naming conventions(rules) for S3 buckets. The naming rules are:
Bucket names must be between 3-63 characters long.
Bucket names can only be made up of lowercase letters, numbers, dots, and hyphens. So bucket names cannot be made up of uppercase letters and underscores.
They must begin and end with a letter or number.
Bucket names cannot be IP addresses.
They must not start with the prefix xn-- or end with the suffix -s3alias
And as we have already seen, a bucket name must be globally unique.
I hinted in the introductory paragraph that S3 is a cloud-based object storage service. What this means is that in S3, data is stored as objects. An Object is the fundamental unit of storage and it represents a file, which can be any kind of data: a text file, an image, a video, a database backup, or any other type of file. Each object within an S3 bucket is uniquely identified by its key, which is a string that can be up to 1024 characters long. The key serves as the object's name and also provides a way to organize and retrieve objects within a bucket. Individual objects stored in S3 can have a size of up to 5TB (5,000GB). It is important to note that in S3, objects are stored in a flat structure, meaning that there are no actual folders or directories like in a traditional file system. However, S3 allows you to create a logical hierarchy of objects using prefixes in object keys. For example, if you upload an object with the key "folder/subfolder/object.txt", S3 will create a prefix "folder/subfolder/" that behaves like a folder or directory. You can use the AWS Management Console, the AWS CLI, or an SDK to create and manage these prefixes as if they were directories.
S3 Security
The security posture of any IT resource or service is very important and it is something that must be taken seriously at any point in time. Amazon S3 has different ways in which it enforces security. We can enforce user-based security by making use of IAM policies or enforce resource-based security by using: bucket policies, object Access Control Lists (ACLs), and bucket Access Control Lists. We are only going to explore S3 bucket policies because it is the most widely used.
S3 Bucket Policies
S3 bucket policies are JSON-based documents that specify permissions for S3 buckets and their contents (i.e. the objects within them). They allow you to define a set of rules that allow or deny access to your S3 bucket and its contents for specific users, groups, or even public access. They can be used to control various actions that can be performed on an S3 bucket, such as read, write, or delete actions. These policies can be applied at the bucket level or the object level within a bucket, giving you granular control over access to your data. Below is an example of a sample S3 bucket policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowGetObject",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::examplebucket/*"
}
]
}
The policy above allows anyone to retrieve objects (files) from the "examplebucket" S3 bucket. The Principal
field is set to "*" to indicate that any AWS account or user can perform the s3:GetObject
action. The Resource
field specifies the ARN (Amazon Resource Name) of the bucket and its objects, and the Sid
field provides a name for the policy statement to help identify it.
S3 Versioning
It is a feature of Amazon S3 that allows you to keep multiple versions of an object in the same bucket. Versioning is a setting that is enabled at the bucket level. When versioning is enabled for a bucket, any object uploaded to that bucket will have a unique version ID associated with it. You can upload a new version of an object by simply uploading a file with the same key (name) as the existing object. Each version of the object will have a unique version ID and you can access and manage all versions of an object using the S3 API or the AWS Management Console. If versioning is enabled after an object is uploaded, that version of the object (i.e. the uploaded before versioning was enabled) will have a version ID of "null". It is a best practice to enable versioning for your S3 buckets as it helps prevent unintended deletes by giving you the ability to restore deleted objects and also allows you to roll back to a previous version of an object. An S3 bucket can be in one of three states at any given point; unversioned (the default), version-enabled or version-suspended and once you version-enable a bucket, it can never return to an unversioned state. You can however suspend versioning.
Note: Enabling versioning may increase storage costs, as each version of an object is stored as a separate object in the bucket. So keep this in mind when enabling versioning.
S3 Replication
It is an S3 feature that allows you to automatically replicate objects in an S3 bucket to another bucket in the same or a different AWS region. Replication helps improve the durability and availability of data and also allows you to comply with regulatory requirements for data replication and disaster recovery. To carry out S3 replication, versioning must be enabled on both the source and destination S3 buckets because replication relies on the presence of version IDs to determine which objects have changed and need to be replicated. S3 supports both same-region replication (SRR) and cross-region replication (CRR).
Cross-Region Replication is used to replicate objects automatically and asynchronously across different AWS regions. With cross-region replication, you can create a replica of your S3 objects in a different region than the source bucket for data redundancy, disaster recovery, and lower latency access to data.
Same-Region Replication allows you to automatically and asynchronously replicate objects between buckets that reside in the same AWS region. SRR helps improve the durability and availability of data by creating an exact copy of objects in a different bucket within the same region. It can be useful for a variety of use cases, such as creating backups for disaster recovery purposes, distributing content to multiple locations for better access times, or complying with regulations that require data redundancy.
Here are some key points to take note of about replication:
After replication is enabled, only new objects uploaded to the bucket are replicated. If you would like to replicate existing objects, use S3 Batch Replication.
There is no chaining of replication. This means that replication from one source bucket to a destination bucket cannot be further replicated to another bucket. In other words, if replication is enabled on a source bucket, it can only replicate to one destination bucket and cannot replicate to any other bucket beyond the initial destination.
S3 Storage Classes
An S3 storage class refers to a way of categorizing S3 into different types of data storage based on their usage patterns, access frequency, durability, and availability requirements. Amazon S3 offers a range of storage classes to help customers optimize their costs and performance based on the access patterns of their data. The storage classes include S3 Standard, S3 Standard Infrequent Access (IA), S3 One Zone-Infrequent Access, S3 Intelligent Tiering, S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive. We are now going to explore each of these storage classes in detail as well as look at some use cases for each one of them.
S3 Standard — This is the default storage class in S3 and it is designed for frequently accessed data. It offers high durability, availability, and performance. It is designed for use in scenarios where low latency and high throughput are important, and where data needs to be accessed frequently and quickly. In this storage class, data is stored across multiple devices and facilities within an AWS Region. This provides high durability, meaning that data is protected against hardware failure, and high availability, which means data can be accessed at any time. Additionally, S3 Standard has strong read-after-write consistency for new objects, which means that when a new object is written to S3, it can be read immediately after. Some common use cases for S3 Standard are; general-purpose storage for mobile and gaming applications, dynamic websites, content distribution and for use in big data analytics.
S3 Standard Infrequent Access (IA) — It is a cost-effective storage option for data that is accessed infrequently but still requires low-latency access when needed. It offers the high durability, high throughput and low latency of S3 Standard with a low per GB storage price and a low per GB retrieval fee. These retrieval fees are charged based on the amount of data retrieved and the time it takes to retrieve it. The S3 Standard-IA storage class provides a good balance between performance and cost and it is ideal for storing long-term backups, disaster recovery files, or data that needs to be retained for compliance purposes. It is also suitable for storing data that is accessed less frequently but requires rapid access when needed.
S3 One Zone-Infrequent Access — Unlike other storage classes that store data in a minimum of three Availability Zones, One Zone-IA stores data only in a single AZ. It is designed for data that can be recreated easily in the event of a loss. One Zone-IA offers a lower cost option compared to other S3 storage classes by storing data in a single AZ within a region. S3 One Zone-IA is ideal for data that is rarely accessed but still requires low-latency access when it is needed. It offers the same low latency access as the S3 Standard-IA storage class but at a lower price point. It is intended for use cases where data can be easily recreated, such as for backups, disaster recovery, or infrequently accessed data that is not business-critical.
S3 Intelligent-Tiering — It is a storage class that automatically optimizes your storage costs by moving objects between different access tiers based on changing access or usage patterns. It is designed to optimize costs for data with unknown or changing access patterns by automatically moving objects between three access tiers: the frequent access tier (default tier), the infrequent access tier, and the archive instant access tier. When objects are uploaded to this storage class, they are placed in the frequent access tier. Over time, S3 Intelligent-Tiering monitors the access patterns of your objects and automatically moves them between the access tiers depending on their usage patterns. It is important to note that there is no retrieval fee for this storage class.
S3 Glacier Instant Retrieval — This storage class is used for archiving data that is rarely accessed but requires millisecond retrieval when needed. Data stored in the storage class offers a cost saving of up to 68% compared to the S3 Standard-IA with the same latency and throughput performance. When using S3 Glacier Instant Retrieval, you are charged for both storage and retrieval of data, as well as any data transfers. The cost of data retrieval depends on the size of the archive and the retrieval mode. There are three retrieval modes available: expedited, standard, and bulk. Expedited retrieval provides data access within 1-5 minutes, while standard retrieval provides data access within 3-5 hours. Bulk retrieval provides data access within 5-12 hours and is the most cost-effective option. It is a good storage option for applications that require low-latency access to data archives.
S3 Glacier Flexible Retrieval — It offers low-cost storage for archived data that is accessed 1-2 times per year. Just like S3 Glacier Instant Retrieval, this storage class also offers the three retrieval modes. Flexible Retrieval also provides the capability to perform partial retrievals, where users can retrieve only a portion of their archived data instead of the entire archive. This can save costs and time, especially if the archived data is large. S3 Glacier Flexible Retrieval is ideal for storing backups, disaster recovery, and offsite data storage needs.
S3 Glacier Deep Archive — This is the lowest-cost S3 storage class. It supports long-term retention (usually ranging from several years to decades) and digital preservation of data that might be accessed once or twice a year. Data stored in this storage class have a retrieval time of 12 hours so it is recommended only for data that is unlikely to be needed in the near future. It is designed for customers who have regulatory or business requirements to retain large amounts of data for extended periods.
Final Thoughts
We have explored some of the fundamental details of Amazon S3. I believe what we have covered in this article is enough to get you started using the service and also to help you make well-informed business decisions when it comes to choosing an S3 storage option based on cost, performance or use case. Go on and keep exploring this service so that you can enjoy the benefits it offers. Be sure to leave any questions you have in the comment section.