From Logs to Compliance: A Guide to AWS Monitoring and auditing with CloudWatch, CloudTrail, and AWS Config
Suppose you are the lead cloud engineer for your company and you handle all building and deployments of solutions in the AWS cloud. You deployed an application without configuring monitoring, and the application has been running smoothly for a while until one faithful day, you get a call from your operations manager in the middle of the night saying the application you deployed is no longer running and to make matters worse for you and for everyone, you don't even know where to start troubleshooting the issue all because you didn’t consider adding monitoring to your architecture. As a cloud engineer or a software developer, one of the worst things you could do when building and deploying solutions in the cloud is to fail to enable monitoring of some sort. Monitoring is the act of collecting, analysing and using data to make decisions or answer questions about your IT resources and systems. Monitoring primarily gives you visibility into your resources. Some of its benefits include;
It allows you to respond to operational issues proactively even before they are picked up by your end users.
Monitoring can help you improve the performance and reliability of your IT resources.
It helps you to easily identify security threats and vulnerabilities
Monitoring helps you make data-driven decisions for your business and through monitoring, you can create more cost-optimized solutions.
Having seen some of the benefits of monitoring, for the remainder of the article, we are going to explore the various monitoring and auditing services offered by AWS. These services are; Amazon CloudWatch, CloudTrail and AWS Config. We are going to look at each of these services in detail so let's begin!
CloudWatch
CloudWatch is a monitoring and observability service that tracks and collects resource data and uses it to provide you with valuable and actionable insights into your AWS resources and applications. With CloudWatch, you can respond to system-wide performance changes, optimize resource usage and get a unified view of operational health. You can use CloudWatch to do the following;
Detect anomalous behaviour in your AWS environment.
Visualize logs and metrics using the AWS management console.
Set alarms to notify you when something goes wrong.
Troubleshoot issues
Discover insights to keep your applications healthy and
Take automated actions such as scaling.
Now let's look at the various services within CloudWatch that enable us to carry out the tasks listed above.
CloudWatch Metrics — CloudWatch Metrics are used to monitor resources, such as EC2 instances, RDS databases, Lambda functions, and more in the AWS cloud. CloudWatch provides metrics for all AWS services and a metric is a time-ordered set of data points. Each data point represents the value of the metric at a specific point in time. Simply put, a metric is a variable you want to monitor, for example in the case of EC2 instances, metrics could be CPU utilization, memory usage or for RDS databases, it could be the number of database connections. CloudWatch Metrics are collected at regular intervals, and the data points are aggregated to form a time series. This time series can be used to create graphs and alarms, which can be used to monitor the health and performance of AWS resources. In CloudWatch, a metric belongs to a namespace, which is a container for metrics with similar characteristics.
A namespace is a way to organize and categorize metrics in CloudWatch. For example, if you have an application that runs on EC2 instances and you want to monitor CPU usage, you would create a CloudWatch metric for CPU utilization. You could then store this metric in a namespace called "AWS/EC2" to indicate that it pertains to EC2 instances. Namespaces provide a way to avoid naming conflicts and simplify the organization of metrics. By grouping metrics in namespaces, you can easily view and manage related metrics.
We have only been talking about AWS-provided metrics. With CloudWatch, you are allowed to create custom metrics instead of relying only on those provided by AWS. An example of a custom metric could be the number of user logins to your application.
CloudWatch also allows you to continually stream CloudWatch metrics to a destination of your choice, with near-real-time delivery and low latency with CloudWatch metric streams. Some possible destinations could be; Kinesis Data Firehose as well as third-party service providers such as Datadog, Splunk and New Relic.
CloudWatch Logs — CloudWatch Logs allow you to monitor, store, and access log files from various sources. It enables you to store and monitor logs from EC2 instances, CloudTrail, VPC flow logs, AWS Lambda, Elastic Beanstalk, and other custom logs generated by your applications running on AWS or on-premises servers. These logs are stored in log groups. A log group is a collection of log streams that share the same retention, monitoring, and access control settings and a log stream represents a sequence of log events that share the same source. For example, you might have a log group for a particular application or service, and each instance of that application or service might generate its own log stream within that group. Once you have created a log group, you can configure its retention period, which determines how long log data is retained before it is automatically deleted. You can also set up log metric filters and alarms for the log group to monitor and alert on specific log events or patterns. With CloudWatch, you can export logs to other AWS services such as OpenSearch, AWS Lambda, Amazon S3, Kinesis Data Firehose, and Kinesis Data Streams.
CloudWatch Logs offer a feature called Logs Aggregation that allows you to consolidate and summarize your log data from multiple sources into a single log group. With CloudWatch Logs Aggregation, you can centralize log data from multiple AWS accounts, regions, and VPCs, as well as on-premises resources, into a single log group.
By default, logs are not sent from an EC2 instance to CloudWatch. You have to install a CloudWatch agent on the EC2 instance to enable it to send logs. There are two agents you can use for this purpose; the CloudWatch Logs Agent (older one) and the CloudWatch Unified Agent (newer). The CloudWatch Logs Agent can only send logs to CloudWatch whereas the Unified Agent can collect additional system-level metrics as well as collect logs and send them to CloudWatch Logs.
CloudWatch Alarms — Alarms are used to trigger notifications for any metric. They enable users to monitor and respond to changes in the metrics and logs they collect. Alarms can be configured to alert users via email or SMS when a metric crosses a predefined threshold or changes in a specified way. A CloudWatch alarm can be in one of three states at any given point in time. These states are; OK, INSUFFICIENT_DATA, and ALARM. The OK state indicates that the metric being monitored is within the acceptable range and the alarm is not triggered. The INSUFFICIENT_DATA state indicates that CloudWatch has not received any data or enough data to determine the state of the alarm. This state can occur when the metric is new or the data feed has been interrupted. The ALARM state indicates that the metric being monitored has breached its predefined threshold, and the alarm is triggered. This state may also indicate that there is a problem with the data feed or the alarm configuration.
CloudWatch Alarms have evaluation periods which refer to the length of time to evaluate the metric. During the evaluation period, CloudWatch continuously checks the metric against the threshold value to determine whether the metric is within the alarm threshold or not. The period can be set between 1 minute and 24 hours, and it must be a multiple of the metric's frequency. For example, if the metric is sent to CloudWatch every 10 minutes, the evaluation period must be a multiple of 10 minutes. You must consider the duration of the evaluation period when creating a CloudWatch alarm because it directly affects the sensitivity and responsiveness of the alarm. A shorter period will result in more frequent evaluations and faster state changes, while a longer period will provide a broader view of the metric's behaviour over a longer time frame.
Note: The evaluation period should not be confused with the metric's period, which is the time interval at which data points are collected and sent to CloudWatch.
CloudWatch Alarms have targets and these targets are the entities that receive notifications when an alarm is triggered. The three main alarm targets are; EC2 instances (where an instance is either stopped, terminated, rebooted or recovered when an alarm is triggered), EC2 Auto Scaling(where an alarm triggers an auto-scaling action), and Amazon SNS.
In CloudWatch, you can also make use of Composite alarms which are essentially alarms based on the status of other alarms. For instance, you can create a composite alarm that goes off when two or more alarms are in an ALARM state, or when any of several alarms are in the ALARM state. This allows you to create complex rules to assess your system's health and respond appropriately. Composite alarms are created by joining two or more alarms using logical operators (AND, OR). The alarm will go off when the logical conditions are met. For example, you can create a composite alarm that goes off when a CPU utilization alarm AND a memory usage alarm both reach the ALARM state, or a composite alarm that triggers when EITHER a latency alarm OR an error rate alarm reaches the ALARM state. Using composite alarms reduces alarm noise and avoids unnecessary notifications. For example, you have several alarms that go off when CPU utilization on an EC2 instance exceeds a certain threshold, if you create a composite alarm that goes off only when a certain number of the CPU alarms are in an ALARM state at the same time, you can avoid getting notifications for individual CPU alarms that go off. This can help reduce alarm fatigue and make it easier to focus on the most important issues.
Amazon CloudTrail
CloudTrail is a service provided by AWS that enables governance, compliance, operational auditing, and risk auditing of AWS accounts. CloudTrail records all AWS API calls and provides detailed information about who made the API call, when it was made, and what the API call was for. The recorded information is saved as log files in an S3 bucket, which can be analyzed and monitored using Amazon CloudWatch or other third-party tools. CloudTrail is enabled by default for most AWS services, and it can be configured to capture events for specific AWS regions, accounts, or resources.
CloudTrail logs contain detailed information about AWS resource changes, including any API calls made to create, modify, or delete resources. This information can be used for auditing and compliance purposes, as well as for troubleshooting and security analysis. CloudTrail logs can also be used to identify trends and patterns in API usage, which can help organizations optimize their AWS resource utilization and cost. CloudTrail supports integration with other AWS services such as CloudWatch, Amazon SNS, and AWS Lambda, enabling users to automate analysis and response to events captured by CloudTrail.
CloudTrail captures information about activity in an AWS account in the form of CloudTrail events. These events provide insight into changes made to resources in your account, who made the changes, and when they were made. CloudTrail events have a timestamp, and carry the identity of the user who performed the action, the source IP address of the request, and other details about the event. Examples of events that CloudTrail captures include EC2 instance launches and terminations, security group changes, S3 bucket access events, and IAM user policy changes. CloudTrail records these events in a JSON format, which can be sent to Amazon S3 buckets or Amazon CloudWatch Logs. It is high time we saw the different types of events that can be captured and logged by CloudTrail;
Management Events are the operations that are performed on resources in your AWS account in other words API calls that are made to create, modify, or delete AWS resources. Examples include creating an EC2 instance, modifying an S3 bucket policy, or deleting an IAM user.
Data events are API calls that are made to access or modify data within an AWS resource. Examples include reading an S3 object, writing to an RDS database, or modifying an EC2 instance attribute. By default, data events are not logged because they are high-volume operations.
CloudTrail Insights events help in detecting unusual activity such as inaccurate resource provisioning, hitting service limits, and bursts of AWS IAM actions in your AWS account by analyzing management events using machine learning algorithms.
Note: CloudTrail events are stored for 90 days. To keep them beyond this period, you should log them to an S3 bucket and then use Amazon Athena when you want to analyze them.
AWS Config
It is an AWS service that enables you to track changes and monitor the compliance of your AWS resources. It automatically collects and manages configuration data of AWS resources over time and provides you with a comprehensive view of the configuration changes and compliance status of the resources. AWS Config provides a detailed inventory of all your AWS resources and their current configuration state, enabling you to track changes and monitor compliance over time. It can also be used to track resource relationships and dependencies and to view resource configuration history.
AWS Config allows you to define config rules that evaluate your resources against specific compliance policies. Config continuously monitors and evaluates the configuration settings of your resources against the rules you have defined. If a resource is found to be non-compliant with a rule, Config will automatically generate an evaluation result that indicates the non-compliance issue, send notifications and automates remediation actions helping you maintain a secure and compliant AWS environment. Config rules can be created using either a pre-defined AWS-managed rule or a custom rule created by you. The AWS-managed rules are pre-configured, and you can simply activate them to monitor the configuration settings of your resources. Custom rules, on the other hand, are created using AWS Lambda functions.
Final Thoughts
To anyone who would listen, always remember to incorporate some form of monitoring and auditing when you are designing cloud solutions for your business or customers because you wouldn’t want to be caught in the mix of not having a clue about what is happening when something goes wrong with your applications or resources. Enabling monitoring and auditing on your AWS environment will let you sleep better at night so to speak. I have written a lot of articles on various AWS services and concepts. Below are links to some of them if you'd like to read them.