piątek, 27 grudnia 2024

CloudWatch - integration with Spring application

This document is a quick summary of the all steps that you need to do to have a logs in CloudWatch in a Spring Application. I won't go into detail about each step here because I want to gather all of that different areas needed to accomplish wider goal and I want this manual to be concise and short.

📌 #1 IAM role

IAM policy CloudWatchLogsFullAccess needs to be added to IAM Role attached to your ec2.

📌 #2 Spring and Logback

At the very beginning you need as Maven Dependency the Logback library:

  
  <dependency>
    <groupId>ch.qos.logback</groupId>
    <artifactId>logback-classic</artifactId>
    <version>1.5.6</version>
</dependency>

Then you have to add Logback configuration file in this directory src/main/resources/logback.xml. Below there is an example file content:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <include resource="org/springframework/boot/logging/logback/base.xml"/>

    <property name="LOGS" value="./logs" />

    <appender name="Console"
              class="ch.qos.logback.core.ConsoleAppender">
        <layout class="ch.qos.logback.classic.PatternLayout">
            <Pattern>
                %black(%d{ISO8601}) %highlight(%-5level) [%blue(%t)] %yellow(%C{1}): %msg%n%throwable
            </Pattern>
        </layout>
    </appender>

    <appender name="RollingFile"
              class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>${LOGS}/your-application-standard-logger.log</file>
        <encoder
                class="ch.qos.logback.classic.encoder.PatternLayoutEncoder">
            <Pattern>%d %p %C{1} [%t] %m%n</Pattern>
        </encoder>

        <rollingPolicy
                class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <fileNamePattern>${LOGS}/archived/your-application-standard-logger-%d{yyyy-MM-dd}.%i.log
            </fileNamePattern>
            <timeBasedFileNamingAndTriggeringPolicy
                    class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
                <maxFileSize>10MB</maxFileSize>
            </timeBasedFileNamingAndTriggeringPolicy>
        </rollingPolicy>
    </appender>

    <root level="info">
        <appender-ref ref="RollingFile" />
        <appender-ref ref="Console" />
    </root>

    <logger name="com.your.application" level="trace" additivity="false">
        <appender-ref ref="RollingFile" />
        <appender-ref ref="Console" />
    </logger>

</configuration>

📌 #3 Running Application Container

Container with your Spring Application needs to be started with special configuration - you need to map logs inside the container to outside scope which is some ec2 directory:

docker run -d -v /path-in-the-ec2/logs:/logs <<IMAGE>>

To see the logs in the Docker Container, go into the running container and open the log file:

docker exec -it <container_id> bash

cat logs/your-application-logger.log

On the other hand you can see the logs in the ec2 like that:

tail -50 /path-in-the-ec2/logs/your-application-standard-logger.log

📌 #4 CloudWatch Agent

You need to install CloudWatch Agent in your ec2 (following command for the Amazon Linux):

sudo yum install amazon-cloudwatch-agent

The set up Agent configuration file using command sudo nano /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json and an example content of that file that I've proposed:

{
    "agent": {
        "region": "your-region"
    },
    "logs": {
        "logs_collected": {
            "files": {
                "collect_list": [
                    {
                        "file_path": "/path-in-the-ec2/logs/your-application-standard-logger.log",
                        "log_group_name": "your-application-group",
                        "log_stream_name": "your-application-stream",
                        "timestamp_format": "%Y-%m-%d %H:%M:%S.%f"
                    }
                ]
            }
        }
    },
    "force_flush_interval": 15
}

Then you can start CloudWatch Agent with appended configuration file:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a append-config -s -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json

To check the status of the Agent:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a status

And you can stop the Agent manually:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a stop

📌 #5 Summary

After that you should go into AWS -> CloudWatch -> Log Groups and see new Log Group. All described steps as diagram below:

niedziela, 8 grudnia 2024

CloudWatch Insights - monitoring async workers traffic distribution

Symfony Framework Messenger allows to run parts of your system's processes asynchronously. To gain a better understanding of how your application really works, you can use the following queries.

These queries filter all logs that comes from Symfony Messenger Workers and aggregate them to provide broader perspective. These also prepare a summary of all asynchronous execution units in your system, giving you greater awareness of the overall complexity and the building blocks that make up the whole.

So, at the very beginning let me explain the data from the logs in my application. This way you can apply these queries to your system, keeping in mind that my log structures might differ partially or entirely.

message - with value MESSENGER_WORKER_MESSAGE_FINISH indicates Symfony Messenger Message that was successfully handled,
extra.messengerWorker.class - event/command class that was handled by Worker
extra.messengerWorker.executionId - represents each message that was handled by Messenger Worker
extra.requestDuration - whole Worker duration,
context.messengerWorker.processingDuration - execution of a single Message duration
context.messengerWorker.memoryUsageMb - whole Worker memory usage

📊 List all unique Workers

Thanks to that query you can easily check how many different async-workers do you have in your system. dedup function helps a lot with removing duplicated message class names.

fields @timestamp, @message, @logStream, @log
| filter message = "MESSENGER_WORKER_MESSAGE_FINISH"
| display extra.messengerWorker.class 
| dedup extra.messengerWorker.class

All unique Messenger events/commands that are handel asynchronously.
This summary gathers all async execution units (that have worked in a given period of time)
of you application in one place.

📊 Reveal traffic distribution per Worker

That query shows the busiest workers in your system.

fields @timestamp, @message, @logStream, @log
| filter message = "MESSENGER_WORKER_MESSAGE_FINISH"
| stats count(*) as executionCount by extra.messengerWorker.class  
| sort by executionCount desc

It's a simple list which execution count for each individual worker that was handled in a given period of time.

The wider time period you'll set for the query the more (probably) unique worker type you might get as a results.

Changing the time period of the query to only one day would reveal less worker types
but returned metrics might be more useful.

📊 More detailed Workers data

Each row symbolizes a single worker per instances, how many messages they handled, when the first and last messages were handled, and, of course, the message type they handled. Take look at interesting usage of latest(fieldName) function which allows to present not aggregated data in this summary.

fields @timestamp, @message, @logStream, @log
| parse @logStream "/php/*" as log_stream_id
| filter message = "MESSENGER_WORKER_MESSAGE_FINISH"
| stats latest(substr(log_stream_id, 0, 6)) as instance,
        count(extra.messengerWorker.executionId) as executions, 
        min(@timestamp) as start,
        max(@timestamp) as end,
        max(extra.requestDuration / 60) as duration_in_minutes,
        latest(extra.messengerWorker.class) as message_class
  by extra.executionId as worker_exectuion_id

You can use that query when you need to check a specific worker performance.

📊 The longest Workers executions

To identify the longest workers execution in you application just use the query bellow. These spots could be the first candidates for optimization or potential bottlenecks that slow down your application.

fields @timestamp, @message, @logStream, @log
| filter message like /(?i)(MESSENGER_WORKER_MESSAGE_FINISH)/
| display datetime, 
          context.messengerWorker.processingDuration / 60,
          context.messengerWorker.memoryUsageMb,
          extra.messengerWorker.class
| sort context.messengerWorker.processingDuration desc

niedziela, 1 grudnia 2024

CloudWatch Insights - revealing endpoint's clients

I'll show you how to list all the clients of your's application HTTP endpoints, how large is the traffic they generate and how it's exactly distributed in a given period of time. That few queries below could easily reveal the knowledge about application dependencies (with a minimal effort on your end) that you might be not aware of.

At the very beginning I'm gonna show you how the log structure looks like in my case. It may differ how it would look like in your application but what's matter is to highlighting a values that you need to achieve our goal to find out who is calling your application endpoints.

JSON structure

User Agent of the client - in my case it's passed in request headers as key user-agent which needs to be parsed first because even that key is auto-discovered by CloudWatch it contains a dash (-) in a key name so cannot be used in a query,
value that distinguish Http Request Log - message,
Http Request Method - context.method,
Http Request Name - context.route - we cannot use exact URI that was called by client because the same URIs may differ when endpoint has param inside endpoint path e.g. {{id}} of the resource it refers to - which makes logs aggregation not possible,
Caller IP - context.clientIp,

📊 List all the Clients per endpoint

fields @timestamp, @message, @logStream, @log
| parse @message '"user-agent":["*"]' as userAgent
| filter message = "HTTP_REQUEST" 
| display context.method, 
          context.route, 
          userAgent, 
          context.clientIp
| sort context.method, 
       context.route, 
       userAgent asc
| dedup context.route, 
        context.clientIp

📊 Count all requests done by each Client

fields @timestamp, @message, @logStream, @log
| parse @message '"user-agent":["*"]' as userAgent
| filter message = "HTTP_REQUEST" and
         ispresent(userAgent)
| stats count(userAgent) as endpointCallByExternalService by userAgent
| sort endpointCallByExternalService desc

📊 Traffic distribution in a time per demand Client

Based on results above you need to manually parse each Client that you want to see in this chart and add it in the stats section to finally display it.

fields @timestamp, @message, @logStream, @log
| parse @message '"user-agent":["*"]' as userAgent
| parse userAgent /(?<isGo>Go-http-client?)/
| parse userAgent /(?<isSymfony>Symfony?)/
| parse userAgent /(?<isPostman>PostmanRuntime?)/
| parse userAgent /(?<isJava>Java?)/
| parse userAgent /(?<isPython>python-requests?)/
| parse userAgent /(?<isRuby>Ruby?)/
| filter message = "HTTP_REQUEST" and
         ispresent(userAgent)
| stats count(isGo),
        count(isSymfony),
        count(isPostman),
        count(isJava),
        count(isPython),
        count(isRuby)
  by bin(1h)

Conclusion

Since you got the new perspective and knowledge of the system that you're maintain, you can start asking questions about the things that are still unrevealed:

does one User Agent is represented by two or more systems (e.g. two application implemented in Java)?
which of the User Agents are internal system and which one are external?
why these are needing date from your application?
why traffic distribution looks exactly like this?

sobota, 30 listopada 2024

CloudWatch Insights - most used HTTP endpoints

Having a standard PHP Symfony Framework log structure like one below you can measure whole application HTTP traffic using CloudWatch Insights.

{
   "message":"Matched route \"some_endpoint_name\",
   "context":{
      "route":"some_endpoint_name",
      "request_uri":"http://some.domain.com/some/endpoint/91b5602d-f098-471a-aa05-92937fea3636/name",
      "method":"POST"
   },
   "level":200,
   "level_name":"INFO",
   "channel":"request",
   "datetime":"2024-11-30T10:26:08.594527+00:00",
}

As you can see context.request_uri key it's hard to aggregate since each request for the same endpoint will differ due to passed in URI params like UUID of the resource. To avoid mentioned problem we should use rather a context.route value which is declared in Symfony Controller class as one of the value Action Attribute.

Here is example of the CloudWatch Insights query which show in descending order all of the most used POST/PATCH/PUT/DELETE (these are changing state of the system) endpoints in your application:

fields @timestamp, @message, @logStream, @log
| filter channel = "request" and context.method != "GET" 
| stats count(*) as endpointsCallsCount by context.route, context.method
| sort endpointsCallsCount desc

The standard output of the query, it's descending order easily shows the most used endpoints in your application

You can also see results as chart so you can compare all endpoints share in total HTTP traffic volume

For monitor all GET endpoints worth to filter non functional endpoints call that only checks an availability of the system:

fields @timestamp, @message, @logStream, @log
| filter channel = "request" and context.method = "GET" and context.route not in ["health_check","ping"]
| stats count(*) as endpointsCallsCount by context.route, context.method
| sort endpointsCallsCount desc

piątek, 8 listopada 2024

Observability

Tradycyjne metody monitoringu przestały być wystarczające w dobie złożonych systemów - wymagane jest bardziej wnikliwe zrozumienie systemów oraz przyśpieszanie rozwiązywania incydentów.

Sama implementacja systemów obserwowalnych i ich utrzymanie rodzi nowe problemy. Zewnętrzne systemy służą do obserwowania aplikacji; kwestionowanie w celu poznania wewnętrznej pracy systemu i stanu systemu.

Jeżeli mamy możliwość pozyskania informacji na temat stanu aplikacji w każdym jej aspekcie nawet z którymi byliśmy nie zaznajomieni jeszcze jakiś czas temu, których nie przewidzieliśmy - a przyszły nam one dopiero teraz i mamy możliwość zweryfikowania tychże danych. Oznacza to, że wskaźnik Observability systemu jest wysoki.

Stale musimy usprawniać proces zwiększania wskaźnika Observability; dzięki wysokiemu Observability możemy wypatrywać niecodziennych/podejrzanych wzorców i zachowań; pozwala na analizę interakcji użytkownika z systemem; dzięki Observability mamy wgląd w dynamikę komunikacji między mikro serwisami/kontenerami; taka analiza powinna być standardowym elementem pracy programistów (development life-cycle). Observability daje możliwość wglądu w zachowanie systemu, dzięki tym informacją deweloperzy mogą poprawiać niezawodność/wydajność systemu; analiza logów, metryk, trace’ów pozwala na zidentyfikowanie bottleneck’ów wydajnościowych;

Kluczowe koncepty:

Root cause analysis
Highly observable system (has intricate details/critical insights)
Realtime monitoring/alerting
resource utilization
error rates
Synthetic Journeys
performance metrics
deviations from normal patterns
APM (Application Performance Monitoring)
application dependencies,
Distributed tracing - złożone systemy gdzie pojedynczy request oddziałuje na wiele mikro serwisów w różnych data centers - taki trace ma swój ID
Telemetry Instrumentations (Open Telemetry Standard) event wysyłany do central location (tracking user journey; troubleshooting errors)
Site Reliability Engineering (SRE)
feature flagging
incident analysis
blue-green deployment
chaos engineering; “pytania które zostaną postawione bez wcześniejszej wiedzy”
alert ➡️ wartość progowa predefiniowanej metryki została przekroczona; remediations (środki zaradcze)
podejście reaktywne: zidentyfikowanie i rozwiązanie problemu po tym jak wystąpi
podejście proaktywne

Observability pomaga zrozumieć wewnętrzne zachowanie systemu co może wyłonić potencjalne problemy które będą mieć miejsce w przyszłości.

Alert musi mieć dane dot. powodu jego wystąpienia (moje doświadczenie: miejsce wystąpienia, dane kontekstowe - zasobu którego dotyczy).

Tradycyjny monitoring i dashboardy polegają na wiedzy Seniora (dependency on human expertise) - czyli metodologia (tradycyjny monitoring) polegająca na objawach aniżeli na actual Root Cause - to nie może być dłużej stosowane gdy złożoność i skala jest duża; “Information to debug issues in details”, “ask open questions”, “trace the system to find real cause of problems (deeply hidden)”; organizacja nie polega na wiedzy eksperta i na subiektywnym zgadywaniu, a prowadzi do bardziej obiektywnej analizy; Złożone interakcje między systemami rozproszonymi; metrics, events, logs, traces, telemetry data - unforeseen issues. Szybkie rozwiązanie problemu to ograniczenie down-time; identify & solve potential issue before they affect users - w przeciwieństwie do reagowania na problem; Problemy z Observability: przechowywanie tych danych, przesyłanie tych danych po sieci; zmiana sposobu myślenia z re na pro; observability kosztuje; security & privacy;

Źródło

Developer Notes