AWS Exam Notes

Notes From Stephen Maarak Udemy Course

IAM Notes

Concepts

.                    
.             Groups          Identity-Policy
.
.      IAM-User    Principal  
.
.             Role            Session-Policy    STS               
.
.            Resources        Resource-ACLs     
.
.            Actions          Resource-Policies
.
.                             Permission-Boundaries    Trust-Policy 
.
.                             SCP        
.
.          SSO SAML OIDC      IAM-Identity-Center
.
.          Federated User
.

Important Commands

IAM User

  • Users have Long term credentials: IAM user name and Access Key.
  • User is associated with:
    • Identity based Policy (one or more)
    • permissions bounday (optional)

IAM Role

  • IAM role is intended to be (temporarily) assumable by anyone who needs it.
  • A Role is associated with:
      1. Identity based policy (one or more)
      1. Trust Policy and
      1. Permissions boundary. (optional)
  • No permanent password or access key associated with role (unlike IAM User). You get temporary credentials by assuming the role using STS (Security Token Service).
  • For assuming a role itself, you need permission that is granted to you by the owner of the role. For example:
    • Admin allows IAM user to assume role xyz by attaching a policy to user with action sts:AssumeRole on the role resource.
    • To allow EC2 service to assume your role, you to attach "Amazon EC2 Service" as a "Trusted Entity" for that role. So EC2 instances can assume this role.
    • To allow another IAM user in another account to assume your role, Admin of the trusting account must specify the trusted account number as the Principal in the role's Trust policy. The Admin of the trusted account must then give specific groups or users permission to assume that role.
  • Note: - EC2 uses EC2 metadata service to detect the associated instance role and let the EC2 to assume the role.

IAM Group

  • IAM users can be assigned to one or more IAM Groups.
  • IAM Groups are attached to Identity Policies.
  • This makes policy management easier.
  • Belonging to multiple Groups means attached to multiple Policies: This usually means more permissions. But if any policy contains explicit Deny rule, that takes precedence. e.g. If user john belongs to both dev and prod, he may have more permissions, if there is no Deny Rule in both policies. Otherwise hey may end up having least permissions of both.

IAM Policy

Policies start with empty permissions. If there is any explicit Deny rule, that takes precedence.

  • Policies could be:
    • AWS Managed
    • Customer Managed
    • Inline Policies

IAM Policy Types

  1. Identity-based policies:
  • Attach managed and inline policies to IAM identities (users, groups or roles).
  • Grants permissions to an identity.
  • The keywords in the policy are: Effect: Allow | Deny, Action, Resource.
  • The user to which this policy is applied is the implicit Principal.
  1. Resource-based policies:
  • Attach inline policies to specific resources. (No managed policies) e.g. S3 buckets, SQS Queues, VPC Endpoints,
  • Not all services support resource level policies for their resources.
  • IAM role trust policy is a kind of Resource based policy (resource being Role)
  • Grants permissions to the principal that is specified in the policy.
  • Principals can be in the same account as the resource or in other accounts.
  • Principal can be User, Role, Group, Service or AnonymousUser.
  • The keywords in the policy are: Effect, Action, Resource, "Principal", Condition.
  • The resource being attached is the implicit Resource for the policy.
  • If "Resource" keyword present, it should not conflict. i.e. could be * or same resource.
  1. Permissions boundaries:
  • Defines the maximum permissions but does not grant permissions.
  • Use a managed policy as permissions boundary for an IAM entity (user or role).
  • Permissions boundaries does not limit permissions given by resource-based policy.
  • Note: e.g. Resource based S3 policy can allow though user perm policy does not! Ofcourse, there should not be any explicit Deny in user policy.
  1. Organizations SCP - Service Control Policy:
  • Use an AWS Organizations service control policy (SCP) to define the max permissions.

  • You can attach SCP to either of the following:

      1. Root account - Does not apply to Root Account, but applies to all children OUs and ACs.
      1. organizational unit - Applies to OU root and member accounts.
      1. Member Account.
  • SCPs limit permissions that identity-based policies or resource-based policies grant to entities (users or roles) within the account, but do not grant permissions.

  • Note: SCP is more strict than Permission boundary applied to IAM identity.

  1. Access control lists (ACLs):
  • Use ACLs to control which principals in other accounts can access the attached resource. ACLs are similar to resource-based policies but uses non-JSON and for remote accounts only. ACL can not be used to grant permissions for entities in same account.
  • Note: S3 ACL can be attached to both bucket and objects.
  1. Session policies:
  • Inline Policy on Session Creation. Max Permission. Role Session or Federated Use Session.
  • Session policies limit permissions for a created session, but do not grant permissions.
  • Applies to dynamic sessions only either (1) Role Session or (2) Federated User Session.
  • Role session is created by AssumeRole* API with inline session policy JSON document and maxiumm 10 managed session policy ARNs. Can be invoked by IAM user or from role session. Note that session policy is optional.
  • Federated user session is created using GetFederationToken API. This can only be invoked by IAM user and not from role session. Must pass atleast one session policy.

Trust Policy

  • Trust Policy specifies the trusted entities or services that are allowed to assume the given role.
  • Trust Policy is a resource-based Policy for IAM role.
  • Usually Role is a Principal, but in this case Role is a resource and Principal is another IAM user or Group or Service.
  • Note: A Role is associated with:
      1. Identity based policy
      1. Trust Policy and
      1. Permissions boundary.

The following Trust Policy allows 2 services to assume the associated role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [ "elasticmapreduce.amazonaws.com", "datapipeline.amazonaws.com" ]
        /* "AWS": "arn:aws:iam::987654321098:root" to allow external account */
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Principal

Principal mainly refers to an "Actor" as opposed to "resource" being acted upon:

.                                  Action
.            Principal   ---------------------->   Resource
.
.      { User                      }
.      { Role or Service           }
.      { AnonymousUser             }
.      { Role Sessions             }
.      { Federated User Sessions   }
.      { AnonymousUser             }

It is explicitly referanced from "Resource Policy".

Group is just permissions and does not represent the identity, so it is not allowed.

There is subtle difference between the Principal being Role or Role-Session:

/* Role: */
"Principal": { "AWS": "arn:aws:iam::AWS-account-ID:role/role-name" }    /* Role */

/* Assumed-role session */
"Principal": { "AWS": "arn:aws:sts::AWS-account-ID:assumed-role/role-name/role-session-name" }

Just specifying role, all role sessions based on that roles qualify. But specifying role-session, only that role session based on that role qualify.

Other example Principals:

/* Federated OIDC Provider Principal. Works with OAuth. IdP is OIDC  */
"Principal": { "Federated": "accounts.google.com" }
"Principal": { "Federated": "cognito-identity.amazonaws.com" }

/* Assumed Role session principal for AssumeRoleWithWebIdentity is similar to AssumeRole. */
"Principal": { "AWS": "arn:aws:sts::AWS-account-ID:assumed-role/role-name/role-session-name" }

/* Federated SAML Provider Principal IdP is SAML. e.g. Active Directory */
"Principal": { "Federated": "arn:aws:iam::AWS-account-ID:saml-provider/provider-name" }

/* Assumed role session Principal for SAML is similar but with no session name */
"Principal": { "AWS": "arn:aws:sts::AWS-account-ID:assumed-role/role-name" }  /* SAML Session */

"Principal": { "AWS": "arn:aws:iam::AWS-account-ID:user/user-name" }  /* Regular IAM User */

/* STS Federated User session. GetFederationToken API. 
 * IAM Center multi-account permissions.
*/
"Principal": { "AWS": "arn:aws:sts::AWS-account-ID:federated-user/user-name" }

/* Service Principal. */
"Principal": { "Service": "s3.amazonaws.com" }
/* Region name may be required if cross region access is involved */
"Principal": { "Service": "s3.us-east-1.amazonaws.com" }

IAM Policy Permissions Evaluation

Consider:

Resource Policy Permissions =>  Resource
User Policy Permissions =>  User
Role Policy Permissions =>  Role
(Identity based Permission = User or Role)
Session Policy Max Permission = Inline Json Policy / Managed Session Policies = Session

Scenarios:

IAM User Access to any Resource. :=  (Resource + User) Permissions
AssumeRole User Access to any Resource. :=  (Resource + Role) And Max Session Perms
Federated User Access to any Resource. :=  (Resource + User) And Max Session Perms

If Permission Boundary Exists for User or Role := Above Permission And Max Perm Boundary

If SCP (Service Control Policy) Exists := Any scenario is limited by max SCP Perms.

IAM Policy Tips

  • Here is Admin Identity based policy :

    {
       "Version" : "2012-10-17",
       "Statement" : [
         {
           "Effect" : "Allow",
           "Action" : "*",
           "Resource" : "*"
         }
       ]
    }
    
  • Policies can contain variables tags e.g. ${aws:username} AWS specific aws:CurrentTime, Service specific s3:prefix, Tag based, iam:ResourceTag/key-name, aws:PrincipalTag/key-name, ...

Service Linked Roles

.
.     Service Linked Role is not same as Service Role.
.
  • A service role is a role with trusted entity as the specified "Service".
  • But service linked role is special role for the "AWS Service".

::
# To create service linked role ... aws iam create-service-linked-role --aws-service-name SERVICE-NAME.amazonaws.com ...

# To create service role with the trust policy. aws iam create-role --role-name Test-Role --assume-role-policy-document file://Test-Role-Trust-Policy.json

  • You just need "iam:CreateRole" action permission to create Service role.

  • To create service linked role, you need: "iam:CreateServiceLinkedRole":

    arn:aws:iam::*:role/SERVICE-ROLE-NAME     # This is Service Role. Below is Linked Role!
    arn:aws:iam::*:role/aws-service-role/SERVICE-NAME.amazonaws.com/LINKED-ROLE-NAME-PREFIX*
    
  • Some services support multiple service roles.

  • The linked service also defines how you create, modify, and delete a service-linked role.

  • A service might automatically create or delete the role.

  • It might allow you to create, modify, or delete the role as part of a wizard or process in the service.

  • Or it might require that you use IAM to create or delete the role.

  • Regardless of the method, service-linked roles simplify the process of setting up a service.

STS - Security Token Service

Concepts:

STS  AssumeRole 
Federation 
Role-Session 
Session-Policy              SAML2       OIDC

STS is useful in following scenarios:

  • when you want an IAM user in one AWS account to access another. Both owned by you.
  • AWS cross account access.
  • Provide access for certain AWS services to your resources. Examples ?
  • Grant access for externally authenticated users (may not be AWS user). Identity Federation.
  • Provide access to outside application (with no AWS credentials).

Federation with SAML 2.0 and OIDC and access control:

  • IAM allows you to use separate SAML 2.0 and Open ID Connect (OIDC) IdPs. (outside AWS)
  • IAM allows you to use federated user attributes for access control.
  • With IAM, you can pass user attributes, such as cost center, title etc from your IdPs to AWS, and implement fine-grained access permissions based on these attributes. (Tag based permission policies)
--------------------------------------------------------------------------------------------
API                 Who can call and Comments
--------------------------------------------------------------------------------------------

AssumeRole          Caller: IAM user or IAM Role. (Role chaining allowed. Expires in 1 hr)
                    Session tags are transitive and persist after role chaining.
                    Optional Session Policy.

--------------------------------------------------------------------------------------------
AssumeRoleWithSAML  Caller: Any user; 

                    SAML setup with mutual trust must already done with SAML Idp.
                    SAML IdP (like AD FS) will issue the claims required by AWS using assertions.

                    App must pass a SAML assertions to STS to assume preferred role.

                    Use Case:
                       Map (SAML) external users to AWS Role. No link to local IAM users.
                       Inline session policies can apply user specific restrictions.

--------------------------------------------------------------------------------------------
AssumeRoleWithWebIdentity 
                    Caller: Any user;

                    Must pass an OIDC or OAuth 2.0 compliant JWT token from a known IdP
                    You can make the IdP trust relationship with AWS. 
                        e.g. Github, account.google.com already trusts AWS.
                    Role's trust policy should point to the external IdP.
                    Use aud condition in role trust policies to verify that the tokens used 
                    to assume roles are intended for that purpose e.g. AppName.

                    Use Case: 
                       Github actions to access AWS resources.
                       Map (OIDC) external users to AWS Role. No link to local IAM users.
                       Inline session policies can apply user specific restrictions.

--------------------------------------------------------------------------------------------
GetFederationToken  Caller: IAM user or AWS account root user. Not by role-session.

                    The resulting session can not call AssumeRole.
                    Supports session policy to restrict permissions.

                    Use Case: Grant proxy application limited temp credentials.
                              Application internal or federated users need credentials.
                              Can create this session on behalf of internal or federated user.
                              Different session policies per user.
--------------------------------------------------------------------------------------------
GetSessionToken     Caller: IAM user or AWS account root user. Not by role-session.
                    Session Policy not supported.
                    Use Case: Protect long term credentials and use temp credentials as
                              proxy to IAM user. 
--------------------------------------------------------------------------------------------

AssumeRole API

.
.                  AssumeRole                      AssumeRole
.  IAM User     ------------------> Role-Session ---------------> Role-Session
.     or                            [Policy]                      [Expires 1 hr]
.  Role Session                     [Tags]
.                                   [Session-Name]
.                                   [external-Id]
.
  • Session tags are transitive so they persist when roles are chained.

  • Optional session policies can further restrict permissions.

  • Role session with temp credentials is valid between 15 mins to 12 hours.

  • With Role chaining max valid time is 1 hour only.

  • You can have role permission policy restricting based on tags/sessionName/ExternalId.

  • The externalId is optional and typically used to specify the application name.

  • You can restrict permission based on session Name and ExternalId as well in that role policy:

    {
        ...
        "Effect": "Allow",
        "Action": "sts:AssumeRole",
        ...
        "Condition":   { "StringEquals": {"sts:ExternalId": "my-app-name"} }
                    // { "stringEquals": { "aws:PrincipalTag/Dept" : "HR" }
    }
    
aws sts assume-role \
  --role-arn arn:aws:iam::123456789012:role/myRole \
  --role-session-name my-session
  --tags Key=dept,Value=dev
  --external-id my-app-name
  --policy-arns <arn1>,<arn2>

Output:

{
    "AssumedRoleUser": {
        "AssumedRoleId": "AROA3XFRBF535PLBIFPI4:my-session",
        "Arn": "arn:aws:sts::123456789012:assumed-role/myRole/my-session"
    },
    "Credentials": {
        "SecretAccessKey": "9drTJvcXLB89EXAMPLELB8923FB892xMFI",
        "SessionToken": "...",
        "Expiration": "2016-03-15T00:05:07Z",
        "AccessKeyId": "ASIAJEXAMPLEXEG2JICEA"
    }
}

# Note: Principal Of Session: "arn:aws:sts::123456789012:assumed-role/myRole/my-session"
# 
# Session Policy is optional and can further restrict permissions.

# Role Policy can restrict permissions based on Tags and external-Id 
# (also by session name but difficult and not recommended)

AssumeRoleWithSAML API

.
.
.
.                  AssumeRoleWithSAML                    
.   IAM User     -------------------------> Role-Session 
.     or           [Session-Policy]            
.  Role Session    [Saml-Assertion]
.                  [Saml-Provider-ARN]
.
.   Note: Tags are passed through SAML assertion by IdP using PrincipalTag attribute.
.
.
.   Note: Role should Trust Saml-Provider Principal.
.

aws sts assume-role-with-saml \
 --role-arn arn:aws:iam::123456789012:role/my-role \
 --principal-arn arn:aws:iam::123456789012:saml-provider/my-onpremise-saml-Idp \
 --policy-arns arn1,arn2
 --saml-assertion "..."

Output:

{
    "Issuer": "https://my-onpremise.example.com/idp/shibboleth", # SAML Idp. e.g. On-premise AD
    "AssumedRoleUser": {
        "Arn": "arn:aws:sts::123456789012:assumed-role/my-role",
        "AssumedRoleId": "ARO456EXAMPLE789:my-role"              # Internal RoleId:rolename
    },
    "Credentials": {
        "AccessKeyId": "ASIAV3ZUEFP6EXAMPLE",
        "SecretAccessKey": "8P+SQvWIuLnKhh8d++jpw0nNmQRBZvNEXAMPLEKEY",
        "SessionToken": "...",
        "Expiration": "2019-11-01T20:26:47Z"
    },
    "Audience": "https://signin.aws.amazon.com/saml",  // Service Provider: i.e. AWS
    "SubjectType": "transient",
    "PackedPolicySize": "6",
    "NameQualifier": "SbdGOnUkh1i4+EXAMPLExL/jEvs=",
    "Subject": "my-onpremise-user-john"
}

AssumeRoleWithWebIdentity API

.
.
.
.     IAM User     AssumeRoleWithWebIdentity                    
.     Or Role   ------------------------------> Role-Session 
.                  [Session-Policy]            
.                  [Session-Name]            
.                  [Identity-Token]            
.                  [Provider-Id-like-google-cognito]
.
.   Note: Principal_Tags and Transitive_Tags are passed through claims in Token only.
.
.   Note: Cognito could be Web IdP but not a SAML IdP.

aws sts assume-role-with-web-identity \
    --duration-seconds 3600 \
    --role-session-name "my-app-session" \
    --provider-id "www.amazon.com" \   # Or google.com, facebook.com, or cognito etc.
    --policy-arns arn1,arn2
    --role-arn arn:aws:iam::123456789012:role/my-role-for-web \
    --web-identity-token "..."

Output:

{

    "SubjectFromWebIdentityToken": "amzn1.account.AF6RHO7KZU5XRVQJGXK6HB56KR2A",
    // Subject is Unique Id in provider. e.g. your-email in google.com

    "Audience": "client.5498841531868486423.1548@apps.example.com",
    // Audience is either service provider or client application that must be registered
    // with OIDC Idp for requesting login and return claims.

    "AssumedRoleUser": {
        "Arn": "arn:aws:sts::123456789012:assumed-role/my-role-for-web/my-app-session",
        "AssumedRoleId": "AROACLKWSDQRAOEXAMPLE:my-app-session"
    }
    "Credentials": {
        "AccessKeyId": "AKIAIOSFODNN7EXAMPLE",
        "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYzEXAMPLEKEY",
        "SessionToken": "...",
        "Expiration": "2020-05-19T18:06:10+00:00"
    },
    "Provider": "www.amazon.com"
}

# Note Assumed Role Principal Format:
#       "arn:aws:sts::123456789012:assumed-role/my-role-for-web/my-app-session
#

GetFederationToken API

  • You get Federated User Session. Nothing related to Roles.
  • You can pass arbitrary Federated User name!
.
.
.
.                  GetFederationToken                    
.     IAM User  ------------------------------> Federated User Session (Not a Role-Session)
.     only         [Federated-User-Name]            
.                  [Session-Policy]            
.                  [Tags]            
.                
.
.

aws sts get-federation-token \
    --name Bob \
    --policy file://myfile.json \
    --policy-arns arn=arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
    --duration-seconds 900
    --tags Key=dept,Value=dev

Output:

{
    "Credentials": {
        "AccessKeyId": "ASIAIOSFODNN7EXAMPLE",
        "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "SessionToken": "...",
        "Expiration": "2023-12-20T02:06:07+00:00"
    },
    "FederatedUser": {
        "FederatedUserId": "111122223333:Bob",
        "Arn": "arn:aws:sts::111122223333:federated-user/Bob"
    },
    "PackedPolicySize": 36
}

# Note Principal of the User session (does not include Caller IAM user info):
#       "Arn": "arn:aws:sts::111122223333:federated-user/Bob"

GetSessionToken API

  • Get temporary user session for calling user.
  • Used to protect permanent credentials.
  • Nothing related to Roles.
.
.
.
.                  GetSessionToken                    
.     IAM User  ------------------------------> Temporary User Session
.     only         [Duration-12hrs-Default]
.                
.

aws sts get-session-token --duration-seconds 900 --serial-number "YourMFADeviceSerialNumber"
                          --token-code 123456
Output:

{
    "Credentials": {
        "AccessKeyId": "ASIAIOSFODNN7EXAMPLE",
        "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYzEXAMPLEKEY",
        "SessionToken": "...",
        "Expiration": "2020-05-19T18:06:10+00:00"
    }
}

# Note principal ARN of session is same as the calling user's IAM user ARN.
#       "AWS": "arn:aws:iam::AWS-account-ID:user/user-name"

Federation Notes

SAML: Security Assertion Markup Language

  • SAML 2.0 (version 2.0 of the Security Assertion Markup Language) is an open standard that allows organizations to set up single sign-on (SSO).
  • Allows a user to authenticate once and gain access to multiple systems, by providing proof of prior authentication.
  • Enables the transfer of information about identity, attributes, and authorization data. SAML Assertions contains these data.
  • Authentication can be initiated by Idp or SP but ultimately done by Idp only.
  • AWS SSO implementation avoids repeated sign-ins to AWS and external Idp like AD FS. But atleast one time login validation directly with Idp is required for the SSO.
  • User/App requests IdP (like On-Premise AD) for auth and get SAML response. App presents the SAML response (assertion) to SP (AWS) and gets access. The assertion is encrypted using some certificates that asserts that it is issued by IdP.
.
.                                        Assertions Protocol  Bindings
.
.                    Trusts
.     IdP       <----------------->       FederationServer   ServiceProvider(AWS)
.      ^                                                              ^
.      |  Auth                  Access Service                        |
.      +---------  User/App ------------------------------------------+
.

OIDC - Open ID Connect Authentication

OpenID Connect is an interoperable authentication protocol based on OAuth 2.0 framework.

Use Cases:

  • Just for User Authentication.
  • Social Login: Many web and mobile applications use OpendId Connect to allow users to login using their existing accounts from providers like Google, Microsoft and Facebook.
  • Single-SignOn: Login once and gain access to your multiple applications.
  • OAuth2 defines only authorization protocol and focuses on generating access and identity tokens and considers authentication as a separate concern.
  • OIDC is an extension which also implements authentication and authorization.

Concepts :

.   OP      - OpenID Provider or IdP
.   Client  - Client Software (Must be registered with OP)
.   User    - User uses the Client to initiate authentication with OP.
.   RP      - Relying Party (aka SP - Service Provider e.g. AWS or Web Application)
.   

Flow:

.
.   OP(IdP)    ----->  Client (WebApp)  -----> RP/SP (AWS)
.                         User
.

Technology :

.
.   Protocols:  Discovery  Dynamic-Client-Registration
.               Session Management 
.      --------------------------------------------
.
.  OAuth2.0: Core  Bearer  Assertions JWT-Profile 
.
  • Identity Token issued by IdP contains following information:
    • user (called the sub aka subject claim) information like name and email.
    • when and how it was authenticated.

Various Active Directory based Services

  • Microsoft AD - Provided by Microsoft to manage single domain on-premise. AWS alternative for this service is AWS Managed Microsoft AD. AWS also provides a Simple AD service which offers subset of these features.
  • Microsoft AD FS - Used for managing multiple AD Domains federation on-premise. AWS does not have any direct alternative to this.
  • Microsoft Azure AD FS - Cloud based Microsoft AD FS solution. AWS can integrate with this as it does with Microsoft AD on-premise.
  • AWS Directory Service is a service from AWS which provides these 3 solutions: Managed AD, Connector AD and Simple AD.

IAM Identity Center (aka AWS SSO)

  • IAM Identity Center is a service which is enabled at management account.

  • The target of integration for IAM Identity Center is multiple AWS accounts (SSO) or Applications (e.g. back-end for a mobile application).

  • It can be used for SSO or just federated identity broker for your mobile or web application.

  • Identity Center is a recommended solution for integrating on-premise AD and even SAML mobile applications but it is not required if you do not require SSO. You can integrate with plain simple IAM also.

  • One login (single sign-on) for all your:

    • AWS accounts in AWS orgs.
    • Business Cloud Applications (SalesForce, Microsoft 365)
    • SAML2.0 enabled applications (Active Directory, etc)
  • Identity Providers:

    • Built-in identity store in IAM Identity Center
    • 3rd Party: Active Directory, OneLogin, Okta, etc
  • AWS Directory Services:

    • AWS Managed Microsoft AD :

      |    auth                   Trust Mutual     AWS Managed     auth
      |   <----->  On-Prem-AD  <--------------->      MS AD      <------->
      |                                              
      |    Note: MFA Supported.
      
    • AD Connector: Proxy for on-premise AD in AWS :

      |                           Proxy                     auth
      |            On-Prem-AD  <------------ AD Connector <------->
      |                                              
      |    Note: MFA Supported.
      
    • Simple AD: AD Compatible managed directory on AWS :

      |                    auth
      |         Simple AD <------->
      |                 
      |    Note: MFA Not Supported.
      
  • IAM identity center can be configured with TTI (Trusted Token Issuer) for integrating with OIDC Idps so that the token from OIDC Idp can be exchanged with token from AWS.

  • Keycloak, superTokens are good opensource alternatives to Cognito.

  • IAM Identity Center does not support OIDC. Only SAML. Not for mobile application integration. Only for business and browser based integration.

  • Just for integrating mobile applications it is better to use Cognito identity pools with external IdP instead of Identity Center. Or you can use plain IAM with external IdP if you do not need SSO.

Cognito

.
.  Identity-Pools  User-Pools  Hosted-UI  SignIn  Refresh-Tokens
.
.                                                                                      Renew Using 
.           SignIn/SignUp                       Idp Issues       Cognito Issues        Refresh Token -
.    User  --------------> Authenticate OR --->  IdToken   --->   AccessToken   ---->  New IdToken
.            HostedUI       Redirect IdP           JWT            (Short Lived)        New AccessToken
.            UserPool       (OIDC)                                RefreshToken              |
.                                                                 (Long Lived)              |
.                                        (AccessKey,                                        |
.                                         SecretKey,       <-----  Identity Pool <----------+
.                                         Session Token)              STS
.
.
.           Login to                  AssumeRoleWithWebIdentity           Temp AWS Credential
.    User -----------> IdentityToken ---------------------------->  (AccessKey, SecretKey, SessionToken) 
.           IdP                            STS
.
.
  • Cognito support for HostedUI is horrible. Documentation is confusing.
  • It does handle Auto refreshing of tokens and getting the AWS Credential.
  • User pool is needed if you need HostedUI, users management and login management. It provides IdP functionality.
  • Identity Pool is needed if you want to convert Identity Token into AWS access credentials with proper automatic refreshing.
  • Useful for mobile or web integration.
  • Keycloak (by Redhat) is opensource alternative to Cognito. Others are Gluu, FusionAuth etc
  • To integrate Keycloak with AWS STS, you just use it as IdP and call AssumeRoleWithWebIdentity yourself. You also need to refresh the IdToken periodically using KeyCloak. KeyCloak also provides IdToken, AccessToken, RefreshToken just like Cognito.

Other IAM Notes

  • Use Access Analyzer

  • When you assume role (user, app or service) you give up your original permissions and take the permissions of the role.

  • When you have resource based policy the principal does not need to give up original permission.

  • IAM Permissions Boundaries: Set max permission for IAM entity. Can be used along with AWS Organization SCP (Service Control Policy) identity based policy (e.g. original user permissions)

  • Assume Role typically called by IAM user or externally authenticated user (SAML or OIDC) already using a role.

  • AssumeRole can also be chained. i.e. A role can assume another role.

  • Root user can not call AssumeRole.

  • Suppose you want user A not to be able to terminate EC2 instances by default. But allow the user to explicitly assume Role R which can terminate EC2 instances. You can protect that role using MFA, if need be. It is like sudo in unix. You can do it by actively performing "AssumeRole" operation but you can't accidentally delete anything. Also when assuming a Role, you lose your original prvileges. These are all audited using CloudTrail logs. :

    {User A}  ----AssumeRole-->{Role A}-->{Can Terminate EC2}
    
  • You can let services to assume role on behalf of you. You create a role and make that service as a trusted service for this role. And when you initiate certain action like EC2 Run Instance, this iam:PassRole privileges is used. Note that Service also should have AssumeRole permission in addition.:

    {User A} ---PassRole-->{EC2-Service}---AssumeRole-->
    
  • For Cross account access from say origin account (Dev) to destination (production), you create a role in destination account and associate a "trust policy" which identifies the origin (Dev) account as trusted entity. Also Dev account admin should allow selected IAM users to assume that remote role in Production account. Not all users in trusted account (Dev) get access to the role in trusting (Production) account.:

    {   Production Account             }                            { Dev Account    }
    {   Role: UpdateProdS3             }  <---Req Access to Role--- { Group: Dev     }
    {   Admin allows remote Dev Group  }  -----STS Credentials----> {                }
    {   to assume UpdateProdS3 Role    }
    {     S3Bucket: ProductionS3       }
    
  • Providing 3rd Party Access:

    • Zone of Trust = Accounts, Organizations that you own.

    • For granting access to 3rd Party you need:

      • Third Party AWS Account Id
      • An External ID (It is not a secret key. May be application-name or purpose)
  • SSO and SAML: Security Assertion Mark-up Language (SAML) is an authentication standard for federated identity management and can support single sign-on (SSO). SSO allows a user to log in with one ID/password to other federated software systems. SAML is an XML-based open-standard for transferring identity data between two parties: an identity provider (IdP) and a service provider (SP) like web application or SSO provider.

  • SAML supports both authentication and authorization where as OAuth is primarily for authorization.

  • STS Important APIs:

    • AssumeRole: access role within your account or cross-account.

    • AssumeRoleWithSAML: return credentials for users logged in with SAML:

      • Example SAML providers are Microsoft Active Directory Federation Services ADFS, etc.
    • AssumeRoleWithWebIdentity: return creds for users logged with an IdP:

      • Example IdP proviers include Amazon Cognito, Facebook, Google or any OpenID Connect compatible identity provider.
    • GetSessionToken: for MFA, from a user or AWS account root user

    • GetFederationToken: obtain temp creds for a federated user. Calling IAM user credentials
      are used as the basis. Unlike AssumeRole, there is no Role involved here.

  • Notes on LDAP vs Active Directory:

    • LDAP is Lightweight Directory Access Protocol mechansim to interact with directory servers.
    • Active Directory (AD) is one such proprietary product for a directory server.
    • AD provides a database and services for identity and access management (IAM).
    • Directory server stores information about users, organizations and such.
    • Active Directory is a directory service made by microsoft. Others are Red Hat Directory Service, OpenLDAP, Apache Directory Server, etc.
  • Example SAML based SSO service providers are:

    • Microsoft Azure Active Directory (aka Entra ID), Okta, OneLogin, Auth0, IBM Security IAM, etc.
    • WSO2 is an On-premise and OpenSource SAML solution.
  • Amazon Single Sign-on (AWS SSO) Federation is new recommended method compared to old SAML 2.0 Federation.

  • Tagging of resources:

    • AWS tags allow you to attach metadata to most resources in the form of key-value pairs.
    • You tag the resources. You can have SCP (Service Control Policy) so that the certain tags exist. Tagging usually involves key=value or key=true ?
    • The most frequently used tags in Amazon's EC2 service include: owner, team, environment, and stack and can be used in cost analysis, development automation, and application scaling.
    • Most services support tagging for the resources created by that service.
  • Root User Considerations:

    • You can not attach identity policy to root user.
    • You can not attach permission boundary to root.
    • Root user can not assume any other role. Only IAM user can do.
    • Only root user can change account name/email address or close account or some limited billing tasks.
    • Create atleast one IAM user with admin privileges.
    • Do not use root user as much as possible. Create IAM user with AdministratorAccess managed policy as alternative to root.
    • Limited by SCP (Service Control Policy) if the account is part of AWS organizations.

SAML vs OIDC

  • Both are authentication protocols.
  • Both are used with Identity Provider Idp (like Google) and Servier provider (SP) (like your custom application or AWS SSO). SAML calls it Service Provider (SP) and OIDC calls it Relying Party (RP).
  • Both redirects the user browser login attempt at SP to IdP and gets redirected back.
  • Both support https. SAML is older standard with XML encoded information and can use basic SOAP as well.
  • OIDC is simplified using Json Web Tokens (JWT) and modern instead of processing XML. However SAML is more widely supported by even govt entities.
  • OIDC supports user consent by default i.e. supports "Scopes" where user chooses which information to share with SP. (e.g. limited set of user attributes like user email only or additional info) SAML also can be used to achieve user content but it is complex.
  • OIDC assigns a secret and client-ID (like sts.amazonaws.com) to the RP (such as your custom application or IAM identity center) There is also client secret that is shared only between IdP and RP.
  • OIDC is based on OAuth 2.0 standard and more secure because implementation is simpler.
  • Both can be used to implement SSO.

IAM Roles Anywhere Service

There are various ways on-premise server connects to AWS: * The first being using access key. * The other being getting instance role by installing SSM Manager agent. Only SSM related basic permissions. * Using CodeDeploy to deploy your initial STS temp permissions along with Application. More later. * There is one more -- IAM Anywhere Service -- by using certificates.

.                             Trust
.              Private CA <------------->   On-premise
.                          ------------>    Workstation
.                        Temp Credential
.
.                              Certificate
.            IAM Role      <---------------   On-premise
.  Cert Condition CN=onprem                    Workstation
.
.

Commands:

# Create trust anchor first using AWS Certificate manager private CA.
aws iam create-role --role-name ExampleS3WriteRole \
         --assume-role-policy-document file://<path>/rolesanywhere-trust-policy.json

# You can optionally use condition statements based on the attributes of X.509 certificate
# to further restrict the trust policy.
aws iam put-role-policy ...  --policy-document file://<path>

aws_signing_helper credential-process \
                   --certificate /path/to/certificate.pem \
                   --private-key /path/to/private-key.pem \
                   --trust-anchor-arn <TA_ARN> \
                   --profile-arn <PROFILE_ARN> \
                   --role-arn <ExampleS3WriteRole_ARN>
  • use AWS IAM Roles Anywhere to obtain temporary credentials for Outside workloads.
  • Your outside workloads must use X.509 certificates issued by your certificate authority (CA). i.e. Use certificates instead of secret access keys.
  • Public CA certificates are not allowed. It must be AWS private CA (or your own external CA).
  • You establish trust between IAM Roles Anywhere and your CA by creating a trust anchor.

CodeDeploy and On-Premise Server

.
.
.        On-Premise Server
.        Code Deploy Agent  (/etc/codedeploy-agent/conf)
.        IAM User/Role Credential
.        SSM Agent     (Role for SSM to Assume)
.
.
  • Install and configure CodeDeploy agent on on-premise server.
  • Create IAM role with SSM Manager as trusted Service. (SSM Agent will deploy credentials)
  • You can directly install an IAM user access/secret key on the machine (not recommended) or install temporary STS credentials after AssumeRole of the IAM role.
  • Configure cron to periodically refresh STS credential (If it is not IAM user but role)

Attribute based access control - ABAC - Using Tags

  • Attach tags to IAM resources, (IAM users or IAM roles) and to AWS resources.
  • Create a single ABAC policy or small set of policies for your IAM principals.
  • Design one or few ABAC policies that allow operations based on tags match.
  • Use Case: You don't have to change your policies when new resources are added. You just need to tag on creation. You can have global policies to enforce tagging.
  • Tag Policy (AWS organization feature) can be used to enforce specific values for specific keys.
  • You can use SCPs to enforce tagging at resource creation by using RequestTag IAM condition key. e.g. Deny ec2:RunInstance action if the condition for "aws:RequestTag/mytag" value is empty.
  • See https://aws.amazon.com/blogs/mt/implement-aws-resource-tagging-strategy-using-aws-tag-policies-and-service-control-policies-scps/

AWS IAM Identity Center

  • Successor to AWS SSO.
  • AWS IAM Identity Center service is used in managing human user access to AWS resources.
  • Used with AWS managed applications such as Amazon Redshift and also any customer managed applications such as any SAML 2.0 applications such as Microsoft 365.
  • Can be used with OIDC applications as well ? (like Google, Facebook, etc)?
  • AWS IAM Identity Center OpenID Connect (OIDC) is a web service that enables a client (such as AWS CLI or a native application) to register with IAM Identity Center. What does it mean ? A new application gets registered in IAM identity center (say myapp), and can now be invoked from IAM identity center. Does it mean you can invoke Google using user credential in SSO ? No. It can initiate OAUTH2 protocol and remember the credential ?
  • The service also enables the client to fetch the user’s access token upon successful authentication and authorization with IAM Identity Center.

OpenID Cconnect (OIDC) Use cases

  • GitLab CI/CD Job with a JSON webtoken (JWT) to retrieve temp credentials from AWS with out storing secrets.
  • GitHub actions script should deploy changes into AWS. Don't use AWS IAM user credentials in external service since that can be compromised. GitHub should be able assume certain IAM role temporarily only. What to do ?:
    • Configure GitHub as Identity Provider in IAM identity center. Now GitHub becomes a trusted service. The identity provider URL is https://token.actins.githubusercontent.com The Audience (aka client ID) URL is sts.amazonaws.com (Amazon has already registered itself with this clientID as OAuth client and has a shared client secret stored in both Amazon and GitHub Server)

IAM Access Analyzer

Main Use Case: Generate policy for IAM user/role based on past activity across accounts.

IAM feature works across AWS Organization.

All features:

  • Generate IAM policies based on access activity in your AWS CloudTrail logs.
  • Identify resources in your organization shared with any external entity outside organization.
  • Identify unused access in your organization and accounts.
  • validates IAM policies against policy grammar and AWS best practices.
  • custom policy checks of modified IAM policies against original ones. Quickly identify new addition/deletion of access.

IAM get credentials report

# View the list of users, their last used access keys, age, last used service etc.
aws iam generate-credential-report
aws iam get-credential-report

Condition Keys

Condition Keys classified as:

  • Global condition keys: Such as aws:RequestTag (specified Tag during resource creation).
  • Service Specific Condition Keys: Such as s3:ExistingObjectTag (Filter by existing Tag on Objects)
  • Service Specific Request Specific Condition keys: Such as s3:x-amz-storage-class (Requested storage class on PutObject)
  • Cross-Service condition keys: ECS Service may make some ec2: condition keys to be available to use.

Note: The keys won't be present when it is not applicable.


Condition keys can be used in policies such as the following:

{
    "Version": "2012-10-17",
    "Id": "ExamplePolicy",
    "Statement": [
        {
            "Sid": "AllowGetRequestsReferer",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::your-bucket-name/*",
            "Condition": {
                "StringLike": {
                    "aws:Referer": "https://www.example.com/*"
                }
            }
        }
    ]
}

Global Condition Keys

  • Properties of Principal:

    aws:PrincipalArn                aws:PrincipalAccount        aws:PrincipalOrgPaths       aws:PrincipalOrgID
    
    aws:PrincipalTag/tag-key        aws:PrincipalIsAWSService   aws:PrincipalServiceName   
    
    aws:PrincipalServiceNamesList   aws:userid                  aws:username
    
    aws:PrincipalType (Account | User | FederatedUser | AssumedRole | Anonymous)
    
  • Properties of network:

    aws:SourceIp    aws:SourceVpc   aws:SourceVpce      aws:VpcSourceIp
    
  • Properties of resource:

    aws:ResourceAccount    aws:ResourceOrgID   aws:ResourceOrgPaths    aws:ResourceTag/tag-key
    
  • Properties of Request:

    aws:RequestTag/tag-key              aws:TagKeys
    
    aws:CalledVia   aws:CalledViaFirst  aws:CalledViaLast   aws:ViaAWSService
    
    aws:CurrentTime aws:EpochTime       aws:referer         aws:RequestedRegion
    
    aws:SecureTransport                 aws:SourceArn       aws:SourceAccount
    
    aws:SourceOrgPaths                  aws:SourceOrgID     aws:UserAgent
    
  • Properties of Role Session:

    aws:FederatedProvider               - (e.g. cognito-identity.amazonaws.com) 
    aws:TokenIssueTime
    aws:MultiFactorAuthAge              - (Time elapsed since MFA)
    aws:MultiFactorAuthPresent
    aws:ChatbotSourceArn                - 
    aws:Ec2InstanceSourceVpc
    aws:Ec2InstanceSourcePrivateIPv4
    ec2:SourceInstanceArn
    lambda:SourceFunctionArn
    ssm:SourceInstanceArn
    identitystore:UserId                -  IAM Identity Center workforce identity user id.
    

AWS Backup

.
.   Backup-Plan  Tags-Based Backup-Policy  Backup-Service-Role
.
.   Incremental-Backups Cold-Warm-Backups  Backup-Gateway-For-VMWare
.
.   Vault-Uses-Internal-S3-buckets.
.                                                 Auto Run      (Cross-Region Backup OK)
.                 Frequency                      Backup Job
.   Backup-Plan -------------------> Assign     -------------->    Vault
.                 Retention Period   Resources    Incr/full      (Encrypted)
.                                                                (Recovery Point)
.                                                                
.                  Attach to                                      
.   Backup-Policy  -------------> AWS Organization or Account.
.                  Backup Plan
.                                                                
.   Backup Audit Manager ---> Generate/View Compliance Reports and Resources
.
  • AWS Backup is a fully managed service for backup.
  • Targets include EC2, RDS, DynamoDB, EFS, FSx, S3 and more.
  • You must create and use Vault for backup destination. Vault uses AWS internal S3 storage but not visible to customers.
  • To use incremental backups, the retention period > 1 day.
  • Backup Plan == Frequency of Backup + Retention Period + Target Vault name. You assign resources into Backup Plan. Backup Jobs are auto scheduled.
  • No resources assigned to backup plan, then no jobs get triggered.
  • Tags based backup plans are better since it can be used in a backup policy that can be attached to all accounts using AWS Organizations.
  • Backup policy requires all member accounts to have same "backup-vault" created using same name. Also Default backup service role will be used for backup. If you need to backup EC2 with specific IAM role, then you need additional PassRole privileges. In that case, you may want to create custom backup role in all member account with same name.
  • Backup Gateway meant for only VMWare VMs backup on-premise or on VMWare Cloud on AWS. Backup Gateway is a software. It integrates with AWS Backup and is used when you backup VmWare resources (like machines) and to implement the backup plan.
  • Cross-Account backups and delegated admin for backup (separate account for backup) possible.
aws backup create-backup-plan --cli-input-json file://path/to/backup-plan.json

# Assigning resources to backup plan.
aws backup create-backup-selection --backup-plan-id <backup-plan-id> \
  --backup-selection '{"SelectionName":"MyTagBasedAssignment", "IamRoleArn":"arn:*:AWSBackupRole",
        "ListOfTags":[{"ConditionKey":"Environment", "ConditionValue":"Production", 
                       "ConditionType":"STRINGEQUALS"}]}'

Amazon Data Lifecycle Manager (DLM)

DLM is another simpler (less featured) alternative to AWS Backup -- It automates the creation, retention, and deletion of Amazon EBS snapshots and EBS-backed AMIs. (The destination is AWS internal EBS snapshot storage. However you can move to S3 or Glacier.)

AWS Backup does all that DLM does and can also lock your backup using backup vault.

# Create DLM Policy (if not created)
aws dlm create-lifecycle-policy   ...  (Daily, etc.)

# Create On-Demand Snapshot
aws ec2 create-snapshot --volume-id vol-XXXXXXXX --description "On-demand snapshot"


aws ec2 create-volume --volume-type io1 --iops 1000 --snapshot-id snap-066877671789bd71b 
                      --availability-zone us-east-1a

# Copy to your S3 bucket.
aws s3 cp ...

AWS Organizations

AWS Organizations helps you centrally manage and govern your environment.

.
.  Root(OU) +---> Management (Ac) == First Top Level Account ---> root IAM user + Admin Users
.           |
.           +---> Security (OU)       +--->  Audit aka Security Tooling (Ac) Read-Only
.           |                         +--->  Log Archive (Ac) CloudTrail Logs etc.
.           |
.           +---> Infrastructure OU   (Empty by default. Can use Network Mgmt etc Accounts here)
.           |
.           +---> Sandbox OU          (Dev/Test Accounts)
.           |
.           +---> Production OU       (Production Accounts)
.           |
.           +---> Exceptions OU
.
.
.
  • Use AWS Control tower to auto create standard OUs for your organization to start with.

Use Cases:

- Create/Group accounts into OU (Organizational Units) to easily Govern.
  (Using Console, CLI or Cloudformation Stacks)

- Apply policies (SCP - Service Control Policy) to all or selectively.

- Many Features easily apply to Organizations such as GuardDuty, CloudTrail, Backup,
  AWS Config, MS Active Directory, etc.

- Easily share resources within Organization using RAM.

- Manage Consolidated Billing and Costs

Management Account

  • Same as the "First Top Level Account".
  • Once you create AWS Organization, the top level account becomes Management Account.
  • There is a root user (the first user in the first account).
  • This account becomes a child of invisible Root OU.
  • IAM Identity Center Service is configured and setup by Control Tower in this account.
  • Billing, Cost Explorer, AWS Config all are configured from here but many of them can be delegated to other accounts in other OUs.
  • e.g. IAM Identity center can be delegated to account in Infrastructure OU.

Security OU

Contains Audit aka Security Tooling Account:

  • Account which contains delegated Admins for Guard Duty, Security Hub, AWS Config, Amazon Macie, AWS AppConfig, AWS Firewall Manager, Amazon Detective, Amazon Inspector and IAM Access Analyzer.
  • Typically has only read permissions.
  • Responsible for Patch management, Security Scanning, etc.

Contains Log Archive Account:

  • Cloud Trails, VPC Flow Logs, AWS Config, Cloudwatch Logs are all archived in this account.

Joining Master Account

.
.   Master Account --> Send Invite --> Member Account -> Accept Invite -> Grant Access to Master.
.
.   For Control Tower created new accounts, master account has auto access.
.
.   AWS Control Tower > Account Factory > Enroll Account >  Choose OU
.

If the existing member account joins master account by accepting invitation, it does not have all access automatically. Member account should create OrganizationAccountAccessRole and grant access to master account.

Moving between OU

You can move account between OUs. :

aws organizations move-account \
 --account-id 111122223333 \
 --source-parent-id r-a1b2 \
 --destination-parent-id ou-a1b2-f6g7h111

AWS Control Tower

.
.  Orchestration-Tool  Organization  GuardRail LandingZone AccountFactory
.
.  Compliance-Check
.
  • AWS Management and Governance feature.
  • AWS Control Tower: Automates account provisioning and deployments of multi-account envs.
  • AWS Control Tower is used to create AWS Organization and standard OUs and sets it up.
  • Uses standard controls (aka Guardrails) to enforce policies across organization.
  • Uses AWS Service Catalog service to provision AWS Accounts.
  • At the core it lands on IAM identity center and controls others.
  • No additional charge for using AWS Control Tower similar to AWS Organizations.

Landing Zone

  • AWS Control Tower creates Landing Zone with Foundational Security OU + Audit (Tools) Account + Log Archive Account;
  • If you enabled AWS Organization but not yet setup Control Tower, then you don't have Landing Zone.
  • Additional Recommended OUs are optional.
  • Landing Zone best managed by Control Tower but you can also manage it by other custom tools.

AWS Guardrail aka Controls

  • AWS Control Tower detects policy violations using Controls aka Guardrail.
  • Uses SCPs and AWS Config to enforce policy compliance.

Summary :

|    Guardrail
|    Control Type
|
|    Preventive    Deny using SCP
|    Detective     Record using AWS Config event
|    Proactive     Deny during creation using Cloudformation Hook.
|

Guardrail vs AWS Config Managed Rules

Feature Guardrails (Control Tower) AWS Config Managed Rules
Scope Multi-account Organizational Account and resource-level compliance
Application Organizational Units (OUs) Individual AWS resources
Enforcement SCPs for preventive; AWS Config for detective; Detective with optional remediation
Visibility Compliance dashboard in Control Tower. Resource-level reporting in Config
Best Use Case Organizational compliance Granular config compliance and auditing

A control is a high-level rule for governance. It's expressed in plain language. Example controls are:

  • Disallow attaching unencrypted EBS volume to EC2. (Preventive Control)
  • Disallow creation of EC2 instances without dept=dept_name tag. (Proactive - i.e. Creation Preventive)
  • Detect object deletion from a specific S3 bucket containing sensitive data. (Detective Control)

Some more:

  • AWS-GR_IAM_USER_MFA_ENABLED
  • AWS-GR_RDS_INSTANCE_PUBLIC_ACCESS_CHECK (Dectective - Implemented by AWS Config)
  • AWS-GR_RESTRICTED_SSH

Three kinds of controls exist:

  • preventive, (Implemented using SCP)
  • detective (AWS Config)
  • proactive. (CloudFormation hooks)

Three categories of guidance apply to controls:

  • mandatory,
  • strongly recommended
  • elective.

AWS Resource Access Manager (RAM)

  • Resource sharing using RAM is modern alternative to VPC Peering.
  • Not all resources support it. If it does, it is preferred way of sharing.
.                                                        
.                                                               
.                                                              Admin
.                resource-share-invitation                     Accept
.    Owner VPC ------------------------------> Another AC    ----------> Visible  
.                Subnet,                       IAM User/Role             Resource
.                PHZ, TGW, ...                 Service
.
.
.   Implicit-Read-Permission      No-explicit-Resource-policy-needed
.
.   enable-sharing-with-aws-organization (Bypass invitation)
.
.
  • You can share resources with another account, IAM User/Role or Service.

  • Some resources can only be shared at account level

  • some can be shared between both at account level and user level.

  • Some resources can use customer managed permissions.

  • Using RAM share eliminates the need for creating/managing explicit resource policies in the owner account. But you still need explicit IAM policies on the external account!

  • You can share resources like the following (but not limited to):

    • VPC EC2:Subnet (Can not be shared outside Org)
    • Customer Managed EC2:PrefixList (IP blocks. eg. CloudfrontIPs. AWS Managed vs Customer)
    • Route 53 Resolver Rules
    • Private Hosted Zones (Route 53)
    • EC2 Capacity Reservations
    • License Manager Configurations
    • Transit Gateway
    • FSx for OpenZFS Volume
    • Network Firewall Policy
    • OutPosts (Shared only within Organization accounts)
    • SSM Parameter store (Supports Customer Managed Permission)
    • Amazon VPC IP Address Manager (IPAM) (Manage Global CIDR IP pools for organization)
    • ec2:CoipPool - Customer Owned IP pool in Outposts.
    • Aurora Global Databases. RDS:Cluster Useful to clone central Database
  • VPC subnets can be shared from same AWS Organizationa or external accounts:

    # By pass invitation-accept requirement across accounts within org only.
    aws ram enable-sharing-with-aws-organization
    
    aws ram create-resource-share --name MyNewResourceShare \
             --no-allow-external-principals --principals "arn:<orgn_ou_arn>"
             --resource-arns "resource-arn1 arn2 arn3"
    
    # Same organization sharing auto-accepted. External accounts generate invitation.
    aws ram accept-resource-share-invitation --resource-share-invitation-arn <arn>
    
    # Suppose SSM parameters are shared with OU. The OU admin can associate this with all accounts in OU.
    aws ram associate-resource-share
            --resource-share-arn arn:*:resource-share/xyz --principals arn:...:ou/... 
    
    # Sometimes resource share is associated with account, sometimes with VPCs, sometimes with Users.
    
    aws ram disassociate-resource-share --resource-arns "arn1 arn2"
    
  • Sharing gives usually only implicit Read access. You can not write or delete. The kind of default permissions depend on the resource. There is no fixed rule.

  • Some more Examples:

    # Share SSM Parameter with another account in same org or outside org.
    aws ram create-resource-share --name "ShareSSMParameter" --resource-arns arn:..*::parameter/my-db/my-password 
            --principals arn:aws:iam::other-account-id:root \  # Shared with root === Account.
            --allow-external-principals                        # This resource supports customer managed permission.
    
    aws ram list-resources  --resource-owner SELF            # List shared resources owned by me.
    aws ram list-resources  --resource-owner OTHER-ACCOUNTS  # List shared resources owned by others
    
    aws ssm describe-parameters --shared    # SSM supported command to list shared parameters.
    
  • You can create and use customer managed permission while sharing itself. The terminlogy differentiates the implicit AWS managed default permissions for resource sharing. :

    aws ram create-permission --name "ReadOnlySSMParameterPermission" --resource-type "ssm:Parameter"
            --policy "{... }" --client-token "unique-client-token" --description "Allow custom access to SSM Parameter"
    
    aws ram create-resource-share --name "MySSMParameterShare" --resource-arns "arn:*:parameter/my-parameter" \
             --principals "arn:aws:iam::external-account-id:root" \
             --permission-arn "arn:aws:ram:region:account-id:permission/ReadOnlySSMParameterPermission" \
             --allow-external-principals
    
  • You can share many things like:

    • Can share managed Prefix List (i.e. set of CIDR blocks) used in security group rules. e.g. ssh allowed on Prefix-List-A !
    • Route 53 Outbound Resolver. Share DNS resolver rules.
  • Other things that can be shared include:

    • Microsoft AD Directory,
    • Transit Gateway so that
  • Sharing does not mean full access.

  • You can share all your resources with all the accounts in same organization:

    aws ram enable-sharing-with-aws-organization
    #  AWS RAM creates a service-linked role called AWSServiceRoleForResourceAccessManager.
    #  and makes all accounts in same organization as trusted entities for this role.
    
  • You can view all resources available for you, shared by other accounts:

    aws ram get-resource-shares  --region us-east-1 --resource-owner OTHER-ACCOUNTS
    
  • You can also share VPC subnets with other accounts (in same org) from VPC console.

AWS CloudTrail

.
.
.      CloudTrail ----->              S3 -----> Athena
.                        CloudWatch Logs -----> Alarms
.      90 Days                           -----> Data Firehose
.                           EventBridge  -----> Lambda
.
.  Note: AWS config uses it to record history of resources.
.
.  Organization Trail - Single trail for all Organization member accounts.
.
.
.       Org Member Account1 ---->  Organization Trail      ----> S3 Bucket (Enable trust policy)
.                  Account2        (Multi-Region-Optional)                 (For Organization principal)
.                  Account3
.
.
.  
  • Console, SDK, CLI, AWS Services all events/API calls are logged in Cloud Trail.
  • Only management events are logged by default (e.g. Role creation, Resource deletion etc) but data read/write events (S3 bucket read/write) and Lambda InvokeAPI are not logged by default.
  • Maintained only for 90 days. For longer duration archive configure it to send it to S3 and/or cloudwatch logs (by default cloudwatch logs don't expire. Choose retention period.)
  • Trail Logs are almost free. You pay for advanced features: CloudLake and Insights.
  • CloudTrail Lake is a paid service, you pay for data ingestion, supports queries using SQL.
  • CloudTrail Insights is a paid option that when enabled will analyze management events and will generate "Insight Events" -- these are sent to CloudTrail console, S3 Bucket and Eventbridge Event as configured.
  • You can use Athena to analyze the CloudTrail events. Athena is serverless analytics service that can read S3 (and other sources) using SQL and python.
  • S3 object level logging can be applied to create CloudTrail logging. You can choose to log [get, write,] create(PutObject), delete operations. And monitor the creation of public objects using cloudwatch if you like.

CloudTrail Log Integrity Check

  • AWS CloudTrail provides a log file integrity validation feature, which uses hashing and signing mechanisms to append a signature with log files stored in S3 to prevent tampering.
# This command will validate the integrity of log files within the specified date range.

aws cloudtrail validate-logs --trail-name <YourTrailName> --start-time <StartTime> --end-time <EndTime>

AWS EventBridge

.
. Concepts
.
.    Most Services                          Routing Rules
.   -----------------> Default-Event-Bus  ------------------>  Lambda | Step Function  | ECS Task
.                      Custom-Event-Bus                        SNS    | SQS | FireHose | DataStreams
.                      Partner-Event-Bus                       HTTP   | Remote Event Bus!
.                                                              API GW | Batch Job | ...
.
.   Schema Registry, Infer Schema       Optional EventArchive and Replay
.   Schedule Jobs
.
.   Note: For Lambda target, You can better use Event Source Mapping.
.

Event Format:

{
    version: 0, 
    id: "..." ,
    source: "aws.s3",
    ...
    resources: [ "arn:aws;s3:::my-s3-bucket" ]
    detail-type: "Object Created",
    detail : {
        /* Your json or Service Specific Json here ...  */
    }
}

EventBridge pipe:

SQS ----->     EventBridge Pipe       ---> StepFunction
             Filter and Enrich Data

# SQS can not invoke StepFunction directly. Pipe supports multiple sources and single target.
# This way, most AWS services can invoke other services if native integration not available.
  • AWS Event Bus used for integration between AWS Services and your applications.
  • Formerly called Amazon Cloudwatch Events, but more powerful now.
  • Event Bus events can invoke Lambda functions, HTTP invocations, or event buses in other AWS accounts!
  • Events Schema defines well defined schema for the events. There are rules that filter and send events to targets (such as Lambda, SNS, etc)
  • Can schedule cron jobs.
  • Event rules to react to a service doing something e.g. IAM root sign-in Event
  • Source Events examples:
    • EC2 startInstance, S3 upload Object, Cron job every 4 hours, failedCodeBuild, etc
  • Example Destinations:
    • Lambda
    • AWS Batch Job
    • ECS Task
    • SQS, SNS, DataStreams
    • StepFunctions, CodePipeline, EC2 Actions
  • 3rd party partners like DataDog can send events to Partner Event Bus. Custom Apps can place custom events on custom Event bus.
  • Event buses can be accessed by other AWS accounts using Resource-based policies.
  • You can archive events (all/filter) to an event bus (indefinitely or set period) By default it is stored for 24 hours.
  • You can replay archived events
  • EventBridge can analyze events in your bus and infer the schema.
  • Schema Registry allows you to generate code for your application. (how?)
  • Schema can be versioned.
  • Has resource-based Policy. The event bridge resource is named as you like (my-event-bus) and refered accordingly from the policy.

EventBridge Vs SNS

  • Both can be used to decouple publishers and subscribers.
  • EventBridge has broader AWS services integration.
  • EventBridge supports custom events with custom schema. SNS supports Topic based events that can be routed to multiple destinations.
  • EventsBridge supports API destinations and SaaS apps integration and SNS does not (You need Lambda).
  • Only SNS supports A2P subscribers (App to Person) like Email, SMS and Mobile.
  • Note that SNS is not a message queue like SQS.
  • Async communication methods are: Queue (e.g. SQS), Topic (e.g. SNS), Event Bus (EventBridge)
  • SQS and SNS FIFO throughput is around 300 messages/sec. Event bridge soft limit is around 1400 requests/sec. SNS topic where ordering not important the throughput is unlimited. SNS is better for large scale fan out.
  • SQS has lowest latency. SNS is around 100ms, EventsBridge > 200ms.

AWS X-Ray

  • Visual analysis of our applications
  • Tracing requests across your microservices (distributed systems)
  • Integrations with:
    • EC2 - Install X-Ray agent
    • ECS - Install X-ray agent in docker container
    • Lambda
    • Beanstalk - agent is auto installed
    • API Gateway
  • X-Ray agent or services need IAM permissions to X-Ray

AWS Personal Health Dashboard

  • Global Service
  • Aggregations across AWS Organization
  • https://phd.aws.amazon.com
  • Show how AWS outages directly impact you.
  • You can send events from Personal Health Board to EventBridge and take actions.

AWS Code Deploy

  • Alternative to Ansible, Terraform, Chef, Puppet, etc.
  • Managed deployment service in AWS.
  • Can deploy to EC2, ASG, or Lambda.
  • Code Deploy to EC2:
    • Use appspec.yml + deployment strategy
    • Will do in-place update to your fleet of EC2 instances.
    • Can deploy half instances from v1 to v2; Then deploy the other half.
  • Code Deploy to ASG:
    • In place updates similar to EC2.
    • Blue/Green deployment: New ASG is created; choose how long to keep the old instances.
    • Package your application with AppSpec.yml file
    • Create new Application in CodeDeploy.
    • Create Deployment group and link it to ASG
    • Ensure CodeDeploy agents installed in all instances in ASG.
    • Trigger Deployment -- in place or blue/green.
  • Code Deploy to Lambda:
    • Traffic Shifting feature. Supports Blue/Green deployment.
    • Lifecycle event hooks (BeforeInstall, AfterInstall, BeforeAllowTraffic, and AfterAllowTraffic).
    • Pre and post traffic hooks: Features to validate deployment
    • Easy and automated rollback using cloudwatch alarms.
    • SAM (Serverless Application Model) framework natively uses code-deploy.
  • CodeDeploy can get trigger from CICD pipeline
  • Code Deploy integrates well with ECS deployment as well (See below).
  • Code Deploy does not directly integrate with Elastic Beanstalk deployment since EB has it's own mechanism of deployments. But you can trigger it using plain shellscripts as part of Code Pipeline.

ECS Deployment Choices

ECS service specifies the deployment type associated. The Deployment types supported are:

  • ECS (aka Rolling Update) - Depending on allowed healthy tasks min/max percentage, rolling batches of update will happen.
  • Code_Deploy: The blue/green deployment type. Traffic shifting strategies could be Canary, Linear and all-at-once after the blue/green testing succeeds.

Code deploy to ECS (Blue/Green Deployment support)

  • Tight integration with ECS deployment.
  • Supports Blue/Green deployment using traffic shifting feature. ECS deployment type should be "Blue/Green" or CodeDeploy or CloudFormation or ALB
  • Supported for both ECS and AWS Fargate.
  • Setup is done within ECS Service definition. (can not be done from CodeDeploy Console)
  • New Task set is created and traffic is re-routed to new task set.
  • If everything is stable for X minutes, the old task is terminated.
  • Also supports Canary Deployment eg. Canary10Percent5minutes

AWS Service Catalog

.
.   AWS Service Catalog === Restricted Product Catalog
.
.   Product - Predefined product by CloudFormation
.
.     Product  Product-Portfolios  LaunchConstraints Share-With-OU
.
.     Import-Portfolio Role-For-Launch
.
  • Provide restricted set of Products (CloudFormation Templates) for some accounts.
  • Each Template represents a Product. A portfolio is a collection of products. A user in some account can launch any product using one of those provided templates.
  • Prevents user to access arbitrary creation of resources and select only products that are assigned permission by Admin.

AWS SAM - Serverless Application Model

  • SAM - Serverless Application Model
  • All config is YAML code.
  • SAM can help you to run Lambda, API Gateway, DynamoDB locally!
  • SAM can use CodeDeploy to deploy Lambda functions (traffic shifting)
  • Leverages CloudFormation in the backend.

AWS Cloud Development Kit (CDK)

  • Define your cloud infrastructure using: Javascript, Python, Java or .Net

  • The code is "compiled" into CloudFormation template (JSON/YAML)

  • You can deploy infrastructure and application runtime code together:

    • Great for Lambda functions
    • Great for Docker containers in ECS / EKS

TLS vs SSL

  • Both are application level protocol for secure connection.
  • SSL went through 1.0, 2.0, 3.0 and is now deprecated.
  • TLS 1.2 and 1.3 are actively being used now.
  • All current certificates are TLS certificates. SSL certificate is just naming convention.
  • SNI is an extension to TLS where client indicates hostname before connection. It is equivalent of HTTP/1.1 name-based virtual hosting but for HTTPS.

Security

AWS Security Features Summary

AWS Security Hub  # Integrated Dashboard for Compliance.    Costs little.
                  # Prepackaged Security Std checks (like Payment Card Industry PCI) available.
                  # Receive and Consolidate findings from GuardDuty, Inspector, Config, all

AWS Inspector     # Continous Scaning of Lambda, EC2, ECS, ECR push images.
                  # Report to Security Hub and Events Bridge

AWS Config        # Use Managed Config rules (over 75) to check compliance. Cheap.
                  # View compliance (Green/Red) for resource in timeline.
                  # View CloudTrail API calls! (Auditing)
                  # Can even remediate using SSM Documents!
                  # AWS Config is used to implement Guardrails by AWS Tower.
                  # Implements specific events recording for auditing purposes.

AWS Firewall Manager  # Manage rules in all accounts of Organization. WAF, Shield, SG,
                      # AWS network firewall (VPC Level), Route 53 DNS firewall.
                      # Costs $100 per protection policy!

AWS Network Firewall  # VPC Level network firewall. Stateful inbound/outbound inspection.

AWS WAF           # Protects CloudFront, ALB or API Gateway. Meant for HTTP.
                  # Can also be used for white/black listing, custom header checking

AWS Shield        # Mainly DDoS protection. Premium is 3k $ per month per Org!
                  # CloudFront, Route 53 protected by default by shield.
                  # Protects ElasticIP, ELB, Global Accelerator, etc.

AWS GuardDuty     # Auto threat discovery using Logs and ML, Anomaly Detection, Real time!
                  # By default, Analyzes: CloudTrail event Logs, VPC Flow logs, S3 Data events,
                  # DNS Query Logs, EKS control plane logs.
                  # Send to Security Hub, EventBridge, Amazon Detective.
                  # Also useful in PCI DSS Compliance (in addition to Security Hub)

Amazon Detective  # Analyze and visualize security data to investigate root of security issues.
                  # Uses Logs, Guard Duty, Security Hub, Inspector, CloudTrail, etc.
                  # Bit Expensive $2 per GB digested.

Amazon Macie      # Continous scan S3, using ML, detect personal data, Do periodic full scan.
                  # Send finding to Security Hub.

Note:

1. Security Hub and Amazon Detective are primarily aggregation service.

2. AWS Control Tower is mainly for Governance, Accounts Provisioning and Policy enforcements.
   It enforces policy compliance using GuardRail (which uses set of AWS Config Rules).

3. AWS Audit manager helps you with auditing your compliance wrt prebuilt frameworks.
   Some automatic evidence collection built-in but it does not do strict compliance check.
   For example, you can upload your evidence from on-premises resources, no check is done.

4. AWS Artifacts help you with access Reports (demonstrating AWS compliance of standards)
   and Agreements (between you and AWS) management to comply with GDPR, PCI, etc.

CloudHSM

.
.
.                HSM Cluster                       WebServer LoadBalancers (SSL)
.                                                  Digital Signature
.
.         HSM-AZ1  HSM-AZ2   HSM-AZ3               Bulk Encryption  PKI
.
  • CloudHSM - Cloud Hardware Security Module - is hardware module for encryption.

  • CloudHSM can be used to store encryption keys and also perform encryption operations. The key storage is protected by Hardware and even AWS can not access it.

  • In dev environment, You can create/delete/recreate CloudHSM cluster across availability zones on need basis. No need to keep it running all the time unless you use it for storing keys.

  • By using CloudHSM, you can manage your own encryption (not AWS).

  • AWS KMS is FIPS 140-2 Level 2 compliant. But CloudHSM provides Level 3 compliance. FIPS - Federal Information Processing Standards

  • If you want to encrypt using your own key before writing into S3, then use cloudHSM using SSE-C encryption (server-side encryption using customer provided keys).

  • CloudHSM can be deployed and managed in VPC.

  • In FIPS mode - Only selected FIPs approved algorithms are allowed.

  • AWS CloudHSM M of N access control allows minimum quorem of CO (Crypto Officers) to authorize sensitive operations. (e.g. minimum 3 CO authorizing out of total 5)

  • CloudHSM can be integrated into Webservers and Load Balancers for SSL termination. The key can be stored inside CloudHSM itself.

  • Uses Public Key Infrastructure (PKI) for managing certificates and keys in a secure environment.

  • You can use cloudHSM to generate key that can be imported to KMS. The cloudHSM is on-premise HSM alternative. BYOK (Bring your own Key) and maintaining own HSM (Hardware Security Module) is a common pattern used in multi-cloud organizations. :

    .
    .                                Import
    .    CloudHSM/On-premise HSM ------------> KMS -----> Use with S3-KMS
    .                                CMK
    .
    .    Note: Create empty KMS key with external type and import key by generating a token.
    .
    .
    

AWS Shield

  • Static threshold DDoS protection for underlying AWS services. AWS Shield Standard is free option. You can not customize (the threshold) the protection. Only standard layer 3/4 (network and transport layer) level protection against DDOS attacks.
  • CloudFront and Route 53 is mainly for faster edge access but also provides better availability and protection against DDOS. Both Cloudfront and Route 53 is protected by AWS Shield by default.
  • AWS Shield premium provides better DDOS protection 24/7. (3k $ per month per org!):
    • Gives you 24/7 access to the AWS Shield Response Team (SRT)
    • Near real-time visibility into DDOS attacks.
    • Insurance against higher bills due to DDOS attack.
    • In addition to layer 3/4, also provides application layer protection HTTP floods, DNS query floods, etc.
    • Provides customized detection based on traffic patterns to your protected Elastic IP address, ELB, CloudFront, Global Accelerator, and Route 53 resources.

AWS WAF

Web Application Firewall (WAF) filters specific requests based on rules:

  • Better stability. Not specific to DDOS.
  • Typically you apply AWS WAF on CloudFront or ALB (Application Load Balancer) or API Gateway.
  • Specifically meant for HTTPS, after SSL decryption filters the packet.
  • WAF makes use of Web ACL Rules.
  • There are over library of 190 managed WAF rules by AWS and Market Sellers. Some famous rules are IP Reputation Rule Groups (by AWS to block IPs), AWSManagedRulesBotControlRuleSet to block bots.
  • WAF logging can be redirected to CloudWatch Logs group (5MB per second) or S3 bucket (at 5mins interval) or Kinesis (limited by Firehose quotas)
  • While attached to ALB can check custom header to allow only cloudfront origin.
  • Can black/white list IPs.

AWS Firewall Manager

AWS Fiewall Manager manages rules in all accounts of an AWS organization.:

  • This includes WAF Rules, AWS Shield Advanced, Security Groups, AWS Network Firewall (VPC Level), Route 53 DNS firewall.
  • Important advantage is Rules are applied to new resources as they are created in all future accounts in your Organization!
  • The pricing includes a monthly fee for firewall manager protection policy ($100 each) and the fees for underlying services like WAF, Network Firewall, AWS Config Rules (Firewall Manager Rules), AWS Shield Advanced, Route 53 Resolver DNS Firewall, Third-party firewall charges (from Market Place).

Network firewall

  • Firewall at VPC level.
  • VPC Flow logs entry one per line per direction.
  • can block outbound Server Message Block (SMB) requests to prevent the spread of malicious activity.
  • TLS Inspection
  • Can use Partner developed rules.
  • Web blocking certain sites based on SNI (as it remains unencrypted even in TLS session)
  • Can block all http traffic (forcing https)
  • Rules based on IP, port, protocol.
  • Stateful rules supported. (e.g. Outbound connection allowed only as response to the earlier inbound connection)
  • Thousands of firewall rules, no problems.

VPC Flow Logs

.
.   Source
.
.   VPC | Subnets | Instances | ENI | TGW | EndPoints | ELB-ENI
.
  • VPC Level - Subnets, Instances, ENI
  • Subnet Level
  • Network Interface Level (ENI) - (Attached to EC2, NATGW, ELB)
  • AWS Transit Gateway
  • AWS PrivateLink - VPC Interface Endpoints.
  • VPN Connections and Direct Connect
  • Elastic Load Balancers (ELB) - Only through relevant ENIs you can enable.
  • AWS App Mesh and Other Networking Services

Best Practices:

  • Filter for Specific Traffic: Use filters in VPC Flow Logs (accepted, rejected, all) to collect the most relevant traffic data.

Blocking single IP address

  • Best place is NACL (network ACL) at VPC level. It supports Deny Rule.
  • Security-Groups are useless since it has only allow rules.
  • You can also use firewall inside EC2. But it becomes useless if ALB or CloudFront is used before reaching EC2.
  • If using CloudFront, just use WAF to block IP.
  • If using ALB, use WAF installed at ALB to block IP.

Amazon Inspector

  • Amazon Inspector is an automated and continual vulnerability scanning service.
  • It scans and assesses:
    • Amazon Elastic Compute Cloud (EC2) instances
    • AWS Lambda functions. (Standard Scan check and advanced Code Scan)
    • container images in Amazon ECR and within continuous integration
  • Continuous delivery (CI/CD) tools to improve the security and compliance of infrastructure workloads. Your monthly costs are determined based on the different workloads scanned
  • For EC2 instances, it uses AWS System Manager (SSM) agent to analyze network accessibility and the running OS against known vulnerabilities. Per month approx cost is $1.2 per EC2 instance.
  • Analyzes Container images as they are pushed to ECR. 9 cents per image per month scan.
  • Analyzes Lambda Function code and package dependencies as they are deployed. Lambda standard scans costs $0.30 per month per lambda. Lambda code scans costs $0.60 per month per lambda!
  • Reports into AWS Security Hub and also send findings to Event Bridge
  • Checks against a database of CVE (package vulnerabilities) and runs everytime CVE is updated.
  • A risk score is associated with vulnerabilities for prioritization
  • Seems worth the money.

AWS Config

.   Configuration Management, Conformance, History Tracking
.
.   Managed-Config-Rules   SSM-Documents-Remediate
.
.   Conformance-Pack-Apply-Across-Organization
.
  • Helps with Auditing and recording compliance. Can also Remediate!
  • Records configuration changes history of selected Resources:
    • Examples include S3 Buckets, EC2 instance, EC2 Subnet, EC2 VPC, IAM Roles
  • There are AWS managed config rules (over 75)
  • Config Rules raise alerts and also allows you to remediate and fix noncompliant resources:
    • Example Config Rules: check-if-cloudtrail-enabled, acm-certificate-expire-check, etc.
    • check-unrestricted-ssh : i.e. Check if ssh inbound is restricted to specific IPs and CIDRs only.
    • Use SSM Documents to remediate and fix noncompliant resources.
  • AWS Config is Per region service but can be aggregated.
  • You can view compliance of a resource in green/red indicates in timeline.
  • You can view CloudTrail API calls (who called what) in Config.
  • Can make custom config rules (using AWS Lambda)
  • Make the rule to be evaluated for each config change or periodically.
  • Can use SSM Automations for auto remediation (using Lambda).
  • Using AWS organization, can apply conformance pack across all or most accounts. AWS Organization Service should list AWS-Config as a trusted Service to do this.
  • Pricing seems cheap and affordable.
  • See https://www.youtube.com/watch?v=qHdFoYSrUvk AWS Config Tutorial Demo

Amazon GuardDuty

  • Intelligent automatic Threat Discovery to protect your Account using Logs.
  • One click to enable, no need to install software.
  • Uses ML algorithms and third party data, anomaly detection.
  • It analyzes CloudTrail Event Logs, VPC flow Logs and DNS query Logs, also EKS control plane logs, S3 data plane events.
  • With in AWS Org, you can make one member account as "Delegated Admin" for GuardDuty. That account has full perms to enable and manage GuardDuty in all accounts.
  • The findings can be integrated/sent to Amazon Detective, Security Hub, EventBridge.
  • Use Case: SSH brute-force attack is detected and send event to Events Bridge which can trigger step functions to block the traffic from that IP address by updating firewall or NACL.
  • Pricing seems affordable, watchout for VPC flow logs as they are billed $1 per 1GB/month. Other services seem to be cheap: cloud trail management events analysis: $4 per 1 million events.

AWS Security Hub

  • Central security tool to manage security across many AWS accounts.
  • Integrated dashboards showing current security status
  • The Security Hub Console provides following:
    • Summary : Brief summary of the following items.
    • Security Standards: Displays the percentage compliance against standards.
    • Insights:
    • Findings: Finding is a security issue or a failed security check. Major issues.
    • Integrations: List of services (like GuardDuty) we have accepted to receive the facts.
    • Settings: Member Accounts, Custom Actions.
  • Integrates well with other AWS services & partner tools and aggregates and prioritizes the findings : Config, GuardDuty, Inspector, Access Analyzer, Systems Manager, Firewall Manager, AWS Health, Amazon Macie and partner solutions.
  • Prepackaged security standards are available for Security Hub such as CIS AWS Foundations Benchmark, Payment Card Industry Data Security Std (PCI DSS), NIST SP 800-53, etc. Runs security best practices checks against industry standards.
  • You can configure customised actions and send findings to ticketing, chat, email or automated remediation systems using integration with CloudWatch Events.
  • Pricing is cheap.
  • See demo: https://www.youtube.com/watch?v=f17l2v5v9g4

Amazon Detective

Analyze and visualize security data to investigate potential and root of security issues. Uses Logs, events from Guard Duty, Security Hub, Inspector, Access Analyzer, CloudTrail, VPC Flow Logs.

Bit expensive $2 per GB Logs digested.

AWS Secrets Manager

Secrets are encrypted using AWS managed KMS key aws/secretsmanager; If you want to use your own KMS key for encryption, you can do so using update-secret operation.

Commands:

aws secretsmanager create-secret \
 --name MyTestSecret \
 --description "My test secret created with the CLI." \
 --secret-string "{\"user\":\"diegor\",\"password\":\"EXAMPLE-PASSWORD\"}"

aws secretsmanager put-secret-value \
   --secret-id MyTestSecret \
   --secret-string "{\"user\":\"diegor\",\"password\":\"EXAMPLE-PASSWORD\"}"

aws secretsmanager get-random-password --require-each-included-type --password-length 20

# Reencrypt the secret using my own KMS key.
aws secretsmanager update-secret --secret-id MyTestSecret 
          --kms-key-id arn:aws:kms:.*:key/*   // <<  Re-encrypt using My own KMS key!
                                              // instead of std AWS managed key.

aws secretsmanager list-secrets

# By default secrets deletion does not happen for 7 days.
aws secretsmanager delete-secret --secret-id MyTestSecret --recovery-window-in-days 7
aws secretsmanager delete-secret --secret-id MyTestSecret --force-delete-without-recovery

# Restore secret that was previously scheduled for deletion.
aws secretsmanager restore-secret --secret-id MyTestSecret

# RDS offers managed rotation where it updates DB password as well.
aws secretsmanager rotate-secret \
 --secret-id MySecret \
 --rotation-rules "<cron-expression>"

# If DB credentials also need to be updated, you can specify lambda along with rotation.
aws secretsmanager rotate-secret \
 --secret-id MyTestDatabaseSecret \
 --rotation-lambda-arn arn:.*
 --rotation-rules "<cron-expression>"

Type of DDOS attacks

DDOS - Distributed Denial of Service -- types:

  • Too many application level requests (API calls). You can overload backend DB, etc.
  • SYN Flood - Too many TCP connection requests. Layer 4.
  • UDP Reflection - Too many incoming big UDP packets.
  • DNS Flood attack - Too many DNS requests to DNS server, so people can not resolve.
  • Too many active connections : Slow Loris attack: Keep many http (or so) connections open.

AWS KMS - Key Management Service

.
.  KMS-Key  Key-Policy  Symmetric Asymmetric  
. 
.  Encrypt Decrypt Sign-Verify  Generate-Data-Key   HMAC
.
.  Multi-Region-Key
.

KMS lets you create, manage, and control cryptographic keys across your applications and AWS services.

KMS can not encrypt/decrypt more than 4k size data. It is intended to encrypt your key itself not data.

The encrypted data is stored along with the encrypted key. While decrypting, the data key is decrypted. And openssl or standard library is used to decrypt the data using plain key.

master_key_id = aws kms create-key of type symmetric  # This is CMK -  Customer Master Key.
(my_plain_key, my_encrypted_key) = aws kms generate-data-key and encrypt using master_key_id
my_encrypted_data =  openssl encrypt input_data + Append master_key_arn + my_encrypted_key

For decryption:
my_plain_key =   kms.decrypt(my_encrypted_key, master_key_id)
my_decrypt_data = openssl decrypt (my_encrypted_data, my_plain_key)

# KMS encrypted Data Block (e.g. SSE-KMS )
.
.     {--Encrypted-Data--}   {KMS-key-id} {encrypted-data-key}
.
.     Data-Key  = Decrypt(kms-key-id, encrypted-data-key)
.

For symmetric key usage, The KMS key maintains an internal unique key (previously called Customer Master Key) which will never be revealed to you. You generate symmetric key and encrypt that key itself using CMK and store the encrypted-key in your application along with data.

Use Cases:

  • Encrypt and Decrypt Data
  • Sign and Verify
  • Generate Data Keys
  • Generate and verify MACs
  • Auto integrated with S3, EBS, EC2, etc.
  • Multi-region keys are supported.

KMS Key Types

Type         KeyUsage and (KeySpec)
Symmetric  - SYMMETRIC_DEFAULT
Asymmetric - ENCRYPT_DECRYPT (RSA_4096,etc) 
             SIGN_VERIFY (ECC_NIST_P521)  ECDSA (elliptic curve key pair) for sign and verify.
                                          This can not be used for encryption/decription.
             KEY_AGREEMENT (ECC_NIST_P521) Pair of ECDH keys used to derive shared key.
                                           Also used for sign/verify. Not for encrypt/decrypt.

KMS Signing Algorithms

"SigningAlgorithms": [            # For SIGN_VERIFY with keyspec of RSA_2048
  "RSASSA_PKCS1_V1_5_SHA_256",
  "RSASSA_PKCS1_V1_5_SHA_384",
  "RSASSA_PKCS1_V1_5_SHA_512",
  "RSASSA_PSS_SHA_256",
  "RSASSA_PSS_SHA_384",
  "RSASSA_PSS_SHA_512"
]

"KeyAgreementAlgorithms": [       # For SIGN_VERIFY with keyspec of ECC_NIST_P521
      "ECDH"                      # Elliptic key pair to derive shared secret
],

"EncryptionAlgorithms": [         # For symmetric usage, keyspec also SYMMTERIC_DEFAULT
  "SYMMETRIC_DEFAULT"
]

Symmetric/Asymmetric KMS Keys

Single Symmetric key used for encrypt/decrypt operation is the default type and most Common.

Hash-Based Message Authentication Code (HMAC) KMS keys

secret_key_id = aws kms create-key --key-spec HMAC_512 --key-usage GENERATE_VERIFY_MAC
# You get ARN and keyId and algorithm etc info in output. GenerateMac operation.

Hash_value = hmac(data, secret_key_id, algorithm)    # data: max 4k; Hash fixed length.
# The HMAC KMS key keeps the secret_key secret. Same key for sign and verify.
# algorithm = HMAC_SHA_224 or 256 or 384 or 512; key spec is HMAC_224, etc.

After receiving data, you recalculate Hash_value using same secret_key_id and verify the hash.

Command operations are: GenerateMac and VerifyMac

Commands :

aws kms create-key
       [--policy <value>]                          # Resource policy for the key
       [--description <value>]
       [--key-usage <value>]                       # Omit for default symmetric key.
                     SIGN_VERIFY
                     ENCRYPT_DECRYPT
                     GENERATE_VERIFY_MAC
                     KEY_AGREEMENT

       [--customer-master-key-spec <value>]
       [--key-spec <value>]
       [--origin <value>]                      # Use "External" for later to import key. 
       [--custom-key-store-id <value>]
       [--tags <value>]
       [--multi-region | --no-multi-region]
       [--endpoint-url <value>]
       [--output <value>]
       [--query <value>]
       [--profile <value>]
       [--region <value>]
       [--no-sign-request]
       [--ca-bundle <value>]


aws kms generate-data-key --key-id alias/ExampleAlias --key-spec AES_256

        {
             "Plaintext": "VdzKNHGzUAzJeRBVY+uUmofUGGiDzyB3+i9fVkh3piw=",
             "KeyId": <arn>
             "CiphertextBlob": "AQEDA..."
        }


aws kms get-public-key --key-id alias/my_RSA_3072  # Get pulic key portion of asymmetric key

aws kms describe-key --key-id alias/my_RSA_3072  # Note KeySpec, KeyUsage in output

# To prepare for import, you need to get wrapping public key and import token and
#
# Note: Wrapping Key Means temporary key used to encrypt your key during import.

# use that to encrypt your key.
aws kms get-parameters-for-import ....

# Encrypt key material using openssl 
openssl pkeyutl  -encrypt -in PlainKeyMaterial.bin  ...

# Import your own key material. Save your copy. You can never export it from KMS.
aws kms import-key-material --key-id <key_id>
                            --encrypted-key-material fileb://EncryptedKeyMaterial.bin \
                            --import-token fileb://ImportToken.bin  ...

# To keep your keys in cloudHSM hardware module ...
aws kms create-custom-key-store
     --custom-key-store-name ExampleCloudHSMKeyStore \
     --cloud-hsm-cluster-id cluster-1a23b4cdefg \
     --key-store-password kmsPswd \
     --trust-anchor-certificate <certificate-goes-here>

aws kms describe-custom-key-stores 

# External keystore is also supported.
# Both cloudHSM and External key store is used only for Symmetric Key.

Allow KMS access to external account

  • You can add key policy to KMS Key.

  • You can allow specific role (preferred) or external account.

  • Allowing external account means specify the root principal or specific role:

    Principal: { "AWS": "arn:aws:iam::444455556666:root" }  /* External Account Allow */
    Principal: { "AWS": "arn:aws:iam::444455556666:role/ExampleRole"}  /* External Role Allow */
    
  • In addition, the external account IAM policy must allow that as well.

  • External account can further restrict the original access given by owner account but can not give more permission:

    { 
      Effect: "Allow", 
      Action:  [ ... ], 
      "Resource": "arn:aws:kms:us-west-2:111122223333:key/xxx"   # Source Key
    }
    

Cloudfront and KMS

Attach policy to KMS key to allow cloudfront.amazonaws.com service to read specific KMS key if the distribution origin is S3 bucket which uses server side KMS key encryption.

Multi-Region KMS Key

You need a multi-region key, if you want to manage encrypted backups across regions.:

.                      Replicate
.   Primary-KMS-Key  -----------> Replicated-KMS-Key
.      Region-1                      Region-2
.       KeyId-1                      KeyId-2

Create the primary multi-Region KMS key in the source region. :

aws kms create-key  --description "Primary multi-Region key for my application" --region us-east-1
                    --multi-region         # This key is multi-region capable!
# Note the KeyId from output!

Replicate the key in the target region:

aws kms replicate-key --key-id "<PrimaryKeyId>" --replica-region us-west-2

aws kms describe-key --key-id "<ReplicatedKeyId>" --region us-west-2

# Examine the details of the replicated key, its MultiRegion status

Notes:

  • The primary and replicated keys have distinct KeyIds but share a common KeyArn pattern to support interoperability.
  • Automatic Key Replication: Once set up, AWS KMS synchronizes updates to both keys across regions

AWS Certificates

  • ACM - Amazon Certificates Manager does not support cross-region!
  • Only for cloudfront (global service) you can use us-east-1 certificate. For ALB and API gateway, you have to obtain certificate from local ACM only.
  • If you have your own certificates, upload and keep it in IAM. (don't keep it in s3 buckets)
aws iam upload-server-certificate --server-certificate-name ExampleCertificate
                                  --certificate-body file://Certificate.pem
                                  --certificate-chain file://CertificateChain.pem
                                  --private-key file://PrivateKey.pem
                                  [--path  /cloudfront/test For access with cloudfront ]

FIPS 140-2 certified encryption

FIPS 140-2 defines a cryptographic module as “the set of hardware, software, and/or firmware that implements approved security functions and is contained within the cryptographic boundary. It is approved std from NIST.

KMS CRR

aws kms replicate-key \
  --key-id <primary-key-id> \
  --description "Replica KMS key in eu-west-1" \
  --region eu-west-1

EC2 Traffic Mirroring

Traffic mirroring is used to analyze inbout/outbound/both traffic asynchronously by other appliances or network monitoring tools.

.
.
.      EC2     Traffic Mirroring
.     ENI ------------------------->  NLB | Another ENI | Monitor-Appliance-IP
.

S3 Notes

.
.    S3TA       SSE-S3        SSE-KMS         SSE-C       CRR
.
.    Bucket-Policy  ACL  Access-Point  S3-Partitioning  Parquet Datalake
.
.    UploadFree (Minimal for PUT Request)    Download:9c/GB
.
.    Transfer-Acceleration  S3-LifeCycle-Rules-per-bucket
.
.    Storage-Class-Per-Object  Default-Storage-Class-For-Bucket
.
.    Bucket-Owner-Enforced   Cache-Control
.
  • Object Storage, serverless, unlimited storage, pay-as-you-go

  • Flat object storage service.

  • Looks like /bucket/myfolder/mysubfolder/myfile

  • Note that /bucket/myfolder/ is a zero-length object with that folder name. S3 does not recognize folders.

  • Good for static content. Image/video, etc.

  • Access objects by key, no indexing.

  • Anti patterns:

    • Lots of small files
    • Search features and rapidly changing data
  • Supports Multi-part upload. Recommended for files >100MB. Tools available to scan incomplete S3 objects created with multi-part upload.

  • S3 Transfer Acceleration:

    • Transfers first to edge location and then to target region.
    • Enabled at bucket level using advanced settings. Costs few pennies per GB extra.
    • You should enable the bucket for transfer acceleration.
    • S3 URL looks like bucketname.s3-accelerate.amazonaws.com
  • S3 Pre-signed URLs used to download/upload s3 files valid for 1hour by default.

  • S3 Storage Classes and Access Tiers:

    - S3 Standard - General Purpose 
    
    - S3 Standard - Infrequent Access (IA)   - Access time instant. Storage fee less. Access fee more.
    
    - S3 One Zone- Infrequent Access         - 99.5% Availability vs 99.99 for others.
    
    - S3 Intelligent Tiering                 - Access Tiers: Frequent, Infrequent (30+ days),
                                               Archive Instant (90+ days), (Default, automatic)
                                               Archive,       (Optional, days configurable, 3-5 hours)
                                               Deep Archive.  (optional, days configurable, 9-12 hours)
                                               (Note: Small monthly auto-tiering fee.)
    
    - S3 Glacier Instant Retrieval           - Instant Access. More storage cost vs Glacier. Less access fee.
    
    - S3 Glacier Flexible Retrieval          -  Retrieval Options and Access Times:
      (aka Glaicer)                               Expedited: Typically 1-5 minutes
                                                  Standard: Takes about 3-5 hours
                                                  Bulk: Takes 5-12 hours
                                                  (Less Storage cost)
    
    - S3 Glacier Deep Archive                -  Retrieval Options and Access Times:
                                                  Standard: 12 hours
                                                  Bulk:  up to 48 hours
                                                  (Least Storage cost)
    
  • S3 Storage cost:

    S3 Standard         :    .023  $/GB-Month   (4x cheapther than GP2)       1TB - 23 USD
                                                (      0.10 $/GB-Month) 
    S3 IA               :    .0125 $/GB-Month   (2x cheapther than S3 std)    1TB - 12 USD
    S3 Glacier Instant  :    .004  $/GB-Month   (3x cheaper than S3 IA)       1TB -  4 USD
    S3 Deep Archive     :  .00099  $/GB-Month   (4x Cheaper than Glacier)     1TB -  1 USD
    
    vs
    
    GP2 EBS : 1 TB  - 100 USD  (10c / GB-Month)
    GP3 EBS : 1 TB  -  80 USD  ( 8c / GB-Month; Per 1000 IOPS, 5 USD more above 3000; )
    EFS     : 1 TB  - 300 USD  (30c / GB-Month)
    io1     : 1 TB  - 125 USD  (12.5c / GB-Month)
    io2     : 1 TB  - 125 USD (Provisioned IOPS costly - Every 1000 IOPS 50 USD more.)
    
  • Minimum storage duration cost is applied for IA and Glacier types. IA minimum is 30 days and Glacier is 90 days, Glacier Deep is 180 days.

  • Restore Fees:

    • (Backup) Retrieval/Restore fee is applied for IA and Glacier types. A temporary copy in S3 is created and you are billed for both for storage.
    • S3 Glacier flexible retrieval charge depends on retrieval type: standard (3-5 hours), expedited (1-5 mins), or bulk (5-12 hours). For example, expedited retrieval costs 3 cents per GB plus 1 cent per request.
  • For S3 Deep Archive, std retrieval time is 12 hrs and bulk is 48 hrs.

  • There's no data transfer fee to upload data to Glacier. But uploading an object is a PUT request. PUT request fees are billed at $.03 per 1,000 requests. Not a huge charge but 6x the PUT request for S3 Standard.

  • Data transfer Out from Glacier to internet costs approx 9 cents per GB. In addition there is per-object pricing of GET request that is very cheap (like 1 cents for 1000 objects)

  • The special S3 storage class S3 Intelligent tiering has built-in life cycle management and moves the object into different "access tiering" still in the same storage class.

  • The storage class is for Object, not bucket. There is "Default Storage Class" for bucket.

  • S3 put Object can override the default storage class of the bucket.

  • Storage class of object and Default storage class of bucket can be changed anytime.

  • S3 Std IA is different storage class and S3 Glacier is a different storage class.

  • S3 Glacier Flexible Retrieval is same as S3 Glacier storage class.

  • S3 life cycle policies can be used to move objects between tiers. The policy is based on rules with filters of objects for which the policy applies. The prefix of object keyname is a supported filter.

  • S3 Event notifications possible:

    • S3:ObjectCreated, removed, restore, replication done etc.
    • Object name filtering possible *.jpg
    • Use case: generate thumbnails of images after S3 upload.
    • Event notification delivered in seconds but can take minutes too!
    • To simplify, you can configure to send "All S3 events" to EventBridge.
  • S3 Cost Saving Tips:

    • S3 Select & Glacier Select: Retrieve selected object with some filters on contents! Only CSV, JSON or Apache Parquet format files are supported. This saves cost and faster to access. Use appropriate API calls.
    • You can also compress objects to save space and money.
    • use S3 Lifecycle rules and auto transition objects between tiers.
    • Enable S3 Requester Pays option for your bucket. The reader pays for data download. Owner just pays for the cost of storing data.
  • S3 Analytics same as Storage Class Analysis: Helps you to transition to right storage class.

    • Recommendations for Standard and Standard IA (does not work for One-Zone IA or Glacier)
    • Report is updated daily
    • Visualize data in Amazon Quicksight
    • Good first step to create Lifecycle Rules (or improve them)
  • S3 Storage Lens: Analyzes storage usage across your organization and generates report! There are around 28 free usage metrics. Advanced metrics cost extra.

  • Tip: You can index objects in S3 in DynamoDB and use that index to search and filter!

  • Durability is very high: 99.9999999999% i.e. eleven nines. for objects across multi-AZ S3 standard Availability is high: 99.99% = not available for 53 minutes a year.

S3 Lifecycle Policy

.
.              After 1 year     Move to Glacier
.     Bucket -----------------> Delete
.              Tag filter       Move to S3 IA
.              Name Filter      etc.
.
.    Note: Does not apply for per object but can use filters in bucket.
.    Note: There is no Storage class for bucket.
.
  • You can configure per-bucket S3 Lifecycle Rules to move the objects between storage classes.
  • There is No Storage Class associated with S3 Bucket.
  • You can use bucket policy to prevent PutObject operation without matching storage class header. This is one way of enforcing specific storage class for the objects in the bucket.
  • You can also use Life Cycle policy of bucket to somewhat enforce storage class.
  • Note that you can not specify transition critieria for idle time to be 0 days, it should be minimum 30 days for Standard_IA for example! (Different minimum days for storage classes)

There is no standard managed policy for S3 life cycle policy. Use this for Example :

aws s3api put-bucket-lifecycle-configuration --bucket your-bucket-name --lifecycle-configuration '{

"Rules": [
  {
    "ID": "TransitionToGlacier",
    "Prefix": "",
    "Status": "Enabled",
    "Transitions": [
      {                                     # Moving From S3 std -> Glacier Can be done in 1 day!
        "Days": 30,                         # Minimum 30 days for Standard_IA
        "StorageClass": "STANDARD_IA"       
      },
      {
        "Days": 90,
        "StorageClass": "GLACIER"
      },
      {
        "Days": 365,
        "StorageClass": "DEEP_ARCHIVE"
      }
    ],
    "Expiration": {
      "Days": 730
    }
  }
] }'
  • So you want to specify storage class in putObject explicitly.

S3 Object Ownership

  • When you upload the object into the object, who is the owner ? Bucket creator or the uploader ?
  • Setting Amazon Object Ownership for the bucket to "Bucket owner enforced" (default for new buckets) ensures that new objects default ownership is set to the bucket owner.

S3 Encryption

  • By default, there is no encryption.
  • To enforce https for S3 access, you use bucket policy with aws:SecureTransport condition. (Encryption during transit).
  • You must use one of the S3 Encryption strategy for Encryption at Rest.
  • Note: All Glacier data is AES-256 encrypted under AWS control (Same as SSE-S3).

S3 Encryption Options

  • SSE-S3
  • SSE-KMS
  • SSE-C
  • CSE (Using your own key or KMS key)

SSE-S3

.
.     PutObject ---->  x-amz-server-side-encryption: AES256
.     GetObject ---->  No headers needed.
.
  • Each object is encrypted using unique Key by Server.
  • That key itself encrypted using S3 master key. Process transparent to user.
  • Bucket level setting possible to default to this encryption.
  • PutObject header must specify: x-amz-server-side-encryption: AES256
  • GetObject need not supply any header.

SSE-KMS

.
.  PutObject ----> {sse-encryption: KMS, kms-id, optional-encryption-context}
.
.  GetObject ----> {optional-encryption-context}
.
  • Use KMS to manage encryption keys.
  • Even if object is made public, it can't be read without access to KMS.
  • Usually the reader has KMS access, S3 automatically decrypts it and gives it to you.
  • Reader does not have to remember the associated KMS key.
  • KMS Encryption Context:
    • Encryption context is not a confidential data. e.g. Env=Dev, App=MyApp
    • Used as additional protection and convenience.
    • Must specify same context during GetObject if used in PutObject.

SSE-C

.
.  PutObject ---> { sse-customer-algorithm: AES256, customer-key, customer-key-md5 }
.
.  GetObject ---> { sse-customer-algorithm: AES256, customer-key, customer-key-md5 }
.
  • SSE-C: Server Side Encryption with customer provided keys.
  • Both put and get object requires specifying key and algorithm.
  • Server does not remember key, but helps you with encrypt/decrypt operation.

At the time of object creation with the REST API, you should specify following header:

Name                                                    Description
x-amz-server-side-encryption-customer-algorithm     Specify `AES256`

x-amz-server-side-encryption-customer-key           provide the 256-bit, base64-encoded encryption key 
                                                    for S3 to use to encrypt or decrypt your data.

x-amz-server-side-encryption-customer-key-MD5       Specify base64-encoded MD5 digest of the encryption key.
  • When using the presigned URL to upload/download object, you must provide all the encryption headers above.
You should remember the key and specify it while reading::

  aws s3api get-object  --bucket <bucket-name> --key <object-key> <local-file> \
                        --sse-customer-algorithm AES256 \
                        --sse-customer-key <Base64-encoded-encryption-key> \
                        --sse-customer-key-md5 <Base64-encoded-MD5-of-encryption-key>

CSE - Client side encryption

.
.
.    PutObject -->              Data  Layout
.                    {--Encrypted-Data--}   {encrypted-data-key}
.
.    GetObject -->  Retrieve Content --> Use SDK (or CloudHSM) to Decrypt.
. 
.    CMK: Customer Master Key is Secret.
.
.    CSE-KMS : Client Side Encryption using KMS (Optional)
.
  • With CSE, server does not if it is encrypted. You do encryption/decryption yourself.

  • You need to remember the Customer Master Key for later decryption.

  • For each object you can use different encryption data key encrypted by CMK.

  • data_plain_key = Decrypt(encrypted_data_key, CMK)

  • There are two ways to use CMK:

    • Your own key (using CloudHSM or On-Premise HSM or Save in your env )
    • Or Use KMS key
  • Using On-premise HSM is a common requirement in some environments.

  • Can be used with key stored in CloudHSM which is deployed into your VPC. :

    # sudo yum install aws-cloudhsm-client aws-cloudhsm-pkcs11
    ....
    # Encrypt the file using the AES key in CloudHSM
    pkcs11-tool  --module /opt/cloudhsm/lib/libcloudhsm_pkcs11.so \
                 --login --pin <crypto-user-password> --key <key-handle> \
                 --encrypt --input-file <file-to-encrypt> --output-file <encrypted-file>
    
  • AWS SDKs support client-side encryption library.

  • By using KMS, you can do:

    • Create KMS key and use it as CMK (e.g. for all objects in a bucket)
    • Ask KMS to generate data key.
    • Store encrypted-data-key along with data
    • You need to remember CMK for decrypting object later.
    • Optionally you can also store KMS-key-id also along with data.

Default Encryption For Bucket

  • Default behaviour for bucket is no encryption.
  • You can set default encryption for the bucket as SSE-S3 or SSE-KMS only.
  • The encryption context is optional and applicable only for SSE-KMS.
  • For GetObject operation on SSE-KMS object, you don't need to specify the key.
  • If you use encryption context, you should supply in during GetObject operation.
  • For SSE-KMS, default encryption context is internally created and used if you don't specify. It just uses bucket and object arn or something, we don't need to worry what that is.
aws s3api put-bucket-encryption --bucket my-bucket --server-side-encryption-configuration '{
  "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "aws:kms",    /* OR "AES256" for SSE-S3 */
          "KMSMasterKeyID": "arn:aws:kms:region:account-id:key/key-id",
          "EncryptionContext": {             /* optional and applicable only for SSE-KMS */
              "AppName": "MyApp"
          }
      }
  }]
}'
  • You can use in Condition key: "kms:EncryptionContext:AppName": "ExampleApp"
  • For all actions such as encrypt, decrypt, generatedatakey, creategrant you can conditionally grant permissions depending on context.

S3 Encryption Summary

.
.  Method       Description                 Relevant Headers
..............................................................................................
.  SSE-S3   Dynamic Server Side.            x-amz-server-side-encryption: AES256 
.
.  SSE-KMS  Specify KMS Key-id.             x-amz-server-side-encryption: aws:kms
.           Get: No need to specify key.    x-amz-server-side-encryption-aws-kms-key-id: arn:*
.           Context is optional.            x-amz-server-side-encryption-context: base64-json
.
.  SSE-C    Both Put and Get                x-amz-server-side-encryption-customer-algorithm: `AES256`
.           Requires 3 headers              x-amz-server-side-encryption-customer-key: base64-encoded
.           Server forgets keys.            x-amz-server-side-encryption-customer-key-MD5: base64 
.           But aware it is SSE-C.          
.
.  CSE      Client Side Encryption          No Headers.
.           CMK is secret.                  Store encrypted datakey along with data.
.

S3 Access Control Strategies

.
.   Bucket-Policy  AccessPoint-Policy   IAM-Policy  Endpoint-Policy   ACLs (Deprecated)
.
.
.      VPC                                                          (Bi-Directional S3-CRR )
.           S3-Gateway-Endpoint ------>      Access-Point   ------->  S3-Bucket-Region-1
.                  |                     (Optional Multi-Region)            |
.               Policy                            |                   S3-Bucket-Region-2
.                                               Policy                      |
.                                                                        Policy
.

Also See: https://aws.amazon.com/blogs/security/ iam-policies-and-bucket-policies-and-acls-oh-my-controlling-access-to-s3-resources/

There are multiple ways you can control:

  • S3 bucket policies: Attach one or more to bucket. Most modern and recommended.
  • S3 ACLs: (Not recommended) You can disable ACLs on bucket. But ACLs support easier control over per object permission.
  • Access Point Policy: Can have multiple access point policies (one per application) for single access point.
  • IAM Policies : You can attach one or more IAM policies to principals.

Example Access Point Policy:

# This policy allows Jane to use this Access Point to access the bucket.
{
  "Version":"2012-10-17",
  "Statement": [
  {
      "Effect": "Allow",
      "Principal": {
          "AWS": "arn:aws:iam::123456789012:user/Jane"
      },
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:us-west-2:123456789012:accesspoint/my-access-point/object/Jane/*"
  }]
}

# For the above to be effective the bucket policy also must permit Jane to access the bucket!
# This is eqvt to blocking all public access using console for this bucket.
aws s3control put-public-access-block \
  --account-id 123456789012 \
  --public-access-block-configuration '{"BlockPublicAcls": true, "IgnorePublicAcls": true,  \
                  "BlockPublicPolicy": true, "RestrictPublicBuckets": true}'

# Blocking public access settings Flags:
#
#  BlockPublicAcls:   Put bucket ACL or Put object ACL blocked if public.
#  IgnorePublicAcls:  You can have public access ACL, but all ACL will be ignored.
#                     This is same as "Turn off public Access" button in S3 console for bucket.
#  BlockPublicPolicy: Put bucket policy will fail if public.
#  RestrictPublicBuckets:  Even if public, only local accounts with access and services are allowed. 
#                          Cross-account access not allowed.

S3 Bucket policies

  • you can grant public access, force objects to be encrypted at upload, grant access to another account.
  • Optional condition can involve SourceIp (using public or Elastic IP) or VpcSourceIp (private IP through VPC Endpoint)

S3 ACL

.
.  Bucket-Owner-Enforced   Bucket-Owner-Preferred  Object-Writer
.
.  Bucket-ACL   Object-ACL
.
.  S3-Condition-Keys  x-amz-acl:read
.
  • ACL is for only remote account access.
  • ACL is usually applicable to buckets only.
  • By default object level ACL support is disabled, but this can be enabled. (Object level ACL is deprecated).
  • If you disable and then later re-enable ACL, the old objects and the associated ACLs will be restored.
  • ACL is applicable to bucket as well as Objects:
    • Read permission on bucket means you can list objects.
    • Read on object means you can read object contents.

Object ownership property Of S3 Bucket

This property has influence on how ACLs are interpreted.

  1. Bucket Owner Enforced (New Default). ACLs are disabled. Objects are all owned by bucket owner only.
  2. Bucket Owner Preferred - All new objects are owned by bucket owner. For existing objects it is backward compatible and honors the object level ACLs. (ACLs enabled)
  3. ObjectWriter - Only object Writer owns the object and can alter ACL to grant access to others. (Previous Default, Now deprecated) (ACLs enabled)
aws s3api put-bucket-ownership-controls --bucket your-bucket-name 
          --ownership-controls '{ "Rules": [ { "ObjectOwnership": "BucketOwnerEnforced" } ]}'

Note: It is recommended to set bucket ACL to private before disabling ACL.

ACL Grants and Condition Keys

During creation of S3 object (PutObject), you can specify ACL grants (using headers) which can also be used in S3 conditions to allow/deny operation.

If bucket ownership property set to bucket owner enforced, using these headers will result in error during putobject.

  • s3:x-amz-grant-read ‐ Read access to specified account id. e.g. "id=1234-account-id"

  • s3:x-amz-grant-write ‐ Grant Write.

  • s3:x-amz-grant-read-acp ‐ Read Access Control Policy (acl==acp). Example value: "id=1234-account-id"

  • s3:x-amz-grant-write-acp ‐ Grant Write ACL to account.

  • s3:x-amz-grant-full-control ‐ e.g. "s3:x-amz-grant-full-control": "id=AccountA-CanonicalUserID"

  • s3:x-amz-acl ‐ Valid values:

    - private             - Owner gets full control. Recommended.                 (Object/Bucket)
    - public-read         - Owner gets full control. Public Read.                 (Object/Bucket)
    - public-read-write   - Owner gets full control. Public Read-Write.           (Object/Bucket)
    - authenticated-read  - Owner gets full control. Authenticated reads only.    (Object/Bucket)
    - bucket-owner-read   - Object owner FULL_CONTROL. Bucket owner READ.         (For Object Only)
    - bucket-owner-full-control - Both Object/Bucket Owner gets FULL_CONTROL      (For Object Only)
    

S3 Access Points

.                              1:N
.      One S3 Bucket         -------  Multiple Access Points
.
.      DataLake-Application  ------> S3-Accesspoint-Per-Application
.
.      Client                ------> S3-Accesspoint-Per-VPC  (Restricts within that VPC)
.
.      Every accesspoint has user friendly name and policy.
.
.      Simplifies Permission management without messing with Global bucket policy and IAM policies.
.
  • S3 access point is bound to a bucket.
  • S3 Access points like /finance, /sales, /analytics can be used for simpler security.
  • Each access point can be used to attach different access point policies.
  • Each access point has its own DNS name (Internet Origin or VPC Origin)
  • Access point policy is similar to bucket policy. Access point can give user permission to read even if the principal has no direct read permission.
  • This prevents handling of very complex simple monolithic global bucket policy.
  • S3 Access points allow you to have, e.g., “test” access point in every account and region. https://AccessPointName-AccountId.s3-accesspoint.region.amazonaws.com
  • Multi-region access points offer a global S3 hostname that provides access to multiple S3 buckets across AWS regions with automatic routing and failover. (Active-Active or Active-Passive replication for same S3 bucket in selected regions). Creating Multi-region access point option is available under S3 service.
  • S3 Object lambda access point is used for custom lambda functions that will modify the S3 object data before returning it to the caller.
aws s3control create-access-point --account-id 123456789012 --bucket business-records
              --name finance-ap
              [ --vpc-configuration '{ "VpcId": "my-vpic-id" }' ]

# Following commands done at object level. use aws s3api vs aws s3 for API level control.
aws s3api get-object --key my-image.jpg 
          --bucket arn:aws:s3:us-west-2:123456789012:accesspoint/prod download.jpg

aws s3api put-object --bucket my-access-point-xyz-s3alias --key my-image.jpg --body my-image.jpg

# acl private, public, etc are canned policies.
aws s3api put-object-acl --key my-image.jpg 
          --bucket arn:aws:s3:us-west-2:123456:accesspoint/prod  --acl private

S3 Partitioning backed by AWS Glue Data Catalog

S3 Paritioning is usually internal to AWS. When total objects in buckets increases internally it partitions the data.

However, you can explicitly partition S3 for better performance when you use S3 + Glue + Athena. You can have abstraction like "Table" with S3 contents.

When you use Athena and create table using Athena IDE, it creates a table with S3 underneath with proper partitioning.

S3 Directory Buckets

  • Directory buckets for single-digit millisecond latency on single AZ.
  • No redundancy. Meant for only high performance load.
  • Supports directory prefixes (unlike s3 which is flat storage)
  • S3 Express One Zone Storage class only.

S3 Transfer Acceleration

You can just enable this option for your bucket. For a small price, the uploads are accelerated using edge locations.

Uploading to s3 through cloudfront is technically possible but S3 transfer acceleration is the preferred method.

It is compatible with SSE-S3 and with some limitations with SSE-KMS (due to KMS being regional and permissions). It is not compatible with SSE-C.

S3 Glacier And Vaults

.   
.   Service Name        S3           S3-Glacier
.   -----------------------------------------------------
.   Container          Bucket         Vault
.   Files              Object         Archive
.   API                S3-API         Glacier-API
.   Access Policy      Bucket Policy  Vault Access Policy
.   Lock Mechanism     Object Lock    Vault Lock Policy
.

aws s3 glacier create-vault                  # Create vault like a bucket in S3.
aws s3 glacier upload-archive                # Upload Archive
aws s3 glacier initiate-job                  # Prepare Archive to read
aws s3 glacier get-job-output                # Read Archive

aws s3 glacier initiate-vault-lock           # You lock the vault container, not individual archives.
                                             # Specify Vault lock policy. e.g. Deny DeleteArchive for 10 years.
aws s3 glacier complete-vault-lock           # Finalize the lock. You can abort till you complete it.

aws s3 glacier set-data-retrieval-policy     # Specify upper limits to control cost.
aws s3 glacier set-vault-access-policy

aws   s3api   create-bucket
aws   s3api   put-bucket-policy
aws   s3api   put-bucket-lifecycle-configuration
aws   s3api   put-bucket-versioning
aws   s3api   put-bucket-cors

aws   s3api   put-bucket-accelerate-configuration
aws   s3api   put-bucket-encryption
aws   s3api   put-bucket-replication

aws   s3api   put-object
aws   s3api   put-object-acl                  # object level acls are not recommended
aws   s3api   put-bucket-acl                  # bucket level acls also not recommended
aws   s3api   put-public-access-block

aws   s3api   put-object-legal-hold
aws   s3api   put-object-lock-configuration   # Specify Retention mode (governance vs compliance) and period.
aws   s3api   put-object-retention            # Change object lock

aws   s3api   restore-object
  • S3 Glacier (formerly called Glacier) is alternate service for S3 primarily meant for archival.
  • Glacier does not support Buckets -- It uses vaults. The thing stored in a vault is called Archive.
  • Glacier uses its own set of APIs for data uploading and retrieving.
  • Now Glacier APIs are deprecated, so Glacier services are being integrated with S3 using S3 API.
  • Now Glacier Storage classes are available under S3. Use S3 APIs but internally storage is done by Glacier.
  • Glacier service continues to be available as independent service.
  • Glacier permissions are at vaults level. e.g. glacier:UploadArchive, glacier:InitiateJob, glacier:GetJobOutput (retrieving archives), and glacier:DeleteArchive.
  • You have to initiate a job to prepare the archive (glacier:InitiateJob) to be ready before you can get the archive.
  • Glacier Vault Lock is implemented using a vault lock policy which mainly specify Deny action for delete operation for certain period (like 10 Years).
  • Once you lock the vault (initiate-vault-lock operation), the vault becomes immutable (can not be deleted).
  • Vault access policy and Vault lock policy can co-exist together -- However Lock Policy Deny prevails over Access Policy.
  • Glacier Data retrieval policy enforces max limits to control cost. e.g. FreeTier or MaxRetrievalRate 1GB/s, etc.
  • You can specify controls such as “write once read many” (WORM) in a vault lock policy and lock the policy from future edits.

S3 Object Locks and WORM

.
.  WORM  Rention-Modes
.
.              Lock = Rention Mode (Governance|Compliance) + Retention Period
.
.
  • WORM - Write once read many

  • Object lock involves 2 things: Lock Retention mode and Lock retention period.

  • Retention Modes:

    • Governance Mode: Users with s3:BypassGovernanceRetention permission only can modify objects.
    • Compliance Mode: No one, including root users, can modify. Even if account deleted, objects remain.
  • Retention Period: Specify on object creation.

  • With s3:PutObjectLegalHold permission you can apply/remove legal hold on object to prevent delete.

  • Use Cases for S3 Object Lock: Regulatory Compliance, Data Protection

  • Bucket-level configuration: When creating a new S3 bucket, enable Object Lock for entire bucket. You can also enable it on existing bucket.

  • Object-level configuration: You can apply Object Lock on individual objects after enabling Object Lock on the bucket. Each object can have its own retention mode and period.

  • Once Object lock enabled, it can be disabled/deleted only if the mode is Governance (not Compliance):

    aws s3api put-object-retention --bucket <bucket_name> --key <object_key> 
                 --bypass-governance-retention --retention '{"Mode":"NONE"}'
    
  • Monitor and Manage: You can view the retention and legal hold settings using the S3 console

  • Bucket Versioning: S3 Object Lock requires bucket versioning to be enabled.

  • The object lock is supported on all storage classes except S3 Intelligent Tiering and Glacier Deep Archive.

S3 Replication

S3 supports Replication of buckets within or across regions.

|
|       Region1                                           Region2
|                        CRR
|       Bucket   -------------------------------------->  Bucket
|       |                                                 (Re-encrypt as per Destination)
|       | SRR        Live or Batch (aka On-Demand)
|       |
|       Bucket                       
|
|       (Enable Versioning)             Meta-Data  Permissions  
|
|       (One Time S3 Batch Operations to copy is Required For Replication of Existing Objects)
| 
|    aws s3api put-bucket-replication --bucket <source> --replication-configuration file://config.json
|                   # config.json: Specify Dest, Role, Filters, desired Storage class.
|
  • Enabling CRR or SRR does not auto-copy existing objects. You should (batch) copy the objects one-time first. S3 Batch Operations requires an input manifest file that explicitly lists the objects you want to operate on. :

    aws s3api list-objects --bucket my-bucket --query 'Contents[].[Bucket, Key]' 
                           --output text | awk '{print $1","$2}' > manifest.csv
    
                        my-bucket,object1.txt              (manifest.csv file)
                        my-bucket,object2.jpg
                        my-bucket,folder/object3.pdf
    
    # Alternatively you can use S3 inventory report to generate this manifest.
    
    aws s3control create-job  ... (with manifest file and copy operation)
    
  • Cross Region Replication (CRR)

  • Same Region Replication (SRR)

  • The tiers and life cycle rules apply independently on source and destination buckets.

  • There are two types: Live Replication and On-Demand Replication (aka Batch Replication).

  • SRR does not cost data transfer fee but costs PUT requests.

  • CRR costs data transfer rate around 2c/GB which is 4x times cheaper than internet download.

  • Active replication status does not cost, only the Put Requests and Data transfer charges apply.

  • Different encryption settings between source and destination buckets - The destination encryption policy prevails.

  • Encryption metadata (such as encryption method SSE or KMS ) will be updated as needed.

  • Ensure proper KMS key permissions if using SSE-KMS on either side to prevent replication failures.

  • The SSE-C objects can not be auto replicated.

  • S3 Replication Time Control (S3 RTC) can be enabled to guarantee most of objects will be replicated in seconds and 99.99% will be within 15 mins. It costs some additional cost per GB replication transfer (around 1.5 cents/GB ??)

S3 Inventory

  • You can use Amazon S3 Inventory to generate report csv or parque reports on your bucket for auditing.
  • You can generate inventory report on the replication and encryption status of your objects.
  • You can use S3 Inventory to generate a list of unencrypted objects, which can then be used to encrypt them using S3 Batch operations as explained below.

S3 Batch Operations

  • To perform work in S3 Batch Operations, you create a job.
  • The job consists of the list of objects, the action to perform and the parameters you specify.
  • You can specify your own lambda for action if needed.
  • You can create and run multiple jobs with priorities.
  • Manages retries, tracks progress, sends completion notifications, generates reports, and delivers events to AWS CloudTrail for all changes made and tasks executed.

S3 Cache Control

The cache control headers for S3 object can be controlled using the object meta data as "max-age=xxxx" (seconds).

aws s3 cp myfile.txt s3://my-bucket/myfile.txt --cache-control "max-age=3600"

curl -I https://your-cloudfront-url/myfile.txt

    HTTP/1.1 200 OK
    Cache-Control: max-age=86400  # max-age=1 (expire early) max-age=0 no-cache (cache but recheck using ETag)
                                  # no-store (Don't cache at all)

S3 Condition Keys

  • Condition keys are used in bucket policies and IAM policies.
  • There are global condition keys such as aws:RequestTag/${TagKey}, aws:ResourceTag/${TagKey}, aws:TagKeys

Storage Notes

Instance Store

  • Typically used to mount /tmp
  • It is not EBS. It is temporary. Typically uses NVMe SSD.
  • Available capacity varies by instance type.
  • Does not persist on reboot. Even though NVMe SSD persists, not supported on AWS.
  • Typically not used for root volume. If used for root volume, then on boot you need to specify the source image.
  • Similar to attached disk with the instance, so it has high throughput.
  • Instance Attached Storage: EBS scales upto 256K IOPS with io2 Block Express; Instance Store scales to millions of IOPS linked to EC2 Instance, low latency.
  • Network Storage: S3, EFS, FSx (FS for Linux HPC)

EBS - Elastic Block Store

  • EBS Volume types gp2/gp3 (general purpose), io1/io2 (High performance Provisioned SSD) st1 (HDD) Throughput Optimized HDD, sc1 (Cold HDD).

  • IOPS - IO Operations Per Second. Typical blocksize is 256 KB. 1000 IOPS means 256 MB/s throughput.

  • Boot volumes must be SSD, not HDD.

  • Bigger volumes (like 16TB offer bigger max IOPs).

  • EBS gp2 3000 to 16K IOPS: (gp2 is not provisioned IOPS; depends on disk size)

    • Storage Cost: $10 for 100 GB-Month. 1 TB disk $100; Max 16 TB disk.
    • IOPS and disks depend on Disk Size.
    • GP2 IOPS:
      • Max 3 IOPS/GB; 100 GB Disk only Max 300 IOPS capable.
      • 1 TB disk capabile of 3000 IOPS
      • 5.5 TB disk, you get max 16K IOPS with $550 cost
    • GP2 Throughput:
      • Base throughput 125 MB/sec; Max is 250 MB/sec
      • Above 335 GB disk, Base throughput 250 MB/sec
    • Burst Credits:
      • Idle disk accumulates I/O credits at 3 IOPS/GB and can be used for burst performance.
      • 10 hours of Idle time will get you 1 hour of 10x IOPS. Burst credits are awesome.
      • Above 170 GB disk, you get max upto 250 MB/s with burst credits.
    • Note: Max throughput 250 MB/sec for GP2 vs 1 GB/sec for GP3.
  • gp3 3000 to 16K IOPS; Supports Provisioned IO. (Bigger disk is not biggger IOPS)

    • Storage cost: $8 for 100 GB-Month. $80 for 1TB ; Max 16 TB disk;
    • Baseline 3000 IOPS on all disks.
    • IOPS and Throughput price independent.
    • GP3 IOPS:
      • $0.005/provisioned IOPS-month above 3,000
      • For 1000 IOPS, $5; 10K IOPS, $50
      • Max IOPS 16K i.e. (16-3)= 13K IOPS, $65. (max)
    • GP3 Throughput:
      • Base line throughput: 128 MB/sec
      • Max 1GB/sec; (May need Nitro instance for above 500MB/sec)
      • 4c /MB/ throuthput extra; i.e. 100 MB/s $4; Max for $36, you get 1GB/sec !
    • Max IOPS + Max Throuhput costs $65+$35 = $100
    • For 1TB storage cost you save $20 (vs GP2), 5TB storage, you save $100.
  • io1 and io2 IO Optimized: Faster than general purpose gp2/gp3.

  • io1 Max 32K IOPS. Max 64 TB disk(Provisioned IOPS)

    • Storage Cost: $12.5 for 100 GB-Month. (only 25-50% costlier than gp2/gp3)
    • Max 50 IOPS/GB; Even 1TB Disk supports Max 50K IOPS; But you need not use that.
    • For just 1000 IPS, you need to pay $65/month. For 3K IOPS, $195/month.
    • Note: For gp3 3000 IOPS is free. Unless gp3 max 16K IOPS not good enough, No need for io1.
    • For 20K IOPS, you are looking at 20*$65 = $1300 per month!
    • Throughput depends on IOPS but still capped at 500 MB/sec. (less than gp3!)
    • Supports multi-attach up to 16 EC2 instances!
    • suitable for OLTP load - higher number of reads with lower block sizes (4 or 8KB).
    • 256 times more reliable (in terms of hardware failures)
  • io2 Block Express Max 256K IOPS;

    • io2 highest speed.
    • Max 1000 IOPS/GB: Even 256GB disk supports Max 256K IOPS
    • IOPS charges are tiered. Approx. Per 1000 IOPS, you pay $50/month.
    • For 20K IOPS, it costs 20 * $50 = $1000 per month.
    • Storage Cost: $12.5 for 100 GB-Month. 1 TB disk $125; 5 TB disk $600;
    • Throughput depends on IOPS but still capped at 500 MB/sec. (less than gp3!)
    • Supports multi-attach up to 16 EC2 instances!
  • Compare io1, io2, io2 block express price:

    .                gp2       gp3     io1       io2      io2-express
    -----------------------------------------------------------------------------------
    Max IOPS         16K       16K     32K       64K      256K
    -----------------------------------------------------------------------------------
    Max Throughput   250MB/s   1GB/s   1GB/s     1GB/s    4GB/s
    -----------------------------------------------------------------------------------
    Storage Price
    per 100 GB       $10       $8      $12.5     $12.5    $12.5
    -----------------------------------------------------------------------------------
    Max IOPS
    per 100 GB       300                5K       50K      50K
    per     GB       3                  50       500      500
    -----------------------------------------------------------------------------------
    Min IOPS                    3K      
    -----------------------------------------------------------------------------------
    1K IOPS price              $5      $65       $48      $20
    -----------------------------------------------------------------------------------
    Durability                         99.9      99.999   99.999
    -----------------------------------------------------------------------------------
    
  • st1 HDD (throughput optimized):

    • $4.5 for 100 GB-Month. $45 or 1 TB; (50% cheaper than Gp3)
    • Max 500 IOPS per volume.
    • Max 500 MB/sec throughput.
  • Local instance store IOPS can go from 100K to millions.

  • By default EBS volumes are not encrypted.

  • Account level setting to encrypt new EBS volumes by default.

  • EBS Snapshots are taken using incremental backup. You can create AMI from snapshot.

  • FSR Feature - Fast Snapshot Restore - reads your SSD and pre-warms it for performance after restore.

  • EBS Multi-attach supported (can attach to multiple EC2 instances) based on io1/io2 disks possible. But the filesystem should support. (not normal ex4)

  • Data Lifecycle Manager (DLM) - for when you want to automate the creation, retention and deletion of EBS snapshots. It is free.

  • Attached to one AZ.

  • Root EBS volume type is auto deleted on instance termination by default.

  • To migrate an EBS volume across AZ, take a snapshot, restore on another AZ.

  • EBS volume is marked optionally as "Encrypted volume" on creation. If so, encryption happens transparently and all snapshots are also encrypted.

  • AWS Backup- to manage & monitor backups across all AWS services (including EBS volumes), from a single place. It is more recent and advanced service. It can backup environment (like subnet) etc whereas DLM does not.

EFS - Elastic File System

.
.         MaxSize: PBs  (Unlimited)
.
.     EFS                                         On-Premise
.                         Remote NFS Mount
.     Mount-Target-ENI  ---------------------->   Server (Linux)
.     SG                  DirectConnect
.     1 Mount Per AZ
.
.   
.  Also Note: EFS Access Point, EFS CRR
.   
.  Pay per use.   Performance Mode; Throughput Mode
  • EFS Elastic File System is managed NFS. Max size is unlimited. Single file max size 48 TB.
  • Expensive (3x gp2), pay per GB Used!
  • It has security-group attached to it. EC2 instances with necessary privileges can mount.
  • Windows does not work with EFS.
  • EFS can attach to only one VPC (having one ENI mount target) per AZ.
  • Throughput is like 10GB+/second; 1000s of concurrent NFS clients;
  • EFS volumes can Grow to Petabyte NFS. Note that max EBS max size is 16TB; EFS single max file size is like 47.9TB.
  • Performance mode could be set at EFS creation time:
    • General Purpose (default) or (Recommended. Faster)
    • Max IO (optimized for bulk sequential data throughput like bigdata and Video Processing.)
  • Throughput mode can be set to:
    • Elastic - Automatically scale up and down. You may get 1GB/s to 3GB/s. (Recommended) When you don't know your workload, set to this.
    • Provisioned - e.g. 1 GB/s for 1 TB Storage. Set throughput regardless of storage.
    • Bursting - 1TB Means 50MiB/s and burst of upto 100MB/s ??? Bigger Disk, better. If not good, you can change to Provisioned or Elastic.
  • EFS Storage tiers are:
    • EFS Standard tier: ($.30 per GB per month. This is 3x compared to EBS GP2 $0.10)
    • EFS Infriequent Access: ($.016 GB per month. 1/20 th of Standard)
    • EFS Archive: ($.008 per GB Month; 50% of IA)
  • There are incremental charges for read/writes per GB transferred in addition to above:
    • All Storage classes - Reads 3 cents/GB; Writes 6 cents/GB
    • EFS IA - Incremental cost of 1 cents/GB (In addition to std above)
    • Archive - Incremental 3 cents/GB
  • You can have lifecycle policy to move from EFS std to EFS IA when you have no access for 60 days for example.
  • By default EFS is multi-AZ availability, great for prod. You can also set it to One Zone availability with backup enabled to reduce costs.
  • EFS supports VPC Peering, integration with On-premises Direct Connect/VPN. Supports Multiple mounts in different AZs.
  • EFS file system policy, a resource policy, can grant access to all users by default or only specific users.

Installing EFS

Note: EFS does not work on Windows. Use FSx for Windows instead.

sudo yum -y install amazon-efs-utils    # Installs nfs-utils, efs-utils, openssl, etc
sudo yum install -y gcc openssl-devel tcp_wrappers-devel
# Install stunnels for encryption on transit
brew upgrade stunnel

# Mount using DNS name of the EFS -- Resolves to Same AZ closest EFS server.
mount -t nfs4 -o krb5p <DNS_NAME>:/ /efs/

EFS Accesspoint

  • EFS Access Points supports directory specific permission policies so that only certain POSIX users (e.g. /user/guru UID like 1001 etc) get access to certain dirs. IAM users have specific access to EFS access points and they can mount only that.
mount -t efs -o tls,iam,accesspoint=fsap-abcdef0123456789a fs-abc0123def456789a: /localmountpoint
# EFS file system driver understands the accesspoint. 
# Local user has POSIX user privileges (UID, mount permission, etc)

EFS CRR

  • EFS cross-region replication is possible to meet compliance and business goals.

Amazon FSx

.
.   High-Performance  SSD  Max-Size-Varies
.
.   Specify-Volume-Size-(unlike EFS)
.
.   FSx File Gateway;
  • FSx - High performance file system on AWS. Classified as:

    • FSx For Lustre
    • FSx For Windows File Server
    • FSx For NetApp ONTAP (To migrate from NetAPP NAS)
    • FSx for OpenZFS
  • FSx For Windows File Server:

    • Supports SMB protocol and NTFS; Can be mounted across AZs!
    • Active Dir integraion, ACLs, user quotas
    • Can be mounted on Linux!!!
    • Supports Microsoft DFS Namespaces.
    • Can configure to be backed-up daily to S3.
  • FSx For Lustre:

    • Lustre stands for "Linux and Cluster"
    • Parallel distributed file system. Used by people like NetFlix.
    • Suitable for ML, HPC, Video Processing, etc.
    • Can read/write S3 as a file system! (through FSx)
    • Can be deployed as Scratch FS without replication or Persistent File System with replication within same AZ.
  • FSx For NetApp ONTAP:

    • Managed NetAPP ONTAP on AWS
    • Move from ONTAP or NAS to AWS
    • Compatible with NFS, SMB, iSCSI protocols. (Linux, Windows, Mac, etc supported)
    • Snapshots, replication, low-cost, compression etc possible.
    • Point-in-time instantaneous cloning supported!
  • FSx for OpenZFS:

    • Managed OpenZFS
    • Compatible with NFS (v3, 4, 4.1, 4.2)
    • Works with Linux, Windows, Mac, etc.
    • Upto 1million IOPS with < 0.5ms latency
    • Snapshots, compression, and low cost.
    • Point-in-time instantaneous cloning supported!
  • Uses SSD. Max capacity of File system varies per type:

    FSx for Windows File Server: Up to 64 TB.
    FSx for Lustre: Up to 1 PB (persistent file systems).
    FSx for NetApp ONTAP: Up to 192 TB.
    FSx for OpenZFS: Up to 64 TiB.
    
  • Solution Architecture Tips:

    • To decrease the FSx volume size, setup AWS Data Sync between source and new destination FSx File Server with smaller size.
    • Amazon FSx for Lustre can be used like a lazy caching layer for S3. You import S3 data into FSx (by specifying bucket object prefix) and decide when to export (on-demand or in batches). Since number of S3 requests are minimized, it is more cost effective.

AWS DataSync

  • Move large amount of data to and from:
    • On-premises or other cloud to AWS - Needs agent.
    • Or AWS to AWS. (needs no agent)
  • Can synchronize to S3, EFS, FSX
  • Replication can be scheduled periodically.
  • File permissions/metadata are perserved (NFS POSIX, SMB, etc)
  • One Agent task can use 10 Gbps, can setup bandwidth limit.
  • AWS Snowcone device (8TB capacity) comes pre-installed with DayaSync Agent. You can use it to backup all data on-premises and restore at AWS datacenter.

AWS Data Exchange

  • Find, subscribe and use huge third-party data in the cloud.
  • E.g. Load 2.2 million stories from Reuters to S3 and run analytics on them.
  • After subscribing, use AWS Data Exchange API to load data to S3.
  • Around 4000 data products available.
  • AWS Data Exchange for Redshift: You subscribe and say you want the data to be loaded in Redshift. Also easily license your data in Redshift.
  • AWS Data Exchange for APIs: Find and subsribe to 3rd party APIs.

AWS Transfer Family

  • Fully managed FTP service in/out of S3 OR EFS.
  • Protocols: FTP or FTP over SSL (FTPS), secure FTP (SFTP)
  • You can expose your S3 or EFS contents using FTP procol for client users with auth support with Active Directory, etc.
  • Public endpoint (for users to access FTP) use dns names (no static IP).
  • Rarely you may want to deploy this within VPC. In that case, the service has static private IP that can be used by clients within your VPC.
  • You can also attach elastic IP to the VPC endpoint for both access.

AWS Storage Gateways

|
|       On-Premise                         AWS Cloud
|
|              S3 File Gateway         +------------  S3
|                                      |
|   Storage    FSx File Gateway  ------+------------  Fsx For Windows| FSx For Lustre| FSx For OpenZFS
|   Gateway                            |                                              
|              Tape Gateway            +------------  Virtual Tape
|
|              Volume Gateway    ------------------->  S3
|                                 Send Volume Backup
|
  • Storage Gateway is either software installed on-premises on a virtual server (any VM - VMWare, Hyper-V, KVM) or you can use a Hardware appliance.
  • Use cases: On-premises cache and low-latency file access; Backup and restore
  • Part of "hybrid cloud" strategy.
  • Bridge between on-premises data and cloud data.
  • Types of storage gateways:
    • S3 file Gateway (NFS interface for remote S3 objects!)
    • FSx File Gateway (Faster access to in-cloud FSx for Windows File Server file)
    • Volume Gateway (Local volumes regularly backed up to remote S3)
    • Tape Gateway (Emulates Virtual Tape device in remote cloud on-premise)
  • File systems like ext4, Amazon EFS, Amazon FSx for Lustre supports POSIX.

S3 File Storage Gateway

.
.                 NFS / SMB
.    On-Premise <--------------> S3 File Gateway <----> S3 Bucket <--- Cloud Clients
.                   AD            Cache / Sync
.                                 Auto Refresh
.
.    Useful for cloud-Native applications as well for NFS/SMB.
.
  • Backed by S3.
  • Supports NFS and SMB for on-premise clients. (Not at AWS cloud) (Partial POSIX compliance)
  • S3 Files are visible at AWS S3 buckets.
  • Caching and Syncing
  • Hybrid development and access from both on-premise and cloud.
  • SMB protocol has integration with AD for user authentication.
  • File Storage Gateway is useful for cloud native applications (without On-premise use) just for it's NFS/SMB capabilities and exposing S3 as NFS file system. Useful for easier interface for reading/writing by applications.
  • Remote direct uploads to S3 gets reflected in Gateway after about 60 seconds.(configurable) You can perform manual refresh-cache operation, if needed. Auto Refresh is turned on by default.

Volume Gateway

.
.                                                        Snapshot Schedule/Retention
.                 iSCSI Block                                Incremental
.    On-Premise --------------> Volume Gateway ----> S3 (Volume EBS Snapshots) 
.                   Any         Stored / Cached            |
.                File System                               +---> Restore to EBS disk.
.                                                                          (Not EFS)
.
.   Note: Max Volume Size: Stored: 512TB; Cached: 1024TB
.
  • Volume gateway uses block stroage using iSCSI protocol backed by S3.
  • iSCSI protocol is block storage on top of IP layer.
  • Only one machine can mount but can export it as NFS.
  • Can support any file system like ext4 (POSIX compliant), NTFS, etc.
  • No support for FSx or EFS etc.
  • This volume or Files created inside this volume is not visible in AWS S3 console.
  • Stored volume Mode: All stored locally and also remotely. Requires large storage.
  • Cached Volume Mode: Most frequently accessed cached locally. Primary Storage is in cloud.

FSx Gateway

.
.                 SMB
.  On-Premise  ----------->   FSx For Windows Gateway
.              ----------->   FSx For Lustre  Gateway
.                 NFS
.
.
  • Virtual Appliance on-premise providing SMB/NFS storage.
  • FSx for Lustre provides POSIX compliance and supports NFS.
  • FSx for Windows supports NTFS and SMB.
  • Primary storage in cloud. On-premise Cache only.
  • On-Premise gateway caches cloud files and exports it as NFS.
  • Hybrid access from both on-premise and cloud.
# You can mount Lustre File system using DNS name in EC2.
sudo mount -t lustre <fsx-file-system-dns>:<mount-name> /mnt/fsx-lustre

Tape Gateway

  • Tape Gateway helps to take tape backups on-premise using legacy backup software.
  • It interfaces to remote Vitual Tapes stored in Amazon S3.

Snow Family

.
.   Snowcone       :   8  TB (Upto  14 TB SSD)        (Appliance Machines)
.   Snowball Edge  :   80 TB (Upto 210 TB SSD)
.   Snow Mobile    :  100 PB (45 feet truck)          (== 1200 Snowball Edge)
.                            (Recommended for >10 PB i.e. 125 Edge and more)
.
.   1Gbps    - 1 Week -  75 TB  (Snowball Edge)
.   100Mbps  - 1 Week -   7 TB; 1 Month 28 TB; 2.5 Months 75 TB
.   50Mbps   - 1 Week - 3.5 TB; 1 Month 14 TB; 2.5 Months 35 TB; 5 Months 75 TB
.
.
.   Snowball -------> S3 ----> Glacier  (No direct upload to Glacier)
.
  • For large data transfer we use snow devices to save and ship it to AWS.
  • AWS snow cone device (appliance) (8TB - 14TB) and Snowball Edge (80TB-210TB) contains embedded storage and some CPU (2 CPUs, 4GB Memory) for doing some edge processing (filtering and transformation and such). There is storage optimized (default) and compute optimized versions of these devices where compute optimized has less storage These are mainly used to transfer data from on-premise to AWS by mail shipping.
  • You can run EC2 instances and Lambda functions at the edge in snow device.
  • You can not copy data from snow devices into Glacier directly. It will get restored in S3 only. You can transition that into Glacier later (using lifecycle policy or whatever)

DMS - DB Migration Service

|
|  "Lift-and-Shift Rehost"    "Replication Template"    "Replication Agent"
|  "Schema Conversion Tool"
|
|                                        DMS
|   Oracle, MsSQL, MySQL, PostGres     ========>  RDS, Aurora, S3, OpenSearch, Kinesis Datastreams
|   MongoDB, SAP, DB2, Azure SQL, S3
|
|   Migration Task          SCT-Schema Conversion Tool
|   Validation Task
|

Use Cases:

  • Migrate on-premise VMs by installing AWS Replication Agent on them.
  • Migrate on-premise Databases
  • To migrate a VM, create a job in Application Migration Service to migrate virtual machines. The migration of VMs happens in background from on-premise to AWS and not launched until you do. After that test the VM and do the cut-over.
  • Quickly migrate databases to AWS.
  • sources include on-premises and EC2 instances DBs: Oracle, Ms SQL, MySQL, Postgres, MongoDB, SAP, DB2, Azure SQL, S3, DocumentDB
  • Targets include Amazon RDS, Aurora, S3, OpenSearch Service, Kinesis Datastreams, DocumentDB, etc.
  • AWS Schema Conversion Tool (SCT) helps with schema conversion:
    • SQLServer or Oracle to MySQL, Postgre
    • TerraData or Oracle to Redshift
  • Works over VPC peering, VPN, Direct Connect
  • Supports Full Load, Full Load + CDC (change data capture) or CDC only.
  • Possible to migrate from relational DB to OpenSearch! OpenSearch is not a Source, so you can not replicate openSearch using this.

Migration Validation Task The database tables and every row can be validated after migration using validation task.

You can also use Table Statistics window from Migration Service console to view the tables and rows count migrated with the source database to verify manually.

Replication of RDS MySQL to on-premise server is possible with either using DMS replication task or using MySQL native replication. (mysqldump|RDS Snapshot) + binlog replication start will work. VPN connection is recommended for this replication (for security) but not mandatory.

AWS Application Migration Service (MGN)

Migration service for VMs and software applications.

Use Cases:

  • Migrate on-premise VMs by installing AWS Replication Agent on them. It copies over all source volumes to AWS EBS snapshots and then converts into single AMI.
  • Typically you replicate Root Volumes and all attached volumes together. Replication agent is smart enough to keep the replicated data properly and while converting that into VM, creates proper EBS volumes and restores it.
  • Migrate applications such as SAP, Oracle, and SQL Server running on physical servers, VMware vSphere, Microsoft Hyper-V, and other on-premises infrastructure.
  • Other cloud applications (Google Cloud, Azure, etc)
  • EC2s across Regions
  • Windows BYOL, Cross Region DR
|
|
|   On-Premise                   AWS Cloud
|   Other Clouds       =====>    EC2           DR
|   VMs
|   Region1                      Region2
|

To migrate a VM, create a job in Application Migration Service to migrate virtual machines. The migration of VMs happens in background from on-premise to AWS and not launched until you do. After that test the VM and do the cut-over. :

# sudo python3 aws-replication-installer-init.py
Choose volumes: /dev/svda  ....
Enter Access Key, Secret Key: ....
Your source server Id is .... 
Replication Started. Manage using Application Migration Console.
  • Note: You can use the Active Directory Migration Toolkit (ADMT) along with the Password Export Service (PES) to migrate users from your on-premise Active Directory to your AWS Managed Microsoft AD directory.

    Note: This is not part of Application Migration Service.

Migration: Application Discovery Service

  • Using Discovery Agent installed on each on-premise servers collect performance and usage data.
  • Also can deploy a virtual appliance as agentless collector.
  • Send to Migration Hub.
  • Collect OS, Database version details.

Migration Hub

AWS Migration Hub delivers a guided end-to-end migration and modernization journey through discovery, assessment, planning, and execution.

Docker

Docker Concepts:

|                                                (Container Apps)
|
|     Container Layer Read/Write                 C1        C2         C3  
|     Layer 3         Read Only                ------------------------------
|     Layer 2         Read Only                 Bin/Lib   Bin/Lib    Bin/Lib
|     Layer 1         Read Only                ------------------------------
|                                                     Docker Engine 
|                                                       Host OS     
|                                                               
|
|     Docker uses Containerd container runtime (Kubernetes uses Containerd, not docker)
|

.
.                 Build                Run                      Commit
.    DockerFile --------> DockerImage ------> DockerContainer ---------> Docker Image
.
.    Docker Container can also be inactive.
.
  • DockerFile defines how the docker image is built. It starts with Layer-1 Parent image. And contains commands which creates upper layers using copy-on-write strategy.

    For example:

    FROM  php:7.4-apache                          # This is Layer 1 - Parent Image
    RUN   apt-get update && apt-get upgrade -y    # This is Layer 2
    
    COPY  code  /var/www/html                     # This is Layer 3
    
    EXPOSE 80
    
    CMD  ./my_script.sh                           # Default command to run if not specified in docker run.
    

Commands:

docker build -t my-username/my-image .   # -t for tagging.
docker image ls
docker run --name my-app -p 80:80 -d my-username/my-image
docker push my-username/my-image

# To push my image to private registry ...
docker login  repo.company.com:3456  --username my_username
docker tag 518a41981a6a repo.company.com:3456/myappImage 
docker push repo.company.com:3456/myappImage

The SSM Agent is open sourced by AWS. See https://github.com/aws/amazon-ssm-agent This is a classic example of using docker to run make command to build the agent. The docker image contains all GoLang compiler and libraries. Use this to build locally! :

docker build -t ssm-agent-build-image .  # Build using ./Dockerfile and tag image.

docker run -it --rm --name ssm-agent-build-container   # name of running instance
   -v `pwd`:/amazon-ssm-agent                          # Mount ./ to docker container
   ssm-agent-build-image                               # Docker image name
   make build-release 

docker cp <containerId>:/file/path/within/container /host/path/target # works even if inactive.
docker start <container_name>       # restart inactive container
docker exec -i container_name bash  # Attach to running container

Multi-Container Application and Docker compose

If single application needs mutiple docker images to run, then it is a multi-conatiner app.

Suppose if each docker conainter needs to be invoked with certain port mappings and local volumes mounts etc. This could be achieved by docker compose (for example).

version: '3.9'
services:
  my-nginx-service:
    container_name: my-website
    image: my-nginx-image:latest
    cpus: 1.5
    mem_limit: 2048m
    ports:
      - "8080:80"
    volumes:
      -  /host/dir/log:/log

  my-db-service:
    ....

# docker compose  up -d
.....
All containers will be created and running.

Amazon Elastic Container Registry - ECR

Docker images can be stored in Docker Hub and Amazon Elastic Container Registry.

Amazon ECR - Elastic Container Registery supports both Private and Public repository:
https://gallery.ecr.aws

ECR private registry supports cross-region and cross-account replication.

ECR image is scanned for CVE (common vulnerabilities) or extensive scaning using Amazon inspector.

Inspect and update Docker Image

  • You can not directly extract Dockerfile used to create the image from the image. However you can inspect and extract information used to build Dockerfile.
docker history --no-trunc <image-name>   # Find out base image
docker inspect <image-name>              # Env variables, entry points
brew install dive
dive <image-name>                        # Visual info about Layers, files, commands. 

To update the docker image with new files, you can create a new Dockerfile using the base docker image as the base and add the files and generate. Or you can activate the container copy new files and docker commit the new image.

Typical WebApp 3-Tier Architecture

|
|      Route-53                                                          
|        |                                                                  
|        |                       Web App       |                   |          ElastiCache
|     Client  --->  ELB  ------> ASG + EC2s    |--> Application    |   ----> Redis Multi-AZ
|                   Public       Private Subnet|    Private Subnet |   ----> RDS/Database
|                  Multi-AZ      Multi-AZ      |    Multi-AZ       |
|
|
|

Elastic Beanstalk (EB)

.
.   1 Beanstalk == 1 Web Application == N EC2 Instances 
.
.   Java  NodeJS PHP Python Docker
.
.   Environment  Application
.
.

Web Server Tier vs Worker Tier :

|                                            |
|        Web Environment                     |           Worker Environment
|                                            |
|    myapp.us-east-1.elasticbeanstalk.com    |
|                                            |
|                 (ASG)                      |                   (ASG)         
|                                            |                               
|           +--> EC1 in AZ1                  |                 EC1 in AZ1     
|      ELB  |                                |                             <--- SQS Queue
|           +--> EC2 in AZ2                  |                 EC2 in AZ2     
|                                            |                               
  • EB Deployment Mode:

    • Single Instance : All components on single EC2. Great For Dev.
    • HA with LB: For production.
  • Runtimes support for:

    • Go, JavaSE, Java with Tomcat
    • .Net, Node.js, PHP, Python, Ruby
    • Docker, Multicontainer Docker, Preconfigured Docker
  • Great to replatform on-premise to the cloud.

  • Instance config/OS is handled by Beanstalk

  • Deploy application for Lazy people; Auto-magically creates ELB, ASG, EC2 instances, etc.

  • Deployment strategy configurable but performed by EB.

  • Just the application code (.war for Java or .zip for PHP) is responsibility of the developer.

  • Deployment models:

    • Single Instance: Good for Dev
    • LB + ASG : For production web app.
    • ASG only : For non-web apps in production (e.g. workers reading SQS, etc)
  • Decoupling application using webworker + web tiers is common pattern. i.e. one with LB + ASG and another with ASG Only.

  • Beanstalk is free, only you pay for the underlying resources.

  • Environment: Collection of AWS resources.

  • It uses CloudFormation to provision infrastructure underneath. ClouldFormation template primarily defines Environment and Application.

  • Application defines version and the source .zip file in S3. Environment defines ASG, LB, EC2 instance type, and docker version and such.

  • HostManager is the agent running in each EC2 machine to help with EB admin tasks such as deploying application and monitoring.

  • The application is uploaded as a zip or war or etc file depending on the platform. For PHP application it is a zip file of top folder.

  • You may have to deploy different tiers and submit them to EB separately. Once you create a webtier, you get the application DNS name. For worker env, you don't get it.

  • It is not serverless solution. It is PaaS solution.

  • Elastic Beanstalk allows you to choose instance types, configure load balancers, and set scaling parameters.

Deployment Strategy with Elastic Beanstalk

Following deployment strategies supported:

  • Rolling: Batches of instances are upgraded one at a time. So capacity is temporarily affected.
  • Rolling with Additional batches: For first batch, new EC2 instances are created for first batch. So capacity is not affected.
  • Immutable: A new environment created and deployed. (Equivalent to ECS Blue/Green Deployment)
  • Traffic Splitting: A canary testing deployment method. Suitable if you want to test the health of your new application version using a portion of incoming traffic. If API testing failed the deployment will be abandoned.
  • All at once: All upgraded at same time. Out of Service during upgrade.

Docker with ElasticBeanstalk

  • supports both single and multi-container applications using docker.
  • Multi-container apps require docker-compose.yml config file.
  • In general multi-container appications better be deployed in ECS.
  • Simplifies autoscaling using ASG with EC2 instances. Autoscaling parameters like cooldown period, cpuUtilization etc can be configured.
  • It does not use container/task level cpu utilization cloudwatch parameters like Fargate does. -- just uses the EC2 parameters for simplification.

ElasticBeanstalk Entry Point

  • For Docker environments: Use Dockerfile (CMD or ENTRYPOINT) or Dockerrun.aws.json (Command).
  • For multicontainer Docker: Use docker-compose.yml and define command for each service.
  • For other platforms (Node.js, Python, etc.): Use .ebextensions to run shell scripts or configure environment settings.
  • For Python, Ruby, Node.js: Use a Procfile to define the entry point.

ElasticBeanstalk vs ECS

  • Deploying with Beanstalk is much easier since lot of things are autocreated. However that also means, you have less control over it.
  • For docker deployment, ECS is preferred as it offers better control.
  • If EB is platform, then ECS is IAAS.
  • ECS does not auto-create loadbalancer or anything.
  • ECS also offers better control over autoscaling strategies.

ECS - Elastic Container Service

|
|   (ECS Service)                (Service Scheduler)    (Task Placement Strategy)
|
|      VPC       (ECS Cluster)       (Tasks Launched in Fargate)
|                                    (Uses Shared Pool of ENIs from App Subnet)
|         ALB      ASG               (ALB Target - Use these Target Groups of IP Addresses)
|
|      (App Subnet)  (Data Subnet)         (Task Definition)
|                                          (Task)
|                                          (Service = N Tasks)
|                                          (Task = N Containers)
|
|             EC2 (EC2 is an ECS Container instance running ECS container Agent)      
|
|  (ECS+Fargate Service AutoScaling)
|

Commands:

aws ecs create-cluster --cluster-name MyCluster

# Note: A default ECS cluster with name "default" already exists in your account.

aws ecs create-capacity-provider --name "MyCapacityProvider" 
     --auto-scaling-group-provider "autoScalingGroupArn=arn:.*"

# create-service defines and starts running the service as well.
# Specify launch-type or capacity provider not both.
aws ecs create-service --cluster MyCluster --service-name MyService \
         --task-definition sample-fargate:1 --desired-count 2 --launch-type FARGATE \
         --network-configuration 
           "awsvpcConfiguration={subnets=[xxx],securityGroups=[sg-x],assignPublicIp=ENABLED}"

aws ecs create-service --cluster MyCluster --service-name ecs-simple-service \
       --task-definition sleep360:2  --desired-count 1   # Simple single instance task.

# deploy action updates service defintion with new Task definition and initiates CodeDeploy

aws ecs deploy --service <value> --task-definition <value> --codedeploy-appspec <value>

# Stop task. Equivalent to docker stop. Agent sends SIGTERM on task process.
aws ecs stop-task --task 666fdccc2e2d4b6894dd422f4eeee8f8

# Run new task directly. Auto placement
aws ecs run-task --cluster default --task-definition sleep360:1

# Start task with better control on placement. override execution roles, networking etc.
aws ecs start-task --cluster default --task-definition sleep360:1 \
        --container-instances "<ec2-instance-ids>" 

aws ecs list-clusters
aws ecs list-services --cluster <cluster_arn>
aws ecs list-tasks --cluster <cluster_arn>
aws ecs list-container-instances --cluster <cluster_arn>

ECS Task Definition

Task definition represents a group of docker containers to run that task:

|                        1:N
|  Task Definition  ------------->  Docker Containers

It also includes Task execution IAM role, Docker image details, port mappings, env variables,

ECS Task

ECS Task is an instance of a Task Definition.

All docker container instances from single ECS Task run together on single machine:

|
|    Task  =   1 Task runs N docker container instances together 
|              on any single machine in cluster.
|

Task can be directly started or indirectly started through service.

Task Launch Type

Task definition optionally contains Task Launch Type. This is one of:

|               Task Launch Type
|
|   Fargate     -  Launch Task in Fargate
|   EC2         -  Launch in one of the available EC2 container Instance.
|   External    -  Launch in one of the available external container Instance.
|
|   Fallback mechanism is Cluster's capacity provider strategy.

ECS Service

Defines min and max number of Tasks to run. e.g. Web application service may define to run 4 to 10 tasks instances in the cluster. It also defines scaling policy like Step Scaling, etc. :

|
|   Running ECS Service  =  N Tasks   1<=N<=max running across many EC2 instances.
|

Also includes Task Definition, desired/min/max no of tasks, VPC, subnet, security group, Load Balancer type (ALB), Container port mapping, EC2 target group, Listener Rules, Scaling Policy (Target Tracking or Step Scaling).

The scaling policy Target Tracking scales based on target value of total tasks.

ECS Service Target Group

.
.    Target Group == Place Holder(HTTP, port 80, VPC-Id, HealthCheck-path, TargetGroup-Type)  
.
.                    1:1
.    TargetGroup ----------- Service (Multiple Tasks of Single Service)
.
.    TargetGroup Type == Instance (EC2) | IP (For Fargate) | Lambda (Not supported for ECS)
.
.                +-------- TargetGroup1
.       ALB1  ---+-------- TargetGroup2
.       ALB2     +-------- TargetGroup3
.
.
  • Service Target Group is the feature of associated load balancer ALB.
  • Typically single ALB makes use of TargetGroup. But multiple ALBs can use it too. (e.g. MicroService is needed by different applications with respective associated ALBs)
  • Single Target Group is for single Service only. (Multiple tasks instances of single Service)
  • For Fargate integration with ECS, you should create empty target group of type "IP Addresses" and associate that and a Load Balancer with ECS Service.

The Service tasks automatically gets registered to the ALB and inbound requests are forwarded to that Target group. ECS makes sure Target group is populated with Fargate IP addresses automatically.

ECS Workflow

  • Task is concerned about associated containers defintions and bit about destination env. (networking and launch type, etc)
  • Service ties the task with TargetGroup to bind it with specific destination runtime env.
  • ALB only knows about TargetGroup, it knows nothing about Tasks or Service.
  • ECS can host multiple microservices and web applications and back-end APIs in single cluster.
  • Single ALB can serve multiple web applications and microservices.
.
.                                              (Tasks - TargetGroups)
.                                   Serving    WebApplications
.        ALB           ------ ECS -----------> MicroServices, 
.  (Host+Path based)        Cluster            Back-End APIs
.  (   Routing     )
.
.  (Optional ALB2+ )
.
.
.     ALB ---> Listener ---> Rule --> TargetGroup -->  Register Task --> Create Service
.                                                                        (Register Targets (e.g. EC2 OR IP) with Target Group)
.
.     Note: Tasks get registered with TargetGroup.
.           Multiple ALBs can share TargetGroups i.e. Tasks.
.           Tasks may run on EC2 or FARGATE.
.

aws elbv2 create-target-group --name my-target-group --protocol HTTP --port 80 --vpc-id vpc-abc12345 \
                              ...  --health-check-path /health

aws elbv2 create-load-balancer  --name my-load-balancer  ...

aws elbv2 create-listener --load-balancer-arn arn:* --protocol HTTP  --port 80 \
                          --default-actions Type=forward,TargetGroupArn=*

aws elbv2 create-rule --listener-arn arn:*   
                      --conditions Field=path-pattern,Values='/app*'
                      --actions Type=forward,TargetGroupArn=* ...

aws ecs register-task-definition  ... # ECS knows container name and container port mappings etc.

aws ecs create-service ...  --task-definition my-task-def ...
                            --load-balancers "targetGroupArn=*:,containerName=my-container,containerPort=80" ...

# Targets like EC2 gets implicitly registered with Target Group by ecs create-service:
# aws elbv2 register-targets --target-group-arn arn:.* --targets Id=i-0598c7d356eba48d7,Port=80 ...

ECS Container Instance

This is EC2 instance running docker agent and ECS Container Agent running on it. Any single task is typically deployed on any available ECS Container Instance.

Eventhough Task can also be deployed on Fargate, that is not a ECS Container Instance by definition. :

|
|   ECS Container Instance is EC2 instance running docker and ECS agent.
|
|                            1:N
|                EC2        ------->  May run total 8 tasks coming
|           (Docker Agent)            from 4 different services.
|           (ECS Agent)
|

ECS Cluster

|
|    ECS Cluster contains group of ECS Container Instances.
|
|                  1:N
|    ECS Cluster  --------  EC2
|
|

A default cluster is created in your account (which is empty). You can create additional named clusters.

Cluster Launch Type could be: EC2 | Fargate | External

You should register EC2 instances before launching tasks and services.

For external instances, you should prepare it using ecs-anywhere-install.sh script.

Capacity Provider

Each cluster can use multiple capacity providers to spread tasks across different ASGs or Fargate using capacity provider strategies.

Capacity provider concept existed even before ASG.

|
|                               Capacity Provider
|
|                       ----->  Fargate       (Predefined Capacity Provider)
|                       ----->  Fargate Spot  (Predefined Capacity Provider)
|    Capacity Provider  ----->  EC2 Capacity Provider ---> ASG EC2 | ASG Spot
|                                      (This Capacity Provider is like alias for an ASG)
|                       ----->  External Instances
|
|                        N:1
|    Capacity Provider  ------  Cluster  (Auto Scaling turned on)
|
|                                   upto  1:6 
|    Capacity Provider Strategey  ---------------  Capacity Provider 
|                                     Contains
|     (Fargate, ASG1, ASG2)
|     (50%,      25%,  25%)
|
|

Service uses either Launch Type' or `Capacity Provider strategy not both. If you specify service launch type as "FARGATE", you may not have to explicitly specify capacity provider as "FARGATE".

ECS dynamically allocates ENIs with necessary subnet Private IPs before instantiating Fargate launch for the Task.

ECS clusters can contain a mix of tasks hosted on AWS Fargate, Amazon EC2 instances, or external instances. They can also contain a mix of Auto Scaling group capacity providers and Fargate capacity providers

In Amazon Elastic Container Service (ECS), a capacity provider strategy determines how tasks are spread across the cluster's capacity providers. The strategy is made up of one or more capacity providers, along with a base and weight for each.

Container Instance Draining

ECS may transition container instance to "DRAINING" state in order to prepare to remove it. For example, SPOT instance may have to be released or replaced. Spot instance will receive interrupt and ECS will mark that instance in "Draining" state.

Suppose desiredCount = 4, minimumHealthyPercent=50%, then temporarily only 2 nodes may be available during this transition. Draining instance will not accept new task. If maxHealthyPercent=200%, then temporarily there may be 8 nodes running as part of transition.

SpontInstance draining should be enabled in ECS agent config.

ECS + CloudMap + Service Discovery

  • ECS service can optionally be configured to use Amazon ECS service discovery.
  • Service discovery uses AWS Cloud Map API actions to manage HTTP and DNS namespaces.

ECS Agent source code

/usr/bin/docker run --name ecs-agent --init 
     --volume=/var/run:/var/run  ...
     --net=host                         # awsvpc in EC2
     --env-file=/etc/ecs/ecs.config 
     --env ECS_DATADIR=/data  ...  --detach 
     amazon/amazon-ecs-agent:latest

ECS Anywhere

|      On-Premise
|                                  Node
|                         Runs SSM Agent, ECS Agent

Use Case: Create ECS using on-premises machines.

Commands:

aws ssm create-activation --iam-role ecsAnywhereRole | tee ssm-activation.json
bash /tmp/ecs-anywhere-install.sh

Amazon ECS Anywhere can be used to use your own EC2 instances to your ECS cluster. Just install ECS Container Agent and SSM Agent on on-premises server. Register that EC2 instance with SSM and then with your cluster.

ECS IAM Roles

  • EC2 Instance Profile. The ECS agent needs this role and permissions to do ECS admin tasks on that EC2.
  • ECS Task IAM Role - allow each task to have a specific role.

ALB Integration support with ECS

  • ALB integration supports dynamic port mapping. i.e. you can run multiple instances of same application on single EC2 machine with different ports. ALB can find the right port on your EC2 Instances. The tasks are "registered" with ALB (through TargetGroup) with port mappings. So ALB finds it easy to find them.
  • You can inject secrets and configs as Env Variables into running docker containers in ECS.

ECS Networking

  • ECS Tasks networking can be one of:
      1. None
      1. Bridge
      1. host (bypass docker, use host) OR
      1. awsvpc : Every tasks get it's own ENI and private address.
  • The awsvpc networking is the default option for Fargate tasks. Every task with it's associated docker containers runs with a separate IP address.
  • The awsvpc simplifies networking, Security Groups, monitoring, VPC flow Logs, etc.

ECR with ECS

  • ECR is fully integrated with ECS i.e. container task can pull private images from ECR based on container IAM permissions.

ECS with Fargate And Service AutoScaling

.                     (Service)
.       ECS -------> FarGate Task1  --- Target Tracking: CPU Usage = 70%
.                            Task2      Enable Service AutoScaling.
.                            Task3
.
.
  • With AWS Fargate, specify min of tasks for on-demand baseline workload. You can add tasks on fargate with FARGATE_SPOT for cost saving. Fargate scales well based on load.
  • Enable Service Autoscaling, Define Scaling Policy as target tracking of CPU Usage: 70%

ECS Deployment Choices

The task definition specifies launch type: Fargate | EC2 | External;

In addition ECS service (i.e. Tasks) specifies the deployment type associated.

The Deployment types supported are:

  • ECS (aka Rolling Update) - Depending on allowed healthy tasks min/max percentage, rolling batches of update will happen.
  • Code_Deploy: The blue/green deployment type. Traffic shifting strategies could be Canary, Linear and all-at-once after the blue/green testing succeeds.
  • External: Third party deployment controller.

The deployment itself may be initiated from Console or cli or Cloudformation. The cloudformation specifies one of the above deployment type and it drives it.

ECS with One-Off Task

  • You can run short-lived task on demand. Like Lambda, but with your own custom container!
  • You create task definition but no service definition (Task won't run by default)
  • Invoke the task anytime using aws ecs run-task or using SDK from Lambda. You can pass env variables.
  • You can use launch_type as FARGATE (No need to worry about TargetGroup!)
  • EventBridge (formerly CloudWatch Events) provides direct integration capability to invoke ECS tasks.
  • Event-driven tasks triggered by S3, SNS, SQS and most.
aws ecs run-task --cluster my-cluster --launch-type FARGATE
                 --network-configuration "awsvpcConfiguration=*" --task-definition my-fargate-task --count 1
                 # --overrides ... (Pass env Variables)

# Configure EventBridge Rule to Invoke task every 1 hour
aws events put-rule --name "TriggerFargateTask" --schedule-expression "rate(1 hour)"

aws events put-targets  --rule "TriggerFargateTask" --targets "Id"="1","Arn"="arn:aws:ecs:*:task/my-fargate-task"

ECS vs EB

  • ECS is cheaper since it shares ALB and EC2 machines. Even more so with FARGATE launch mode.
  • Elastic Bean auto creates LB and it can run multi-container applications (using docker Compose), but basically it is designed to run single application only.
  • Containers share same host IP address in EB, but communicates through each other using the Exposed ports. ECS with awsvpc networking support, each container gets separate IP address.

Containers Virtualization History

Key Historical Milestones For Containers:

1990s:  Early virtualization technologies emerge (VMware, FreeBSD Jails).

2008:   Linux Containers (LXC) introduced.

2013:   Docker launched, revolutionizing container usage.
        [ Tools, Images, Dockerfile.  Revolutionised Adoption ]

2014:   Google open-sourced Kubernetes.

2015:   OCI formed to standardize containers, and Kubernetes became dominant.

2017:   Cloud providers introduce managed Kubernetes services (EKS, GKE, AKS).

2020s:  Kubernetes leads the container orchestration space, with an expanding cloud-native ecosystem.

Key Milestones For Serverless Computing:

1990s–2000s:    Virtualization, VMs (VMware, FreeBSD Jails)

2006–2014:      IaaS (AWS EC2), followed by PaaS solutions (Google App Engine, Heroku).

2014:           AWS Lambda (FaaS), serverless Revolutionized

2015–2016:      Competitors like Azure Functions and Google Cloud Functions enter the serverless market.

2017–2020:      Ecosystem matures with tools. Expands beyond FaaS to include databases, storage, and containers.

2020–Present:   Serverless moves into edge computing, AI/ML, and containers.

EKS - Elastic Kubernetes Service

.
.   Control-Plane 
.
.                  AWS Load Balancer Controllers
.      Ingress Resources (ALB)     Service Resources (NLB)
.
.      Node            Pods
.      EC2
.

Control Plane aka Master Nodes

  • etcd database, highly available cluster metadata which is key value store.
  • Kubernetes API server. All Control Plane operations performed through this.
  • AWS API Gateway Provides a secure entry point to the Kubernetes API server.
  • Kube Controller Manager which includes:
    • Replication Controller which manages the replication of Pods. Uses Scheduler to select node.
    • Node Controller: Monitors the health of nodes.
  • Kube Scheduler: Responsible for scheduling Pods. Selects which node the next pod should run on.
  • kubectl: You can kubectl cli to manage EKS cluster.
  • AWS Management Console to view cluster status and manage node groups.
  • Master Nodes run control plane components.

Worker Nodes and Components

  • The worker nodes are EC2 instances where your Kubernetes Pods run.
  • Node Groups: Collections of EC2 instances; You can use ASG to manage worker nodes scaling.
  • `Kubelet`: An agent that runs on each worker node. Communicates with the API server.
  • `Container Runtime`: Kubernetes support includes: Docker, containerd, and CRI-O.
  • `Kube Proxy`: A network component that maintains network rules on nodes. Implements service discovery and load balancing.

Kubernetes Add-ons

  • Core Add-ons: Kubernetes Dashboard, Cluster Autoscaler, and AWS App Mesh.
  • Custom Add-ons: Metrics server, logging solutions, and monitoring tools

Network Policies

  • You can define Kubernetes Network Policies to control the traffic flow between Pods within the cluster.

Logging and Monitoring

  • EKS integrates with CloudWatch.
  • You can also use third-party tools like Prometheus and Grafana for advanced monitoring and visualization.

EKS Integrations

  • EKS integrates with AWS Cloud Map and Route 53 for service discovery.
  • EKS supports AWS Fargate
  • CodePipeline and CodeDeploy

Kubernetes Operators

  • Advanced method to extend Kubernetes API to manage stateful applications.
  • Leverages the Kubernetes custom resource definitions (CRDs)
  • Uses a controller that watches the custom resources
  • Declarative management using desired state of application.
  • Application Lifecycle Management including installation, scaling, health check, upgrades, and failure recovery.
  • Use Cases:
    • Databases: Handle the complexity of scaling, failover, and backups.
    • Message Queues: Managing message broker applications like RabbitMQ or Kafka.
    • Prometheus Operator for monitoring instances and alerting rules as custom resources.
    • Custom Applications
  • Operators can be installed via standard methods, such as Helm charts or Kubernetes manifests.

EKS vs ECS Components

.-----------------------------------------------------------------------------------------------------
.              EKS                                           ECS
.-----------------------------------------------------------------------------------------------------
.
.  Control Plane                                          ECS Control Plane (internal)
.
.  Nodes with kubelet                                     Container Instance EC2 with ECS agent.
.
.  Pod (mutli containers, Shares Storage)                 Task (multi Container)
.
.  Deployment - Pod Replicas (Pod Instances)              Service (Group of Tasks)
.   (Also Scaling mechanisms, Rolling updates, etc)
.   (Manifest yaml with placement constraints)
.   (Includes ReplicaSets - Stateless Set of Pods)
.
.  StatefulSet - Persistent Volume, DNS Name, etc.        Task is stateless only. No alternatives.
.   (For Databases, Kafka, etc) 
.
.  DaemonSets (Force pod on every/selected Nodes)         Fixed ECS Agent. No Alternatives.
.  (For Log, monitor etc. Fluentd, Prometheus, etc)
.
.  Services (Pods network endpoints)                      Load Balancer, TargetGroup, Task Definition
.  (ClusterIP (internal), NodePort and LB)
.
.  Ingress, Ingress Controller (Traffic Routing)          Load Balancer               
.
.  Namespaces (Cluster Partition - Isolation)             Use different ECS Clusters.
.
.  ConfigMaps                                             Env variables, SSM Parameters
.  Secrets (Inject into Pod as env var/files)             Secrets Manager, SSM Parameters
.  
.  Persistent Volumes (PV) and Claims (PVC)               EBS, EFS
.
.  Helm (Uses Helm charts- Defines Pods, Services.        Task Definition. Defines container, CPU, etc.
.        Pkg Manager. Defines all resources)
.
.  Horizontal Pod Autoscaler (HPA)                        ECS Autoscaling
.
.

EKS Alternatives

ECS 
Docker Swarm
Docker Compose   - Very simple, Run multiple containers on single host. For dev.
AKS              - Azure Kubernetes Services
GKE              - Google Kubernetes Engine
Nomad            - HarshiCorp Nomad (Terraform company)
OpenShift        - Redhat's Offering

Fargate

Run containerized tasks in serverless environment.

.
.   Serverless  Containerized-Task-Only     ECS OR EKS Tasks
.
.   Autoscaling (with ECS only)
.
.   Load Balancing (with ECS only)
.
  • Fargate is a serverless compute service to run containers.
  • It is alternative to EC2 for running containers.
  • Fargate can be used for ECS or EKS both.
  • Autoscaling and Load balancing are the two important critical elements to be handled.

Fargate Autoscaling

  • Can be configured through ECS Service only.
  • Application autoscaling is done through cloudwatch metrics using ECS Console.
  • This is the trickiest part of configuration.
  • Scaling parameters available:
    • Provides min, max, desired number of instances. (Target tracking. Else use Step tracking)
    • Cloudwatch metrics like ECSServiceAverageCPUUtilization (and Memory)
    • Cluster Level, Service Level, Task Level metrics available.
    • Scale-out/Scale-in cooldown period
    • Disable Scale-in

Fargate Load Balancing

  • Can be configured through ECS Service only.
  • Create ALB with target group of Type IP (for Fargate Tasks) and associate with Fargate tasks.

API Gateway Integration

  • If you run Fargate Task as a service with ECS with Load balancer, then use HTTP Proxy integration with API Gateway :

    API Gateway -----> Http Integration ----> ALB -> Fargate Service
    
  • You can directly invoke Fargate task from Lambda:

    API Gateway ---> Lambda ---> Invoke Fargate Task.
    
  • You specify following while invoking ECS task from lambda:

    • ECS cluster name
    • Launch Type = 'FARGATE' (it is already part of task definition)
    • taskDefinition = 'my-task:1'
    • count
    • networkConfiguration
    • overrides (container name, environment variables, etc)
  • The overrides (like environment variable) is used as input while invoking the specfic ecs task from Lambda.

Use Cases

  • Microservices
  • Web Applications. Stateless. e.g. Node.js; Use Elasticache for State.
  • API Backend integrated to API Gateway (similar to Microservices)
  • Event Driven Application. S3, SNS, SQS can invoke Fargate Tasks.
  • CI/CD Pipeline: Run build script after git checkin.
  • Batch Processing: Process large datasets.

Load Balancers

Classic Load Balancer -      v1 - 2009 - CLB - HTTP/HTTPS, TCP, SSL  - Layer 7 or 4 (TCP)
Application Load Balancer -  v2 - 2016 - ALB - HTTP/HTTPS, WebSocket
Network Load Balancer -      v2 - 2017 - NLB - Layer 4 i.e. at TCP/UDP Level.
Gateway Load Balancer -           2020 - GWLB - Layer 3 Network Layer - IP protocol.
  • You should enable multiple Availability Zones for all load balancers.
  • You can temporarily enable/disable some availability zones. Targets registered in disabled AZ do not receive traffic until it is enabled again.

`Cross Zone Load Balancing`:

  • It tries to be fair to all targets e.g. EC2 instances -- not to Zones.
  • i.e. It does not target equal divisions among Zones - but equal load among targets.
.
.    Cross-Zone Load Balancing === Round-Robin of All Targets.  All Targets Gets equal load.
.                                  May Generate (unnecessary) Cross Zone Traffic (Rare)
.                                  
  • Cross-Zone load balancing, when enabled, evenly distributes load across AZs by every load balancing node (i.e. if one AZ has a big target group of many EC2 instances, then they receive more requests). i.e. All elements in all target groups receive even load irrespective of the fact of the location of LB node.
  • Cross-Zone load balancing is enabled by default for ALB and can't be disabled.
  • For NLB and Gateway LB, you need to pay for Cross-Zone load balancing.
  • For CLB, you can enable it without paying for it.

Load balancers support sticky sessions using cookie but using it can imbalance it.

Load balancers do not belong to any specific subnet and just attached to VPC. Physical location (or AZ) of load balancer could be anywhere. ALB gets a DNS name that is dynamically resolved by Route-53, it does not have static IP.

Load Balancer Dispatch Algorithms

Dispatch algorithms include:

  • Least Pending requests (ALB)
  • Round Robin (ALB)
  • Hash the source/destination IP/port (NLB) same source connects to same destination.

Classic Load Balancer

  • Supports only one SSL Certificate but certificate can have many SAN (Subject Alternate Name)
  • Need multiple CLBs for multiple certificates.
  • ALB is better as it supports SNI (Server name indication).
  • Layer 4 TCP=>TCP redirection to EC2 can happen and may be required for generic protocol requiring SSL handshake (other than https). (How ?)

Application Load Balancer

.
.                                             Target Group
.                      /user              
.                  +------------------->  ASG  (TargetGroup) | HealthCheck    Task1
.                  |
.      ALB  -------|   /action
.                  +------------------->  Lambda             | HealthCheck    Task2 
.   App Listener   |
.                  |   ?process=reports
.   https Listener +------------------->  IP (ECS Tasks)     | HealthCheck
.   http listner   |   (Port mapping)        (awsvpc net)
.                  |
.   domain, path,  |   /batchjob
.   port based     +------------------->  EC2                | HealthCheck
.   routing        |                   
.                  |   /chat           
.   Dynamic        +------------------->  IP (Fargate)       | HealthCheck
.   Public IP      |                         (awsvpc net)
.                  |   /offers 
.   Security Grp   +------------------->  S3                 | HealthCheck
.                  |
.   No Elastic-IP  |   Host=abc.com
.                  +------------------->  EC2                | HealthCheck    Multi-Domain; Host+Path Routing OK.
.   Many ENIs      |
.                  |   /app1/action
.   SSL Certs      +------------------->  EC2                | HealthCheck    Multi MicroServices/apps;
.

Commands:

# Create ALB enable multi AZ by specifying subnets
aws elbv2 create-load-balancer --name my-alb --subnets subnet-b7d581c0 subnet-8360a9e7

aws elbv2 create-target-group --name my-targets --protocol HTTP --port 80   # source port
                              --target-type instance --vpc-id vpc-3ac0fb5f

aws elbv2 register-targets \
          --target-group-arn arn:.* 
          --targets Id=i-0598c7d356eba48d7,Port=80 Id=i-0598c7d356eba48d7,Port=766
          # target Id could be instance-id, ip or arn of lambda or another alb.
          # Note: target group port is source and instance port is destination.

# Add http listener
aws elbv2 create-listener --load-balancer-arn arn:.* --protocol HTTP --port 80 \
                          --default-actions Type=forward,TargetGroupArn=arn:.*

# default actions could be in JSON for complex specification.
#    actionType: forward | authenticate-oidc|authenticate-cognito|redirect|fixed-response
#    RedirectConfig: { protocol: "HTTP,HTTPS", ... }
#    AuthenticateOidcConfig: {
#         "Issuer": "string",
#         "AuthorizationEndpoint": "string",
#         "ClientId": "string",
#         "SessionCookieName": "string", ...
#         ....
#    }

aws elbv2 create-listener --load-balancer-arn arn:.* --protocol HTTPS --port 443 \
          --certificates CertificateArn=arn:.* --ssl-policy ELBSecurityPolicy-2016-08 \
          --default-actions Type=forward,TargetGroupArn=arn:.*

# Network load balancers support TCP and TLS as protocols and can do SSL termination as well.
aws elbv2 create-listener ...  --protocol TLS --port 443 --certificates ...

aws elbv2 create-rule --listener-arn arn:* --priority 5
          --conditions file://conditions-pattern.json  # Specifies path e.g. /action 
          --actions Type=forward,TargetGroupArn=arn:*
  • URL based routing to different Target Groups.
  • Every Targetgroups has healthcheck.
  • HTTP request is translated to JSON event in case of Lambda as target.
  • Great fit with ECS, supports dynamic port mapping.
  • ALB gets public dynamic IP by default only if it is created in a public subnet. To get static IP or equivalent you can do:
    • Route 53 DNS resolution that resolves ALB name to IP address.
    • AWS Global accelerator
    • Register ALB behind a NLB. NLB supports Elastic IP.
    • Note: You can not attach Elastic IP to ALB!
  • There 'are' health check settings supported. target endpoints. e.g. http://AWS.ALB/healthcheck
  • Also See: https://docs.aws.amazon.com/solutions/latest/constructs/aws-alb-fargate.html

ALB and Certificates

.
.
.    CloudFront             ------------->  ALB
.    (us-east-1 SSL Cert)                   (Same SSL custom domain Cert OK only if ALB in us-east-1)
.    Custom Domain                          (Otherwise use new ACM certificate in same Region)
.
  • Supports SNI (Server name Indication). So hostname based routing supported. CLB does not support that.
  • SSL Termination is supported!
  • Many applications can be deployed to single ALB. Using SNI, domain based routing possible. The routing also is based on incoming port and path URL and query string.
  • Typically ALB is associated with atleast one https listener. There can be another http listener that redirects to https.
  • If ALB is behind cloudfront, then it can share same Cloudfront custom domain certificate. The SSL certificate host and the Host: header should match!

Sticky Sessions in ALB

  • `Application cookie`: Can be generated by ALB with name AWSALBAPP
  • `Custom Cookie`: Generated by target and can be specified for each target group.
  • `Duration based Cookie`: Generated by LB to track duration. Cookie name: AWSALB (for ALB) and AWSELB (for CLB).

ALB and Public IPs

  • Multiple Dynamic public IP addresses (one per subnet per AZ) for one ALB.
  • Multiple ENIs also.
  • Can not attach Elastic IP.
  • The DNS name of the ALB resolves to multiple IP addresses (one per AZ)

Network Load Balancer

.    
.                                     TCP Listener Rules
.    
.        EIP OK.                          TCP/3306
.        Many ENIs     Forwards      +------------->    TargetGroup MySQL <--> HealthCheck
.                    TCP and UDP     |
.    --->  NLB       --------------->|
.        (One Per AZ)                |    TCP/80 
.        (Dynamic IPs)               +------------->    Web Applications <--> HealthCheck
.                                    |                     TargetGroup
.                                    |    TCP/8080
.        No Sec. Group               +------------->    ALB  ----> ASG
.        NACL Subnet OK.
.    
  • Forwards TCP and UDP packets.

  • NLB preserves client IP by default!

  • Rewrites Destination IP from NLB to target EC2 address. On reverse rewrites the Source IP address.

  • This could pose problem if NLB is used by the target itself where source and destination will have same IP and packet may be dropped! :

    .
    .                                  Rewrites Dest IP
    .    Source -------------->  NLB -------------------> EC2
    .          <--------------
    .          Rewrite Source IP
    .
    .    Note: If Source === Destination, then Problem!
    .
    
  • The workaround for above problem is just to disable client-IP Preservation or use Proxy Protocol v2 which disables client IP preservation plus prepends TCP stream with client IP information. (equivalent to X-forwarded-for):

    PROXY TCP4 192.168.0.1 192.168.0.11 56324 443\r\n    # Sends same header even for SSH, FTP 
    GET / HTTP/1.1\r\n                                   # If App does not expect, it will fail!
    Host: 192.168.0.11\r\n
    
  • Less Latency ~100ms vs ~400ms for ALB.

  • Handles millions of requests per seconds.

  • One public IP per AZ.

  • Note: NLB Supports Elastic IP unlike ALB!

  • If you enable NLB for multi AZ (recommended), you get one NLB node per AZ. Then NLB gets a list of static IP addresses to represent it as many IPs as enabled availability zones only -- There is no additional global IP allocated.

  • If you enable cross-zone load balancing (off by default), the NLB may route traffic across AZ (may not be desired in many cases)

  • The NLB private IP is auto assigned on creation using any IP address in AZ's CIDR block.

  • Target Groups could be:

    • EC2 Instances
    • IP Addresses
    • Application Load Balancers (ALB)
  • NLB can optionally do SSL termination, if you enable it. You need to install SSL certificates at NLB for that.

  • NLB typically has DNS name like my-nlb-xxx.us-east-1.amazon.aws.com which resolves to multiple IP addresses one for each AZ. To get zonal DNS name add prefix as AZ. e.g. us-east-1a.my-nlb-xxx.*.aws.com

Filtering incoming traffic for NLB

  • Security Group Not supported for NLB unlike ALB.
  • Must Use NACL of the deployed subnet. Affects the whole subnet.
  • May want to block all UDP traffic and such.

NLB and Public IPs

  • Multiple Dynamic public IP addresses (one per subnet per AZ) for one NLB.
  • Multiple ENIs also.
  • The DNS name of the NLB resolves to multiple IP addresses (one per AZ) for load balancing.
  • You can attach Elastic IPs - One per AZ.

Gateway Load Balancer

  • Use case: Centralized scalable Firewall router supporting 3rd party security appliances.
  • Analyze and filter IP packets coming from IGW and going to IGW. (or Transit GW)
  • One dedicated VPC with GLB to serve all VPCs in your organization.
  • Enable Multi-AZ support for high availability. Otherwise traffic may go across AZs.
  • You can configure single large EC2 instance and make it as a router and install all firewall 3rd party software on it. But it will not scale!
.       GLBE - Gateway Load Balancer Endpoint
.    
.                                Internet                 +---------------------------------+
.                                  |                      |                                 |
.                                  |                      |   VPC2 (GLB Service Provider)   |
.     +-------------------------- IGW------+              |                                 |
.     |                            |       |              |                                 |
.     |    AZ                      |       |              |                                 |
.     |                            |       |              |                   AZ            |
.     |  (Apps)                    V       |              |                                 |
.     | Subnet-1 <--> Subnet-2    GLB      |  <========>  |  GLB <------>  EC2-Instances    |
.     |                           EndPoint |              |               (Appliances)      |
.     |   (Consumer)VPC                    |              |               (Firewalls)       |
.     +------------------------------------+              +---------------------------------+
.    
  • Used for Firewalls and inspection and payload manipulation, etc.
  • Operates at Layer 3 Network - IP
  • Deploy third party network virtual appliances such as firewalls, inspection systesm (software) Transparent routing. Packets go through all virtual appliances and passed on to Application.
  • Uses GENEVE protocol on port 6081
  • Target Group could be EC2 instances (i.e. Virtual appliance runs on this EC2), or IP addresses (i.e. Virtual appliance runs at this IP address)

Note: Elastic Load Balancer(ELB) is not a typical load balancer but health checker. ELB provides the application-level health check by monitoring the endpoint (a webpage, or a health page in a web application) of an application. It can mark an instance unhealthy so that ASG can terminate that instance.

Running Containers without Load Balancers

  • Kubernetes requires running minimum 3 worker nodes. For simple config, ECS does not have this overhead.
  • ECS requires load balancer to expose to internet which costs money.
  • Kubernetes can share an ingress to share single load balancer with many applications ???

Multi-Region Architecture

  • You can use DynamoDB table with global Table Replication and deploy applications in multiple regions accessing the same DynamoDB table. This way you can have distributed web application server across regions using the same data store.
  • You can have DNS level load balancing (using Route 53) to redirect requests to different application endpoints in different regions (e.g. App Runner Endpoint).

API Gateway

.
.                                      +----> Lambda-Auth
.              REST/HTTP               |
.              Req/Response      (Caching)             Lambda     | HTTPS | Step Func | S3  
.    Client <---------------->  API Gateway    ----->  DataStream | SQS   | SNS | AppRunner 
.              Max: 10MB 29s     (IAM - SIGV4)         DynamoDB   | VPC-Link
.            API Key/Usage Plan  (WebSockets) 
.                                (Private/Pub)
.                                   
  • API Gateway has 29 seconds time out. Has 10MB payload limit.
  • Can be used with integration with Lambda, EC2 web applications, AWS Services such as S3 get object, etc.
  • API Gateway cache could be from 0.5GB to 237GB. Clients can override the cache with maxage=0 kind of request headers.
  • Pass IAM credentials in headers through Sig V4.
  • Lambda Authroizer checks your header for 3rd party or custom auth through headers tokens obtained from Cognito.
  • You can also generate API Keys and use it for tracking/Usage Plan for premium customers.
  • Using websocket you can have chat application. It can still be async operations. Every user message invokes lambda that persists messages in DynamoDB. 3 Lambda functions involved: On-connect, send-message, on-Disconnect
  • private API Gateway allows only access from selected VPCs and interface Endpoints. The Gateway resource policy can restrict aws:SourceVpc and aws:SourceVpce (endpoint). The endpoint policy can restrict the private APIs ??

API Gateway Integration Targets

  • Lambda
  • HTTPS Endpoints
  • Step Functions
  • S3
  • Kinesis Data Stream
  • DynamoDB
  • SQS and SNS
  • AWS App Runner
  • VPC Link:
    • Feature provided by API Gateway. Uses NLB underneath attached to any VPC subnet.
    • Target: Microservices, EC2, ECS, Fargate using Private IPs
    • Better security, performance, scalability

Regional vs Edge Optimized API Gateway

Edge optimized is by default. Edge optimized API gateway uses cloudfront. If you use with in AWS or invoke it from cloudfront, then use only Regional optimized deployment.

REST API vs HTTP API

  • REST API provides all advanced level full features support with AWS Services, API Key & usage plans, caching, sync/asycn integration, etc.
  • HTTP API is new modern, lightweight and simple service. Ideal to integrate with any custom HTTP backend service.

AWS AppSync

.
.    Fully Managed GraphQL API Service.
.
.    
.    Mobile      -----------> AppSync Endpoint         -----> HTTP REST | Lambda | DynamoDB
.    Web App     <----------  [Schema Introspection]              Local Pub/Sub
.                 Websocket   [Resolvers ]
.
.
.    Lambda         Publish                                                   Pub/Sub
.    EventBridge  ----------->  AppSync Pub/Sub API [Serverless WebSockets]  <--------> Mobile/Web
.
.    AppSync === Sync Mobile and Enterprise Apps using GraphQL - 2 way Websockets.
.
  • Managed Service using GraphQL API to get data from multiple sources and limited specific data.
  • GraphQL schema specifies the API and associated cognito groups for auth.
  • Typically the backend DB is dynamoDB but also could be RDS.
  • Auto creates API for backend Databases like RDS!
# Example GraphQL Schema
type User {
  id: ID!
  name: String!
  email: String!
}

type Query {
    getUser(id: ID!): User
}

# Example Resolver Request mapping template
{
      "version": "2017-02-28",
      "operation": "GetItem",
      "key": {
          "id": $util.dynamodb.toDynamoDBJson($ctx.args.id)
      }
 }

# Example Response mapping template
$util.toJson($ctx.result)

DNS

.                                            
.                                            Root DNS Server (ICANN)
.
.
.    Client ---> Local DNS   ------>         TLD DNS Server (ICANN)
.                   Server                      (.com)
.                         
.                         
.                                            SLD DNS Server
.                                               (mydomain.com)
.
.     Records:
.
.     mydomain.com      A          2.3.4.5        # Maps domain name to IP
.     mydomain.com      A          5.6.7.8        # Multiple A records are Okay!
.     www.mydomain.com  CNAME      mydomain.com   # Subdomain alias. Must be unique.
.                                                 # Top domain can't be CNAME'ed.
.                                                 # Can't co-exist with A record for same name.
.
.     mydomain.com      ALIAS           xyz.com   # Top domain can be aliased.
.                                                 # Non-std extn DNS record type.
.
.     app.xyz.com       ALIAS  myalb.amazonaws.com  # AWS may do recursive lookup and 
.                                                   # return the result as A record.
.                                                   # ALIAS lookups are free for AWS!
.                                                   # No TTL. Because it is not propagated!
.
.     ALIAS Record Targets: 
.           Load Balancers, API Gateway, S3 Websites, VPC Interface EndPoints,
.           Global Accelerator accelerator, CloudFront Distributions, Elastic Beanstalk env
.
.

Route 53

.      Public-Hosted-Zones   Private-Hosted-Zones   DNS-Server  DNSSEC
.
.      Routing-Policy  Health-Check   Resolver
.
.
.                          1:N                                           Can Associate
.      Route 53 Resolver  ----- Resolver Rules ----- PrivateHostedZones ---------------- Other VPCs
.                                                |
.                                                +-- External Domains
.
.      Implicit Route 53 Resolver ==  VPC+2 == Amazon Provided DNS == VPC DNS Resolver
.
.      RAM-Share-PHZ       Rule === DNS Records (PHZ) + Forward Pointer (External Domains)
.
.      Target Resources: Cloudfront, ELB, S3, API Gateway, EC2, Global Accelerator   
.
.

Inbound Resolver Endpoint vs Outbound Resolver Endpoint

.
.
.                                                            +--- On-Premise to Resolve VPC domains
.                                                            |
.                                                            V            (Local IP, SG)
.   DNS Clients in VPC ---->   VPC+2  --------- Inbound Resolver Endpoint (On-premise connects to this)
.                                        |      (Optional) (Single Endpoint Resolves All PHZ)
.                                        |
.                                        +----  Outbound Resolver Endpoint (Local IP, SG) (Route 53 uses this)
.                                               (One Per external domain)
.   
.   One Outbound Resolver ---- One external Domain only. (Create multiple Resolver Rules for multiple Domains)
.
.   VPC --- (DHCP Option Set)
.
  • Provides Public Hosted Zones and Private Hosted Zones (with in your one more more VPCs)
  • Private hosted zones is useful when you want to resolve your microservices to private IP. Also you want to resolve s3.amazonaws.com to private interface endpoint IP.
  • There is one DNS Resolver per VPC which contains set of Rules. There is one Rule per private hosted Zone that connects the VPC to the zone.
  • Registered Domain Registrar.
  • Except for ALIAS record, each DNS records like A, NS, CNAME, etc requires TTL.
  • Route-53 can return multiple values for A record (i.e. IP address).
  • Route-53 needs both local ENI IP and remote DNS Server IP address for resolving on-premise domain. The local IP endpoint is called outbound Resolver endpoint. You have to create this before you can create outbound Resolver rule.
  • Creating private hosted zone does not create implicit inbound resolver endpoint. You have to explicitly create inbound resolver endpoint. Single inbound resolver endpoint resolves all private hosted zones associated with VPC. Used by on-premise servers.
  • To prevent man in the middle attack the DNS response should be protected from being forged with different response. You should enable DNSSEC for your domain and your domain name server should support DNSSEC. Route-53 supports DNSSEC.
.
.    1 Resolver Rule  <-----> 1 Domain only.
.    1 Resolver Rule  <-----> Atmost 1 Resolver endpoint Only (For outbound resolver rule only)
.
.

aws route53 list-hosted-zones
aws route53 list-health-checks

# All resolver rules are meant to be active. You don't disable/enable resolver rules.
aws route53resolver list-resolver-rules   # List resolver rules in current account.

   # Domain name: .            ==> Internet Resolver
   #
   # Domain name: example.com  ==> TargetIps:  10.21.1.5:53;   (Target DNS Server. External Domain Only)
   #                               RuleType: FORWARD                
   #                               ResolverEndpointId: rslvr-out-xxxx  (Local IP endpoint.)
   #                               STATUS: COMPLETE | FAILED | ACTION_NEEDED
   #
   # Domain name: anothervpc.com   OwnerId: <Owner-account-id> (For external VPC only)
   #                               ShareStatus: 'SHARED_WITH_ME'
   #                               RuleType: SYSTEM          (No resolver endpoint for this)
   #

# List VPC Level resolver endpoints. All resolver endpoints need note be associated with Resolver rules.
# Endpoint does not include the domains that it is responsible for. 
# You should name the endpoint properly if you plan to associate rules later.
aws route53resolver list-resolver-endpoints
{
"ResolverEndpoints": [
    {
        "Id": "rslvr-out-1234567890abcdef0",
         ....
        "Name": "OutboundResolverEndpoint",
        "Direction": "OUTBOUND",
        "IpAddresses": [ {
            ...
            "SubnetId": "subnet-0abc1234def567890",
            "Ip": "10.0.1.10",                        # Local of VPC local IP addresses where DNS server runs.
                                                      # This can be used by on-premise servers also depending on SG.
                                                      # Usually one Outbound Endpoint for One domain.
        }, ... ]
    },
    {
        "Id": "rslvr-in-abcdef1234567890",
        ...
        "Name": "InboundResolverEndpoint",
        "Direction": "INBOUND",
    }
]
}

Route-53 Routing Policies

.
.
.   Active-Active And Active-Passive (Failover)       Application-Routing 
.
.   Target:   NLB, ALB, API Gateway                   Cross-Region
.
.                                                [All Healthcheck + Fallback Support]
.
.                              +----------> Simple (Single A Record)
.                              +----------> Fail Over (Health Check) (Forced Passive)
.                              +----------> Latency Based (Health Check)
.    DNS Name ----> Route 53   |----------> Weighted Records e.g. 70% / 30%
.                              +----------> Geo Location  (Check client IP)
.                              +----------> Geo Proximity (Check client IP)
.                              +----------> Multi Valued (upto 8) A Records (Client LB)
.                              +----------> IP Based (Client IP CIDR <--> A Record Mapping)
.
.

Different Routing Policies exist for returning values for resolution:

  • Weighted records policy allows returning val1 for 70% and val2 for 30%
  • Latency based policy returns A record based on least Latency.
  • Failover Routing policy which uses "Healthcheck" and returns the healthy A record.
  • Geolocation Routing policy is based on user location. Specify also "Default" value in case if there is no match on location. Can use Healthcheck also.
  • GeoProximity routing policy is similar to geolocation. You can also add bias value to direct more traffic to one resource.
  • Multi-Value routing policy returns multiple (upto 8) values for single A record. client based load balancing can be used to connect to different servers.
  • IP-based routing policy can specify for which CIDR blocks of client what values to be used.

Route-53 Public vs Private Hosted Zones

|
|        Private-Hosted-Zone ------ VPC1, VPC2, VPC3 (Same Account)
|                              |
|                              +--- External VPC (Using CLI only)
  • There are public and private hosted zones. The private hosted zones is for using Route-53 from within VPC. You must enable enableDnsHostnames, enableDnsSupport from VPC settings.

  • The public hosted zones is for use by world for your public DNS name.

  • You can associate more vpc's in same account to private hosted zones:

    associate-vpc-with-hosted-zone --hosted-zone-id <value> --vpc <value>
    
  • To associate another account (B) VPC to your (A) private hosted zone:

    aws route53 list-hosted-zones
    
    aws route53 list-vpc-association-authorizations --hosted-zone-id <hosted-zone-id>
    
    aws route53 create-vpc-association-authorization --hosted-zone-id <hosted-zone-id> 
          --vpc VPCRegion=<region>,VPCId=<vpc-id> --region us-east-1
    
    # From Account B
    aws route53 associate-vpc-with-hosted-zone --hosted-zone-id <hosted-zone-id> 
          --vpc VPCRegion=<region>,VPCId=<vpc-id> --region us-east-1
    
    # From Account A
    aws route53 delete-vpc-association-authorization --hosted-zone-id <hosted-zone-id>  
          --vpc VPCRegion=<region>,VPCId=<vpc-id> --region us-east-1
    

Route-53 Health Check Monitors

  • Here is an example to monitor an ALB where IP address is not known:

    aws route53 create-health-check --caller-reference unique-alb-check-98765 \
                --health-check-config '{
    
                  # Use IPAddress to monitor EIP or known endpoint.
                  "FullyQualifiedDomainName": "my-alb-1234567890.us-east-1.elb.amazonaws.com",
                  "Port": 80,
                  "Type": "HTTP",              # "TCP" to monitor NLB.
                  "ResourcePath": "/health",   # Or just "/"; It should just return 200 
                  "RequestInterval": 30,
                  "FailureThreshold": 3
                }'
    
  • Healthcheck monitors endpoints such as application (ALB), server, other AWS resource. Healthcheck can also monitor other healthchecks (Calculated Health Checks).

  • Healthcheck that monitors CloudWatch Alarms.

  • There are about 15 global health checkers available.

  • Global health checkers are not available from private hosted zones (VPC), so it must rely on custom cloudwatch metric and associate cloudwatch alarm and use that.

Route 53 Resolver Inbound Endpoint

  • The default VPC+2 DNS server won't entertain DNS queries from outside VPC.
  • Resolver Inbound Endpoint is created at VPC only for use by on-premise servers to resolve PHZ names like *.aws.private
  • Resolver-Inbound-Endpoint provides you with 2 private ENI IPs associated with 2 private subnets.

Route 53 Resolver Outbound Endpoint

  • Resolver Outbound Endpoint is auto created by each Resolver Forwarding Rule.
  • Mainly used to resolve on-premise domains from VPC like *.onpremise.private
  • These resolver rules can be shared across accounts using AWS RAM.

Route 53 Resolver aka Amazon DNS Server aka VPC+2

.
.                                                             Resolver Forwarding Rules
.                              VPC                                |
.                                                                 V
.                                Route 53     ----->  Outbound EndPoint ---> On-Premise Server
.       Resolve domain.com --->  Resolver     <-----  Inbound EndPoint  <--- On-Premise Server
.                                (VPC+2)
.
  • By default Route 53 Resolver exists at VPC+2 and resolves VPC internal domains and public domains.
  • Inbound Endpoint is optional. Only useful for external/on-premise server. VPC+2 does not accept incoming DNS requests from external networks or other VPCs.
  • Outbound Endpoint is also optional. Only useful if you need to resolve on-premise/other VPC domains. Outbound Endpoint has forwarding rules. (not inbound endpoints).
  • Amazon Route 53 private hosted zones is part of Route 53 Resolver core.
  • Forwarding rule could be a conditional forwarding rule which means only for certain subdomain like internal.mydomain.com to forward it and let mydomain.com resolution be handled by AmazonDns.

Route 53 ARC - Application Recovery Controller

  • ARC helps you prepare for faster recovery for applications
  • ARC provides the following capabilities:
    • Multi-Availability Zone (AZ) recovery -- including zonal shift and zonal autoshift
    • Multi-Region recovery, which includes routing control for failover and readiness check for application monitoring.

Multi-Availability Zone recovery:

  • Zonal shift : You can use ARC zonal shift to quickly isolate and recover from single AZ impairments.
  • Zonal shifts are manual and temporary. When you start a zonal shift, you must specify an (extendable) expiration of up to three days.
  • Zonal autoshift: ARC zonal autoshift authorizes AWS to shift traffic away from an AZ on your behalf to another healthy AZs in the same AWS Region.
  • The internal telemetry incorporates metrics from multiple sources to detect health.
  • Zonal autoshifts are temporary. AWS ends a zonal autoshift when the internal telemetry indicators are fine.
  • Readiness check: ARC readiness check continually monitors AWS resource quotas, capacity, and network routing policies, and can notify you about changes that may affect your ability to failover during regional fail. It is mainly for alerting, it does not do any other action otherwise.

Commands:

# Cells in ARC define the logical groups of resources 

# Readiness Checks validate the health of your resources in each cell.

aws arc create-cell --cell-name PrimaryAppCell --resource-set PrimaryResources
aws arc create-cell --cell-name SecondaryAppCell --resource-set SecondaryResources

aws arc create-readiness-check --readiness-check-name PrimaryAppReadinessCheck --resource-set PrimaryResources
aws arc create-readiness-check --readiness-check-name SecondaryAppReadinessCheck --resource-set SecondaryResources

# Create Routing Control to Manage Traffic

aws arc create-routing-control --routing-control-name PrimaryRegionControl --control-panel PrimaryControlPanel
aws arc create-routing-control --routing-control-name SecondaryRegionControl --control-panel SecondaryControlPanel

# Set Up Route 53 Health Checks and Failover Policies

aws route53 create-health-check  --caller-reference primary-health-check \
                                 --health-check-config IPAddress=xx.xx.xx.xx,Port=80,Type=HTTP,ResourcePath="/"

# Perform Manual Failover (Test) Using ARC
aws arc update-routing-control-state \
  --routing-control-arn arn:aws:arc:control-panel/primary \
  --routing-control-state OFF

aws arc update-routing-control-state \
  --routing-control-arn arn:aws:arc:control-panel/secondary \
  --routing-control-state ON

AWS Global Accelerator

.
.  Global-Application-Router   Global-AWS-Network   TCP+UDP
.
.  Endpoint-Group-per-Region   Health-Check  Listeners-with-Port
.
.  Intelligent-Routing    Static-IPs+DNS-Name
.
.                          Upto 10 Endpoints
.     Global-Accelerator ---------------------> ALB, NLB, EC2, IP 
.                          Listeners
.                            :80 :443
.
.     1 Endpoint Group Per Region.
.
.     Routing and Healthcheck indepdendent of Route 53.
.
  • Use AWS internal network to route to your application using static IP for you!
  • Targets: ALB, NLB, EC2, EIP
  • Especially useful for ALB which does not have static IP support.
  • 2 Anycast IP are created for your application and associated with edge locations. (2 IPs For high availability and client side fail-over)
  • You also get a DNS name: e.g., abcdef1234567890.awsglobalaccelerator.com.
  • The connection is directly routed from edge location to your ALB (using internal AWS Global network than public internet!).
  • Also supports healthcheck for your applications. Great for disaster recovery.
  • Can adjust routing more quickly than Route 53 because it doesn't depend on DNS TTL expiration.
  • Fast regional failover and load balancing. Can be like Global Application Load Balancer!

Anycast IP Address

It is used for stateless servers like DNS servers to share same IP by multiple servers.

Solution Architecture Examples

Elastic IP For Failover

  • For quick failover of One EC2 to another, use Elastic IP that can be detached from one EC2 instance and attached to another if primary goes down. Simple but does not scale.
  • Stateless web app scaling horizontally: Use single Route-53 DNS name which resolves to 3 different private IPs of EC2 instances. Due to DNS TTL the failover may be too slow (like 1 hour)

ALB For Loadbalancing

  • ALB + ASG + EC2 : ALB itself hosted in multi AZ. ASG also exists in multi AZ. Classic Architecture. Scales well. Can't handle sudden peak load. May need pre-warm.
  • ALB + ASG + ECS on EC2: Similar to above. Tough to orchestrate ECS Service auto-scaling + ASG auto-scaling ! Two sets of autoscaling rules needed!
  • ALB + ECS on Fargate : Scales well and easy to manage.

ALB + Lambda

  • Simpler alternative to API Gateway to expose lambda as HTTP API.
  • You can also use ECS for some requests and use Lambda for others for your microservice.

API Gateway + Lambda

Better auth support and standard well documented.

API Gateway + AWS Service

You can directly integrate with AWS Services using first class integration using HTTP API. Note: There is a payload limit of 10 MB going through API Gateway.

Following services are supported:

  • SQS - Directly push message into SQS.
  • Eventbridge - From here pretty much all services can be integrated without coding.
  • Kinesis - Putrecord
  • Stepfunctions

API Gateway + VPC Private Resource (e.g. ALB)

You can create VPC Link by specifying the subnets and then integrate using that VPC Link.

API Gateway + HTTP backend (ALB)

Expose third party http service with auth and other support.

CloudFront

|
|  CDN    Regions(34) AZs(108)   Regional-Edge-Caches(13)  Edge-Locations(215)  POPs(600)
|
|  WebSockets   Origins OrginGroups  OAI OAC  Lambda@Edge  RestrictViewerAccess
|
|  CloudFront-Signed-URL  Signed-Cookies   Cache-Behaviour   Geo-Restrictions-Blocking
|
|
|                                           new HTTP session     Supported Origins
|            https                             https/http                         
|   Viewer --------->      Cloudfront       ---------------> ALB  | EC2 | API GW | HTTP
|                       us-east-1-SSL-Cert           
|                          Lambda@Edge                   
|
|                             +---  Origin Group ------+
|                             |   Fail-Over            |
|      CloudFront ----------->|   Primary Origin       |
|                             |   Secondary Origin     |
|                             |   Health Check         |
|                             +------------------------+
|
  • content Delivery Network (CDN)

  • Content cached at edge

  • 225+ Points ( 215 Edge locations and 13 R egional Edge Caches )

  • Protection against DDoS attacks and integration with AWS Shield, WAF and Route 53

  • can talk to internal HTTPS backends

  • Supports Websockets

  • Supported Origins:

    - S3 Bucket for distributing files. 
    - Above Works with Origin Access Control (OAC) replacing Origin Access Identity(OAI)
    - Can be used as ingress to upload files to S3. (using S3 transfer acceleration)
    - S3 Bucket configured as website.
    - Mediastore Container to deliver Video on Demand (VOD) using AWS Media Services
    - Custom origin HTTP:
    
      + API Gateway
      + EC2 instance
      + ALB or CLB
      + Any HTTP backend
    
  • Custom Origins:

    Custom Origins (like EC2 and ALB) need not whitelist client IPs 
    but should whitelist edge location IPs. 
    
    The EC2 (or ALB) must be available using public IP not private.
    
    To prevent others directly acccessing EC2, you can configure cloudfront to add
    Custom HTTP Header name=value as a secret. 
    
    Filter requests at the backend.
    You can also use the security group of EC2 to allow only edge location IPs.
    
  • CloudFRont `Origin Groups`:

    Origin Groups help to increase HA and failover. 
    
          CloudFront ---------> Origin Group (Two EC2s in different regions)
    
    You can specify 2 Origins in a group, e.g. EC2 in 2 different regions.
    If the request returns error code, it will be retried in second Origin.
    
  • Cloudfront + API Gateway -- Multi-Region Architecture:

    .                                                                      DynamoDB
    .                          Lambda@Edge   +--> API GW Region1 -- Lambda --> Global DB
    . Client ---> CloudFront ----------------|                                    |
    .                                        +--> API GW Region2 -- Lambda -------+
    .
    
def lambda_handler(event, context):
  request = event['Records'][0]['cf']['request']
  headers = request['headers']

  country_code = headers.get('cloudfront-viewer-country', [{}])[0].get('value')

  if country_code in ['US', 'CA']:
      request['origin'] = {
          "custom": {
              "domainName": "us-origin.example.com",  /* Must be already configured */
              "port": 443,
              "protocol": "https",
              "path": "",
              "sslProtocols": ["TLSv1.2"],
              "readTimeout": 5,
              "keepaliveTimeout": 5,
              "customHeaders": {}
          }
      }
  else if ....
  • CloudFront Geo Restriction possible.
  • Note: the geo Header CloudFront-Viewer-Country is in Lambda@Edge
  • CloudFront Edge locations cost of data-out per edge location varies. Cheapest is US and costlist at India. Price classes: All regions, Class 200 (most regions, exclude expensive ones), class 100 (include only cheap regions)
  • It costs around $0.85 per each download of 10 GB data! (8.5 c/GB) Quite expensive.
  • CloudFront can be compared to S3 Cross Region Replication (CRR). CRR is faster replication in real time but should be setup for each region. CloudFront is simpler to setup.
  • HTTPs only config requires Viewer Protocol Policy setting for both Cloudfront (Redirect HTTP to HTTPs or HTTPs only) and also Cache behaviour setting for Viewer Protocol Policy.
  • Require https to S3 origin just needs setting OriginProtocolPolicy and no need to configure any additional SSL certificate at S3 side.
  • Custom Error pages for HTTP 4xx or 5xx errors can be configured as cached for some TTLs.
  • Cloudfront also support uploading files to S3 (POST request). This may speed up things but S3 transfer acceleration does the same thing with less cost.

CloudFront Signed URL and Signed Cookies

Signed URL and Signed cookies achieves same purpose. Following query parameters are reserved for signing, do not use it for application:

Expires
Policy          # Canned policy or Custom Policy
Signature
Key-Pair-Id  Trusted-signers

.
.    https://abc.com
.
.                     OAC
.                     Restrict Viewer Access (/private or /public URL based)
.
.    Application Signed URL:     Sign using Certificate trusted by Cloudfront.
.          /private/file?Signature=xxxxx
.
.    OAI - Virtual user for Cloudfront. Used to give permission to read S3 from cloudfront.
.
.    Viewer-Protocol-Policy Origin-Protocol-Policy 
.
.    Enable Restrict-Viewer-Access == Require SignedURL 
.
  • CloudFront distribution can be configured with Restricted Viewer access with signed URL only.

  • In that case, you also need to configure trusted Key Group (public and private RSA Key) used to generate URL signing.

  • You can use Cache-Behaviour Path to Restric-Viewer-Access only for some paths.

  • CloudFront Signed URL is generated by API call into CloudFront as a trusted signer.

  • CloudFront Signed URL applies to any Origin Paths (S3 or not) and leverage caching where as S3 Pre-signed URL applies to only S3 buckets.

  • Note that S3 Pre-signed URL uses access-key and secret-key and HMAC for signature where as CloudFront uses a special Key-pair attached to cloudfront for signing purpose.

  • Cloudfront singed URL looks like this (Uses Keypair):

    https://d111111abcdef8.cloudfront.net/path/to/file.jpg?
    Expires=1669999200&
    Signature=EXAMPLESIGNATURE&       <-- Algorithm: RSA-SHA256 (RSA-SHA1 is legacy)
    Key-Pair-Id=APKAIXXXXXXXXXXXX
    
  • S3 Signed URL looks like this (Uses your IAM access Keys):

    https://my-bucket.s3.amazonaws.com/my-object?
    X-Amz-Algorithm=AWS4-HMAC-SHA256&
    X-Amz-Credential=YOUR-ACCESS-KEY-ID/20240329/us-east-1/s3/aws4_request&
    X-Amz-Date=20240329T120000Z&
    X-Amz-Expires=3600&
    X-Amz-SignedHeaders=host&    <-- Host: my-bucket.s3.amazonzws.com is also signed.
    X-Amz-Signature=EXAMPLESIGNATURE    <-- HMAC Hash based Msg Auth Code signature.
    

CloudFront, Lambda@Edge and CloudFront Functions

Use Cases: Authentication, Geo customizations.

|
|    Lambda@Edge, CloudFront Functions can intercept Requests.
|              
|                                  +----> External-Auth                  
|                                  |                  
|                              Lambda@Edge              
|                                  |
|            Viewer Request        |          Origin Request
|          ----------------->      |        ---------------->
|  Viewer  <----------------    CloudFront  <----------------   Origin(S3)
|            Viewer Response                 Origin Response
|                                         
|          (CloudFront Func)                 (Lambda@Edge 4 Hooks)
|          (2 Hooks only)                 
|                                         
  • CloudFront provides: CloudFront Functions and Lambda@Edge
  • Use Cases: Geo customizations, security, A/B testing Header manipulation (Strip insecure headers), URL rewrites or redirects, etc.
  • Lambda@Edge functions execute at Regional Edge Cache (one per region) but CloudFront Functions execute at Edge locations (many).
  • CloudFront Functions:
    • Lightweight functions in JavaScript
    • Supports only 2 hooks: Viewer Request and Viewer Response Hooks only.
    • Sub-ms startup times, millions of requests/sec. Run at edge locations.
    • Process based isolation. Native feature of CloudFront.
    • Can not call non-trivial external functions. Should be simple. Max execution time allowed is < 1ms !! Max Memory is 2 MB! Total Package size 10 KB! No network access and no access to request body.
    • Can do header manipulation, validate JWT tokens, etc.
  • Lambda@Edge :
    • Written in NodeJS or Python. It is more legacy than modern CloudFront Functions.
    • Bit heavier solution than CloudFront Functions. Scales to 1000s req/sec
    • Lambda@Edge concurrency limits: 1000 per region (across all Lambda); Max 10K RPS.
    • Runs at Regional Cache. VM-based isolation.
    • Provides more hooks than CloudFront Functions -- after viewer request received etc, total 4 hooks compared to 2 hooks for CloudFront Functions.
    • Max execution time is 5 seconds for viewers triggers and 30 secs for origin triggers.
    • Max Mem is 128 MB (viewer trigger), 10 GB (origin triggers)
    • Total package size 1 MB for viewer trigger and 50 MB for origin triggers.
    • Network access available and access to request body available.
    • You can manipulate request URL, headers. Can load different images based on User agent.
  • You can use both CloudFront functions and Lambda@Edge together.

Cloudfront and Field Level Encryption

.
.                               Encrypt-Using      Decrypt Using
.                               Public Key         Private Key
.    Request ----> Cloudfront ------------------>  Origin  (S3 or Custom)
.
  • You can additionally encrypt certain part of your requests e.g. POST data of CreditCard number.
  • Create field level encryption configuration in cloudfront and associate it using CacheBehaviour.
  • Involves Generating Key Pair
  • If you have S3 origin, there is no additional S3 configuration to make it aware of Private key.
  • The custom Origin should have access to Private Key to decrypt fields.

Amazon ElastiCache

|          Managed Redis or Memcached (Key value Store)
|
|             Read/Write                     CacheMiss
|     App  <---------------->  ElastiCache <---------------->   RDS
|             SessionData
|
|                            Backup
|           REDIS         <------------> Disk    Note: HA but no horizontal scaling.
|        AZ1      AZ2        Restore
|
|
|                       Read/Write     Sharded     Note: Partition scaling. No HA.
|           Memcached -------------->  Partitions        No persistence.
|
|
|     Serverless Option
|
  • By default ElastiCache is a managed Redis on AWS. ElastiCache for Memcached is a managed Memcached on AWS.
  • It is key=value store in-memory database.
  • Application has to be refactored to lookup ElastiCache
  • Common pattern to implement it as session store. Write session data to cache and only maintain the session key at application. Application server retrieves session data from ElastiCache.
  • REDIS:
    • Mutli AZ with Failover
    • Persistent, backup and restore possible.
    • IAM Auth
    • Redis password/token support (Usually single password for all access. But ACL also supported)
    • Fine grained ACL access support. (multiple users with passwords and some users can only read etc)
    • In-flight SSL encryption support.
    • Use Case: Gaming Leaderboards; Redis Sorted Sets provide element ordering and uniqueness.
  • Memcached:
    • Multi-node for data partitioning (Sharding)
    • Non-persistent
    • Multi-threaded architecture.
    • optional SASL (Simple Auth and Security Layer) based authentication.
    • No TLS, ACL support
    • You can use Security Group for the memcached cluster restricting the network inbound IP addresses. Otherwise it is pretty much open.
    • The application connection URL usually contains cluster endpoint and application uses automatic discovery of node endpoints. However some clients do not use auto discovery. In that case, after adding new nodes, you need to update application with updated endpoints.
  • Can be used in front of DB for records or store some aggregation results also.
  • From 2023, Serverless Elasticache option is now available and preferred!

MemoryDB For Redis

.
.     Cluster Nodes Shards Primary-node Secondary-nodes
.
  • Redis OSS (Open Source Software) compatible.
  • Elasticache for Redis is only a caching layer and not durable.
  • For durable Redis cluster, you should use Amazon MemoryDB.
  • In-memory and durable with low latency.
  • Multi-AZ with auto failover.
  • Can be used for persistent session storage.
  • Shards is the data partitioning across nodes. Nodes run in different machines. You add shards for horizontal scaling.
  • Each shard has 1 primary node (read/write) and upto 5 replica nodes (only reads).
  • Max 500 shards. i.e. Max 500 shards + 0 Replicas (total 500 nodes) or 100 shards + 4 Replicas (total 500 nodes).

DynamoDB

.  Sources
.                                      Single Region
.      Lambda       -------------->    DynamoDB
.  API Gateway                         Unlimited Storage
.  Step Functions                      WCU/RCU Provisioned/OnDemand or
.  Glue                                WCU/RCU Optional Autoscaling
.  IOT Core Rules                      Strong or Eventual Consistent.
.
.                             DynamoDB Global Tables
.
.       client  ----------->   Region-1 (Read/Write) (Master-Master)
.       (Application           Region-2 (Read/Write)
.        Auto-failover)
.                              (Eventual Consistency Only)
.
.
.  Last-Writer-Wins-By-Request-Timestamp   WCU RCU
.
.  PartitionKey + SortKey     LSI  GSI  Stream DAX  TTL
.
.  Global Tables   Items Attributes  Storage-Always-Unlimited
.
.  Active-Active   AutoScaling-AutoScales-WCU-RCU
.
.  AutoScaling  AdaptiveCapacity
.
.
.
.  Last Writer Wins based on request timestamp; 
.  DAX  PartitionKey or (PartitionKey+SortKey) Composite Key
.  Storage Autoscaled
.  Capacity Provisioned or OnDemand
.  Item (Rows), Attributes (Columns), Different rows different attributes OK.
.  Multiple Indexes Okay. (Global Secondary Index).
.  Global Tables - Cross Region - Active 
  • No SQL DB. Fully managed. upto 1 million rps.
  • Read: Eventually or strong consistency; ACID support.
  • Resolve concurrent writes with LWW (Last writer wins) strategy. This is based on request timestamp with pending writes.

Capacity Planning

  • No disk space to provision. Max object size is 400 KB! (but total capacity is unlimited!)
  • Capacity:
    • provisioned (Provision max WCU/RCU)
    • on-demand (fully Autoscaling -- Specify Min/Max)
  • In provisioned mode, you can specify RCU and WCU (read/write capacity units).
  • 1 WCU = 1 Request/sec with max 1 KB.
  • 1 RCU = 8 KB/sec (eventual) or 4 kB/sec (strong) (i.e. RCU = 4x Write Capacity) or 2KB/sec for transactional reads.
  • Provisioned Units can not be used only on some interval (like weekends). You pay for 24/7. So on-demand is often useful as pay-per-use.
  • With Autoscaling, you use Target Utilization (specified as 70% or 80%, etc) where RCU/WCU will be adjusted accordingly (and will scale down later when it is idle).
  • Adaptive Capacity is complementary to AutoScaling -- It quickly and dynamically adjuts partition specificic access pattern. Fully automatic, no need to specify any other parameter.

Keys and Indexes

  • Supports primary key, items (rows), attributes (columns).
  • DynamoDB - Primary Keys can be one of:
    • Unique Partition Key. e.g. userId (Uses hashing)
    • Unique Partition Key + Sort Key - A composite primary Key. e.g. user-games table user_id+game_id Sort key is also called range key.
    • Another good sort key is a timestamp.
  • Indexes:
    • LSI - Local Secondary Index: Keep PK. Select another sort key. Must be defined at creation
    • GSI - Global Secondary Index: Different Primary key + optional sort key. Can be defined after table creation.

Global Tables (Cross region replication)

  • Master-Master Active replication. Read/Write all regions.
  • Must enable streams.
  • Useful for low latency and DR purposes.

DynamoDB PITR

  • It is enabled at table level.
  • Once Point-in-time recovery enabled, continous backup active for last 35 days.
  • For RDS it is enabled through RDS configuration for backup. Enable automated backup and the retention period upto 35 days. It works pretty much same for both DynamoDB and RDS otherwise.

DynamoDB TTL

  • TTL : automatically expire row after specified epoch date.
  • For this assign an attribute to hold expiration time.
  • System will delete it after expiry -- with in few days -- no exact time guaranteed -
  • It should be enabled at Table level and the attribute must be specified.
  • The operation will be sent to the stream of System deletes vs user deletes and relevant indexes will be updated by system.

Notes

  • Table Classes: Standard and Infrequent Access (IA)
  • Data types supported are:
    • Scalar types: String, Number, Binary, Boolean, Null
    • Document Types: List, Map
    • Set Types: String Set, Number Set, Binary Set
  • DynamoDB Streams:
    • React to changes to Table in real time.
    • Can be read by Lambda
    • 24 hours retention of data.
    • Lambda can send output to S3 for change backup.
  • DynamoDB - DAX = DynamoDB Accelerator:
    • Seamless cache for DynamoDB.
    • 5 minutes TTL for cache by default.
    • up to 10 nodes in cluster
    • Micro second latency for cached reads
    • Multi AZ support
    • Secure - Encryption at Rest with KMS, VPC integration, IAM, etc.

OpenSearch (aka ElasticSearch)

|
|     ELK (ElasticSearch Logstash Kibana)
|     Realtime Indexing   FullText
|
|     Logs              --------> OpenSearch  -----------> Kibana  (Realtime)
|     Clickstream
|     Cloudwatch Logs
|
  • Managed OpenSearch (fork of ElasticSearch which is open)
  • Kibana is known as OpenSearch Dashboards
  • Use Cases:
    • Log Analytics
    • Realtime Application Monitoring
    • Security Analytics
    • Full Text Search
    • Clickstream Analytics
    • Indexing
  • OpenSearch + Kibana + Logstash (Log ingesion mechanism with Logstash agent) is std stack.
  • DynamoDB Stream can be configured to sent to OpenSearch (using Lambda or through Kinesis Datastream):
    • In that case, we can easily search items using OpenSearch.
    • Using the PK available in the search result, we can lookup the main record.
  • Cloudwatch Logs (with some subscription filter) can be sent to OpenSearch as well.
  • You can build ElasticSearch which processes Terrabytes of data every month! How to estimate the nodes requirements ? Every node should not have more than 32 GB Java Heap memory for better performance. For maximum performance you should have less data in every node. (6 TB disk , 64G ram and 20 core CPU per node).

RDS

.                      Auto FailOver
.                   <-------------------->           Manual Failover
.                         60s       35s
.
.                              Multi-AZ              [Complimentary]
.
.               Primary   StandBy  Readable-Standby  Read-Replica        AutoFailover             Storage
.                 AZ1      AZ2        AZ3
. RDS             yes      Sync       Sync            Upto 15            By standby (60s/35s)  Indepdendent
. Aurora          yes      ....    ...............    Async (max:15)     By Replica (35s)      Cluster
. Global A.       yes      ....    ...............    1+5 Regionsx15     By Replica (60s)      Cluster
. Srvless A.                          Automatic                                                Cluster
.
.   Aurora implies Cluster and Logical Shared Storage Layer.
.
.   RDS-PITR  AutoScaling (Read Replicas and Storage)
.

.
.    Multi-AZ Deployment and Read Replicas are Complimentary and independent Features.
.    Multi-AZ + 1 Standby Instance  = Multi AZ RDS Instance Deployment.
.    Multi-AZ + 2 Standby Instances = Multi AZ RDS Cluster Deployment. (Standby Is Readable)
.
.    PITR == Just enable Backup Retention Time: 1 to 35 days.
.
  • Engines: PostgreSQL, MySQL, MariaDB, DB2, Oracle, SQL Server
  • Aurora: MySQL and Postgres Only. (Aurora implies DB Cluster)
  • Managed DB: provisioning, patching, monitoring, backups
  • Storage by EBS, can auto-scale by increase in volume size.
  • Launched within a VPC, usually in private subnet, control network access using sg.
  • Automated backups possible with point-in-time recovery
  • Snapshots are manual and created on demand. Backups are automated. Both are maintained by RDS.
  • You can copy snapshot over to another region using RDS console (or cli) before restoring there.
  • `Multi-AZ`: Standby instance for failover in case of outage.
  • `Read Replicas`: Increase Read throughput. Eventual consistency. Can be cross-region. They are like Read-only backup instances.
  • Read replicas is an independent and complimentary feature -- can co-exist with Multi-AZ. The standby+ReadReplica instances are different from regular ReadReplicas.
  • You can use AWS driver (jdbc:mysql:// to jdbc:aws-wrapper:mysql://) to use Multi-AZ cluster features.
  • You can use Amazon Route 53 + health check to distribute reads to read replica.

RDS Auto Scaling

  • Storage Autoscaling can be enabled. e.g. Allocate 100GB. Specify Max Stroage 1TB. It will grow based on demand.
  • Seamless Autoscaling of Read replicas supported only in Aurora. Specify minCapacity, maxCapacity (which means total replicas) it will autoscale based on traffic. Autoscaling of Read Replicas in basic RDS also possible but involves Cloudwatch alarms and Lambda.
  • The instance type does not auto-scale (e.g. Moving from db.m5.large to db.m5.xlarge) It has to be done manually and it involves downtime.

RDS Security

  • KMS encryption at rest for underlying EBS volumes/snapshots.
  • Transparent Data Encryption (TDE) for Oracle and SQL server.
  • SSL encryption to RDS is possible for all DB (in-flight)
  • IAM auth for MySQL, PostgreSQL and MariaDB; Auth still happens within RDS.
  • RDS Events get notified via SNS for events such as operations, outages, etc.

RDS For Oracle

  • `RDS For Oracle and backups`: Use RDS backup for Oracle. If you use Oracle RMAN (Recovery manager) for backup, you can only restore to another Oracle instance in EC2, you can not use it to restore RDS.
  • RDS for Oracle does not support RAC (Real Application Clusters). RAC will work on EC2 as you have full control.

AWS Database Migration

  • AWS DMS (DB Migration Service) can be used to migrate On-premises Oracle RDS to cloud RDS.

RDS Proxy

  • RDS Proxy for Lambda available. This is for connection pooling and to be usable from Lambda and other applications.
  • RDS proxy must be in the same VPC as the database instance. The proxy cannot be publicly accessible even if the database instance is.
  • Lots of free IP addresses must be available in the subnet.

RDS Failover

  • Cross Region RDS Failover can be achieved by Route-53 health check (using /health-db REST API or Cloudwatch Alarms) and trigger Lambda that will update DNS to promote Read Replica to primary. (With Aurora it is built-in and can be configured for ReadReplica to take over automatically)
  • RDS Multi-AZ fail-over is automatic by switching DNS CNAME entry to new passive node IP address. Note that new IP address is in another zone, so IP address must change.

RDS Custom

  • Managed Oracle and Microsoft SQL Server but with customization control.
  • Provides full admin access to OS and the database.
  • Better control to configure settings, install patches and enable native features.
  • You can SSH into the underlying EC2 instance.

RDS PITR - Point in Time Recovery

  • Just enable Backup Retention time to 1-35 days.
  • Database is continously backed up.
  • Transaction logs are continously backed up to S3 (in 5 minutes interval).
  • You can restore upto few seconds RPO using RDS Console, go to restore.
  • AWS Backup supports continous backup with PITR in addition to snapshot backup.

RDS Upgrades

  • Multi-AZ deployment for both RDS and Aurora, does upgrades of primary and standby nodes together at the same time. (Downtime unavaoidable).
  • There is no rolling updates.
  • When you upgrade using RDS using console, it gives you option to upgrade the associated read-replicas as well. It is optional.
  • For Aurora, since it is a cluster, the upgrades always happen together along with read replicas.
  • For cross-region Read replicas, you need to upgrade separately. It is not clear if upgrading the primary or read replica is better. For minor versions, it does not matter much. For major version, ideally you want to delete and recreate read replica.
  • You may want to keep the Read-replica online during downtime during upgrade (maintenance window).

Aurora

Aurora is single region only. Only Aurora Global supports multi-region.

.
.           Primary(Writer)       Replica (Upto 15)   Max: 3 AZs
.
.
.
.             AZ1     AZ2    AZ3    Total
.   Storage    2       2      2      6 copies using Storage Based Replication
.
.   Quorum based writes - 4 out of 6 writes should be complete.
.
.   Multi-master Active-Active possible for Aurora MySQL only (not postgres) but deprecated.
.
.   Auto-Failover to ReadReplica.
.

.
.                RDS                                     Aurora
.
.    Independent DB Instances                            DB Cluster. Always.
.    (Multi-AZ+2RR is called cluster but independent.)
.    Separate Storage                                    Logical Shared Storage Volume
.    Fail-Over with Multi-AZ Standby                     Auto Fail-Over by Read-Replica
.    Multi-AZ + Passive or 2 RR                          Passive Standby NA. Only Read-Replicas.
.    Can increase storage size, type later online        Aurora storage is always auto scalable.
.                                                        (Just specify min and max storage limits)
.    Support Oracle, SQL Server also.                    Only MySQL and Postgres
.    Multi-AZ replication synchronous.                   Read-Replicas Asynchronous.
.
.    Disk: gp2, gp3, io1, io2 (Explicit) 16 TB Max.      Disk: Auto managed. Auto IOPS scaling. 64 TB Max.
.
.    Example Writer/reader endpoints: 
.
.    mydbinst.abcdxxx.us-west-2.rds.amazonaws.com        my-aurora-cluster.cluster-abcdefghij.us-west-2.rds.amazonaws.com
.    mydbinst-ro.abcdxxx.us-west-2.rds.amazonaws.com     my-aurora-cluster-ro.cluster-abcdefghij.us-west-2.rds.amazonaws.com
.
  • Managed PostgreSQL or MySQL. Expensive option.
  • Storage auto grows up to 128 TB. 6 copies of data across 3 multi-AZ.
  • 4 out of 6 needed for writes; 3 out of 6 needed for reads; ???
  • `Read replicas`: up to 15 RR, reader endpoint to access them all. Cross-region.
  • Can load/offload data directly from/to S3.
  • Storage is striped across 100s of volumes.
  • Only Master instance can write; Upto 15 Aurora RR (Read instances) can serve reads.
  • Automated failover happens in less than 30 seconds. Failover Read Replica instance is chosen dynamically based on different factors such as lowest replication lag.
  • To protect against "Entire Regional failure", you must create cross-region READ replica and manually promote it in case of disaster.

RDS Aurora Scaling

  • For Aurora Autoscaling of storage is always enabled. scaling from 10GB to 128TB. (For RDS it is optional)
  • Auto scaling of Read Replicas (1-15 aka min max Capacity) can be configured depending on metrics. This is done by enabling autoscaling of Read Replicas.
  • Metrics could be CPU, database connections or any cloudwatch metrics. For Scaleout and Scalein.
  • If there is too much replication lag, it may throttle autoscaling of replicas..

Aurora Endpoints (HostAddress + Port)

  • Cluster Endpoint (aka Writer Endpoint): Connects to primary DB instance
  • `Reader Endpoint`: List of (ip+port) of all Read Replicas.
  • `Custom Endpoint`: use some subsets of DB instances or some purpose. Some instances can be configured as xlarge and some may be just large, etc.
  • `Instance Endpoint`: Specific instance endpoint to troubleshoot/fine tune that instance.
  • Note: RDS Proxy for Aurora is also available for read-only endpoints.

Troubleshooting RDS & Aurora Performance

  • `Performance Insights`: find issues by waits, SQL statements, hosts and users.
  • `CloudWatch Metrics`: CPU, Memory, Swap Usage
  • `Enhanced monitoring metrics`: At host level.
  • Slow Query logs

Aurora Serverless

  • Automated DB instantiation and auto scaling.
  • Proxy fleet
  • Data API (no JDBC connection needed). Secure HTTPS endpoint to run SQL statements. Users must be granted permissions to Data API and Secrets manager.

Global Aurora

|      Aurora Global
|
|             Primary-Region                   Secondary-Region (Up to 5)
|               Cluster                             Cluster
|                                               (Write Forwarding)
|
|        1 + 5 Regions
|        16 Read Replicas (Instances)
|
|
|                    RPO               RTO < 1min
|              <--------------><------------------>
|   --------CheckPoint------Disaster----------Recovered----------
|
|
|   Fail-Over within Region is automatic. 
|   Fail-Over cross-region requires manual selection of read-replica to promote.
|
|
  • 1 Primary Region and 5 secondary (Read-only) regions with replication lag < 1 sec,

  • upto 16 Read replicas per secondary region.

  • Available for both MySQL and PostgreSQL

  • Need to specify on creation of Aurora instance itself. (e.g. Engine MySQL; Edition: Global, etc)

  • Recommended only for truly distributed application.

  • Promoting another region for DR has an RTO (Recovery Time Objective) < 1 minute.

  • You can manage RPO - Recovery Point Objective - Tolerance for data loss.

  • Aurora global databases provide managed RPO. For Aurora Postgres you can set RPO as low as 20 seconds. For Aurora Global, you can set RPO as low as 1 second.

  • Provides Write Forwarding from secondary DB clusters to Primary cluster. It reduces the number of endpoints to manage. Behaves like Active/Active though it is Active/Passive.

  • Switch over (aka Managed Planned failover) can be used to trigger switch on healthy instance.

  • You can also trigger "FailOver" to recover from unplanned failures. There may be some data loss.

    even with possibly some data loss. There is RPO setting for Aurora Postgres.

RDS and Aurora Backups

.        Backup                               Expires
.   RDS -------->  Backup-Vault -------> After Retention Period
.
.         Manual  (RDS Internal)   +-----> Delete to remove
.   RDS -------->    Snapshot  ----+                             Import
.        Snapshot                  +-----------> S3-Bucket      ---------------> RDS/DB Instance
.                                    S3-Export   (Parquet Files) Glue ETL
.                                                                Custom Scripts
.
.
.   RDS (Automated) Backup  ===> (RDS-Instance, Backup-Time) (No separate backup Identifier)
.
.   RDS (Manual) Snapshot   ===> Snapshot-Identifier
.
  • Automated Backups:
    • Daily Full backup (during backup window)
    • Transaction logs are backed-up every 5 mins. (Aurora may even more often for better RPO)
    • 1 to 35 days of retention. (For Aurora, you can not disable retention)
  • Manual DB Snapshots:
    • Anytime manually triggered by the user.
    • Managed by RDS.
    • Can export to S3 but import needs custom scripts.

RDS Backup and Encryption

  • RDS (automated) backups and (manual) snapshots are internally managed by RDS (in vaults).
  • You can create a copy of a snapshot but You can not make a copy of a backup.
  • RDS Encryption is possible only through KMS keys.

Following notes are specific to (manual) `snapshots`:

  • You can easily encrypt a snapshot on the fly - by snapshot copying and enabling encryption.
  • You can not decrypt a snapshot on the fly. (Need restore and native export/import)
  • If you copy encrypted snapshots across regions, you need KMS copy grant permission to access the source original KMS key.

  • Can not change encryption status during backup/snapshot :

    .
    .                        Backup/Snapshot
    .      RDS Instance    ------------------> Encrypted Only
    .      (Encrypted)
    .                        Backup/Snapshot
    .      RDS Instance    ------------------> UnEncrypted Only
    .      (UnEncrypted)
    .
    
    # For Automated Backups.
    aws rds modify-db-instance --db-instance-identifier mydbinstance  --backup-retention-period 3 
    
    # For manual DB Snapshot
    aws rds create-db-snapshot  --db-instance-identifier database-mysql --db-snapshot-identifier mydbsnapshot
    
  • On the fly encryption during a Restore is supported :

.
.         RDS                 Restore          Encrypted
.    Backup/Snapshot   -------------------->  RDS Instance  
.     (Unencrypted)       Specify KMS Key
.


# From manual snapshot, use restore-db-instance-from-db-snapshot or restore-db-cluster-from-snapshot
aws rds restore-db-cluster-from-snapshot --db-cluster-identifier newdbcluster 
                                         --db-snapshot-identifier my-db-snapshot  # manual snapshot
                                         --kms-key-id xxx  # Encrypt/Re-encrypt using this key.

# From automated backup
aws rds restore-db-cluster-to-point-in-time    --source-db-cluster-identifier database-4 \
                                               --db-cluster-identifier sample-cluster-clone \
                                               --restore-type copy-on-write \
                                               --use-latest-restorable-time

aws rds restore-db-instance-to-point-in-time   --source-db-instance-automated-backups-arn "arn:*"
                                               --target-db-instance-identifier my-new-db-instance  \
                                               --restore-time 2020-12-08T18:45:00.000Z [--use-latest-restorable-time]
  • Decryption of backup/snapshot possible only through native export/import :

    .
    .     De-Encryption Only through mysqldump or native export: 
    .
    .        RDS         Restore     Encrypted  Export            Restore   UnEncrypted
    .   Backup/Snapshot  --------->  RDS       -------> mysqldump ------->  RDS Instance
    .     (Encrypted)                Instance                               
    .
    .
    
  • Restoring from encrypted backup/snapshot to across regions involves KMS copy grant operation:

    .
    .      RDS Backup    Copy Grant Key
    .        Region-1  ------------------> Restore From Region-2
    .                                      (local KMS Key to Re-encrypt)
    .
    
    aws kms create-grant --key-id xxx  --grantee-principal arn:*:role/keyUserRole --operations Decrypt
    aws kms list-grants  [--key-id xxx ]
    
    # Execute following from destination region (us-east-1)
    aws rds copy-db-snapshot \
       --source-db-snapshot-identifier arn:aws:rds:us-west-2:123456789012:snapshot:mysql-instance1-snapshot-20161115 \
       --target-db-snapshot-identifier mydbsnapshotcopy \
       --kms-key-id my-us-east-1-key    # Re-encrypts using new key in destination region.
    
    aws rds restore-db-instance ...   # Restore from the new copied snapshot
    
  • You can create on-the-fly encrypted snapshot from unencrypted snapshot by copying :

    .   
    .                           Copy Snapshot
    .   Unencrypted-Snapshot  ------------------>   Encrypted Snapshot
    .                           Encrypt KMS
    .
    
    # Possible for snapshots only. Note: You can not copy a backup.
    aws rds copy-db-snapshot ...  --kms-key-id arn:*:key/my-kms-key
    
  • You can not add encrypted Read Replica from unencrypted RDS instance and vice versa. :

    .
    .     Source-RDS     ---------> New Read-Replica    (Encryption status should match)
    .      Encrypted     --------->    Encrypted
    .      UnEncrypted   --------->   UnEncrypted
    .
    

RDS and Aurora Restore options

|                     Restore
|        Backup    ------------>  New RDS or Aurora
|     (RDS/Aurora)
|
|  MySQL On-Premises  Upload                    Restore     
|     Backup         --------> S3-Backup-File ----------> RDS/Aurora
|
  • Restoring snapshot or backup creates new db instance.
  • You can backup on-premises database, transfer to S3 and restore from backup file to new RDS instance / Aurora cluster.

Convert RDS to Aurora

|                            Restore
|      RDS -----> Snapshot -----------> New Aurora DB
|
|            Create New              New     Promote
|      RDS -----------------------> Aurora ----------> New Aurora DB
|            Aurora Read Replica    Replica
  • You can take a snapshot and restore into Aurora DB instance.
  • Create a new Aurora Read replica for simple running RDS instance! Then you can promote the Read Replica to Aurora DB Instance!!!

Aurora DB Cloning

|                     Clone
|     Aurora-DB1  ------------------->  cloned-DB
|                   Copy-on-write

Fast cloning and creation of new Aurora cluster supported. The original volume data is reused until write happens. Copy on write.

RDS/Aurora Security

  • Optional At-rest encryption of all data. Specify option on creation of DB.
  • In-flight encryption. TLS-ready by default.
  • IAM authentication instead of user/password. (only for MySQL?)
  • Security Groups can be attached to RDS/Aurora to control access.
  • No SSH except for RDS Custom.
  • Audit Logs can be enabled to be sent to cloudwatch logs.

RDS Logging

.
.              Enable Logging using
.  RDS MySQL ------------------------>  File | Table | CloudWatch
.              DB Parameter Group
.
  • Slow Query Log
  • Error Log
  • general_log (Log all queries)
  • Go to RDS Console > Parameter groups and associate with DB and reboot RDS.
  • For PostgreSQL: Slow Query Log: Controlled by log_min_duration_statement.
  • Error Log: Enabled by default.
  • View Logs in the AWS Management Console or use CLI or Cloudwatch
  • RDS instance > Modify > Enable CloudWatch Logs Export.
aws rds describe-db-log-files --db-instance-identifier <your-instance-id>
aws rds download-db-log-file-portion ...  --log-file-name <name> --output text > logfile.txt

SELECT * FROM mysql.slow_log ORDER BY start_time DESC LIMIT 10;

Aurora ML

Use Cases: Product Recommendation. Fraud Detection. Ads targeting. sentiment analysis.

Amazon Aurora machine learning (ML) enables you to add ML-based predictions to your applications via the familiar SQL programming language.

It provides secure integration between Aurora DB and AWS ML services without having to build custom integrations or move data around.

.
.
.
.                          SQL  Query
.    Application ------------------------------------>            Aurora ML
.                        Recommended Products?
.
.                                                      SageMaker           AWS Comprehend
.                                                      (ML Modeling)

DocumentDB (Managed MongoDB)

.
.      DocumentDB Cluster == 1 Primary (Writes) + upto 15 Read Replicas. 
.

Database Sharding

  • It is a technique where data is partitioned by application itself into multiple database instances.
  • Allows horizontal scaling. E.g. Facebook all indian users into a separate database, etc.

AWS Step Functions

With AWS Step Functions, you can create and run complex workflows based on state machine. Serverless Solution.

Max Duration: 1 Year (std workflow); 5 mins (express workflow)

Alternatives: Run Batch Job or Simple Lambda

.    
.                               Invoke
.     EventBridge Or           ------>    StepFunction --->Task1 -> Lambda--> Task3 --> ...
.     CloudWatch Alarm+Lambda                              Workflow-in-Workflow (Parallel)
.    
.               Task = HTTP Call | Glue:StartJobRun | AWS SDK | AWS Batch Job | Athena
.    
.
.     Batch-Input OK.
.
  • Build serverless workflow to orchestrate your lambda and other batch jobs and other tasks.
  • Features: sequence, parallel, conditions, timeouts, error handling ...
  • Max exec time of 1 year
  • Possibly to implement human approval feature.
  • Native Integration (without writing code):
    • Invoke Lambda
    • Run AWS Batch Job
    • Run ECS Task
    • Insert an item to DynamoDB
    • Publish message to SNS, SQS
    • Launch EMR (elastic MR job), Glue or SageMaker jobs
    • Launch another step function workflow
    • Call AWS SDK API calls with 200+ Services from your state machine.
  • You can invoke step function from Console or AWS SDK StartExecution API call or CLI, Lambda, API Gateway, EventBridge, CodePipeline or another StepFunction.
  • There are express workflows (cheaper) and std workflows. Express workflow max duration is 5 mins (as against 1 year), atleast-once workflow (as against exactly-once), does not have visual debugging through console.
  • Express Workflow could be Async or Synchronous. From Api Gateway you can invoke and wait or just return with "workflow started confirmation".
  • Error Handling can be implemented by raising event to EventBridge to SNS to email.
  • State machine is defined by JSON.
  • To handle batch input, the state function definition should provide instructions to handle batch map input. Advantage of batch inputs is that the concurrency is under control. We don't want 1000 active state machines running in parallel for 1000 events.

SQS

|
|  Max Unlimited (std) and 300 Msgs/sec; 3000 Msgs/sec with batching of 10 msgs for FIFO;
|  Std or FIFO       Max-Msg-Size:256KB
|
|       ---->         SQS                 ----->     Trigger Lambda or
|               120K Max Inflight Msgs (std)         Long Poll
|                20K Max Inflight Msgs (FIFO)
|
|     DeadLetterQueue   VisibilityTimer  DelayQueue   RetentionPeriod 
|
|     Delivery-Delay  MaxReceive (For DLQ)
|

.   Parameter           Values               Comments
.---------------------------------------------------------------------------------------------------------------------
.   Queue Type          Std/FIFO             Standard Queue (unlimited Rate) or FIFO (limited Rate, exactly once)
.---------------------------------------------------------------------------------------------------------------------
.   Visibility Timeout  0 to 12 hours        Max Processing Time. Otherwise go back to queue. Consumer must delete.
.---------------------------------------------------------------------------------------------------------------------
.   Message Retention 
.   Period              1 min - 14 days
.---------------------------------------------------------------------------------------------------------------------
.   Max Msg Size        256KB                
.---------------------------------------------------------------------------------------------------------------------
.   Delivery Delay      0 to 15 mins         Delay before becoming visible. DelaySeconds Attribute of Message.
.---------------------------------------------------------------------------------------------------------------------
.   Receive Message
.   Wait Time           0 to 20 secs         For Long polling max time ReceiveMessage() will wait.
.                                            Default can be configured at Queue level or at API call level.
.---------------------------------------------------------------------------------------------------------------------
.   MaxReceive          10 or any            Redrive Policy: Maximum times received before going to Dead-Letter-Queue
.---------------------------------------------------------------------------------------------------------------------
.  Content-Based Deduplication                For FIFO queues, enable content-based deduplication.
.---------------------------------------------------------------------------------------------------------------------
  • Managned Queue, integrated with IAM
  • Queue can be std type (Ordering on best-effort, atleast once delivery) or FIFO (Ordering preserved, exactly-once delivery)
  • Can handle extreme scale
  • Max msg size 256KB (use a pointer to S3 if needed)
  • Can be read from EC2 (optional ASG), Lambda
  • Can be a write buffer for DynamoDB
  • For std queue max rate is unlimited!
  • For FIFO, 300 msgs/sec without batching 3000 msgs/sec with batching
  • Visibity timeout is Max Processing timeout. Consumer must delete, otherwise message will reappear. Smaller timeout if multiple processing is okay. Higher timeout to prevent duplicates.
.
.          +-------------------------------> Dead letter Queue
.          |          Too many tries
.          |
.    SQS Queue ---> Process -----------+---> Delete From Queue
.          ^                           |
.          |  Visibility Timeout       |
.          +---------------------------+
.              Return to Queue
.
  • supports optional Dead Letter Queue. If consumer fails to process a message within Timeout, it goes back to queue MaxReceives threshold times, then goes to DLQ. You have to create dead letter queue (std or fifo) and associate with main queue during creation.

  • Typically set the retention to 14 days in DLQ.

  • It is possible to redrive msgs from DLQ to source queue. (using policies)

  • Messages should be idempotent. (could be consumed twice by consumer)

  • "Lambda Event Source Mapping" feature of Lambda allows triggering lambda on various conditions including when SQS queue fills up with N entries. This is useful for batch processing.

  • Lambda supports newer feature Lambda Destinations (for chain processing?) When event processing fails Lambda can either insert into DLQ or use this Lambda destination feature to send it to another Lambda or SNS etc.

  • Example pattern architecture for better decoupling and load balancing:

    .                SQS Request Queue
    .  Client <-->                         <--> Work Processor
    .                SQS Response Queue  
    
  • Message timers can set delay for individual messages upto 15 mins.

  • Delay queues can delay the message delivery upto 15 mins. The delay parameter set on the queue.:

    .                Wait till message timer
    .     Message ----------------------------> Deliver (Individual message level wait)
    .
    .                Wait On Delay Queue
    .     Message ----------------------------> Deliver after Delay queue wait
    

Amazon MQ

  • Message Queue using open protocols: MQTT, AMQP, STOMP, OpenWire, WSS
  • Managed message broker service for Rabbit MQ and Apache Active MQ
  • IBM MQ, TIBCO EMS, Rabbit MQ, Active MQ all could be migrated to this.

Amazon SNS

SNS allows you to create almost unlimited topics; The std topics does not guarantee FIFO order. If you need FIFO order processing, the throughput limits are much lower.

.
.    Topic    Subscription    Notification    Email       SMS
.
.    EventBridge
.    SDK          ---> SNS Topic ---->  Lambda | SMS | Email | HTTPS | FireHose | SQS
.                                       GCM | APNS  (Mobile Platform Endpoints)
.
.
.  Max Publish Rate: 30K messages/second;  FIFO: 3K messages or 20MB /second/topic
.  Max-Topics: 100K  Max-FIFO: 1K per account  Subscriptions: 12.5M
.  Max FIFO-Subscriptions: 100 per topic
.  Max SMS: 20 /second for Promotional
.  Max Msg Size: 256KB
.
.  SNS is a lambda trigger
.

Subscription Workflow:

.    Create Subscription
.
.                        using Email  |        | Confirm Email click
.    -------------------------------->|        |---------------------> Subscription Pending -> Confirmed
.                Using https Endpoint |        | Receive { SubscribeURL: "..." }
.                                     |        | at the https endpoint. Visit URL
.                                     |        |
.                Using Firehose       |        | (Same account firehose -- No confirmation required)
.       (Provide ARN, service role)   | Topic  |
.                                     |        |
.                Using SQS            |        |  Msg Format: { ..., TopicArn: <arn>, Subject: "...",
.                                     |        |                Message: ".txt.or.json.." }
.                                     |        |
.                Using Lambda         |        |  From Lambda console, SNS is called a trigger.
.    (SNS console or Lambda console)  |        |  From SNS console, Lambda is called a subscription.
.                                     |        |  Auto confirmed.
.                                     |        |
.           Resource in other Account |        |  Confirm
.
  • Simple Notification Service. One message to many receivers. Human and Services.
  • Email Notification, SQS Queue, Shipping Service, etc
  • Works on Pub/Sub pattern. Publishers publish msgs using topics. subscribers subscribe.
  • There can be get upto 12 Million subscriptions (subscribers) per topic.
  • Max number of topics is 100K.
  • SNS has integration with lots of AWS services as destination. Also SNS has integration to receive msgs from many AWS services.
  • Topic Publish using SDK is common. You can also do Direct Publish for mobile applications using mobile apps SDK. e.g. create platform application and endpoint. publish to platform endpoint. Works with:
    • Google GCM - Google Cloud Messaging to forward messages to Android Device
    • Apple APNS - Apple Push Notification Service - Notification to Apple device.
  • Supports Encryption: in-flight encryption using HTTPS API, At-rest using KMS keys; client-side encryption if the client wants to do that.
  • Access control : IAM policies to regulate access to SNS API.
  • SNS Access policies is similar to S3 bucket policies. Useful for cross-account access to SNS topics and allowing other services like S3 to write into SNS topic.
  • SNS + SQS: Fan Out pattern: Push once in SNS, receive in all SQS queues as subscribers.
  • Also possible to send SNS notification to Amazon S3 through kinesis Data Firehose. Note: Kinesis Firehose itself is a good FanOut pattern.
  • SNS respects FIFO within the topic.
  • Subscription can have Filter Policy to limit messages on the topic based on specific attribute name=value e.g. Want to receive only cancelled orders notification where topic is all orders.
  • SNS supports DLQ (which are SQS queues). It is attached to subscription, not topic.
  • Typically EventsBridge triggers SNS notification but there is no direct integration to create event from SNS to EventsBridge (though this can be done by a lambda which creates event and that lambda subscribes to SNS topic).

AWS Kinesis Data stream

.  
.          Many                     WCU Auto
.        Producers      --------->  Kinesis DataStream    -----------> Consumers
.    (PartitionKey, data)      Shards-based-on-PartitionKey            Firehose
.    (One Writer Endpoint)            Can Replay                       Apps (KCL, SDK)
.    (VPC Endpoint)                                                    Data Analytics
.
.         Max 1MB/s in or           Max 2MB/s out
.         1000 records/s            2000 records/s
.         -----------------> Shard -------------->         Default: 4 Shards / Stream
.                                                                   Dynamic Auto Scaled.
.  
.    Shards Auto Scaling supported.                        Provision: Max 500 Shards/account
.  
.    2 Shards may handle 10 different PartitionKeys!
.    Used as source of truth for input stream events and persisted.
.    Best use for Homogenous input records.
.    Imagine Kinesis datastream is for a single topic Kafka.
.  
.         Shard  ---Contains--- Partitions
.      Record    ---Includes--- Partition-Key and Sequence Number (with in the partition)
.  
.                           Enhanced Fanout    +--------> Consumer 1  (Dedicated)
.    Kinesis DataStream   ------------------>  +--------> Consumer 2  (2MB/s Read Rate)
.                                              +--------> Consumer 3  (Data Fanout)
.
.    Note: Supports Multiple writers but single input stream endpoint only.
.  
  • Managed data streaming service
  • Default Retention time is 1 day, extended 7 days, max 365 days.
  • Can replay data.
  • Great for application logs, metrics, IoT, clickstreams, real-time big data.
  • Great for streaming processing frameworks Spark, NiFi, etc
  • Auto replicated synchronously to 3 AZ.
  • Kinesis Streams: Low latency streaming ingest at scale
  • Kinesis Firehose direct output integration is only S3, Redshift, ElasticSearch and Splunk, custom HTTP and such. You need lambda if you want to fanout.
  • Related concepts: Stream, PartitionKey, Shard, Sequence Number, Read/Write throughput
  • Stream is a set of Shards. More the shards, more the power to handle. Writes from single client always goes to the same shard. Data capacity of the stream is the sum of data capacity of the shards.
  • Client writes each record with client.putRecord(partitionKey, blob) The stream hashes the partitionKey and assigns it to a shard and assigns an increasing sequence number for that record. Sequence number is unique with in the partition key.
  • Note that partitionKey somewhat helpful to preserve client ordering but the sequence number is only the order of receiving which may be different if client resent the packet. To really enforce order, you should include application specific sequence-no in payload.
  • Per Shard Limits:
    • Max write rate 1 MB/second, 1000 records per second
    • Max read rate 2 MB/second across all consumers per shard
    • Max 5 read API calls per second across all consumers per shard
    • Consumer Enhanced Fan-Out: Push model - 2 MB/s read per shard per consumer (Better read throughput with push model with additional cost)
  • All data stored for 24 hours by default and up to 365 days.
  • For standard Kinesis data streams, Lambda reader configuration allows polling once per second! For enhanced-fanout configuration, Lambda framework uses HTTPS/2 connection and invokes your lambda function as soon as records arrive. Even then the latency will be minimum 200ms (since max 5 API calls per second limit).
  • No direct control on how-much buffering should happen. If you want direct control on buffering, then use firehose downstream. Increasing shards decreases the buffering. Since you can not get things more often than 200ms, multiple records arriving within those times will have to be buffered. You can configure lambda to process, say upto 1000 events with wait timeout of 10 seconds. (i.e. Even if there is only one event, it will be processed after 10 seconds)
  • Capacity can be configured as On-demand (Supports AutoScaling) or Provisioned (specify total no of shards).
  • Producers can use AWS SDK (simple producer), Kinesis Producer Library (KPL) for advanced usage, or Kinesis Agent to send log files to kinesis directly.
  • Consumers can use AWS SDK (simple use), Lambda (through Even Source mapping feature), Kinesis Consumer Library (KCL) for advanced use (checkpointing, etc).
  • Even though multiple applications can produce and publish to single kinesis datastream, it is not really intended to multiplex and send non-homogeneous records per stream. Though Lambda EvenSource mapping feature allows "filtering" of consumer messages, the streams iterator skips past the records.

Kinesis Data Analytics

  • Input from Kinesis Datastream and Firehose

Kinesis Firehose

.
.                                            +------------> Lambda -->  Multi Destinations | SNS
.                                            |              (Transform Lambda)
.                                       +--------------------+
.                                       | [Parquet Convert]  |
.   Kinesis DataStream    Records       | [Lambda Transform] |        Max 1 destination.
.   Kinesis Agent     ----------------> |    FireHose        |----> S3 | ElasticSearch |
.   Cloudwatch Logs        JSON         |                    |         | Custom HTTP   | Redshift
.   MSK               Max One Stream    +--------------------+         | Kinesis Data Analytics
.   SDK, PutRecord                                               ( No support for Reader Lambda )
.
.                    Max Delivery Streams: 5000 per account
.                    Max dynamic Partitions: 500
.                    Max Rate of Put: 2000 per second per stream
.                    Max Rate of Data 5 MB/sec per Stream
.                    Max Rate of Records 500K records/sec per Stream
.
.                   
.       Multiple Sources   ------------->    FireHose     -----------> Single Target
.    (Subscribed/PutRecord)       (32 MB Buffer, 10s Buffer Time) 
.
.
.                       Records                  Dynamic Partitioning        
.     MSK (Kafka)     ------------->  FireHose ------------------------>   S3 
.     (Partitions)                               (Using S3 Prefix)          
.                                        
.
.     CloudWatch Log  ---------------> FireHose Delivery Stream
.     (Local/Remote)    Subscription
.                       
.
.     Note: There is Writer Endpoint to Write. But No Reader Endpoint to read.
.           Target pre-configured and can not dynamically read output.
.
  • Mainly for routing data, does not persist.

  • Source could be applications, Kinesis DataStreams, SDK, Kinesis Agent, Client, Cloudwatch Logs and Events.

  • In order to get input, Firehose Delivery Stream "subscribes" into cloudwatch logs.

  • Firehose

  • There can be only one active source stream bound to Firehose:

    .                           1:N                          1:1
    .    Kinesis DataStream  ------------ Kinesis Firehose ----------> Output Target
    .    
    .    1 Datastream to multiple Firehose stream is shared (unless enhanced fanout enabled)
    .
    .    Firehose supports single target only. Use Lambda or KCL lib for multi-target fanout.
    .
    
  • Can configure to write to destination, without writing code!

  • `Supported Destinations`:

    • S3 (Most Common)
    • Redshift
    • ElasticSearch (For Realtime Visualization)
    • Custom HTTP
    • Third Party: MongoDB, NewRelic, Datadog
    • Note: DynamoDB, RDS etc not supported as target.
  • Destination could be S3, Redshift, OpenSearch, Custom HTTP endpoint, 3rd Party Partner destinations such as MongoDB, NewRelic, Datadog, etc possible.

  • Data manipulation using Lambda possible.

  • Batchwrites support

  • Firehost accumulates records in buffer and flushes it on reaching maxsize or timeout:

    • Buffer Max Size: e.g. 32MB. Flushout after 32 MB.
    • Buffer Time: e.g. 10 seconds: Flushout after 10 seconds.
  • Firehose latency time is high because of min buffering time required for some service integrations. e.g. S3 integration requires min 1 minute buffer time.

  • Note that there is no Lambda integration to invoke on each record as it will be too costly. But Lambda can transform data to be delivered to other destinations -- for this buffering at Firehose needs to be enabled. It allows Lambda to finish within 5 mins.

.
.       Source                       SQL   Lambda-Preprocess
.                                     |    |                       Flink Studio
.    Kinesis DataStream               V    V
.                      ---------> KDA Application   ------> Sinks (Firehose|Kafka|S3|Lambda, etc)
.    Kinesis Firehose                 ^
.                                     |
.                               S3 Reference Data
.
  • Use Cases:

    • Mainly for Analytics on Streaming Data.
    • Continous Metric Generation: Live leaderboard for mobile game.
    • Streaming ETL: Transform and enrich data and send it to downstream.
    • Identify Anamolies
  • Kinesis Analytics is Legacy service is now part of Kinesis Data Analytics. Apache Flink SQL support replaces it.

  • Input Stream could be Kinesis Datastream or Firehose.

  • Processes input Streams and optional Reference Table (may be in S3), Run SQL kind of query like below:

    SELECT STREAM (ItemID, count(*) FROM SourceStream GROUP BY ItemID)
    
  • The output Stream destination could be FireHose (to S3 and such), or another DataStream.

  • Implementation is serverless; scales automatically.

  • Pay for resources consumed -- but it is not cheap.

  • Use SQL or Flink to write the computation.

Use Case: Real-Time Clickstream Analytics :

# First, create a Kinesis Data Stream to ingest clickstream data.

aws kinesis create-stream --stream-name ClickStreamData --shard-count 1

# Create a Sample Producer to Send Click Data. Python Script.

import boto3
....

kinesis_client = boto3.client('kinesis')

while True:
  ...

  # Send data to Kinesis
  kinesis_client.put_record(StreamName='ClickStreamData', 
      Data=json.dumps(click_data),
      PartitionKey='partitionkey'
  )
  ....
  sleep(1)

$ python send_click_data.py

# Create a Kinesis Data Analytics Application

aws kinesis-data-analytics create-application \
              --application-name ClickStreamAnalytics \
              --inputs '[{
                  "namePrefix": "clickstream",
                  "kinesisStreamsInput": {
                      "resourceArn": "arn:aws:kinesis:REGION:ACCOUNT_ID:stream/ClickStreamData",
                      "roleArn": "arn:..."
                  },
                  "inputSchema": { .... } "   # { userId : <userId>, action: <action> }
              }]'

# Define the SQL Query. Send Analytics output to FireHose Stream.

aws kinesis-data-analytics add-application-output \
    --application-name ClickStreamAnalytics \
    --output-configuration '{
        "outputId": "clickstreamOutput",
        "kinesisFirehoseOutput": {
            "resourceArn": "arn:.../YourFirehoseDeliveryStream",
            "roleArn": "arn:.../KinesisAnalyticsRole"
        },
        "sql": "SELECT userId, COUNT(action) as actionCount FROM clickstream GROUP BY userId"
    }'

# Start the Application
aws kinesis-data-analytics start-application --application-name ClickStreamAnalytics

Amazon Managed Streaming for Kafka (Amazon MSK)

  • Fully managed Kafa on AWS
  • Also available as MSK Serverless without capacity planning (partitions, brokers, etc)
.
.  MSK                                                          Apache Flink
.                                 broker1                       Glue ETL
.  Producer ------------------>   broker2   ------------------> Lambda    
.             Write to Topic      broker3     Poll From Topic   Applications 
.                                                               [Partition Aware]
.                                                               [Can Reset Seek Point]
.                                Zoo Keeper      
.                                Partitions   Replication-Factor: 3 (2-4)
.
.  Total Partitions < 2x to 3x total Brokers since concurrency is limited to brokers.
.  Total Partitions ~= Max(total_producers, total_consumers)
.  Partitions represents IO parallelism. Even single topic is spread over partitions.
.  Total concurrent consumers is the primary factor.
.
.
.    Kafka Broker = Kafka Server = Kafka Node.
.    Kafka Cluster = Set of Kafka Nodes + One Zoo Keeper
.
.    Kafka (Leader) Broker  ----1:N------ Partitions
.
.    Kafka Topic  ----1:N------ Partitions (Total partitions vary by topic. You choose per Topic)
.
.    2 <=  Topic Partition Replicatition Factor <= 4; 
.    Every broker owns some Partitions and replicate some.
.
  • Client connects using URL which is a list of (bootstrap) Broker-IPs (subset of all brokers). Broker gives metadata of topics, all other brokers, partitions information to client. Client can connect to any specific broker as well.
  • Creates and manages Kafka broker nodes & Zookeper nodes for you!
  • Deply MSK cluster in your VPC, multi-AZ (up to 3 for HA)
  • Auto recovery from common Kafka failures
  • Data is stored on EBS volumes.
  • You can also have serverless MSK!

Kinesis Data Streams vs MSK

|     DataStreams                        |         MSK
|----------------------------------------|----------------------------------------
| 1 MB msg size limit                    |  1 MB default, but upto 10 MB.
| Scales with Shards                     |  Scales with Topics with Partitions
| Can do shard splitting & Merging       |  Add partitions to a topic         
| TLS in-flight encryption               |  In-flight Encryption is optional
| Storage upto 1 year                    |  Unlimited time. Leave in EBS.

MSK consumers

- Kinesis Data Analytics (Managed Apache Flink - Streaming and Analytics)
- AWS Glue. Streaming ETL Jobs. 
- Glue Streaming is Managed Apache Spark Streaming (Micro Batching and Spark RDDs).
- Lambda
- Any application on EC2 or ECS or EKS

Commands:

# Create the MSK Cluster

aws kafka create-cluster \
    --cluster-name MyKafkaCluster \
    --broker-node-group-info '{
        "instanceType": "kafka.m5.large",
        "clientSubnets": ["YOUR_SUBNET_ID_1", "YOUR_SUBNET_ID_2"],
        "securityGroups": ["YOUR_SECURITY_GROUP_ID"],
        "storageInfo": {
            "ebsStorageInfo": {
                "volumeSize": 100
            }
        }
    }' \
    --kafka-version "2.8.1" \
    --number-of-broker-nodes 2

aws kafka describe-cluster --cluster-arn YOUR_CLUSTER_ARN

aws kafka create-topic --cluster-arn YOUR_CLUSTER_ARN --topic-name MyTopic 
                       --partitions 1 --replication-factor 2

# You can use the kafka-console-producer command to send messages to the Kafka topic. 
# First, install Kafka tools on your machine (or use a Docker container).

aws kafka get-bootstrap-brokers --cluster-arn YOUR_CLUSTER_ARN

kafka-console-producer --broker-list YOUR_BROKER_ENDPOINT --topic MyTopic --property "parse.key=true" 
                       --property "key.separator=:" 

# You can now type messages, and they will be sent to the MyTopic topic. For example:

  key1: Hello, Kafka!
  key2: Another message.

# Create an AWS Lambda Function to read.

import json
from kafka import KafkaConsumer

def lambda_handler(event, context):
    # Replace with your bootstrap server and topic
    consumer = KafkaConsumer(
        'MyTopic',
        bootstrap_servers='YOUR_BROKER_ENDPOINT',
        auto_offset_reset='earliest',
        enable_auto_commit=True,
        group_id='lambda-consumer-group'
    )

    for message in consumer:
        print(f"Received message: {message.value.decode('utf-8')}")

    return {
        'statusCode': 200,
        'body': json.dumps('Messages processed successfully!')
    }

# Set Up Event Source Mapping

aws lambda create-event-source-mapping \
    --function-name MyKafkaLambda \
    --event-source-arn YOUR_TOPIC_ARN \
    --starting-position LATEST \
    --batch-size 100

AWS Batch

.
.   Run Batch Job using Docker Container.
.   Cheaper since resources are released as soon as job is done.
.
.     Batch Job ------> Fargate (Serverless) OR
.                       ECS  OR
.                       EC2
.
.   Fully Managed (Almost Serverless)
.
.   Prioritized Job Queues ; Job Dependencies; 
.
.   Computing Environment -- Min and Max CPUs. Managed or Unmanaged.
.
  • Run batch jobs as Docker Images

  • You can use Fargate (managed) or ECS or EC2 or your own computing environment (unmanaged).

  • Options:

    - Run on AWS Fargate (serverless)
    - Dynamic provisioning of EC2 & spot instances in your VPC
    - Run on your own EC2s
    
  • Computing Environment abstracts limited resources that you already have (EC2 instances) or that can be created on demand. Say, I can create atmost "10 EC2 instances" on demand and I have 100 Jobs to run. How will I run it ? AWS Batch helps with it using Job Queues and scheduling.

  • You can schedule using Amazon EventBridge

  • Orchestrate batch jobs using AWS Step functions.

  • If you have to invoke a batch job in response to S3 upload, you have two options:

    - Trigger Lambda on S3 upload, and lambda invokes AWS Batch job. (Bit messy)
    - Send S3 upload event to EventBridge, configure this to invoke AWS Batch Job (easier)
    
  • If you launch within VPC private subnet, make sure it has access to ECS Service. ie. Use NAT gateway or VPC endpoint for ECS.

  • You can also invoke job on your Own preconfigured running EC2.

  • SDK Application can enque your job into "AWS Batch Job Queue".

  • In multi-node mode, job may invoke multiple EC2/ECS instances at same time! :

    • Does not work with Spot instances.
    • 1 main node, many child nodes.
    • Better to specify EC2 launch mode placement group "cluster" which tries to use same Rack, same AZ for EC2 instances.
  • Use Cases: HPC, Machine Learning, ETL, Media Processing, etc.

  • Each job queue has a priority number attached. Higher number, higher priority.

  • Array jobs mechanism can be used to start identical jobs in parallel -- Each job inherits AWS_BATCH_JOB_ARRAY_INDEX environment variable that it can use to consume different inputs.

  • There are mechanisms like Job Dependency, Compute Environment Max vCPUs, Fair Share Scheduling that can be used to limit concurrency of batch jobs.

Amazon EMR (Elastic Map Reduce)

.
.  EMR - Elastic Map Reduce
.
.  Hadoop-Clusters  Apache-Spark HBase Presto  Flink
.
.  Apache Hive (For SQL and meta-data)
.
.  Master-Node ---- Core-Node (Run Tasks, Store Data)
.                   Task-Node (Run Temp Tasks using Spot instance)
.
.
.                        S3 (EMRFS) (Data lake)
.                        |
.                    EMR-Cluster ---- Hive (HiveQL) / Presto SQL (ANSI)
.                        |
.                    Spark-Jobs
.                       ML
.
.  EMR on EKS:
.                    EKS  ------- (Embedded Spark-Job No-EMR-Cluster)
.
  • Creates Hadoop Clusters (Big Data)
  • Can have hundreds of EC2 instances.
  • Also supports Apache Spark, HBase, Presto, Flink (Stream & Analytics), ...
  • Presto is an open source distributed SQL query engine for running interactive analytic queries against datasets ranging from gigabytes to petabytes similar to spark.
  • Both Hive (HiveQL) and Presto (ANSI SQL) provides SQL interface and Query Execution Engine.
  • Note: Athena is a managed Presto.
  • Auto-scaling with CloudWatch
  • Use cases include big data, data processing, ML, web indexing, etc.
  • Integrations:
    • Launched in single VPC, single AZ
    • EBS Volume with HDFS (Use for temporary storage)
    • S3 integration with EMRFS (EMR File system on S3) (Use only for permanent storage)
    • Use Apache Hive (for SQL interface over HDFS or even connect to external DynamoDB)
  • EMR Node Types:
    • Master Node: Manages the cluster. Long running.
    • Core Node: Run tasks and stores data. Long running.
    • Task Node (optional): Just to run tasks. Usually spot instance.
  • EMR Node Purchasing options: On-demand, reserved (min 1 year), Spot instances. Use reserved for master and core nodes.
  • Can have long-running cluster or transient one.
  • EMR Instance Configuration:
    • Uniform instance groups: For each node type, all nodes have same config. supports auto scaling.
    • Instance Fleet: For each node type, you can mix On-demand or Spot instances. You can say, I want 3 large on-demand and 5 xlarge (Spot) mcs for Task nodes. Then I want only 5 on-demand nodes for Core nodes.
  • EMR Serverless offering is also available and introduced in Dec 2021. (EMR was introduced in 2009)

Running Jobs on AWS

- Provision EC2 instance and run CRON jobs.

- Amazon Event Bridge + Lambda (cron)

- Reactive Workflow: On EventBridge, S3, API Gateway, SQS, SNS, run lambda

- AWS Batch Job using Docker Image or scripts. (Triggered by EventBridge Schedule)

- AppRunner to Run docker Image in fargate. (Triggered by EventBridge Schedule)

- Use EMR for SPARK jobs.

AWS Glue

.
.    Glue --- Serverless Data Integration Service.
.
.    Crawler           - Crawl And Create Data Catalog
.    Data Catalog      - Hive Catalog Compatible
.    ETL               - Run Glue (ETL) Job in Spark environment.
.    Streaming         - Managed Spark Streaming
.    DataBrew          - Visual Data Preparation Tool to clean data.
.    Studio            - Create and Run Glue (ETL) jobs using Notebook
.
.
.                                    +--------------+    Used By      EMR
.    S3/RDS/JDBC  ---->  Glue    --->|   Glue       |------------>   Athena
.    DynamoDB           Crawler      | Data Catalog |                Spectrum
.                                    +--------------+                Glue ETL
.
.
.               Extract                    Load
.     S3 / RDS --------->   Glue ETL   ---------->  RedShift
.                         (Transform)
.                      (Batch Oriented)
.
  • Managed Data Integration Service.
  • Serverless Service -- Leverages different open source technologies and standards.
  • You don't need EMR cluster or Spark environment to run Glue Job. The serverless service creates PySpark environment and executes your Python Script.
  • AWS Glue Streaming Service is Powered by Apache Spark Streaming.
  • Prepare and transform data for analytics.
  • You can Load it to destination like Redshift datawarehouse.
  • Glue (Crawler) Jobs can be used to create AWS Glue Data Catalog. It can scan S3, RDS, DynamoDB etc and creates a Metadata Tables which can be used for Data Discovery to be consumed by Athena (analytics), Redshift Spectrum and EMR.

Glue Datacatalog

  • The AWS Glue Data Catalog is a centralized metadata manager for your datasets.
  • Foundation for most Analytics services.
  • The source includes: S3, RDS, MongoDB, DynamoDB, etc.

Glue Studio

  • Interactive ETL job creation using notebook, script editor, SparkSQL, etc
  • On the fly transforms like: join, union, Dropfields, FillMissingValues, RemoveNullRows, etc
  • Detect PII data and transform.
  • Monitor jobs.

Glue Crawler

You can use an AWS Glue crawler to populate the AWS Glue Data Catalog with databases and tables.

Glue DataBrew

A visual data preparation tool to clean and normalize data without writing any code.

Glue Streaming

  • Managed serverless Apache Spark Streaming compatible service.
  • Click Stream Analytics, Fraud Detection, Data ingestion, etc
  • Though Glue/Spark Streaming could be used for Analytis, Glue is mainly used for ETL ingestion workloads.
  • Provides instant-on notebooks for streaming jobs.

Glue Commands Examples

# Create glue job. Same commands for Glue ETL Job or Glue Streaming Job.
aws glue create-job --name my-glue-job --role my-glue-role 
    --command '{  ...  "ScriptLocation": "s3://my-bucket/scripts/my_script.py",  ...  }' 
    --max-capacity 2.0

# Start a Glue Job
aws glue start-job-run --job-name my-glue-job \
    --arguments '{"--input_path": "s3://...", "--output_path": "s3://..."}'

aws glue list-jobs

# To list all job runs (past executions) for a Glue job, use the get-job-runs command:
aws glue get-job-runs --job-name my-glue-job

aws glue delete-job --job-name my-glue-job

# Create a Glue Crawler
aws glue create-crawler --name my-glue-crawler --role my-glue-role \
    --database-name my-glue-database \
    --targets '{"S3Targets": [{"Path": "s3://my-bucket/data/"}]}' \
    --table-prefix my_table_prefix_

# Start crawler.
aws glue start-crawler --name my-glue-crawler

aws glue list-crawlers
aws glue get-crawler --name my-glue-crawler --query 'Crawler.State'

# Listing All Tables Created by a Crawler
aws glue get-tables --database-name my-glue-database

# To update crawler to scan new S3 bucket ...

aws glue update-crawler  --name my-glue-crawler \
    --targets '{"S3Targets": [{"Path": "s3://new-bucket/data/"}]}'   ...

Example Python Glue ETL Job Script:

#
#  SparkContext: Automatically created when the job runs.
#  GlueContext: Provides additional Glue-specific ETL functions.
#  DynamicFrame: Glue’s custom data structure allowing for schema inference and flexibility.
#
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame

# Initialize Spark and Glue contexts
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Example data source and target
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="my_database",
    table_name="my_table"
)

# Transformation example
transformed_df = datasource.apply_mapping(
    [("column1", "string", "new_column1", "string")]
)

# Save the transformed data back to an S3 bucket
sink = glueContext.getSink(
    path="s3://my-target-bucket/transformed-data/",
    connection_type="s3"
)
sink.write_dynamic_frame(transformed_df)

AWS Redshift

  • Based on PostgreSQL but not for OLTP.
  • OLAP - Online analytical processing.
  • 10x better performance than other data warehouses, scale to PBs of data.
  • Columnar storage.
  • Massively Parallel Query Execution (MPP)
  • Integration with BI tools such as AWS Quicksight and Tableau
  • Data loaded from S3, Firehose, DynamoDB, DMS (DB migration Service), etc.
  • 100+ nodes, up to 16 TB of space per node!
  • Leader Node: For query planning, results aggregation
  • Compute Node: Perform queries and send to leader
  • Backup & Restore, Security VPC/IAM/KMS, Monitoring
  • Redshift Enhanced VPC Routing: COPY/UNLOAD goes through VPC. Better performance, lower cost
  • Redshift provisioned, only worth it if you have sustained usage. Use Athena for sporadic rare queries.

Redshift Snapshots and DR

  • Snapshots are point-in-time backups of a cluster, stored in S3.
  • snapshots are incremental. Restore into new cluster.
  • Automated, every 8 hours, every 5 GB or on a schedule. Set Retention.
  • Can configure to copy snapshots of a cluster to another region. You need to enable "snapshot copy grant" for destination region to use appropriate destination KMS key while copying snapshot.
  • It is common to enable automatic copy to another destination region. You can also make a copy from automatic snapshot to create manual snapshot.
  • To copy snapshots for AWS KMS–encrypted clusters to another AWS Region,
    1. create a grant for Amazon Redshift to use a customer managed key in the destination AWS Region.
    2. Then choose that grant when you enable copying of snapshots in the source AWS Region

Commands:

aws redshift enable-snapshot-copy
             --region us-east-1
             --cluster-identifier cc-web-data-cluster
             --destination-region us-west-1
             --retention-period 7
             --manual-snapshot-retention-period 14

aws redshift create-cluster-snapshot --cluster-identifier mycluster --snapshot-identifier my-snapshot-id



# Prepare auto snapshot copy across regions when encryption enabled ...
# Execute the following in destination region:
aws redshift create-snapshot-copy-grant --snapshot-copy-grant-name my_copy_grant

# Enable auto snapshot copy from source to another region. Execute the following in source region:
aws redshift enable-snapshot-copy  --cluster-identifier mycluster --destination-region us-west-1
                                   --snapshot-copy-grant-name my_copy_grant

Redshift Spectrum

.
.            SQL
.   Client ------->  Redshift-Cluster ----> Redshift-Spectrum  <------ S3
.

- Query S3 data along with Relational tables.
- `Redshift Cluster must be running` to use Spectrum.
- Serverless and auto allocated resources. Bigger your redshift, bigger allocation.
 Pricing is only based on data size scanned (e.g. $5 for terrabyte data)
- Query looks like: Select * from S3.ext_table ... 
- Existing redshift processes the query using Redshift Spectrum nodes.
- Store S3 objects in Apache Parquet format for better columnar performance.
- You need to associate IAM role with Redshift cluster to access S3 files.
- You need to create external schema to specify external table on S3 location.
 In addition you need data catalog which can be Hive Catalog in EMR or Athena (simpler
 and can be managed by Redshift).

Redshift Workload Management (WLM)

  • To prevent short-running queries from getting stuck behind long-running queries
  • Define multiple query queues, route queries to appropriate queues.
  • Internally, there are superuser queue, short-running queue, long-running queue
  • Automatic WLM queues and resources managed by Redshift.
  • Manual WLM - queues managed by user.

Redshift Concurrency Scaling

  • When enabled, adds automatic additional cluster capacity (i.e. Concurrency-scaling cluster)
  • uses WLM (workload management feature) to decide which queries sent to additional cluster
  • Charged per second.

Document DB (Managed MongoDB)

  • Managed MongoDB in AWS.
  • MongoDB cluster means one of the following:
    • Simple ReplicaSet: There is a primary node receives read/write requests and multiple secondary nodes (typically 2). Writes happen at only primary node. AWS calls this as "Instance Based Cluster"
    • Sharded Cluster: Each shard is a replica set. Client connects to mongos (router) which uses configuration server to route write query to specific shard (aka replicaset). AWS Calls this as "Elastic Cluster"
  • HA with replication across 3 AZ
  • Storage auto grows in increments of 10GB
  • Auto scales workload with millions of requests per seconds
  • Much like Aurora backup and other features available.
  • DB Storage and Backup Storage (in S3) are charged per GB/month

Amazon Neptune

  • Use cases: Customer personalization, Fraud Detection, Neptune ML uses graph Neural Networks,

Amazon Neptune Database

Fully managed Graph, Analytics, Serverless Database.

  • Neptune Streams is a feature of Neptune DB which is like logical logging:
  • every change is logged and available for reading using REST API.
| 
|  HA ; 3 AZ; 15 Replicas. Scales to billions of relationships.
|
|                  R1        R2
|              A  ----> B <------ C
|
|

Amazon Neptune Analytics

  • Get insights and trends for large data of relationships.
  • Run queries on data with tens of billions of relationships in seconds.
  • Analytics database engine for graph data.
  • Data sources: S3 buckets or Neptune database.
  • Uses built-in algorithms, vector search, and in-memory computing.
  • Popular open graph APIs are supported such as OpenCypher, W3C RDF/SPARQL, Apache TinkerPop Gremlin Property Graph.

Amazon Keyspaces (for Apache Cassandra)

. 
.  Keyspaces === Managed Apache Cassandra
.
.
.   Serverless  HA         1000RPS     3x Replication 
.
.   Peer-to-Peer cluster    Write-Intensive
.
.  Netflix (For Logging), Instagram
.
  • Use cases: Store IoT devices info; Time-series Data; High write workloads. Fraud Detection;
  • Not ACID compliant.
  • Managed Serverless, Scalable, HA, Cassandra Compatible NoSQL Database.
  • Uses CQL - Cassandra Query Language.
  • Single digit millisecond latency -- 1000 RPS!
  • Netflix uses it for audit logging! Instagram uses Cassandra too.

Amazon QLDB

  • Use Case: Record Financial transactions.
  • Amazon Quantum Ledger Database (Amazon QLDB) is a fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log.
  • Serverless, HA, Replication across 3 AZs.
  • Provides features such as immutability, cryptographically verifiable, ledger history function.

Amazon Timestream

  • Fully managed fast scalable serverless time series database
  • Can receive data from AWS IoT, Data Streams, Prometheus, Kinesis Data Analytics Apache Flink (Analytics), Lambda, MSK (kafka), etc.
  • Destination could be JDBC, Quicksight, Sagemaker, Grafana

Amazon Athena

.
.
.  S3 with Glue Catalog                            ODBC
.  S3 with Hive Metastore  ---->   Athena        --------->  QuitkSight
.  CloudWatch Logs                 [Glue Crawler]  JDBC      SQL Editor Results
.
.
  • Serverless Query Serive to analyze S3 data.
  • Uses SQL (built on Presto)
  • Supports CSV, JSON, Avro, ORC (columnar), Parquet (columnar) formats.
  • AWS Glue (meant for ETL) can convert CSV to Parquet formats, better for performance.
  • $5 per TB of data scanned
  • Used with Quicksight for dashboards
  • BI, analytics, Analyze and query VPC flow logs, ELB logs, etc.
  • Use columnar data for cost-savings.
  • Federated Query:
    • Also allows you to fetch data in relational, non-relational, custom data sources on AWS or on-premises.
    • Supports Lambda as Data Source Connector. So can connect and fetch any data!
  • Results can be stored in S3.

AWS Lake Formation

AWS Lake Formation is a service that is an authorization layer that provides fine-grained access control to resources in the AWS Glue Data Catalog.

.                              AWS Lake Formation
.
.                                Source Crawlers
.                               Data Catalog (Glue)       RedShift        ----> QuickSight 
.     Source       ingest        Security Settings   ---> Athena
.     S3 RDS      -------->      ETL Data Prepare         EMR (Hadoop and Spark)
.     On-Premise                (Import by Blueprint)     Apache Spark
.                                      |
.                                      |
.                                      V
.                           DataLake (Stored in S3)
.
.
.     Row-Level-Security
.
  • Standard Blueprints are available for importing into S3 Datalake. (e.g. RDS to S3)

Amazon Quicksight

.
.   DataSources: S3 (With Manifest file), Athena (coupled with Glue Catalog), RDS, JDBC, ...
.
.   SPICE Engine
.
.   Column-Level-Security
  • Serverless
  • ML powered BI service to create interactive dashboards. Adhoc Analysis.
  • Embeddable with per-session pricing!
  • Integrated with RDS, Aurora, Athena, Redshift, S3, JDBC, ... Third party datasources like Jira, salesforce also supported.
  • In-memory computation using SPICE engine if Data is imported into QuickSight. SPICE (Super-fast, Parallel, In-memory Calculation Engine) can reuse data.
  • Can import csv, xlsx, json, tsv, ELF & CLF log formats.
  • Enterprise edition supports Column level security so that some users don't see some statistics or columns.
  • Supports users (std versions) and groups (enterprise only): These are not IAM users, only for Quicksight.
  • Dashboard is a read-only. To share it, you must publish and share it with users.

Bigdata Architecture

Bigdata Analytics Layer

.
.
.                  EMR (Hadoop/Presto/Spark/Hive ... )
.
.  S3  ----->      Redshift/Spectrum                    -->    Quicksight
.
.                  Amazon Athena (serverless)
.

Bigdata Ingestion Pipeline

.
.  IOT Devices  ------> Kinesis     -->   Firehose  -->   S3
.                       DataStreams       ETL Lambda      
.
.     Athena     -----> S3 Reporting Bucket --> Quicksight
. (Periodic or                              --> Redshift
.  on ingestion event)
.

Warehouse technologies

  • EMR: Hadoop, Apache Spark, HBase, Presto, Flink, ...; Bigdata, Heavy Duty, ML, etc.
  • Presto: Opensource Distributed SQL query engine supporting sources like S3, RDS, HDFS, etc.
  • Athena: Serverless, Adhoc queries and Aggregation on S3 data. Athena uses Presto.
  • Redshift: Advanced SQL with Spectrum option.

CloudWatch

Realtime monitoring and centralized log management.

.
.      Log-Groups  Subscriptions (Get Log Events)
.      Alarms (Send notification on Metrics threshold breach)
.      Metrics  (CPUUtilization from AWS/EC2, etc standard metrics available + Custom metrics)
.
.      CloudWatch Events = Events Bridge:  Events Rules are used for Event Driven Automation.
.
.      Insights Dashboard
.
.                Notification
.      Alarm  ---------------------->  SNS            --> Lambda
.                                      EventBridge
.
.      EventsBridge Events --------->  lambda | Almost all Services
.
  • Cloudwatch is metrics Repository.
  • Namespaces. e.g. AWS/EC2

Cloudwatch Metrics

.
.  Metric            Dimension               Statistics
.
.                    Alarms  Custom-Metrics 
.
.  Subscriptions     Streaming       Retention-Time      Detailed-Monitoring (1min)
.
.  Unified-Cloudwatch-Agent: (For Memory, etc additional-Metrics)
.
.  Synthetic-Canary   KMS-Encryption-at-rest
.
  • Metrics provided by AWS services free of cost. Regional Service.
  • Enabling Detailed monitoring and custom metrics get charged extra.
  • Each metric is associated with (namespace, name, Optional Dimension, Timestamp, Unit-Of-Measure)
  • You can assign upto 30 dimensions to a metric!
  • Metrics retention time depends on frequency: 1 sec - 1 min: Retain 3 hours. 1-5 minutes: Retain 2 months; 1+ hours: 15 months.
  • Statistics: Aggregation of metrics over a period of time.
  • Metrics can be read over standard periods only: 1 second, 5 seconds, 10 seconds, 30 seconds, or any multiple of 60 seconds.
  • Units: Example units include Bytes, Seconds, Count, and Percent.
  • Some CloudWatch metrics support percentiles, trimmed mean and other performance statistics.
  • Example: EC2 standard: 5 minutes; Detailed monitoring: 1 minute
  • Can create custom metrics: standard resolution 1 minute, high resolution 1 sec.
  • EC2 RAM usage is not a built-in Metric
  • clouldwatch alarms:
    • Can trigger actions:
      • EC2 action (reboot, etc)
      • Auto Scaling (more cpu usage, increase instances)
      • SNS, etc
    • Intercepted by EventBridge
    • Can not trigger Lambda directly. Either through SNS or EventBridge only.
  • cloudwatch Dashboards:
    • Display metrics and alarms
    • can show metrics across regions
  • cloudwatch synthetic montitoring (aka Canary):
    • Configurable script that monitor your APIs, URLs, Websites
    • Reproduce customer problems in advance programmatically
    • Store load time data and screenshots of the UI
    • Integrates and creates Cloudwatch Alarm as needed.
    • Written in Node.js or Python
    • Programmatic access to a headless Google Chrome Browser
    • Blueprints include:
      • Heartbeat monitor: load URL store screenshot and http archive file
      • API canary: test basic read/write REST APIs
      • Canary Recorder - Used with Cloudwatch Synthetics Recorder (Record your actions on website and generate script)
      • GUI workflow builder - Verifies that actions can be taken on your webpage (login form)

CloudWatch Example Metric Examples

Service

Dimensions

Metrics Description

EC2

  • InstanceId
  • AutoScalingGroup
  • CPUUtilization
  • NetworkIn/Out
  • DiskReadOps/Write
  • Average CPU utilization of EC2 instance.
  • Bytes of incoming/outgoing network traffic.
  • Number of disk read/write operations.
ECS * ClusterName, * ServiceName * CPUUtilization * MemoryUtilization * Across ECS per task per service.
S3 * BucketName * StorageType * BucketSizeBytes * NumberOfObjects * Total bytes stored.

DynamoDB

  • TableName
  • GlobalSecondaryIndex
  • ConsumedReadCapacityUnits
  • (Also Write)
  • Read capacity units used.

API Gateway

  • ApiName
  • Stage
  • Method
  • Count
  • 4XXError, 5XXError
  • Latency
  • Total number of API requests.
  • Number of such errors (4XX/5XX).
  • Average response time.

Lambda

  • FunctionName
  • Resource
  • (Version,Alias)
  • Invocations
  • Duration
  • Errors
  • Total invocations.
  • Average execution time of the Lambda function.
  • Number of invocation errors.

RDS

  • DBInstanceIdentifier
  • DBClusterIdentifier
  • DatabaseConnections
  • FreeStorageSpace
  • ReadIOPS, WriteIOPS
  • Number of active database connections.
  • Available storage space in the instance.
  • Disk I/O operations per second.

ELB

  • LoadBalancerName,
  • AvailabilityZone
  • RequestCount
  • HealthyHostCount
  • UnHealthyHostCount
  • Latency
  • Total load balancer requests.
  • Number of healthy targets.
  • Number of Unhealthy targets
  • Average response time of requests.

Redshift

  • ClusterIdentifier
  • NodeID
  • CPUUtilization
  • DatabaseConnections
  • ReadIOPS, WriteIOPS
  • CPU utilization of the Redshift cluster.
  • Active database connections.
  • Disk read/write operations per second.

Kinesis

  • StreamName, ShardId
  • IncomingBytes
  • IncomingRecords
  • ReadProvisioned-
  • ThroughputExceeded
  • Volume of data ingested into the Kinesis stream.
  • Total Incoming Records.
  • Read operations exceeding provisioned throughput.

CloudFront

  • DistributionId
  • Region
  • Requests
  • BytesDownloaded
  • BytesUploaded
  • 4xxErrorRate,
  • 5xxErrorRate
  • Total number of requests served by CloudFront.
  • Data transferred through CloudFront.
  • (Downloaded/Uploaded)
  • Rate of client/server errors in CloudFront.

EBS

  • VolumeId
  • VolumeReadBytes
  • VolumeReadOps
  • (Also Write)
  • BurstBalance
  • Data read in bytes on the volume.
  • Number of read operations on the volume.
  • (Also Write Bytes, WriteOps)
  • Remaining burst credits for burstable volumes.

Cloudwatch Logs

.
.  Max Rate Of Logging: 5000 TPS per account per region. (Quota can be increased)
.  Max Log event Size: 256 KB. (fixed)
.
.  Log-Group  Metric-Filter  Custom-Metrics Rules Alarm
.
  • Sources:

    • SDK, Cloudwatch Logs Agent, Unified Agent
    • Elastic Beanstalk, ECS, Lambda, VPC Flow Logs, API Gateway
    • Route 53 DNS queries Log
    • Cloudwatch Logs Agents on EC2 machines
  • Format: Log Groups: usually represents application; Log stream: instances within app

  • Log expiration policies: never expire, 30 days, etc

  • Optional KMS encryption

  • Can send logs to S3 (exports), Kinesis Data Streams, Firehose, Lambda, Elastic Search Log data can take upto 12 hours, so it is not realtime!

  • For realtime analysis of logs, use Logs Subscriptions

  • Logs can use filter expressions to generate Alarm, for example. e.g. ERROR keyword in logs can generate alarm.

  • Cloudwatch Logs Insights can be used to query logs and add queries to dashboard

  • Cloudwatch Logs Subscriptions: Allows (custom) Lambda function using subscription filter. This is realtime. It can send output to Elastic Search etc. :

    .
    .                                        Lambda (custom) Realtime
    .   Logs -->  Subscription Filter --->   Firehose (Near Realtime)   --> Elastic Search
    .                                                                       Write to S3
    .
    

    Note: Lamda is for realtime and Firehose near realtime (1 minute or more).

  • For logs aggregation of multi-account, multi-region, you can define single Kinesis data streams and use subscription filters in all accounts to send logs to that single Data Stream.

Unified Cloudwatch Agent

  • By default, you don't get memory usage metrics from EC2. If you install (Unified) Cloudwatch Agent on EC2 and On-premises machines, then you get those metrics as well.

  • You can install cloudwatch agent using SSM (AWS Systems Manager) run command. Note: SSM agent is available by default on EC2 instance.

  • See cloudwatch agent source code:

    https://github.com/aws/amazon-cloudwatch-agent/
    https://github.com/docker/buildx/                #building-multi-platform-images
    

Cloudwatch Custom Metric

.
.         On-Premise      Send               Metric-Filter                  Alarm
.     Cloudwatch Agent  ------->  Log-Group --------------> Custom Metrics -------> SNS
.         Logs                     Logs
.

aws logs put-metric-filter  --log-group-name my-log-group  --filter-name ErrorCountMetricFilter  --filter-pattern "ERROR" \
         --metric-transformations  metricName=ErrorCount,metricNamespace=MyApplication/Metrics,metricValue=1

aws logs describe-metric-filters  --log-group-name my-log-group

aws cloudwatch put-metric-alarm --alarm-name "HighErrorCountAlarm" --metric-name "ErrorCount" 
               --namespace "MyApplication/Metrics"  --statistic "Sum" --period 300 --threshold 10 
               --comparison-operator "GreaterThanOrEqualToThreshold" --evaluation-periods 1 
               --alarm-actions arn:...::MySNSTopic --actions-enabled

Application Signals

  • Use CloudWatch Application Signals to automatically instrument your applications on AWS so that you can monitor.
  • Leverages Application observability services cloudwatch metrics, cloud trail and X-Ray.
  • Automatic source instrumentation is made possible by providing some library options to Java, Python application commands.
  • Tightly integrated into EKS when enabled.
  • Enabling this with ECS involves running a sidecar agent along with tasks (preferable) or run as a daemon (without Fargate support).

Promotheus - Opensource Monitoring Tool

  • Prometheus is an open-source monitoring and alerting toolkit for cloud-native environments.

  • Prometheus is known for its powerful querying language (PromQL)

  • Time-Series Database: Prometheus stores all data as time-series.

  • Service Discovery: Prometheus can automatically discover targets based on labels, has tight integration with Kubernetes.

  • Amazon Managed Service for Prometheus (AMP) is available. AMP integrates with AWS services like Amazon CloudWatch for alerts, IAM for access control, and Grafana for dashboards.

  • CloudWatch Agent can push metrics to Prometheus.

  • Use specialized metric exporters like RDS exporter, S3 exporter, etc if needed.

  • For kubernetes integration:

    helm install prometheus prometheus-community/prometheus
    # Configure Prometheus with Kubernetes service discovery 
    # Scrapes metrics from all pods labeled with Prometheus metrics.
    
  • Integrate with Grafana (optionally, Amazon Managed Grafana) for visualizations

AWS Service Discovery Solutions

.
.    CloudMap  Route-53  ECS-Built-in  Kubernetes-Built-In 
.
.    AppMesh (Uses Envoy Proxy for Service-Service communication)
.
  • Cloud Map: Offers both DNS-based (private hosted zone entry for myservice.app.local) and API-based service discovery (registerInstance, discoverInstances SDK API), along with health checks.
  • Route 53: Provides DNS-based discovery. Used for microservices that need a highly available DNS service.
  • ECS Service Discovery: DNS-based discovery specifically for ECS tasks and services.
  • Kubernetes Service Discovery: Uses Built-in KubeDNS or coreDNS to use internal DNS names.
  • App Mesh: Provides service discovery within a service mesh, with advanced routing and observability. Uses OpenSource Envoy Proxy.

AWS CloudMap

Fully managed resource (such as microservices) discovery service.

.
.   AWS CloudMap -- Microservices and others lookup by name. - Resource Discovery Service.
.
.   Health Checks:  Look up service and also locate healthy one.
.
.   ECS and Fargate etc - Enable Service Discovery ===> Uses AWS CloudMap
.
.   EKS can publish external IPs to CloudMap.
.
.   Use Custom names for your application resources and endpoints.
.
.   Resources Examples: DynamoDB, SQS, RDS, etc
.
  • Easier to manage application with many microservices with independent upgrade versions.
  • CloudMap also does integrated health checking for you and can stop sending traffic to unhealthy endpoints.
  • You can also do this with a simple redis lookup table, but you don't get health check.
  • Can query CloudMap using SDK, API or DNS.
  • There is opensource project ExternalDNS for Kubernetes. ExternalDNS synchronizes exposed Kubernetes Services and Ingresses with DNS providers:
    • Inspired by Kubernetes DNS, Kubernetes' cluster-internal DNS server, ExternalDNS makes Kubernetes resources discoverable via public DNS servers, e.g. Route 53, Google CloudDNS.
    • It uses TXT records that have the my-cluster-id value embedded.

AWS System Manager (SSM)

.
.  SSM
.  
.   Patch Manager
.   State Manager
.   Session Manager
.   etc.
.

AWS Systems Manager is the operations hub for your AWS applications and resources. Secure end-to-end management solution for hybrid and multicloud environments.

It provides following broad capabilities :

------------------------------------------------------------------------------------------------------------
Application management - Group some resources together and name it as an application.
                         Dynamic property lookup. If version = AppConfigLookup('version') etc.
                         Deployment validators and UI as well. A/B Testing.
------------------------------------------------------------------------------------------------------------
Change management        Includes Change Manager (Request/Approve changes in Application), Automation,
                         Change Calendar, Maintenance Windows.
------------------------------------------------------------------------------------------------------------
Node management          Includes Patch Manager, Fleet Manager, Session Manager, State Manager, Compliance,
                         Run Command, Distributor, Inventory.
------------------------------------------------------------------------------------------------------------
Operations management    Incident Manager, Explorer, OpsCenter
------------------------------------------------------------------------------------------------------------
Quick Setup              Manage service configurations and deploy it in single or multiple accounts.

                         e.g. Setup default host management configuration (Create necessary EC2 roles, etc)
                              Enable periodic updates of SSM and cloudwatch agents, etc.
                         e.g. Create an AWS Config configuration recorder.
                         e.g. Patch Manager configuration in quick setup is called "Patch Policy".
------------------------------------------------------------------------------------------------------------
Shared resources         Create SSM Document to share across organization. 100 Predefined.
                         SSM Document could be classified as:

                           Automation Runbook, CloudFormation Template, Command Document, App Configuration,
                           AWS Config Conformance Pack, Change Calendar Document, Package Doc(Distributor) 
------------------------------------------------------------------------------------------------------------

It provides different features such as:

------------------------------------------------------------------------------------------------------
Node Management:
------------------------------------------------------------------------------------------------------
   Patch Manager       # Bulk Patch Management. Patch baseline. 
   Fleet Manager       # View and Manage group of EC2 (and on-premise) nodes in single UI.
   Session Manager     # Run ssh shell
   Run Command         # Run SSM Command Document on selected nodes.  e.g. "AWS-RunShellScript"
   State Manager       # Associate SSM documents with selected nodes to run once or periodically.
                       # Associations e.g. Patch Commands or Collect Inventory (Meta data) etc.
   Distributor         # Create zip as installable package. A package is a kind of SSM Document.
                       # Use State manager to run on schedule or use SSM Run command to install once.
   Inventory           # Collect Inventory meta data into S3 file (periodically).
                       # Integrates with Compliance reports, AWS Config, State Manager etc.
                       # Uses AWS:GatherSoftwareInventory SSM (Policy) document.
   Compliance          # View mainly Patching compliance (as per Patch Manager)
                       # View also State Manager association compliance.

------------------------------------------------------------------------------------------------------
Operations Management
------------------------------------------------------------------------------------------------------

   OpsCenter           # Track Ops Issues called OpsItems. Can use Automation Runbooks to solve issues. 
                       # OpsItems can be auto-created from cloud watch alarms.
                       # Event bridge Rule can create OpsItem on Security Hub alert issued.

   Explorer            # Configure source for OpsData. e.g. Security Hub, Trusted Advisor, regions, etc.
                       # View consolidated reports of OpsItems from different sources.
------------------------------------------------------------------------------------------------------
Applications Management
------------------------------------------------------------------------------------------------------

   Application Manager # Your application as logical group of resources. UI View.
                       # Provides AppConfig and Parameter Store.
                       # A/B Testing. Dynamic Props Lookup. Deployment and Validators.

   App Config          # Application related dynamic properties. e.g. AppConfig Lookup('enable_debug')
                       # A/B Testing. Consolidated UI View.

   Parameter Store     # Application parameters and secure strings.
                       # Note: Secrets Manager (vs PM) supports versions and auto rotation.

------------------------------------------------------------------------------------------------------
Change Management
------------------------------------------------------------------------------------------------------
   Change Manager      # Advanced Change Request Management with Approvals.

   SSM Automation      # Run Automation Runbook using Automation Service. 
                       # Runbook includes Tasks aka Actions. 
                       # Also supports: aws:executeScript action for Python or shell scripts.
                       # Common simple or complicated bulk IT Tasks across accounts.

   Change Calendar     # Restrict actions that can be performed during specific time interval.
                       # e.g. Do not run some automation runbooks on business hours, etc.

   Maintenance Windows #  Maintenance window has a schedule, registered targets and registered tasks.

------------------------------------------------------------------------------------------------------
Shared Resources
------------------------------------------------------------------------------------------------------
   SSM Document        #  SSM Document represents actions or configurations or template or such.
                       #  More than 100 predefined documents to share across organization.
                       #  SSM Document could be classified as:
                       #     Automation Runbook, CloudFormation Template, Command Document, 
                       #     App Configuration, AWS Config Conformance Pack, 
                       #     Change Calendar Document, Package Doc(Distributor) 
------------------------------------------------------------------------------------------------------

SSM Agent

  • SSM Agent required to be running on EC2 instances (and on-premise also) to be managed by SSM.
  • Installed by default on Amazon Linux AMI and some ubuntu AMI
  • The agent communicates without outbound HTTPS request to:
    • ssmmessages.* (Amazon Message Gateway Service) or
    • ec2messages.* (Amazon Message Delivery Serive).
  • Make sure EC2 instances have proper IAM permissions for SSM actions. SSM agent has to register with System Manager Service.

SSM Document

SSM Document is used to specify actions or configurations or template or such. These are the categories of SSM Document:

------------------------------------------------------------------------------------------------------
Category              Examples
------------------------------------------------------------------------------------------------------
Command Document      AWS-RunPatchBaseline, AWS-ConfigureAWSPackage, AWS-RunShellScript
                      AWS-InstallApplication, AWSSSO-CreateSSOUser, AWSFleetManager-CreateUser

Automation Runbook    AWS-CreateImage, AWS-CreateSnapShot, AWS-ECSRunTask,  
                      AWSConfigRemediation-DeleteIAMUser,  AWSDocs-Configure-SSL-TLS-AL2, 
                      AWSSupport-ExecuteEC2Rescue, AWSSupport-CollectECSInstanceLogs,
                      AWSSupport-ResetAccess, AWSSupport-TroubleshootEC2InstanceConnect

Change Calendar       Define a schedule to restrict actions. No predefined AWS document.

Application Config    Application related dynamic properties. No predefined AWS documents.

Cloudformation        Cloudformation template that creates or updates resources.
                      "AWSQuickStarts-AWS-VPC,
                      "AWSQuickSetupType-SSMChangeMgr-CFN-DA, 
                      "AWSQuickSetupType-SSMHostMgmt-CFN-TA,  etc.

AWS Config            AWS Config managed rules and remediations.
Conformance Pack      AWSConformancePacks-OperationalBestPracticesForNIST800181
                      AWSConformancePacks-OperationalBestPracticesforAIandML
                      AWSConformancePacks-OperationalBestPracticesforAPIGateway

Package Document      Document used to define package. Used by SSM distributor.
                      AWSCodeDeployAgent, AWSEC2Launch-Agent, AWSNVMe, 
                      AWSSupport-EC2Rescue, (Note: EC2Rescue is pakcage. ExecuteEC2Rescue is runbook)
                      AmazonCloudWatchAgent

Policy Document       AWS-GatherSoftwareInventory Policy document used by Inventory and State Manager
                      to track resources and associations (for desired state of patch compliance).


Session Document      For use with SSH session from Session Manager.
                      AWS-PasswordReset, AWS-StartSSHSession, etc. 

SSM Run Command

Systems Manager Run command can run command across multiple instances. No need for SSH. Results in the console. For example:

aws ssm send-command \
--document-name document-name \
--targets Key=tag:tag-name,Values=tag-value \
[...]    # Run command on all EC2 instances with specific tag.

SSM Automation

  • Automates deployment and configuration tasks. Free Service.
  • Uses Automation Runbook to execute steps.
  • Run automations across multiple AWS accounts and regions.
  • Automate common IT Tasks.
  • Event Bridge can trigger SSM Manager Service to run commands.
# Reboot EC2
aws ssm start-automation-execution --document-name "AWS-RestartEC2Instance" --parameters "InstanceId=i-xxx"

# Dedicated account for SSM Admin. (Similarly for security, you can use an Account, etc.)
# The management account for organization always has super power.
aws organizations register-delegated-administrator --account-id <delegated-admin-account-ID> \
                                                   --service-principal ssm.amazonaws.com

# 

Systems Manager Patch Manager

Patch Baselines

Set of rules for auto-approving what patches need to be applied. An example rules:

All "CriticalUpdates" and "SecurityUpdates" released until 1 week back.
Specific Patch to be applied or skipped. (whitelist or blacklist)
For Windows rule1 and For Mac Rule2

Default patch baseline is the default set of rules to apply.

Patch Operation Methods

The patch operations are: Scan and Scan and Install Following 4 methods are available:

  • Patch Policy (configured in Quick Setup):
    • Integrated with AWS organizations to support all accounts or specific OUs.
    • Single patch policy defines multiple patch schedules and patch baselines to Scan and Install.
  • Host Management (configured in Quick Setup):
    • Similar to Patch Policy but limited to only Scan Operation and default patch baseline.
    • Produce Compliance Reports.
  • Maintenance Window to run Scan or Install Task:
    • Set of Selected Nodes in single account.
    • Specific Schedule
    • To run Scan or Scan and install Task.
  • `On-demand Patch Now operation`:
    • Set of Selected Nodes in single account.
    • To run Scan or Scan and install Task.

Patch Group

  • PatchGroup is a set of Nodes appropriately marked with specific PatchGroup.
  • Node can be tagged like: OS=Windows, dept=dev, PatchGroup=WindowsDev The PatchGroup is the special tag recognized by Patch Manager.
  • Every PatchGroup can be atmost attached to only one Patch baseline. (Set of rules)
  • The node is patched as per the associated PatchGroup -> Baseline or Default-baseline.

SSM Command document

SSM document can be runbook or cloudformation template or a command document etc.

The SSM Command document can be used for patching OS and applications. Uses the default patch baseline if no patchgroup specified.

e.g. You can use the document AWS-RunPatchBaseline to apply patches for both OS and Applications. On Windows only Microsoft Applications are supported for patching.

There are only 5 recommended SSM command documents for patch management:

AWS-RunPatchBaseline              # For all OS and Apps 
AWS-ConfigureWindowsUpdate        
AWS-InstallWindowsUpdates
AWS-RunPatchBaselineAssociation   # Often used to Scan only. Can select Baseline on Tags.
AWS-RunPatchBaselineWithHooks     # Supports pre-Install, post-Install, post-Reboot hooks!

You can create custom SSM command documents (JSON File) for your own operations which looks like:

{
   ...
   action: "aws:runDocument",    // or aws:runShellScript, etc.
   ...
   documentType: "LocalPath",    // Or SSMDocument for composite document!
   documentPath: "bootstrap"     // 
}

SSM Session Manager

  • Allows you to start secure shell on your EC2 or on-premises servers.
  • Access through AWS Console, AWS CLI, or Session Manager SDK
  • Does not need SSH access already setup.
  • Every command is logged. Better for security and tracing.

SSM OpsCenter

  • Manage OpsItems - issues, events and alerts
  • Provides Automation Runbooks that you can use to resolve issues.
  • EventBridge or CloudWatch Alarms can create OpsItems
  • Aggregates information about AWS Config changes, Cloudwatch Alarms, etc.
  • Reduces mean time to resolution.
  • Can be integrated with JIRA, ServiceNow

Manage On-premise Node From AWS

Follow these steps:

  • Create a new service role which uses AmazonSSMManagedInstanceCore policy. Add optional other policies if you need additional permissions. Select Trusted Entity as Systems Manager. :

    aws iam create-role --role-name SSMServiceRole 
             --assume-role-policy-document file://SSMService-Trust.json 
    aws iam attach-role-policy --role-name SSMServiceRole 
             --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore  
    
  • Create and apply a hybrid activation. Following command should give you activation code and ID. You need this later to install and activate SSG agent on the on-premise node.

aws ssm create-activation \
       --default-instance-name MyWebServers \
       --description "Activation for Finance department webservers" \
       --iam-role service-role/SSMServiceRole  --region us-east-2 \
       --tags "Key=Department,Value=Finance"
  • Run following at on-premise node:

    mkdir /tmp/ssm
    curl https://amazon-ssm-region.s3.region.amazonaws.com/latest/debian_amd64/ssm-setup-cli 
                                       -o /tmp/ssm/ssm-setup-cli
    sudo chmod +x /tmp/ssm/ssm-setup-cli
    sudo /tmp/ssm/ssm-setup-cli -register -activation-code "activation-code" 
                             -activation-id "activation-id" -region "region"
    

AWS Cost Allocation Tags

  • For classifying your expenses you have 2 ways:

    • Cost Category: Define your custom rules to define category based on Service, resource or accounts or tags. e.g. Development == Set of some specific accounts. OR
    • Tags: Define and use proper tags on your resources.
  • Resource Groups could be created using Tags. You can also create static Resource Groups without using tags (by specifying certain EC2 resources, etc).

  • Management account owners can activate the AWS-generated tags in the Billing and Cost Management console. When a management account owner activates the tag, it's also activated for all member accounts. This tag is visible only in the Billing and Cost Management console and reports. e.g. aws:createdBy

  • User tags can be defined by user and starts with prefix "user:"

  • Example Tags:

    aws:createdBy = Root:123456789
    user:Cost Center =  56789
    user:Stack =  Test | Production
    user:Owner = DbAdmin
    user:app = myPortal1
    
  • Total cost report can group by desired Tags. e.g. Per owner and CostCenter, etc.

  • You can see "Your Cost Explorer trends", monthly costs, Chart style

  • You can choose filters such as Service, Region, Tag, Instance type, etc.

  • You can get various granular data:

    Resource-level data at daily granularity;
    Cost and usage data for all AWS services at hourly;
    EC2-Instaces resource level data by hourly.
    

Enforcing Tags - Advanced

You can have an IAM or SCP policy to enforce Tags on creation of resources.

  • The ForAllValues:StringEquals condition ensures that unknown tags are not present.
  • The Null condition ensures that the required keys are not absent.

An example is given below:

{
 "Version": "2012-10-17",
 "Statement": [
   {
     "Effect": "Allow",
     "Action": "ec2:RunInstances",
     "Resource": "*",

     "Condition": {

       /* Multi context condition. Usually there is only one condition */

       "ForAllValues:StringEquals": {
         "aws:TagKeys": ["Department", "Project"]
       },

       /* There is implied AND here. If you need OR, then add another statement element! */

       "Null": {
         "aws:RequestTag/Department": "false",    /* Key exists */
       },
       "Null": {
         "aws:RequestTag/Project": "false"        /* Key exists */
       }
     }
   }
 ]
}

Some of the common operators are:

  • Action, NotAction, Principal, NotPrincipal, Resource, NotResource
  • StringEquals, StringNotEquals, StringLike, StringNotLike
  • NumericEquals, NumericLessThan, etc
  • StringEqualsIfExists, etc.
  • "Condition":{"Null":{"aws:TokenIssueTime":"true"}} means IsNull('TokenIssueTime') = True

Finding Non-Compliant Resources

  • You can use AWS Resource Groups (All resources with specific tag or just all EC2s, etc) in combination with AWS Tag Editor or AWS Config to identify non-compliant tags across your resources.
  • Tags editor lets you find all resources having or not having the tags that you specifiy.

Trusted Advisor

  • Use cases: Optimize costs, improve performance, and address security gaps
  • AWS Support Premium feature (mostly).
  • AWS Trusted Advisor provides real-time guidance to help you provision your resources following AWS best practices.
  • Analyze your AWS accounts and provides recommendation:
    • Cost Optimization
    • Performance
    • Security
    • Fault Tolerance
    • Service Limits
    • Operational Excellence.
  • Can enable weekly email notification from the console.
  • Core Checks (around 7) and recommendations (around 56) - available for all customers.
  • Full Trusted Advisor - (around 426!) Only for Business & Enterprise support plans.
  • AWS Support Plans:
    • Basic Support: Included for all customers and free; 7 core Trusted Advisor checks; No AWS Support API programmatic Access.
    • Developer: Above plus Case support with general guidance < 24 hours; System Down: <12 hrs
    • Business: Above Plus Support API + Production system impaired < 4hrs; system down<1hr
    • Enterprise: Above Plus Business critical system down: < 15 mins
  • Service Limits can only be monitored; To raise limits, use AWS Service Quotas or create case using AWS Support Centre.
  • e.g. You can use Lambda Function TA Refresher that can trigger Trusted Advisor checks and Generate events in EventBridge, if needed.
  • e.g. vCPU Checker Lambda function can check Service Quotas and raise event in EventBridge. Note: EventBridge event in one account can be forwarded to one in another account.

AWS Service Quotas

  • Raise alarm events when you are close to exceeding service quota limits.
  • Create CloudWatch Alarms using the Service Quotas Console.
  • you can request a quota increase or shutdown resources before limit is reached.

EC2

Instance Types

  • General Purpose: M* types; T2; T3; T4g; Xeon;
  • Compute Optimized: C4, C5, C* types; AWS Graviton3 processors
  • Memory Optimized: R*, X*; AWS Graviton4 processors
  • Accelerated computing: P*,G*, etc.; Graphics, FloatingPoint, etc. . (AMD EPYC 7R13), Up to 8 NVIDIA H100 Tensor Core GPUs
  • Storage Optimized: I*, D3; AWS Graviton2 processors
  • HPC Optimized: Hpc*; Up to 64 cores of Graviton3E processors with 128 GiB of memory

Launch Types

While creating EC2 instance, you will have option to select a launch type. The billing will be based on that.

.  Launch Type           :  Comments
........................................................................................
.          On-Demand     :  Short Workload
.     Spot-Instances     :  Cheap, not reliable. Up to 90% Savings
.     Reserved Instances :  Reserve for 1-3 Years. Up to 72% savings.
.     Dedicated Instances:  May share hardware only with same accounts resources.
.                           On reboot hardware may change.
.     Dedicated Hosts    :  Dedicated physical server. On host affinity, reboots to same.
.                           Available only large config. On-demand price.
.    

AWS Savings Plans

  • Commit to certain type of usage: e.g. $10 per hour for 1 to 3 years.
  • Any usage beyond the savings plan is billed at on-demand price.
  • Savings plans could be one of the following 3 types.
EC2 Instance Savings Plan
  • Upto 72% discount -- Same as Standard Reserved Instances.
  • Select Instance family (e.g. M5, C5, ...) and lock to specific region.
  • Felxible across size (m5.large to m5.xlarge), OS (windows or Linux), tenancy (dedicated or default)
Compute Savings Plan
  • Upto 66% discount; same discount as Convertible Reserved Instances
  • You can move between instance family (move from C5 to M5), region, Compute type (EC2, Fargate, Lambda), OS & tenancy (dedicated or default)
SageMaker Savings Plan
  • Up to 64% off; AWS ML Platform

Capcity Reservation

  • Reserve On-Demand instances capacity on specific AZ for any duration.
  • No discounts. Just on-demand rate. But combined with Savings Plan you gain some discount.
  • Use Case: Short-term uninterrupted workloads that need to be in specific AZ.

EC2 Fleet or Spot Fleet

  • Designed for bulk launch EC2 instances.
  • EC2 Fleet allows CLI/API only. Spot Fleet is legacy and allows also console access.
  • It is set of Spot Instances Plus optional On-Demand Instances.
  • Configured by a launch template or a set of launch parameters.
  • It will try to meet the target capacity with price constraints.
  • Uses a Fleet Method:
    • Use standard EC2 Auto Scaling OR
    • EC2 Fleet: No need for auto-scaling mostly need fixed capacity.
    • Spot Fleet: Legacy, do not use it.
  • Strategies to allocate and maintain Spot Instances:
    • lowestprice
    • diversified: Better for availability
    • CapacityOptimized:
    • priceCapacityOptimized (recommended)

Placement Group

  • You can create placement group and immediately ask to run EC2 instances out of that placement group.
  • You can associate that placement group with ASG.
  • It will be used when ASG is activated (through loadbalancer by ECS or by anyone else)
.
.         Create       +-----------> Run Instances (Using the PG)
.    Placement Group --+
.                      +---> Tie with ASG ---> Tie with TargetGroup/ELB --> Tie with ECS 
.                       
.

Use Case: Create multiple EC2 instances in same AZ with lowest latency for cluster workload. :

aws ec2 create-placement-group --group-name HDFS-GROUP-A --strategy partition 
        --partition-count 3

aws ec2 run-instances --placement "GroupName = HDFS-GROUP-A, PartitionNumber = 1" --count 100

Partition Strategy:

  • "cluster": Packs instances close together with lowest latency in same AZ for HPC apps.
  • "partition" : It means partition of racks. (Not a partition within a rack). A parition number within a partition group means a collection of racks. When one rack fails, few nodes in the same cluster partition will fail in Hadoop cluster reducing damage. i.e. Want low latency but don't want co-related cluster partition failure on single rack fail.
  • Spread: Spreads the small workload in different hardware mostly in same AZ, but could be in different AZs. "Rack level Spread" placement groups can have only atmost 7 per AZ.

ENI - Elastic Network Interface

.                N:1
.        ENI  -----------  AZ    ENI is bound to AZ, but can be reassigned to any subnet in same AZ.
.
.                N:1
.        ENI  -----------  EC2   EC2 has one Primary ENI and optional multiple secondary ENIs.
.                                Secondary ENI can even be in another VPC, but must be in same AZ!
.                N:1
.        ENI  -----------  subnet  ENI is bound to single subnet.
.
.                N:M
.        EC2  -----------  subnet  EC2 belongs to multiple subnets if multiple ENI's attached.
.                                  However, primary ENI determines primary subnet of EC2.
.
.  Max ENI per c5.large is 3.
.
.  Use floating ENI and disable delete on terminate option to persist ENI.
.
.  Elastic Fabric Adapter   == EFA == 100 Gbps For HPC computing.
.  Elastic Network Adapter  == ENA == Enhanced Networking ==  upto 100 Gbps
.  Intel 82599 VF (Virtual Func)   == Supports up to 10 Gbps.
.
.  Max speed is capped by EC2 instance type. More ENIs does not increase throughput.
.
  • Virtual Network card in VPC.
  • ENI Can have:
    • Primary private IPv4, optional many secondary IPv4.
    • One Elastic IP per private IP. (Elastic IPs are attached to Private IPs!)
    • One Public IPv4
    • One or more security groups!
    • A MAC address.
  • You can create ENI independently and attach them on fly on EC2 instances for failover!
  • Bount to a specific AZ.
  • Traffic Mirroring can send the duplicate traffic of specific ENI to network/gateway load balancer and network appliances for threat monitoring.

Security Group and Network ACLs - NACL

.         Internet
.     -------------->
.     
.                                   
.      NACL Implicit inbound/outbound - All Allow;   Stateless; All inbound/outbound rules are independent.
.
.          NACL Outbound Rules   (Match on first Rule and exit)
.          Rule                                             Destination
.           1     All high outbound ports  TCP   1024-64K   0.0.0.0/0     Allow    (Most important Rule)
.           2     Allow SSH                TCP         22   0.0.0.0/0     Allow
.           3     Allow All to EC2         TCP         22   0.0.0.0/0     Allow
.         100     Deny if no match         ALL        ALL   0.0.0.0/0     Deny    (Least important Rule)
.
.      (Security Group)                         SG allows Only Allow Rules; No Deny Rules;
.                                               TCP   22  0.0.0.0 
.          EC2 - ENI-1  <-- SG1                 TCP   80  0.0.0.0
.                ENI-2  <-- SG2                 (Implicit Deny; Default SG has outbound rule to allow all)
.
.      SG   Implicit inbound/outbound - All Deny;    Stateful; Outbound auto allowed if inbound allowed.
.                                                    New SG group has all allow outbound rule added.
.      (Network ACL)
.
.      Note: Outbound Rules refer to Port at destination. Inbound Rules refer to Port at local.
.            This is true for SG and NACL.
.

Security group is attached to EC2 (specifically ENIs) and Network ACLs are attached to subnets. Using SG, you can control which ENI should serve your inbound and outbound requests. SG can be attached to other resources whenever ENI is involved, for example, interface VPC endpoints, RDS, etc.

By default security group has implicit deny inbound all logic. By defaut, there is also "allow all outbound" connections rule attached to security group which can be deleted if desired.

EC2 Instance Profiles

.
.    Instance Profile: Defines Instance IAM Role. 
.    Supplies temp credentials to Applications running on that instance.
.
  • An instance profile is a container for an IAM role that you can pass to EC2 instance.
  • The instance profile also can be tagged which helps in proper resource management.
  • By default, when you create IAM role using console, same named instance profile for the role is created.
  • Using cli you can create instance profile and attach any role to it.
aws iam create-instance-profile --instance-profile-name Webserver
aws iam add-role-to-instance-profile --role-name S3Access --instance-profile-name Webserver
aws iam tag-instance-profile --instance-profile-name WebServer
                         --tags '[{"Key": "Department", "Value": "Engineering"}]'

Instance Identity Role

  • Each Amazon EC2 instance that you launch has an instance identity role that represents its identity.
  • An instance identity role is a type of IAM role.
  • AWS services use it to identify the instance to the service.

Supported services using the instance identity role:

  • Amazon EC2 – EC2 Instance Connect uses it to update the host keys for a Linux instance.
  • Amazon GuardDuty – Runtime agent to send security telemetry to the GuardDuty VPC endpoint.
  • AWS Security Token Service (AWS STS) – Use it to call AWS STS GetCallerIdentity action.
  • AWS Systems Manager – AWS Systems Manager uses it to register EC2 instances.
  • Instance identity roles can’t be used with other AWS services or features because they do not have an integration with instance identity roles.

The instance identity role looks like, for example:

arn:aws:iam::0123456789012:assumed-role/aws:ec2-instance/i-0123456789example

Instance Metadata Service

  • IMDS, often laden with environment variables, is a service present on every EC2 instance.

  • This offers insights into the instance's network settings, attached role credentials, and other metadata.

  • Such environment variables help applications retrieve information without extra configurations.

  • The instance identity role credentials are accessible from the Instance Metadata Service (IMDS) at:

    /identity-credentials/ec2/security-credentials/ec2-instance.
    

The credentials consist of an AWS temporary access key pair and a session token. They are used to sign AWS Sigv4 requests to the AWS services that use the instance identity role.

Instance identity roles are automatically created when an instance is launched, have no role-trust policy document, and are not subject to any identity or resource policy.

# Access temporary credentials using Metada service ...
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/<role-name>
# Output:
{
   AccessKeyId: ..., SecretAccessKey: ..., Token: ...., Expiration: ....
}

EC2 Hibernate

  • Stores RAM on boot volume so that next reboot is fast!
  • You can configure EC2 to hibernate by default on stop action.
  • Not all EC2 instance types support hibernation.
  • Root volume must be EBS, encrypted.
  • Can not be hibernated for more than 60 days!
  • Enable hibernation before starting instance, make sure root volume is encrypted, then create AMI.
aws ec2 stop-instances --instance-ids i-1234567890abcdef0 [--hibernate]

ASG - Auto Scaling Group

.  Auto Scaling Group
.
.                   instance1       min <= N <= max instances.
.                   instance2
.                   instance3        Auto Register
.                   instance4       --------------->   Load-Balancer
.
.
.
.    ASG Health Check  --->  EC2-Built-In-Health-Checks | ELB-TargetGroup-HealthCheck
.
.    Launch Template (Preferred) vs Launch Config
.
.    CloudWatch-Based-TargetTracking
.
  • Auto add new/remove EC2 instances depending on load

  • Dynamic Scaling:

    • Target Tracking Scaling: e.g. Based on CPU usage to stay at fixed value (say 40%)
    • Simple/Step Scaling: Cloudwatch alarm CPU > 70% then add 2 units. CPU < 30%, remove 1 unit.
  • Scheduled Scaling: e.g. Increase the min capacity to 10 at 5 PM on Fridays.

  • Predictive Scaling: Scale based on time series analysis => Schedule based on that.

  • LoadBalancer can do healthcheck on ASG instances!

  • Scaling Cooldown period (by default 300 seconds) is the period ASG will not add or remove instances.

  • You can temporarily suspend auto-scaling activities (adding or deleting instances), so that you can ssh into machines and do some investigations:

    aws autoscaling suspend-processes --auto-scaling-group-name MyGroup
    aws autoscaling resume-processes --auto-scaling-group-name MyGroup
    
  • Configure Healthcheck:

    aws autoscaling update-auto-scaling-group  --auto-scaling-group-name my-asg
                    --health-check-type ELB           # EC2: For built-in EC2 Healthcheck.
                    --health-check-grace-period 300
    
  • Use launch template (instead of Launch Config) which defines things including:

    • AMI, EBS, SG, Key Pair, Instance Profile, User Data
    • Termination Protection
    • Placement Group
    • Capacity Reservation (Reserve Capacity on specific Zone. Charged on-demand rate even if not used)
    • Tenancy (Dedicated, Physical Host, Default shared, etc). Physical host ideal for cluster config. Dedicated instance hardware may share hardware with other instances from same account in a way not in your control. Usually smaller hardware gets used.
    • Purchasing Option (e.g. Spot)

Rescue EC2

If your EC2 is stuck because of network or kernel corruption / blue screen then you can use the Systems Manager AWSSupport-ExecuteEC2Rescue Automation document, or run EC2Rescue manually.

It works for both Linux or Windows.

First you install EC2Rescue on your working (good) machine and then run the Systems manager Document.

Cloud Init and User Data

  • User data can be used for bootstrapping. If it starts with #!, it is recognized as shell script.
  • cloud-init is the tool (from canonical) that interprets and executes user data script. It is executed only once for the first time boot.
  • Content available from link local address: http://169.254.169.254/latest/user-data
  • EC2Config is installed on Windows Server AMIs which is alternative to cloud-init. User data is executed on first boot using Cloud-Init (technically EC2Config parses the instructions) if the user data begins with <script> or <powershell>

Instance Metadata

Metadata is divided into two categories.

Instance metadata:

Dynamic data:

  • It is generated when the instances are launched.
  • It includes information like instance identity documents (i.e. account id, privateIP, AZ, region, ImageId, etc) (at /dynamic/instance-identity/document)
  • Also includes instance-identity/signature public signature that other parties can use.
  • At the endpoint fws/instance-monitoring, indicates whether monitoring by cloudwatch enabled/disabled
  • Can be accessed from http://169.254.169.254/latest/dynamic/
  • can be used for managing and configuring running instances
  • Both metadata and userdata is open and not protected, so no secure information should be stored.

Security Credentials in Instance Metadata

Security credentials for automatically available for AWS SDK and other applications from EC2 instance (from instance role). It is saved in some place similar to .aws/credentials;

If you need to directly get the keys, you can also do this:

http://169.254.169.254/latest/meta-data/iam/security-credentials/s3access

You may have to get a token and pass it in http header. See docs for more details

VM Import/Export

  • EC2 Service to create and export on-premises server to S3 in AWS and also export AMI to a format importable on premises.
  • Alternatively you can use AWS MGN (Application Migration Service) to replicate live server to AWS. It replicates all volumes then auto converts into AMI for you. That is a preferred method.
  • It is also available from Migration Hub Orchestrator console.
  • VM supported formats like: OVA (Open Virtualization Archive), RAW, VHD (Hyper-V), VMDK (VMware)
  • Target environments like vmware, citrix and microsoft.
# If you want to convert from your own Ubuntu machine without any hypervisor ...
sudo dd if=/dev/sda of=/path/to/disk-image.img bs=4M status=progress
# Convert disk image to .vmdk format.
qemu-img convert -O vmdk /path/to/disk-image.img /path/to/output-image.vmdk

# Run from on-premises. You can directly use .VMDK file or create one first. Copy over to S3.
aws s3 cp <vmdk> s3://my-bucket/myfile.vmdk

# Run from EC2. Converts .vmdk to .ami
aws ec2 import-image --description "My server image" --disk-containers \
             Format=VMDK,UserBucket={S3Bucket=your-bucket-name,S3Key=your-image-file.vmdk}

# Run from EC2. Can export running or stopped instance. This creates .vmdk file.
aws ec2 create-instance-export-task --instance-id i-instanceid --target-environment vmware \
                 --export-to-s3-task S3Bucket=mybucket,S3Prefix=exports/

Elastic EBS volume changing

  • You can change volume type among gp2, gp3, io1, io2 with out downtime.
  • You can increase size but can not decrease it.
  • You can change IOPS for gp3, io1, io2 instances -- increase or decrease.
  • For windows machine you may have to configure OS to increase disk size after change.

ASG vs Fleet

.
.    Spot-Fleet                       EC2-Fleet                ASG
.    (Spot+On-Demand)                 (Spot+On-Demand+RI)      (Dynamic Metrics)
.
.    Target-Capacity                  Batch-Job                Web-Servers
.
  • You can directly use EC2 Fleet or Spot Fleet to manage Spot instances without involving an ASG.
  • Spot Fleet is a standalone service to manage a fleet of Spot instances (with optional On-Demand instances). Fleet involves Target Capacity and allocation strategy. Useful for batch processing.
  • EC2 Fleet is similar to Spot Fleet but allows for a combination of Spot, On-Demand, and Reserved instances.
  • Auto Scaling Group (ASG) with Spot Instances is useful for dynamic scaling (based on CPU Utilization, etc).
  • ASG with the Mixed Instances Policy, it can also launch a combination of Spot and On-Demand instances. Web application is good use case.
  • ASG has min, max, target capacity as well but scales dynamically.

EC2 Reserved Instance and ASG

  • Reserved Instances (RIs) are not tied to a specific EC2 instance or Auto Scaling Group (ASG).
  • When you launch instances in your ASG that match the attributes of the Reserved Instances, the RI discount will apply. (e.g. m5.large, zonal/regional, Linux/Windows)

Lambda

.
.  1st Million calls free. Next $0.20 for 1 million. Very Cheap.
.  Max Memory: 128 MB - 10GB !
.  Max Execution Time: 900 Seconds (15 mins) (API Gateway limit is 29 seconds)
.  Max Deployment size: 50 MB (compressed zip)
.  Concurrency executions: 1000 (can be increased)
.
.             Attach to VPC
.    Lambda  --------------->  Use ENI in your subnet
.
.    CanaryRelease (Functional Alias and route-config) (AWS SAM + Code Deploy)
.
.  Reserved Concurrency     == Max Concurrency Per Function.
.  Provisioned Concurrency  == Pre-warmed Available Lambda Instances. (For Application AutoScaling)
.
  • Lambda snapStart provides 10x performance for Java 11 and above. Pre-initialized snapshot of memory saved in disk and reused.

See Also: Lambda@Edge in CloudFront section.

Lambda Event Source Mapping And Triggers

  • Lambda Event Source Mapping is managed by Lambda framework and Lambda invoked by the framework.
  • Lambda Event Source Mapping is especially very useful for batching and intelligent error handling.
  • Lambda Triggers is another mechanism and is managed and invoked by source Service.
.
.    Event Source Mapping Supported Sources
.
.    SQS 
.    Kinesis DataStream
.    DynamoDB 
.    MSK
.    MQ
.
.    Note: Kinesis Firehose supports only Lambda Triggers (For Data Transformation).
.


...........................................................................................................
.
.  Feature             Event Source Mapping               Event Source Trigger
...........................................................................................................
.
.  Polling Mechanism   Lambda polls source                Source service invokes Lambda
...........................................................................................................
.
.  Batching Support    Yes (configurable batch size)      No (invocation per event)
...........................................................................................................
.
.  Supported Services  Queues and streams                 Most services (S3 SNS EventBridge API GW, etc)
.                      SQS Kinesis DynamoDB MQ MSK
...........................................................................................................
.
.  Invocation Timing   Polling interval and batching      Immediate upon event occurrence
...........................................................................................................
.
.  Error Handling      Retry policies, DLQ                Limited to retries within source service
...........................................................................................................
.
.  Use Case            High-throughput stream process     Real-time response to specific events
...........................................................................................................
  • With Event Source Mapping, Lambda framework does long polling and invokes your function. So you don't pay for polling and Lambda does not need to do the polling itself.
  • It is synchronous for the Lambda framework but async for the other services. Retries done by Lambda Framework.
  • May process event more than once.

Event Source Mapping Lambda invocation modes:

Synchronous Invocation:  e.g. CLI, API Gateway, DataFirehose: Caller may timeout quicker than lambda timeout.
Asynchronous Invocation: e.g. S3, SNS, CodePipeline, etc. Caller does not wait. Lambda may retry.
Polling Invocation: e.g. SQS, Kinesis Streams, MSK, DynamoDB. Invoked by Lambda Framework.
                         Retries until input stream event is visible.

Triggers are best suited for discrete independent events and invoked by the calling framework. Examples:

SNS Triggers    (Async)
S3 Triggers     (Async)
API Gateway Triggers (Sync)

.
.                   Long Poll Get Batch
.    Event ----------------------------->     Lambda                     --> Success ---> Remove messages
.               Max wait    (20 secs)      visibility Timeout 30 secs    --> Fail    ---> Reinvoke
.               Max Batch Count (10)       Messages not visible in Queue
.               Max Payload  (6MB)
.

Example SQS message event:

{
   Records: [
    {
       body: "....",
       eventSourceARN: "<arn-queue>",
       ...
    }
   ]
}

The invocation of lambda tries to batch events.

Lambda Alias

  • Useful for versioning and traffic shifting (90% to version 1, 10% to version 2).
  • Immutable Reference to specific version -- Has it's own ARN.
  • Can control provisioned concurrency, permissions, metrics separately.

Designing API for 1 million/sec

.    
.                           Default Limits Per Account Per Region
.                       Max RPS         Max Integration Timeout   Max Concurrency
.     API GW REST API   10K            29 Seconds                29*10K; 290K
.     API GW Websocket  500            2 hours                   3600*2*500=3.6Million
.    
.     Lambda            10K            15 mins                   1K (can increase to 10K+)
.    
.     ALB               100K-million   4000s(1hr+)               No limit.
.                                      (Default:60s)
.    
.     ALB-Lambda        (Throughput limited by Lambda concurrency)
.    
.    
  • API Gateway payload limit is 10 MB but Lambda payload limit is 6MB.
  • API Gateway Timeout is 30 seconds but Lambda timeout is 15 minutes.
  • Lambda can achieve max limit of 10K RPS only if average execution time is 100ms since only 1000 concurrent executions are allowed.
  • For ALB, it is recommended to pre-warm the load balancer with AWS support.

ALB Lambda Integration

  • The ALB integration timeout is called idle timeout.
  • By default it is just 60 seconds.
  • ALB idle timeout can not be changed at target group level.
  • Idle timeout can be raised for all targets upto a maximum of 4000 seconds.
  • Supported since 2019

API Gateway Throttling Per Client

.
.                Authenticate
.     Client ---------------------->               API Gateway
.               Rate Limit using        (Lookup API Key and Throttle and Bill it)
.                 API Key
.

You can throttle clients depending on usage plans using API Keys to identify your clients.

  • API Key is not meant for authentication.
  • An Application must authenticate itself using other means (like IAM credentials)
  • Client may buy API key to limit the usage of certain service spending. API Key must enforce usage limits as per API Key.
  • API Key can be generated using API gateway or imported from your file.
  • Uses x-api-key header. Both key name and value is specified by you.

AWS Budgets

. . Apply Budget . Management Account --------------> Member Accounts (Optional) . . SNS Topic ---> lambda --> Fix Policies . . . Budget Action ----> Lambda | Stop EC2, RDS Instances .

  • Create budget and send alarms when budge costs exceeded.

  • 4 types of budgets: Usage, Cost, Reservation, Savings Plans

  • For Reserved Instances (RI):

    • Track utilization:

      • Utilization Percentage: How many RIs using vs Bought.
      • Utilization Coverage: How much EC2 hours by RIs vs Total EC2 hours of All.
    • Supports EC2, ElastiCache, RDS, Redshift

  • Can filter by Service, Tag, etc.

  • Same options as AWS Cost Explorer!

  • 2 budgets are free, then 2 cents/day/budget

  • Upto 5 SNS notifications per budget.

  • Budget Actions could be created to:

    - Apply IAM policy to user/group/role
    - Apply SCP to an OU
    - stop EC2 or RDS instances
    - Action can be auto or require workflow approval.
    - Can configure Lambda to trigger
    
  • Centralized Budget management can be done from management account in organization. Each budget created can apply a filter for one member account, so that each budget refers to different member account.

  • For decentralized management accounts, we can apply a cloudFormation template from management account to each member account. This auto creates budgets in all member accounts.

AWS Cost Explorer

. 
.
.  Forecast-Usage  Visualize-Cost-History  Custom-Reports
.
.  Savings-Plans-Selection   Cost-Breakup-By-OU-Accts-Tags-Service
.
.  RI-Utilization  RI-Coverage  Custom-Usage-Report
.
  • This is different from Billing and Cost Management feature which is free. Cost Explorer is advanced feature which is paid service.
  • Visualize and manage costs over time.
  • Create custom reports.
  • Analyze cost across all acounts (at organization level)
  • Choose an optimal Savings Plan.
  • Forecast usage up to 12 months based on previous usage.
  • Gives spending based on different dimensions:
    • By Service
    • By linked accounts
    • etc.
  • Reserved Instance Utilization is how much percentage you are using RI that you have bought.
  • Reserved Instance Coverage is how much is covered by RI wrt total EC2 instances.
  • It does not generate Alarms. AWS Budget uses RI metrics and supports to generate Alarms.

AWS Compute Optimizer

  • Get recommendations to optimize your use of AWS resources
  • Analyzes resources configuration and utilization CloudWatch metrics.
  • AI and ML based analytics. Can save cost upto 25%
  • Supported Resources:
    • EC2 instances
    • ASG
    • EBS Volumes
    • Lambda Functions
  • Install CloudWatch agent on EC2 (to get memory utilization, etc) Not needed for CPU and networking.

Cloud Migration Strategies

  • The 7 Rs:
    • Retire: Turn off things you don't need.
    • Retain: If no business value to migrate e.g. mainframe, just do nothing for now.
    • Relocate: Move apps from on-premises to Cloud Version. Move EC2 instances as needed.
    • Rehost: "Lift and Shift"
      • Migrate physical machines to AWS Cloud (e.g. EC2). No change in applications.
      • You can use AWS Application Migration Service for this.
    • Replatform: "Lift and reshape":
      • No change to core architecture. e.g. Move to fully managed service or serverless. Migrate your DB to RDS or application to ECS.
    • Repurchase: "drop and shop":
      • Move to different product on cloud.
      • Often move to a SaaS platform.
      • Examples: CRM to salesforce.com; HR to workday; etc.
    • Refactor/Re-architect:
      • Monolithic to microservices
      • Move application to serverless and cloud native.

Disaster Recovery

  • RPO : Recovery Point Objective: Last recovery point that can be stored. e.g. Last backup taken 1 hour before, so last 1 hour data is lost.

  • RTO : Recovery Time Objective: From disaster to RTO is the downtime:

    .            {Data Lost }         {Down Time}
    .     ----RPO-------------Disaster-----------RTO--------
    .
    
  • Disaster Recovery strategies:

    - Backup and Restore
    - Pilot Light (Continous Backup Of data. Small downtime Okay.)
    - Warm Standby (not hot standby. may operate minimally)
    - Hot Site / Multi Site Approach
    
  • Backup and

AWS Health

  • AWS Health provides ongoing visibility for your resource performance and the availability of your AWS services and accounts in your entire Organization.
  • e.g. Your there is some ongoing outage, so your EC2 instance may be affected.
  • AWS Health events are notifications of 2 types:
    • Account specific Event
    • Public Event
  • You can setup EventBridge rules for Health Events.
  • AWS Health API can be used programmatically as well.

Amazon AppFlow

Perform analytics using data from supported SaaS applications:

  • Salesforce,
  • Google Analytics,
  • Facebook Ads and
  • AWS Services (S3, Redshift, etc).
.   AppFlow:
.
.                                 Glue Catalog          S3
.   Salesforce                    Transform             RedShift
.   Google Analytics   ====>      Mask Fields    ===>   Salesforce
.   Facebook Ads                                        Snowflake (DataCloud AWS Partner)
.

Serverless Application Repository

  • Managed repository for serverless applications.
  • Like ECR for docker images, this is for Serverless Applications.
  • It enables teams, organizations, and individuals to store and share reusable applications.
  • Integration with AWS IAM provides resource-level control of each application.
  • Publicly share applications with everyone or privately share with specific AWS accounts.
  • To share an application you've built, publish it to the AWS Serverless Application Repository.
  • Each application is packaged with an AWS Serverless Application Model (SAM) template.
  • Publicly shared applications also include a link to the application’s source code.
  • Usage is Free. You only pay for the AWS resources used in the applications you deploy.

Cloudformation

  • Cloudformation provides easy automation for provisioning resources using templates. Infrastructure as code.
  • Backbone for Elastic Beanstalk, Service Catalog, SAM (Serverless Application Model), etc.
  • Make changes to your cloudformation template
  • Generate a change set
  • Review changes -- i.e. cloudformation will let you know the actions that it will take. Then approve it.
.
.   Templates: Json or YAML file template.
.   Stacks:      Create/update/delete Stacks using Template
.   Change Sets: Review changes then update Stack
.   Stack Sets:  Multiple stacks across regions using single template.
.
.   Template-Config-File: Defines Parameters, Tags, Stack Policy
.
.   StackPolicy: Permissions policy. Defines Who can update.
.
.
.               StackSets
.           Region1    Region2
.                                          Delegated-Admin-Account
.             Target Accounts
.
.    Resource Type:  e.g. EC2::Instance, EC2::EIP, ECS::Cluster, etc.
.    Resource Properties: ImageId, SubnetId, etc.
.    Resource Attributes: Special Attributes to control behaviour and relationships.
.                         e.g. CreationPolicy, MetaData, DependsOn (controls creation order)
.

Templates

{
  "AWSTemplateFormatVersion" : "version date",
  "Description" : "JSON string",

  "Metadata" : { template metadata },

  "Parameters" : { set of parameters }, # Define parameters. If you specify AllowedValues, it becomes menu.
                                        # Specify Default/max/min constraints.

  "Rules" : { set of rules },

  "Mappings" : { set of mappings },    # Reuse parameters e.g. 
                                       # !FindInMap [ MapName, TopLevelKey, SecondLevelKey ]

  "Conditions" : { set of conditions },

  "Transform" : { set of transforms },

  "Resources" : {
      set of resources
  },

  "Outputs" : { set of outputs }
}

Note: YAML format of cloudformation templates support: !Ref (Reference) and !Sub (Substitute) and similar functions. This is specific to cloudformation. To convert this yaml to json, you need to use cfn-flip command.

CloudFormation Example Commands

aws cloudformation create-stack --stack-name myteststack --template-body file://sampletemplate.json \
    --parameters ParameterKey=KeyPairName,ParameterValue=TestKey
    --stack-policy-url file://my-policy.json      # Permission policy for updates
    --tags  Key=dept,Value=dev                    # Tag applied to the cloudformation and all resources.

aws cloudformation update-stack --stack-name <stack-name>  ...

Template Configuration File

A template configuration file is a JSON file that defines:

  • Template parameter values.
  • Tags
  • stack policy

Useful when you deploy single template into different environments with different configuration files. Stack policy also can live in a separte file and used as such.

Stacks

Change Sets

Stack Sets

  • Create/update/delete stacks across multiple accounts and regions with single operation.
Mappings:
   RegionMap:
       us-east-1:
         AMIID: ami-0abcdef1234567890
       us-west-1:
         AMIID: ami-0abcdef1234567891

   Resources:
     MyInstance:
       Type: "AWS::EC2::Instance"
       Properties:
         ImageId: !FindInMap [RegionMap, !Ref "AWS::Region", AMIID]
  • Admin account to create stacksets.
  • Trusted accounts to create/update/delete stack instances from stacksets.
  • Enable auto deployment feature to auto deploy to accounts in Organization.
  • There could be some minimal resources that should be deployed in all current/new accounts in the organization.
  • You can create stacks across regions using single template. Use self managed permissions or service managed permissions (using AWS Organizations).

Typically there is a Delegated Administrator member Account for a service (say, CloudFormation or EC2 etc).

Users in that member account can administer other accounts in AWS Organization.

aws organizations register-delegated-administrator \
      --service-principal=member.org.stacksets.cloudformation.amazonaws.com \
      --account-id="memberAccountId"

Stack Policy

  • It is a permissions policy to prevent accidental update of the stack.

  • Defines permissions who can update and which resources.

  • You can protect all resources or only some resources from being updated.

  • Atmost one stack policy only can be attached to the stack.

  • If you attach empty policy, all updates are denied by default.

  • Example policy:

    {
       "Statement" : [
          { "Effect" : "Allow", "Action" : "Update:*", "Principal": "*", "Resource" : "*" },
          { "Effect" : "Deny",  "Action" : "Update:*", "Principal": "*", "Resource" : "my/db" }
       ]
    }
    

Integration with Secrets Manager

You can first create a secret using Secrets Manager. Later you can reference it:

Resources:
  mySecret:
     Type: AWS::SecretsManager::Secret
     Properties:
       .....

  myRDSDBInstance:
     Type: AWS::RDS::DBInstance
     Properties:
        ....
        ..... Refer to ${mySecret}::password here ...

Integration With SSM Parameters

It is similar to using Secrets Manager keys. Another example of using it dynamically :

UserData:
Fn::Base64: !Sub |           <=====  Note the !Sub (Substitute Function)
  #!/bin/bash
  echo "API_KEY=${ssm:/myApp/apiKey}" >> /etc/environment
  # echo "API_KEY=$(aws ssm get-parameter --name /myApp/apiKey ... )" >> /etc/environment

Misc Notes on CloudFormation

The DeletionPolicy to delete resources on deleting the stack could be one of following:

  • DeletionPolicy=Delete (Default for Most objects)
  • DeletionPolicy=Snapshot (Use it for RDS, EBS, etc)
  • DeletionPolicy=Retain (Do not delete it)

Note: For RDS cluster, default is to snapshot. Note: For S3, you need to empty bucket first.

Custom Resources can be defined using Lambda:

- Resource not yet supported. New service for example.
- Empty S3 bucket before deletion.
- Fetch an AMI Id.
  • CloudFormation Drift:

    • Single resource may have manually changed later.
    • Detect drift of entire stackset using CloudFormation Drift feature from console!
  • Resource Import allows importing existing resources in the template:

    • No need to delete and re-create all resources. You can keep some and import it.

Web Proxy Server

.
.     Access Control  - URL based restriction possible (unlike NACL or SG)
.     Load Balancing
.     Caching         - Similar to browser caching, local common caching.
.     SSL Terminnation
.     Rate limiting
  • Can be implemented using EC2 and install nginx, Squid or HAProxy.
  • Can be implemented using ELB to offload SSL termination to different servers, if needed.

VMWare Cloud

.
.  VmWare-Cloud   
.
.  vSphere    ESXi-Hypervisor  vCenter
.
.  Backup-Gateway
.
  • Backup Gateway is meant only for VMWare VMs on-premise.
  • vSphere is VMware's cloud computing virtualization platform.
  • vCenter Server: A centralized management tool (part of vSphere)
  • VMs run on top of ESXi hosts (Hypervisor).
  • The AWS Application Discovery Connector integrates with VMware vCenter to collect VM details without the need to install agents on individual VMs.

VMware vSphere

  • It is a product suite from VMware.
  • With vSphere, you can consolidate servers into virtual machines on single or fewer physical servers managed centrally via the vCenter Server.
  • The suite includes several components:
    • ESXi hypervisor
    • vCenter Server
    • vSphere Client
    • vSphere Virtual Machine File System (VMFS), vSphere vMotion,
    • vSphere Distributed Resource Scheduler (DRS), and vSphere High Availability (HA).
  • Single software license start from 1200 Pounds per year. Expensive.

VmWare Cloud

Offers you to reserve some physical hosts on AWS and run your familiar vSphere suite of products that you use on-premise on cloud also.

VPC and IPV6

.    IPv6 - Dual Stack - CIDR - Subnet - NACL - SG 
.
.    All public. No NATGW. DNS AAAA. 
.
  • IPv4 and IPv6 both can coexist in VPC. It is called dual stack networking.
  • You can assign separate additional CIDR block for IPv6 (From console, choose auto assign IPv6). And also for every subnet (You can auto assign. Enable IPv6 for VPC first that is easier).
  • IPv6 address is always public. However if you want to implement private IPv6 subnet, you should use EIGW (Egress only Internet Gateway) for outbound route.
  • IPv6 does not support NAT gateway.
  • ALB can be configured to support IPv6 as well.
  • Route 53 DNS, you have AAAA record (like A Record) for IPv6.
  • Default route is ::/0 -- can point to IGW or Peering connection, etc.
  • NACL and Security groups need to be updated as well.

AWS OpsWorks

  • A configuration management service.
  • OpsWorks using Chef
  • OpsWorks using Puppet
  • Puppet agent on EC2 machines.
  • Maintenance jobs scheduled (like Patch Manager, backups)

AWS Trusted Advisor

Trusted Advisor is a service with programmatic API. (Premium support required) You can get recommendations on following categories:

Cost optimization 
Performance
Security
Fault tolerance
Service limits – Checks if usage approaches or exceeds the limit (also known as quotas)
Operational Excellence – Recommendations as per standards.

aws support describe-trusted-advisor-checks --language en

  Check              Check-Id
  .....
  Service Limits     eW7HH0l7J9
  ....

aws support describe-trusted-advisor-check-result --check-id eW7HH0l7J9 --language en
  .... (Json output regarding service limits)
  {

      "checkId": "eW7HH0l7J9",
      "result": {
          ...
          "flaggedResources": [
              {
                  "region": "us-west-2",
                  "service": "Amazon EC2",
                  "limit": "On-Demand Instances",
                  "currentUsage": "110",
                  "maxLimit": "120",
                  "status": "WARN"
              }, ...
            ]
       }
  }

Caching Strategies

  • Write-Through: Writes to both cache and DB simultaneously. Consistent data. But write-heavy system floods cache.
  • Write-Back: Writes to cache first, and DB later asynchronously. Temporary inconsistency. High-write rate.
  • Write-Around: Writes go directly to the DB, cache updated only on reads. Stale data possible. (lack of invalidation) Prevents write-heavy work load flooding cache.
  • Read-Through: Instead of Application, caching system reads data from DB directly on cache-miss.
  • Cache-aside: Application manually loads data into cache on cache miss. Selective update. (Write behaviour unknown)

AWS Auto Scaling

.
.                                    CloudWatch
.                     Spot Fleet
.                        |
.        ECS ------  Scaling-Plan   ---- ASG
.                        |
.                  Aurora, DynamoDB
.
.
.   Free-Service   CloudWatch-Based 
.
.   Discover-Scalable-Resources 
.
.   Built-In-Recommendations Predictive-Scaling-Support
.
  • Manage scaling for multiple scalable AWS resources through a scaling plan.

  • Discover Scalable Resources.

  • Choose scaling strategies - Built-in Scaling Recommendations.

  • Free Service. CloudWatch Based. Basic CloudWatch Charges Applicable.

  • Supported Resources:

    • Amazon EC2 (ASG) Auto Scaling
    • Amazon ECS Auto Scaling
    • Amazon Spot fleets Auto Scaling - Dynamicly Change Target Capacity
    • Amazon DynamoDB Auto Scaling -- Provisioned RCU/WCU and GSI
    • Amazon Aurora Auto Scaling - Increase/Decrease Read Replicas
  • Scaling strategies available:

    • Optimize for availability — Target resource utilization at 40 percent.
    • Balance availability and cost — Target 50 percent utilization.
    • Optimize for cost — Target 70 percent utilization.
  • In addition to Scaling strategies, Enable/Disable Following:

    • Predictive Scaling: Enabled by Default for ASG.
    • Dynamic Scaling: Enabled by Default for all resources. You can disable. Target Tracking is usually dynamic.
  • Other tuning parameters:

    • Scaling Metric. e.g. CPU-Usage, Custom, etc
    • Min Capacity, Max Capacity
    • Replace any existing external scaling policies or not.
    • Disable Scale-in
    • Cooldown - Scale-out, Scale-In operations cooldown after previous activity.
    • Instance Warmup - Won't contribute to cloudWatch metric during warmup (after reboot/init).
  • Synopsis :

    aws autoscaling-plans create-scaling-plan \
          --scaling-plan-name MyScalingPlan \
          --application-source "ResourceType=AutoScalingGroup,TagFilters=[{Key=Environment,Values=[Production]}]" \
          --scaling-instructions '[
            {
              "ServiceNamespace": "autoscaling",
              "ResourceId": "autoScalingGroup/my-asg",
              "ScalableDimension": "autoscaling:autoScalingGroup:DesiredCapacity",
              "MinCapacity": 1,
              "MaxCapacity": 10,
              "TargetTrackingConfiguration": {
                "PredefinedScalingMetricSpecification": {
                  "PredefinedScalingMetricType": "ASGAverageCPUUtilization"
                },
                "TargetValue": 70.0,
                "ScaleOutCooldown": 300,
                "ScaleInCooldown": 300
              }
            }
          ]'
    
    aws autoscaling-plans describe-scaling-plans --scaling-plan-names MyScalingPlan
    

Application Auto Scaling

  • Application Auto Scaling provides resource specific control for auto-scaling.
  • It does not have separate console or UI (unlike AWS scaling plan based Autoscaling)
  • Supports custom resources through custom metrics.
  • Supports most resources except EC2 ASG. Even EMR clusters and such are supported.
  • Provides aws application-scaling CLI. (Even individual services make use of this CLI only to control their auto scaling except ASG).
  • Application Auto Scaling parameters are integrated into CloudFormation templates.

Application Auto Scaling vs Scaling Plan based Auto Scaling:

Feature/Aspect

AWS Auto Scaling (Scaling Plans)

Application Auto Scaling

Primary Purpose Centralized scaling for ASG and core. Support More Resources
Supported Resources

EC2 ASG, ECS, DynamoDB, Aurora, Spot Fleet

Plus Lambda, EMR, etc.

Scaling Plan

Yes, defines a central scaling strategy.

No single plan;

Scaling Policies

Target tracking, step scaling, and scheduled.

Similar.

Predictive Scaling

Yes, forecasts patterns and scales.

No Predictive Scaling.

Ease of Management

CLI, UI Console.

CLI only. Service UI mostly.

Built-in Strategy

Built-in support for balancing (40,50,70%)

No such built-in strategy.

Custom Resource

No support for custom resources.

Yes. With Custom Metrics.

  • Relies heavily on Amazon CloudWatch for metrics.
  • You can create alarms to trigger scaling actions.

Application Auto Scaling Supported AWS Services:

Feature/Aspect Comments - What it scales.
Amazon ECS ECS Service tasks.
EC2 Spot Fleet Spot Fleets Capacity.
DynamoDB WCU/RCU capacity On-Demand for DynamoDB tables and indexes.
RDS Aurora Read replicas.
Kinesis Data Streams The number of shards On-Demand.
Amazon EMR The instance count in EMR managed instance groups.
AppStream 2.0 Backing Fleets.
Amazon Comprehend Document classifier/recognizer Inference endpoints.
Amazon ElastiCache Redis Replication NodeGroups and Replicas.
Amazon EMR Total Master, Core Group Instances (EC2)
Amazon Keyspaces WCU and RCU
Lambda Per function provisioned concurrency.
Amazon MSK Storage Volume Size
Amazon Neptune Read Replica Count
SageMaker Desired Instances (with in min max range) etc
Spot Fleet Target Capacity i.e. Total instances. (with in min max)
WorkSpaces Desired User Sessions (with in min-max range)
Custom resources Custom

ECS Example:

aws application-autoscaling register-scalable-target 
                                --service-namespace ecs
                                --scalable-dimension ecs:service:DesiredCount
                                --resource-id service/cluster-name/service-name
                                --min-capacity 1 --max-capacity 10

aws application-autoscaling put-scaling-policy
                                --policy-name TargetTrackingCPUUtilizationPolicy
                                --service-namespace ecs
                                --resource-id service/cluster-name/service-name
                                --scalable-dimension ecs:service:DesiredCount
                                --policy-type TargetTrackingScaling
                                --target-tracking-scaling-policy-configuration file://config.json

Example config.json for maintaining CPU utilization at 70%:

{
  "TargetValue": 70.0,
  "PredefinedMetricSpecification": {
    "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
  },
  "ScaleInCooldown": 300,
  "ScaleOutCooldown": 300
}

Websockets vs Https

  • Websockets connection is bi-directional unlike https.
  • WS: Non-secure WebSocket protocol over HTTP (port 80)
  • WSS: secure WebSocket protocol over HTTPs (port 443)
  • WebSockets connection established over HTTPS and upgraded to websocket connection.
  • If you want to talk HTTPS as well as Websocket, make two separate connections.

Client Request:

GET /socket HTTP/1.1
Host: example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13

# Server Response: 

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: hSmTq3NQZLHT4VkZJ1xGjbkA0V8=

Misc Services

  • AWS Tag Editor allows you to manage tags (add/update/delete) of multiple resources at once.
  • Search tagged/untagged resources in all AWS Regions

AWS Cloud Adoption Readiness Tool (CART)

  • Helps Organizations develop effective plans for migration.
  • Answer set of questions across 6 perspectives: business, people, process, platform, operations, security.

AWS Fault Injection Simulator (FIS)

  • A fully managed service for running fault injection experiments on AWS workloads.
  • Based on Chaos Engineering - stressing application and observe systems.
  • Supports: EC2, ECS, EKS, RDS, ...
  • Use pre-built templates that generate the desired disruptions.

Amazon CloudSearch

"Custom Full text search"  "Your Pages" "Your documents" "Auto Complete"
"Term Boosting"  "Faceting - Classifying search results" 

Amazon CloudSearch is a fully managed search service in the cloud for yor website to search your collection of web pages, document files, forum posts or product information.

AWS IOT Greengrass

.
.                           Factory                    |                AWS
.                                                      |  Deploy
.                               Nucleous MQTT-Broker   |<------->  GreenGrass Cloud Service
.     X.509 Cert     MQTT          Components          |
.      IOT Device   ---------- GreenGrass Core Device  |
.       FreeRTOS                  Linux Machine        |<------->  IOT-Core  Analytics S3
.     Device SDK                  Auth   CLI           |
.     (Sensors)                      JVM               |
.     (Thing)                                          |           Sitewise
.                                                      |

.                  MQTT
.     IOT Device ------->  Data-ATS  --> IOT Core --> Rule --> Timestream DB | DynamoDB
.                          Endpoint
.                  (ATS - Amazon Trust Services for IOT Core)
  • AWS IoT Greengrass enables local processing, messaging, data management, ML inference.
  • Offers prebuilt components to accelerate application development.
  • IOT SiteWise - Collect, Organize, Analyze data and detect anamolies.
  • IOT Analytics - Pure Analytics.
  • IOT TwinMaker - Digital Representations (twins) of Physical devices.
  • Every account can have limited number of Data-ATS endpoints.
  • Data-ATS is like IOT-Data-SSL-Endpoint supported by ATS - Amazon Trust Services.
  • The endpoint sends/receives data to/from IOT message broker. i.e. Talks MQTT.

AWS Code Artifact

AWS CodeCommit: Primarily used for version control of source code and other assets. AWS CodeArtifact: Designed for managing and sharing software packages and dependencies.

Amazon Workdocs

Amazon WorkDocs is a fully managed platform for creating, sharing, and enriching digital content. Supports client managed encryptions, versions and sharing of documents within and across teams in secure manner.

Canary vs Blue/Green vs A/B testing

Canary Testing

  • Small subset of users used for Incremental Rollout.
  • Route 53 is used to implement Canary testing using Weighted routing gradually.
  • API Gateway natively supports it.
  • Lambda supports it natively with parameters like LambdaCanary10PercentEvery10Minutes etc.
  • Code Deploy supports it for ECS, Lambda (independent of Route 53). e.g. update only 10% EC2 instances on the running ECS containers.
  • SAM (Serverless Application Model) comes with built-in support with Code Deploy and Lambda Canary Deployment.
  • Cloudfront Lambda@Edge can be used for Canary testing. It can intercept and change the request URL to /canary/<original-Url> and have multiple origins based on request URL.

Blue/Green Testing

  • Entire Userbase used before seamless version switch.

  • Green: Running new version; Blue: Running existing version (e.g. production)

  • Elastic Beanstalk supports it using SWAP URL using CNAME :

    .
    .    Route-53 Dynamic Mapping to Blue/Green Load Balancer of Beanstalk
    .
    .    DNS-Web-URL example.com  -------------->  blue.example.com | green.example.com
    .                                              Blue BeanStalk     Green Beanstalk
    .
    .    Note: ELB is optional in Beanstalk.
    .
    
  • Code Deploy supports it in ECS deployment and Lambda functions.

  • Route 53 supports (using weighted route policy) switching CNAME to new env.

  • API Gateway Stages can be used as Blue/Green strategy.

ECS Blue/Green Code Deployment

.
.                       Listener Rule  +----  Blue-Target-Group
.       Load-Balancer ---------------->|
.                       Host SNI Rule  +----  Green-Target-Group
.                       20% - 80% Rule
.                       (Traffic Splitting)
.                       Canary Deployment
.
.   Deployment Type: Blue/Green
.   Traffic Shifting Strategy: Linear | Canary | All-at-once-after-testing
.
  • ECS Blue/Green Code Deployment shares Load Balancer but creates different Target Groups.

Commands:

#
# You can have Host based routing, even when the traffic shifting is in progress.
# The higher priority rule, will enable manual testing of Blue/Green environment.
#
aws elbv2 create-rule \
             --listener-arn <your-listener-arn> \
             --priority 10 \                         # Higher Priority Rule for manual testing.
             --conditions Field=host-header,Values=blue.example.com \
             --actions Type=forward,TargetGroupArn=<blue-target-group-arn>
#
# For traffic Splitting ....
aws elbv2 create-rule \
             --listener-arn <your-listener-arn> \
             --priority 10 \
             --conditions Field=path-pattern,Values='/app/*' \
             --actions Type=forward,ForwardConfig="
                           {TargetGroups=[{TargetGroupArn=<blue-target-group-arn>,Weight=80},
                                          {TargetGroupArn=<green-target-group-arn>,Weight=20}]}"

A/B Testing

Userbase divided to test different versions.

AWS ML Services

AWS Rekognition

  • Use ML to analyze images or videos.

  • For Images:

    - Detect objects, scenes, and concepts in images. Augment with inline labels.
    - Recognize celebrities
    - Detect text in a variety of languages
    - Detect explicit, inappropriate, or violent content or images - Content Moderation.
    - Detect, analyze, and compare faces and facial attributes like age and emotions
    - Detect the presence of PPE
    
  • For Videos do above and in addition:

    - Track people and objects across video frames
    - Search video for persons of interest
    - Analyze faces for attributes like age and emotions
    - Aggregate and sort analysis results by timestamps and segments
    
  • Used in social media, broadcast, advertising and e-commerce to create safer user experience.

  • Set a min confidence threshold for item to be flagged.

  • Flag sensitive content for manual review in Amazon Augmented AI (A2I) (with augmented labeling)

  • If your goal is only to extract text, use Textract since this is heavy weight solution.

Amazon Transcribe

  • Speech to text conversion.
  • Uses deep learning process automatic speech recognition (ASR)
  • Multi-language detection.
  • Can remove automatically PII (personally identified information) using Redaction.
  • You can generate metadata for media assets -- fully searchable.
  • Realtime Transcription is possible.

Amazon Polly

  • Convert text to Speech.
  • Customize pronunciation using Pronunciation Lexicons:
    • Stylized words: st3ph4ne => Stephane
    • Acronyms: AWS => Amazon Web Services
  • Use SSML (Speech Synthesis Markup Language) for customizations such as:
    • emphasizing specific words
    • using phonetic pronunciation
    • including breathing sounds and whispering
    • using Newscaster speaking style

Amazon Translate

  • Can translate large volumes of text efficiently!

Amazon Lex & Connect

.
.          Voice Call
.   User -------------> Lex (Convert to Text)    ------> Connect
.                           (Understand Intent)          (Connect to Backend Workflow)
.                                                        (Uses Lambda)
.
  • Amazon Lex: (same tech that powers Amazon Alexa) :
    • Automatic Speech Recognition to convert speech into text
    • Natural language understanding (NLU) to understand intent.
    • Helps builds chatbots, call center bots.
    • Better than the Transcribe service converting speech to text. (no NLU capability)
  • High-Level API Flow Summary for Lex:
    • User input (via voice or text)
    • Client application sends input to Lex Runtime API
    • Lex recognizes intent, fills slots (missing information e.g.BookFlight, asks which date ?) Maps to one of the preconfigured intents.
    • Lex invokes AWS Lambda or backend service to fulfill the request
    • Lex sends a response back to the user (via voice or text).
  • Amazon Connect:
    • Receives calls, create contact flows, cloud based virtual contact center.
    • Can integrate with other CRM systems or AWS using Lambda.

Amazon Comprehend

  • For NLP
  • Fuly managed and serverless service
  • Use ML to find insights using text:
    • Langauge of text, extract key phrases
    • understand sentimental analysis
  • Use cases:
    • Analyze emails for sentiments
    • Group articles by topics

Amazon Comprehend Medical

Amazon Comprehend Medical Service does NLP analysis on unstructured clinical text:

- Physician's notes, Discharge summaries, Test results, Case notes.
- Use NLP to detect Protected Health Information and process accordingly.
- Store documents in S3 and analyze text, media accordingly. 

Amazon SageMaker

  • Fully managed service to build ML models.

Amazon Forecast

  • Use ML to deliver accurate forecasts.
  • Uses time series data
  • Use cases: Product demand planning, Financial Planning, etc.

Amazon Kendra

  • Fully managed document search service powered by ML.
  • Primarily intended for customer service or domain specific search from company network.
  • Extract answers from large collections of documents and sources. (text, pdf, html, powerpoint etc)
  • NL search capabilities.
  • Incremental learning from user interactions/feedback.
  • Fine tune search results ( Relevance, freshness, custom )
  • You can use Amazon Comprehend to create meta-data of a document and attach it. Use Kendra to do intelligent search those documents.

Amazon Personalize

  • Fully managed ML-Service to build apps with realtime personalized recommendations.
  • Same tech used by amazon.com
  • Use cases: retail stores, media and entertainment.

Amazon Textract

  • Auto extracts text, handwriting and data from any scanned docs using AI and ML
|   Image  ----Amazon-Textract--> { DocumentId: "12345", "Name": "", 
|                                   Sex: "F", DOB: "23/05/1980" }

Media Servces

.
.   Elemental MediaConvert   -   Video Transcoding - One format to another.
.   Elemental MediaLive      -   Live Video Processing.
.   Elemental Media Package  -   Media prepare and Packaging. (Scalabe Streaming?) 
.   Elemental Media Store    -   Low Latency Media Storage
.   Elemental Media Tailor   -   Personalized Ad insertion.
.   Amazon Interactive 
.          Video Servce IVS  -   Like Zoom or Google Meet application.
.

Other Services

CodeCommit

  • Discontinued on 25 July 2024
  • Recommended to use Github

Continuous Integration

  • Developers push the code. Build/testing server checks the code as soon as pushed (CodeBuild, Jenkins CI, etc)
  • Developers know the tests passed/failed.

Continous Delivery

  • Deploy every passing build.
  • Usual tools used:
    • Code Deploy
    • Jenkins CD
    • Spinnaker, etc.
  • Can be done using scripts or CloudFormation or Elastic Beanstalk or ECS etc

CI/CD

  • AWS CI/CD stack: CodeCommit/CodeBuild/CodeDeploy == CodePipeline (Orchestrator)
  • Different environments like Dev/Prod requires separate instances of CodePipeline.
  • Git branch Dev Branch or Prod Branch represents single Code Pipeline or environment.
  • GitHub supports HTTP webhooks on code updates and now CodeStar Source Connection (GitHub APP) provides even better integration with codepipeline.

Code Pipeline

.
.  CodeCommit / CodeBuild / CodeDeploy
.
.  Caching-Artifacts: S3,  EFS, CodeArtifact
.
.

CodeBuild with EFS Caching across stages, yaml config file:

environment:
computeType: BUILD_GENERAL1_LARGE
image: aws/codebuild/standard:4.0
type: LINUX_CONTAINER
fileSystemLocations:
  - location: fs-12345:/build-cache
    mountPoint: /mnt/efs
    type: EFS

Amazon CodeGuru

  • ML Powered service for automated code reviews
  • Supports:
    • CodeGuru Reviewer: Static code anlaysis (development)
    • CodeGuru Profiler: Recommendations about app performance runtime. (production)
  • Supports popular languages including Javascript, Java, Python, etc.

Alexa for Business, Lex & Connect

  • Alexa for Business:
    • Use Alexa to help employees more productive in meeting rooms
    • e.g. Book a meeting room in their workplace.
  • Amazon lex: Handles customer call and stream audio, understands intent and invokes L ambda.
  • Amazon Connect: Receives calls and create contact flows. Can integrate with other CRMs.

Kinesis Video Streams

  • One video stream per streaming device (producers):
    • Security Cameras, CCTV, smartphone
    • Can use a Kinesis Video Streams Producer Library
  • Underlying data is stored in S3 (we dont have direct access).
  • Consumers:
    • EC2 instances for realtime analysis or in batch.
    • Can use Kinesis Video Stream Parser Library
    • Integration with AWS Rekognition for facial detection
  • Integration with Rekognition is much easier solution than EC2 parsing the packets. Just feed video stream to Rekognition, output the Metadata stream into Kinesis Data Stream which can be easily processed by Kinesis Data Analytics, etc.

AWS Workspaces

.
.     AWS Workspace === Windows (and Linux) Remote Desktop
.                            Persistent Disk
.                       Microsoft AD FS
.
  • Remote virtual desktop for Windows. Also supports Ubuntu, AMI Linux, Redhat Enterprise.
  • Integrated with Microsoft AD
  • Windows desktop service.
  • Workspaces Application Manager (WAM) supports deploying your own applications as virtualized application containers.
  • Maintenance and updates can be automated.
  • WorkSpaces Directory in primary region can use Microsoft AD (AWS managed) and the secondary (failover) region can use AD Connector.
  • You should create a (connector) service account in on-premise AD (instead of using Domain Admin), if you are connecting to On-premise directory using AD Connector.
  • A route 53 TXT record can indicate failover type or secondary region IP address to auto failover. e.g. desktop.example.com can resolve to primary region but TXT Record may contain hints about how to failover.
  • Amazon WorkDocs can be used to save user data in case of failover.
  • IP Access Control Groups is like Security Group which contains white list of IPs or CIDR range to allow source connections.
  • Pricing is on-demand (pay per hour) or monthly.
  • Approximately costs $30 per month per user.

Amazon AppStream 2.0

.
.       AppStream  ===  Access Windows Remote Applications using Browser or Client.
.                                   Not Persistent.
.
.       Hint: Remember as "AppStream is a Stream of Pixels in Windows".

Access any applications and/or non-persistent desktops in your HTML5 browser or Windows Client!

AWS compute and data is co-located and only encrypted pixels are streamed to client! It is like a remote desktop.

AWS Device Farm

  • Application testing service for your mobile and web applications.
  • Test across real browsers and real mobile devices.
  • Fully automated using framework
  • Can remotely log-in to devices for debugging
  • Based on Appium framework (which originated from sellenium)
  • Competitors include Browserstack, Saucelabs, etc.

AWS Macie

.
.   Macie: Analyze S3 using ML for security and Personal Data.
.          Notify Security Hub.
.
.   Hint: Imagine Macie Cleaning (S3) Bucket full of documents/letters.

Amazon Macie is a data security service that uses ML to protect your sensitive data.

Continously assess your S3 bucket for security and access controls.

Look for Personally identifiable data.

Full discovery scans from interactive user given data map.

You can run scan job daily/weekly or just one-time from console. You can also specify export findings frequency.

Generate findings and send to AWS Security Hub or EventBridge.

e.g. A PDF or xlsx file in S3 bucket contains names and credit card numbers.

Amazon SES (simple email service)

  • Fully managed service to send emails at scale.
  • Also could be used to setup inbound emails.
  • Supports DKIM and Sender Policy Framework (SPF)
  • Flexible IP: shared, dedicated and customer-owned IPs.
  • Use cases: Transactional, marketing and bulk email communications.
  • Configuration Sets feature allows you to send events:
    • Event destinations: Kinesis Data Firehose: send metrics (clicks, deliveries, etc)
    • SNS: Immediate feedback on bounce and complaint information.
  • IP Pool management: Use IP pools to send particular types of emails.

Amazon Pinpoint

.                                 Push
.     Mobile <--------> Pinpoint ----------> Email | SMS | Notification | In-App Messaging
.
  • Campaigns at scale using multiple communication channels.
  • Scalable 2-way marketing communication service.
  • Supports email, SMS, push, voice and in-app messaging.
  • Ability to segment and personalize messages with right content to customers
  • Possibility to receive replies.
  • Capture message delivery and engagement data.
  • Scales to billions of msgs per day
  • Run campaigns by sending marketing, bulk, transactional SMS messages.

EC2 Image Builder

  • Used to automate the creation of Virtual Machines or Container Images
  • Free Service.
  • Can publish AMI to multiple regions and multiple accounts.
|              create           create
|  EC2 Image  ------> Builder --------> New-AMI ---> Test --> Distribute
|   Builder            EC2                          EC2        AMI
|
|
  • CloudFormation templates are used to create builder and test ec2 resources.
  • Can secure image with AWS provided or custom templates. e.g. enable software firewall, turn on full disk encryption, etc.

AWS IoT Core

  • Internet Of Things
  • Allows you to easily connect IoT devices to AWS cloud.
  • Serverless, secure, scalable to billions of devices and trillions of msgs.
  • Build IoT applications that gather, process, analyze and act on data.
  • Publish & subscribe messages
|
|                                       Actions
| MQTT --> IoT Topic ------> IoT Rules ------->  SQS | SNS | S3 | FireHose etc
|
|
  • MQTT (Message Queuing Telemetry Transport) is a lightweight and widely adopted messaging protocol that is designed for constrained devices.

AWS CDK - Cloud Development Kit

npm install -g aws-cdk

AWS Solutions Reference Architectures

Following patterns were taken from AWS Solutions Constructs.

AWS Solutions Constructs is open source extension of CDK. (Cloud Development Kit).

You can create cloudformation template out of these constructs.

AWS App Runner

.
.    Run Docker Image with Ease - Serverless Solution.
.
.    AppRunner is independent of ECS, Fargate, Elastic Beanstalk
.
.    Github checkin ==> Can Auto Trigger Build & Running Job
.
.    Built-in Scaling, Built-in Load Balancer, Built-in Health Checks and replace.
.
  • More auto features. Less Configurable (compared to EB, ECS, ECS+Fargate).
  • Auto Scaling based on Traffic -- request count and response times.
  • Only things configurable are: Min, Max instances, Max concurrency per single instance.
  • It may be powered by Fargate but there is no traces of it in interfaces.
  • You can give github workspace, it can auto-dockerize it for you and deploy application.
  • But there are many bugs in building and auto deploying.
  • Better to build image yourself and push it into ECR and let the App Runner deploy the application.

HPC Notes

  • EC2 Enhanced Networking options include:
    • Elastic Network Adapter (ENA) upto 100 Gbps
    • Intel 82599 VF up to 10 Gbps -- Legacy.
    • Elastic Fabric Adapter (EFA) is Improved ENA for HPC only works for Linux: Upto 100 Gbps
      • Uses Message Passing Interface (MPI) standard bypassing Linux OS
  • The ENI speed mainly depends on the machine type:
    • General Purpose Instances (e.g., T3, T3a, M5, M6i): around 0.5-5 Gbps (MemTip: T-Typical; M-Medium)
    • Compute-Optimized Instances (e.g., C5, C5n, C6i): 3 to 25 Gbps (Burst).
    • Memory-Optimized Instances (e.g., R5, R5n, R6i) : 3 to 25 Gbps (Burst). (MemTip: R-Remember-Memory)
    • High-Performance Networking Instances (e.g., C6gn, R6in, M6idn): 50 to 100 Gbps. Uses Elastic Network Adapter (ENA) or Elastic Fabric Adapter (EFA) for HPC.
  • Burstable EC2 instances: Instances like T3, T3a, T4g have burstable network speeds and CPU speeds temporary higher speeds depending on available capacity.
  • AWS Parallel Cluster: Opensource cluster management tool to deploy HPC on AWS. Automates creation of VPC, subnet, cluster type and instance types. Configure with text files.
  • Intel Hyper-Threading makes a single processor appear as multiple logical processors. Most HPC applications will benefit from disabling hyperthreading.

Outposts

.
.    On-Premise                 OutPosts
.    Network     <---------->    Subnet       ------>  AWS
.                              Local Gateway 
.
  • Outpost subnet route tables route to your on-premises network using Local Gateway.
  • The default mode of connection is direct VPC routing, which uses the private IP addresses of the instances.
  • The other option is to use addresses from a customer-owned IP address pool (CoIP) that you provide.
  • Direct VPC routing and CoIP are mutually exclusive options that control how routing works.

Misc Notes

  • EC2 boot time (user data) script can be used to install certs on EC2 retrieving things from SSM parameter store. The EC2 IAM role should have permissions to retrieve things from SSM parameter store ??

  • There are AWS managed Logs --- Load Balancer Access Logs, CloudTrail Logs, VPC Flow Logs, Route 53 Access Logs, S3 Access Logs, etc.

  • aws:PrincipalOrgId is AWS Organization Id. you can use this in IAM policy to restrict action on resources belonging to any account in same Org.

  • You can use EC2 Instance Connect feature to use IAM permission to ssh into EC2. The AMI should have EC2 Instance Connect pre-installed. It allows temporary public key copied to EC2 instance metadata and then gets copied over to root user's ssh public key. The client then connects using private key.

  • AWS Batch supports multi-node Parallel jobs. Schedule jobs that auto launches EC2s.

  • Auto Scaling Groups is usually used with ALB: Setup rules like if CPU > 40% then scale up. You can also use network In/Out, RequestCountPerTarget (From ALB to EC2), custom metrics. Predictive scaling involves historical load analysis and scale in advance.

    Splot Fleet Support can mix on Spot and On-Demand EC2 instances to keep cost minimum.

  • AWS App Runner Service allows you to deploy Docker Image based application. Provides Auto scaling, HA, Load Balancer. Rapid production deployment! It supports custom domain for the application. It uses fargate underneath but provides simpler workflow.

  • Amazon EKS Anywhere also makes available EKS on-premises. Optionally you can use EKS Connector to connect to AWS. Otherwise you can just run in disconnected mode from AWS.

  • AWS Lambda coupled with Event Bridge can run serverless Cron Job.

  • Lambda supports Node.js, Python, Java, Ruby, etc. Lambda also supports running Docker image but it must implement Lambda Runtime API.

  • Lambda limits:

    • RAM: 128 MB to 10GB memory
    • CPU: Linked with RAM. 2 to 6 vCPUs
    • Timeout - Up to 15 mins.
    • /tmp storage - 10GB
    • Deployment Package - 50 MB (zipped), 250MB (unzipped) including layers.
    • Concurrent executions - 1000 (softlimit that can be increased)
    • Container Image Size - 10 GB
    • Invocation Payload (Request/response) - 6 MB (sync), 256 KB (async)
    • Async invocations (invoked from SDK) typically raise SNS notifications and such.
  • Lambda can be invoked from ALB, API Gateway or from SDK/CLI directly. Limit concurrency at API Gateway to prevent too many concurrent lambdas. There is Lambda global concurrency that applies too.

  • AWS Wavelength Service is to support Telecom providers datacenter at the edge of 5G networks. Brings AWS services to 5G networks. Traffic does not leave 5G network. Highbandwidth connection to parent AWS region is optional and possible. Use cases include real-time gaming, Smart cities, AR/VR, Connected Vehicles, etc.

    Some AWS Wavelength Zones are made available by AWS along with availability zones.

  • AWS Local Zones are additional availability zones mainly targetting cities. These zones are available along with standard availability zones for regions. Local Zones support most popular services, but some services may not be available. You can create EC2 on specific local zone (say Boston). You should first create subnet on that local zone then use that subnet for that EC2.

  • To check if certain property exists or not:

    |  "Condition":{"Null":{"aws:TokenIssueTime":"true"}}    # Is null = True means the key does not exist.
    | 
    |  "Condition":{"Null":{"aws:TokenIssueTime":"true"}}    # Is null = False means the key exists
    |                                                        #           value is immaterial.
    
  • AWS Data Pipeline is retired and deprecated in favor of Glue for ETL workflow.

Best Solutions References

See following references.

AWS Solutions Constructs

Opensource extension of AWS Cloud Development Kit (AWS CDK).