Mikael's DevOps & Cloud Demo

I created this page to provide a better perspective about my skills and experiences. Unlike my CV, this page contains details like DevOps principles, architectural diagrams, and sample codes.

I will explain each workstream using the DevOps Matrix (next section) and provide code examples where applicable.

I provide a sample microservice (Sample-App) and implemented DevOps practices around it. I will use this often as an example. Please note that the technical implementations are simplified versions of my real corporate experiences.

Target Audience: Must have basic knowledge about Agile, CI/CD, and Cloud.

DevOps Matrix

What is DevOps?

DevOps has different meanings depending on who you're talking to. For a cloud engineer, DevOps is about infrastructure automation. For a Tester, it's about test automation and for a developer, it's about CI/CD, so on and so forth. All of them are correct. However, because of these differences, it becomes confusing and there is no sufficient description that will cover all DevOps areas. That's why I created the DevOps Matrix.

DevOps matrix is a representation of DevOps as a whole. It comprises 9 DevOps workstreams to distinguish practices.

To achieve the full potential of DevOps, an organization must consider all workstreams and not just CI/CD or Cloud.

About the Sample-Apps and other tools used

Sample App

A fully operational CRUD application created using Java Spring-Boot Framework.
Uses virtual database if deployed locally or in Jenkins Slaves.
Uses MySQL database when deployed in Dev, SIT, UAT servers. //TODO

Source Codes: https://github.com/mikaelvg/devops-demo/tree/master/crud-with-api-testing

Static Code Analysis: https://sonarcloud.io/dashboard?id=mikaelvg_devops-demo // Fix this

Other tools

For documentation simplicity, I will use specific tools like JIRA as an example for a generic ticketing system.

Agile Adoption

What is Agile Adoption?

Agile-Adoption is a process and technical implementations that integrate AGILE Practices and DevOps Automations.

JIRA Ticketing System

User Stories - Business or operation benefit definition.
(User Stories) Sub tasks - Technical and non-technical related to features, non-functional requirements, etc.
Test tasks - Executable Gherkin codes to test features (functionalities)

CI/CD Integration

Know if a Feature (Jira Ticket) is passing or failing a test.
Trace the code change related to a JIRA ticket.

Label each Stories/Features to what release versions they are included.

Developers Experience (DevEx)

What is DevEx?

You might already know this word, but it has never been an officialized standard practice. So let me define it by providing the purpose of this workstream.

Empower developers by increasing their ability to do tasks locally that are normally done in a server environment.

For example: Deploy the app to a production-like environment and conduct a Systems Integration Test on a local desktop.

Increase defect detection rate while in a Local environment. Fix as many issues as possible before submitting the code to the central repository.
Increase coding speed through tooling like IntelliJ plugins, boiler-plates, and reusable components.
Provide refactoring guidelines to make LEGACY applications suitable for full DevOps Automation.
Examples are:

1 - How to migrate SVN to GIT

2 - How to convert from ANT to Maven migration

3 - How to "granularize" monolith into small applications. Ex. Jar files or microservices.

4 - Externalize configuration variables and store them in K8s CONFIGS and K8s or Sops SECRETS

5 - How to fix technical debts from SonarQube or other security scans like AWS Security Hub and Detectify.

Sample-App

The application is runnable locally.
Developers can run End-to-end testing while on local. i.e. Connects to Database, no mocking.
Run the command mvn spring-boot. It will deploy the application to a tomcat server.
Database and tables are auto-generated.
Sample test links

//TODO: Deploy this in an aws free tier

Home: http://localhost:9080/

Get All Student: http://localhost:9080/api/student

Search by Firstname: http://localhost:9080/api/fetchstudent?fieldName=firstName&value=Mikael

Continuous Integration

I assume you have some level of knowledge about CI/CD concepts. I’ll go straight on how I usually design CI/CDs from my previous projects.

Continuous Integration Architecture

I have implemented this design since 2012 to several companies like ANZ, DBS, SC. It works perfectly fine with the Development Teams & Compliance. It increases delivery speed by removing DevOps Automation related responsibilities away from application developers.

The Diagram-1 below represents the application lifecycle in relation to CI/CD from source codes down to server deployments. It uses a custom branching strategy (Diagram-2). A combination of standard branching strategy (appendix A) and trunk-based strategy.

Diagram-1

Diagram - 2

The Jenkins jobs can be categorized into 7 types, each with different functions. They are:

1. Build-Test

• Runs every code check-in to FEATURE branch.

• For compiling and automated-test executions.

• Run security scans and provide alerts via chat for immediate action.

2. Pull-Request

• A developer requests to merge to the MASTER branch.

• Similar to Build-Test job but has additional checks and functionalities like

o Git-squash into one commit

o Comply with the minimum test coverage

o Comply with the Minimum Security Checks, etc.

• There are two types of pull-request reviewers

o Automated checks, such as minimum test coverage.

o Human checks like technical architecture designs.

3. Master (Integration and Snapshot Release)

• Runs every approved pull request

• Uploads the latest stable version to Docker Registry or Artifactory for libraries like JAR files.

• Deploys a stable snapshot release in the development environment.

4. Release-Package

• Triggered manually by the release manager.

• Creates "release candidate" artifacts, versioning, and tagging.

• Uploads the "released candidate" to Docker Registry or Artifactory.

• GIT tagging, Docker tagging, and Artifactory versioning

5. Release-Deploy

• Not related to any branch.

• Takes a binary or a docker image and deploys it

• Ability to select a version and environment to deploy.

• Manual or automatic deployments.

6. Relfix

• PROD and UAT fixes

• Allows UAT or PROD fixes independently from MASTER branch recent changes.

• Changes will go through a proper testing cycle.

7. Hotfix

• For urgent PROD fix only

• Changes will be applied directly to the current PROD release branch.

• Note: A new PROD branch is created whenever there is a regular production release (not hot-fixes or release-fixes).

3 types of release Artifacts produced within CI/CD

Release artifacts are binaries, source codes, or docker images that are directly deployable to servers.

1- Snapshots Container Images/ Jar Files

Latest stable version of the application
Overrides the previous version

2 - Release-Package

Generated by Release-package job
"Release candidate" version of the application.
Follows the semantic versioning. <major>.<minor>.<patch>.<build number>

3 - (optional) Release-Archive

Generated by Release-package job
Each deployment to production. All related artifacts except docker images are zipped into a single file and uploaded to Artifactory.
The zip file comprises but are not limited to Source codes, jar, war, metrics reports, release notes, UAT - excel test scripts, etc.

Central CI/CD Library

Jenkinsfile Traditional Approach

As part of the code deliverables, CI/CD functionalities are typically defined in this file. Each application will have its own Jenkinsfile maintained by a developer. Thus, code duplication is very high and the CI/CD codes are not properly controlled.

Jenkinsfile Library Approach

For the library approach, this Jenkinsfile contains only five to seven lines of code and its purpose is to call the shared library. All build processes, automated testing, reporting, and deployment processes are defined in the library. As far as the application developers are concerned, they only need to add a few lines of code. See sample below. The DevOps team develops and maintains the library.

5 types of Server used in CI/CD

1 - Jenkins Agent - Short-lived Agent

• For compilation, automated tests execution, and package/image creation.

• Created and destroyed automatically by the Jenkins master.

• Life span = build duration.

• A good tool for this is Kubernetes Plugin to create dynamic agents.

2 - Dev Server

• Contains the latest snapshot stable version.

• Developers can deploy anytime.

• Starting from the Dev server, all microservices are deployed together so they can be tested against one another.

• Life span = Adhoc or persistent

3- QA Servers

• Same with the QA server, but the test data are more like production data.

• Life span = Adhoc or persistent

4- UAT Server

• Same with the SIT server, but the test data are more like production data.

• Life span = Adhoc or persistent

5. PROD Server

• Highly resilient and fault tolerant servers

• EKS Nodes are spread across different regions.

Additional Info

• Jenkins application uses 100% Infra-as-a-Code approach. Jobs creations and admin configuration changes are done via GitOps and not via Web console.

Appendix A

You may want to checkout how to implement Centralized DevOps in a nutshell.https://www.devops.ph/centralized-devops

Continuous Test

What is Continuous Test? How does it differ from Automated Test?

Continuous Test is not Automated Testing.

Continuous Test has Automated Testing.

Continuous Test is highly integrated with other DevOps workstreams from JIRA tickets to Compliance reports.

Charateristing of Continuous Testing

Behavioral Driven Development (BDD)
Change the data, change the test scenario approach.
Test data includes SQL insert scripts, Service Virutalization/ stubs, parameters and extected results.
High Test Coverage, Low test Line-of -codes (LOC)
Every test execution replicates the real-world transactions. Ex. Invoke test via API or UI and not via (java class) method calls like in unit testing.
End-to-End testing; Includes Database calls and external application calls
Comprehensive functional testing scenarios that cover Normal, Error, and Exception test scenarios.
Exception testing, for example, what if the database is down? Or external dependency application is down?

Sample-App

By running the mvn test command, it will

1 - Spin-up Tomcat Server

2 - Deploy application

3 - Create database and tables

4 - Upload Test data

5 - Execute API level automated test.

6 - Refresh test data

High Test Coverage, Low Test LOC - This single class is for 80+ automated test coverage.

https://github.com/mikaelvg/devops-demo/blob/master/crud-with-api-testing/src/test/java/ph/devops/student/StudentApplicationTests.java

Continuous Deployment

I have a specific definition of Continuous deployment and it differs from Automated Deployment.

Automated deployment is not equal to Continuous deployment.
Continuous deployment has Automated deployment.
Automated deployment is the technical approach how to deploy the application to a server.
Continuous deployment is the deployment of the correct functionalities, features, or business values.

As illustrated in this mind map. Automated deployment (Red font) is just a portion of the overall Continuous Deployment workstream.

Regression Test Pack (RTP)

RTP is a key component to Continuous deployment.
It follows sets of principles from "Continuous Test".
RTP tests the Applications via API or UI automatically and generates test results.
RTP can be run repeatedly without the need for preliminary setup. Ex. test data refresh.
Other reports and tracking tools are automatically updated based from the test result. Ex. Jira Tickets.

Deployment Rules

Only through Release-Deploy job can you deploy to SIT, UAT, PROD Servers.
No manual deployments

Additional Jenkins details

I have Jenkins Running on my personal Kubernetes Cluster at home. Should you need access, simply message me.

http://jenkins.devops.ph:8888/

Jenkins was setup using HELM using the source code below.
Github Link: https://github.com/mikaelvg/jenkins-aws-kubernetes
The Repo contains the comparison between CloudFormation and Helm as Jenkins Infra-As-A-Code.

Elastic Infra

Elastic Infrastructure is the term I use for Cloud Infrastructure Management Frameworks like AWS EC2, AWS ECS/EKS, Google Cloud Platform, Kubernetes, Helm, Azure, etc.

Infra-as-a-Code & CI/CD

I make sure at least 90% of cloud infrastructure resources are written as code (IaaC), including configurations and deployment scripts. I use a single codebase to set up production and non-production environments. Except for the environment-specific variables. I use the following technologies depending on the purpose.

Terraform to configure Cloud infrasture
Ansible to confgure Virtual Machines. i.e. EC2
Helm to deploy applications to Kubernetes.
Flyway to configure databases.

Kubernetes CI/CD and the Servers of each environment type

About the diagram

• There are two Kubernetes Clusters involve. 1 for CI/CD and DevOps related tools like Jenkins, SonarQube, etc.

• And the other is for the server environments like DEV, UAT, PROD.

• EKS comprises multiple nodes spread across different AWS regions.

• My common practice is to have 1 Kubernetes cluster per environment.

• Application deployment procedure comprises two areas:

o Idempotent deployments to set up the infrastructure. i.e. database creation, folders, etc.

o Once the server/environment is set up, the application is deployed next.

• Lastly, the creation/updates of the environment are done via the CI/CD. Ex. The creation of VPCs, K8s clusters, IP whitelisting, etc.

AWS High Availability Diagram

Basic AWS Architecture design to ensure security and high-availability.

The main components descriptions are

Cloud-front - Fast content delivery network (CDN) to deliver static and dynamic contents
S3 - Static content storage, for images, videos, files, logs, reports, etc.
Application Load balancer - Distribute traffic across the instances within the auto-scaling group.
Auto Scaling Group - Manages the Spin-up & spin down of EC2 instances based on CPU and RAM usage.
Cloudtrail - Provides the usage data.
ECS EC2 (or Fargate) & Tasks defs- Manages the container orchestration
Aurora Serverless Cluster - Has one end-point for primary and replica databases.
NAT Gateway - Allow the applications, ecs-tasks to SECURELY connect to the internet. Ex. Pull an image from the docker hub, download Jenkins plugins, etc.
Cloud9 - Highly secured web GUI to connect to Ec2 instances or administer database.
Blue/Green Deployment - Via AWS Code Pipeline

GCP High Availability Diagram

Unlike the previous AWS diagram, I stripped out most of the components and focused on the High Availability Design explanations. This diagram illustrates how the Replica Pods are distributed across availability zones (or regions).

Each pentagon represents an application-pod replica.
The U1 and U2 are sample microservices applications.
The Kube-scheduler uses a sophisticated algorithm to manage the distribution of replicas across different availability zones.
There are several ways to override the replica distribution. Ex. Node-affinity, taint/tolerance, and zone selector.
Environments (DEV, SIT, UAT) are separated by namespaces, not by availability zone.

High Availability Design within the Kubernetes Cluster and the Blue Green Deployment

Aside from Cloud Load Balancing GCP, there’s another layer of load balancing within the Kubernetes cluster. Also illustrated in the diagram is the Blue/Green deployment. There are several ways to implement Blue/Green deployment. However, in this example, I am using Ingress-Resource as the switch to change from blue to green.

The primary purpose of the ingress-controller is to manage URL redirections.
The ingress-resource yaml file contains the configuration that maps the URL to a specific Service.
To change from Blue to green, update the Ingress (resource) file. See example config below then run kubectl --force -f <nameoftheingressfile.yaml>
Regarding horizontal autoscalling, it needs the Horizontal-Pod-Autoscalling (not included in the diagram.)

//TODO - in progress

Pure Kubernetes Non-fancy Limited Canary Deployment
Canary Deployment via Virtual service and Destination Rules from ISTIO

//TODO - in progress

Pure Kubernetes Non-fancy Limited Canary Deployment
Canary Deployment via Virtual service and Destination Rules from ISTIO

AWS Accounts, Accounts and Acceess Architecture

About the diagram

Accounts

Accounts

There are 3 required accounts:

1.Main - Account owned by DevOps and to host DevOps Tools

2.Non-Prod - Account for Dev, UAT, QA, and other non-production environments.

3.Live - For production use only.

Demo and Sandbox are optional.
I separate access to environments using VPC
IAM-Users are created in MAIN account only.
To gain elevated access within MAIN and other accounts, you need to assume a ROLE Ex. Admin.

Continuous Monitoring

What is Continuous Monitoring?

"Continuous monitoring is the process and technology used to detect compliance and risk issues associated with an organization's financial and operational environment. The financial and operational environment consists of people, processes, and systems working together to support efficient and effective operations. Controls are put in place to address risks within these components. Through continuous monitoring of the operations and controls, weak or poorly designed or implemented controls can be corrected or replaced – thus enhancing the organization's operational risk profile. Investors, governments, the public and other stakeholders continue to increase their demands for more effective corporate governance and business transparency."

-- Wikipedia

The ultimate goal of implementing Continuous Monitoring is to improve the company’s Risk Profile and gain confidence from the investors. However, as a technical person, I don’t intend to bore you to death about Risk Management and or details on how to contribute to the company’s portfolio. I am not an expert in that field either. This section is intended to describe how to monitor items that are related to application development and support.

The target beneficiaries are the Developers and Technical support. Finance is also as beneficiary by reducing the operational costs like the AWS/GCP Cloud billing statements. To achieve the “Company Portfolio” level benefits, we must address the Continuous Monitoring in our own backyard, the Delivery and Operations Team.

The framework

There is no complete Continuous Monitoring framework in the DevOps world, so I invented one! This framework consists of existing DevOps practices such as RED & USE Methods.

I invented the "STRATUM DIAGRAM" to clearly define different types and layers of monitoring. Adopted from the word Stratum, which is used by geologists to describe a bed or layer of sedimentary rock that is visually distinguishable from adjacent beds or layers. Each layer provides tell-tales about earth's past events.

For DevOps monitoring purposes, each layer corresponds to a network layer. The elongated blocks (ex. blue) represents the monitoring-scope. Each monitoring scope will have specific metrics and a purpose. By combining the multiple monitoring scopes, this will allow us to pinpoint where issues occur.

Blue monitoring-scope

Check the end-to-end connections with the best effort to include all the involved components and network layers.
Sample Assertions

- A sanity test that retrieves data from the database and checks its value. This data should be public and no authentication should be required.

- Sanity Automated test to read from a page content. The source of the content is an S3 bucket.

Orange monitoring-scope

Check the application's basic business functionalities.
Sample Assertions

- A sanity test that retrieves data from the database and checks its value.

- Sanity automated tests that retrieve data from other microservices that are also connected to their corresponding databases

Green monitoring-scope

Resource monitoring like CPU & RAM usage.
Requested vs Actual.Pod restarts acceptable counts, etc.

Technical Implementations: Traceability

Kiali is an observability console for Istio with configuration capabilities like canary traffic adjustments. It helps you to understand the structure of your microservices architecture and also provides the health of your servers. Kiali provides detailed metrics, and a basic Grafana integration is available for advanced queries.

I took the GIF image from my Demo App. It shows the transaction flows between Microservices. Thru Kiali, it is a lot easier to track the flow of transactions compare with the traditional approach that is via console logs.

Kiali is more than just for traceability. You can control the flow of traffic between two versions of the same application. For example, you can use it to deploy a canary version. See the staff-service and the staff-service-risky-version modules (right most). I configured kiali to direct 90% of the traffic to staff-service, a stable version. Then 10% to staff-service-risky-version, the latest version. For some use cases, it is beneficial to try the application first with a small portion of users before making a full release.

Server Resource Utilization

For example, you have Metrics-Server or Prometheus and Grafana installed and configured to monitor your server and have gathered enough data. What will you do with the data? What data will you gather?

Data from CPU and Memory utilization reports must be regular inputs in the way we develop our applications or manage CPU/RAM allocations or autoscalling configurations.

For example, in Dockerfile, Kubernetes pods, and ECS tasks definitions. Initially, we guess the CPU and RAM allocations. Once data becomes available, the configuration of the CPU/RAM values must be tweaked for optimal operations.

// TODO: Add Grafana / Prometheus Graphics example of CPU and Memory usage.

//TODO Jaeger

Jaeger is a far more advance of implementing traceability.
Jaeger can allow us to trace a transaction of a single user.
The full implementation of Jaeger requires modification of source code or should be considered in the Design Patterns followed by the application. It is important that the implementor of Jaeger has (Java or other languages) coding and architectural skills.

Continuous Compliance

What is Continuous Compliance?

Efforts from this workstream must not be separated with other DevOps workstreams. Continuous Compliance is a mind-set or a guideline on how we should structure the overall DevOps workflow. For example; After running the Regression Test Pack, it will generate automated test reports to inform the Development Team's progress.

The example below is the tool I will use to implement the traceability matrix report. In simple terms, the Traceability Matrix Report contains the links between Requirements/functionalities, Test Execution and results, Test Data, Application Functionality Code Commits, Automated Testing (Glue) codes, coders name, reviewers name, etc.

As a DevOps engineer, this kind of report must not be created manually but rather a by-product of other processes and technical implementations from different DevOps workstreams. Through the tool, the report is generated automatilcally.

For example

JIRA Requirements and Test approach are from Agile Adoption.
Automated test codes and results are from Continuous Test.
Code Commits, Coders Name, and reviewers' names are from Continuous Integration.

If the traceability matrix report is implemented correctly, from my experiences, 80% of the audit findings related to the Application development can be resolved.

Security

My approach to security can be categorized in two discipline. Detection and Prevention.

Detection

These are the technical debts generated from security scans. There are tons of tools you can use from OWASP.org. Security scans must be part of the CI/CD and peer review process on a regular basis.

Prevention

I am not a hardcore security guy, but I love following security best practices and make sure that these are part of the IaaC codes. Here is a list of AWS security checklist that I usually follow.

IAM

Do not use root account. Create a separate super admon
Enable MFA
Assign individual account per user
Rotate Access Keys every 75 days
Impose strong password policy with LastPass to manage password
Plan Group access ahead and identify users.
Don't assign access directly to users.
Provide access through IAM Roles. User map-to Groups, Groups map to Roles, Role to Permissions/ Policies
Practice Least Privilege access when creating IAM Policies

By default not publicly accessible
Add Object-level or bucket-level permissions to IAM Policies
Enable MFA for deletion
Enable encryption of stored data
Enable encryption of data traffic, through SSL endpoints
Enable versioning
Define archiving Policy. i.e. move to glacier after 6 months.
Enable S3 access logging
Monitor S3 buckets using CloudWatch metrics

Firewall

Activate WAF at least using the default firewall configuration. https://s3.amazonaws.com/cloudformation-examples/community/common-attacks.json
Activate Firewall Manager for Cross organization firewall rules and audit.

Security Group

Each Application and layers (front end/back end) must have it corresponding Security Group.
Remove all default security groups.
IP whitelisting are stored on a Separate TF Project for central control and change tracking.
Restricted inbound access to SSH, FTP, SMTP, Databases, etc. to required entities only

EC2, VPC & EBS

Data and disk volumes in EBS are encrypted with AES-256
Dedicated Security Group and Subnet.
ELBs have a security group attached to it
Avoid using access keys to EC2. Use Incognito? I need to try this :)
Enable VPC flow logs to record traffic
Delete unused Virtual Private Gateways and VPC Internet Gateways
Review exposed VPC EndPoints regularly

CloudTrail

Activated across all regions, and for global services.
Both CloudTrail itself and CloudTrail logging
Save logs to S3 bucket
Enable Log file integrity validation
Log files are encrypted

RDS

Dedicated Security Group.
Not publicly accessible.
Enabled encryption and snapshots, using AES-256 level encryption
Data in transit through SSL endpoints
Monitor control to RDS using AWS KMS and Customer Managed Keys
Enable the auto minor upgrade feature for RDS