In today’s world of digital transformation, HCL is helping its customers with container platform setup and migration. To set up any container platform in the customer environment, the platform architect works closely with the foundation architect. Once the foundation is created, there are certain best practices that must ideally be followed by the platform architect to create a robust system. If these best practices are not followed diligently, it might result in architectural challenges in future.
The best practices can be put under a few major categories:
After the required networking resources are created by the foundation architect, certain areas must be taken into consideration from the start.
VNet and Subnet
When defining a Subnet under a VNet, most of the customers start with a small range say “/25”, which does not give an ample number of IPs (internet protocols). A few IPs are reserved by the cluster itself. To make a good scalable cluster which can scale, a well-thought-out decision must be taken. This should involve having a workshop with the customer to understand the application details like current number of pods, scalability, usage patterns, load, and other such details.
Temporary Virtual Machine (VM)
The infrastructure team creates the foundation for most of the customers. Once the infrastructure team hands over the network foundation to platform team the cluster fails to get installed. This usually happens as the required ports and protocols of egress/ingress traffic are not open as per the requirement of Kubernetes cluster. It is recommended to create a temporary VM in a subnet of a Kubernetes cluster. The VM must be able to browse the internet and be reachable from the required subnet or source. This also helps in verifying the traffic flow.
User-Defined Routes (UDR)
For a Kubernetes cluster to be set up, the egress traffic either goes via a public load balancer or User-Defined Routes (UDR) must be attached to direct the traffic through a network virtual appliance or firewall. If UDR is not attached, Kubernetes cluster traffic will not work as expected.
Egress and Ingress Traffic
All ingress traffic must follow the same path as egress traffic. Any traffic packets which do not follow the same path will be dropped due to the symmetric routing nature of the firewall.
Multi-Region Traffic Management
In a production environment, the deployment of a cluster in two regions is a best practice. It can be active-active deployment or active-passive deployment. For the cluster to work, the regional level load balancer must be configured to direct the traffic to either of the regions. Architects can either use Traffic Manager or services like Azure Front Door based upon their needs.
Azure Private Link Limitations
Azure Container Registry is regularly scanned by security tools like Qualys to find out if there is any vulnerability in the images. If the access of Azure Container Registry is restricted to private end points or selected subnets, it cannot be scanned by security tools for image vulnerability. When restricted by the above means, the Azure Container Registry can also not be accessed by Azure DevOps public agents.
In addition, network policies like network security groups are not yet supported for private end points. Enabling private end points prevent working from outside the network. In the current scenario, an engineer should work using virtual desktop infrastructure (VDI). In this case, the connection to the server is slow when performance is considered.
For the Kubernetes cluster to work, all the required ports and protocols must be opened in the firewall for ingress and egress traffic. In most customer scenarios, all the ports and protocols are closed by default. Due to which egress traffic to API Server does not work. As a solution to this, several firewall ports and protocols need to be opened for the API server and other functions to work. It is a recommended practice to create a list of all such rules and share with the firewall/cybersecurity team to avoid challenges.
A typical security scenario is to encrypt all the traffic end-to-end. Most of the customers want to enable https encryption to continue till the ingress resource, which is installed in Kubernetes cluster. After that the certificate is dropped and traffic becomes http. This would require the host name and the correct certificate for every environment. The certificate must be correctly installed on each ingress resource, e.g., Front Door, WAF, and Ingress. A certificate needs to be in the correct format and order for it to be installed. Once the traffic reaches the required resource, certificates must be dropped or carried to the next level. For most customers, the certificate is dropped on the ingress resource of Kubernetes, but for a few customers, it is carried till the applications.
To carry a certificate forward to the application, the following tag needs to be added in ingress deployment file:
Identity and Access Management (IAM)
The channel of Kubernetes cluster access can be different. It can be accessed by the DevOps pipeline, users, admin, applications, security, and monitoring teams. If an emphasis on user roles and their segregation are not given, then it becomes a big challenge later. As a recommended practice, different groups should be created for each set of users. Based on the user requirements, different roles should be assigned. The principle of least privilege should be applied to identity and access management, so that only the specific permission is given to the required groups.
In specific situations, companies apply the policies to stop automatic OS updates of the nodes for good security coverage. Also, there are a few polices that disallow traffic from other regions. Such polices should be identified and listed in initial workshops with the customer.
Other Best Practices
Limitations of Windows Node Pool
Microsoft recommends using managed identity instead of service principal. It was expected that Windows Node Pool can be easily integrated with Key Vault using managed identity, but pod-based identity is not supported as of now in Windows nodes/pods. Therefore, the integration of Key Vault with Windows node pool is not possible.
Windows Server nodes do not receive daily updates. So, one must perform an AKS upgrade that deploys new nodes with the latest base Window Server image, and then patches manually.
Node Pools: System and Users
Microsoft recommends the isolation of critical system pods from application pods to prevent the accidental killing of system pods by misconfigured or rogue application pods. Therefore, separation of node groups for applications and system pods is considered as a best practice.
Surging Ahead with Best Practices
Containerization and Kubernetes are the obvious solutions in today’s world of digital transformation. The above best practices can serve as a guide for new and budding architects. These real-world scenarios and best practices are not collectively mentioned in any document. These best practices are written with Azure cloud implementation as the base but can be referred to for any cloud implementation.