Platform Capacity Planning | HCLTech

Platform Capacity Planning
April 27, 2022

Most customers are moving toward digitalization and digital transformation to meet their digital roadmap in business and to adopt the market changes. The COVID-19 pandemic has made more people go online, increasing the traffic to digital platforms greatly. This surge in traffic has posed many challenges in customer platforms and their components in performance, scalability, availability, and resiliency. Without having the right IT capacity planning approach, addressing such spikes in demand would be challenging. Most of the time, a capacity planning exercise is carried out by the infrastructure team by primarily focusing on infrastructure and giving little attention to the application behavior and its parameters. This approach may not yield accurate predictions as it is not considering various application-level factors for complex enterprise platforms with different applications/components/layers.

An application capacity planning exercise can be carried out at different life cycle phases of an application and platform:

  1. In the architecture and design phase– Through performance modeling, capacity and performance can be predicted for the platform, which has just completed the architecture and design phase

  2. After completion of the development phase– The capacity of the platform can be assessed by carrying out comprehensive performance testing in an environment which resembles the production environment

  3. In the production phase– 

    • Conduct detailed performance testing in a production-like environment and collect the inputs for capacity planning

    • Use the production metrics and do the capacity planning

Carrying out platform capacity planning with the application perspective in consideration, helps the customers to nail down the bottleneck layers and take corrective measures to meet the future business growth.

In this blog post, we will see how capacity planning can be carried out by collecting production metrics and predicting the platform performance to support future business growth.

Approach

The approach depicted in the below diagram can be followed to conduct capacity planning with the application perspective in consideration:

Platform Capacity Planning

                    Figure 1: Capacity planning with the application perspective

Layer Identification

Identifying the potential bottleneck layers/components (referred to as layer in this blog post) to meet the future business growth will be key for capacity planning to be successful. The layers which have a direct impact on the end-user should be given the highest priority. Medium priority can be given to the layers which give a delayed response to the user or contain some workflow with manual approvals to give a response. All back-end/batch processing/report generation kinds of layers can be given lower priority. Though the priorities are identified, ensure that all the critical layers are included in the capacity planning exercise.

Metrics Identification

The infrastructure team focuses mostly on capturing infrastructure parameters like CPU, memory, IO, network, storage, etc., for capacity planning. In addition to that, application-related parameters like requests/transactions per second, thread count, response time, errors and their count, GC time and its frequency, heap memory usage, etc., should also be included for better prediction of the capacity and performance of the platform. In the database, parameters like query frequency and latency, data volume and growth, file size, cache hit ratio, record count for frequently used tables, etc., can be considered. For the scheduled jobs, consider how frequently they are triggered and their execution time. Application parameters will vary depending on the technologies, the IT capacity planning tools used, as well as the programming languages, dependent systems, products, and frameworks used. It might take a long time to collect these metrics as it involves extensive manual effort. The parameters which don’t have a direct impact on resource utilization can be ignored.

Tools

Analyze and decide on the tools to be used for collecting the performance metrics finalized. Most of the metrics can be captured if there is an APM tool (like Dynatrace, Appdynamics, New Relic, etc.) deployed for production monitoring. Still, there are some metrics for which other tools need to be leveraged:

  • Splunk or ELK for various metrics from logs like requests count, errors and their count, etc.
  • Adobe Analytics/Google Analytics or any other analytics tool to capture the user-related metrics
  • Database monitoring tools or database reports for database metrics
  • Monitoring tools on cloud like AWS CloudWatch, Azure Monitor, etc.
  • Prometheus for monitoring Kubernetes/Docker containers
  • JMX Monitoring for monitoring the application which has exposed its metrics via JMX like Ehcache, TIBCO, etc.
  • Native monitoring console/tools available in products like RabbitMQ, TIBCO, SOLR, etc.

Data Collection

The finalized performance metrics to be captured for different application scenarios depend on how the traffic will be coming to the platform and what kind of pages/use cases it will exercise. Fine-grained data should be collected for at least a month on an hourly basis with data for peak, normal, and non-peak hours from the existing production environment. Coarse-grained data should be collected for the last six months to a year, depending upon the data availability. It would be good to include any peak scenarios that occurred and their related performance metrics if available from the past. Include major production incidents for the last six months to a year as that will have an impact on the application performance during heavy traffic conditions. Gather the business growth projection data every month for the next one year or beyond from the business team. In addition, include any other business trends/factors, which might influence traffic growth. A safety net growth can be added on top of the business projections to consider pandemic situations similar to COVID-19.

Predicting Performance

Based on the performance metrics captured, the following correlations should be made:

  1. Correlation between the user visits/transactions per second and the number of requests going to different layers 

  2. Correlation between the requests and the resource utilization on the servers

While arriving at the correlation, various factors as given below need to be considered:

  • Static and dynamic requests

  • Requests with heavy payloads

  • Resource-intensive requests

  • Batch jobs and their frequency

  • Jobs based on real-time entries

  • DB volume and its growth

  • Number of containers/nodes needed for microservices/application servers

Using the above correlations, predict the traffic to each component to support future growth and check whether it can handle that growth level. Based on the capability to withstand, each component must be given a color code. Previous incidents that occurred in each component should also be considered before giving the color code. 

  • Green color can be given to the components which can handle the expected load

  • Amber color can be given to the components which may get impacted to support the future traffic

  • Red color can be given to the components which may not be able to handle the projected load

For the red and amber components, identify the areas which could be a potential bottleneck to support the expected traffic. In the case of cloud-native/microservices-based platforms on the cloud, it might be simpler to resolve the bottleneck by adding more containers/nodes to the bottleneck microservices/layers. When compared to the platform deployed on the on-premises environment where hardware procurement is not an easy task, the former method may be preferred. Adding a hardware component might delay the performance issue as the layer might not have been optimized for performance. Hence, it is suggested to investigate the need for more containers/nodes to see if there is a scope for improvement and find a permanent solution. This will save the additional infrastructure cost.

Considering this, the improvement areas should be highlighted accordingly so that they can aid better planning. Categorize the improvement areas for short-term, medium-term, and long-term implementations based on their criticality. Devise a plan to implement the recommendations along with the owners for each of them. Publish the plan to the respective stakeholders and get a sign-off. Start analyzing the critical items based on their priority, provide solutions, implement them, and measure the solution impact. Track all the critical items across all the teams and bring them to closure.

Teams Required

A capacity planning exercise requires the right teams to be involved at the right time to make it successful. Following are the teams that should be involved in this activity:

  • Performance engineering team
  • Infrastructure team
  • Production support team
  • Application/development team
  • Business team

Case Study

A capacity planning exercise was carried out for one of the largest digital platforms of a leading financial services company. It was carried out to meet the upcoming peak season as well as an increase in traffic due to the COVID-19 pandemic. The platform was not able to handle the sudden spike in traffic in one of the peak seasons. The approach detailed in this blog post was followed by including the application perspective. The team came up with top critical items to be addressed to support the predicted business growth:

Phase

Details

Layer Identification

Identified more than 30 components/layers in the customer’s digital platform for a capacity planning exercise

Finalizing Metrics

Finalized the below metrics to capture data for different application scenarios to measure the capacity of the application:

  • CPU, memory, and disk usage
  • Request/transactions/message count, response time
  • GC time, GC frequency
  • Old gen size, new gen size
  • Thread count
  • Data and storage growth
  • Jobs frequency and duration
  • DB file size, session count

Tools

Used the following tools for metrics collection:

  • Dynatrace
  • Splunk

Data Collection

Collected the finalized metrics for over 150+ machines; obtained the previous one-year historical data with production incidents that occurred from the support team

Business Projection

The business team provided the projection data for a year, and other inputs like business campaigns/initiatives, which would add more traffic to the platform

Predict the performance

Based on the predicted load, predicted resource utilization, and the previous history, the team identified the potential bottleneck layers to support the projected business numbers

Improvement Areas

The team identified the top critical items which would block the projected business growth, and also gave recommendations to resolve the bottlenecks

Table 1: Approach for the capacity planning exercise for the client

 

All the critical items were analyzed further for the root cause analysis and implemented with the recommendations provided. After implementing the recommendations, the platform was able to handle 30% more load during the peak season day without any major incidents. The platform could also handle 25-46% more load per month due to the increase in traffic caused by COVID-19. The changes identified and implemented were mostly changes related to application optimization with minimal infrastructural changes.

Conclusion

Carrying out platform capacity planning with the application perspective in consideration helps the customers nail down the bottleneck layers and take corrective measures to meet future business growth. This exercise should be conducted at regular intervals for better performance of the platform with reduced production incidents.

About Us

At the PESL delivery unit within the Engineering and R&D Services (ERS) of HCLTech, we offer performance engineering (performance testing, performance optimization, performance benchmarking, site reliability engineering), service virtualization, and test data management services to our customers. As part of our performance engineering services, we also offer capacity planning as a service to our customers. Please reach out at harishkm@hcl.com for any requirements related to capacity planning.

Get HCLTech Insights and Updates delivered to your inbox