Sorry, you need to enable JavaScript to visit this website.

Systems Management : Changing Paradigm

Systems Management : Changing Paradigm
January 17, 2018

Traditional Systems Management market is largely centric towards owned infrastructure. However, modern applications have massive infrastructure needs and are primarily running on cloud-based architectures. Problem diagnosis is extremely critical, be it an application or infrastructure issue. If a problem is not diagnosed on time the cost of failures can go high. Let’s discuss some use cases:

  • A 4-hour AWS outage happened in their S3 System in 2017 that is used by 148,213 sites. In this catastrophic four-hour disruption, there was a loss of $150 million to $160 million reported by S&P 500 companies. 54 out of the top 100 internet retailers were hit with a sharp fall of 20% or more in performance. To make things worse, 3 websites shut down totally. These were Express, Lululemon, and One Kings Lane.
  • To share direct revenue impact of such outage, consider eBay with an annual revenue of USD 8,979 million for FY-2016. Any failure amounting for an hour of downtime in services may lead to USD 1.027 million revenue loss for eBay. 

Considering such huge financial impacts of any outage along with current development platforms and scale needs, traditional systems management products are in dire need for a paradigm shift in the way systems have been managed so far.

Services Not Servers

Users are more interested in services than servers. They are concerned with uptime for a service and SLAs for the service delivered. They do not see any urgency for alerts of any server going down till the time service SLAs are in place. Yes, such alerts are important but the criticality has definitely taken a lower precedence.

Single Pane

Application Performance monitoring, Infrastructure Monitoring, Network flows: Users seek right diagnosis for any service downgraded, be it at application code, infrastructure or latencies due to network. Having a single pane helps the user to monitor performance of business services and drill down quickly for the right root cause.

Logs are really critical

Log-based monitoring has arrived in a big way. Tools like Splunk and ELK stack are fast gaining popularity for the variety of ways of ingesting logs data and integrating various analytical capabilities for right and quick diagnosis. The source of logs data could be applications, services or any infrastructure elements including servers, application servers, storage devices, network devices, database or any enterprise applications like SAP, SharePoint etc.

Analytics and Machine Learning

For quick problem diagnostics with data and analytics machine learning algorithms have a critical role to play. Data to the single pane may come from variety of data sources. And any problem reported at the service may be due to one single element lying at the lowest level of overall infrastructure. For example, problem in a router may impact service performance, application performance and any devices (software / hardware) connected to the routers. Collecting all alerts, correlating them, and creating the right diagnostics in real-time is extremely critical.

Micro services and API usage patterns

Modern applications are API centric due to the continuously increasing cloud adoption and modern architecture based upon Micro Services. Micro services are generally managed through API management tools for relevant metrics. Hence, it requires developers to collaborate in order to understand Micro services relevant metrics and reach at the right diagnosis for any problems. Traditionally, developers heavily rely on the logs.

Based upon the usage of API / Service usage patterns and understanding corresponding infrastructure dependencies, efficient capacity planning is extremely important.

Automation

CIOs expect the systems management to be self-diagnostic and self-healing with machine learning built-in. This requires to continuously track business SLAs, right diagnostics for any breach of SLA, raise appropriate Service Desk ticket to record the defect, automatically fix the defect through Runbook Automation, and close the ticket. This necessitates integration of various monitoring tools with Service Desk and Runbook Automation tools.

Paradigm shift doesn’t happen in a day. Most companies dealing with system management tools have already started investing in initiatives mentioned above. The ones coming up with innovative solutions that are easily customizable from one deployment to the other will lead the way. In such a competitive environment, we will surely see customers switching vendors more often. So, continuous Innovation will be the key!!!

Continuous Innovation is going to be the key !!! 

References

http://www.businessinsider.in/The-massive-AWS-outage-hurt-54-out-of-the-top-100-internet-retailers-but-not-Amazon/articleshow/57419637.cms

https://www.usatoday.com/story/tech/news/2017/02/28/amazons-cloud-service-goes-down-sites-scramble/98530914/

https://finance.google.com/finance?q=NASDAQ%3AEBAY&fstype=ii&ei=tngWWvmiEcLvuASvqoTAAQ