From alert saturation to operational clarity: Our 90-day CSPM journey

In the initial CSPM rollout, 6,000 findings were generated, 95% of which were risk-accepted, rather than exploitable risks.
8 min 所要時間
Sauveer Ketan Kumar

Author

Sauveer Ketan Kumar
Solutions architect, HCLTech
8 min 所要時間
From alert saturation to operational clarity: Our 90 day CSPM journey

Problem statement:

Cloud security tools often promise greater visibility but instead create alert fatigue. In our initial CSPM rollout, 6,000 findings were generated, 95% of which were risk-accepted or contextually mitigated, rather than exploitable risks. The greater cost was not the volume of alerts, but the resulting erosion of trust between security and engineering teams.

Business impact

  • Operational efficiency: Time spent by the security team on false positives was reduced from approximately 40 hours per week to less than 2 hours, significantly reducing alert fatigue across security and engineering teams
  • Cost optimization: Approximately $1000 per month in avoidable spending was eliminated through the removal of redundant rules and duplicate tooling. An additional $8,000 per month was saved in unused cloud resources, including idle NAT Gateways and dormant EC2 and RDS instances
  • Compliance posture: Audit preparation effort was materially reduced, with corresponding improvements in compliance outcomes

Strategic insight: Cloud security maturity isn't measured by alert volume, but by the quality of signals, the strength of collaboration with engineering teams and the extent to which security enables business outcomes. Security tools should accelerate development velocity, not impede it.

Why does this matter to organizations?

For technology leaders, this represents not only a security challenge, but it's also a direct constraint on business velocity.

  • Developer productivity tax: Engineers spend 15-20% of their time on security false positives
  • Cloud migration delays: Security findings blocking production deployments
  • Audit risk: Over-tuned tools miss real threats while alerting to non-issues or less impactful issues
  • Talent retention: Over time, this erosion of collaboration might also negatively impact talent retention and organizational effectiveness

Day zero

Enabling a Cloud Security Posture Management (CSPM) solution appears straightforward in principle, but in practice, it presents significant complexity. When tool X was enabled across the environment spanning more than 90 accounts, the initial deployment, despite prior safeguards and precautions, generated approximately 6,000 alerts.

Why we needed this:

We had:

  • 90+ AWS accounts spanning dev to production environments
  • Multi-region architecture (US primary, Singapore and Frankfurt for global customers)
  • Main services: EC2, RDS, Lambda, S3, ECS, EKS
  • AWS Config and a third-party CSPM tool are partially integrated, but not properly monitored
  • Primary cloud: AWS, with a small GCP presence

Security signals were fragmented across multiple tools, necessitating a unified approach. As a result, misconfigurations were often identified only during audits, weeks or months after deployment, posing unacceptable risk when handling protected health information.

After evaluating several platforms, we chose X (a pseudonym, as these learnings will apply to any tool). This tool was chosen for its agentless scanning and, critically, its ability to understand AWS-native controls better than other tools. It also has integrated multicloud capability.

While this article references a third-party CSPM solution, AWS security services have matured significantly and now represent a strong native alternative. AWS Security Hub is evolving into a unified security operations platform, integrating services such as Amazon GuardDuty, Amazon Inspector, Security Hub CSPM and Amazon Macie to analyze threats, vulnerabilities, misconfigurations and sensitive data continuously. The platform is also expanding to support multicloud environments.

Month One:

About 70% alerts were IAM-related. X was flagging roles as "overly permissive," "can escalate privileges,” or “unused for x number of days," among a few others. These alerts were technically correct but practically wrong.

We worked with X's Customer Success team to enable "Effective Permissions Analysis." This feature made X understand our layered security model. Soon, we went from 4,200 IAM alerts to 750; most of these were because of unused roles. We provided standard roles and IAM users to the application teams, with safeguards using permission boundaries and SCPs. Their severity was reduced and they were eliminated gradually after thorough analysis.

The S3 Buckets:

We had ~300 alerts for "publicly accessible S3 buckets."

These findings fell into four different categories:

  • Sandbox account buckets tagged DataClassification: Public used for demo datasets
  • Static website hosting buckets: A lot of them were cleaned up afterward, and a cleanup process was put in place
  • CloudFront origin buckets, where public access was not required. For CloudFront distributions using Origin Access Control (or legacy Origin Access Identity), S3 origin buckets were expected to remain private, with access restricted to CloudFront via bucket policies. Any public access detected on such buckets was therefore a legitimate misconfiguration, not a false positive and was remediated
  • Production buckets with unintended public exposure, which were treated as critical findings and addressed immediately

We customized the policy to add context:

IF bucket is CloudFront origin AND has OAI/OAC → Medium severity

IF bucket is a static website or it is in sandbox AND tagged Public → Info only

IF bucket is in production AND public → Critical

This policy refinement significantly reduced noise while preserving strong protections where it mattered most. These alerts pertained to resources created in a few accounts that permitted public access; these were DMZ accounts with safeguards like Shield Advanced in place. At the same time, most AWS accounts were strictly private.

The Encryption Fight:

X wanted all EBS volumes encrypted with Customer Managed Keys (CMKs). However, the organizational standard is for AWS-managed encryption to be enabled by default at the account level for all private accounts; it is secure and cost-effective.

Given the data classification and compensating controls, AWS‑managed encryption provided adequate protection and the additional benefits of CMKs were evaluated and accepted as residual risk by the organization.

We turned off policies that contradicted our documented security standards. A few other similar policies about other resources were also disabled.

Month Two:

Clearing the backlog was one thing. Preventing it from coming back was another.

  • The Terraform Fix:

    We noticed a pattern: 40+ new alerts every week for "S3 buckets without versioning enabled."

    The root cause was that the Terraform S3 module made versioning optional. Developers would spin up buckets and forget to enable versioning, causing an alert. To fix it, the code was modified to make versioning the default.

    Buckets are secure by default. Developers must explicitly opt out (which requires security approval). Alert stream: stopped.

    Versioning also adds to cost. We also added a default lifecycle policy to delete older versions, which took the environment tag into account and matched it with the organization backup policy.

    This became the norm. Fix the factory, not the products.

  • Bi-Weekly sync:

    We started bi-weekly standups with platform engineering and DevOps to identify patterns, fix the modules and stop the alerts. This evolved into a security operating model and became part of the governance framework.

  • Threat detection:

    We enabled X's Threat Detection and immediately got bombarded with "unusual login location" alerts. We weren’t compromised. It was Zscaler.

    Our corporate laptops route through Zscaler's global secure web gateway. CloudTrail logs showed our engineers "logging in" from whichever Zscaler exit node they happened to use, which was actually normal.

    The fix was to whitelist Zscaler's IP ranges, which eliminated 100+ weekly false positives.

The mistakes:

  • Automated role deletion

    We deleted:

    • A quarterly financial reporting role (last run: 58 days ago)
    • A monthly release pipeline role

    Even though this was done after multiple emails to a DL where the application teams were present, people did not pay attention to them. This is another challenge, making people pay heed, which we fixed through bi-weekly meetings.

    Lesson: 90-day minimum aging period. Exclusion tags for critical roles. 14-day "soft delete" (disable, don't delete) with notifications.

  • The unencrypted database we missed

    All our RDS instances are encrypted because we have an SCP enforcing it. We downgraded all encryption findings from High to Low.

    We had an RDS instance in a sandbox account since before the SCP existed. It had a copy of unencrypted production data. Though an engineer had investigated this using Config, they had missed this instance, as it turned out, this Sandbox account was not properly integrated with Control Tower, which we discovered during an audit.

    This was also a failure of tagging and data governance; this RDS did not have the correct DataClassification tag.

Streamline phase:

After three months, things were streamlined.

  • ~10 new findings per week (down from 200+)
  • ~1 critical alert per week on average, investigated and actioned immediately
  • ~1-2 high-severity alerts actioned quickly
  • Medium/low alerts batched into bi-weekly remediation sprints
  • Tickets were automatically routed to various teams through the ticketing tool. Everyone was made aware of security policies through bi-weekly meetings

Prerequisites:

  • AWS Organizations with a clear account structure
  • Basic tagging strategy (Environment, Application, Owner, Data Classification at minimum)
  • Change management process

Highly Recommended:

  • Service Control Policies
  • Infrastructure as Code (Terraform/CloudFormation)
  • Centralized logging

Nice to Have:

  • Permission boundaries
  • Existing security metrics (for before/after comparison)

Things to note

  • Context beats coverage. Don't enable every policy on Day 1. Start with production and critical controls, and expand gradually
    • Control context (SCPs, permission boundaries)
    • Data context (classification tags)
    • Environment context (prod vs sandbox)
    • Network context (DMZ vs private)
  • High alert volume does not necessarily indicate better risk management. Fewer alerts with higher risk signal density is a sign of maturity
  • Build a relationship with Customer Success. We had monthly sync-ups with X to preview new features and share feedback
  • Fix the factories, not the products. Updating Terraform modules prevents thousands of future alerts. Sending individual remediation tickets doesn't
  • Automate carefully. We auto-remediate low-risk tasks (unused security groups after 90 days). Humans still approve IAM changes and production modifications
  • Culture beats technology. The best security tool in the world fails if engineering doesn't trust you. Having focused conversations is a must

Security is most effective when it is trusted, unobtrusive and consistent. The objective is not the elimination of alerts, but the delivery of accurate signals at appropriate severity levels, supported by teams accountable for remediation.

Virendra Singh Saini

共著者

Virendra Singh Saini
Sr. Solutions Architect, HCLTech
Manimaran B

共著者

Manimaran B
Service Line Head, AWS, HCLTech
共有:
クラウドとエコシステム AWS ブログ From alert saturation to operational clarity: Our 90-day CSPM journey