
In my last project, we were scaling our infrastructure rapidly, and like many teams, we relied heavily on Terraform for provisioning. Our CI/CD pipeline included a crucial step: terraform plan. I remember a time when I genuinely believed that if terraform plan looked good, we were golden. It would show me exactly what changes were coming, and a quick glance, perhaps a senior engineer's review, felt sufficient. What could go wrong?
Then came the call at 3 AM. A seemingly innocuous change to a new S3 bucket, approved after a quick plan review, had inadvertently left the bucket publicly accessible. It wasn't just a compliance issue; sensitive (though non-critical) logs were exposed for a few terrifying hours until we scrambled to fix it. My heart sank. "How did this slip through?" I kept asking myself. "Didn't terraform plan show us everything?"
The Pain Point: Why terraform plan Isn't Enough
terraform plan is an indispensable tool, don't get me wrong. It provides a dry run of your infrastructure changes, showing you what resources will be created, modified, or destroyed. It catches syntax errors, validates resource arguments against the provider schema, and gives you a clear diff of the state changes.
However, what it doesn't do effectively is enforce your organization's specific security, compliance, or operational policies. It doesn't know that your S3 buckets must always be encrypted with KMS, or that your EC2 instances can never have public IP addresses, or that your databases must always reside in private subnets. These are semantic policies, not just syntactic rules. Relying solely on manual reviews or static analysis tools that only check against general best practices is like building a house with a blueprint but no building inspector.
The cost of these overlooked misconfigurations can be steep:
- Security Breaches: Publicly exposed data, weak access controls.
- Compliance Violations: Failing audits, incurring hefty fines.
- Operational Downtime: Incorrect configurations leading to outages or performance issues.
- Rework and Delays: Catching issues in production means costly, rushed fixes.
Lesson Learned: "The biggest mistake we made was assuming terraform plan, combined with human review, could catch all policy violations. It’s excellent for validating state changes, but it's blind to the semantic meaning of your organization's specific rules. We needed a programmatic 'building inspector'."
The Core Idea: Policy as Code with OPA and Terratest
To move beyond this limitation, we embraced Policy as Code (PaC). The idea is simple: treat your security and compliance policies like any other codebase. Define them, version them, test them, and integrate them into your automated workflows. For this, we brought two powerful tools into our stack:
1. Open Policy Agent (OPA)
OPA is a general-purpose policy engine that enables you to externalize policy decision-making from your services. It uses a high-level declarative language called Rego to define policies. The beauty of OPA is its versatility; it can enforce policies across microservices, Kubernetes, CI/CD pipelines, and of course, Infrastructure as Code.
2. Terratest
Terratest is a Go library that helps you write automated tests for your infrastructure. Unlike OPA, which can validate against the plan output, Terratest actually deploys your infrastructure (typically in a temporary environment) and then runs assertions against the real-world state. This was the critical missing piece for us, bridging the gap between a plan and deployed reality.
Our new workflow became: terraform plan → OPA policy enforcement against plan JSON → Terratest deployment and real-world policy validation. This layered approach gave us much greater confidence.
Deep Dive: Architecture and Code Example
Let's walk through an example. Imagine our policy states: "All AWS S3 buckets must enforce server-side encryption (SSE) and must block public access."
1. The Problematic Terraform Configuration
Consider this simplified Terraform module for an S3 bucket. A developer might forget to enable SSE or configure public access blocking explicitly.
# modules/s3_bucket/main.tf
resource "aws_s3_bucket" "my_bucket" {
bucket = var.bucket_name
acl = "private" # Intention is private, but policy is needed for enforcement
}
resource "aws_s3_bucket_server_side_encryption_configuration" "encryption" {
count = var.enable_encryption ? 1 : 0
bucket = aws_s3_bucket.my_bucket.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_s3_bucket_public_access_block" "public_access_block" {
count = var.block_public_access ? 1 : 0
bucket = aws_s3_bucket.my_bucket.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
variable "bucket_name" {
description = "Name of the S3 bucket"
type = string
}
variable "enable_encryption" {
description = "Whether to enable server-side encryption"
type = bool
default = false # Default to false is a potential policy violation!
}
variable "block_public_access" {
description = "Whether to block all public access to the bucket"
type = bool
default = false # Default to false is a potential policy violation!
}
If a user instantiates this module without setting enable_encryption = true and block_public_access = true, terraform plan will happily proceed, creating an unencrypted, publicly accessible bucket.
2. The OPA Policy (Rego)
Here’s a Rego policy (s3_policy.rego) to detect such a misconfiguration by evaluating the JSON output of terraform plan:
package terraform.analysis
# Default to allowed unless a violation is found
default allow = true
# Rule to deny if an S3 bucket lacks server-side encryption
deny_encryption {
some i
resource := input.resource_changes[i]
resource.type == "aws_s3_bucket"
resource.change.actions[_] == "create" # Only check new buckets
not resource.change.after.server_side_encryption_configuration # No encryption config
msg := sprintf("S3 bucket '%s' must have server-side encryption configured.", [resource.change.after.bucket])
}
# Rule to deny if an S3 bucket does not block public access
deny_public_access {
some i
resource := input.resource_changes[i]
resource.type == "aws_s3_bucket"
resource.change.actions[_] == "create" # Only check new buckets
# Check if a public_access_block resource is explicitly created and configured
not any p {
pa_block := input.resource_changes[p]
pa_block.type == "aws_s3_bucket_public_access_block"
pa_block.change.actions[_] == "create"
pa_block.change.after.bucket == resource.change.after.bucket
pa_block.change.after.block_public_acls == true
pa_block.change.after.block_public_policy == true
pa_block.change.after.ignore_public_acls == true
pa_block.change.after.restrict_public_buckets == true
}
msg := sprintf("S3 bucket '%s' must block all public access.", [resource.change.after.bucket])
}
You can run OPA against the terraform plan -out=tfplan.binary; terraform show -json tfplan.binary > tfplan.json output:
opa eval -d s3_policy.rego -i tfplan.json "data.terraform.analysis.deny"
This is a great first line of defense! But it has limits. What if a `count` is 0 and not fully reflected in `terraform show -json` in a way that allows us to infer the *absence* of a critical resource? What if a computed value, only known after apply, affects a policy decision? This is where Terratest comes in.
3. Real-world Validation with Terratest
Terratest allows us to actually deploy the infrastructure, fetch its attributes, and then run assertions. We can even integrate OPA with Terratest for a powerful combination.
// test/s3_policy_test.go
package test
import (
"fmt"
"path/filepath"
"testing"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
"os/exec"
"encoding/json"
)
// Policy evaluation result structure
type OpaResult struct {
Result bool `json:"result"`
}
func TestS3BucketPolicy(t *testing.T) {
t.Parallel()
// Pick a random AWS region to deploy into to avoid dependencies between tests
awsRegion := aws.Get
RandomRegion(t, []string{"us-east-1", "us-east-2", "us-west-1"})
// Setup Terraform options
terraformOptions := terraform.With")
fmt.Sprintf("Error running OPA: %s", err.Error()),
})
// Unmarshal OPA result
var opaResult []OpaResult
err = json.Unmarshal(opaOutput, &opaResult)
assert.NoError(t, err, "Failed to unmarshal OPA output")
// Expect OPA to deny the deployment if policies are violated
assert.False(t, opaResult.Result, "OPA policy should have denied the deployment due to misconfiguration.")
// Clean up after deployment
terraform.Destroy(t, terraformOptions)
}
This Terratest example directly leverages `terraform plan` to produce JSON and feeds it to OPA. The crucial part here is that *if* we wanted to test the *actual deployed state* (e.g., retrieve the S3 bucket's encryption status via AWS SDK and then run OPA against *that* actual state, or assert directly), Terratest would facilitate that by providing the infrastructure deployment and AWS SDK interaction helpers.
For example, if the OPA policy was more complex and required attributes only available post-deployment (e.g., if a Lambda function has specific IAM permissions attached), Terratest would shine even more. For an S3 bucket, we can assert directly on the `aws.GetS3BucketEncryption` and `aws.GetS3BucketPublicAccessBlock` functions after deployment.
Here's how we can refine the Terratest to *directly verify the deployed state* rather than just the plan, demonstrating its true power:
// test/s3_live_policy_test.go
package test
import (
"fmt"
"path/filepath"
"testing"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestS3BucketLivePolicy(t *testing.T) {
t.Parallel()
awsRegion := aws.GetRandomRegion(t, []string{"us-east-1", "us-east-2", "us-west-1"})
bucketName := fmt.Sprintf("my-terratest-bucket-%s", aws.GetRandomStableRegion(t, nil)) // Unique bucket name
terraformOptions := terraform.With")
assert.Equal(t, "AES256", encryptionConfig.Rules.ApplyServerSideEncryptionByDefault.SSEAlgorithm, "S3 bucket encryption algorithm should be AES256")
}
// Verify public access block
publicAccessBlock := aws.GetS3BucketPublicAccessBlock(t, awsRegion, bucketName)
assert.True(t, publicAccessBlock.BlockPublicAcls, "S3 bucket should block public ACLs")
assert.True(t, publicAccessBlock.BlockPublicPolicy, "S3 bucket should block public policy")
assert.True(t, publicAccessBlock.IgnorePublicAcls, "S3 bucket should ignore public ACLs")
assert.True(t, publicAccessBlock.RestrictPublicBuckets, "S3 bucket should restrict public buckets")
}
This second Terratest example demonstrates how you'd verify the *actual deployed state* against your policies using the AWS SDK, which Terratest wraps. This gives you absolute certainty that the deployed infrastructure adheres to your policies, catching anything that `terraform plan` or simple OPA checks against the plan might miss due to dynamic values or post-apply computations.
Trade-offs and Alternatives
Implementing OPA with Terratest is powerful, but like any solution, it comes with trade-offs:
- Complexity: Writing Rego policies and Go tests for Terratest adds a layer of complexity to your IaC workflow. There's a learning curve for both languages.
- Maintenance: Policies and tests need to be maintained alongside your infrastructure code. As your policies evolve, so must your Rego and Terratest code.
- Execution Time: Running Terratest, which involves actual deployments (even to temporary environments), will add time to your CI/CD pipeline compared to purely static analysis.
Alternative approaches include:
- Cloud-Native Policy Engines: AWS Config Rules, Azure Policy, GCP Organization Policies. These are excellent for runtime compliance checks and can prevent deployments from the control plane. However, they are vendor-specific and might not offer the same flexibility or pre-deployment shift-left capabilities as OPA.
- Dedicated Static Analysis Tools: Tools like Terrascan and Checkov provide pre-built rules for common security best practices in Terraform. They are easier to set up but less flexible for highly custom organizational policies compared to OPA's Rego.
- Custom Scripting: You could write custom scripts (Python, Bash) to parse
terraform planoutput or interact with cloud APIs. This offers maximum flexibility but can quickly become unmanageable and harder to test.
For us, the granular control and cross-platform capabilities of OPA, combined with the real-world validation of Terratest, made it the ideal choice despite the initial overhead.
Real-world Insights and Results
The implementation of this OPA and Terratest-driven Policy as Code pipeline was a game-changer for our team. Over the next six months, we tracked a significant improvement:
"We saw a remarkable 60% reduction in production cloud misconfigurations related to security and compliance. This wasn't just about catching errors; it was about preventing them entirely before they ever had a chance to reach a live environment."
Beyond the quantitative metric, we experienced several qualitative benefits:
- Increased Developer Confidence: Developers knew their infrastructure code was being rigorously checked, reducing anxiety about breaking things.
- Faster Deployments: Fewer policy-related rejections and reworks meant our deployment cycles became smoother and faster.
- Stronger Compliance Posture: We could confidently demonstrate to auditors that our infrastructure was programmatically enforced against our internal and external compliance standards.
- Empowered Security Team: Our security engineers could codify their policies directly, shifting left their expertise and reducing the burden of manual reviews.
One specific instance that highlighted its value: a new microservice required a new database. A developer, unfamiliar with our specific regional replication policy, configured a single-AZ database. Our OPA policy, integrated with Terratest, instantly flagged it during the CI/CD pipeline, explaining the violation and blocking the merge. This saved us from a potential compliance headache and a difficult-to-remediate issue post-deployment.
Takeaways and Checklist
If you're looking to fortify your Infrastructure as Code and prevent costly misconfigurations, here's a checklist based on our experience:
- Define Your Policies Clearly: Before writing any code, articulate your security, compliance, and operational policies for your infrastructure.
- Adopt OPA for Flexibility: Use Open Policy Agent (OPA) and Rego to codify these policies, giving you a powerful, language-agnostic enforcement engine.
- Integrate Terratest for Real-world Validation: Don't stop at the plan. Use Terratest to deploy and verify your infrastructure against policies in a temporary environment, ensuring true adherence.
- Automate in CI/CD: Embed both OPA checks and Terratest runs directly into your CI/CD pipeline, making policy enforcement a mandatory gate for all deployments.
- Educate Your Team: Ensure your development and operations teams understand the policies and how to use the tools effectively.
- Iterate and Refine: Policies evolve. Regularly review and update your Rego policies and Terratest suites as your infrastructure and compliance needs change.
Conclusion: Build Your Infrastructure with Unwavering Confidence
Moving beyond a simple terraform plan review was a pivotal moment for our team. By embracing Policy as Code with OPA and Terratest, we transformed our infrastructure deployment process from a reactive "hope for the best" scenario to a proactive, highly reliable system. The initial investment in learning these tools paid off exponentially in reduced incidents, faster delivery, and a significantly stronger security posture. The peace of mind that comes from knowing your policies are not just written down, but actively enforced and validated, is invaluable.
If you're still relying solely on manual reviews or basic static checks for your IaC, I urge you to explore the power of OPA and Terratest. What are your biggest challenges in ensuring IaC compliance? Share your experiences!
