Full-Scale Compliance with Policy as Code
How a leading global investment banking, securities, and investment management firm is leveraging policy-as-code techniques to enable application teams to adapt to the cloud faster without sacrificing security or compliance.
About the Customer
The customer is one of the largest investment banking enterprises in the world. They maintain business segments in investment banking, global markets, asset management, consumer banking, and wealth management. The customer’s internal tools and product offerings are increasingly being launched as AWS workloads.
Key Challenge/Problem Statement
In this engagement, the customer wants to automate the current lengthy, complicated, and error-prone manual security review processes by creating a suite of executable security policies that encodes the firm’s security and compliance requirements. The security policies can be run innumerable times during the development and deployment of an infrastructure stack. This engagement is part of the customer’s larger, innovative approach to cloud adoption that they hope will help them move more quickly to the cloud than they have in the past.
The customer needs a way for application teams to create their own infrastructure without requiring development cycles or productized resources from a centralized DevOps team. In addition, application teams need to know early and often whether their infrastructure-as-code (IaC) will pass a security review without back-and-forth cycles with security teams. The customer envisions a future in which developers can create a new infrastructure stack one day, and if it passes the CI pipeline, deploy it to production that same day with justified confidence in its compliance and security.
One of the first steps in this engagement: by creating executable security policies that run in CI/CD pipelines, they can break the strong temporal dependencies between application developers and security review personnel.
State of Customer’s Business Prior to Engagement
Until now, the customer has used a variation on the industry-prominent Cloud Center of Excellence (CCOE) approach by maintaining a centralized team of cloud experts who consult with application teams to provide them secure, compliant infrastructure solutions; however, the customer has found that the CCOE approach has bogged down cloud adoption in practice due to its high levels of team interdependence, inflexibility in pre-built solutions provided by the CCOE team, and lengthy manual security review cycles.
Currently, the fastest-moving cloud adopters within the organization rely on pre-built, vended Terraform templates with baked-in security mitigations, but these templates usually need to be customized since the centralized CCOE team cannot support every use case. Application teams then need the assistance of the CCOE team to modify or create compliant templates, and the infrastructure solution must be manually reviewed by the organization’s security evaluation team before the solution can be deployed. If changes are made during this stage, the process resets, and further manual reviews are eventually needed.
Proposed Solution & Architecture
The solution is best understood in contrast to the CCOE paradigm well established in the enterprise world. In the CCOE model, an organization’s cloud experts are centralized in a team that acts as a consultancy for the rest of the organization. Their tasks include producing consumable chunks of cloud infrastructure in the form of either Infrastructure as Code (IaC) or vended cloud resources, as well as disseminating knowledge of cloud technologies and patterns to improve the organization’s overall cloud skillset.
Application teams consume products and assistance from the CCOE team to create the infrastructure stacks for their applications. In highly regulated industries such as financial services where security is paramount, the application infrastructure must then be reviewed by the organization’s security teams. While the CCOE team attempts to bake compliance and security into the infrastructure products they create, application teams in many cases still have some flexibility to customize these solutions, and these customized infrastructure stacks by definition require more security and compliance review than pre-built, pre-ordained infrastructure components.
The below diagram illustrates the dependencies between the three major parties in the described CCOE model: the CCOE team, the Security team, and the Application teams.
Figure – 01
Among different organizations in the financial services industry, the level of dependence on the centralized CCOE team varies widely. In our customer’s case, the organization has a long-established classical DevOps culture where application developers are highly cross-functional (capable of performing traditional infrastructure, operations, and monitoring tasks) and usually create their own infrastructure from scratch, giving them a high range of flexibility in the infrastructure they create. But, until now, this flexibility has increased application teams’ dependence on security teams by requiring the security teams to more deeply and frequently manually review teams’ infrastructure solutions in order to ensure these bespoke solutions fit the firm’s security goals. Typically there is a tradeoff between different approaches to this problem: if the CCOE team produces larger pieces of pre-made cloud infrastructure, the stacks are more likely to be secure and compliant out-of-the-box and thus need less manual review, but these pre-built stacks are much less flexible than stacks created by the application teams as-needed. Our customer wants the best of both worlds: high flexibility for application teams in what infrastructure they can create without increasing the teams’ dependence on manual security and compliance review processes.
Our solution focuses on automating these manual security review processes so our customers can free human security personnel from performing manual reviews and give developers immediate, full security review feedback each time they push code to their version control repository.
The following diagram, which describes VR’s and the customer’s approach, is admittedly more complex than the previous one but produces a much looser coupling of cloud delivery teams:
Figure – 02
Automating Security Reviews
Working closely with our AWS consulting partners, VR created a suite of automated security controls to replace the manual security review process. The tool of choice for this work was Open Policy Agent (OPA), a general purpose policy-as-code sentry that can be used to enforce any type of policy against hierarchical data (such as JSON or YAML, common in infrastructure configuration code).
While OPA is an established open-source project with many available example use cases in areas such as Kubernetes, Terraform, application authorization, Linux IPTables, and more, the open-source community’s support for AWS native services and tooling is in its infantile stages.
Design of the automated security tools VR is building was dictated by a relevant, opinionated, and powerful strategic decision the customer had made in its new approach to cloud adoption. Application teams would write IaC using the AWS Cloud Development Kit (CDK), a set of modules for creating IaC alongside application code in the very same languages that teams use already to write their applications. In CDK, teams can use (to date) Typescript, Python, C#, Java, and Go1 to specify AWS resources using both higher and lower level constructs without having to learn relatively low-level domain-specific declarative languages like CloudFormation.
Ultimately, CDK “synthesizes” (generates) CloudFormation JSON templates from the imperative language code developers write, so the output artifacts of CDK are the same regardless of the language used to write CDK code.
VR targeted the CDK-generated CloudFormation code for the OPA policies they wrote. In this way, any CloudFormation stack a team attempts to deploy, regardless of how it was created, will be automatically reviewed for security and compliance, so teams could even use the same suite of OPA policies with raw CloudFormation to create their IaC. Each CDK application pushed by an application team is synthesized into raw CloudFormation JSON and fed into VR’s OPA policy suite by the customer’s CI/CD pipeline.
For example, assume a developer writes CDK code like the following to create an S3 bucket and uses the AWS managed key instead of using a KMS-managed key as the security team requires:
When an application team pushes their CDK IaC through the customer’s CI/CD pipeline, the pipeline synthesizes the CDK code into CloudFormation JSON like the following:
The pipeline then runs VR’s automated security policies against this CloudFormation template, and in less than one second, the developer receives targeted feedback on what exactly they need to change within their IaC to make it compliant.
The below is an example error a user might receive if they tried to create the above S3 bucket:
S3 bucket server side encryption (SSE) is required. Enabling SSE on buckets at the object level protects data-at-rest. Objects can be encrypted only with KMS-Managed Keys (SSE-KMS).
Compliance Evaluation with OPA and Rego
Compliance was evaluated using Rego which is OPA’s declarative and human-readable policy language. It has an expressive syntax and an opinionated runtime that allows for powerful policy creation while remaining relatively obvious in what the policy is accomplishing.
For example, the below snippet is the business rules portion of a simple Rego policy to enforce that each created Secrets Manager Secret uses a Customer Managed KMS Key rather than the default AWS managed key:
The above code should be readable even to users unfamiliar with Rego. Any AWS::SecretsManager::Secret resource in the input CloudFormation template whose KmsKeyId property is not set is denied. If a user violates this requirement, they receive a message like the following on CI/CD pipeline failure:
OPA is a tool for implementing policies in code, but naturally, the intent of a policy must be defined before OPA developers can add that new policy to the suite; otherwise, they cannot know what policy to implement. In an automated security control package, the security controls are analogous to the feature requests in other software product development workflows. In short, the packages “features” are its security mitigations.
Policies were defined in a three-tiered approach:
- Identify broad security threat based on the customer’s threat model. (e.g. “Actor can perform unauthorized actions due to shared IAM roles.”)
- Identify mitigation(s) that address these threats. (e.g. “Ensure that no resource shares an IAM role with another resource”)
- Create policy implementation task(s) that implement the mitigation as an OPA Rego policy (e.g. “Permit each AWS::IAM::Role resource in a CloudFormation template to be referenced exactly once from within that same template” as well as “Ensure no hardcoded ARNs are used to refer to Roles within a CloudFormation template”)
In this process, each implemented policy is then documented with the threats it addresses, tying it back to the organization’s threat model, allowing security personnel to review these policies for completeness.
Once a policy is implemented, it must be approved by security review personnel. The status of each automated policy with respect to this manual review process is stored in a flag in each OPA Rego policy, and an additional meta-policy enforces that AWS resources can only be used if all associated policies have been approved by the customer’s security personnel and the resource itself is deemed to be safe if it passes all included automated security controls. To relate this back to the early sections in this case study, the new security review process is for security personnel to review automated security control code rather than directly reviewing application teams’ infrastructure code. If the automated policy is deemed secure and compliant, the role of security review is delegated to the automated policy rather than a human security engineer, magnifying the power of that review and drastically reducing the amount of time spent looking at application code by this security personnel.
Below is an example of an automated check failure where an application team has tried to use a resource that has an associated policy that has not been approved by the customer’s security personnel (i.e. the policy itself is not deemed safe for use in production):
Development Process and Testing
Developers writing OPA Rego policies for the task at hand requires a different skill set than many other security control implementation or DevOps roles requires. Besides having critical security threat and mitigation analysis skills and IaC competencies, OPA Rego developers in this context need to follow modern application development practices to ensure speed of quality delivery. At the end of the day, even though it is replacing manual security review checklists, the OPA Rego policy suite VR created is a software package – meant to be run during CI/CD and invoked as a command-line utility. The codebase is the literal codification of the organization’s security policies, and like those policies, they need to be easily readable, usable and well tested.
A test-driven development (TDD) approach to policy development by implementing two major levels of test cases: Integration Tests and Unit Tests. TDD ensures that tests are written upfront, codifying the acceptance criteria for each control before the development of the actual OPA Rego policy begins, allowing VR and the customer to align on expectations and iron out miscommunications early. Integration tests take the form of full example applications that are synthesized into CloudFormation JSON and evaluated by the Jest TypeScript automated test runner to ensure they pass all the OPA Rego policies written. Unit tests are comprised of two types of tests: small snippets of TypeScript CDK code and accompanying test assertions and lower-level tests written in Rego to exercise edge cases.
The integration tests ensure that real-life applications that are expected to pass security review actually do pass the OPA Rego policies VR has created, effectively implementing automated acceptance testing for the project.
The CDK unit tests serve two purposes: to ensure that the Rego policies as written functioned correctly with CDK-generated CloudFormation output, and to serve as example code snippets for the customer’s developers who are looking for documentation on how to create compliant resources. The lower-level Rego unit tests provide another layer of specificity, allowing OPA developers to exhaustively test unlikely but important combinations of CloudFormation syntax that a developer might accidentally create or a malicious user might craft in order to get around the automated security review.
AWS Services Used
- AWS Infrastructure Scripting – CloudFormation, AWS CDK
- AWS Storage Services – S3
- AWS Database Services – RDS, DynamoDB
- AWS Compute Services – EC2, ECS, Lambda
- AWS Management and Governance Services – CloudWatch Logs, Synthetics
- AWS Security, Identity, Compliance Services – IAM, Key Management Service, Certificate Manager, Secrets Manager
- AWS Developer Tools – CodePipeline, CodeBuild
- AWS Networking – Elastic Load Balancing
Third-party applications or solutions used
- Open Policy Agent (OPA)
- GitHub & GitHub Actions
- 73 Different Security Controls Implemented
- 28 CloudFormation Resource Types Supported
- 15 Different AWS Services Supported
- Approximately 600 automated test cases executed on every pull request in less than 5 minutes
- Fully automated security review executable in less than 2 seconds on a 3000-line CloudFormation ECS Fargate infrastructure stack
Codifying cloud security and compliance policies in code provides major new capabilities to customers: it can decouple their development cycles from security review cycles, it allows them to version control and easily audit their security and compliance rules, and it speeds up the security review process exponentially. Besides these capabilities, other doors begin opening that were previously completely closed to organizations wishing to improve their cloud adoption speed. For instance, the automated security policy suite that VR developed runs just as well locally on developer laptops as it does in CI/CD pipelines. Our customer has begun exploiting this strength by creating a Visual Studio Code plugin that automatically highlights lines of CDK code with security review failures in the IDE before the developer ever pushes their code. This is as far left as one can possibly go with preventative security measures, in the parlance of “shifting left.”
So far, the development methodology has focused on targeting known-good sample customer applications one at a time with the early-stage goal of supporting a handful of commonly appearing customer application patterns. As AWS service and resources coverage improves, the team will begin tackling less frequently used AWS services and resources, eventually covering as much of the AWS surface as possible. In a perfect world scenario, the automated security review will verify full compliance with NIST SP 800-series bulletins and any additional customer security/compliance requirements across all enterprise-approved AWS services and resources, allowing the customer’s developers to make full use of the AWS cloud to the maximum organization-approved extent, moving away from the CCOE model of distributing prescriptive infrastructure solutions and shifting closer to one where an organization’s developers can do whatever they want in the cloud, using whichever tools they like, in a safe environment.