What to look for when reviewing a company's infrastructure

Version 1.0

This micro-website contains the list of questions that can be asked while reviewing the security architecture of a multi-cloud SaaS company and finding its most critical components.

For an explanation of the content of the table below, please refer to the companion blog post on marcolancini.it:
"What to look for when reviewing a company's infrastructure".

# Phase Stage Questions Useful Resources
0 Phase 1: Cloud Providers Stage 1: Identify the primary CSP • Which Cloud Service Provider is the primary?
1 Phase 1: Cloud Providers Stage 2: Understand the high-level hierarchy • How many Organizations does the company have? • How is each Organization designed? If AWS, what do the Organizational Units (OUs) look like? If GCP, what about the Folder hierarchy? • Is there a clear split between environment types? (i.e., Production, Staging, Testing, etc.) • Which Accounts are critical? (i.e., which ones contain critical data or workloads?) If you are lucky, has someone already compiled a risk rating of the Accounts? • How are new Accounts created? Are they created manually or automatically? Are they automatically onboarded onto security tools available to the company? Mapping Moving Clouds: How to stay on top of your ephemeral environments with CartographyTracking Moving Clouds: How to continuously track cloud assets with CartographyHow to inventory AWS accounts
2 Phase 1: Cloud Providers Stage 3: Understand what is running in the Accounts • What kind of technologies are involved? • Is the company container-heavy? (e.g., Kubernetes, ECS, etc.) • Is it predominantly serverless? (e.g., Lambda, Cloud Function, etc.) • Is it relying on "legacy" VM-based workloads? (e.g., vanilla EC2) • What kind of data is identified by the business as the most sensitive and critical to secure? • What type of data (e.g., secrets, customer data, audit logs, etc.) is stored in which Account? • How is data rated according to the company's data classification standard (if one exists)? Best practices on Cost Optimization
3 Phase 1: Cloud Providers Stage 4: Understand the network architecture • What are the main entry points into the infrastructure? What services and components are Internet-facing and can receive unsolicited (a.k.a. untrusted and potentially malicious) traffic? • How do customers get access to the system? Do they have any network access? If so, do they have direct network access, maybe via VPC peering? • How do engineers get access to the system? Do they have direct network access? Identify how engineering teams can access the Cloud Providers' console and how they can programmatically interact with their APIs (e.g., via command-line utilities like the AWS CLI or gcloud CLI). • How are Accounts connected to each other? Is there any Account separation in place? • Is there any VPC peering or shared VPCs between different Accounts? • How is firewalling implemented? How are Security Groups and Firewall Rules defined? • How is the edge protected? Is anything like Cloudflare used to protect against DDoS and common attacks (via a WAF)? • How is DNS managed? Is it centralized? What are the principal domains associated with the company? • Is there any hybrid connectivity with any on-prem data centres? If so, how is it set up and secured? Summary of AWS servicesSummary of GCP servicesHybrid Connectivity
4 Phase 1: Cloud Providers Stage 5: Understand the current IAM setup • How are authentication and authorization to cloud providers currently set up? • For human access - Where are identities defined? Is an Identity Provider (like G Suite, Okta, or AD) being used? - Are the identities being federated in the Cloud Provider from the Identity Provider? Are the identities being synced automatically from the Identity Provider? - Is SSO being used? - Are named users being used as a common practice, or are roles with short-lived tokens preferred? Here the point is not to do a full IAM audit but to understand the standard practice. - For high-privileged accounts, are good standards enforced? (e.g., password policy, MFA - preferably hardware) - How is authorization enforced? Is the principle of least privilege generally followed, or are overly permissive (non fine-tuned) policies usually used? - How is Role-Based Access Control (RBAC) used? How is it set up, enforced, and audited? - Is there a documented process describing how access requests are managed and granted? - Is there a documented process describing how access deprovisioning is performed during offboarding? • For automated access - How do Accounts interact with each other? Are there any cross-Account permissions? - Are long-running (static) keys and service accounts generally used, or are short-lived tokens (i.e., STS) usually preferred? - How is authorization enforced? Is the principle of least privilege generally followed, or are overly permissive (non fine-tuned) policies usually used? Best practices for IAM in AWSBest practices for IAM in GCP
5 Phase 1: Cloud Providers Stage 6: Understand the current monitoring setup • How are security logs collected, aggregated, and analyzed across the entire estate? - Are security-related logs collected at all? - If so, which services offered by the Cloud Providers are already being leveraged? For AWS, are at least CloudTrail, CloudWatch, and GuardDuty enabled? For GCP, what about Cloud Monitoring and Cloud Logging? - What kind of logs are being already ingested? - Where are the logs collected? Are logs from different Accounts all ingested in the same place? - What's the retention policy for security-related logs? - How are logs analyzed? Is a SIEM being used? - Who has access to the SIEM and the raw storage of logs? • How is the response process setup? - Are any Intrusion Detection systems deployed? - Are any Data Loss Prevention systems deployed? - Are there any processes and playbooks to follow in case of an incident? - Are there any processes to detect credential compromise situations? - Are there any playbooks or automated tooling to contain tainted resources? - Are there any playbooks or automated tooling to aid in forensic evidence collection for suspected breaches? Security Logging in Cloud Environments - AWSSecurity Logging in Cloud Environments - GCP
6 Phase 1: Cloud Providers Stage 7: Understand the current secrets management setup • How are new secrets generated when needed? Manually or automatically? • Where are secrets stored? • Is a secrets management solution (like HashiCorp Vault, AWS Secrets Manager, or GCP Secret Manager) currently used? • Are processes around secrets management defined? What about rotation and revocation of secrets? Secrets Management solutions
7 Phase 1: Cloud Providers Stage 8: Identify existing security controls • Which controls have already been implemented? • What security boundaries are defined? For example, are Service Control Policies (AWS) or Organizational Policies (GCP) used? • What off-the-shelf services offered by Cloud Providers are being used? • What other (custom or third party) solutions have been deployed? Summary of AWS servicesSummary of GCP services
8 Phase 1: Cloud Providers Stage 9: Get the low-hanging fruits • Which high exposure and high impact vulnerabilities or misconfigurations are already in Production? • Testing - AWS - GCP - Azure • Auditing - AWS - GCP - Azure
9 Phase 2: Workloads Stage 1: Understand the high-level business offerings • What are the key functionalities your company offers to their customers? - How many key functionalities does the company have? For example, if you are a banking company, these functionalities could be payments, transactions, etc. - How are the main functionalities designed? Are they made by micro-services or a monolith? - Is there a clear split between environment types? (i.e., Production, Staging, Testing, etc.) - Which functionalities are critical? (i.e., both in terms of data and customer needs) • Map business functionalities to technical workloads, understanding their purpose for the business: - Which ones are Internet-facing? - Which ones are customer-facing? - Which ones are time-critical? - Which ones are stateful? Which ones are stateless? - Which ones are batch processing? - Which ones are back-office support?
10 Phase 2: Workloads Stage 2: Identify the primary tech stack • Which tech stack is the primary?
11 Phase 2: Workloads Stage 3: Understand the network architecture • Kubernetes: - Which (and how many) clusters do we have? Are they regional or zonal? - Are they managed (EKS, GKE, AKS) or self-hosted? - How do the clusters communicate with each other? What are the network boundaries? - Are clusters single or multi-tenant? - Are either the control plane or nodes exposed over the Internet? - How do engineers connect? How can they run `kubectl`? Do they use a bastion or something like Teleport? - What are the Ingresses? - Are there any Stateful workloads running in these clusters? • Serverless: - Which type of data stores are being used? For example, SQL-based (e.g., RDS), NoSQL (e.g., DynamoDB), or Document-based (e.g., DocumentDB)? - Which type of application workers are being used? For example, Lambda or Cloud Functions? - Is an API Gateway (incredibly named in the same way by both AWS and GCP! 🤯) being used? - What is used to de-couple the components? For example, SQS or Pub/Sub? • VMs: - What Virtual Machines are directly exposed to the Internet? - Which Operating Systems (and versions) are being used? - How are hosts hardened? - How do engineers connect? Do they SSH directly into the hosts, or is a remote session manager (e.g., SSM or OS Login) used? - What's a pet, and what's cattle?
12 Phase 2: Workloads Stage 4: Understand the current IAM setup • How are engineers interacting with workloads? How do they troubleshoot them? • How is authorization enforced? Is the principle of least privilege generally followed, or are overly permissive (non fine-tuned) policies usually used? • How is Role-Based Access Control (RBAC) used? How is it set up, enforced, and audited? • Are workloads accessing any other cloud-native services (e.g., buckets, queues, databases)? If yes, how are authentication and authorization to Cloud Providers set up and enforced? Are they federated, maybe via OpenID Connect (OIDC)? • Are workloads accessing any third party services? If yes, how are authentication and authorization set up and enforced? How Authentication in Kubernetes workHow Authorization in Kubernetes work
13 Phase 2: Workloads Stage 5: Understand the current monitoring setup • Are security-related logs collected at all? • What kind of logs are being already ingested? • How are logs collected? • Where are the logs forwarded? • Kubernetes: - Are audit logs collected? - Are System Calls and Kubernetes Audit Events collected via Falco? - Is a data collector like fluentd used to collect logs? - Is the data collector deployed as a Sidecar or Daemonset? • Serverless: - How are applications instrumented? - Are metrics and logs collected via a data collector like X-Ray or Datadog? • VMs: - For AWS, is the CloudWatch Logs agent used to send log data to CloudWatch Logs from EC2 instances automatically? - For GCP, is the Ops Agent used to collect telemetry from Compute Engine instances? - Is an agent like OSQuery used to provide endpoint visibility? What is Falco and how to use it
14 Phase 2: Workloads Stage 6: Understand the current secrets management setup • Where are workloads fetching secrets from? • How are secrets made available to workloads? Via environment variables, filesystem, etc. • Do workloads also generate secrets, or are they limited to consuming them? • Is there a practice of hardcoding secrets? • Assuming a secrets management solution (like HashiCorp Vault, AWS Secrets Manager, or GCP Secret Manager) is being used, how are workloads authenticating? What about authorization (RBAC)? • Are secrets bound to a specific workload, or is any workload able to potentially fetch any other secret? Is there any separation or boundary? • Are processes around secret management defined? What about rotation and revocation of secrets?
15 Phase 2: Workloads Stage 7: Identify existing security controls • Which controls have already been implemented? • Any admission controllers (e.g., OPA Gatekeeper) or network policies in Kubernetes? • Any third party agent for VMs? • Any custom or third party solution? Kubernetes focus areasKubernetes security checklists
16 Phase 2: Workloads Stage 8: Get the low-hanging fruits • Which high exposure and high impact vulnerabilities or misconfigurations are already in Production? Testing KubernetesAuditing Kubernetes
17 Phase 3: Code Stage 1: Understand the code's structure * How is the code structured? * Which philosophy for code organization is being used? Monorepo or multi-repo? * Which security controls are already added to the repositories? - Are `CODEOWNERS` being utilized? - Are there any protected branches? - Are code reviews via Pull Requests being enforced? - Are linters automatically run on the developers' machines before raising a Pull Request? - Are static analysis tools (for the relevant technologies used) automatically run on the developers' machines before raising a Pull Request? - Are secrets detection tools (e.g., git-secrets) automatically run on the developers' machines before raising a Pull Request?
18 Phase 3: Code Stage 2: Understand the adoption of Infrastructure as Code • Which resources are defined as code? • Which Infrastructure as Code (IaC) frameworks are being used? • Are Cloud environments managed via IaC? If not, what is being excluded? • Are Workloads managed via IaC? If not, what is being excluded? • How are third party modules sourced and vetted?
19 Phase 3: Code Stage 3: Understand how CI/CD is setup • How is code built and deployed to Production? • What CI/CD platform (e.g., Github, GitLab, Jenkins, etc.) is being used? • Is IaC, for both Cloud environments and Workloads, automatically deployed via CI/CD? - How is Terraform applied? - How are container images built? • Is IaC automatically tested and validated in the pipeline? - Is static analysis of Terraform code performed? - Is static validation of deployments (i.e., Kubernetes manifests, Dockerfiles, etc.) performed? - Are containers automatically scanned for known vulnerabilities? - Are dependencies automatically audited and scanned for vulnerabilities? • How is code provenance guaranteed? - Are all container images generated from a set of chosen and hardened base images? - Are all the workloads fetching container images from a secure and hardened container registry? - For Kubernetes, is Binary Authorization enforced to ensure only trusted container images are deployed to the infrastructure? - Is a framework (like TUF, in-toto, providence) utilized to protect the integrity of the Supply Chain? • Have other security controls been embedded in the pipeline? • Are there any documented processes for infrastructure and configuration changes? • Is there a documented Secure Software Development Life Cycle (SSDLC) process? Compliance as Code - TerraformCompliance as Code - OPAContainer Scanning StrategiesSecure DockerfilesImage PipelineBinary AuthorizationPipeline Supply Chain
20 Phase 3: Code Stage 4: Understand how the CI/CD platform is secured • How is access control defined? - Who has access to the CI/CD platform? - How does the CI/CD platform authenticate to code repositories? - How does the CI/CD platform authenticate to Cloud Providers and their environments? - How are credentials to those systems managed? Is the blast radius of a potential compromise limited? - Is the principle of least privilege followed? - Are long-running (static) keys generally used, or are short-lived tokens (e.g., via OIDC) usually preferred? • What do the security monitoring and auditing of the CI/CD platform look like? • Are CI runners hardened? • Is there any isolation between CI and CD? • How are 3rd party workflows sourced and vetted? Securing CI/CD Providers