High Leverage Security Decisions

I have often joked that my job as a security engineer is to serve as a sort of technical debt collector. Technical debt is a term that is used to describe the accumulation of technical decisions that, in hindsight, are suboptimal. In some cases, you may know that you’re making a decision that will accumulate technical debt, but in most cases you won’t know that a decision has debt attached until considerably longer down the line.

These design decisions tend to be where security issues crop up, because they are often decisions that are made with the short term in mind. So as a security engineer, I often seek out this technical debt and try to improve those systems in order to mitigate existing security issues, or prevent future security issues.

In the spirit of preventing future security issues, I thought it might be useful to reflect on what decisions I would make if I was worried about long term security problems at an early stage startup. To provide a set of guidelines for decision makers in early-stage startups in order to minimize security risks in a cost effective way, while simultaneously gaining a bunch of benefits when it comes time to pass your ISO or SOC2 audits. This is not guidance on passing your audits, it’s guidance on good security decisions that has a happy side effect of making your audits easier.

Note that these are not listed in any particular order, and they can technically be adopted at any time. The intent here is that the earlier you adopt them, the easier they are to roll out, the easier they are to get accustomed to, and the higher impact they will have in the long term.

Pick an identity provider early

I’m starting with arguably the most important, and also most frustrating, thing you should do. Identity providers are central services that serve as your source of truth for identity across applications. Your identity provider will probably cost you between $6 and $15 per user per month, depending on the number of users, which identity provider, and what additional features you might want.

Some common identity providers are Google Workspace, Okta, and Microsoft Entra ID. There are definitely many other identity providers you could choose, including many free and open source ones. I would absolutely avoid the free and open source ones if you’re looking for high leverage security decisions, though. I’m sure the various open source solutions are plenty functional, but you’ll need someone to manage the installation and keep it secure, which often requires a specialist and suddenly you’re paying $150k a year to keep your identity provider online and safe.

Google Workspace and Microsoft Entra ID have the added benefits of being directly tied in to other things you’re likely going to need as a business - email, office suite, and shared drives. Okta offers neither of these things, but may have better integrations with your HRIS systems. If you pick Okta, you may also end up paying the same or similar prices to Google or Microsoft just to get your email service. That’s not to say you shouldn’t pick Okta – it’s even what I have the most experience with. It’s just a note that it focuses primarily on the identity provider side of things, and doesn’t offer all the extra features of Google or Microsoft who are actively trying to be a one-stop-shop.

Having an authoritative identity provider early will:

Help you centrally manage who has access to your applications
Provide you with audit logs of changes to users and application access
Control access to resources with strong MFA options
Allow you to integrate automated user provisioning and deprovisioning where available (e.g. via SCIM), easing the strain from employee lifecycle management.
Provide a single sign on (SSO) solution to integrate with other applications

This last bullet point is also perhaps the most frustrating part of this decision. The identity provider itself will typically cost you some fixed rate per employee you onboard. But most third party applications you may want to use will charge you an undeterminable amount of extra money in order to let you set up SSO. This is known as the SSO tax, which is a frustrating and depressing reality of the industry.

It’s still worth having the solution in place early, as it’ll allow you to enroll applications in SSO as they become available, and will help you have a central location to manage your employee access. But the amount of leverage this decision provides is variable based on how much your vendors charge you for the privilege of letting them not store your passwords.

Adopt (and enforce) security keys

Hardware security keys are widely regarded as the most secure option for phishing-resistant MFA. For the most popular name in the space, Yubikey, you’re looking at spending between $50 and $75 per user to issue them a new key when they start, depending on the type of key you issue them. I’d recommend doubling this budget and issuing each employee two security keys - a primary use one and then a backup that they can register and keep in a safe place in case they lose their primary one.

Once you’ve provided all your employees with security keys, it’s important to make sure they enroll the security key with your identity provider, and with any other high value systems where the identity provider is not yet available. After users are enrolled, you want to make them the primary MFA option for accessing sensitive systems – you may need to allow other MFA options as a backup in some special scenarios, but you should default to enforcing your users to use the security key.

Security keys have some very interesting properties when it comes to MFA:

Their phishing-resistance comes from the cryptographic identity relationship between the security key and the server. A phishing site (typically) cannot spoof this relationship, so even if you login and touch your security key, the login will fail because the server isn’t prompting your key for something it knows about.
Security keys use credentials stored on the device – they are never sent across the internet and never seen or stored by the server. This makes interception of the credential very difficult (allegedly impossible).
Depending on the security key type, they are often more convenient for users than alternative options – e.g. Nano/5C nano can just live in the laptop permanently, making them way more convenient than having to find your phone, unlock it, forget why you opened it, surf social media, then eventually remember that you need to open your 2FA app and get your one time password.

Security keys use the same underlying technology as MacOS Touch ID or Windows Hello, except Touch ID and Windows Hello use hardware devices embedded in the computer. These options could be used in addition to security keys to improve convenience, but I would advise against relying solely on them. A laptop has a lot more parts that can break than a dedicated security key – if a user needs a new laptop, they can just unplug their security key from their old one and plug it into the new one.

Use Infrastructure as Code

When deploying your infrastructure to whatever environment you’re deploying to, except maybe if you’re self hosting, you should be trying to use infrastructure as code (IaC) as much as possible. There are various IaC options out there that don’t cost anything to get started using, such as the open source fork of Terraform, OpenTofu. For the purposes of this post, I’m going to stick to talking about Terraform/OpenTofu when I say IaC, but the same general premise applies.

The costs that you will realize from embracing IaC, however, is that it requires a bit more overhead to get things setup than just pointing and clicking on things in the cloud console. It requires a bit more forethought about what the dependencies of your infrastructure are, and can take some time getting used to setting it up.

IaC provides a number of super useful benefits:

You can store a representation of your infrastructure in a version control system such as git, which allows reversion to previous states (some limitations apply)
By storing your IaC in Git, you effectively get a free change control system. All actions to change your environment can be logged via git commits, with timestamps, exact details of the changes, and even cryptographic signatures if you’re so inclined.
By utilizing a Git “forge” such as Github or Gitlab, you can also get peer reviewed changes via the Pull Request / Merge Request system, which allows you to require peer-reviewed changes to infrastructure.
Having your infrastructure entirely (or almost entirely) represented in code means you can also take advantage of static analysis tools like Semgrep, tfsec, Snyk, etc. These can help you identify and prevent vulnerable configurations from being deployed, rather than relying on analysis tools that check against your running environment.
When infrastructure state diverges from your code, such as someone manually changing an IAM policy, you have an easy way to simply revert the changes to the “approved” state.
Bonus point: You can define variables and reuse them across your infrastructure
Bonus point: You can literally grep through your infrastructure. Way nicer than trying to find anything in the AWS dashboard.

It also has some drawbacks, though:

By setting up a good peer review system, velocity for infra changes can be slowed down while waiting for approval. This is a risk reduction mechanism, though, and a culture of expedient review can reduce this impact.
Documentation can sometimes be lacking when it comes to setting up your infrastructure
You’ll need somewhere to store your “state” for your IaC setup, which can create a bit of a chicken and egg problem - you need IaC to deploy an s3 bucket, and you need an s3 bucket to store the state of the IaC.

Ultimately there are managed IaC offerings which will manage your state for you, such as Spacelift, Gruntwork, and Hashicorp Cloud. The price for these offerings will vary, but can come with built in peer review / plan review systems, state management, and integrations with your identity systems to ensure only authorized personnel can make infrastructure changes.

Use managed infrastructure

Let me tell you all about the benefits of using managed infrastructure. You get to offload huge amounts of work to the cloud provider. I mean sure, you’re going to pay more for that privilege compared to just running a database on a VM. But you’re going to get so much more from it, too. It is, in my opinion, a huge mistake to deploy to the cloud the same way that you would deploy to an on-prem environment. It loses so many benefits of the cloud. Once you’ve hit unicorn status, you can have a team of infra engineers migrate you to your own managed clusters. Until then, managed services will serve you just fine.

For the purposes of this section, I’m mostly going to be referring to AWS, because it’s what I have the most experience with. But I’m sure many other platforms have similar offerings.

For most startups, you probably don’t need hyper-optimized sub-millisecond improvements in your HTTP processing. You probably don’t need kernel-level modifications to make your application work right. Ultimately, you probably just need a place to deploy your code, and a place to store your data.

Embrace using containers in development and deployment, and suddenly you can deploy your code basically anywhere. My personal preference is to deploy containers to AWS Fargate, because they manage the underlying compute instances, which is one less thing for you to have to worry about. It is not without it’s limitations, but for most applications, it is probably more than sufficient.

When you need a database, reach for a managed database solution like RDS or Aurora. You get multi-zone availability with the click of a button. You get managed backups with the click of a button. You get clustering and replication with the click of a button. There’s even options for blue/green upgrades now, to minimize downtime needed during system updates. It will absolutely be “cheaper” to run your own database on a VM. But then you’re suddenly responsible for availability, disaster recovery, etc.

The underlying advice here is to embrace the DIE triad as much as you can, as early as you can. This focuses on a Distributed, Immutable, Ephemeral architecture. You can deploy containers whose lifecycle is completely decoupled from the state of your application. You can easily deploy to many availability zones, regions, or even separate clouds if you’re feeling up for the challenge. By leveraging managed services, you can spend less time playing sysadmin and more time focused on your actual application. No one ever got a bonus for upgrading their postgres VM, that’s all I’m saying.

Use an MDM solution

Mobile Device Management, or MDM, plays a crucial role in the long-term security of your organization. It’s how you’re going to be able to secure your endpoints and deploy new software to employees.

Even if you don’t end up doing a lot of customization of the MDM to begin with, simply having all your employee devices enrolled will make the future much easier for you. Especially if your employees are remote and can’t simply bring their laptop in for IT to enroll.

If you’re building a Windows environment, you’re probably using Microsoft Entra ID and Office 365. You can further lean into this ecosystem and deploy Microsoft Intune as your management software.

If you’re building a Mac environment, make sure you’ve got Apple Business Manager setup and are buying your new devices through it. Then you can link that to your MDM to force all of your owned devices to go through MDM on first boot. I’ve had good experiences with Kandji. Jamf Pro is definitely the biggest player in the space, but they are also fairly expensive.

Again, the most important thing to do early is simply set up the MDM. Get your devices enrolled in it. By itself, it’s not going to inherently make your environment more secure. But it’s laying the foundation for future improvements, and the earlier you do it, the less difficult it is to lay those foundations.

Conclusion

The keen observers among you will notice that, with the exception of security keys, much of this advice could be construed as generic IT or Infra advice.

I’m a strong believer that the largest security gains are largely not achieved through bolt-on processes, buying more tools, or spending the most money on security analysts. Those largest gains come from an obsession with operational excellence. With picking smart tools and then maximizing the value you get from them. With making small, iterative, strategic improvements in the systems you already have.