Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A more mature take on stateless Terraform (bejarano.io)
77 points by rthnbgrredf on Oct 8, 2023 | hide | past | favorite | 89 comments


State is required for a three-way comparison: the desired state, the last known state and the current state. And since not all information about the last known state will be expressed in the source configuration files, you need a secondary source of information: the terraform state.

Everything else is indeed optional, but the fact that a useful three-way comparison can be done at every run is why terraform works so well and why other IaC tooling is a failure when it comes to change detection and cleaning up.


Why would not all information about the last known state be expressed in the source?


Because the source might no longer exist. Same as the current state might no longer exist. You can't diff something you don't know about and thus cannot create a transaction to deal with it.

Three examples, but first context:

You create some .tf files that create a bucket on s3 on AWS, and then create an object inside the bucket. This creates a graph data structure that describes the AWS provider, the S3 bucket and the S3 Object. You then apply this desired state, which causes terraform to create some resources in AWS and then describe the state as it was last known in the state file.

Scenario 1: You delete the object from AWS but don't tell Terraform about it. You re-run apply, terraform notices that it is gone and re-creates it, so far so good. Even Ansible and SaltStack could do this.

Scenario 2: You delete the object from the desired state in your .tf files. You re-run apply and terraform notices that you used to have an S3 Object in your desired state (because your state file says so), but since it is no longer there it goes to talk to the AWS API to see if the object is still there and if it is, it will delete it (or, plans to delete it and then it is up to you to apply or discard that plan). This is not possible if you don't record your previous intent. One-way tools fail here.

Scenario 3: you make a mess where you remove the object from your .tf files and also manually edit the state file and remove the object there as well. Terraform now knows nothing about what the world used to look like. But you then decide that the S3 bucket can be removed and since it's all managed resources anyway (no dirty clickOps tricks in the console) you tell Terraform to destroy the bucket. Terraform doesn't see any dependencies, and asks aws to nuke the bucket. But since the bucket isn't empty, you get an error. Terraform doesn't know anything about any objects in the bucket, and thus it couldn't clean it up.

Now, the scenarios here are oversimplified, and you would probably not setup an IaC tool to manage 1 bucket and 1 object if that is all you'll ever configure. But when you start creating multiple environments that each need their own configuration drift control and controlled resource destruction you really can't function without it.


Is there a functional difference between that approach and using tags on existing resources? I know Terraform doesn't use tags for this because not everything supports them, but in K8s, scenario 2 is handled by just saying "make everything using this tag look exactly like this". If there's extra resources not in the YAML with that tag, they'll get cleaned up.

In general, I like that approach better, because it just uses the state that already exists. It would be kind of nice if Terraform had a mode where it avoided creating extra state where possible.


Yes, because tags do not express where in a graph a node might be. It also gives you a new problem, because to find out what the previously known state was, you now have to read all tags on all resources.

In Kubernetes, the state reconciliation happens by controllers based on information in etcd, in a way, etcd is what the statefile in terraform is. The apiserver is what the 'current' state is, including metadata and event fields, and any on-disk manifests you might have would be the desired state. The only big difference between the Kubernetes reconciler and terraform is that event-driven nature. Terraform tends to be a series of one-shots where Kubernetes (the controller manager and controllers) is a constant stream of reconciliation events.

The overarching theme in both is 'desired state reconciliation via declarative configuration'.

As for client-side apply and server-side apply in K8S, you're essentially having the same deal as terraform: you feed it a manifest of what you want, and it will figure out if it is new, old, needs to be updated or needs to be destroyed.


The concept I struggle with is

"the desired state, the last known state and the current state"

Why do you need to know the last known state. You have the desired state and the current state, you run your code and you reconcile the current state to the desired state

I get why this is impossible/expensive to get the current state with AWS, which doesn't expose a simple "show state" API, but managing a state store is extra work and thus extra fragility.

However ideally I'd like to list credentials and providers to manage, then say "manage all resources with some form of tag "managedByTerraform=12345" (this could even be in the resource name -- for my fortigate management my scripts manage firewall rules, addresses, etc which start with AUTO_, and ignore the other objects)

It would then run, generate a lock resource, find the current state, compare with desired state, create/delete elements to reconcile actuality with desire, implement, free the lock

Then all I need to do is have that human readable file and I can get back to where I am

If I lose the state, or it gets corrupt, I'm in for a world of hurt with terraform.


What you propose is computationally rather expensive, and on an API level really hard to implement uniformly. And thus the state was born.

Trying to do it without state has been tried plenty of times, and it's bad. (Chef, SaltStack, Ansible, CFengine, Puppet to name a few)

It only works somewhat okay if you're simply writing to a single managed unit, i.e. a file. You know what all the contents of the file have to be, so you can replace it wholesale. This does not work for a graph of resources that interact with each other.

Separately, this also doesn't work with a shared responsibility model and doesn't work with deleted code.

Terraform has a state because that is required to solve the problem in a reasonable manner. It doesn't have state just because it was thought to be a fun exercise ;-) If the problem being solved is not _your_ problem, it is highly likely that using terraform at all is also not going to help you.

> If I lose the state, or it gets corrupt, I'm in for a world of hurt with terraform.

And that's why you secure and version your state. I.e. using S3 and versioning and a restrictive policy.


Because you can't guarantee that everything (or even anything) in the source was actually applied. Whether it's because there was some error in doing so or because there just was no run.

Additional potential issues: how are you generating your primary keys before resource creation? Probably by some hashing of attributes known at creation time? And how are you avoiding collisions? How are you dealing with modules that may have gone through multiple commits/versions since the last time you ran an apply?

There's just too many places where you really want to have a mapping of at least logical IDs -> physical IDs to ensure consistency. The one big advantage CloudFormation has here is that it just handles the state for you, but there's definitely still state.


Because it's incredibly difficult to do. Look at just something as simple as launching an instance. How do you know the state of the instance ahead of time? The instance ID doesn't exist and you can't do search filters for it because it's not there yet. Yes, you could use some half baked conditional logic to try and search and load values only if your instance exists and create if empty but terraform really isn't good at that. And that's just the very simple case. You could use something like the CDK or pulumi and avoid state but it's still work and it's useful work.


I think a simpler example is how do you know to delete something. It won't be in the config any more, so won't be queried, unless you have some tag on managed resources for this module, but then you'd need to persist that somewhere


This is the biggest problem that I think the author left out.

For those that don't know, the way you delete a resource in Terraform is to remove reference to it in your code.

So for example to create an EC2 instance, I make an EC2 resource block (or use a module that itself references one). Then run Terraform. Terraform looks at see that this ec2 instance isn't in the state, but it is in the code, so that must mean it is new and I need to create it. So it creates it and then adds the metadata about that resource into the statefile.

Now to delete it, I simply delete the resource block from my code and run terraform again. Terraform looks at the state file and sees that there is supposed to be an EC2 resource, but doesn't see it in the code. So it then deletes it from the cloud provider and then deletes it from the statefile. That's how you delete.

So if you remove statefile, then how does Terraform know that you deleted something? It isn't in the code so it doesn't even "think" to check on it. You need the statefile to remind Terraform that it used to exist. Again, creating an object is easier because you have it defined in the code and so Terraform can try to reconcile it with an existing resource in the cloud provider, and if it can't reconcile a pre-existing resource than it can assume to create one. But if you delete it without a statefile, then Terraform can't know it exists. The author seems to know this, which is why they suggest using Git to check the last time the code was run, see that there used to be a resource block and that it has been deleted. Then Terraform can conclude that it should delete that resource from the cloud provider.

My problem with this is that in order for Terraform to work you must preserve your git history. Which does generally happen but in enterprise environments we have had to do some nasty voodoo with git on occassion, and I fear that this could significantly mess up my cloud infrastructure as a result.

Trust me, I have spent a lot of my time reconciling and repairing statefiles over my time as an SRE. I know tf state very intimately. It is a hell of a lot easier to repair a statefile (which is just json), than it is to repair a git commit history. And I live in a world where you have to assume that the day will come that you need to perform this task. When that day comes, I will be happier repairing a json state file than git commit history.


This is correct.

If it helps, you can think of your terraform files as HEAD and the state file as HEAD~1


How are multiple branches handled? How would locking and conflict resolution work?

Imagine two branches trying to update staging using gitops


They aren’t handled - the reality of the situation is that the subversion model is better suited to infrastructure than Git.


Only if the infrastructure you target has no concepts of versions, and if it only has monolithic environments (which still holds true for much of the older infrastructure).

But if we just take the GitOps branches problem, they are indeed not handled, but that's because you generally configure your GitOps pipeline to only allow interaction from a specific ref, not any branch you might have lying around ;-) The SVN version would either be trunk-based deployments or revision-based, where you have to tell your reconciler which revision you want.


It’s because infrastructure largely does not have the concept of versioning, and branching infrastructure just makes no sense.


State is only required because Terraform is not truly declarative. If you have a VM instance running that is not defined in your code, it just gets left alone because it exists outside of TF's view.

If Terraform were truly declarative, all of your infrastructure is represented in code and you don't need state because the only thing it needs to compare with is what's actually there.

However, cloud APIs generally suck, and getting rid of state universally would be almost impossible for other reasons.


You're underselling what Terraform gets used for here. What happens when you're using Terraform to manage multiple cloud accounts or even multiple cloud providers? Should it be scanning the entire host it runs on for credentials from every possible cloud, check to see that your configuration only includes resources for AWS account #XXXX, and delete everything it finds for any other account and any other provider?

Think of the footguns you're introducing if they designed it the way you suggest. What happens when you pass the provider configuration an assume role that you typo or forget to edit correctly but it's still a valid role your user credentials actually have permission to assume? It then deletes every resource that role has the power to delete because you nothing for it defined in that particular root module. Safety is more important than purity or ergonomics or devex here. Heck, that may be the core difference between ops and dev.


>It then deletes every resource that role has the power to delete because you nothing for it defined in that particular root module

You would see the deletions in the plan and bail out.


> Have Terraform plug into your Git history, take the Terraform code from the previous commit, and use that to calculate the old tree

This is much much more difficult than it sounds if you want to be able to handle any amount of refactoring or repo reorganization and not be cut off from being able to handle resources from "before" a certain time.

How terraform or provider upgrades would work is another minefield. It's not just a matter of having access to the old terraform source, it would need to evaluate it with the same terraform version and providers - so those would have to be acquired and run within the state backend.

And then this is still assuming the previous run that you're trying to emulate was applied 100% successfully...


Author here, you're right, provider versions are something I did not account for.

Good catch!


What about sensitive values? I don't want those in git history.


At my workplace, managing terraform state has caused outages. It is so very easy to refactor the terraform code and then find yourself deleting a network load balancer and recreating one.

Recent versions of terraform have come out with a `moved`[1] block, but it is really unintelligent. It is a step in the right direction, but I can't use variables with it or string interpolation or anything.

Trying to get the state to be happy has been very difficult.

I prefer ansible for this because it doesn't use state[2], but as my company uses terraform I am left with the unfortunate need to manage state.

The best thing that I've been able to find to help in this fight is a little script I created to help automatically make moved blocks[3]. In a refactor you typically want to take resource blocks from the main.tf file and move them into their own module. You first moved the blocks into the module, and then before you edit anything you run that script and it matches up the blocks that were moved using Levenshtein distance. This approach may seem hacky but it works very well. Then I just take the output of the script and append it to my tf file.

I found the article refreshing and poignant. I will definitely be hoping for innovations like this to finally improve The state of the art because now that OpenToFu[4] is on the scene features like this that might chew into the bottom line of HashiCorp will start finally start being looked at seriously.

1: https://developer.hashicorp.com/terraform/language/modules/d...

2: https://blog.djha.skin/blog/that-does-it-ansible-wins/

3: https://gist.github.com/djhaskin987/b856c157bcfce6d2e4ee19f9...

4: https://opentf.org/


I've had similar difficulties and it seems like Terraform is basically a content-addressable system, where the directory tree is part of the addressable content.

The addition of `moved` is welcome but I feel like it was a huge mistake to use the FS as the module system such that moving a directory or renaming it is a breaking change, despite the fact that the name of your resource never changed.

That design decision has shot even the most experienced engineers in the foot once they see a renamed directory killing a database and spinning a new one up.


>At my workplace, managing terraform state has caused outages. It is so very easy to refactor the terraform code and then find yourself deleting a network load balancer and recreating one.

This is why we read the `plan` output before approving an `apply`. Terraform is not causing this outage, humans are.


I remember seeing the original take posted here at some point. I'm glad that the author eventually realized that if a (successful?) piece of software it is like it is, it's because it took some trade-offs at architectural decisions.


Author here, thank you!


> Have Terraform plug into your Git history, take the Terraform code from the previous commit, and use that to calculate the old tree.

So instead of generating state by doing API requests, this would generate state from code in git, which does not necessarily represent the real world state of resources.

As much as I'd want Terraform without state I believe using Terraform code in git as state creates more problems than it solves. This approach isn't able to recognize changes made to resources done outside of Terraform. It also assumes the previous version of the Terraform code in git is always the one which got applied last, which isn't necessarily the case.


> So instead of generating state by doing API requests, this would generate state from code in git, which does not necessarily represent the real world state of resources.

What you're talking about here is unrelated to whether the state comes from regular Terraform state or if it's generated using the source code, as the refresh phase should happen during both.


Is anyone else distracted by all the emphasis, italics, underlines, and highlighting?

When everything stands out, nothing does.

Sorry for the off-topic comment, but I do think this is valuable feedback for the author.


Yes, I stopped reading because of this.


I was as well, especially the yellow highlight/underline; I interpreted those as links at first.


Is that not what they were? I didn't click any of them, but I assumed they were links the entire time.


The pieces of highlighted text (such as "Terraform should have remained stateless" in the first sentence) aren't links, but I thought they were. The bits with just an underline ("this piece") are links, though.

Sidenote: TIL about the <mark> HTML tag.


Author, thank you! I'll tune it down a notch, this is most definitely helpful feedback.


I think it's to distract you from the vapid stupidity of the article


In my (very limited) experience with Terraform, the pain isn't so much associated with state, but rather with the declarative nature of the code.

I feel like imperative Terraform would be easier to reason about, since it matches up with the non-terraform workflow (i.e. go here add this, then go here and configure that, yadda yadda).

When I read HCL, I really struggle in understanding the order of priorities. Like I know that surely some things must be built first, and supposedly it's all handled under the hood. But I'd rather be in control tbh.

Then again, as I said, I don't have much experience with Terraform so maybe I'm way off?


Apart from very specific scenarios where you really know you need explicit order, it's not something I ever think about. You've got two options really:

1. One resource depends on another through the attributes, like an autoascaling group depending on a subnet. This is handled automatically by Terraform - updates/ordering will happen in the right way.

2. Dependencies that exist outside of configuration. For example some domain name should be created before some service starts. You can add those explicitly through `depends_on` for the resource that needs it.

Going imperative would lose most of what's interesting about Terraform. Imagine you create a service then change some attributes. Now you want another environment that looks the same - how would you do that imperatively? Replay all changes from scratch? Flatten the history and create a new script? With declarative Terraform what you're running is already in your configuration, no changes needed.


In the best case this is true. In the worst case the behavior of a resource in state X differs depending on it having been in state Y, so you have the same problem.


But that's not related to the tool anymore. You'll have the same issue regardless how you created that resource. Also, FWIW, I've been dealing with automation of services through Terraform and others for 2 decades or so and never ran into an issue like that. The one that comes closest is AWS certificate which requires either a parallel deployment or the first attempt to fail (if you're using cloudformation stacks anyway). Still - not a serious issue in practice.


Glad to know you’ve been dealing with Terraform for longer than the 8th highest contributor. Lots of AWS resources behave this way. Perhaps you’re not doing anything serious enough to run into these conditions?


I wrote "through Terraform and others", not Terraform only.

Instead of trying to gotcha me, do you want to provide some examples where creating a resource and changing an attribute results in something different than creating one with the final attribute values?


EKS clusters is a good one. A cluster starting at Kubernetes 1.28 will be in a wildly different state to one upgraded from Kubernetes 1.17 and upgraded through every version. More has come under management scope since 1.17, but even today the add-on model doesn’t cover lots of important aspects.

S3 buckets are another - it’s no longer even possible to create some types of buckets that exist in the wild today, thanks to changes in how canned ACLs may be provisioned. Older buckets have their access rules unchanged however.

These are two super common resources in the biggest cloud. I’m sure if I went digging in Azure or GCP I could find dozens of non-trivial cases of this that actually make a difference - database-as-a-service offers are likely a target-rich environment for this.


That's a different issue than what I thought you're talking about, and I'm not sure why it's relevant here. None of the declarative/imperative systems can deal with future compatibility. The service you're using may change the meaning of some API call/values - there's nothing we can do about it apart from updating the software, whether you're using Terraform or Ansible.


> Then again, as I said, I don't have much experience with Terraform so maybe I'm way off?

Yes. HCL is the easiest part of terraform.

State management / knowing what a particular tf apply is going to do and if it will put your stuff in an irreversible state where you have to go in and manually fix it is the hardest part. For us, we are forever struggling with IAM policies and KMS key aliases, especially if there is also a developer boundary coming into play as well.


If you wanted to store the state in Git, you would probably want to at least encrypt secrets. This is not something that Terraform supports now (theory is that this is to push their commercial offering) but it’s one of the main feature requests for OpenTofu.

With this in place, and ideally with allowing folks to define custom backends using the plugin interface (decoupled from core, versioned) this story could make a ton of sense.

Disclosure: I’m on the steering committee of OpenTofu.


Why would you want to store the state in Git?

This means that if you have a CD pipeline (triggered itself by a commit) that does `terraform deploy`, you'd need to have your CD pipeline do commit _again_ to save the changes to the TF state files.

On top of that, you lose the ability to re-run a previous CD pipeline to rollback to a previous version of your infrastructure.

The goal of Git is just to store the lifecycle of a piece of software, imo. It's not the reponsibility of the repo to know the state of the deployment.


You’d just use Git as fancy storage with some extra features. You could use a separate repo for that.

But I agree that the gain from that is not crazy.


Isn't that a misuse of Git? I would consider Terraform state files to be artifacts, which means they have no place being tracked in a source control system. The natural way to use git to manage incremental deployments would be to use git history: given the tag/commit of the previous deployment, find the diff between the resources, and only deploy that diff. Or, put differently: the "previous state" is already recorded in git, under the earlier tag/commit, as the article notes.


My argument is basically this. It is a misuse of git.

The author wants to eliminate state, so they suggest using git because as it turns out... you do need state.

But the author didn't eliminate state at all, they just eliminated a seperate statefile. If that's what bothered them, then sure, the problem is solved. But now you are using version control as state and those don't always line up. If you only work linearly then it is perfect. But what happens when you need to start tracking a pre-existing resource? You have to import it into git history somehow.

Back to your point, you really should not be modifying git history at all. There are tools to do so, but anytime you modify git history it is technically a mis-use of git. There are side-effects of doing so, especially when working on a large distributed team.

Basically the author trades one minor problem for many major problems.


Not sure if it’s misuse. Versioning your state is not a terrible idea in general. Using Git for that is kinda natural for reasonably sized states. I personally would store the state in a different repo than the infra definitions, though.


This isn't versioning your state though. This is using git commit history to essentially determine state at "apply-time".

Versioning your state is a good idea and is already easy to do. For example if you use the S3 backend you can enable object versioning on your bucket and then everytime the state is updated it is versioning and you can rollback state that way. It is a good practice. Similarly, you could use the local state backend, which just keeps state in a json file in your project and then you could commit that json file to version control. Now git is versioning your state.

But the author is not suggesting this, they are suggesting to not have a state file at all. You look at previous commits to determine state. So for example if you create a new resource for an ec2 instance in your code and run terraform apply, how does Terraform know it is new and needs to create this ec2 instance as opposed to updating an existing one? Normally it would look at the statefile and if it isn't in there, then tf knows to create it, if it is already in state then it knows to update it and reconcile the metadata in state to the declarative state in the code. But without a statefile this can't be done. So the author looks in the git history at the previous commit. If you can find the resource block in the commit history you know what it was last set to. If it doesn't exist in the history but exists in the current file, then it is presumed that the resource is new. If the resource was in the previous history but is different in the current commit then it is presumed that those changes need to be applied.

I'm sure this works in simple use-cases. But at any moderate scale this could break down very quickly.


How would you handle branching and merging of state files? Is there a valid reason to have different state files on different branches (our use case: different branches represent different DTAP phases, and hence deploy to different environments)? What would merge conflict resolution look like for these state files? What would happen if I deploy an environment from one branch that had previously been deployed from a different branch?

If the answer is "you should never branch and merge state files", then the corollary is "you should never put state files in source control".


Its the biggest gap in TF in my opinion. The fact the Vault provider happily slurps all your secrets into state vs using hashes or similar blows my mind. I am convinced HC just is out of touch at this point because "treat the state as sensitive" is not enough.

There are many cases I would accept simple hashes vs client side encryption, too. Everyone is going to have different needs but it makes no sense HC ignored making this better for so long.


I agree that the lack of native state and/or secret encryption is a serious limitation.

In Terraform’s defense, it’s often the fault of providers and/or APIs.


An easy solution would be to just encrypt the whole statefile.

This would work the same as state locking works now. You can apply an extra provider for state encryption/decryption just like you do for state locking/unlocking.

It's already been requested (with pull requests) for Terraform for a while now and Hashicorp keeps rejecting it, presumably because it would undermine features of Terraform enterprise.

OpenTofu is already considering implementing this feature as a result.


> This makes Terraform significantly worse for the majority of its users (users of public cloud providers with APIs whose resources all have such attributes) so the remaining 1% (or less) can use it.

I think the author misunderstood the problem (maybe because it's not crystal clear on Hashicorp's website): it's not that not all cloud providers provide tags, the problem is that only certain percentage of resources are taggable.

So imagine you are using your stateless TF code based on tags only: you could only deal with those resources that support tags. Whereas now, after configuring the backend on S3 with DynamoDB, it's basically "fire and forget" - you can use practically any resource. Paradoxically, TF supports even more actions on resources on AWS than CloudFormation.

So maybe for some simple setups like EC2 with S3 this could work somehow, but for anything more complex the current implementation is far superior.


Missing the word "to" here FYI:

> For any given already-existing resource, Terraform needs to know how, in the absence of its code, determine its dependents so they can be destroyed in the right order.

should be

> how, in the absence of its code, to* determine its dependents so they can be destroyed in the right order.


I want to create 3 servers with randomly chosen names from a pool of 10.

How can you express this without state? Or more generally, any “random_password”, “random_choice” etc, or even anything that is a “write only” property

You can’t. So this is dead in the water.


Interesting example. For context, I have only experience with Azure Bicep so "no state" is my default assumption on IAC languages.

Do you in practice really use random names? In my experience, I'd just use a loop vm01...vm10 for the names and the passwords aren't needed to identify an instance after the deployment so here randomness isn't an issue.


Random choice of subnet is a better example, random names is definitely not common.

Random passwords, write-only attributes (like database master passwords) are the most common.

How do you express “create a DB with this strong password, then put it in a s3 object”, then later “actually put it in SSM rather than s3”?


We cannot, okay, I see the point. Up to now, I considered the inability to express modifications on existing resources a limitation of the declarative model but I can see how adding state can help here.

With Bicep, we mostly deploy only the initial state and then we either re-deploy the whole thing or, if this isn't possible due to the interruptions this causes, add migration scripts in an imperative language (az cli/ pwsh). Which is admittedly the much less elegant approach.


Every single person who has this "you don't need state" take on IaC seems to be missing one key point.

They haven't built it, because it's really really hard to do.


You can’t CRUD without state. Just want the “C”, invoke the cloud APIs and skip the hassle.

And the author’s mental stretch of claiming git is the state is still a bit of a leap.


Even if Terraform worked without state, we’d still have a state file because it’d be unbearably slow without, and constantly run into API rate limits.


How would that result in more API calls than Terraform with state? When creating a plan, Terraform does already compare stored resource state and actual resource state, which does require doing API requests for each resource.


Terraform also needs to detect when a resource is removed from a template. Having a state file makes this easy (if something is in the state file and not the template, delete it); otherwise, you have to scan your whole infrastructure (and still risk deleting something that was never managed by terraform in the first place).


In your inner dev loop, set -refresh=false, it’s a much nicer experience.


Terraform already does a three way reconciliation at plan/apply time. It checks declared state (your code), known state (your statefile), and current state (api calls to your providers).

If you want to see this, just make a terraform apply. After it completes, go change something directly in your cloud provider and run another plan. The top section of your plan will highlight the changes that Terraform found remotely that don't match existing state.

So all this would do is remove known state, so you could do a 2-way reconciliation instead of a 3-way reconciliation. You would compare current-state with declared state. This is how Kubernetes works. But to be fair, Kubernetes actually works in a slightly more controlled environment than Terraform.


Thanks for letting me know.

What an absolutely bone-headed idea. That gives you the worst of both worlds (I guess they’d argue it’s the best instead).

I guess that’s my mistake for assuming Terraform wouldn’t have an extra reason for me to hate it without making something up.


Having used Ansible to easily mange AWS, I have been horrified by how my new org using Terraform for AWS, precisely because TF maintains state. It is both hard to develop in a team and a security nightmare.

They diligently recorded all secrets in AWS Secrets Manager. They used IAM roles and policies to control who can see which environment’s secrets. Great!

But then I discovered all the secrets in PLAIN TEXT dumped into the state file?! And this is known behavior that Hashicorp defends as reasonable. They suggest you just encrypt the state file. So the next time there’s a big oil spill, let’s just throw a big blanket over it and call it a day.

Terraform’s solution swiftly obliterates all the audit, key rotation, separation of duties built into secrets manager. Indeed, what is the point of using Secrets Manager at all if you use Terraform?

Next, they recorded state in a hit repo. So all those secrets are committed to the repo. So now how am I supposed to encrypt the state file? What a disaster.

But wait. There’s more. With the team growing, how are we supposed to manage shared resources? Do I have to run tf apply, wait for completion, and then immediately commit and push, and hope I don’t have to manually manage a state file merge conflict? Or should I use some bizarre, self-built mutex?

Ugh. Terraform is the worst. I was always unimpressed by its feature set and documentation. Now I hate it. I don’t understand why it is so popular which there are such better alternatives.


Why on earth is your state in git? The tool has built-in functionality to handle just these kinds of workflows. This reads a lot like hitting your thumb with the hammer and blaming hammers.


Yes, this. Just supply a few flags to configure terraform backend to store the state in remote storage and encrypt it.

terraform init --backend=gcs --bucket="xxx" --prefix="my-deployment-name" --encryption_key="my-random-bits"


Again, that just puts a bandaid over the problem. You can’t individual audit access to or rotate secrets stored state files.


Presumably one would want to store the state in Git to get contextual diffing “for free” and possibly to avoid a dependency on another system.


If that then create a separate locked down Git repo just for this. Protecting your state file was a big deal when I first reading about Terraform. It was really drilled in.


And that's why many people don't like the idea of a state file. Sure there are benefits, but there are also drawbacks. You now need another system to manage your state. You don't with ansible.


Ansible is a different system, with a subtly different use case. It generally manages a preexisting list of targets. In that sense, there is some initial "state" in Ansible, this being your inventory.

Terraform (or CloudFormation, or Pulumi, or Crossplane, for that matter) shine when you need to create resources. Think of the state as the inventory of what you've created (or imported).


If you think of the resource you are managing with ansible being your AWS account (or your VMWare system, or whatever), then I guess it makes more sense. That state (the account you manage) doesn't really change. (I don't use ansible but that is my understanding)

Having 3 different sources of truth (what is, in AWS, what should be, in the .tf, and something else -- in the statefile) can mean nasty 3 way merges, which i

But I don't manage thousands of different resources, I manage 50. It feels to me that the overhead needed to manage thousand struggles to scale down without bringing all the required baggage. It feels like kubernetes vs docker-compose.

That said, the concept of using an S3 bucked for storing state I saw elsewhere in these comments is an interesting idea so I may revisit terraform.


Not always. I’ve used Ansible to stand up all the inventory that I managed with it.


I didn’t choose it, but my guess is that they didn’t know what they were doing.

Even so, git only exacerbated the problem of secrets being in state files.


> But wait. There’s more. With the team growing, how are we supposed to manage shared resources? Do I have to run tf apply, wait for completion, and then immediately commit and push, and hope I don’t have to manually manage a state file merge conflict? Or should I use some bizarre, self-built mutex?

In one project, we had Terraform run from GitLab CI. The CI was creating a plan. The reviewer had to approve applying that plan.

I'm curious about your usage of Ansible for AWS resources. How do you delete them? Do you have a policy of always having 'state: absent' for at least one commit?


> there are such better alternatives

Would you please elaborate about the alternatives (beside Ansible which you’ve already mentioned)?


Maybe they meant provider-specific services, like Deployment Manager or Managed Terraform, or Infrastructure Manager.

PS: Infrastructure Manager was released just 1 month ago, so probably not.


Meh.

The usual practice is to keep the terraform state in an encrypted S3 bucket tucked away in a separate account (CI/CD, management or similar), with IAM policies controlling who can actually access the terraform state file in a cross-account setup. Limited access to the bucket with the terraform state can be controlled via the S3 bucket access IAM policy. Typically, there is an overarching, cross-account organisational IAM role that controls such access.

Each terraform project typically gets its own dedicated state bucket. Sharing the same S3 bucket for multiple solutions is unusual.

The encrypted S3 bucket persisting the terraform state file has to have the S3 versioning enabled. If one stores the terraform state in git, they are cooking it wrong.

Static and third party provided secrets are stored manually in the Parameter Store and are sourced via «data» blocks in terraform programmatically. Access to the secrets is controlled via the appropriate IAM roles and policies. Access to the secrets by humans is by exemption that is attached to a (typically) SSO role associated with an access group in the organisation's own IDP. This is no different from the non-AWS secret management solutions and tools.

Ephemeral or «low value»[0] secrets in the Secrets Manager can be rotated daily (e.g. every 24 hours) – to discourage the manual access to, storage of and reliance on them as well as to encourage to retrieve the secrets programmatically.

terraform is a vehicle to get from point A to point B, and is not an AI nor a substitute for the knowledge of the platform.

The terraform documentation is excellent, by the way. It is no replacement for the knowledge of AWS though.

[0] Generated via the «random» provider.


For secrets or credentials, I've had a reasonable experience putting the name of the secret into a terraform resource, but then setting the secret's actual value outside of terraform (i.e. via web or cli).

This avoids storing the secret's value in the state file (it's stored as an empty string) and also keeps the terraform plan clean.


As soon as I read that Ansible was easier than Terraform, I knew you were probably doing something wrong.

Ansible is fine to be clear, for managing things like configuration within compute instances and databases (such as user accounts for example). It is good for this, but for building and maintaining the creation and lifecycle of the infrastructure itself, it is the wrong tool for the job.

First of all, you are using local state in Terraform which is a terrible, terrible idea. You are already seeing why based on your problems with it. I've never seen an organization that actually works on a team with local state backends. You need to use a remote state backend. The most popular is to just use a provider like S3 (or equivilant with other clouds). The state file is pulled from there at plan/apply-time and then pushed back up there when complete. Then everyone is always getting the latest statefile and it is shared automatically without needing to commit it.

This solves other problems too. Like what if two people try to apply at the same time? Well that is why state-locking exists. Before the remote state file is pulled down, it is "checked out" and locked. Now only that person can use the state file until it is checked back in and unlocked. THis happens seamlessly in Terraform once you set up a state lock backend. You still just run `terraform apply` and all this happens seamlessly. Now if someone else tries to do it while you're already updating things, then their cli will wait until you are complete. THen they will end up pulling down all of the changes you just made. If they were running the same code as you, then it would say there is nothing to update. Easy.

Encryption of your state file also happens this way, by encrypting the remote backend. It doesn't solve local encryption and this is admittedly a problem with Terraform which Hashicorp has refused to address, but is already being addressed with OpenTofu. But as a result, secrets should simply not be in your state file. You should use a provider like secrets manager to do this for you and then you only have references to your secrets in state, instead of the secrets itself. This is simply a known rule just like you don't commit anything secret to a git repo, you don't put anything secret in your terraform. It's the same way.

Lastly, state versioning. This is accomplished through your remote backend state provider too. We use S3 at work, so we get native versioning there and we can (and have) rolled back state in emergencies (although this itself is not a great practice, just like how you should never edit previous commits in git), but it can be done and preserved this way as a safety mechanism.

As for your auditing and key rotation of secrets, it once again sounds like you are doing this wrong. You shouldn't be updating the secret keys themselves in Terraform. You should only be creating the secrets and passing around references to them in Terraform. This requires a better understanding of how Secrets Manager works. Creating a secret in AWS for example only creates an empty "repo" for a secret. It has an id and metadata, but no inherit value. This is what you do in Terraform. Adding the value to the secret is a seperate (hopefully automated) process. The secret value itself should be hidden from honestly all of your users (even yourself). Rotation should happen automatically with Lambdas or automated jobs of some kind. You could even do something with ansible for this. You rotate the secret value, but the secret reference doesn't change and everything should continue to work with secret versioning and some proper architecture on your side. This should NOT be done in terraform. Auditing your secrets again, is not a job of Terraform. This is a seperate process from IaC. Terraform is IaC. Use the right tool for the job.

State merge conflicts should never happen as you fear because it is wholly managed by the terraform binary. Yes you might have to resolve state conflicts where someone added too much drift outside of Terraform and you need to fix it, but the cli provides tools for making these changes and you shouldn't edit them yourself. Using these tools (like `terraform state mv` or `terraform state rm` or `terraform import`) should resolve state conflict without causing any sort of merge conflict. The state file (with proper state locking) will only ever be edited by the tf binary and only by one person at a time.

So I wouldn't hate on Terraform just yet. It sounds like you and your whole team is using it wrong.


This is a very informative reply and I appreciate it. With the improvements you suggest, many of which I had already started to make, we could mitigate many of the issues.

However, these issues are just a few of the most recent reasons I dislike TF. I’ve never been impressed by TF documentation. It always comes up lacking for me. Subjective, I know.

Similarly subjectively, when I was looking around about 6 years ago, TF had way less coverage of AWS resources than Ansible.

Basically, my experience is that the level of thought and quality of engineering that went into Terraform is way less than Ansible. And I am annoyed because somehow Teraform won in spite of that. My hope is Terraform’s licensing change will fracture the market and the next thing will emerge.


I read this article and the original one.

I don't see any reasoning besides "I think it's not really needed".

The reasoning in the original article is shallow at best, and the presented alternatives are not discussed in the depth I was expecting for an article questioning the whole architectural basis of a very popular IaC stack.

On the second article, I don't see a self critique of the actual points raised during the original article.

The two articles sound much like a collection of statements based on personal opinion.

I see the value in an article like this for starting a discussion but not into taking any sort of conclusion.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: