We rebuilt Cloud Life's infrastructure delivery with System Initiative

34 points by nickstinemates 18 hours ago

Ops type (DevOps/SRE/Sysadmin/whatever you want to call me) here, so I was really interested and this blog left me with more questions than answers?

What is SI? Homegrown GUI Terraform? That part is not clear in article. It looks like homegrown GUI Terraform with module so that's what I'm going with. Cool, glad you got that working, sounds like a big project and you were able to pull it off.

However, this part confused me, "Our engineers were investing a lot of time in what felt like “IaC limbo,” making a change in a Terraform file, waiting for review, waiting for CI/CD to run, and only then finding out if it worked. A simple tweak to a networking rule could take hours to validate."

What in tarnation are you doing? Do you have massive terraform file repo so apply takes forever since the plan is running forever? Talk to me Goose, what is going on that Terraform changes take hours to run? Our worst folder takes about 10 minutes to plan because it's massive "everything for this specific project". We also let people run tofu plan/apply from their laptops in Dev so feedback is instant.

We do have folders that have dependency on others folder, for example, we can't setup Azure Kubernetes without network being in place but we just left dependson yaml that our CI/CD pipelines work off when doing full rollout which is not their normal mode of operation (it's for DR only). We also assume that people have not been ClickOps either or if they have, they take responsibility for letting IaC resolve it.

Writing your own API calls to Cloud Provider is not something I would wish upon anyone. I did it for Prometheus HTTP Service Discovery system and just getting data was difficult, I can't imagine Create/Update/Delete.

SteveNuts 16 hours ago

>this blog left me with more questions than answers
Probably because it's a thinly veiled ad, I agree the post is severely lacking details.
ryanryke 15 hours ago

Thanks for the feedback. I'm new to the platform, and certainly appreciate the interaction.
I think I described SI a bit better in another reply, and you can certainly check their website for a better description than I can give here.
I'll try to high level our particular issues to give you a sense of why this is important to us.
Traditionally, we've managed our customers via TF. I made a big push years back to try and standardize how we delivered infrastructure to our customers. We started pushing module libraries, abstract variables via yaml, and leveraged terra grunt to try and be as dry as possible. We followed along best practices to try and minimize state files for reduced blast radius etc.
What became apparent was that despite how much we tried to standardize there was always something that didn't fit between customers. So quickly each customer became a snowflake. It would have its own special version of some module or some specialized logic to match their workflow. Then over time as the modules evolved, so the questions start to come up:
- Do we go back and update every customer with the new version of the module? - Does the new module have different provider/submodule/tf version requirements? - Did the customer make some other changes to infra that aren't captured?
Making minor changes could end up taking way longer than necessary. Making large changes could be a nightmare.
In working with SI the mindset has shifted. Rather than manage the hypothetical (ie what's written in TF), let's manage the actual. Trying to reconcile in code why a container has 2cpus instead of 4, find the issue and fix it. If want to upgrade something, find it and upgrade it.
I can go into greater depth if you care or have questions, but this at a high level explains this post a bit more.

AOE9 17 hours ago

This blog feels like a poor ad, I was hoping for technical details but seems like this tool just swept in and 'saved the day'. I have no idea how

* Provisioning time dropped from hours to minutes. * Debugging speed improved because we could fix it in real time.

happened.

Seems like the problem of a long feedback loop would have been solved by pull request preview environments and or enabling developers to have their own deployed instances for testing etc.

holoway 16 hours ago

Ryan can give you more details about his own experience. (I'm the CEO of System Initiative) But a lot of it comes from switching to a model where you work with an AI agent alongside digital twins of the infrastructure.
In particular, debugging speed improves because you can ask the agent questions like:
`I have a website running on ec2 that is not working. Make a plan to discover all the infrastructure components that could have an impact on why I can't reach it from a web browser, then troubleshoot the issue.`
And it will discover infrastructure, evaluate the configuration, and see if it can find the issue. Then it can make the fix in a simulation, humans can review it, and you're done. It handles all the audit trails, review, state, etc for you under the hood - so the actual closing of the troubleshooting loop happens much faster as well.
- AOE9 16 hours ago
  
  When you say 'digital twins of the infrastructure' you mean another deployed instance? So if they'd just made a preview environment created upon a pull request they'd have just got the same speed up.
  > It handles all the audit trails, review, state, etc for you under the hood.
  So there is no more IaC SI now manages everything?
  - holoway 16 hours ago
    
    Nope - I mean we make a 1:1 model of the real resource, and then let you propose changes to that data model. Rather than thinking of it like code in a file, think of it like having a live database that does bi-directional sync. The speedup in validating the change happens because we can run it on the data model, rather than on 'real' infrastructure.
    Then we track the changes you make to that hypothetical model, and when you like it, apply the specific actions needed to make the real infrastructure conform. All the policy checking, pipeline processing, state file management, etc. is all streamlined.
    
    AOE9 16 hours ago
    
    Ah okay thank you for clarifying!
    Personally not my thing, I'd rather be testing on real infrastructure rather than a simulation.
    
    holoway 16 hours ago
    
    For what it's worth, that means hitting the 'apply' button in System Initiative. It's a totally viable workflow - it's not either or, it's 'and'.
    
    AOE9 15 hours ago
    
    Yes hitting apply would update the production infrastructure, but what if I want to run automated testing to check a change/new feature? I can't do that on a simulation.
    
    holoway 15 hours ago
    
    Right - obviously, if you need the actual code deployed to run your test, there is not much anyone can do about that. But let me tell you how you would set that up, from scratch, in System Initiative (assuming you have a working deployment at all).
    I assume the use case here is 'I want to deploy the application on every pull request to net-new infrastructure, then run my test suite, and destroy the test infrastructure once the PR is merged or the code is updated'.
    You would fire up the AI Agent and ask it to discover an existing deployment of the application. Probably give it a hint or the boundaries you care about (stop at the network layer, for example - you probably don't want to deploy a net new VPC, subnets, or internet gateways). Once that's done, you'll have a model of the infrastructure for your application in System Initiative.
    Then you'll turn that into a repeatable template component, by either asking the AI to do it for you, or selecting the related infrastructure in our Web UI and hitting 'T'. You'll add some attributes like 'version' to the template, and plumb them through to right spot in the code we generate for you.
    Then you're going to call that code from our GitHub action on every PR, setting the name and the version number from the branch and the artifact version, naming the change set after the PR as well. You'll let the action apply the change set itself, which will then create the infrastructure.
    The next step will be to run your tests against the infrastructure.
    On merge you'll have another GitHub action that opens a change set and deletes the infrastructure you just created, so you don't waste any cash.
    Notice what I didn't tell you to do - figure out how to create new state files, build new CI/CD pipelines, or anything else. Just started from the actual truth of what you already have, used our digital twins to make a repeatable template out of it, then told the platform to do it over and over again with an external API.
    Hope that helps it make sense.
    
    stackskipton 16 hours ago
    
    So you recreated Terraform/OpenToFu state?
    
    holoway 15 hours ago
    
    Nope. Terraform/OpenTofu state has several big differences.
    The first is that Terraform/Tofu can drift. This is why people suffer when a change gets made outside of IaC, and the statefile no longer tracks. That's because IaC tools are by design unidirectional - change should only ever flow from the IaC to the Infrastructure. In SI, this is fine - the resource state can update, and then you can decide if it was beneficial (at which point we just update the component side of the equation, and you're done) or not (at which point you would decide what action to take to revert the change.)
    The second is how it gets generated. In Terraform/Tofu, it's a side effect of the 'apply' phase - basically a compile time artifact. In System Initiative it's the heart of the system - the code you write is operating on that model, not generating that model. This makes programming it much simpler. You can change the model through our Web UI, you can change it through an API, you can change it with an AI Agent, the resource can change because the underlying cloud provider changes it, and it all just works.
    
    stackskipton 14 hours ago
    
    State can drift in SI as well unless you are subscribing to events from AWS that alert your system as soon as resource is changed so you can update your side.
    >the code you write is operating on that model, not generating that model.
    What are you talking about? That model is not reality because reality is whatever the state of resource is in AWS. If your model says my S3 bucket is not public but someone changes it in AWS to make it public, who cares, it's public and that's what's important. Sure, your system may update itself more frequently than only when I run "tofu plan/apply" but at end of the day, it doesn't matter.
    All I'm saying as SRE, you have done poor job selling this to me. I'm telling you what I would tell my boss if he came to me with this product.
    "This is some custom IaC system with AI Agents sprinkled on top. I guess if you want to get rid of SRE team and replace us with their consultants, whatever, I won't be here to care. If you want us as SRE team to use it, nope, it's a waste of money since OpenToFu has much better support. Can you approve my SpaceLift purchase instead?"
    
    ryanryke 10 hours ago
    
    > Sure, your system may update itself more frequently than only when I run "tofu plan/apply" but at end of the day, it doesn't matter.
    Correct me if I'm wrong here. In my experience you have to "apply" before state is updated. This would mean we weren't quite operating on the source of truth. (aws in this case).
    100% it's a solvable problem with a TF centric tool chain. But it's still a problem that needs solving.
    In my experience with SI it fades to the background. Now, I'm sure there is an edge case where someone edits something outside of SI while I'm trying to simultaneously update it in SI where things might break. I haven't run into it yet.
    > All I'm saying as SRE, you have done poor job selling this to me
    Can't argue this, but I would say like any other new tool, it's worth checking out. :)
    
    stackskipton 8 hours ago
    
    Yes, at apply stage, the state is updated. All the state is useful for is finding the resource for big 3. In fact, I'd argue for TF, they could do away with state file beyond resource "s3_bucket" "thebucket" -> arn:aws:s3:us-east-2:000:0123455 since they pull down the current state of system as is and then show you the "This is what you want and with current state, this is what it will change."
    > I would say like any other new tool, it's worth checking out. :)
    I don't see the need for a couple of reasons:
    1) How? If you want me to try something, either big "TRY ME" unless it involves becoming a client which that case, I see you as replacing me so my motivation is zero. :D
    2) I'm on Azure for most part so it's useless anyways.
    3) You have not shown me how SI is that much better than Terraform. If I'm going to invest time over yelling at Kubernetes, I need to know my time is worth it.
    At the end of the day, we all want the same. Here is defined infrastructure, be it YAML, JSON, HCL, some GUI, API call to a system, Whatever AI is smoking. Ability to see what changes and make those changes. HCL/ToFu is what most of have picked because it's pretty open and widely supported across all the providers. You have to overcome all that. This blog post reads, we have this great new Windows Server thing that will blow your Linux Server stuff away completely with GUI and Siri.
    Maybe that's what your customer base needs. However, at technology companies I work at, we don't need that. People editing outside IaC is done very slowly, deliberately and almost always backported. If not, you will get called out. It would like Dev writing code and no tests.
  - holoway 16 hours ago
    
    And yes, there is no more IaC under the hood.
    However! Folks with big IaC deployments can still use all the discovery and troubleshooting goodness, and then make the change however they want. System Initiative is fine either way.
    
    AOE9 16 hours ago
    
    Personally moving away from IaC is a big yikes, for something so critical to my company no way would I let myself be locked into your product. I have already been bitten before when a developer productivity startup fails/pivots(as they often seem to do).
    
    holoway 16 hours ago
    
    That's cool. For what it's worth, the software is all open source, precisely because it's critical in this way. I realize that's like telling you that you can take care of this puppy yourself if you want. :)
    Even if you don't move away from IaC, you can still get benefits from the approach by having SI discover the results, and then do analysis.
    
    lawnchair 16 minutes ago
    
    I noticed on your open source page it says:
    > You can make a build that includes our trademarks to develop System Initiative software itself. You may not publish or share the build, and you may not use that build to run System Initiative software for any other purpose.
    That feels a bit different from what many developers expect when they hear "open source." Nothing wrong with that, just pointing it out.
    https://www.systeminit.com/open-source
    
    AOE9 15 hours ago
    
    Sorry maybe my last reply was a little harsh now I understand it isn't a priority IaC under the hood anymore.
    I still have major reservations around dropping IaC and just working on a simulation of what is deployed, I don't see how this can work for more complex deployments such as multiple region/AZ deployments, blue/green deployments, cell based deployments etc etc. Seems like dropping IaC would only work for very simple environments.
    
    holoway 13 hours ago
    
    It works great. If you think of it as 'dropping all the reasons we chose IaC', then yes - that's obviously dumb. If you think of it as 'getting all those benefits, plus faster feedback loops, AI agents, and an easier programming model' then.. not so much.
  - esseph 16 hours ago
    
    No, not another deployed instance.
ryanryke 16 hours ago

Thanks for the feedback. My plan is to spend a little more time to dive into the details on a follow up post.
I'll try to explain our experience here in a little better detail though.
In a traditional IAC tool (tf for example). The flow would go something like this (YMMV)
Update TF -> Plan -> PR -> Review (auto or peer) -> Merge -> TF Reviews State File -> TF Makes changes -> Updates State.
Some issues we could run into: - We support multiple customers each with their own teams that may or may not have updated infra so drift is always present.
- We support customers over time so modules and versions age, and we aren't always given the time to go make sure that past tf is updated. So version pins need to be updated among other dependencies.
Each of those could take a bit of time to resolve so that the tf plans clean and our updates are applied. Of course there are tools such as HCP Cloud, Spacelift, Terrateam etc. But, in my experience it shifts a lot of the same problems to different parts of the workflow.
The work flow with SI is closer to the following: Ask AI for a change -> AI builds a changeset (PR) -> Review -> Apply
The secret sauce is SI's "digital twin". We aren't just using AI to update code, we're actually using it to initiate changes to AWS via SI. While I would never want to have a team make changes directly to AWS without a peer review or something similar, it is sitting closer to what the actual infrastructure is. Even with changes that are happening to the infrastructure naturally.
This has allowed us to move quite a bit faster in updating and maintaining our customers infrastructure. While still sticking as close as possible to best practices.
- stackskipton 16 hours ago
  
  So basically the product is "Custom IaC with AI agent" Sounds like a great business model if you can convince companies to go for it.
  However, as SRE, pass. I'd rather keep IaC in one of our pre existing tools which much wider support and less lock in. Also, since I'm in Azure/GCP, this tool won't work for me anyways since it's AWS focused and when you go multi cloud, the difficulty ramps up pretty quickly.
  - holoway 16 hours ago
    
    It's absolutely AWS focused today - but one upside of the approach is that building the models is straightforward, because we can build a pipeline that goes from the upstream's specification, augments it with documentation and validation, etc. We'll certainly be expanding coverage.
  - ryanryke 15 hours ago
    
    Essentially. I'm not sure you could call it IAC specifically, but the same ideas apply.
    Regarding lock in: I don't necessarily think there is anything here that is stopping you from writing TF and importing objects. Conversely, SI is great for importing resources into their model.
    So the objects are essentially modeled in type script on the back end so support for other vendors is available. It's just whether or not they are created yet. I'll let the SI folks dive into details there.
- AOE9 16 hours ago
  
  I think as you are a professional services company that imposes a certain workflow on you. For regular software engineering you'd just make the IaC/code deployable from the developers machine and or on a pull request take the branches code, deploy it and post back a link to the PR.

ryanryke 17 hours ago

We're really excited about what the future holds with SI. Feel free to ask any questions.

tietjens 16 hours ago

I have been on a small journey to try to understand what SI is. I’ve read your blog posts, listened to the Changelog show with the CEO, watched some demos and joined the Discord. But I still don’t understand what a 1:1 digital twin means. You are mirroring AWS’s api? Can you help me grok what 1:1 means concretely?
- holoway 16 hours ago
  
  You should check out the site again today - I think it will help at least at a high level of what it's like to use System Initiative today.
  We didn't recreate the AWS API. Rather than think about it as the API calls, imagine it this way. You have a real resource, say an EC2 instance. It has tons of properties, like 'ImageId', 'InstanceType', or 'InstanceId'. Over the lifetime of that EC2 instance, some of those properties might change, usually because someone takes action on that instance - say to start, stop, or restart it. That gets reflected in the 'state' of the resource. If that resource changes, you can look at the state of it and update the resource (in what is a very straightforward operation most of the time.)
  The 'digital twin' (what we call a component) is taking that exact same representation that AWS has, and making a mirror of it. Imagine it like a linked copy. Now, on that copy, you can set properties, propose actions, validate your input, apply your policy, etc. You can compare it to the (constantly evolving, perhaps) state of the real resource.
  So we track the changes you make to the component, make sure they make sense, and then let you review everything you (or an AI agent) are proposing. Then when it comes time to actually apply those changes to the world, we do that for you directly.
  A few other upsides of this approach. One is that we don't care how a change happens. If you change something outside of System Initiative, that's fine - the resource can update, and then you can look at the delta and decide if it's beneficial or not. Because we track changes over time, we can do things like replay those changes into open change sets - basically making sure any proposed changes you are making are always up to date with the real world.
- ryanryke 16 hours ago
  
  Feel free to reach out and I can show you.
  The way I think about it is like this:
  We want a representation that is as close as possible to what actually is in AWS. That way any proposed changes have a high probability of success when they are applied. SI's approach keeps an extremely up to date representation of what's in AWS.
  Why do we need a representation and not just go directly to the AWS API? Among other items, it removes the capability of reviewing changes before they are applied. It gives us a safety net if you will.
  - tietjens 15 hours ago
    
    Is this representation made available to SI users? Do I have clear overview of it? I've accepted that it isn't api calls.
    
    holoway 15 hours ago
    
    Yeah, in all sorts of ways. You can look at it in a Grid of components. You can look at it in a Map, seeing all the relationships. You can look at it via an API. You can have an AI Agent summarize it for you. It's super transparent.