We are shaping a cloud strategy and would like to understand what are typical examples of cloud vendor lock-ins. We are considering OpenShift as way to reduce any potential vendor dependency and would like to hear from the community what is their opinion.
Having worked at AWS and then a litany of other companies in Seattle that are mostly using AWS or Google cloud, here's my perspective on some lock-in that you might not have actively on your mind:
* Larger companies generally have contracts with cloud providers to pay lower rates. Sometimes these contracts include obligations to use a technology for a certain period of time to get the reduced rate.
* Any technology that isn't completely lift-and-shift from one cloud provider to another. It used to be that a JAR deployed to a 'real' host (say EC2) that accesses config through environment variables and communicates through HTTP was the gold standard here. Now docker broadens the possibilities a bit
* All the cloud providers have annoyingly different queueing/streaming primitives (SQS, kinesis, kafka wrappers...). So if you are using those you might find it annoying to switch
* Even for tried-and-true technologies like compute, MySql, K/V stores, cloud providers offer lots of "Embrace and Extend" features.
* If you are wise then you will have back-ups of your data in cold storage. Getting these out can be expensive. Generally getting your data out of the cloud and into another cloud is expensive, depending on your scale.
IMO the only way to truly avoid lock-in is to use bog-standard boring technologies deployed to compute instances, with very few interaction patterns other than TCP/HTTP communication, file storage, and DB access. For all but the largest companies and perverse scaling patterns, this will get you where you are going, and is probably cheaper than using all the fancy bells and whistles offered by which ever cloud provider you are using.
One thing that I've seen work is that if you absolutely require the ability to deploy on-prem then using something like OpenShift/Kubernetes as a primitive can work per the parent.
Even if you rely on streaming like PubSub or Kinesis, one thing teams I've worked on has done is to write interfaces in the application tier that allow for using an on-prem primitive like Kafka and not depending too much on the implementation of that abstraction.
I've been on a platform team that built these primitives into the application layer, e.g. a blob storage interface to access any blob store whether it's on prem NFS, azure, etc. However I'm looking at newer projects like dapr [1] and have taken them for a spin in small projects. Such a project seems like a favorable way to add "platform services" to a non-trivial app while still maintaining a pubsub abstraction that allows for swapping out the physical backend.
So agreeing with you but with the caveat that you can rely on platform service interfaces and then the service behind that interface could be a cloud vendor product or a boring technology provide you don't let that abstraction leak and call a very specific kinesis feature for example.
Had similar goals. Started by writing Go interfaces for it with Go Micro - https://go-micro.dev then opted for the platform service model as you mentioned with Micro - https://micro.dev
I think whether it's Dapr, Micro or something else, the platform service model with well defined interfaces is the way to go. I don't think a lot of people get this yet so it's still going to be a few years before it takes off.
That's an exciting project! One thing that would be cool is to build a workflows engine on top of the pubsub + key-value primitives. This is something that I think too many teams build internally but we need it to be a platform service instead of hand-rolled over and over.
For instance there is a project that I think is pretty interesting but it's built on top of kakfa so you pretty much need to be running kafka to use it. I wish I could swap out redis as pubsub + kv store and I'd use that project in a heartbeat.
Yea had similar aspirations. Someone even implemented the "Flow" concept as a Go interface. I felt like the first implementation was too complex and we never got back to it but definitely agree, it's a core primitive that needs to be built. Flow, triggers and actions that can be sequenced into steps with rollbacks.
I prefer to exploit "cloud native" differences, but plan your off-ramps in advance. Plan, architect, and write up, then ensure the engineering of the native approach maintains that plan to exit.
You do not have to write to LCD, just have an exit strategy that fits inside your negotiated contract windows or regulatory grace periods for migration.
It's also worth noting that it seems infrequent for companies to move mature products to a different cloud.
My preference is for agnostic tech, but there's a fair chance of YAGNI for the flexibility. If some proprietary tech simplifies your operations, consider whether you'll actually get hurt investing in it.
I find the biggest source of lock-in is that it's really hard to move a bunch of data. The longer you stay, the harder it is to leave, because the impact (downtime, slowness, resources allocated to moving data instead of building your business) on your product/service is likely to be bigger.
As you research this, don't neglect the cost of attempting to remain vendor agnostic. Every level of abstraction adds new costs (for some people it's worth it, no doubt!). Sometimes it's more efficient to just go all-in with a vendor.
Not just hard, but expensive. The Acquired podcast episode on AWS[1] highlighted a particularly ironic angle of this:
> David: I can do you one better. Another example. Amazon.com used Oracle databases when it was started. Amazon.com did not finish their migration off of Oracle databases and onto AWS products until 2019.
> Ben: Oh, my God.
> David: Thirteen years after AWS launched.
> Ben: That is insane.
> David: It took that long for Amazon itself to migrate off of Oracle.
Anecdotally, from when I worked there in ~2015, there were a few good reasons:
(a) a lot of Amazon's software was so old that it connected directly to the databases. We had to finish all of the SOA migrations that were long overdo -- all the misc random tools no one had gotten around to migrating -- before we could even contemplate moving off Oracle.
(b) it wasn't done as a simple port; instead, the systems were re-architected in the process to no longer share backing databases as at all, which basically meant it needed a complete rewrite, which took a long time to get funded and then a long time to deliver.
No doubt. This was an amusing aside, not a critique, intended to illustrate the long-term impact of storage choices. That's part of the bull-case for AWS long-term, and why we see projects like AWS Snowball.
True. I felt like they were good reasons at the time, or rather that they were basically the best you could do given that the can had been kicked down the road for far too long. As usual putting off solving tech debt ends up costing >10x in total.
This, of course, isn't new. I once worked at a company that was tasked with migrating data out of a CM system built by a company where I had also worked about 10 years earlier.
I ended up pinging one of the founders at the old company and asked him how we'd go about extracting the data using the platform's data (we were wondering because we kept running into stability issues doing this).
He basically told me what I suspected, there was no good way to do that. Because of the way the cache worked in the product, it would never stay up if every single data record were fetched via the API.
We ended up pulling it all out of the database directly, which was a bit of a pain.
tl;dr no vendor really wants to make it easy to pull ALL of your data out of their system.
Other commenters have covered the workload lock-in angle pretty well. Using Kubernetes as a target platform for your application already gives you a decent shot a workload portability. Keep in mind though that some K8s APIs are leaky abstractions. You pay with lock in into K8s of course. At the end of the day, lock-in is a matter of tradeoffs.
An often overlooked angle is the "organizational lock-in" to the cloud. Adopting the cloud in any serious capacity with more than a handful of teams/applications means that you will eventually have to build up some basic organizational capabilities like setting up resource hierarchy (e.g. an AWS Organization with multiple accounts), an account provisioning process, federated authentication, chargeback... See https://cloudfoundation.org/maturity-model/ for an overview of these topics.
To be honest I have seen quite a few enterprise organizations that went through so much organizational pain integrating their first cloud provider that implementing a second provider is not really that exciting anymore. Now of course, if you plan on eventually leveraging multi-cloud anyway you can save yourself a lot of pain by setting things up with multi-cloud in mind from day one.
My 2¢ is "most companies should not avoid vendor lock-in, but rather should lean in and make maximum productive use of the tools and features that their chosen cloud vendor provides". Engineer time and attention is more expensive than most people give it credit for and designing for a future, seamless cross-cloud migration is building a gold-plated pyramid of YAGNI for most companies.
I think small and medium is well suited for cloud in areas where engineers are scarce or expensive. For me, cloud is more about pace, flexibility, and engineering efficiency than it is about scale.
Even my little side projects I run on cloud. I don’t want to futz around learning how to configure, secure, and scale a DB server if I can order one up or use DynamoDB.
I agree that very large and hyper scale companies are a worse fit for the cloud, largely in part because they can afford to dedicate centralized teams to running great ops. Tiny, small, and medium can’t.
Agree mostly, however if you are a early-mid stage startup gaining traction and have enough of an engineering understanding to not become too coupled, then you can basically play the vendors against each other and get startup discounts or other types of discounts by swapping.
I've heard of cases where vendor A gives a big discount to the tune of a the most recent year of annual billing from vendor B if you can prove that you most recent bill from vendor B is 0 due to completing the migration.
How does this account for the engineering time, attention, and effort it takes to use these cloud services?
A common sentiment you hear online is that if you're on prem you need to hire specialist to management this software while neglecting the same is still true if you use a cloud provider. You'll still need to hire and train staff that are experts in AWS (or whatever cloud provider you use).
If you're just deploying with containers then I suppose all is good, but isn't the risk still there?
I think it outsources a slice of that effort (rack, stack, cool, basic networking) for all the AWS services and a lot of the server administration when you choose managed services.
It does not eliminate all learning curve, but the learning curve to get a static site hosted on S3/Cloudfront is a lot lower than the learning curve to do the same on prem starting from bare metal.
I agree with this in principle, however it seems proprietary cloud services often make development workflows more complicated, offsetting their operational simplicity and cost savings.
That makes it harder to tell what the right call is in any given situation.
Pre-plan your offramps in case of necessary migration, architect and document those plans, and maintain those plans in the documentation of each service, so you keep those off ramps in mind as the service evolves.
If my company decided all of a sudden to move from AWS to some other cloud (and nearly all of my experience as an engineer is with AWS), that’s a big headache for me and a lot of technology that I would have to re-learn. Or I could just go find another shop who is staying on AWS, and probably get a pay bump for myself too.
Not necessarily programming related, though I used programming to get myself out of this pickle.
Over the years I've created many Google Photos accounts for various trips and events (and to keep things free since 15GB is comes with the account). Now I wanted to consolidate everything into a single account. You would think it would be easy, but it's not. You can move photos from one account to another, but not painstakingly created albums and other customizations.
I've had to use a combination of Google Takeout and Google Photos API (which, in itself, is such a half-ass implementation) to move everything intact to a new account.
In a similar vein, I am doing a migration from iCloud Photo Library from a completely self-hosted cloudless solution. Getting the photos out is easy enough, but there's no easy UI to export the albums and smart albums and tags etc. Was doing quite a bit of AppleScript programming before I discovered https://github.com/RhetTbull/osxphotos
Much depend on what you count as "real" lock-in: let's start with some dummy example: all cloud vendors offer some APIs there is no formal lock-in there, you can use them push and pull data as you wish BUT came back a minute: you USE THEM, witch means you craft something yours on top of certain third party APIs. If they change you have to change. If they do not work your service will not work (at least not normally) as well. You might say "hey, but almost ALL software use someone else code", yes, sure but cloud APIs means code you do not run, it's code that run on someone else iron. That's a TERRIBLE lock-in even if in formal terms data and logic on top is at your disposal.
Habits is another form of soft but VERY HARD in practice lock-in. Let's say your employees already know on average, at least a bit, Zoom or Teams or Meet. There is no lock-in in choosing one of such platform, formally. In practice you get a certain UI for users, they get to know it, they are lost in case of change, MOST users at least. Oh, just try to see the sorry state of development of VoIP open tech, most sysadmins nowadays even HATE deskphones...
There are many examples as such NORMALLY not called lock-in but in practice one of the hardest to break soft-lock-in.
Oh, another form is: you decide to quit let's say Backblaze, ok. No lock-in formally... Just... Where you want your gazillion of Pb of crap ahem backups?
One thing I haven't seen mentioned yet: The infrastructure setup. If you're only running a VM or a container this won't be a problem, but if you have any setup that creates stacks on demand, needs dynamic DNS entries, does service discovery or similar - which you will most likely have in any medium to large setup -, you'll discover that switching cloud providers will involve a lot of friction in your application. This will be especially fun if some steps are synchronous (i.e. single API call) with cloud provider A, but callback-based with cloud provider B.
Terraform can help a bit, but a lot of examples I've seen are very AWS-dependent and it won't be as simple as changing the API key to deploy it somewhere else (however, you'll still have somewhat of a documented infrastructure, so there's that). OpenShift and Kubernetes help a lot, but you'll be paying extra for using the non-native abstractions of your specific cloud and, at least in my experience, some quirks will still end up in your app - most likely somewhere in inbound routing and monitoring.
That being said, vendor lock-in is a big topic, but depending on your situation, your really need to look at how much risk you are mitigating for your effort. None of the big clouds is likely to shut down unexpectedly (not even GCP) and no matter how much you prepare, moving a large infrastructure from cloud A to cloud B is always going to be both expensive and time-consuming; you will not do this for a minor reduction in the bill. If you really want to avoid lock-in, the actual way is to go multi-cloud, but this will be a lot of extra effort and I'd wager the expense is not worth it for most companies (except for backups).
Datasystems or services like DynamoDB on AWS, which are not compatible or available on other platforms.
Things like Security Groups can be another one, they don’t necessarily translate directly to what you’ll find on other platforms.
You don’t mention what clouds you’re evaluating, but avoiding services that aren’t available elsewhere or at least don’t have wire compatibility with those available elsewhere would be my recommendation.
I’ve been consulting on contract exclusively for the last 2.5 years, and the biggest issue I’ve seen is the over reliance on AWS Lambda. People tend to go crazy because they are so easy to spin up, however, they quickly find themselves with a runaway AWS cost. This problem of infrastructure cost, as being significantly cheaper than development cost thought to be resolved by using hyper-scaling could providers, becomes salient once again. When your AWS spend, relative to your revenue, starts to impact your ability for additional headcount, something is really wrong. The problem with how teams I’ve worked with use Lambda is that they tend to use all the latest AWS specific features by rejecting the use of abstraction framework. This makes it hard to move shop to a cheaper provider and they instead opt to re-unite the functions into a single executable application and we’re back to an express app deployed on EC2.
OpenShift is just k8s with batteries included so the batteries are in a real sense a lock-in to the OpenShift community so take that trade-off as you like. I wouldn't mind, but since your initial post was in the fear of lock-ins, this may be a start.
Other examples:
- full spec S3 (the basics are well copied at this point)
- GraphDBs I've found are very different in many ways between cloud vendors
- k8s load balancer bridges are different for each vendor though the big three have more or less feature parity with one another, just different impls
Just as with OpenShift, you'll start to see trade-offs between vendor convenience, 'lock-in', and cost which you'll ultimately have to choose what's more important to your business in the end.
I use openshift at work, it’s way way more than k8s with batteries. It has downstream, secured, stable versions of open source projects built into one supported product. I see what you mean though.
Just have a look at the list of AWS services and you'll find over a hundred of examples. But when you look deeper, it turns out people get locked in just because they want, not because they have to.
A good example is ECR. You really don't have to use it to use Docker effectively on AWS, but it's slightly more convenient than spinning up an EC2 instance with a private Docker repo. Also it's well documented, many people use it and so on. So you get hooked up, you write scripts, and when finally you start thinking about switching you realize the sheer amount of work needed to modify all this is just scary. So you say, "I can't afford the downtime" and continue with AWS.
Micro example is that you can't scale rds/cloud db disk size downward easily, only upwards, so if you haven't sorted out go archiving before hand you may be stuck paying more for storage until you have both migrated data away and done the extra leg work to scale back down.
Macro examples:
cases where there are incentives to use for a period e.g. sustained use or yearly discounts
Incentives to use proprietary technology such as S3 or dynamodb being cheap.
Situations where migrations are hard (data), expensive (cold storage) or dangerous/slow to recover such as changing the DNS
Another one: IP addresses. If you offer custom domain functionality, it prohibitively hard to reach out to all of your customers to coordinate "Switch your A record from this IP to that one".
At a former company our development teams addressed this by making their own API gateways to talk to API's in AWS and Azure so that the same primary codebase could run in multiple clouds. I do not have specific details but this is absolutely a concern. I am not aware of a turn-key solution to address this. OpenShift would still need code to talk to all the API interfaces of each cloud vendor I have no idea if this is already baked in at this point. It is a bit of work up front but worth while in my opinion.
Monitoring and logging dashboards and stuff is almost entirely vendor specific unless you roll-your-own via structured logging and even then how you consume it can be a migration pain-point.
The most obvious one is exclusive products or features that only exist within a single cloud provider.
Binding your scripts to a non-standard API will complicate any migrations outside of it and involve a lot of work. (example: Migrating away from Azure Resource Manager templates).
AWS outbound data is expensive. That complicates any data migration outside of AWS or communication between machines inside and outside AWS (example: Cheaper machines on another infrastructure provider with a lot of data transfers with AWS machines).
If you want the cloud to be a powerful tool you will use their higher level services. These are incompatible and cannot be reasonably abstracted away.
The simple services like files which you could reasonably abstract away in your code you can also migrate when needed.
If you don't want lock in you might be better of with a traditional hosting provider. The cloud is only really useful if you go in with a mindset to embrace what it offers.
There are OpenShift components that are not present in native k8s - i.e the OpenShift router. Also, OpenShift dashboard, and some management tools. All OpenShift commands use `oc` instead of `kubectl` as well. If you rely heavily on this stuff in build scripts, processes, or running applications, migrating at some later point could be a good amount of engineering work.
We migrated from OCP to EKS, and IIRC, OpenShift router is Kubernetes Ingress? Also, Kubernetes Dashboard [1] is a similar tool to OpenShift dashboard.
It's not too bad I would imagine, but infra migrations always take extra time to do safely and correctly. If you want to use Openshift over native k8s though I would think about how you can not rely to heavily on the custom OS stuff in case you need to switch.
Havent used the data product so can't really comment there.
oc and kubectl are interchangable for create/apply/edit/delete type actions. It just also has commands for upgrading the cluster and stuff like that. You can absolutely use kubectl with an OpenShift cluster.
Having worked briefly with OpenStack and some years with AWS, I'd say OpenStack is great for what it does, but it does only a tiny fraction of AWS. For example, one of the more annoying findings was that security controls were utter garbage compared to AWS's policies. That said, if it does enough for you then avoiding lock-in is worth quite a lot in the long run.
The cost of "lock-in" is often less than building a "cloud agnostic" product.
Trying to work around cloud vendor specific nuances can increase the risk of that component/feature failing. The increased risk and longer development time might not be acceptable to management.
So I am contributing work to a product that is solving this exact problem.
Would you mind giving it a look and see if it fits your requirements?
www.nuvolaris.io
The most vexing secondary effect lockup to Azure comes when you need Microsoft licensing that Microsoft discounts on Azure and offers palliative adjustments to smaller resellers of cloudy systems together with Microsoft licence agreements, which I could almost ignore using up my quota of enterprise licensing cynicism, except for the fact that margin extraction from competing cloudy resellers can't be ohbo for a probable effect on the level of hardware instances purchased for Microsoft customers who I believe are getting depressed specs as a consequence of this squeeze. Which inevitably has a compounding effect on platform renewal schedules and planned performance purchase points that can only push the package customers get downwards.
I have been suspicious that whilst I am not convinced that the impact is so directly causal simply because of the relative small scale of independent clouds selling Microsoft contracts, nevertheless it could easily be this preferential self dealing the motor for slower upgrade cycles at lower budget defined configurations leading to increasing compression of the options available for Microsoft capacity and anecdotally I've found it increasingly difficult to find equivalent instances outside of Azure which if it isn't a anti competitive practice is most certainly a very harsh environment for resellers which has real effect on customer independence and I will surmise that Microsoft probably sees its position in ten years as being a much bigger and more attractive single source by default like Oracle. If being much more attractive than Oracle is attractive to you I would like to hear how. At least for Oracle, now that installs are nearly only hard mission critical F500 budget full metal jacket affairs, I can rationalise the Oracle position because Oracle at size is going to run on Oracle hardware. But the thought that Microsoft is lurching down the same tortuous path only redeemed by the fact that it's almost impossible for competition to follow after you, and taking the whole intensity of x86 competition off the table and with that a huge part of the value proposition Redmond ought to be nurturing far better than this in terms of passing on the difference between economy of scale plus advantage of platform innovation competition and that sum less some reasonable vig, simply is abhorrent not least because having a dominant swing volume customer gain insensitivity to innovation benefits is tremendously bad for the industry ecosystem entirely. This won't have to carry on for long before I will conclude that Microsoft is going to be a ARM vertical within the next ten years.
I was involved in a long term project/service that moved from private bare metal to private vmware to private openstack to private vastly newer vmware to AWS.
You'll probably laugh but the biggest problem we had was security, moving from one thing to another is possible but the "in between" needs to be as secure as normal operation so you write a LOT more security group style access lists than you'd think. If your process involves five middleware servers there are a large combination of partial moves possible which requires security review of each set of access lists.
The second biggest problem we had with "always be moving" was latency. "Sure, we can move the mysql server today and the front end servers next month". Then insert a surprise 30 ms latency that nobody ever expected and suddenly round trips go from immeasurable and unimportant to being a pretty big idea while parts have been moved. Also funny watching front end people who did "SELECT *" and filter in their frontend because its a 10G connection find things are a little slower when the db is far away on the internet.
The third biggest problem we had was documentation. "Hey its Tuesday so does anyone know where the 'ohio' database is today?" Anybody can move stuff, bigger question is can everyone find it, LOL? Whatever you use, a wiki or some more elaborate project planner, everyone needs to be able to find where "stuff" currently is. How many people need to know where things are and what's the labor cost of moving something?
The forth biggest problem which has kind of gone away with Docker is the old fashioned version bump. "Well, we're moving it anyway, and it would take extra work to either upgrade the old xyz-server from 3.99 to 4.00 or install 3.99 on the new cloud, so what could possibly go wrong?" Turns out to be a lot, mostly performance tuning related although occasionally "ha ha we deprecated that feature two years ago so we removed it last week because nobody would notice". So try not to merge an upgrade process with a cloud conversion process if at all possible.
The fifth biggest problem we had was budget. The bean counters liked how the electric bill was about constant and the bare metal servers were a fraction of HVAC and lighting although you'd have to replace a HDD every couple YEARS. Suddenly "operations" became bean-countable with the cloud and different clouds count different beans so suddenly someone else is driving "how we're going to save money" because overtime and weekend labor is "free" if its salaried but god help anyone provably "wasting" 25 cents on the AWS bill, and the best way to get a promotion is to force someone elses team to work saturday and sunday unpaid salaried to save fifty cents of AWS budget. The lock-in is your internal procedures will depend on some cloud providers crazy arbitrary billing ideas, they're not all on the same page. The endusers will never accept any explanation that's truthful along the lines of "sorry we can't change the font on that webpage because of AWS" which in enormous technical detail would be an entire page of tech and billing nonsense.
I think really the meta-observation of cloud lockins is the front line salaried employees don't like putting in 30+ hours of unpaid maint window overtime just to lower the cloud bill by $5 so some manager can get a promotion. "Technically possible to move and reduce the bill" doesn't save money if employees quit or essentially rebel.