One reason for the explosive interest in service mesh over the last 24 months that this article glosses over is that it's deeply threatening to a range of existing industries, that are now responding.
Most immediately to API gateways (eg. Apigee, Kong, Mulesoft), which provide similar value to SM (in providing centralized control and auditing of an organization's East-West service traffic) but implemented differently. This is why Kong, Apigee, nginx etc. are all shipping service mesh implementations now before their market gets snatched away from them.
Secondly to cloud providers, who hate seeing their customers deploy vendor-agnostic middleware rather than use their proprietary APIs. None of them want to get "Kubernetted" again. Hence Amazon's investment in the very Istio-like "AppMesh" and Microsoft (who already had "Service Fabric") attempt to do an end run around Istio with the "Service Mesh Interface" spec. Both are part of a strategy to ensure if you are running a service mesh the cloud provider doesn't cede control.
Then there's a slew of monitoring vendors who aren't sure if SM is a threat (by providing a bunch of metrics "for free" out of the box) or an opportunity to expand the footprint of their own tools by hooking into SM rather than require folks to deploy their agents everywhere.
Finally there's the multi-billion dollar Software Defined Networking market - who are seeing a lot of their long term growth and value being threatened by these open source projects that are solving at Layer 7 (and with much more application context) what they had been solving for at Layer 3-4. VMWare NSX already have a SM implementation (NSX-SM) that is built on Istio and while I have no idea what Nutanix et al are doing I wouldn't be surprised if they launched something soon.
It will be interesting to see where it all nets out. If Google pulls off the same trick that they did with Kubernetes and creates a genuinely independent project with clean integration points for a wide range of vendors then it could become the open-source Switzerland we need. On the other hand it could just as easily become a vendor-driven tire fire. In a year or so we'll know.
This is a good overview. However, I think that the reason that we see a lot of service variations is because the core tech - namely - Envoy, contains all the "hard" tech (the data plane) while creating a "service mesh", basically comes down to creating a management layer on top of it.
Another intresting note is that Google did NOT recede control over Istio to CNCF.
> Envoy, contains all the "hard" tech (the data plane) while creating a "service mesh", basically comes down to creating a management layer on top of it.
I'd argue this is backwards. Envoy has a fairly tightly defined boundary with relatively strong guarantees of consistency given by hardware -- each instance is running on a single machine, or in a single pod, with a focus on that machine or pod.
The control plane is dealing with the nightmare of good ol' fashioned distributed consistency, with a dollop of "update the kernel's routing tables quickly but not too quickly" to go with it. It's "simple" insofar as you don't need to be good at lower-level memory efficiency and knowing shortcuts that particular CPUs give you. But that's detail complexity. The control plane faces dynamic complexity.
I was going to say re: "How is this different from an API Gateway?" – this was a lot harder a question for me to get my head around than William suggests, _because_ API Gateway vendors, I'm sure quite intentionally, have been positioning their stuff as service mesh solutions or alternatives, not as service mesh complements.
I don't get the part on API gateways. Gateways are north south most of the time while a sm is east west. They can work together perfectly fine and don't even have to be integrated.
Supporting North/South is how they have traditionally been marketed, but not how they are actually used much of the time. Inside enterprise they are often used as an internal "service catalog" and are effectively a shim providing discovery and consistency over a bunch of fairly scoped internal services.
> If Google pulls off the same trick that they did with Kubernetes and creates a genuinely independent project with clean integration points for a wide range of vendors then it could become the open-source Switzerland we need.
Istio is that project, but they'd rather it was Luxembourg than Switzerland.
The cloud vendors haven’t gotten “Kuberneted”. The whole concept of “lock in” is exaggerated. At a certain scale, you’re always locked in to your infrastructure. The rewards are too low in doing a wholesale switch of vendors, and when you try to stay “cloud agnostics” you spend more maintaining layers of abstraction and you don’t get many of the benefits of cloud services.
Kubernetes hasn’t changed the landscape as far as cloud providers market share. AWS is still the leader, Azure is still big in MS shops and GCP is an also ran.
What amuses me about this is back in the day everyone thought the Mach guys were crazy for thinking things like network routing and IPC services be implemented in user space... and others mocked the OSI model's 7 layers as overly complex (e.g. RFC3439's "layering considered harmful").
Now we've moved all our network services onto a layer 7 protocol (HTTP), and we've discovered we need to reinvent layers we skipped over on top of it. We're doing it all in user space with comparatively new and untested application logic, somehow forgetting that this can be done far more efficiently and scalably with established and far more sophisticated networking tools... if only we'd give up on this silly notion that everything must go over HTTP.
You aren't wrong about the trend of solving the same problems over and over at higher points in the stack. But you missed the mark on responding to this article and the Service Mesh pattern, which usually operates at a lower network layer. In fact, if anything, the trend in microservices architecture is away from HTTP and back towards binary protocols over TCP (but, in this case, proxied via authenticated mTLS connections managed by an external service, but which also operate over TCP).
I think this is a bad take, especially saying HTTP is the reason. Its not. The real reason is we want stateless connections and load balancing and because of TLS we want to keep our handshakes warm too.
We also want our retries to work at the application layer too so lower layers would at least have to understand chunks of a connection larger than a packet can fail even if all the packets were acked successfully... _and_ a completely different machine should receive that chunk instead.
We want all that for free and we don't want to pay for hardware. Http/2 is the only close to this and you still need app layer retries.
At my company, we were migrating all our apps to a kubernetes + istio platform over the past couple of months and my advice is this - don't use a service mesh unless you really, really need to.
We initially choose istio because it seemed to satisfy all our requirements and more - mTLS, sidecar authz, etc - but configuring it turned out to be a huge pain. Things like crafting a non-superadmin pod security policy for it, trying to upgrade versions via helm, and trying to debug authz policies took up a non-trivial amount of time. In the end, we got everything working but I probably wouldn't recommend it again.
It's funny that I was at kubecon last week and there was a start up whose value prop was hassle-free istio and the linkerd people stressed that they were less complex than istio.
I would go as far as to say I think the vast majority of people don't need a specialized service mesh. We unfortunately started with Linkerd and it actually is the cause of most reliability/troubleshooting issues. I don't think lack of complexity is actually a good selling point for it, because it's inherently more complex that not using a service mesh.
Istio may appear more complex but that's because it has a superior abstraction model and supports greater flexibility. We're beginning to migrate from Linkerd to Istio at this point. I had the same initial frustrations with podsecuritypolicy (and linkerd suffers from the same), but istio-cni solves the superuser problem, and I believe even the istio control plane is now much more locked down in the latest release.
However if I had my way I would be telling every team they don't need service mesh. We don't have any particular service large and complex enough to really take advantage of its sold features.
I'm also curious about this (author here btw). The majority of people we see coming to Linkerd today are coming from Istio. They get the service mesh value props, but want Linkerd's simplicity and lower operational overhead. Would love some more details, especially GitHub issues.
It’s a pity “fat clients” are dismissed so quickly. I think that when your tech stack is uniform enough to use them, they can provide much more that service meshes, and do it faster as well. After all, why does “service is down” and “service is sending nonsense” have to be handled via completely different paths?
The main problem with fat clients is that for polyglot architectures (which most large companies that end up building a service mesh evolve into over time) you have to maintain a fat client library for every language. You can get very far leveraging existing tools like gRPC that codegens fatty clients for you but the quality of tooling is very uneven depending on the language of choice. By pushing all of this into the network layer you skip all of that.
Right, polyglot architectures have no choice, but this text talks about “5 person startups” as well. Surely they can keep the set of languages limited?
Plus, it’s not either/or situation. A fat clients for Go + Node.js; and a proxy for all others. This way your core logic can enjoy increased introspection / more speed / higher reliability; while special purpose services get a proxy which allow interoperability.
As someone who's familiar with the API gateway pattern, is it fair to say this is just another API gateway for internal services? Seems like it is but its also described in an extremely convoluted way with 'control planes' and such.
The service mesh is a bit different from an API gateway -- in it's current most popular implementations (linkerd[0] & Istio[1]), there are basically small programs that run next to each individual instance of the programs you want to run. Linkerd has been around for a while and IMO there weren't that many companies that were at a scale where they needed it (I didn't see it deployed that often), but it's basically that same concept, but on a more granular level -- if you delegate all your requests to some intermediary, then the intermediaries can deal with the messy logic and tracing so your program doesn't have to.
A better way to describe is "smart pipes, dumb programs". Imagine that all your circuit-breaking/retry/etc robustness logic was moved into another process that happened to be running right next to the program actually doing the work.
You can have both an API gateway and a service mesh deployment -- for example Kong's Service Mesh[2] works this way. They're saying stuff like "inject gateway functionality in your service", but that only make sense if you sent literally every request (whether intra-service or to/from the outside world) through the gateway. Maybe that's how some people used Kong but I don't think everyone thought of API gateways as a place to send every single request through. You'll have a Kong API gateway at the edge and the kong proxies (little programs that you send all your requests through) next to every compute workload.
Hmm, is the assumption that, because you're deploying an instance of the mesh as close to the application as possible, you don't need robust logic between the application and the service mesh? I can buy that I suppose.
Yes kind of -- except not in between the application and the service mesh, it's between application and application.
Imagine that for every application there is one small binary that runs and serves all it's traffic, like a chauffeur. Your application stops talking to the outside world completely and sends all messages to the small chauffeur binary -- which then talks to other chauffeurs, over the network.
Keeping with the chauffeur analogy, there is a "head office" which calls the chauffeurs on CB radio at regular intervals that lets them know which cars go where and how to start them/etc.
"head office" => "control plane"
"chauffeur" => "side-car proxy"/"data plane"
In the end what this means for your application is that you just make calls to external services (whether your own or others) and since all your communication goes through this other binary, you get monitoring, traffic shaping, enhanced security, and robustness for free.
Another interesting feature is that if the side-car proxy can actually understand your traffic, it can do even more advanced things. For example you can prevent `DELETE`s from being sent to Postgres instances at the network level.
Not really. API gateways are typically single-point-deployment proxies, designed to create a coherent public interface out of a disparate group of services. An API gateway change is almost always going to be a PR for the application team.
Service meshes are proxies deployed alongside all the services, and used as much by devops (and potentially security) as by the application developers.
So a goal of the service mesh is to keep it an api agnostic appliance. I'm not sure that concept requires all this nomenclature.
It seems like a very thin API Gateway that forwards calls directly to the microservice API without enforcing much would be much easier to manage and have all the same benefits.
Whats the purpose of deploying the reverse proxy alongside the microservice?
The idea behind a service mesh is that no matter how you design and implement a microservice architecture, you still have all these little services talking to each other somehow. By "default", they're communicating using native sockets and HTTP or gRPC.
There's lots that can go wrong any time you have a bunch of things talking to each other on a network (see Tanenbaum's "Critique of RPCs" for a classic explanation).
If everything was written in the same language, what you'd probably do is come up with a common networking/RPC library for all your services to use. That library would give you a common interface, so everything used the network the same way, and some measure of observability (maybe just by doing a standard log format), and maybe some security controls like a standard HTTPS connection and a standard token format.
But if you've got multiple languages, that option isn't very attractive anymore, because even if it's feasible to build that library once for each platform, now you're keeping an additional component (the library) in sync between the platforms, and that's a drag.
The insight behind a service mesh is that containers make it cheap and easy to just stack a tiny out-of-process proxy alongside your services. So you can just move all the logic you would have had in that common service library into the proxy. The proxy will give you the same features no matter whether your service is a Rust binary, Clojure running on the JVM, or a shell script.
And, because it's an out-of-process standalone component, it can be built and maintained by a third party, which means the features it provides can be a lot more ambitious than you'd build in your own library. So now instead of just hoping to get some TLS and maybe a standard token, everything can be mTLS with client certificates, a standard dashboard for managing certificates, rule systems for what client certs will allow you to talk to which systems, graphical maps of who's communicating with who and trace collection for specific pairs of services, etc.
This is all very un- like what an API gateway does, and the whole concept sort of revolves around having little reverse proxies running alongside the services.
I had trouble finding this under the given title. For those searching, I believe tptacek is referring to Tanenbaum & Renesse, A Critique of the Remote Procedure Call Paradigm [0], published in 1987 or 1988.
A fun read, especially since "it's just the same as local" is a myth each generation gets to revisit. Compare to Waldo, Wyant, Wollrath & Kendall's A Note on Distributed Computing[1], published in 1994.
I think I get it. Its underpinned by the assumption that the application doesn't need robust logic around connecting to a local proxy like it would connecting to a remote one. In this sense its going back to the days when instances had a local HAProxy running. That assumption didn't really pan out and we all decided service LBs were better but ok, sure. We can have both.
I still think describing the concept more plainly would help a lot of the confusion.
I think this is right, yes. We're not worried about the connectivity between the service and the local proxy because they're (always) cotenants of a container and using localhost to communicate.
A more general way to think about service meshes is that the network layer we code to right now is actually really primitive; its service model was fixated in the 1980s, and its programming interface hasn't much evolved from the early 1990s. We'd be happier if we could level up the whole network, so that it had QoS controls, a really expressive security model that didn't rely on magic-number ports and address ranges, and observation capabilities that communicated application-layer details and didn't just try to approximate them the way flow logs do. You can get all that stuff, internally at least, by putting all your services on the same service mesh.
Another thing to look at is Slack's Nebula, which was just released last week:
Nebula is a service mesh that runs at the IP layer (where Istio and Linkerd ride on top of HTTPS proxies, Nebula rides on top of a somewhat Wireguard-ish VPN). Slack has been using it internally for 2 years now. It's solving the same problems Linkerd is, but with a radically different implementation. You can get your laptop connected to a Nebula service mesh in ways that would be clunky to do with a Linkerd mesh.
I would say it's even a bit more pessimistic than that: it's the assumption that the application _can't be relied on_ to provide robust logic around connecting to a remote service. You can address that problem in a library or in an intermediary service of some sort, but in an organization with a heterogeneous collection of languages, versions, and stacks, the library solution becomes expensive.
I'm not sure what you mean by "instances had a local HAProxy running" but if you're thinking about a bunch of reverse proxies handling incoming requests, be aware that service meshes are handling both inbound _and_ outbound traffic to/from your service. For example, you might use a reverse proxy in front of your instance to terminate TLS, but you cannot implement something like two-way TLS authentication between your services unless you're putting something in at the client side as well.
It's an RPC plugin but doing it out-of-process is weird (I need an RPC client to talk to my RPC client?). If you don't need a trust boundary or separate ulimits, loopback network traffic and context switches and reserialization are a really expensive way to mitigate a language having poor FFI plugin support.
If it were in-kernel and supported scatter-gather and zero-copy, that might be different (though some people even avoid going through the kernel).
The "RPC" that the proxy in Linkerd is doing is radically different and more expressive than the "RPC" that is running between the Linkerd proxy and your application, which is the whole point of the architecture.
Service Mesh really is about connectivity among all the service you are deploying in your architecture. Some of these services are going to be databases, some of them will be APIs, some of them will be caches and so on.
Once you have this underlying grid of services all connecting via a service mesh - in a reliable, secure and observable way - for some of these services you want to define governance and on-boarding rules, or you want to expose them externally as products that developer can consume and you then use an API GW to do those things.
In this regards, an API GW is just a service among the services you are running inside the service mesh.
These get mixed up a lot but wrongly, I think. They can be related, but not necessarily. API Gateway pattern is at the application layer, and builds on top of the Service Mesh, which is at the transport layer. Think of API Gateway as HTTP and Service Mesh as TCP/IP. Just with more flexible hooks for authentication, observability, and reliability.
Every part of a service mesh could be baked into operating systems so that all this extra technology was just there by default. This would put a fair amount of start-ups out of business, but it would also mean a lot less people having to be hired to set up and maintain all this stuff. Devs could just... develop software, with a clear view into how their apps run at scale. And Ops wouldn't have to custom-integrate 100 different services.
This is really the future of distributed parallel computing, but we're still just bolting it on rather than baking it in.
I'm evaluating using AWS App Mesh at the moment. We're a really small team so we're choosing Fargate vs Kubernetes- mainly because we don't have need of nodes nor want to deal with them.
The appeal of App Mesh for us was initially around using it to facilitate canary deployments. AWS Code Deploy does a nice job with Blue / Green deployments and that may suffice for us, but it doesn't support canary for Fargate. Is that enough reason to add the additional complexity in our stack? Not sure, looking for input.
Also, much of the documentation is focused on K8s. I'm murky on how to implement an internal namespace for routing. Most of what I've seen is like myenv.myservice.svc.cluster.local but its not clear to me that using that pattern is needed in the context of Fargate.
Consistent observability is valuable, but again Fargate can do that pretty well- it just doesn't mandate access logging so that would be left to the app itself.
We want to implement OIDC on the edge for some services, but App Mesh doesn't support that yet as other meshes like Ambassador, Gloo, and Istio seem to. Since App Mesh doesn't really act as a front-proxy on AWS, we'll still be using ALB to handle auth which is fine, I think. I get mixed messages about the need for JWT validation, but if so, that would need to be implemented in the app level with ALB fronting it.
Can anybody help me find resources to sort this out? I've been through the `color-teller` example time and time again, but it still leaves lots of open questions about how to structure a larger project and handle deployments effectively.
> The appeal of App Mesh for us was initially around using it to facilitate canary deployments. AWS Code Deploy does a nice job with Blue / Green deployments and that may suffice for us, but it doesn't support canary for Fargate. Is that enough reason to add the additional complexity in our stack? Not sure, looking for input.
Maybe you should write a script for this? It sounds like you're about to take on a lot of complexity for just the ability to do canary deployments when you could probably hack up a script in a day or two.
> We want to implement OIDC on the edge for some services, but App Mesh doesn't support that yet as other meshes like Ambassador, Gloo, and Istio seem to. Since App Mesh doesn't really act as a front-proxy on AWS, we'll still be using ALB to handle auth which is fine, I think. I get mixed messages about the need for JWT validation, but if so, that would need to be implemented in the app level with ALB fronting it.
JWTs are only required for client-side identity tokens (you can use opaque ids and other kinds of stuff for backends) -- it seems like you're also at the same time looking for something to take authentication off your hands? App Mesh doesn't do that AFAIK, it's only the service<->service communication that it's trying to solve.
I think it might be a good idea to make a concise need of what you're trying to accomplish here, it seems kind of over the place. From what I can tell it's:
- Ability to do Canary deployments
- The ability to shape traffic to services (?)
- Observability, with access logging
- AuthN via OIDC at the edge
A lot of meshes do the above list of things, but the question of whether it's worth adopting one just to get the pieces you don't have already (which is only #2 really, assuming you scripted up #1), is a harder question.
> Namespaces: In order to identify the versions of services for routing, you need independent virtual nodes and routes in a virtual router. You can reuse the DNS names or use cloudmap names with metadata to identify the versions/virtual nodes.
> OIDC at ingress - App Mesh does not do this yet, ALB / API Gateway is needed for this. App Mesh has this on the roadmap.
> Resources - You can reach the app mesh team with specific questions at the App Mesh roadmap Github and we can help
> Sure, it only worked for JVM languages, and it had a programming model that you had to build your whole app around, but the operational features it provided were almost exactly those of the service mesh.
The thing is, all of our microservices communicate with each other using Kafka. Envoy has an issue open for Kafka protocol support [1], but it's a fundamentally difficult issue because adopting Kafka forces you to build out "fat client" code and building a network intercept that can work with pre-existing Kafka client code is non-trivial. On observability, Kafka produces its own metrics.
Granted, Kafka doesn't offer the same level of control. But Kafka does offer incredible request durability guarantees. We don't have "outages" - we have increased processing latency, and Istio/Envoy and other service meshes can't offer that because they do not replicate and persist network requests to disk.
Opinionated read, but interesting. That being said, Linkerd wasn't the first service mesh — SmartStack predates it by three years. [1] Although they didn't use the (then-nonexistent) "service mesh" term at the time, it pioneered the concept of userspace TCP proxies configured by a control plane management daemon. I doubt the Linkerd folks are unaware of it, so it was a surprising omission.
It's not as much about load as it is about complexity; it starts to make sense when you hit some threshold number of internal services, regardless of the amount of traffic you're doing. You use a service mesh to factor out network policy and observability from your services into a common layer.
This is more a religious question than a technical one. I tend to build monoliths. Some of our clients build microservices; some of them decompose into just a small number of services, and about half of them have monolithic API servers.
But if you're going to do the fine-grained microservice thing, the service mesh concept makes sense. You might choose not to use it, the same way I choose not to use gRPC, but like, it's clear why people like it.
* Codebase should be defined as 'the platform'. where one team will most likely never look at the code of other team's microservices.
* this communication problems and overhead start the moment you go from 2 to 3 or more teams.
* the term 'team' in this context should be interpreted very broadly. One dev working alone on a microservice should be considered "a team".
Also, things mentioned in the article: you don't want to implement TLS, circuit breakers, retries, ... in every single microservice. Keep them as simple as possible. Adding stuff like that creates bloat very quickly.
While nobody ever seems to want to hear it, the vast majority of companies utilizing service meshes and k8s are wasting huge amounts of time and money on things they don’t need.
Unfortunately these technologies are at peak hype so everyone seems to be implementing them for their small to medium crud apps. But get very sensitive if you try and point it out.
This is quite interesting. I used to work in more devopsy kind of roles but at the current gig it has been almost entirely removed from my purview. It's impressive to step away for a few years and return to see so many changes, but the article laid out the concepts in an easy to understand manner.
If one were to implement a service mesh of microservices wouldn’t the services need to be versioned similar to how the packages used by a microservice are version-pinned?
Sort of, but only for major versions, and it's preferable to bake that sort of thing into the API itself. The API exposed by a microservice should only ever be updated in backwards compatible ways unless you can verify that you have no callers, which is hard. New functionality should be introduced using backwards compatible constructs like adding fields to JSON or protobuf. Breaking changes go in a new API. This is easily managed conceptually by having the microservice expose version information as part of the API. A FooService might define "v1/DestroyFoo" and "v2/DestroyFoo" with different calling contracts. Perhaps v1 was eventually consistent and returns a completion token that can be used with a separate "v1/CheckFooDeletionStatus", but now with v2 the behavior has been made strongly consistent and there is no "v2/CheckFooDeletionStatus". The v2 of the API can thus be thought of really as a separate API that happens to be exposed by the same microservice, and pre-existing callers can continue to call the (perhaps now inefficient) v1 API.
Good article - I must admit I’m vaguely familiar with the concept and this read certainly gave me some new insights.
One meta call out on the writing - I read and scrolled at least 30% through the page on my iPhone until the author explained why I should care about a service mesh I.e. what problems it tries to simplify or solve.
It seems to me there are some strong use cases here, but it’s only worth your while if you’re operating at sufficient scale.
For instance, if my team at some FAANG scale company is responsible for vending the library that provides TLS or log rotation or <insert cross cutting/common use case here>, and it requires some non trivial on boarding and operational cost, migrating to this kind of architecture longer term where these concerns are handled out of the box may be beneficial.
Still - it doesn’t mean the service owners are off the hook. They still need to tune their retry logic, or confirm the proxy is configured to call the correct endpoints (let say my service is a client of another service B and for us, B has a dedicated fleet because of our traffic patterns). This is an abstraction. Abstractions have cost.
Trust but verify.
The trap people fall into is, “Here’s a new technology or concept. Let’s all flock to it without considering the costs.”
Most immediately to API gateways (eg. Apigee, Kong, Mulesoft), which provide similar value to SM (in providing centralized control and auditing of an organization's East-West service traffic) but implemented differently. This is why Kong, Apigee, nginx etc. are all shipping service mesh implementations now before their market gets snatched away from them.
Secondly to cloud providers, who hate seeing their customers deploy vendor-agnostic middleware rather than use their proprietary APIs. None of them want to get "Kubernetted" again. Hence Amazon's investment in the very Istio-like "AppMesh" and Microsoft (who already had "Service Fabric") attempt to do an end run around Istio with the "Service Mesh Interface" spec. Both are part of a strategy to ensure if you are running a service mesh the cloud provider doesn't cede control.
Then there's a slew of monitoring vendors who aren't sure if SM is a threat (by providing a bunch of metrics "for free" out of the box) or an opportunity to expand the footprint of their own tools by hooking into SM rather than require folks to deploy their agents everywhere.
Finally there's the multi-billion dollar Software Defined Networking market - who are seeing a lot of their long term growth and value being threatened by these open source projects that are solving at Layer 7 (and with much more application context) what they had been solving for at Layer 3-4. VMWare NSX already have a SM implementation (NSX-SM) that is built on Istio and while I have no idea what Nutanix et al are doing I wouldn't be surprised if they launched something soon.
It will be interesting to see where it all nets out. If Google pulls off the same trick that they did with Kubernetes and creates a genuinely independent project with clean integration points for a wide range of vendors then it could become the open-source Switzerland we need. On the other hand it could just as easily become a vendor-driven tire fire. In a year or so we'll know.