Tech Deep Dive | Transcript: Struggling with Network Costs? Discover How Asymmetric Routing Can Save You!

Struggling with Network Costs? Discover How Asymmetric Routing Can Save You!

February 12, 2024 / 17:21/E113

Speaker 1: 00:08

Hi. I'm Max Clark. As a network engineer, one of the gremlins that we constantly chase is asymmetric routing. Does all sorts of stuff to you. It makes it really hard to troubleshoot, and it is it just turns into a nightmare really quickly.

Speaker 1: 00:18

And what I mean by asymmetric routing is path one way is different from the path on the return. Right? So if you're if you have a network between Los Angeles and New York and bandwidth and traffic going from LA to New York goes to Southern route. Right? So LA, San Diego, Phoenix, Dallas, Georgia, and then up the seaboard right to New York.

Speaker 1: 00:38

And on the return path, it takes a Northern route. So, you know, New York, Pennsylvania, Chicago, Denver, San Francisco back down. Right? Let's say you're doing that. When you get into these scenarios, you get into the things that we would we would call, like, just strange things happening or things just don't work right or, you know, performance is just weird or, you know, like, you know, it's just it's just a nightmare.

Speaker 1: 00:57

It's an absolute nightmare to to troubleshoot. So you spend a lot of your time trying to prevent asymmetric routing because you want, predictability in the path on the egress and the ingress. That way you know what's going on. You know what network interfaces you're traversing. You can identify what routers you're on.

Speaker 1: 01:12

You can control it just it just makes it much better to troubleshoot. You can identify problems much faster, and you can route around them much faster. About 20 years ago, I learned that asymmetric routing can be your friend. When you know the rule you know how to break it or when you can break it. And the first time that I really understood when and how to break the rule with asymmetric routing was with load balancers.

Speaker 1: 01:33

And when the first load balancers came out the first on the market were really, proxy servers that could do traffic distribution and then you had, load balancers that were, you know, using routing to make traffic, you know, decisions. And the load the load balancer still has to, rewrite the source IP address information when it passes to an application server but if you just think of it from top to bottom right you know traffic would come in hit the first box this is the load balancer load balancer would say okay the next box it needs to go to is is here and it could it could select which which server to send traffic to and that server would then see here's a source and my default gateway is a load balancer and I know how to pass it back. Right? So you could have a routed base load balancer or you could do a a source NAT so the server would know okay this is the device I need to send traffic back to and then that that load balancer can send some traffic back. Right?

Speaker 1: 02:20

So either doing likes, like in a source NAT or you're doing, you know, some sort of layer 3 router with, a gateway, you know, default gateway on this server so that way traffic flew back and forth. What was nice about this setup is you actually knew exactly how much traffic was flowing through the load balancer and the load balancer could keep really close, track on, session capacity and application performance and traffic flowing back from the server back to the client. Now, of course, the disadvantage of this is everything's flowing through the load balancer. And that means that as you size up and scale up your traffic and your traffic to your servers that your load balancers become really critical points of failure in the path and they also become very expensive devices because they have to support all of the traffic that flows through them so about 20 years ago and the first ones I I was aware of that did this was foundry networks and they came out with their server iron product the server iron had a ability to do direct server return we also refer to this as like one armed load balancing or you know you hear like things like 3 legged configuration and your load balancer really the technique and the terminology with the server iron was direct server return and the way this was implemented was you know router device up here virtual IP on the load balancer so this would become your IP address that you were using to publish to the Internet or that was published via your through your firewall so it come in hit the load balancer, and the load balancer then had a pool of application servers.

Speaker 1: 03:43

And the load balancer would say, okay. You know, whatever the load balance sheet metric was that you're using, round robin, all the most commonly you're using, like, at least connections. Right? You wanted to send traffic to the server that was being used the least. So you would send from the load balancer, the load balancer would say, okay.

Speaker 1: 03:56

We would have, you know, server 3. Send traffic to server server 3. Now the tweak that, direct server return required was the server now is seeing traffic going to a destined to the virtual IP on the load balancer and the server wouldn't know what to do with it. And you would just bind an IP address to loop back interface of the server that was the virtual IP. So that way when the load balancer redirected that traffic to the server, the server would see it destined to itself and the loop back interface would be able to respond to it.

Speaker 1: 04:22

And now here was the fun. The fun thing was that the source IP address wasn't the load balancer the source IP address was the client so the so the server wouldn't respond back to the load balancer server goes straight back out so now you have this triangle that's flowing disadvantage of this of course is that the load balancer doesn't see traffic flowing back from the server to the Internet. It only sees one way. So it could not tell if there was a application problem on the server, if the server is responding slowly. So you'd have to implement health checks in different ways to try to keep track of all this.

Speaker 1: 04:53

But the positive all of it was the load balancer is seen far less traffic and and resources through it. So you know an average, content heavy website web platform was serving is maybe like 10% egress 90% egress. So for every one megabit of traffic that was flowing in, 10 megabit is flowing out. So now if you think about that in terms of sizing and scaling your load balancer, if you only need 10% of your overall traffic flowing through your load balancer, your load balancers can be much smaller and they can scale much more efficiently. And then we could do other fun things with them because they were only seeing one side of the traffic.

Speaker 1: 05:28

We could then simp we could load balance the load balancers. And and, so, you know, on a network level, you could do this with ECMP, which is, just on the router itself. You can send you can create multiple you can create multiple routes within the router or the, you know, switch or whatever your gateway device is with equal cost pointing to different different endpoints. And depending on the device you're using, you maybe you could do it, like, 8 times. You could have eights, you know, and it was really crude real balancing.

Speaker 1: 05:55

But in a pinch, it worked great because now you could have 2 cheap load balancers distributing traffic across pools of servers. And then you could have 3 cheap load balancers, and you could have redundancy and and fail over between them. This is one of those areas that when you started doing it, if you look at your infrastructure cost, you'd say, okay. You know, if you had to go out and buy a really big load balancer. Right?

Speaker 1: 06:15

So we're the big ones. Right? F 5 big IPs were like the all the rage. Or you would have, you know, that was back what was popular back. Then I have to have to, like, you know, think through the cobweb, Sarah.

Speaker 1: 06:25

So it was big IPs was the Altion 83 product became the 80 fours. Those are really really popular. This is right around the time that, the NetScalers were becoming popular. Phenomenal boxes. I mean, you could do stuff with a NetScaler that was just I mean, you could just you could manipulate a lot of network traffic.

Speaker 1: 06:42

But now, you know, let's just talk about it from a cost standpoint. You know, if you were spending a $100,000 on a NetScaler, you would spend maybe $5,000 on a device that could do DSR for you. From a overall efficiency of being able to build out that infrastructure and scale it infrastructure, how you allocate and spend money starts to matter significantly at scale. Right? So so, you know, you could say it on the low end.

Speaker 1: 07:04

You know, you've got a relatively small amount of infrastructure. You know, and that load balancer might represent, you know, 25% of your total budget for that pop. Or maybe you're on the other side where you're trying to scale it up and you start looking at replicating those load balancers that gets really expensive as well. This also creates a situation where, you know, projects get deferred because the cost of the project just gets too expensive. DSR 20 years ago was my entry point into understanding asymmetric routing and how to leverage it and take advantage of it and do fun things with it.

Speaker 1: 07:32

Right? So first example is is load balancing. 2 other examples, that were really seminal moments for me in in in my career in network design and and really tweaking the network to take advantage and do fun things and and create a lot of operating leverage or leverage for the business. Second example share was 10, 15 years ago, and I had a customer who was a anti piracy organization. And so if you it's actually probably more than that, trying to think.

Speaker 1: 08:01

Yeah. It's about right. 10, 10, 15 years. And so if you were trying to prevent people from, downloading and pirate and content on bit torrent, for instance, there there or if you were trying to find people, you know, doing these sorts of things, You had a a bunch of challenges that you had to overcome. And one of those challenges that you had to overcome was IP addresses.

Speaker 1: 08:22

How did you actually acquire and simulate enough IP addresses that look like residential consumer IP addresses that way the anti anti piracy measures couldn't catch you and find you and what we did with this customer was we convinced them to go out subscribe to and sign up for dial up modems and take advantage of different dial up ISPs to get residential allocated IP addresses. Now the problem, of course, with this is as soon as you're on dial up, now you're talking about an application that first has to get bound to a server but also is limited. I mean, at the time, you know, dial up is 56 k. So trying to be effective on the Internet on a 56 k dial up modem isn't really effective. Well, again, here's the tweak.

Speaker 1: 09:05

If you look at it from a network standpoint, an application, what the application was doing really small amount of ingress and a lot of egress. Sound familiar? Just like a it's just just like the example of a content website. Right? So what we did for them was we split the traffic from the inbound and the outbound, the ingress and the egress.

Speaker 1: 09:22

And, it was fun. This customer went out and we got a d s 3. So a lot of dial up connections. So d s 3 is 648 channels. Oof.

Speaker 1: 09:34

It's been a long time since I've done that math. I don't even know. I'll 66 100 and something channels. Let's just call it. So per d s 3, 600 something channels, it's 600 IP addresses per per d s 3.

Speaker 1: 09:43

It's more than that. It'll it'll dawn on me. Comment to the notes. Tell me I'm wrong because, of course, I don't remember it offhand. It's been a long time since I've done d s threes.

Speaker 1: 09:50

So the d s 3 could on their application servers, they could, send an instruction to the concentrator, have the concentrator file up a dial up, connect to the dial up ISP, acquire an IP address. So now inbound IP address getting bound to the application server. That application server then had and would overwrite its default gateway. So traffic dial up into server, server out to the Internet. In this case, the out to the Internet was multiple 10 gig paths with a lot of capacity.

Speaker 1: 10:16

So their egress traffic and where the bulk of their content traffic was flying all flew out, you know, via these these high capacity 10 gig links. And it worked wonderfully. It was actually was awesome to watch. The modern version of this that we're doing is related to AWS and managing and attacking these really high cost centers within AWS for some customers. And if you look at if you look at your AWS bill, chances are you're spending a concentrated amount of money around compute.

Speaker 1: 10:45

I mean, that's pretty obvious. It's compute, it's storage, it's managed services, and it's it's egress. It's data. Now we can get into an argument around compute and managed services. And, you know, should you be on Dynamo versus Sickla or Cassandra, or should you be running Kinesis or Kafka?

Speaker 1: 10:59

Or, you know, if we just look at it from a standpoint of a steady state application. So, everybody wants to believe that they need infinite elasticity up and contraction down. The reality is your application is probably always running at basically the same amount of resources day over day, hour over hour, month over month, week over week, whatever. And once you start looking at that and you say, okay, your application is predictable in terms of consumption load, how much compute does it need? And if you have a application that has a a a predictable amount of data traffic coming in and out, or if you're heavy egress, these become scenarios that AWS becomes very expensive.

Speaker 1: 11:34

Hey. You know okay. Just just to make this point in math. If you're serving, 10 petabyte to the Internet and you're doing it on list prices with AWS, it's costing you a half a $1,000,000 a month. If you're getting a PPA from AWS at 2.5¢ per gig or half price of what their public, you know, public published pricing is, you're spending $250,000 a month.

Speaker 1: 11:53

If you take that and go to a bare metal infrastructure provider and do a flat rate network circuit, you're spending $2,000 a month. Half a1000000 a month, 250,000 a month on a PPA. So committed contract, embedded it with an EDP with Amazon, $2,000 a month. Right? Yes.

Speaker 1: 12:07

There's a lot of other factors that have to apply that have to work out perfectly in order for you to be able to do this. But, you know, we're talking about saving 2 and a half $1,000,000 a year just just out of thin air. Right? Making this change. Just just changing how your network flows, how your data flows out of your environment.

Speaker 1: 12:21

Going through this exercise and talking about shifting compute out of e c 2 or EKS into a bare metal infrastructure, challenges became perception. You know, can we do this? How do we run it? What happens if the data center goes down, if the bare metal goes down, if we need to scale it? You know, we're used to our our tools and our flow, you know, yada yada yada.

Speaker 1: 12:37

And so there became a really simple, you know, solve for this. And again, you know, here's here's the the modern version of asymmetric routing. And the solve for it was to take and put bare metal into data centers in region with the AWS infrastructure. So, you know, if you're in US East 1, right, that means you're in the the Washington DC region AKA Ashburn, Virginia or Herndon, Virginia, and go out and get data center in that market. And, you know, you overlay this based on where you are.

Speaker 1: 13:02

Right? You know, it can be it can be Ohio. Right? So you can Chicago. You can be in the West Coast.

Speaker 1: 13:07

Right? You can be in London. You can be in Frankfurt. You can be in Amsterdam. You can be in Singapore.

Speaker 1: 13:11

You can be in Hong Kong. I mean, you know, figure out what region is and where you need to be and where your infrastructure is and and overlay that with bare metal. And I'm using bare metal as a term because you don't have to go out and collocate and buy gear. You can go out to bare metal. You'll save more money if you colocate and you buy your own infrastructure, but it doesn't matter because you can get such a cost savings off of bare metal and you can leverage it anyways that you can you can just do it bare metal.

Speaker 1: 13:32

So I'm just gonna I'm just gonna use the bare metal terminology. And again, we just take advantage of a tweak on how we serve traffic. Now, again, assuming that you're heavy egress and light ingress well, actually, it doesn't even matter if you're light ingress because guess what? Again, ingress to AWS is free, but it's a measure of your Direct Connect cost. Right?

Speaker 1: 13:49

So this works best if you can be light ingress versus a heavy, you know, ratio light ingress to heavy egress. And the way this works is, you know, here's your here's your steps. Right? So Internet traffic to AWS could be WAF, could just be straight to an ALB, to direct connect, to bare metal back out to the Internet. Fun thing about doing this is you can use your tooling that you have in AWS.

Speaker 1: 14:09

You can use Terraform. You can use CloudFormation, to manage your bare metal infrastructure at the same time. We have customers who have taken and extended their ALBs into their bare metal nodes. So they're running Kubernetes on top of bare metal, and the Kubernetes containers actually are getting their load balancing from AWS. There's lots of ways of of doing this tweak and looking at it and looking at it for your infrastructure.

Speaker 1: 14:31

Now what we're seeing in terms of average cost reduction on bare metal over AWS is about 72%. So if you're spending a $100,000 a month on AWS, you're gonna spend $28,000 a month in bare metal. Spend $1,000,000 a month on AWS, you're gonna spend right? Do the math. $280,000 a month in bare metal.

Speaker 1: 14:49

The reason why this is important is well, I mean, jeez. Let's talk about it. You're saving a metric boat ton of money here. And that's differential means a bunch of different things. Right?

Speaker 1: 14:58

We start talking about what is your cost for revenue to run your cloud. If you're looking at it from a standpoint of saying, hey. We're spending 15% of, you know, cost of revenue on AWS. So we're making this much, and we're spending 15% of on AWS. You don't have to be a financer to understand that if you can cut that cost in half, it makes a big difference for your business.

Speaker 1: 15:15

Makes a huge difference for you in terms of what you price or service or your product to your customers are, what your competitive advantages, what your efficiency is, you know, what your margin is. You wanna start talking about EBITDA, you know, if you're trying to do a transaction, you wanna start talking about gross margin efficiency. You know, if you're looking for VC, if you're trying to take investors, you gotta go through any metric you can imagine. Having a more efficient cost cost model and delivery of service is just better for the business. You will grow faster.

Speaker 1: 15:39

You will be able to hire more people. You will be able to go out and be more competitive. You'll be able to dominate in your space. Any phrasing that you wanna say. My job here is to help you help you get to that goal.

Speaker 1: 15:50

Right? I'm gonna get completely sidetracked on this rant. But it's a very simplistic tweak of just taking advantage of the fact that in some cases, it's good to have asymmetric routing because then you can do some fun things with them. And, you know, again, I learned this originally 20 years ago with direct server return and load balancers inside of a data center. We took advantage of this in a big way with a customer.

Speaker 1: 16:09

I mean, they had thousands of servers, and they did this at a very large scale for anti piracy. And, of course, now the modern version of it is just taking advantage of bare metal and AWS and not doing wholesale migrations. I mean, this isn't about you know, I'm not an all or nothing guy. Right? Like, there's there's lots of middle ground here.

Speaker 1: 16:25

Maybe the middle ground for you is keeping your AWS environment and keeping your Dynamo and keeping your Kinesis and keeping all these different things you wanna keep inside of AWS because it is a big engineering lift and shift. But being able to move a segment of your compute and your data egress out of AWS into something that you have a much better cost model on, it's gonna give you a pretty big advantage. Those are the kind of things that become really fun because when we can, you know, I mean, again, right, you know, if we can cut significant percentage points out of an operating budget for a business, that business can use those that money to do other things with. And that those things usually are around growth, and I love to see companies grow. It is a miserable conversation to help companies contract.

Speaker 1: 17:03

I love helping companies grow. I'm x Clark. This is just a quick note on asymmetric routing and how you can take advantage of it, and, hopefully, this helps you.

Struggling with Network Costs? Discover How Asymmetric Routing Can Save You!

Broadcast by

headphones Listen Anywhere

Listen Anywhere