上回书说到被 MTU 问题小小坑了一下,问题最后解决了,但是留了一个疑问点没有证实:为什么在 MSS 协商失败的情况下,curl https://x.com
可以,但是 curl https://accounts.google.com
不可以?
本文的实验代码都是在虚拟机中做的,所以没有隐藏 IP,直接粘贴的 tcpdump
结果。代码太宽,可以通过代码块右上角的工具栏配合阅读,比如点击 <->
按钮来展开,或者在新窗口浏览。读本文之前,最好先读一下这篇介绍 MTU 介绍的比较好的博客:有关 MTU 和 MSS 的一切 (即本博客)。
上文中的猜想是这些网站实现了 PMTUD,这一点比较容易证明。
PMTUD 测试
TCP 握手的时候双方协商 MSS,是根据本地的网卡信息协商的。比如网卡的 MTU 是 1500,那么 MS S 就会是 1460,如果网卡 MTU 是 1450,那么 MSS 就是 1410. 这个过程,TCP 的双方都对中间网络设备的 MTU 没有概念,中间设备能转发的 MTU 很可能比两边都小(尤其是在有 VPN 或者有隧道的情况)。PMTUD 就是处理这种情况的:它的原理很简单,当有丢包的时候,我尝试发送小包,看能不能收到 ACK,如果能,说明链路 path 的 MTU 比我想的要小,等用小一点的包发送。PMTUD 的全称是 Path MTU Discovery。
验证方法很简单,我们只要创造一个环境,假设这个环境能接受的 MTU 最大是 800,超过 800 bytes 的都会直接丢包,并且不会发回去 ICMP 消息。
我们用 iptables 直接 DROP 掉超过 800 bytes 的包。实验环境我习惯将 DROP 打印出来。
1 2 |
iptables -I INPUT -p tcp --match multiport --sports 80,443 -m length --length 801:9900 -j LOG --log-prefix "pmtud-should-drop:" iptables -A INPUT -p tcp --match multiport --sports 80,443 -m length --length 801:9900 -j DROP |
然后,我们还要将 Generic Receive Offload 关闭(以及其他的 offload 也一起关了吧,方便查看)。如果不关的话,即使对方发过来小包,网卡也会帮我们合并成大包,导致被 iptables 丢弃。
1 2 3 |
ethtool -K eth0 tso off ethtool -K eth0 gso off ethtool -K eth0 gro off |
最后,我们打开 tcpump,并且发送请求:curl -v https://accounts.google.com
。抓包结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
$ tcpdump -n -i eth0 src port 443 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 10:53:15.411226 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [S.], seq 2021053715, ack 2859282114, win 65535, options [mss 1412,sackOK,TS val 1211142548 ecr 883673807,nop,wscale 8], length 0 10:53:15.415006 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], ack 518, win 261, options [nop,nop,TS val 1211142552 ecr 883673811], length 0 10:53:15.415555 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211142553 ecr 883673811], length 1400 10:53:15.415591 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [P.], seq 1401:6822, ack 518, win 261, options [nop,nop,TS val 1211142553 ecr 883673811], length 5421 10:53:15.422637 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [P.], seq 5601:6822, ack 518, win 261, options [nop,nop,TS val 1211142560 ecr 883673811], length 1221 10:53:15.626823 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211142764 ecr 883673811], length 1400 10:53:16.034514 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211143172 ecr 883673811], length 1400 10:53:16.882718 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211144020 ecr 883673811], length 1400 10:53:18.546672 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211145684 ecr 883673811], length 1400 10:53:21.810626 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211148948 ecr 883673811], length 1400 10:53:25.424108 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [F.], seq 6822, ack 518, win 261, options [nop,nop,TS val 1211152561 ecr 883673811], length 0 10:53:25.425154 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [P.], seq 1:5601, ack 518, win 261, options [nop,nop,TS val 1211152562 ecr 883683821], length 5600 10:53:25.425316 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [P.], seq 5601:6822, ack 518, win 261, options [nop,nop,TS val 1211152563 ecr 883683821], length 1221 10:53:25.635089 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211152772 ecr 883683821], length 1400 10:53:26.042889 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211153180 ecr 883683821], length 1400 10:53:26.866828 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211154004 ecr 883683821], length 1400 10:53:28.531115 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211155668 ecr 883683821], length 1400 10:53:31.794866 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211158932 ecr 883683821], length 1400 10:53:38.515292 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211165652 ecr 883683821], length 1400 10:53:40.521879 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], ack 519, win 261, options [nop,nop,TS val 1211167659 ecr 883698918], length 0 |
果然,对方一直尝试发给我们大小是 1400 的包,不断被我们丢弃,不断重发,非常锲而不舍,可惜是无用功。
还记得我们当时 MTU 设置错误,还是可以访问通 x.com,我们再拿它来试一下。
以下是 curl https://x.com
的抓包结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
10:58:24.860299 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [S.], seq 426985981, ack 35246320, win 65535, options [mss 1460,sackOK,TS val 3505824407 ecr 358095009,nop,wscale 8], length 0 10:58:25.030946 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], ack 518, win 261, options [nop,nop,TS val 3505824578 ecr 358095181], length 0 10:58:25.032070 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [P.], seq 1:2897, ack 518, win 261, options [nop,nop,TS val 3505824579 ecr 358095181], length 2896 10:58:25.032070 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [P.], seq 2897:3518, ack 518, win 261, options [nop,nop,TS val 3505824579 ecr 358095181], length 621 10:58:25.245503 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1:1449, ack 518, win 261, options [nop,nop,TS val 3505824793 ecr 358095349], length 1448 10:58:25.809509 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1:1449, ack 518, win 261, options [nop,nop,TS val 3505825357 ecr 358095349], length 1448 10:58:26.833545 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1:1449, ack 518, win 261, options [nop,nop,TS val 3505826381 ecr 358095349], length 1448 10:58:28.881459 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1:513, ack 518, win 261, options [nop,nop,TS val 3505828429 ecr 358095349], length 512 10:58:29.049007 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [P.], seq 513:1449, ack 518, win 261, options [nop,nop,TS val 3505828596 ecr 358099199], length 936 10:58:29.585502 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 513:1025, ack 518, win 261, options [nop,nop,TS val 3505829133 ecr 358099199], length 512 10:58:29.753129 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1025:1449, ack 518, win 261, options [nop,nop,TS val 3505829300 ecr 358099903], length 424 10:58:29.753129 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1449:1961, ack 518, win 261, options [nop,nop,TS val 3505829300 ecr 358099903], length 512 10:58:29.920937 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1961:2897, ack 518, win 261, options [nop,nop,TS val 3505829468 ecr 358100070], length 936 10:58:30.481510 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1961:2473, ack 518, win 261, options [nop,nop,TS val 3505830029 ecr 358100070], length 512 10:58:30.649066 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 2473:2897, ack 518, win 261, options [nop,nop,TS val 3505830196 ecr 358100799], length 424 10:58:30.818291 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], ack 739, win 261, options [nop,nop,TS val 3505830365 ecr 358100968], length 0 10:58:30.818703 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 3518:4030, ack 739, win 261, options [nop,nop,TS val 3505830365 ecr 358100968], length 512 10:58:30.818728 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [P.], seq 4030:4077, ack 739, win 261, options [nop,nop,TS val 3505830365 ecr 358100968], length 47 10:58:30.986303 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 4077:4589, ack 739, win 261, options [nop,nop,TS val 3505830533 ecr 358101136], length 512 10:58:30.986349 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [P.], seq 4589:4852, ack 739, win 261, options [nop,nop,TS val 3505830533 ecr 358101136], length 263 10:58:31.028397 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], ack 770, win 261, options [nop,nop,TS val 3505830576 ecr 358101136], length 0 10:58:31.155381 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [P.], seq 4852:4876, ack 771, win 261, options [nop,nop,TS val 3505830702 ecr 358101305], length 24 10:58:31.155381 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [F.], seq 4876, ack 771, win 261, options [nop,nop,TS val 3505830702 ecr 358101305], length 0 |
可以看到,在 server 端发送了 4 个长度为 1448
的包之后,就开始发送了一个长度为 512 的包,发现能够收到 ACK,就加大到 936 尝试扩大 MTU 发送,然后失败了,就退回到 512,可以看到后面还有大包的尝试,同样也失败了。不过最终的结果是发送成功的。
具体 PMTUD 的行为不太一样,比如 facebook.com 的第一次尝试是 1024,然后退到 512.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
11:04:14.189897 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [S.], seq 2553051582, ack 2622595605, win 65535, options [mss 1392,sackOK,TS val 420920143 ecr 125287188,nop,wscale 8], length 0 11:04:14.193091 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], ack 518, win 261, options [nop,nop,TS val 420920147 ecr 125287192], length 0 11:04:14.194893 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 1:3245, ack 518, win 261, options [nop,nop,TS val 420920148 ecr 125287192], length 3244 11:04:14.200022 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 2761:3245, ack 518, win 261, options [nop,nop,TS val 420920154 ecr 125287192], length 484 11:04:14.252009 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:1381, ack 518, win 261, options [nop,nop,TS val 420920205 ecr 125287199], length 1380 11:04:14.359043 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:1381, ack 518, win 261, options [nop,nop,TS val 420920313 ecr 125287199], length 1380 11:04:14.567999 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:1381, ack 518, win 261, options [nop,nop,TS val 420920521 ecr 125287199], length 1380 11:04:14.983062 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:1381, ack 518, win 261, options [nop,nop,TS val 420920937 ecr 125287199], length 1380 11:04:15.863042 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:1381, ack 518, win 261, options [nop,nop,TS val 420921816 ecr 125287199], length 1380 11:04:17.528072 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:1025, ack 518, win 261, options [nop,nop,TS val 420923482 ecr 125287199], length 1024 11:04:20.856058 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:513, ack 518, win 261, options [nop,nop,TS val 420926810 ecr 125287199], length 512 11:04:20.856637 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 513:1025, ack 518, win 261, options [nop,nop,TS val 420926810 ecr 125293855], length 512 11:04:20.856694 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1025:1381, ack 518, win 261, options [nop,nop,TS val 420926810 ecr 125293855], length 356 11:04:20.857202 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 1381:2405, ack 518, win 261, options [nop,nop,TS val 420926811 ecr 125293856], length 1024 11:04:20.857416 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 2405:2761, ack 518, win 261, options [nop,nop,TS val 420926811 ecr 125293856], length 356 11:04:20.909052 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1381:1893, ack 518, win 261, options [nop,nop,TS val 420926862 ecr 125293857], length 512 11:04:20.909668 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1893:2405, ack 518, win 261, options [nop,nop,TS val 420926863 ecr 125293908], length 512 11:04:20.911844 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], ack 728, win 261, options [nop,nop,TS val 420926865 ecr 125293910], length 0 11:04:20.912221 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 3245:3416, ack 728, win 261, options [nop,nop,TS val 420926866 ecr 125293910], length 171 11:04:20.912323 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 3416:3490, ack 728, win 261, options [nop,nop,TS val 420926866 ecr 125293910], length 74 11:04:20.912837 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 3490:3534, ack 728, win 261, options [nop,nop,TS val 420926866 ecr 125293911], length 44 11:04:20.954040 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], ack 759, win 261, options [nop,nop,TS val 420926907 ecr 125293912], length 0 11:04:21.128720 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 3534:3777, ack 759, win 261, options [nop,nop,TS val 420927082 ecr 125293956], length 243 11:04:21.130507 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [F.], seq 3777, ack 760, win 261, options [nop,nop,TS val 420927084 ecr 125294129], length 0 |
ICMP type 3 code 4 测试
ICMP 专门有一种消息是处理这种不可达的错误的。ICMP 的 type 3 意思是 Destination Unreachable,但是 Destination Unreachable 的原因有很多,对于每一种原因都有一种 Code,Code 4 意思就是 Fragmentation Needed and Don’t Fragment was Set。(即,包太大,需要拆成多个 IP 包,但是你有设置了不要拆包,所以我只能丢弃,并且用此 ICMP 来告知你。)
在上面的测试中,我们并没有发送任何的 ICMP 消息,而只是丢包。现在,我们添加一步,在丢包的时候,发回去一个 ICMP 消息。我们用 scapy 来做这个。代码非常简单,它抓所有超过 800 bytes 的包,对这些包的来源都发送一个 ICMP。还是 iptables 负责丢包,scapy 脚本只负责发 ICMP。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from scapy.all import ICMP, IP, send, sniff, raw def send_icmp_to(dst, payload): icmp_packet = IP(dst=dst) / ICMP(type=3, code=4, nexthopmtu=800) / payload icmp_packet.display() send(icmp_packet) def callback(packet): ip_packet = packet[IP] icmp_body = raw(ip_packet)[:28] icmp_dst = ip_packet.src send_icmp_to(icmp_dst, icmp_body) return packet.summary() def sniff_interface(): sniff(filter="greater 815", iface="eth0", prn=callback) sniff_interface() |
为什么 filter 是 greater 815
呢?因为 libpcap 的 greater
是 Ethernet 层的大小,Ethernet 的 header 是 14 bytes,所以我们要的条件是 >= 815 bytes
。greater
是大于等于。(是,我也觉得很奇怪)
保存上文件为 a.py
。运行方式是 python3 a.py
。
然后使用这台服务器进行测试。发现…… 结果和上文完全一样,我都告诉他们 next hop mtu 是 800 了,但是他们有自己的想法,从 512 开始尝试之类的。仿佛 ICMP 从来没发送到他们的服务器上。不知道是我构造包的问题,还是他们的服务器没有处理好 ICMP 的问题。比如之前看过 cloudflare 的这篇文章,就是说因为 ECMP 的问题,ICMP 消息会被路由到错误的负载均衡器上去,导致 PMTUD 失败。解决办法是将 ICMP type 3 code 4 广播到所有的负载均衡器上去。
ICMP type 3 code 4 虚拟机测试
为了试试看是不是我的脚本有问题,我在本地搭建了一个非常简单的网络环境。
抓包结构如下,可以看到,8000 端口尝试发送 1448 bytes 的包一直被忽略。当收到 ICMP 消息,server 端就立即改用 800 bytes (MSS 是 748 bytes)来发送了。所以,感觉还是公网发送 ICMP 黑洞的问题。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
11:44:22.580672 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [S], seq 3856372240, win 64240, options [mss 1460,sackOK,TS val 3466746064 ecr 0,nop,wscale 7], length 0 11:44:22.581286 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [S.], seq 3719990126, ack 3856372241, win 65160, options [mss 1460,sackOK,TS val 3542152433 ecr 3466746064,nop,wscale 7], length 0 11:44:22.581311 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 1, win 502, options [nop,nop,TS val 3466746064 ecr 3542152433], length 0 11:44:22.581524 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [P.], seq 1:87, ack 1, win 502, options [nop,nop,TS val 3466746065 ecr 3542152433], length 86 11:44:22.582388 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], ack 87, win 509, options [nop,nop,TS val 3542152434 ecr 3466746065], length 0 11:44:22.583786 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [P.], seq 1:204, ack 87, win 509, options [nop,nop,TS val 3542152436 ecr 3466746065], length 203 11:44:22.583786 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], seq 204:1652, ack 87, win 509, options [nop,nop,TS val 3542152436 ecr 3466746065], length 1448 11:44:22.583786 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [P.], seq 1652:3100, ack 87, win 509, options [nop,nop,TS val 3542152436 ecr 3466746065], length 1448 11:44:22.583807 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 204, win 501, options [nop,nop,TS val 3466746067 ecr 3542152436], length 0 11:44:22.584961 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [FP.], seq 3100:3276, ack 87, win 509, options [nop,nop,TS val 3542152436 ecr 3466746065], length 176 11:44:22.584973 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 204, win 501, options [nop,nop,TS val 3466746068 ecr 3542152436,nop,nop,sack 1 {3100:3277}], length 0 11:44:22.585478 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], seq 204:1652, ack 87, win 509, options [nop,nop,TS val 3542152437 ecr 3466746068], length 1448 11:44:22.608337 IP 172.16.42.22 > 172.16.42.21: ICMP 172.16.42.22 unreachable - need to frag (mtu 800), length 36 11:44:22.609059 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], seq 204:952, ack 87, win 509, options [nop,nop,TS val 3542152437 ecr 3466746068], length 748 11:44:22.609161 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 952, win 499, options [nop,nop,TS val 3466746092 ecr 3542152437,nop,nop,sack 1 {3100:3277}], length 0 11:44:22.609958 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], seq 952:1652, ack 87, win 509, options [nop,nop,TS val 3542152462 ecr 3466746092], length 700 11:44:22.609958 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], seq 1652:2400, ack 87, win 509, options [nop,nop,TS val 3542152462 ecr 3466746092], length 748 11:44:22.610016 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 1652, win 499, options [nop,nop,TS val 3466746093 ecr 3542152462,nop,nop,sack 1 {3100:3277}], length 0 11:44:22.610078 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 2400, win 494, options [nop,nop,TS val 3466746093 ecr 3542152462,nop,nop,sack 1 {3100:3277}], length 0 11:44:22.610517 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], seq 2400:3100, ack 87, win 509, options [nop,nop,TS val 3542152463 ecr 3466746093], length 700 11:44:22.610644 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 3277, win 499, options [nop,nop,TS val 3466746094 ecr 3542152463], length 0 11:44:22.611267 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [F.], seq 87, ack 3277, win 501, options [nop,nop,TS val 3466746094 ecr 3542152463], length 0 |
而且这个 path MTU 信息会在 route 的 cache 中,后续的发送会默认这个 path 的 MTU 就是 800,不会使用更高的尝试。
相关链接:
- nmap 有 Path MTU 探测功能:https://nmap.org/nsedoc/scripts/path-mtu.html
- Path MTU discovery in practice
- Iptables Tutorial 1.2.2 by Oskar Andreasson 一份不错的 iptables 教程
- RFC 5508 NAT Behavioral Requirements for ICMP
- Resolve IPv4 Fragmentation, MTU, MSS, and PMTUD Issues with GRE and IPsec
greater 表示 >= 确实很怪,写成 len >= length 可能更直观一些。 https://www.tcpdump.org/manpages/pcap-filter.7.html
有个问题:如果启用pmtud,收到icmp后,会直接干扰已建立的tcp flow接下来的包大小吗?也就是优先级高于syn协商里的mss么?
之前以为的是收到icmp后,kernel更新route里的mtu,后续下一个tcp协商出来小的mss。
谢谢xintao
Hi, 我觉得会干扰已经建立的 tcp flow。优先级高于 mss 协商。
理由是,如果 mss 协商是正确的,那么就没有 pmtud 存在的必要了,pmtud 就是在 mss 协商不正确的情况才发挥作用。
kernel 收到 icmp 更新 route 的 mtu,这个我看代码[1]觉得这个参数应该是在 3 层上的吧,所以更新了目标 ip 对应的 mtu ,应该在 ip 层生效,ip 层没有 tcp 连接的概念,所以会立即生效的。
wiki 中[2] 也说,tcp 连接的第一个大于 MTU 的包会造成 ICMP code 3 type 4 发到 source,这时候就应该更新 PMTU 了,而不是等到下一次连接才行。这个功能对于 tcp 用户来说是透明的。(如果不透明的话,就需要用户去重建连接了)。
1. https://github.com/torvalds/linux/blob/v3.15/net/ipv4/route.c#L951
2. https://en.wikipedia.org/wiki/Path_MTU_Discovery#cite_ref-6
嗯嗯昨天看了下代码加测试,确实是的。感谢
> 而且这个 path MTU 信息会在 route 的 cache 中,后续的发送会默认这个 path 的 MTU 就是 800,不会使用更高的尝试。
请教xintao,这个结论的依据有出处么?我看到的信息是 `Starting with Linux kernel version 3.6, there is no routing cache for IPv4 anymore.`
我说的这个 cache 和你说的 routing cache 可能不是同一个东西。
我的意思是,这个 pmtu 一经探测出来,后续就会一直使用(至少在一段时间内),而不是每一次发送一个包都会先发 1500 bytes 然后收到 ICMP,然后降低为 800 bytes 再重新发送。
就是说,会把这个 route(到目标地址的路线)对应的 mtu 放到某个 cache 中。这样也是合理的。但是不是说放在了 routing cache 中。
依据就是我上面的实验,可以看到 mtu 在一段时间内一直是使用的探测出来的值,而不是每次都探测。