上回书说到被 MTU 问题小小坑了一下,问题最后解决了,但是留了一个疑问点没有证实:为什么在 MSS 协商失败的情况下,curl https://x.com
可以,但是 curl https://accounts.google.com
不可以?
本文的实验代码都是在虚拟机中做的,所以没有隐藏 IP,直接粘贴的 tcpdump
结果。代码太宽,可以通过代码块右上角的工具栏配合阅读,比如点击 <->
按钮来展开,或者在新窗口浏览。读本文之前,最好先读一下这篇介绍 MTU 介绍的比较好的博客:有关 MTU 和 MSS 的一切 (即本博客)。
上文中的猜想是这些网站实现了 PMTUD,这一点比较容易证明。
PMTUD 测试
TCP 握手的时候双方协商 MSS,是根据本地的网卡信息协商的。比如网卡的 MTU 是 1500,那么 MS S 就会是 1460,如果网卡 MTU 是 1450,那么 MSS 就是 1410. 这个过程,TCP 的双方都对中间网络设备的 MTU 没有概念,中间设备能转发的 MTU 很可能比两边都小(尤其是在有 VPN 或者有隧道的情况)。PMTUD 就是处理这种情况的:它的原理很简单,当有丢包的时候,我尝试发送小包,看能不能收到 ACK,如果能,说明链路 path 的 MTU 比我想的要小,等用小一点的包发送。PMTUD 的全称是 Path MTU Discovery。
验证方法很简单,我们只要创造一个环境,假设这个环境能接受的 MTU 最大是 800,超过 800 bytes 的都会直接丢包,并且不会发回去 ICMP 消息。
我们用 iptables 直接 DROP 掉超过 800 bytes 的包。实验环境我习惯将 DROP 打印出来。
1 2 |
iptables -I INPUT -p tcp --match multiport --sports 80,443 -m length --length 801:9900 -j LOG --log-prefix "pmtud-should-drop:" iptables -A INPUT -p tcp --match multiport --sports 80,443 -m length --length 801:9900 -j DROP |
然后,我们还要将 Generic Receive Offload 关闭(以及其他的 offload 也一起关了吧,方便查看)。如果不关的话,即使对方发过来小包,网卡也会帮我们合并成大包,导致被 iptables 丢弃。
1 2 3 |
ethtool -K eth0 tso off ethtool -K eth0 gso off ethtool -K eth0 gro off |
最后,我们打开 tcpump,并且发送请求:curl -v https://accounts.google.com
。抓包结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
$ tcpdump -n -i eth0 src port 443 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 10:53:15.411226 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [S.], seq 2021053715, ack 2859282114, win 65535, options [mss 1412,sackOK,TS val 1211142548 ecr 883673807,nop,wscale 8], length 0 10:53:15.415006 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], ack 518, win 261, options [nop,nop,TS val 1211142552 ecr 883673811], length 0 10:53:15.415555 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211142553 ecr 883673811], length 1400 10:53:15.415591 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [P.], seq 1401:6822, ack 518, win 261, options [nop,nop,TS val 1211142553 ecr 883673811], length 5421 10:53:15.422637 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [P.], seq 5601:6822, ack 518, win 261, options [nop,nop,TS val 1211142560 ecr 883673811], length 1221 10:53:15.626823 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211142764 ecr 883673811], length 1400 10:53:16.034514 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211143172 ecr 883673811], length 1400 10:53:16.882718 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211144020 ecr 883673811], length 1400 10:53:18.546672 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211145684 ecr 883673811], length 1400 10:53:21.810626 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211148948 ecr 883673811], length 1400 10:53:25.424108 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [F.], seq 6822, ack 518, win 261, options [nop,nop,TS val 1211152561 ecr 883673811], length 0 10:53:25.425154 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [P.], seq 1:5601, ack 518, win 261, options [nop,nop,TS val 1211152562 ecr 883683821], length 5600 10:53:25.425316 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [P.], seq 5601:6822, ack 518, win 261, options [nop,nop,TS val 1211152563 ecr 883683821], length 1221 10:53:25.635089 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211152772 ecr 883683821], length 1400 10:53:26.042889 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211153180 ecr 883683821], length 1400 10:53:26.866828 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211154004 ecr 883683821], length 1400 10:53:28.531115 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211155668 ecr 883683821], length 1400 10:53:31.794866 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211158932 ecr 883683821], length 1400 10:53:38.515292 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], seq 1:1401, ack 518, win 261, options [nop,nop,TS val 1211165652 ecr 883683821], length 1400 10:53:40.521879 IP 142.251.10.113.443 > 159.65.132.17.36852: Flags [.], ack 519, win 261, options [nop,nop,TS val 1211167659 ecr 883698918], length 0 |
果然,对方一直尝试发给我们大小是 1400 的包,不断被我们丢弃,不断重发,非常锲而不舍,可惜是无用功。
还记得我们当时 MTU 设置错误,还是可以访问通 x.com,我们再拿它来试一下。
以下是 curl https://x.com
的抓包结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
10:58:24.860299 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [S.], seq 426985981, ack 35246320, win 65535, options [mss 1460,sackOK,TS val 3505824407 ecr 358095009,nop,wscale 8], length 0 10:58:25.030946 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], ack 518, win 261, options [nop,nop,TS val 3505824578 ecr 358095181], length 0 10:58:25.032070 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [P.], seq 1:2897, ack 518, win 261, options [nop,nop,TS val 3505824579 ecr 358095181], length 2896 10:58:25.032070 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [P.], seq 2897:3518, ack 518, win 261, options [nop,nop,TS val 3505824579 ecr 358095181], length 621 10:58:25.245503 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1:1449, ack 518, win 261, options [nop,nop,TS val 3505824793 ecr 358095349], length 1448 10:58:25.809509 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1:1449, ack 518, win 261, options [nop,nop,TS val 3505825357 ecr 358095349], length 1448 10:58:26.833545 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1:1449, ack 518, win 261, options [nop,nop,TS val 3505826381 ecr 358095349], length 1448 10:58:28.881459 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1:513, ack 518, win 261, options [nop,nop,TS val 3505828429 ecr 358095349], length 512 10:58:29.049007 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [P.], seq 513:1449, ack 518, win 261, options [nop,nop,TS val 3505828596 ecr 358099199], length 936 10:58:29.585502 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 513:1025, ack 518, win 261, options [nop,nop,TS val 3505829133 ecr 358099199], length 512 10:58:29.753129 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1025:1449, ack 518, win 261, options [nop,nop,TS val 3505829300 ecr 358099903], length 424 10:58:29.753129 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1449:1961, ack 518, win 261, options [nop,nop,TS val 3505829300 ecr 358099903], length 512 10:58:29.920937 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1961:2897, ack 518, win 261, options [nop,nop,TS val 3505829468 ecr 358100070], length 936 10:58:30.481510 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 1961:2473, ack 518, win 261, options [nop,nop,TS val 3505830029 ecr 358100070], length 512 10:58:30.649066 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 2473:2897, ack 518, win 261, options [nop,nop,TS val 3505830196 ecr 358100799], length 424 10:58:30.818291 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], ack 739, win 261, options [nop,nop,TS val 3505830365 ecr 358100968], length 0 10:58:30.818703 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 3518:4030, ack 739, win 261, options [nop,nop,TS val 3505830365 ecr 358100968], length 512 10:58:30.818728 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [P.], seq 4030:4077, ack 739, win 261, options [nop,nop,TS val 3505830365 ecr 358100968], length 47 10:58:30.986303 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], seq 4077:4589, ack 739, win 261, options [nop,nop,TS val 3505830533 ecr 358101136], length 512 10:58:30.986349 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [P.], seq 4589:4852, ack 739, win 261, options [nop,nop,TS val 3505830533 ecr 358101136], length 263 10:58:31.028397 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [.], ack 770, win 261, options [nop,nop,TS val 3505830576 ecr 358101136], length 0 10:58:31.155381 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [P.], seq 4852:4876, ack 771, win 261, options [nop,nop,TS val 3505830702 ecr 358101305], length 24 10:58:31.155381 IP 104.244.42.193.443 > 159.65.132.17.56282: Flags [F.], seq 4876, ack 771, win 261, options [nop,nop,TS val 3505830702 ecr 358101305], length 0 |
可以看到,在 server 端发送了 4 个长度为 1448
的包之后,就开始发送了一个长度为 512 的包,发现能够收到 ACK,就加大到 936 尝试扩大 MTU 发送,然后失败了,就退回到 512,可以看到后面还有大包的尝试,同样也失败了。不过最终的结果是发送成功的。
具体 PMTUD 的行为不太一样,比如 facebook.com 的第一次尝试是 1024,然后退到 512.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
11:04:14.189897 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [S.], seq 2553051582, ack 2622595605, win 65535, options [mss 1392,sackOK,TS val 420920143 ecr 125287188,nop,wscale 8], length 0 11:04:14.193091 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], ack 518, win 261, options [nop,nop,TS val 420920147 ecr 125287192], length 0 11:04:14.194893 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 1:3245, ack 518, win 261, options [nop,nop,TS val 420920148 ecr 125287192], length 3244 11:04:14.200022 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 2761:3245, ack 518, win 261, options [nop,nop,TS val 420920154 ecr 125287192], length 484 11:04:14.252009 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:1381, ack 518, win 261, options [nop,nop,TS val 420920205 ecr 125287199], length 1380 11:04:14.359043 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:1381, ack 518, win 261, options [nop,nop,TS val 420920313 ecr 125287199], length 1380 11:04:14.567999 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:1381, ack 518, win 261, options [nop,nop,TS val 420920521 ecr 125287199], length 1380 11:04:14.983062 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:1381, ack 518, win 261, options [nop,nop,TS val 420920937 ecr 125287199], length 1380 11:04:15.863042 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:1381, ack 518, win 261, options [nop,nop,TS val 420921816 ecr 125287199], length 1380 11:04:17.528072 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:1025, ack 518, win 261, options [nop,nop,TS val 420923482 ecr 125287199], length 1024 11:04:20.856058 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1:513, ack 518, win 261, options [nop,nop,TS val 420926810 ecr 125287199], length 512 11:04:20.856637 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 513:1025, ack 518, win 261, options [nop,nop,TS val 420926810 ecr 125293855], length 512 11:04:20.856694 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1025:1381, ack 518, win 261, options [nop,nop,TS val 420926810 ecr 125293855], length 356 11:04:20.857202 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 1381:2405, ack 518, win 261, options [nop,nop,TS val 420926811 ecr 125293856], length 1024 11:04:20.857416 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 2405:2761, ack 518, win 261, options [nop,nop,TS val 420926811 ecr 125293856], length 356 11:04:20.909052 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1381:1893, ack 518, win 261, options [nop,nop,TS val 420926862 ecr 125293857], length 512 11:04:20.909668 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], seq 1893:2405, ack 518, win 261, options [nop,nop,TS val 420926863 ecr 125293908], length 512 11:04:20.911844 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], ack 728, win 261, options [nop,nop,TS val 420926865 ecr 125293910], length 0 11:04:20.912221 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 3245:3416, ack 728, win 261, options [nop,nop,TS val 420926866 ecr 125293910], length 171 11:04:20.912323 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 3416:3490, ack 728, win 261, options [nop,nop,TS val 420926866 ecr 125293910], length 74 11:04:20.912837 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 3490:3534, ack 728, win 261, options [nop,nop,TS val 420926866 ecr 125293911], length 44 11:04:20.954040 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [.], ack 759, win 261, options [nop,nop,TS val 420926907 ecr 125293912], length 0 11:04:21.128720 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [P.], seq 3534:3777, ack 759, win 261, options [nop,nop,TS val 420927082 ecr 125293956], length 243 11:04:21.130507 IP 157.240.235.35.443 > 159.65.132.17.33794: Flags [F.], seq 3777, ack 760, win 261, options [nop,nop,TS val 420927084 ecr 125294129], length 0 |
ICMP type 3 code 4 测试
ICMP 专门有一种消息是处理这种不可达的错误的。ICMP 的 type 3 意思是 Destination Unreachable,但是 Destination Unreachable 的原因有很多,对于每一种原因都有一种 Code,Code 4 意思就是 Fragmentation Needed and Don’t Fragment was Set。(即,包太大,需要拆成多个 IP 包,但是你有设置了不要拆包,所以我只能丢弃,并且用此 ICMP 来告知你。)
在上面的测试中,我们并没有发送任何的 ICMP 消息,而只是丢包。现在,我们添加一步,在丢包的时候,发回去一个 ICMP 消息。我们用 scapy 来做这个。代码非常简单,它抓所有超过 800 bytes 的包,对这些包的来源都发送一个 ICMP。还是 iptables 负责丢包,scapy 脚本只负责发 ICMP。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from scapy.all import ICMP, IP, send, sniff, raw def send_icmp_to(dst, payload): icmp_packet = IP(dst=dst) / ICMP(type=3, code=4, nexthopmtu=800) / payload icmp_packet.display() send(icmp_packet) def callback(packet): ip_packet = packet[IP] icmp_body = raw(ip_packet)[:28] icmp_dst = ip_packet.src send_icmp_to(icmp_dst, icmp_body) return packet.summary() def sniff_interface(): sniff(filter="greater 815", iface="eth0", prn=callback) sniff_interface() |
为什么 filter 是 greater 815
呢?因为 libpcap 的 greater
是 Ethernet 层的大小,Ethernet 的 header 是 14 bytes,所以我们要的条件是 >= 815 bytes
。greater
是大于等于。(是,我也觉得很奇怪)
保存上文件为 a.py
。运行方式是 python3 a.py
。
然后使用这台服务器进行测试。发现…… 结果和上文完全一样,我都告诉他们 next hop mtu 是 800 了,但是他们有自己的想法,从 512 开始尝试之类的。仿佛 ICMP 从来没发送到他们的服务器上。不知道是我构造包的问题,还是他们的服务器没有处理好 ICMP 的问题。比如之前看过 cloudflare 的这篇文章,就是说因为 ECMP 的问题,ICMP 消息会被路由到错误的负载均衡器上去,导致 PMTUD 失败。解决办法是将 ICMP type 3 code 4 广播到所有的负载均衡器上去。
ICMP type 3 code 4 虚拟机测试
为了试试看是不是我的脚本有问题,我在本地搭建了一个非常简单的网络环境。
抓包结构如下,可以看到,8000 端口尝试发送 1448 bytes 的包一直被忽略。当收到 ICMP 消息,server 端就立即改用 800 bytes (MSS 是 748 bytes)来发送了。所以,感觉还是公网发送 ICMP 黑洞的问题。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
11:44:22.580672 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [S], seq 3856372240, win 64240, options [mss 1460,sackOK,TS val 3466746064 ecr 0,nop,wscale 7], length 0 11:44:22.581286 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [S.], seq 3719990126, ack 3856372241, win 65160, options [mss 1460,sackOK,TS val 3542152433 ecr 3466746064,nop,wscale 7], length 0 11:44:22.581311 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 1, win 502, options [nop,nop,TS val 3466746064 ecr 3542152433], length 0 11:44:22.581524 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [P.], seq 1:87, ack 1, win 502, options [nop,nop,TS val 3466746065 ecr 3542152433], length 86 11:44:22.582388 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], ack 87, win 509, options [nop,nop,TS val 3542152434 ecr 3466746065], length 0 11:44:22.583786 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [P.], seq 1:204, ack 87, win 509, options [nop,nop,TS val 3542152436 ecr 3466746065], length 203 11:44:22.583786 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], seq 204:1652, ack 87, win 509, options [nop,nop,TS val 3542152436 ecr 3466746065], length 1448 11:44:22.583786 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [P.], seq 1652:3100, ack 87, win 509, options [nop,nop,TS val 3542152436 ecr 3466746065], length 1448 11:44:22.583807 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 204, win 501, options [nop,nop,TS val 3466746067 ecr 3542152436], length 0 11:44:22.584961 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [FP.], seq 3100:3276, ack 87, win 509, options [nop,nop,TS val 3542152436 ecr 3466746065], length 176 11:44:22.584973 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 204, win 501, options [nop,nop,TS val 3466746068 ecr 3542152436,nop,nop,sack 1 {3100:3277}], length 0 11:44:22.585478 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], seq 204:1652, ack 87, win 509, options [nop,nop,TS val 3542152437 ecr 3466746068], length 1448 11:44:22.608337 IP 172.16.42.22 > 172.16.42.21: ICMP 172.16.42.22 unreachable - need to frag (mtu 800), length 36 11:44:22.609059 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], seq 204:952, ack 87, win 509, options [nop,nop,TS val 3542152437 ecr 3466746068], length 748 11:44:22.609161 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 952, win 499, options [nop,nop,TS val 3466746092 ecr 3542152437,nop,nop,sack 1 {3100:3277}], length 0 11:44:22.609958 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], seq 952:1652, ack 87, win 509, options [nop,nop,TS val 3542152462 ecr 3466746092], length 700 11:44:22.609958 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], seq 1652:2400, ack 87, win 509, options [nop,nop,TS val 3542152462 ecr 3466746092], length 748 11:44:22.610016 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 1652, win 499, options [nop,nop,TS val 3466746093 ecr 3542152462,nop,nop,sack 1 {3100:3277}], length 0 11:44:22.610078 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 2400, win 494, options [nop,nop,TS val 3466746093 ecr 3542152462,nop,nop,sack 1 {3100:3277}], length 0 11:44:22.610517 IP 172.16.42.21.8000 > 172.16.42.22.60960: Flags [.], seq 2400:3100, ack 87, win 509, options [nop,nop,TS val 3542152463 ecr 3466746093], length 700 11:44:22.610644 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [.], ack 3277, win 499, options [nop,nop,TS val 3466746094 ecr 3542152463], length 0 11:44:22.611267 IP 172.16.42.22.60960 > 172.16.42.21.8000: Flags [F.], seq 87, ack 3277, win 501, options [nop,nop,TS val 3466746094 ecr 3542152463], length 0 |
而且这个 path MTU 信息会在 route 的 cache 中,后续的发送会默认这个 path 的 MTU 就是 800,不会使用更高的尝试。
相关链接:
- nmap 有 Path MTU 探测功能:https://nmap.org/nsedoc/scripts/path-mtu.html
- Path MTU discovery in practice
- Iptables Tutorial 1.2.2 by Oskar Andreasson 一份不错的 iptables 教程
- RFC 5508 NAT Behavioral Requirements for ICMP
- Resolve IPv4 Fragmentation, MTU, MSS, and PMTUD Issues with GRE and IPsec
greater 表示 >= 确实很怪,写成 len >= length 可能更直观一些。 https://www.tcpdump.org/manpages/pcap-filter.7.html
有个问题:如果启用pmtud,收到icmp后,会直接干扰已建立的tcp flow接下来的包大小吗?也就是优先级高于syn协商里的mss么?
之前以为的是收到icmp后,kernel更新route里的mtu,后续下一个tcp协商出来小的mss。
谢谢xintao
Hi, 我觉得会干扰已经建立的 tcp flow。优先级高于 mss 协商。
理由是,如果 mss 协商是正确的,那么就没有 pmtud 存在的必要了,pmtud 就是在 mss 协商不正确的情况才发挥作用。
kernel 收到 icmp 更新 route 的 mtu,这个我看代码[1]觉得这个参数应该是在 3 层上的吧,所以更新了目标 ip 对应的 mtu ,应该在 ip 层生效,ip 层没有 tcp 连接的概念,所以会立即生效的。
wiki 中[2] 也说,tcp 连接的第一个大于 MTU 的包会造成 ICMP code 3 type 4 发到 source,这时候就应该更新 PMTU 了,而不是等到下一次连接才行。这个功能对于 tcp 用户来说是透明的。(如果不透明的话,就需要用户去重建连接了)。
1. https://github.com/torvalds/linux/blob/v3.15/net/ipv4/route.c#L951
2. https://en.wikipedia.org/wiki/Path_MTU_Discovery#cite_ref-6
嗯嗯昨天看了下代码加测试,确实是的。感谢
> 而且这个 path MTU 信息会在 route 的 cache 中,后续的发送会默认这个 path 的 MTU 就是 800,不会使用更高的尝试。
请教xintao,这个结论的依据有出处么?我看到的信息是
Starting with Linux kernel version 3.6, there is no routing cache for IPv4 anymore.
我说的这个 cache 和你说的 routing cache 可能不是同一个东西。
我的意思是,这个 pmtu 一经探测出来,后续就会一直使用(至少在一段时间内),而不是每一次发送一个包都会先发 1500 bytes 然后收到 ICMP,然后降低为 800 bytes 再重新发送。
就是说,会把这个 route(到目标地址的路线)对应的 mtu 放到某个 cache 中。这样也是合理的。但是不是说放在了 routing cache 中。
依据就是我上面的实验,可以看到 mtu 在一段时间内一直是使用的探测出来的值,而不是每次都探测。