最近需要研究tcp拥塞算法,决定通过写博客的方式加深理解.这是第一篇,记录下拥塞避免和慢启动算法
拥塞避免阶段:以1/cwnd的速度增长.即每次收到一个ack(如果每个包都对应一个ack,不考虑延迟ack等复杂情况),就会将snd_cwnd_cnt加1,如果snd_cwnd_cnt超过了当前窗口大小,就会将发送窗口加1,snd_cwnd_cnt重新重置为0.其中snd_cwnd_cnt就是用来统计1/cwnd个数的,满了就加1.
拥塞避免关键部分,在net/ipv4/tcp_cong.c文件中.
/* linux-4.14.69 */
/* In theory this is tp->snd_cwnd += 1 / tp->snd_cwnd (or alternative w),
* for every packet that was ACKed.
*/
void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked)
{
/* If credits accumulated at a higher w, apply them gently now. */
if (tp->snd_cwnd_cnt >= w) {
tp->snd_cwnd_cnt = 0;
tp->snd_cwnd++;
}
tp->snd_cwnd_cnt += acked;
if (tp->snd_cwnd_cnt >= w) {
u32 delta = tp->snd_cwnd_cnt / w;
tp->snd_cwnd_cnt -= delta * w;
tp->snd_cwnd += delta;
}
tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_cwnd_clamp);
}
EXPORT_SYMBOL_GPL(tcp_cong_avoid_ai);
如果对比3.10,2.6等早期代码就会发现,现在的拥塞避免算法多了一个acked参数.之前收到一个ack就把snd_cwnd_cnt加1,现在就不是加1了,而是加acked.这个更加合理,因为一个ack可能ack了多个包.那么是不是旧的逻辑有问题了?
/* linux-2.6 or linux-3.10 */
/* In theory this is tp->snd_cwnd += 1 / tp->snd_cwnd (or alternative w) */
void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w)
{
if (tp->snd_cwnd_cnt >= w) {
if (tp->snd_cwnd < tp->snd_cwnd_clamp)
tp->snd_cwnd++;
tp->snd_cwnd_cnt = 0;
} else {
tp->snd_cwnd_cnt++;
}
}
EXPORT_SYMBOL_GPL(tcp_cong_avoid_ai);
这里面涉及到一个tcp_abc这个参数,在早期版本中,考虑到延迟ack等情况,就有了ABC这个功能.在ABC中,使用被ACK的字节数而不是ACK的数量来作为增窗的反馈信号.这个可以从旧版本的reno拥塞算法看出来
/* linux-2.6 */
/*
* TCP Reno congestion control
* This is special case used for fallback as well.
*/
/* This is Jacobson's slow start and congestion avoidance.
* SIGCOMM '88, p. 328.
*/
void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
{
struct tcp_sock *tp = tcp_sk(sk);
if (!tcp_is_cwnd_limited(sk, in_flight))
return;
/* In "safe" area, increase. */
if (tp->snd_cwnd <= tp->snd_ssthresh)
tcp_slow_start(tp);
/* In dangerous area, increase slowly. */
else if (sysctl_tcp_abc) {
/* RFC3465: Appropriate Byte Count
* increase once for each full cwnd acked
*/
if (tp->bytes_acked >= tp->snd_cwnd*tp->mss_cache) {
tp->bytes_acked -= tp->snd_cwnd*tp->mss_cache;
if (tp->snd_cwnd < tp->snd_cwnd_clamp)
tp->snd_cwnd++;
}
} else {
tcp_cong_avoid_ai(tp, tp->snd_cwnd);
}
}
EXPORT_SYMBOL_GPL(tcp_reno_cong_avoid);
当开启了ABC功能,在拥塞避免阶段将不会执行tcp_cong_avoid_ai逻辑,而是根据byte_acked(这个ack包ack的字节数)和mss_cache(上一次缓存下来的有效最大报文段长度)来增加窗口.
然而在linux-3.10和linux-4的代码来看,已经找不到ABC相关逻辑了,都是直接走tcp_cong_avoid_ai.不同的是linux-4中的tcp_cong_avoid_ai函数多了acked参数,这样其实相当于实现了ABC功能,只是ABC以字节为精度计算窗口增长,而acked是以包为精度计算窗口增长.
/* linux-3.10.1 */
/*
* TCP Reno congestion control
* This is special case used for fallback as well.
*/
/* This is Jacobson's slow start and congestion avoidance.
* SIGCOMM '88, p. 328.
*/
void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
{
struct tcp_sock *tp = tcp_sk(sk);
if (!tcp_is_cwnd_limited(sk, in_flight))
return;
/* In "safe" area, increase. */
if (tp->snd_cwnd <= tp->snd_ssthresh)
tcp_slow_start(tp);
/* In dangerous area, increase slowly. */
else
tcp_cong_avoid_ai(tp, tp->snd_cwnd);
}
EXPORT_SYMBOL_GPL(tcp_reno_cong_avoid);
慢启动阶段:以1cwnd的速度增长.即每次收到一个ack(如果每个包都对应一个ack,不考虑延迟ack等复杂情况),就会将cwnd加1
慢启动算法关键部分,在net/ipv4/tcp_cong.c文件中.
/* linux-4.14.69 */
/* Slow start is used when congestion window is no greater than the slow start
* threshold. We base on RFC2581 and also handle stretch ACKs properly.
* We do not implement RFC3465 Appropriate Byte Counting (ABC) per se but
* something better;) a packet is only considered (s)acked in its entirety to
* defend the ACK attacks described in the RFC. Slow start processes a stretch
* ACK of degree N as if N acks of degree 1 are received back to back except
* ABC caps N to 2. Slow start exits when cwnd grows over ssthresh and
* returns the leftover acks to adjust cwnd in congestion avoidance mode.
*/
u32 tcp_slow_start(struct tcp_sock *tp, u32 acked)
{
u32 cwnd = min(tp->snd_cwnd + acked, tp->snd_ssthresh);
acked -= cwnd - tp->snd_cwnd;
tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
return acked;
}
EXPORT_SYMBOL_GPL(tcp_slow_start);
代码很简洁,就是把窗口设置为min(发送窗口+acked,慢启动门限值).如果超过了门限值,返回的acked就不为0.多余的acked就会通过拥塞避免的方式进行计算,来增加窗口.如reno算法
/* linux-4.14.69 */
/*
* TCP Reno congestion control
* This is special case used for fallback as well.
*/
/* This is Jacobson's slow start and congestion avoidance.
* SIGCOMM '88, p. 328.
*/
void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 acked)
{
struct tcp_sock *tp = tcp_sk(sk);
if (!tcp_is_cwnd_limited(sk))
return;
/* In "safe" area, increase. */
if (tcp_in_slow_start(tp)) {
acked = tcp_slow_start(tp, acked);
if (!acked)
return;
}
/* In dangerous area, increase slowly. */
tcp_cong_avoid_ai(tp, tp->snd_cwnd, acked);
}
EXPORT_SYMBOL_GPL(tcp_reno_cong_avoid);
注意慢启动的注释中提到,现在采用的是比ABC更好的方式.
如果与2.6中的代码对比的话,会发现2.6的慢启动复杂很多,仔细看其实就是多了ABC功能的处理了,窗口增长的逻辑写的也很绕,通过snd_cwnd_cnt来计算窗口值,它是当前窗口的多少倍,就把当前窗口加多少.首先判断bytes_acked,小于一个包大小直接退出.接着看窗口是否超过了最大门限值tcp_max_ssthresh(这个值默认是没有开的,可以忽略),超过了cnt就设置成tcp_max_ssthresh/2,正常情况下cnt就是当前窗口值了.如果开启了abc功能,而且ack包确认的字节数超过了2个包大小,cnt翻倍.从注释看,这个就是用来解决延迟ack的.所以cnt一般就是一倍或者两倍的窗口大小,接着将其加到snd_cwnd_cnt, 然后取窗口整数倍加到窗口上,不足整数倍的部分留着下一次其计算.(注意,这里和linux-4相比,就没有考虑最后一次慢启动,可能会超过ssthresh很多的情形)
/* linux-2.6.39 */
/*
* Slow start is used when congestion window is less than slow start
* threshold. This version implements the basic RFC2581 version
* and optionally supports:
* RFC3742 Limited Slow Start - growth limited to max_ssthresh
* RFC3465 Appropriate Byte Counting - growth limited by bytes acknowledged
*/
void tcp_slow_start(struct tcp_sock *tp)
{
int cnt; /* increase in packets */
/* RFC3465: ABC Slow start
* Increase only after a full MSS of bytes is acked
*
* TCP sender SHOULD increase cwnd by the number of
* previously unacknowledged bytes ACKed by each incoming
* acknowledgment, provided the increase is not more than L
*/
if (sysctl_tcp_abc && tp->bytes_acked < tp->mss_cache)
return;
if (sysctl_tcp_max_ssthresh > 0 && tp->snd_cwnd > sysctl_tcp_max_ssthresh)
cnt = sysctl_tcp_max_ssthresh >> 1; /* limited slow start */
else
cnt = tp->snd_cwnd; /* exponential increase */
/* RFC3465: ABC
* We MAY increase by 2 if discovered delayed ack
*/
if (sysctl_tcp_abc > 1 && tp->bytes_acked >= 2*tp->mss_cache)
cnt <<= 1;
tp->bytes_acked = 0;
tp->snd_cwnd_cnt += cnt;
while (tp->snd_cwnd_cnt >= tp->snd_cwnd) {
tp->snd_cwnd_cnt -= tp->snd_cwnd;
if (tp->snd_cwnd < tp->snd_cwnd_clamp)
tp->snd_cwnd++;
}
}
EXPORT_SYMBOL_GPL(tcp_slow_start);