美文网首页
Linux网络协议栈3--neighbor子系统

Linux网络协议栈3--neighbor子系统

作者: 苏苏林 | 来源:发表于2021-06-16 10:56 被阅读0次

邻居,可以简单理解为三层上的一跳距离。路由的下一跳可以不是直连的一跳距离(迭代路由),但最终走到邻居子系统的时候就是一跳距离。
linux 迭代路由的用法: https://www.jianshu.com/p/070202b6d3ca

邻居子系统,提供三层地址到二层地址之间的映射,提供二层首部缓存加速二层头的封装,提供二层报文头的封装。
如下,邻居表信息,表达了IP地址是x.x.x.x的下一跳,它的mac地址是xx:xx:xx:xx:xx:xx,通过出接口ethx能够到达。

#ip neigh
172.16.10.34 dev eth1 lladdr 52:54:00:8f:77:cd STALE
172.16.100.2 dev eth1 lladdr 00:1e:08:0a:53:01 STALE
192.168.122.1 dev eth2 lladdr 52:54:00:7a:39:1c STALE
172.16.100.3 dev eth1 lladdr 00:1e:08:0a:b2:f7 STALE
172.16.0.2 dev eth1 lladdr 00:1e:08:15:18:65 STALE
172.16.0.1 dev eth1 lladdr 50:c5:8d:b4:3e:81 REACHABLE
1.1.1.1 dev eth0 lladdr 52:54:00:e4:f7:11 PERMANENT
192.168.121.1 dev eth0 lladdr 52:54:00:8a:20:74 STALE
20.1.1.10 dev eth2.100 lladdr 52:54:00:e4:f7:2a STALE

除此之外,每个表项还有一个状态,需要了解一下其含义,无论是实际环境还是阅读代码都很重要。


/*
 *  Neighbor Cache Entry States.
 */

#define NUD_INCOMPLETE  0x01
#define NUD_REACHABLE   0x02
#define NUD_STALE   0x04
#define NUD_DELAY   0x08
#define NUD_PROBE   0x10
#define NUD_FAILED  0x20

/* Dummy states */
#define NUD_NOARP   0x40
#define NUD_PERMANENT   0x80
#define NUD_NONE    0x00

/* NUD_NOARP & NUD_PERMANENT are pseudostates, they never change
   and make no address resolution or NUD.
   NUD_PERMANENT is also cannot be deleted by garbage collectors.
 */

#define NUD_IN_TIMER    (NUD_INCOMPLETE|NUD_REACHABLE|NUD_DELAY|NUD_PROBE)
#define NUD_VALID   (NUD_PERMANENT|NUD_NOARP|NUD_REACHABLE|NUD_PROBE|NUD_STALE|NUD_DELAY)
#define NUD_CONNECTED   (NUD_PERMANENT|NUD_NOARP|NUD_REACHABLE)

下面这张状态机图描述的很清楚。


image.png

NUD_INCOMPLETE  :该状态是请求报文已发送,但尚未收到应答的状态。该状态下还没解析到硬件地址,因此尚无可用硬件地址,如果有报文要输出到该邻居,会将其缓存起来。
这个状态会启动一个定时器,如果在定时器到期时还没有接收到邻居的回应,则会重复发送请求报文,否则发送请求报文的次数打到上限,便会进入NUD_FAILED。
NUD_REACHABLE :该状态以及得到并缓存了邻居的硬件地址。进入该状态首先设置邻居项相关的output函数(该状态使用neighbors_ops结构的connectd_outpt),然后查看是否存在要发送给该邻居的报文。如果在该状态下闲置时间达到上限,便会进入NUD_STATLE。
NUD_STALE :该状态一旦有报文要输出到该邻居,则会进入NUD_DELAY并将该报文输出。如果在该状态下闲置时间达到上限,且此时的引用计数为1,则通过垃圾回收机制将其删除,在该状态下,报文的输出不收限制,使用慢速发送过程
NUD_DELAY :该状态下表示NUD_STATE状态下发送的报文已经发出,需得到邻居的可达性确认的状态。在为接收到邻居的应答或确认时也会定时地重发请求,如果发送请求报文的次数到上限,如果收到邻居的应答,进入NUD_REACHABLE,否则进入NUD_FAILED,在该状态下,报文的输出不收限制,使用慢速发送过程。
NUD_PROBE :过渡状态,和NUD_INCOMPLETE 状态类似,在未收到邻居状态的应答或者确认时,也会定时的重发请求,直到收到邻居的应答、确认、或者尝试发送请求报文的次数达到上限,如果收到应答或者确认就会进入NUD_REACHABLE,如果尝试发送请求到达上限,则进入NUD_FAILD状态,在该状态,报文的输出也不受限制,使用慢速发送过程。
NUD_FAILED  :由于没有收到应答报文而无法访问状态,
NUD_NOARP   :标识邻居无需将三层地址协议映射到二层地址协议。如一些三层overlay的虚拟接口,loopback等。
NUD_PERMANENT : 设置邻居表项的硬件地址为静态。

相关数据结构

struct neigh_table 表示一种邻居协议接口,目前就ipv4的arp和ipv6的nd,由两个全局变量定义,ipv4= arp_tbl, ipv6=nd_tbl。

// ipv4= arp_tbl, ipv6=nd_tbl
struct neigh_table {
    int         family;           // ipv4\ipv6
    int         entry_size;    // 邻居表项结构的大小,包括邻居表项和其key的信息,对于ipv4,是根据ipv4地址查询neighbor表项的,所以=sizeof(neighbour)+4
    int         key_len;       // 就是上面用到的neighbor表项key,三层地址,arp就是ipv4地址
    __be16          protocol;     // 三层协议类型,ETH_P_IP 或者 ETH_P_IPV6
    __u32           (*hash)(const void *pkey,
                    const struct net_device *dev,
                    __u32 *hash_rnd);          // 表项hash函数,eg arp_hash
    bool            (*key_eq)(const struct neighbour *, const void *pkey);
    int         (*constructor)(struct neighbour *);
    int         (*pconstructor)(struct pneigh_entry *);
    void            (*pdestructor)(struct pneigh_entry *);
    void            (*proxy_redo)(struct sk_buff *skb);
    char            *id;                   //用来分配neighbour缓存的缓冲池,arp_tabl为arp_cache
    struct neigh_parms  parms;      //存储与协议相关的可调节参数
    struct list_head    parms_list;   
    int         gc_interval;     // 这四个是垃圾回收的时间参数
    int         gc_thresh1;
    int         gc_thresh2;
    int         gc_thresh3;
    unsigned long       last_flush;
    struct delayed_work gc_work;        // 垃圾回收的工作队列
    struct timer_list   proxy_timer;
    struct sk_buff_head proxy_queue;
    atomic_t        entries;                          // 所有邻居项的数目
    rwlock_t        lock;
    unsigned long       last_rand;
    struct neigh_statistics __percpu *stats;
    struct neigh_hash_table __rcu *nht;
    struct pneigh_entry **phash_buckets;  //存储邻居表项的散列表
};

struct neighbour 定义邻居表项,包括状态,二层和三层协议地址,缓存的二层首部,出接口,还有一些函数指针。

其中 output 为数据报文输出函数,用来将报文输出到邻居,其回调根据状态变化而变化,邻居可达时为 connected_output,NUD_CONNECTED 转换成 NUD_STALE 或者 NUD_DELAY,neigh_suspect 会强制进行可达性的确认,通过把 neighbor->output 指向 neigh_ops->output, 也就是 neigh_resolve_output。
neigh_ops 中定义了地址解析请求发送函数,数据报文发送函数(一个通用报文发送函数和一个connected状态下发送函数)。


struct neighbour {
    struct neighbour __rcu  *next;
    struct neigh_table  *tbl;                    // arp_tbl反指
    struct neigh_parms  *parms;              //用于调节邻居协议的参数
    unsigned long       confirmed;         //记录最近一次确认该邻居可达的时间,传输层通过neigh_confirm确认更新,邻居系统通过neigh_update 更新
    unsigned long       updated;           //记录最近一次被neigh_update 更新的时间
    rwlock_t        lock;
    atomic_t        refcnt;
    struct sk_buff_head arp_queue;      // 在发送第一个报文时,需要新的邻居项,发送报文被缓存到arp_queue队列中,然后会调用solicit()发送请求报文。
    unsigned int        arp_queue_len_bytes;
    struct timer_list   timer;
    unsigned long       used;
    atomic_t        probes;
    __u8            flags;
    __u8            nud_state;
    __u8            type;
    __u8            dead;
    seqlock_t       ha_lock;
    unsigned char       ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))];    // 与存储在primary_key 中的三层地址相对应的二层硬件地址
    struct hh_cache     hh;                                    // 指向缓存二层协议首部的hh_cache结构,完整的二层头,不只是二层地址
    // 输出函数,用来将报文输出到邻居,其回调根据状态变化而变化,邻居可达时为 connected_output;
    // NUD_REACHBLE 转换成 NUD_STALE 或者 NUD_DELAY,neigh_suspect 会强制进行可达性的确认,通过把 neighbor->output 指向 neigh_ops->output, 也就是 neigh_resolve_output    
    int         (*output)(struct neighbour *, struct sk_buff *);   
    const struct neigh_ops  *ops;          // 邻居项函数指针:实现了三层到二层dev_queue_xmit
    struct rcu_head     rcu;
    struct net_device   *dev;              // 通过此网络设备可以访问到改邻居,即下一跳出接口
    u8          primary_key[0];     //存储哈希函数使用的三层协议地址,ipv4 或 ipv6地址
};

在创建邻居表项调用__neigh_create时,会调用neighbor的构造函数,arp协议就是arp_constructor,对邻居表项进行了初始化,根据设备类型和特性支持挂载了不同的output 和 ops函数。


struct neigh_ops {
    int            family;
    void            (*solicit)(struct neighbour *, struct sk_buff *);// 发送请求报文函数。在发送一个报文时,需要更新邻居表项,发送报文会缓存到arp_queue中,然后调用solicit函数发送请求报文。
    void            (*error_report)(struct neighbour *, struct sk_buff *); // 邻居项缓存着未发送的报文,而该邻居项又不可达时, 被调用来向三层报告错误的函数。
    int            (*output)(struct neighbour *, struct sk_buff *); //通用输出报文函数,做邻居状态等校验,流程上会比connected_output 慢一些
    int            (*connected_output)(struct neighbour *, struct sk_buff *);//当邻居可达NUD_CONNECT的时候,肯定处于邻居可用状态,直接构造和封装二层头发送。
};

// 不支持 header cache的设备
static const struct neigh_ops arp_generic_ops = {
    .family =       AF_INET,
    .solicit =      arp_solicit,
    .error_report =     arp_error_report,
    .output =       neigh_resolve_output,
    .connected_output = neigh_connected_output,
};

// 支持header cache的设备
static const struct neigh_ops arp_hh_ops = {
    .family =       AF_INET,
    .solicit =      arp_solicit,
    .error_report =     arp_error_report,
    .output =       neigh_resolve_output,
    .connected_output = neigh_resolve_output,
};
// 无头处理设备,直接做报文发送,封装的dev_queue_xmit
static const struct neigh_ops arp_direct_ops = {
    .family =       AF_INET,
    .output =       neigh_direct_output,
    .connected_output = neigh_direct_output,
};

static int arp_constructor(struct neighbour *neigh)
{
    __be32 addr = *(__be32 *)neigh->primary_key;
    struct net_device *dev = neigh->dev;
    struct in_device *in_dev;
    struct neigh_parms *parms;

    rcu_read_lock();
    in_dev = __in_dev_get_rcu(dev);
    if (!in_dev) {
        rcu_read_unlock();
        return -EINVAL;
    }

    neigh->type = inet_addr_type_dev_table(dev_net(dev), dev, addr);

    parms = in_dev->arp_parms;
    __neigh_parms_put(neigh->parms);
    neigh->parms = neigh_parms_clone(parms);
    rcu_read_unlock();

    // 没有头操作,就不需要做二层封装,也不需要arp,几个output函数都是直接调用dev_queue_xmit发送
    if (!dev->header_ops) {
        neigh->nud_state = NUD_NOARP;
        neigh->ops = &arp_direct_ops;
        neigh->output = neigh_direct_output;
    } else {
        /* Good devices (checked by reading texts, but only Ethernet is
           tested)

           ARPHRD_ETHER: (ethernet, apfddi)
           ARPHRD_FDDI: (fddi)
           ARPHRD_IEEE802: (tr)
           ARPHRD_METRICOM: (strip)
           ARPHRD_ARCNET:
           etc. etc. etc.

           ARPHRD_IPDDP will also work, if author repairs it.
           I did not it, because this driver does not work even
           in old paradigm.
         */
        // 邻居类型判断,组播、广播、P2P接口、loopback接口,打了NOARP的接口都不需要arp
        if (neigh->type == RTN_MULTICAST) {
            neigh->nud_state = NUD_NOARP;
            arp_mc_map(addr, neigh->ha, dev, 1);
        } else if (dev->flags & (IFF_NOARP | IFF_LOOPBACK)) {
            neigh->nud_state = NUD_NOARP;
            memcpy(neigh->ha, dev->dev_addr, dev->addr_len);
        } else if (neigh->type == RTN_BROADCAST ||
               (dev->flags & IFF_POINTOPOINT)) {
            neigh->nud_state = NUD_NOARP;
            memcpy(neigh->ha, dev->broadcast, dev->addr_len);
        }
        // arp_generic_ops 和 arp_hh_ops 的最大区别是后者最二层头缓存,前者不需要而是使用硬件地址临时封装;
        // 支持二层头cache的设备才挂载 arp_hh_ops
        if (dev->header_ops->cache)
            neigh->ops = &arp_hh_ops;
        else
            neigh->ops = &arp_generic_ops;
        // 可用状态挂载 connected_output,ops->connected_output 和 ops->output 的区别是前者不需要做邻居状态等信息
        // 的校验,从而更快
        if (neigh->nud_state & NUD_VALID)
            neigh->output = neigh->ops->connected_output;
        else
            neigh->output = neigh->ops->output;
    }
    return 0;
}

IP报文发送的最后阶段,ip_finish_output2函数中通过邻居子系统将数据包输出到网络设备。
1、ip_finish_output2首先查询neighbor表项,如果不存在,调用__neigh_create创建并初始化邻居表项;
2、调用dst_neigh_output 函数发送数据报文。

  • neigh为NUD_CONNECTED 状态且缓存了报文头,直接贴头发送,就是常说的快转;neigh->hh 设置的时机,一个是在第一次进入NUD_CONNECTED状态后,发送数据报文时(正常流程应该是缓存的报文)调用neigh_hh_init 制作二层头;二是在neigh_update中更新neigh状态时,如果二层地址发生变化,调用neigh_update_hhs更新二层头。
  • 否则调用neigh->output,当 neigh 进入 NUD_CONNECTED , neigh_connect 把 neigh->output 的函数指向 neigh->ops->connected_output,此时neighbor中已经保存了邻居二层地址,它会在调用 dev_queue_xmit 之前填充 L2 头部,把包直接发出去。当从 NUD_REACHBLE 转换成 NUD_STALE|NUD_DELAY ,neigh_suspect 会强制进行可达性的确认,通过把neighbor->output 指向 neigh_ops->output, 也就是 neigh_resolve_output。其中根据neigh的状态不同流程也有很大不同,函数中做了较详细的注解。
/*
 * 此函数通过邻居子系统将数据包输出到网络设备。
 */
static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb)
{
    struct dst_entry *dst = skb_dst(skb);
    struct rtable *rt = (struct rtable *)dst;
    struct net_device *dev = dst->dev;
    unsigned int hh_len = LL_RESERVED_SPACE(dev);
    struct neighbour *neigh;
    u32 nexthop;

    if (rt->rt_type == RTN_MULTICAST) {
        IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTMCAST, skb->len);
    } else if (rt->rt_type == RTN_BROADCAST)
        IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len);

    /* Be paranoid, rather than too clever. */
    if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
        struct sk_buff *skb2;

        skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
        if (!skb2) {
            kfree_skb(skb);
            return -ENOMEM;
        }
        if (skb->sk)
            skb_set_owner_w(skb2, skb->sk);
        consume_skb(skb);
        skb = skb2;
    }

    if (lwtunnel_xmit_redirect(dst->lwtstate)) {
        int res = lwtunnel_xmit(skb);

        if (res < 0 || res == LWTUNNEL_XMIT_DONE)
            return res;
    }

    rcu_read_lock_bh();
    // 从路由中取下一跳,分两种情况,指定下一跳的从路由的 rt_gateway 取,未指定路由的,取报文的dst ip
    // 这就是前面我们说的配置路由时指定nexthop和不指定的区别,不指定后面会构造请求dst ip mac的arp报文
    nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);

    // 正式进入 neighbor 子系统,发送流程,路由的本质是找到下一跳,而下一跳是通过 neighbor 子系统 管理的
    neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
    if (unlikely(!neigh))
        neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
    if (!IS_ERR(neigh)) {
        int res = dst_neigh_output(dst, neigh, skb);

        rcu_read_unlock_bh();
        return res;
    }
    rcu_read_unlock_bh();

    net_dbg_ratelimited("%s: No header cache and no neighbour!\n",
                __func__);
    kfree_skb(skb);
    return -EINVAL;
}


static inline int dst_neigh_output(struct dst_entry *dst, struct neighbour *n,
                   struct sk_buff *skb)
{
    const struct hh_cache *hh;

    if (dst->pending_confirm) {
        unsigned long now = jiffies;

        dst->pending_confirm = 0;
        /* avoid dirtying neighbour */
        if (n->confirmed != now)
            n->confirmed = now;
    }
    /*
    // neigh为NUD_CONNECTED 状态且缓存了报文头,直接贴头发送,就是常说的快转模式;
    //否则,需要根据当前的nud_state 状态,调用不同的函数处理
    // * 当 neigh 进入 NUD_CONNECTED , neigh_connect 把 neigh->output 的函数指向 neigh->ops->connected_output,
         也就是 neigh_connected_output,它会在调用 dev_queue_xmit 之前填充 L2 头部,把包直接发出去。
       * 当从 NUD_REACHBLE 转换成 NUD_STALE ,neigh_suspect 会强制进行可达性的确认,通过把 
         neighbor->output 指向 neigh_ops->output, 也就是 neigh_resolve_output。
     */
    hh = &n->hh;
    if ((n->nud_state & NUD_CONNECTED) && hh->hh_len)
        return neigh_hh_output(hh, skb);
    else
        return n->output(n, skb);
}
int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb)
{
    int rc = 0;
    /* neigh_event_send,完成收到数据报文后的邻居表项状态校验,同时作为邻居子系统状态机的一部分,是一个事件入口,完成状态机操作
        如第一次收到数据报文,会触发状态NUD_NONE -> NUD_INCOMPLETE、缓存报文、发送arp request等。
        详细流程见函数内注解
        函数返回 1 表示无法直接发送数据报文(丢弃或缓存),0 表示可以发送数据报文
     */ 
    if (!neigh_event_send(neigh, skb)) {
        int err;
        struct net_device *dev = neigh->dev;
        unsigned int seq;
        // 制作二层头cache,支持的设备 cache函数不为空,如eth口,eth_header_cache
        if (dev->header_ops->cache && !neigh->hh.hh_len)
            neigh_hh_init(neigh);

        do {
            // 构造并封装二层头,注意上面的二层 header cache 在 dst_neigh_output 函数中使用,
            // 只有在没有cache的时候才会走到这里
            __skb_pull(skb, skb_network_offset(skb));
            seq = read_seqbegin(&neigh->ha_lock);
            err = dev_hard_header(skb, dev, ntohs(skb->protocol),
                          neigh->ha, NULL, skb->len);
        } while (read_seqretry(&neigh->ha_lock, seq));

        if (err >= 0)
            // 发送数据报文
            rc = dev_queue_xmit(skb);
        else
            goto out_kfree_skb;
    }
out:
    return rc;
out_kfree_skb:
    rc = -EINVAL;
    kfree_skb(skb);
    goto out;
}
EXPORT_SYMBOL(neigh_resolve_output);


/* As fast as possible without hh cache */

int neigh_connected_output(struct neighbour *neigh, struct sk_buff *skb)
{
    struct net_device *dev = neigh->dev;
    unsigned int seq;
    int err;
    // connected状态下,不需要做状态校验
    do {
        // 构造并封装二层头,注意上面的二层 header cache 在 dst_neigh_output 函数中使用,
        // 只有在没有cache的时候才会走到这里
        __skb_pull(skb, skb_network_offset(skb));
        seq = read_seqbegin(&neigh->ha_lock);
        err = dev_hard_header(skb, dev, ntohs(skb->protocol),
                      neigh->ha, NULL, skb->len);
    } while (read_seqretry(&neigh->ha_lock, seq));

    if (err >= 0)
        // 发送数据报文
        err = dev_queue_xmit(skb);
    else {
        err = -EINVAL;
        kfree_skb(skb);
    }
    return err;
}
EXPORT_SYMBOL(neigh_connected_output);

上面 neigh_resolve_output 调用的的 neigh_event_send 比较重要。涉及一些非稳定状态下的状态机迁移。特别是数据报文触发的邻居地址解析流程也在里面。
状态机是什么,状态机元素包括event、state、action,可以概括为: 某一个状态下,收到某个事件,触发一个动作,使动作迁移到新的状态。状态机可以从事件入口去看。邻居子系统的一些事件更新状态及的入口如:
neigh_timer_handler,定时器超时事件导致的状态机更新
neigh_event_send,数据报文接收事件导致的状态机更新
neigh_update,协议报文接收事件导致的状态机更新,这个实际上不准确,直接的状态运行是在调用它的函数中,如收到arp request/reply报文(arp_process),静态配置arp表项(neigh_add)等。


static inline int neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
{
    unsigned long now = jiffies;
    
    if (neigh->used != now)
        neigh->used = now;
    // 邻居状态的这几个状态对于数据报文接收事件是稳定状态,这里不需要做任何action,特别是第一个,
    // NUD_DELAY|NUD_PROBE 状态需要等待邻居地址解析应答报文到来或者定时器超时来确认下一个状态。
    if (!(neigh->nud_state&(NUD_CONNECTED|NUD_DELAY|NUD_PROBE)))
        return __neigh_event_send(neigh, skb);
    return 0;
}

int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
{
    int rc;
    bool immediate_probe = false;

    write_lock_bh(&neigh->lock);

    rc = 0;
    // 邻居状态可用,返回0,可以做报文发送
    if (neigh->nud_state & (NUD_CONNECTED | NUD_DELAY | NUD_PROBE))
        goto out_unlock_bh;
    // dead,不可用,释放报文
    if (neigh->dead)
        goto out_dead;
    // 这里是简单的状态机处理,加上 neigh_timer_handler
    if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) {
        // NUD_NONE 状态分支

        // 对应/proc/sys/net/ipv4/neigh/eth1/ 下的 mcast_solicit 和 app_solicit配置,
        // 控制使用发送邻居地址探测报文的次数,不为0则可以做探测,否则 nud_state = NUD_FAILED并释放报文
        if (NEIGH_VAR(neigh->parms, MCAST_PROBES) +
            NEIGH_VAR(neigh->parms, APP_PROBES)) {
            unsigned long next, now = jiffies;

            atomic_set(&neigh->probes,
                   NEIGH_VAR(neigh->parms, UCAST_PROBES));
            // NUD_NONE -> NUD_INCOMPLETE
            neigh->nud_state     = NUD_INCOMPLETE;
            neigh->updated = now;
            next = now + max(NEIGH_VAR(neigh->parms, RETRANS_TIME),
                     HZ/2);
            neigh_add_timer(neigh, next);
            // 第一次,会立即出发arp request
            immediate_probe = true;
        } else {
            // NUD_NONE -> NUD_FAILED
            neigh->nud_state = NUD_FAILED;
            neigh->updated = jiffies;
            write_unlock_bh(&neigh->lock);

            kfree_skb(skb);
            return 1;
        }
    } else if (neigh->nud_state & NUD_STALE) {
        // NUD_STALE 需要发送报文时,立即切换为 NUD_DELAY 状态,并触发定时器(处理函数=neigh_timer_handler),其
        // 会调用 neigh_probe --> neigh->ops->solicit 构造arp requet
        // NUD_CONNECTED | NUD_DELAY | NUD_PROBE | NUD_STALE 状态下数据报文都能正常发送,不需要缓存报文
        neigh_dbg(2, "neigh %p is delayed\n", neigh);
        neigh->nud_state = NUD_DELAY;
        neigh->updated = jiffies;
        neigh_add_timer(neigh, jiffies +
                NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME));
    }
    // NUD_INCOMPLETE 状态下,缓存数据报文,此时arp request报文已经发出去了,等待reply或定时器超时即可
    // 返回1,外面什么不做
    if (neigh->nud_state == NUD_INCOMPLETE) {
        if (skb) {
            while (neigh->arp_queue_len_bytes + skb->truesize >
                   NEIGH_VAR(neigh->parms, QUEUE_LEN_BYTES)) {
                struct sk_buff *buff;

                buff = __skb_dequeue(&neigh->arp_queue);
                if (!buff)
                    break;
                neigh->arp_queue_len_bytes -= buff->truesize;
                kfree_skb(buff);
                NEIGH_CACHE_STAT_INC(neigh->tbl, unres_discards);
            }
            skb_dst_force(skb);
            __skb_queue_tail(&neigh->arp_queue, skb);
            neigh->arp_queue_len_bytes += skb->truesize;
        }
        rc = 1;
    }
out_unlock_bh:
    if (immediate_probe)
        // neigh_probe 调用neigh->ops->solicit发送地址解析请求报文,
        // immediate_probe=false的情况下(如上面切换 NUD_DELAY 状态)等定时器超时(neigh_timer_handler),
        // 也会调用 neigh_probe
        neigh_probe(neigh);
    else
        write_unlock(&neigh->lock);
    local_bh_enable();
    return rc;

out_dead:
    if (neigh->nud_state & NUD_STALE)
        goto out_unlock_bh;
    write_unlock_bh(&neigh->lock);
    kfree_skb(skb);
    return 1;
}
EXPORT_SYMBOL(__neigh_event_send);

以arp协议为例,收到协议报文,neighbor的状态变化。
arp_process 函数在内核中处理一个arp报文,简单概括:
1、收到arp request:
  1)如果tip是本机,使用收包接口设备(并不是tip所在的接口)的mac应答arp reply,并学习sip的arp表项。
  2)如果tip地址类型不是本地(tip的路由是转发类型的),且接收设备支持转发,这种情况下如果开启了代理arp功能,则做arp 代理处理,即用自己的mac地址回arp reply,将流量引到本设备(一般是网关设备)。并学习sip的arp表项。
  3)tip不是本地ip,接收设备又没有配置arp proxy,甚至tip在本地查不到路由的情况,则只有收到的是免费arp会触发学习sip的arp表项。其它情况不会创建表项,防止大量表项但实际上又用不到。
  4)收到arp request,会将本机已经存在或者新建的邻居表项修改为stale状态。
2、收到的是arp reply,修改邻居表项为 NUD_REACHABLE状态。
3、更新邻居表项状态附带的操作:
  1)更换 neigh->output 数据报文输出函数,NUD_CONNECTED指向ops->connected_output,其他指向ops->output;
  2)arp报文中新的二层地址(无论是sip的smac 还是tip的dmac),更新neighbor表项的二层地址和二层头cache;
  3)reset 状态及超时定时器;
  4)状态如果从 ! NUD_VALID --> NUD_VALID状态,说明邻居从不可用到可用,会将表项上缓存的报文发送出去。

static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
{
    struct net_device *dev = skb->dev;
    struct in_device *in_dev = __in_dev_get_rcu(dev);
    struct arphdr *arp;
    unsigned char *arp_ptr;
    struct rtable *rt;
    unsigned char *sha;
    __be32 sip, tip;
    u16 dev_type = dev->type;
    int addr_type;
    struct neighbour *n;
    struct dst_entry *reply_dst = NULL;
    bool is_garp = false;

    /* arp_rcv below verifies the ARP header and verifies the device
     * is ARP'able.
     */

    if (!in_dev)
        goto out_free_skb;

    arp = arp_hdr(skb);
    // arp 报文合法性校验
    switch (dev_type) {
    default:
        if (arp->ar_pro != htons(ETH_P_IP) ||
            htons(dev_type) != arp->ar_hrd)
            goto out_free_skb;
        break;
    case ARPHRD_ETHER:
    case ARPHRD_FDDI:
    case ARPHRD_IEEE802:
        /*
         * ETHERNET, and Fibre Channel (which are IEEE 802
         * devices, according to RFC 2625) devices will accept ARP
         * hardware types of either 1 (Ethernet) or 6 (IEEE 802.2).
         * This is the case also of FDDI, where the RFC 1390 says that
         * FDDI devices should accept ARP hardware of (1) Ethernet,
         * however, to be more robust, we'll accept both 1 (Ethernet)
         * or 6 (IEEE 802.2)
         */
        if ((arp->ar_hrd != htons(ARPHRD_ETHER) &&
             arp->ar_hrd != htons(ARPHRD_IEEE802)) ||
            arp->ar_pro != htons(ETH_P_IP))
            goto out_free_skb;
        break;
    case ARPHRD_AX25:
        if (arp->ar_pro != htons(AX25_P_IP) ||
            arp->ar_hrd != htons(ARPHRD_AX25))
            goto out_free_skb;
        break;
    case ARPHRD_NETROM:
        if (arp->ar_pro != htons(AX25_P_IP) ||
            arp->ar_hrd != htons(ARPHRD_NETROM))
            goto out_free_skb;
        break;
    }

    /* Understand only these message types */

    if (arp->ar_op != htons(ARPOP_REPLY) &&
        arp->ar_op != htons(ARPOP_REQUEST))
        goto out_free_skb;

/*
 *  Extract fields
 */
    // arp 头信息提取
    arp_ptr = (unsigned char *)(arp + 1);
    sha = arp_ptr;
    arp_ptr += dev->addr_len;
    memcpy(&sip, arp_ptr, 4);
    arp_ptr += 4;
    switch (dev_type) {
#if IS_ENABLED(CONFIG_FIREWIRE_NET)
    case ARPHRD_IEEE1394:
        break;
#endif
    default:
        arp_ptr += dev->addr_len;
    }
    memcpy(&tip, arp_ptr, 4);
/*
 *  Check for bad requests for 127.x.x.x and requests for multicast
 *  addresses.  If this is one such, delete it.
 */
    if (ipv4_is_multicast(tip) ||
        (!IN_DEV_ROUTE_LOCALNET(in_dev) && ipv4_is_loopback(tip)))
        goto out_free_skb;

 /*
  * For some 802.11 wireless deployments (and possibly other networks),
  * there will be an ARP proxy and gratuitous ARP frames are attacks
  * and thus should not be accepted.
  */
    if (sip == tip && IN_DEV_ORCONF(in_dev, DROP_GRATUITOUS_ARP))
        goto out_free_skb;

/*
 *     Special case: We must set Frame Relay source Q.922 address
 */
    if (dev_type == ARPHRD_DLCI)
        sha = dev->broadcast;


    if (arp->ar_op == htons(ARPOP_REQUEST) && skb_metadata_dst(skb))
        reply_dst = (struct dst_entry *)
                iptunnel_metadata_reply(skb_metadata_dst(skb),
                            GFP_ATOMIC);

    /* Special case: IPv4 duplicate address detection packet (RFC2131) */
    // sip==0,是dhcp服务器用来检测它所分发的地址释放重复
    if (sip == 0) {
        if (arp->ar_op == htons(ARPOP_REQUEST) &&
            inet_addr_type_dev_table(net, dev, tip) == RTN_LOCAL &&
            !arp_ignore(in_dev, sip, tip))
            arp_send_dst(ARPOP_REPLY, ETH_P_ARP, sip, dev, tip,
                     sha, dev->dev_addr, sha, reply_dst);
        goto out_consume_skb;
    }
    // arp请求报文,需要能查到tip的路由,正常情况下tip应该是本机ip
    if (arp->ar_op == htons(ARPOP_REQUEST) &&
        ip_route_input_noref(skb, tip, sip, 0, dev) == 0) {

        rt = skb_rtable(skb);
        addr_type = rt->rt_type;
        // 如果是本地路由,说明请求本机IP地址的二层地址
        if (addr_type == RTN_LOCAL) {
            int dont_send;
            // 两个arp控制的特性过滤,对应的都有系统参数
            dont_send = arp_ignore(in_dev, sip, tip);
            if (!dont_send && IN_DEV_ARPFILTER(in_dev))
                dont_send = arp_filter(sip, tip, dev);
            if (!dont_send) {
                // neigh_event_ns中会做src ip的邻居表项的学习,新建或更新邻居表项。更新neighbor为stale状态
                n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
                if (n) {
                    // 发送arp reply,不管tip实际在哪个dev,lladdr都用的报文接收dev
                    arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
                             sip, dev, tip, sha,
                             dev->dev_addr, sha,
                             reply_dst);
                    // neigh->refcnt--,新建的至少还剩下1(neigh_alloc和__neigh_create分别hold一次)
                    neigh_release(n);
                }
            }
            goto out_consume_skb;
        } else if (IN_DEV_FORWARD(in_dev)) {
            /* 地址类型不是本地,TIP的路由是转发类型的,且接收设备支持转发,如果开启了代理arp功能,则做arp 代理
                 即用自己的mac地址回arp reply,将流量引到本设备(一般是网关设备)
                 net.ipv4.conf.xx.proxy_arp == 是否启用arp
                 net.ipv4.conf.xx.proxy_arp_pvlan 回应代理arp的数据包从接收此代理arp请求的接口发出去 
            */
            if (addr_type == RTN_UNICAST  &&
                (arp_fwd_proxy(in_dev, dev, rt) ||
                 arp_fwd_pvlan(in_dev, dev, rt, sip, tip) ||
                 (rt->dst.dev != dev &&
                  pneigh_lookup(&arp_tbl, net, &tip, dev, 0)))) {
                // 同样是做src ip的邻居表项的学习,新建或更新邻居表项
                n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
                if (n)
                    neigh_release(n);

                if (NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED ||
                    skb->pkt_type == PACKET_HOST ||
                    NEIGH_VAR(in_dev->arp_parms, PROXY_DELAY) == 0) {
                    // 发送代理arp reply
                    arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
                             sip, dev, tip, sha,
                             dev->dev_addr, sha,
                             reply_dst);
                } else {
                    // arp proxy延时处理,skb入proxy_queue,其定时器
                    pneigh_enqueue(&arp_tbl,
                               in_dev->arp_parms, skb);
                    goto out_free_dst;
                }
                goto out_consume_skb;
            }
        }
    }

    /* Update our ARP tables */
    // 1、arp reply处理,更新邻居状态
    // 2、arp request的某些情况,如未找到tip的路由、非本地tip,但是又未开arp proxy。这些情况不需要回reply,但也更新邻居状态
    n = __neigh_lookup(&arp_tbl, &sip, dev, 0);

    if (IN_DEV_ARP_ACCEPT(in_dev)) {
        unsigned int addr_type = inet_addr_type_dev_table(net, dev, sip);

        /* Unsolicited ARP is not accepted by default.
           It is possible, that this option should be enabled for some
           devices (strip is candidate)
         */
        is_garp = arp->ar_op == htons(ARPOP_REQUEST) && tip == sip &&
              addr_type == RTN_UNICAST;
        // 如果本地neighbor表项不存在,arp reply包会触发新建neigh,
        // arp request走到这里要么找不到tip路由,要么tip非local,只会创建免费arp request的邻居表项,
        // 其他忽略,否则可能创建大量表项但实际上又用不到
        if (!n &&
            ((arp->ar_op == htons(ARPOP_REPLY)  &&
                addr_type == RTN_UNICAST) || is_garp))
            n = __neigh_lookup(&arp_tbl, &sip, dev, 1);
    }

    if (n) {
        int state = NUD_REACHABLE;
        int override;

        /* If several different ARP replies follows back-to-back,
           use the FIRST one. It is possible, if several proxy
           agents are active. Taking the first reply prevents
           arp trashing and chooses the fastest router.
         */
        override = time_after(jiffies,
                      n->updated +
                      NEIGH_VAR(n->parms, LOCKTIME)) ||
               is_garp;

        /* Broadcast replies and request packets
           do not assert neighbour reachability.
         */
        // reply包触发neighbor更新为 NUD_REACHABLE,request包更新为 NUD_STALE
        if (arp->ar_op != htons(ARPOP_REPLY) ||
            skb->pkt_type != PACKET_HOST)
            state = NUD_STALE;
        neigh_update(n, sha, state,
                 override ? NEIGH_UPDATE_F_OVERRIDE : 0);
        neigh_release(n);
    }

out_consume_skb:
    consume_skb(skb);

out_free_dst:
    dst_release(reply_dst);
    return NET_RX_SUCCESS;

out_free_skb:
    kfree_skb(skb);
    return NET_RX_DROP;
}

/*
    更新邻居状态,重置定时器
    更新二层地址、二层头cache,
    更新邻居数据报文发送函数,
    邻居可用后,发送缓存的数据报文
*/
int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
         u32 flags)
{
    u8 old;
    int err;
    int notify = 0;
    struct net_device *dev;
    int update_isrouter = 0;

    write_lock_bh(&neigh->lock);

    dev    = neigh->dev;
    old    = neigh->nud_state;
    err    = -EPERM;

    if (!(flags & NEIGH_UPDATE_F_ADMIN) &&
        (old & (NUD_NOARP | NUD_PERMANENT)))
        goto out;
    if (neigh->dead)
        goto out;
    // 进入NUD_FAILED 状态,释放一些资源(定时器、缓存报文)
    if (!(new & NUD_VALID)) {
        neigh_del_timer(neigh);
        if (old & NUD_CONNECTED)
            neigh_suspect(neigh);
        neigh->nud_state = new;
        err = 0;
        notify = old & NUD_VALID;
        if ((old & (NUD_INCOMPLETE | NUD_PROBE)) &&
            (new & NUD_FAILED)) {
            // (NUD_INCOMPLETE | NUD_PROBE)状态下,可能缓存了数据报文或arp报文,这里释放掉
            neigh_invalidate(neigh);
            notify = 1;
        }
        goto out;
    }

    /* Compare new lladdr with cached one */
    // 取新得二层地址
    if (!dev->addr_len) {
        /* First case: device needs no address. */
        lladdr = neigh->ha;
    } else if (lladdr) {
        /* The second case: if something is already cached
           and a new address is proposed:
           - compare new & old
           - if they are different, check override flag
         */
        if ((old & NUD_VALID) &&
            !memcmp(lladdr, neigh->ha, dev->addr_len))
            lladdr = neigh->ha;
    } else {
        /* No address is supplied; if we know something,
           use it, otherwise discard the request.
         */
        err = -EINVAL;
        if (!(old & NUD_VALID))
            goto out;
        lladdr = neigh->ha;
    }

    if (new & NUD_CONNECTED)
        neigh->confirmed = jiffies;
    neigh->updated = jiffies;

    /* If entry was valid and address is not changed,
       do not change entry state, if new one is STALE.
     */
    err = 0;
    update_isrouter = flags & NEIGH_UPDATE_F_OVERRIDE_ISROUTER;
    if (old & NUD_VALID) {
        if (lladdr != neigh->ha && !(flags & NEIGH_UPDATE_F_OVERRIDE)) {
            update_isrouter = 0;
            if ((flags & NEIGH_UPDATE_F_WEAK_OVERRIDE) &&
                (old & NUD_CONNECTED)) {
                lladdr = neigh->ha;
                new = NUD_STALE;
            } else
                goto out;
        } else {
            if (lladdr == neigh->ha && new == NUD_STALE &&
                !(flags & NEIGH_UPDATE_F_ADMIN))
                new = old;
        }
    }
    // reset定时器,修改neigh状态
    if (new != old) {
        neigh_del_timer(neigh);
        if (new & NUD_PROBE)
            atomic_set(&neigh->probes, 0);
        if (new & NUD_IN_TIMER)
            neigh_add_timer(neigh, (jiffies +
                        ((new & NUD_REACHABLE) ?
                         neigh->parms->reachable_time :
                         0)));
        neigh->nud_state = new;
        notify = 1;
    }
    // 更新二层地址、更新二层头cache
    if (lladdr != neigh->ha) {
        write_seqlock(&neigh->ha_lock);
        memcpy(&neigh->ha, lladdr, dev->addr_len);
        write_sequnlock(&neigh->ha_lock);
        neigh_update_hhs(neigh);
        if (!(new & NUD_CONNECTED))
            neigh->confirmed = jiffies -
                      (NEIGH_VAR(neigh->parms, BASE_REACHABLE_TIME) << 1);
        notify = 1;
    }
    if (new == old)
        goto out;
    // 更换 neigh->output 函数,NUD_CONNECTED指向ops->connected_output,其他指向ops->output
    if (new & NUD_CONNECTED)
        neigh_connect(neigh);
    else
        neigh_suspect(neigh);
    if (!(old & NUD_VALID)) {
        // 走到这里 new 是 NUD_VALID的,old如果是!NUD_VALID,说明邻居从不可用到可用,可以发送缓存的数据报文了
        struct sk_buff *skb;

        /* Again: avoid dead loop if something went wrong */

        while (neigh->nud_state & NUD_VALID &&
               (skb = __skb_dequeue(&neigh->arp_queue)) != NULL) {
            struct dst_entry *dst = skb_dst(skb);
            struct neighbour *n2, *n1 = neigh;
            write_unlock_bh(&neigh->lock);

            rcu_read_lock();

            /* Why not just use 'neigh' as-is?  The problem is that
             * things such as shaper, eql, and sch_teql can end up
             * using alternative, different, neigh objects to output
             * the packet in the output path.  So what we need to do
             * here is re-lookup the top-level neigh in the path so
             * we can reinject the packet there.
             */
            n2 = NULL;
            if (dst) {
                n2 = dst_neigh_lookup_skb(dst, skb);
                if (n2)
                    n1 = n2;
            }
            n1->output(n1, skb);
            if (n2)
                neigh_release(n2);
            rcu_read_unlock();

            write_lock_bh(&neigh->lock);
        }
        __skb_queue_purge(&neigh->arp_queue);
        neigh->arp_queue_len_bytes = 0;
    }
out:
    if (update_isrouter) {
        neigh->flags = (flags & NEIGH_UPDATE_F_ISROUTER) ?
            (neigh->flags | NTF_ROUTER) :
            (neigh->flags & ~NTF_ROUTER);
    }
    write_unlock_bh(&neigh->lock);

    if (notify)
        neigh_update_notify(neigh);

    return err;
}
EXPORT_SYMBOL(neigh_update);


相关文章

  • Linux网络协议栈3--neighbor子系统

    邻居,可以简单理解为三层上的一跳距离。路由的下一跳可以不是直连的一跳距离(迭代路由),但最终走到邻居子系统的时候就...

  • docker网络基础

    网络的命名空间 linux在网络栈中引入网络命名空间,从而支持网络协议栈的多个实例。这些独立的协议栈被隔离到不同的...

  • K8S原理简介及环境搭建

    一、原理简介 名词解释 1、网络的命名空间:Linux在网络栈中引入网络命名空间,将独立的网络协议栈隔离到不同的命...

  • 网络协议、端口和Socket

    1、网络协议分层 网络层次可划分为五层因特网协议栈和七层因特网协议栈。 1.1 五层因特网协议栈 因特网协议栈共有...

  • 用户态协议栈的实现

    协议栈,指的是TCP/IP协议栈。linux系统中,协议栈是内核实现的。 Client发送数据给server,数据...

  • CentOS系统启动流程你懂否

    一、Linux内核的组成 相关概念:Linux系统的组成部分:内核+根文件系统内核:进程管理、内存管理、网络协议栈...

  • linux.network 网络协议栈

    https://blog.csdn.net/zxorange321/article/details/75676063

  • 协议栈的内部结构

    什么是协议栈? 如果说网卡是连接网络的硬件,那么协议栈就是连接网络的软件。 协议栈包括什么? 主要包括TCP,UD...

  • Linux网络子系统

    网络分层 类似于OSI模型,Linux网络协议栈分层: 网络数据传输期间发生的基本操作: 1) 当一个应用程序发送...

  • Linux网络协议栈7--macvlan

    macvlan是linux的一种虚拟网络接口,macvlan 允许你在主机的一个网络接口上配置多个虚拟的网络接口,...

网友评论

      本文标题:Linux网络协议栈3--neighbor子系统

      本文链接:https://www.haomeiwen.com/subject/dwgdiktx.html