网络设备子系统

<h2>概述</h2> <p>调用树：</p> <pre><code class="language-c">dev_queue_xmit // 选择发送队列 __dev_xmit_skb // qdisc 排队 __qdisc_run qdisc_restart // 取出 skb sch_direct_xmit dev_hard_start_xmit // gso 相关检查；ptype_all 抓包，调用网卡驱动程序发送 ops-&gt;ndo_start_xmit // 调用驱动程序发送数据。对于 igb 网卡为 igb_xmit_frame</code></pre> <p>邻居子系统通过调用 <code>dev_queue_xmit</code> 进入网络设备子系统</p> <h2>分析</h2> <pre><code class="language-c">// file: net/core/dev.c /** * dev_queue_xmit - transmit a buffer * @skb: buffer to transmit * * Queue a buffer for transmission to a network device. The caller must * have set the device and priority and built the buffer before calling * this function. The function can be called from an interrupt. * * A negative errno code is returned on a failure. A success does not * guarantee the frame will be transmitted as it may be dropped due * to congestion or traffic shaping. * * ----------------------------------------------------------------------------------- * I notice this method can also return errors from the queue disciplines, * including NET_XMIT_DROP, which is a positive value. So, errors can also * be positive. * * Regardless of the return value, the skb is consumed, so it is currently * difficult to retry a send to this method. (You can bump the ref count * before sending to hold a reference for retry if you are careful.) * * When calling this method, interrupts MUST be enabled. This is because * the BH enable code must have IRQs enabled so that it will not deadlock. * --BLG */ int dev_queue_xmit(struct sk_buff *skb) { struct net_device *dev = skb-&gt;dev; struct netdev_queue *txq; struct Qdisc *q; int rc = -ENOMEM; skb_reset_mac_header(skb); /* Disable soft irqs for various locks below. Also * stops preemption for RCU. */ rcu_read_lock_bh(); skb_update_prio(skb); txq = netdev_pick_tx(dev, skb); // 选择发送队列 q = rcu_dereference_bh(txq-&gt;qdisc); // 获取与此队列关联的排队规则 #ifdef CONFIG_NET_CLS_ACT skb-&gt;tc_verd = SET_TC_AT(skb-&gt;tc_verd, AT_EGRESS); #endif trace_net_dev_queue(skb); if (q-&gt;enqueue) { rc = __dev_xmit_skb(skb, q, dev, txq); // 如果有队列，则调用。按下面注释所说：没有队列的通常是回环设备和隧道设备。 goto out; } /* The device has no queue. Common case for software devices: loopback, all the sorts of tunnels... Really, it is unlikely that netif_tx_lock protection is necessary here. (f.e. loopback and IP tunnels are clean ignoring statistics counters.) However, it is possible, that they rely on protection made by us here. Check this and shot the lock. It is not prone from deadlocks. Either shot noqueue qdisc, it is even simpler 8) */ if (dev-&gt;flags &amp; IFF_UP) { int cpu = smp_processor_id(); /* ok because BHs are off */ if (txq-&gt;xmit_lock_owner != cpu) { if (__this_cpu_read(xmit_recursion) &gt; RECURSION_LIMIT) goto recursion_alert; HARD_TX_LOCK(dev, txq, cpu); if (!netif_xmit_stopped(txq)) { __this_cpu_inc(xmit_recursion); rc = dev_hard_start_xmit(skb, dev, txq); __this_cpu_dec(xmit_recursion); if (dev_xmit_complete(rc)) { HARD_TX_UNLOCK(dev, txq); goto out; } } HARD_TX_UNLOCK(dev, txq); net_crit_ratelimited(&quot;Virtual device %s asks to queue packet!\n&quot;, dev-&gt;name); } else { /* Recursion is detected! It is possible, * unfortunately */ recursion_alert: net_crit_ratelimited(&quot;Dead loop on virtual device %s, fix it urgently!\n&quot;, dev-&gt;name); } } rc = -ENETDOWN; rcu_read_unlock_bh(); kfree_skb(skb); return rc; out: rcu_read_unlock_bh(); return rc; } EXPORT_SYMBOL(dev_queue_xmit); static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, struct net_device *dev, struct netdev_queue *txq) { spinlock_t *root_lock = qdisc_lock(q); bool contended; int rc; qdisc_pkt_len_init(skb); qdisc_calculate_pkt_len(skb, q); /* * Heuristic to force contended enqueues to serialize on a * separate lock before trying to get qdisc main lock. * This permits __QDISC_STATE_RUNNING owner to get the lock more often * and dequeue packets faster. */ contended = qdisc_is_running(q); if (unlikely(contended)) spin_lock(&amp;q-&gt;busylock); spin_lock(root_lock); if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &amp;q-&gt;state))) { kfree_skb(skb); rc = NET_XMIT_DROP; } else if ((q-&gt;flags &amp; TCQ_F_CAN_BYPASS) &amp;&amp; !qdisc_qlen(q) &amp;&amp; qdisc_run_begin(q)) { // 如果可以绕开排队系统 /* * This is a work-conserving queue; there are no old skbs * waiting to be sent out; and the qdisc is not running - * xmit the skb directly. */ if (!(dev-&gt;priv_flags &amp; IFF_XMIT_DST_RELEASE)) skb_dst_force(skb); qdisc_bstats_update(q, skb); if (sch_direct_xmit(skb, q, dev, txq, root_lock)) { if (unlikely(contended)) { spin_unlock(&amp;q-&gt;busylock); contended = false; } __qdisc_run(q); } else qdisc_run_end(q); rc = NET_XMIT_SUCCESS; } else { // 正常排队 skb_dst_force(skb); rc = q-&gt;enqueue(skb, q) &amp; NET_XMIT_MASK; // 入队 if (qdisc_run_begin(q)) { if (unlikely(contended)) { spin_unlock(&amp;q-&gt;busylock); contended = false; } __qdisc_run(q); // 开始发送 } } spin_unlock(root_lock); if (unlikely(contended)) spin_unlock(&amp;q-&gt;busylock); return rc; }</code></pre> <p>继续：</p> <pre><code class="language-c">// file: net/sched/sch_generic.c void __qdisc_run(struct Qdisc *q) { int quota = weight_p; // 循环从队列取出一个 skb 并发送 while (qdisc_restart(q)) { /* * Ordered by possible occurrence: Postpone processing if * 1. we've exceeded packet quota * 2. another process needs the CPU; */ // 注意：此时占用的是用户进程的系统态时间(sys) // 如果发生下面情况之一，则延后处理： // 1. quota 用尽 // 2. 其它进程需要 CPU if (--quota &lt;= 0 || need_resched()) { __netif_schedule(q); // 将触发一次 NET_TX_SOFTIRQ 类型 softirq break; } } qdisc_run_end(q); }</code></pre> <p>触发软中断的场景，分析见：<a href="https://www.showdoc.com.cn/1832930169049935/10837052896313005">https://www.showdoc.com.cn/1832930169049935/10837052896313005</a> 这里继续往下分析：</p> <pre><code class="language-c">// file: net/sched/sch_generic.c /* * NOTE: Called under qdisc_lock(q) with locally disabled BH. * * __QDISC_STATE_RUNNING guarantees only one CPU can process * this qdisc at a time. qdisc_lock(q) serializes queue accesses for * this queue. * * netif_tx_lock serializes accesses to device driver. * * qdisc_lock(q) and netif_tx_lock are mutually exclusive, * if one is grabbed, another must be free. * * Note, that this procedure can be called by a watchdog timer * * Returns to the caller: * 0 - queue is empty or throttled. * &gt;0 - queue is not empty. * */ static inline int qdisc_restart(struct Qdisc *q) { struct netdev_queue *txq; struct net_device *dev; spinlock_t *root_lock; struct sk_buff *skb; /* Dequeue packet */ skb = dequeue_skb(q); // 取出 skb if (unlikely(!skb)) return 0; WARN_ON_ONCE(skb_dst_is_noref(skb)); root_lock = qdisc_lock(q); dev = qdisc_dev(q); txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb)); return sch_direct_xmit(skb, q, dev, txq, root_lock); } /* * Transmit one skb, and handle the return status as required. Holding the * __QDISC_STATE_RUNNING bit guarantees that only one CPU can execute this * function. * * Returns to the caller: * 0 - queue is empty or throttled. * &gt;0 - queue is not empty. */ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, struct net_device *dev, struct netdev_queue *txq, spinlock_t *root_lock) { int ret = NETDEV_TX_BUSY; /* And release qdisc */ spin_unlock(root_lock); HARD_TX_LOCK(dev, txq, smp_processor_id()); if (!netif_xmit_frozen_or_stopped(txq)) ret = dev_hard_start_xmit(skb, dev, txq); // 继续 HARD_TX_UNLOCK(dev, txq); spin_lock(root_lock); if (dev_xmit_complete(ret)) { /* Driver sent out skb successfully or skb was consumed */ ret = qdisc_qlen(q); } else if (ret == NETDEV_TX_LOCKED) { /* Driver try lock failed */ ret = handle_dev_cpu_collision(skb, txq, q); } else { /* Driver returned NETDEV_TX_BUSY - requeue skb */ if (unlikely(ret != NETDEV_TX_BUSY)) net_warn_ratelimited(&quot;BUG %s code %d qlen %d\n&quot;, dev-&gt;name, ret, q-&gt;q.qlen); ret = dev_requeue_skb(skb, q); } if (ret &amp;&amp; netif_xmit_frozen_or_stopped(txq)) ret = 0; return ret; } </code></pre> <p>继续：</p> <pre><code class="language-c">// file: net/core/dev.c int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq) { const struct net_device_ops *ops = dev-&gt;netdev_ops; int rc = NETDEV_TX_OK; unsigned int skb_len; if (likely(!skb-&gt;next)) { netdev_features_t features; /* * If device doesn't need skb-&gt;dst, release it right now while * its hot in this cpu cache */ if (dev-&gt;priv_flags &amp; IFF_XMIT_DST_RELEASE) skb_dst_drop(skb); features = netif_skb_features(skb); if (vlan_tx_tag_present(skb) &amp;&amp; !vlan_hw_offload_capable(features, skb-&gt;vlan_proto)) { skb = __vlan_put_tag(skb, skb-&gt;vlan_proto, vlan_tx_tag_get(skb)); if (unlikely(!skb)) goto out; skb-&gt;vlan_tci = 0; } /* If encapsulation offload request, verify we are testing * hardware encapsulation features instead of standard * features for the netdev */ if (skb-&gt;encapsulation) features &amp;= dev-&gt;hw_enc_features; if (netif_needs_gso(skb, features)) { // 注意：这里会判断网卡是否支持 TSO，如果不支持，则 dev_gso_segment 进行切分 if (unlikely(dev_gso_segment(skb, features))) goto out_kfree_skb; if (skb-&gt;next) goto gso; } else { if (skb_needs_linearize(skb, features) &amp;&amp; __skb_linearize(skb)) goto out_kfree_skb; /* If packet is not checksummed and device does not * support checksumming for this protocol, complete * checksumming here. */ if (skb-&gt;ip_summed == CHECKSUM_PARTIAL) { if (skb-&gt;encapsulation) skb_set_inner_transport_header(skb, skb_checksum_start_offset(skb)); else skb_set_transport_header(skb, skb_checksum_start_offset(skb)); if (!(features &amp; NETIF_F_ALL_CSUM) &amp;&amp; skb_checksum_help(skb)) goto out_kfree_skb; } } if (!list_empty(&amp;ptype_all)) dev_queue_xmit_nit(skb, dev); // ptype_all 抓包 skb_len = skb-&gt;len; rc = ops-&gt;ndo_start_xmit(skb, dev); // 调用驱动程序发送数据。对于 igb 网卡，dev-&gt;netdev_ops = &amp;igb_netdev_ops，是在网卡的 igb_probe 函数中初始化的 trace_net_dev_xmit(skb, rc, dev, skb_len); if (rc == NETDEV_TX_OK) txq_trans_update(txq); return rc; } gso: do { struct sk_buff *nskb = skb-&gt;next; skb-&gt;next = nskb-&gt;next; nskb-&gt;next = NULL; if (!list_empty(&amp;ptype_all)) dev_queue_xmit_nit(nskb, dev); skb_len = nskb-&gt;len; rc = ops-&gt;ndo_start_xmit(nskb, dev); trace_net_dev_xmit(nskb, rc, dev, skb_len); if (unlikely(rc != NETDEV_TX_OK)) { if (rc &amp; ~NETDEV_TX_MASK) goto out_kfree_gso_skb; nskb-&gt;next = skb-&gt;next; skb-&gt;next = nskb; return rc; } txq_trans_update(txq); if (unlikely(netif_xmit_stopped(txq) &amp;&amp; skb-&gt;next)) return NETDEV_TX_BUSY; } while (skb-&gt;next); out_kfree_gso_skb: if (likely(skb-&gt;next == NULL)) { skb-&gt;destructor = DEV_GSO_CB(skb)-&gt;destructor; consume_skb(skb); return rc; } out_kfree_skb: kfree_skb(skb); out: return rc; } </code></pre> <p>对于 igb 网卡，dev->netdev_ops = &igb_netdev_ops，是在网卡的 <code>igb_probe</code> 函数中初始化的。 <code>igb_netdev_ops</code> 定义如下：</p> <pre><code class="language-c">// file: igb_main.c static const struct net_device_ops igb_netdev_ops = { .ndo_open = igb_open, .ndo_stop = igb_close, .ndo_start_xmit = igb_xmit_frame, // 这个 .ndo_get_stats64 = igb_get_stats64, .ndo_set_rx_mode = igb_set_rx_mode, .ndo_set_mac_address = igb_set_mac, .ndo_change_mtu = igb_change_mtu, .ndo_do_ioctl = igb_ioctl, .ndo_tx_timeout = igb_tx_timeout, .ndo_validate_addr = eth_validate_addr, .ndo_vlan_rx_add_vid = igb_vlan_rx_add_vid, .ndo_vlan_rx_kill_vid = igb_vlan_rx_kill_vid, .ndo_set_vf_mac = igb_ndo_set_vf_mac, .ndo_set_vf_vlan = igb_ndo_set_vf_vlan, .ndo_set_vf_tx_rate = igb_ndo_set_vf_bw, .ndo_set_vf_spoofchk = igb_ndo_set_vf_spoofchk, .ndo_get_vf_config = igb_ndo_get_vf_config, #ifdef CONFIG_NET_POLL_CONTROLLER .ndo_poll_controller = igb_netpoll, #endif .ndo_fix_features = igb_fix_features, .ndo_set_features = igb_set_features, }; </code></pre> <p>所以对于 igb 网卡而言，上面的 <code>ndo_start_xmit</code> 就是 <code>igb_xmit_frame</code>。</p>

公开学习文档

网络设备子系统

页面列表