Network Layer: IP
COMS W6998
Spring 2010
Erich Nahum
Outline
IP Layer Architecture
Netfilter
Receive Path
Send Path
Forwarding (Routing) Path
Recall what IP Does
Encapsulate/
IP-packet format
decapsulate
0 3 7 15 31 transport-layer
messages into IP
Version IHL Codepoint Total length datagrams
Fragment-ID
DM
Fragment-Offset
Routes datagrams
F F
to destination
Time to Live Protocol Checksum Handle static
and/or dynamic
Source address
routing updates
Destination address Fragment/
reassemble
Options and payload datagrams
Unreliably
IP Implementation Architecture
Higher Layers
ip_input.c ip_output.c
ROUTING
Forwarding ip_queue_xmit
ip_local_deliver_finish Information Base
ip_route_input ip_route_output_flow ip_local_out
NF_INET_LOCAL_INPUT
ip_forward.c NF_INET_FORWARD NF_INET_LOCAL_OUTPUT
ip_local_deliver
ip_forward ip_forward_finish ip_output
MULTICAST NF_INET_POST_ROUTING
ip_rcv_finish
ip_mr_input
ip_finish_output
NF_INET_PRE_ROUTING
ip_rcv ip_finish_output2
ARP
neigh_resolve_
output dev.c
dev.c
dev_queue_xmit
netif_receive skb
Sources of IP Packets
1. Packets arrive on an interface and are
passed to the ip_rcv() function.
2. TCP/UDP packets are packed into an IP
packet and passed down to IP via
ip_queue_xmit().
3. The IP layer generates IP packets itself:
1. Multicast packets
2. Fragmentation of a large packet
3. ICMP/IGMP packets.
Outline
IP Layer Architecture
Netfilter
Receive Path
Send Path
Forwarding (Routing) Path
What is Netfilter?
A framework for packet “mangling”
A protocol defines "hooks" which are well-defined
points in a packet's traversal of that protocol stack.
IPv4 defines 5
Other protocols include IPv6, ARP, Bridging, DECNET
At each of these points, the protocol will call the
netfilter framework with the packet and the hook
number.
Parts of the kernel can register to listen to the different
hooks for each protocol.
When a packet is passed to the netfilter framework, it
will call all registered callbacks for that hook and
protocol.
Netfilter IPv4 Hooks
NF_INET_PRE_ROUTING
Incoming packets pass this hook in ip_rcv() before routing
NF_INET_LOCAL_IN
All incoming packets addressed to the local host pass this
hook in ip_local_deliver()
NF_INET_FORWARD
All incoming packets not addressed to the local host pass
this hook in ip_forward()
NF_INET_LOCAL_OUT
All outgoing packets created by this local computer pass
this hook in ip_build_and_send_pkt()
NF_INET_POST_ROUTING
All outgoing packets (forwarded or locally created) will pass
this hook in ip_finish_output()
Netfilter Callbacks
Kernel code can register a call back function to be
called when a packet arrives at each hook. and are
free to manipulate the packet.
The callback can then tell netfilter to do one of five
things:
NF_DROP: drop the packet; don't continue traversal.
NF_ACCEPT: continue traversal as normal.
NF_STOLEN: I've taken over the packet; stop traversal.
NF_QUEUE: queue the packet (usually for userspace
handling).
NF_REPEAT: call this hook again.
IPTables
A packet selection system called IP Tables has
been built over the netfilter framework.
It is a direct descendant of ipchains (that came from
ipfwadm, that came from BSD's ipfw), with
extensibility.
Kernel modules can register a new table, and ask
for a packet to traverse a given table.
This packet selection method is used for:
Packet filtering (the `filter' table),
Network Address Translation (the `nat' table) and
General preroute packet mangling (the `mangle' table).
Outline
IP Layer Architecture
Netfilter
Receive Path
Send Path
Forwarding (Routing) Path
Naming Conventions
Methods are frequently broken into two stages
(where the second has the same name with a suffix
of finish or slow, is typical for networking kernel
code.)
E.g., ip_rcv, ip_rcv_finish
In many cases the second method has a “slow”
suffix instead of “finish”; this usually happens when
the first method looks in some cache and the
second method performs a lookup in a more
complex data structure, which is slower.
Receive Path: ip_rcv
Higher Layers
ip_input.c
Packets that are not addressed to
ip_local_deliver_finish ROUTING
the host (packets received in the
promiscuous mode) are dropped.
ip_route_input
NF_INET_LOCAL_INPUT Does some sanity checking
ip_forward.c Does the packet have at least the
ip_local_deliver size of an IP header?
ip_forward
Is this IP Version 4?
ip_rcv_finish MULTICAST Is the checksum correct?
ip_mr_input Does the packet have a wrong
NF_INET_PRE_ROUTING length?
If the actual packet size > skblen,
ip_rcv then invoke
skb_trim(skb,iphtotal_len)
dev.c
Invokes netfilter hook
netif_receive skb
NF_INET_PRE_ROUTING
ip_rcv_finish() is called
Receive Path: ip_rcv_finish
Higher Layers
ip_input.c
If skb->dst is NULL, ip_route_input()
is called to find the route of packet.
ip_local_deliver_finish ROUTING Someone else could have filled it in
ip_route_input skb->dst is set to an entry in the
NF_INET_LOCAL_INPUT routing cache which stores both the
ip_forward.c destination IP and the pointer to an
ip_local_deliver entry in the hard header cache
ip_forward
(cache for the layer 2 frame packet
header)
ip_rcv_finish MULTICAST
If the IP header includes options, an
ip_mr_input
ip_option structure is created.
NF_INET_PRE_ROUTING
skb->input() now points to the
ip_rcv function that should be used to
handle the packet (delivered locally
dev.c
or forwarded further):
ip_local_deliver()
netif_receive skb
ip_forward()
ip_mr_input()
Receive Path: ip_local_deliver
Higher Layers
ip_input.c
The only task of
ip_local_deliver_finish ROUTING ip_local_deliver(skb) is to re-
ip_route_input assemble fragmented packets
NF_INET_LOCAL_INPUT
by invoking ip_defrag().
ip_forward.c
ip_local_deliver The netfilter hook
ip_forward
NF_INET_LOCAL_IN is
ip_rcv_finish MULTICAST invoked.
ip_mr_input
NF_INET_PRE_ROUTING
This in turn calls
ip_local_deliver_finish
ip_rcv
dev.c
netif_receive skb
Recv: ip_local_deliver_finish
Higher Layers
ip_input.c
Remove the IP header from skb by
__skb_pull(skb, ip_hdrlen(skb));
ip_local_deliver_finish ROUTING
The protocol ID of the IP header is
ip_route_input
NF_INET_LOCAL_INPUT used to calculate the hash value in the
inet_protos hash table.
ip_forward.c
ip_local_deliver Packet is passed to a raw socket if one
ip_forward exists (which copies skb)
If transport protocol is found, then the
ip_rcv_finish MULTICAST
handler is invoked:
ip_mr_input
NF_INET_PRE_ROUTING tcp_v4_rcv(): TCP
udp_rcv(): UDP
ip_rcv icmp_rcv(): ICMP
igmp_rcv(): IGMP
dev.c
Otherwise dropped with an ICMP
netif_receive skb Destination Unreachable message
returned.
Hash Table inet_protos
net_protocol
0 udp_rcv()
inet_protos[MAX_INE T_PROTOS] handler
udp_err()
err_handler
gso_send_check
gso_segment
gro_receive
gro_complete
net_protocol igmp_rcv()
1
handler Null
err_handler
gso_send_check
gso_segment
gro_receive
gro_complete
MAX_INET_ net_protocol
PROTOS
Outline
IP Layer Architecture
Netfilter
Receive Path
Send Path
Forwarding (Routing) Path
Send Path: ip_queue_xmit (1)
Higher Layers
ip_output.c
skbdst is checked to see
ip_queue_xmit
ROUTING
if it contains a pointer to an ip_local_out
entry in the routing cache. ip_route_output_flow
Many packets are routed NF_INET_LOCAL_OUTPUT
through the same path, so
storing a pointer to an ip_output
routing entry in skbdst
saves expensive routing NF_INET_POST_ROUTING
table lookup.
ip_finish_output
If route is not present (e.g.,
the first packet of a socket), ip_finish_output2
then ip_route_output_flow() ARP
neigh_resolve_
is invoked to determine a output dev.c
route. dev_queue_xmit
Send Path: ip_queue_xmit (2)
Higher Layers
ip_output.c
ip_queue_xmit
Header is pushed onto
ROUTING
packet ip_local_out
ip_route_output_flow
skb_push(skb,
sizeof(header + options); NF_INET_LOCAL_OUTPUT
The fields of the IP header
ip_output
are filled in (version, header
length, TOS, TTL, NF_INET_POST_ROUTING
addresses and protocol).
ip_finish_output
If IP options exist,
ip_options_build() is called.
ip_finish_output2
Ip_local_out() is invoked. ARP
neigh_resolve_
output dev.c
dev_queue_xmit
Send Path: ip_local_out
Higher Layers
ip_output.c
The checksum is computed ip_queue_xmit
ip_send_check(iph)
ROUTING
Netfilter is invoked with ip_local_out
NF_INET_LOCAL_OUTPUT ip_route_output_flow
using skb->dst_output() NF_INET_LOCAL_OUTPUT
This is ip_output()
If the packet is for the local ip_output
machine:
NF_INET_POST_ROUTING
dst->output = ip_output
dst->input = ip_local_deliver
ip_finish_output
ip_output() will send the
packet on the loopback device
Then we will go into ip_rcv() ip_finish_output2
ARP
and ip_rcv_finish() , but this neigh_resolve_
output dev.c
time dst is NOT null; so we will
end in ip_local_deliver() . dev_queue_xmit
Send Path: ip_output
Higher Layers
ip_output.c
ip_output() does very little, ip_queue_xmit
essentially an entry into the ROUTING
output path from the ip_route_output_flow ip_local_out
forwarding layer. NF_INET_LOCAL_OUTPUT
Updates some stats.
ip_output
Invokes Netfilter with
NF_INET_POST_ROUTING
NF_INET_POST_ROUTING
and ip_finish_output() ip_finish_output
ip_finish_output2
ARP
neigh_resolve_
output dev.c
dev_queue_xmit
Send Path: ip_finish_output
Higher Layers
ip_output.c
Checks message length against ip_queue_xmit
the destination MTU ROUTING
Calls either ip_route_output_flow ip_local_out
ip_fragment() NF_INET_LOCAL_OUTPUT
ip_finish_output2()
ip_output
Latter is actually a very long
inline, not a function NF_INET_POST_ROUTING
ip_finish_output
ip_finish_output2
ARP
neigh_resolve_
output dev.c
dev_queue_xmit
Send Path: ip_finish_output2
Higher Layers
ip_output.c
Checks skb for room for MAC ip_queue_xmit
header. If not, call ROUTING
skb_realloc_headroom(). ip_route_output_flow ip_local_out
Send the packet to a neighbor NF_INET_LOCAL_OUTPUT
by:
dst->neighbour->output(skb) ip_output
arp_bind_neighbour() sees to it NF_INET_POST_ROUTING
that the L2 address (a.k.a. the
mac address) of the next hop ip_finish_output
will be known.
These eventually end up in ARP
ip_finish_output2
neigh_resolve_
dev_queue_xmit() which passes output dev.c
the packet down to the device.
dev_queue_xmit
Outline
IP Layer Architecture
Netfilter
Receive Path
Send Path
Forwarding (Routing) Path
Forwarding: ip_forward (1)
ROUTING
Forwarding
Information Base
ip_route_input ip_route_output_flow
ip_input.c ip_forward.c NF_INET_FORWARD ip_output.c
ip_rcv_finish ip_forward ip_forward_finish ip_output
Does some validation and checking, e.g.,:
If skb->pkt_type != PACKET_HOST, drop
If TTL len > mtu) and no fragmentation is allowed (Don‟t fragment bit is
set in the IP header), the packet is discarded and the ICMP
message with ICMP_FRAG_NEEDED is sent back.
Forwarding: ip_forward (2)
ROUTING
Forwarding
Information Base
ip_route_input ip_route_output_flow
ip_input.c ip_forward.c NF_INET_FORWARD ip_output.c
ip_rcv_finish ip_forward ip_forward_finish ip_output
skb_cow(skb,headroom) is called to check whether there is still
sufficient space for the MAC header in the output device. If not,
skb_cow() calls pskb_expand_head() to create sufficient space.
The TTL field of the IP packet is decremented by 1.
ip_decrease_ttl() also incrementally modifies the header checksum.
The netfilter hook NF_INET_FORWARDING is invoked.
Forwarding: ip_forward_finish
ROUTING
Forwarding
Information Base
ip_route_input ip_route_output_flow
ip_input.c ip_forward.c NF_INET_FORWARD ip_output.c
ip_rcv_finish ip_forward ip_forward_finish ip_output
Increments some stats.
Handles any IP options if they exist.
Calls the destination output function via skb->dst-
>output(skb) – which is ip_output()
IP Backup
Recall the IP Header
IP-packet format
0 3 7 15 31
Version IHL Codepoint Total length
DM
Fragment-ID Fragment-Offset
F F
Time to Live Protocol Checksum
Source address
Destination address
Options and payload
Recall the sk_buff structure
sk_buff
sk_buff_head next sk_buff
prev
sk
tstamp
struct sock dev net_device
...lots..
...of..
...stuff.. Packetdata
transport_header ``headroom„„
network_header MAC-Header
mac_header IP-Header
head
data UDP-Header
tail UDP-Data
end ``tailroom„„
truesize dataref: 1
users nr_frags
... skb_shared_info
destructor_arg
linux-2.6.31/include/linux/skbuff.h
Recall pkt_type in sk_buff
pkt_type: specifies the type of a packet
PACKET_HOST: a packet sent to the local host
PACKET_BROADCAST: a broadcast packet
PACKET_MULTICAST: a multicast packet
PACKET_OTHERHOST:a packet not destined for the
local host, but received in the promiscuous mode.
PACKET_OUTGOING: a packet leaving the host
PACKET_LOOKBACK: a packet sent by the local host
to itself.