运行中的ipvs
ipvs 的规则实现原理
ipvs的规则是如何生效的,先来看看他实现的原理
简单的来讲,ipvs无非就是修改了数据报头信息来完成client -> virus server -> real server的调度.调度的目的是使realservers之间的负载接近于平衡状态.这里牵扯到2个问题,修改数据报的方式和调度的策略.
我们先来看看修改数据报的具体方式,现在2.6内核中ipvs实现的方式和原来有点不一样.引用一下ipvs的作者张文嵩先生的一段话
我们分别在Linux 内核2.0和内核2.2中修改了TCP/IP协议栈,在IP层截取和改写/转发IP报文,
实现了三种IP负载均衡技术,并提供了一个ipvsadm程序进行虚拟服务器的配置和管理。在Linux
内核2.4和2.6中,我们把它实现为NetFilter的一个模块,很多代码作了改写和进一步优化,
目前版本已在网上发布,根据反馈信息该版本已经较稳定。
好吧,说得很清楚了,ipvs就是借用netfilter来修改数据报的.那么简单了解一下netfilter的工作原理还是很有必要的,如图
netfilter一共有5个规则链,每个规则链都能存放若干条规则,规则之间都顺序(也就是优先级),一旦有规则被匹配到,完成相应动作后,跳出该规则链.这5个规则链分别是PREROUTING,INPUT,FORWARD,OUTPUT,POSTROUTING.我们可以将机器中的连接分成3中状态
- 从外部进入主机的连接,经过 PREROUTING -> INPUT
- 从主机出去的连接,将经过 OUPUT -> POSTROUTING
- 由主机转发的连接,经过PREROUTING -> FORWARD -> POSTROUTING
每个规则链里的规则会在数据经过该规则链的时候起作用(也就是调用相应的函数进行处理).看上去很简单吧,比如ipvs作为netfilter的一个模块,往这些规则链里写入规则就好可以了
等等.如果netfilter有很多模块,都往一个规则链里写入规则,会不会很乱呢?优先级如何控制呢?所以规则链里的规则我们会根据不同的作用将其分类进行管理,每一类的规则用一个整数来表示他的优先级,越小,优先级越高.如果是同一类型的规则,则根据规则的先后顺序来决定(链表结构,越靠前,优先级越高)
netfilter本身有3个作用,所以他的规则分为3种类型,用3个表来表示,分别为filter表(过滤),nat表(修改数据报头),mangle表(修改数据).而ipvs模块就相当于在netfilter里添加了一张新的ipvs表一样.关于netfilter的更多信息,请参考文献一
ipvs 的规则实现过程
每当有新的连接(数据报)经过netfilter的规则链时,就会调用NF_HOOK()函数.此函数会访问一个全部变量nf_hooks.这个变量里存放了netfilter的所有表(包括filter,nat,mangle和ipvs附加表等),以及每个表的规则链,规则链里的函数调用.然后遍历nf_hooks变量里相应规则链里的所有信息,根据优先级进行相应的函数调用,每个规则链里的函数都会根据该规则链里的规则对数据报进行匹配和处理
还记得在前一部分的最后,讲到的nf_register_hook()部分吗?正是ipvs使用ret = nf_register_hooks(ip_vs_ops, ARRAY_SIZE(ip_vs_ops)); 往nf_hooks变量里加入了一些数据,才使得ipvs的规则能被netfilter执行.接下来我们来看看加入的都是些什么数据
ip_vs_ops的数据内容是
net/ipv4/ipvs/ip_vs_core.c- static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
-
- * or VS/NAT(change destination), so that filtering rules can be
- * applied to IPVS.
- {
- .hook = ip_vs_in,
- .owner = THIS_MODULE,
- .pf = PF_INET,
- .hooknum = NF_INET_LOCAL_IN,
- .priority = 100,
- },
-
- {
- .hook = ip_vs_out,
- .owner = THIS_MODULE,
- .pf = PF_INET,
- .hooknum = NF_INET_FORWARD,
- .priority = 100,
- },
-
- * destined for 0.0.0.0/0, which is for incoming IPVS connections
- {
- .hook = ip_vs_forward_icmp,
- .owner = THIS_MODULE,
- .pf = PF_INET,
- .hooknum = NF_INET_FORWARD,
- .priority = 99,
- },
-
- {
- .hook = ip_vs_post_routing,
- .owner = THIS_MODULE,
- .pf = PF_INET,
- .hooknum = NF_INET_POST_ROUTING,
- .priority = NF_IP_PRI_NAT_SRC-1,
- },
- };
|
可以看到,ipvs一共在INPUT,FORWARD,POSTROUTING这3个规则链里一共添加了4个处理的函数.接下来一个一个来分析
ip_vs_in()
ip_vs_in()被放置在INPUT规则链里,会检查进入本机的所有数据报.作用是将访问vs(虚拟服务器)的连接转给rs(真实服务器),达到负载均衡的目的,如何调度与配置时的调度算法相关.如何修改数据报头部与VS的类型相关,VS有3种类型
- VS/NAT会修改s_addr, d_addr, d_port(可能)
- VS/DR会修改d_addr, d_port(可能)
- VS/TUN直接在原来数据报的基础上加一个新的包头,也叫封装
在这个函数中,对所有目的地址为本机(调度服务器)的数据进行了处理,从skb(sk_buff)中提出连接的协议结构pp(ip_vs_protocol),找出哪些skb(sk_buff)符合虚拟服务的规则svc(ip_vs_service),并找到与之对应的cp(ip_vs_conn),如果没有找到就new一个cp,并将其加入到ip_vs_conn_tab列表中).最后根据cp->packet_xmit()的方法对数据进行传送.当然,有很多的参数需要更新,比如连接的状态,pp,cp,skb的计数器等等...
net/ipv4/ipvs/ip_vs_core.c-
- * Check if it's for virtual services, look it up,
- * and send it on its way...
-
- static unsigned int
- ip_vs_in(unsigned int hooknum, struct sk_buff *skb,
- const struct net_device *in, const struct net_device *out,
- int (*okfn)(struct sk_buff *))
- {
- struct iphdr *iph;
- struct ip_vs_protocol *pp;
- struct ip_vs_conn *cp;
- int ret, restart;
- int ihl;
-
-
- * Big tappo: only PACKET_HOST (neither loopback nor mcasts)
- * ... don't know why 1st test DOES NOT include 2nd (?)
-
- if (unlikely(skb->pkt_type != PACKET_HOST
- || skb->dev->flags & IFF_LOOPBACK || skb->sk)) {
- IP_VS_DBG(12, "packet type=%d proto=%d daddr=%d.%d.%d.%d ignored\n",
- skb->pkt_type,
- ip_hdr(skb)->protocol,
- NIPQUAD(ip_hdr(skb)->daddr));
- return NF_ACCEPT;
- }
-
- iph = ip_hdr(skb);
- if (unlikely(iph->protocol == IPPROTO_ICMP)) {
- int related, verdict = ip_vs_in_icmp(skb, &related, hooknum);
-
- if (related)
- return verdict;
- iph = ip_hdr(skb);
- }
-
-
- pp = ip_vs_proto_get(iph->protocol);
- if (unlikely(!pp))
- return NF_ACCEPT;
-
- ihl = iph->ihl << 2;
-
-
- * Check if the packet belongs to an existing connection entry
-
- cp = pp->conn_in_get(skb, pp, iph, ihl, 0);
-
- if (unlikely(!cp)) {
- int v;
-
- if (!pp->conn_schedule(skb, pp, &v, &cp))
- return v;
- }
-
- if (unlikely(!cp)) {
-
- IP_VS_DBG_PKT(12, pp, skb, 0,
- "packet continues traversal as normal");
- return NF_ACCEPT;
- }
-
- IP_VS_DBG_PKT(11, pp, skb, 0, "Incoming packet");
-
-
- if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
-
-
- if (sysctl_ip_vs_expire_nodest_conn) {
-
- ip_vs_conn_expire_now(cp);
- }
-
- drop the packet.
- __ip_vs_conn_put(cp);
- return NF_DROP;
- }
-
- ip_vs_in_stats(cp, skb);
- restart = ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pp);
- if (cp->packet_xmit)
- ret = cp->packet_xmit(skb, cp, pp);
-
- else {
- IP_VS_DBG_RL("warning: packet_xmit is null");
- ret = NF_ACCEPT;
- }
-
-
- * to be synchronized
- *
- * Sync connection if it is about to close to
- * encorage the standby servers to update the connections timeout
-
- atomic_inc(&cp->in_pkts);
- if ((ip_vs_sync_state & IP_VS_STATE_MASTER) &&
- (((cp->protocol != IPPROTO_TCP ||
- cp->state == IP_VS_TCP_S_ESTABLISHED) &&
- (atomic_read(&cp->in_pkts) % sysctl_ip_vs_sync_threshold[1]
- == sysctl_ip_vs_sync_threshold[0])) ||
- ((cp->protocol == IPPROTO_TCP) && (cp->old_state != cp->state) &&
- ((cp->state == IP_VS_TCP_S_FIN_WAIT) ||
- (cp->state == IP_VS_TCP_S_CLOSE)))))
- ip_vs_sync_conn(cp);
- cp->old_state = cp->state;
-
- ip_vs_conn_put(cp);
- return ret;
- }
|
ip_vs_out()
此函数放在FORWARD规则链上,经过本机进行转发的skb都会被该函数处理.在vs/nat模式下,内网的rs返回给client的数据会经网关(本机)转发,这个时候需要修改数据报的源地址,将其修改为网关的公网ip地址,这样才能使连接持续下去,否则client将无法访问到rs(内网地址)
net/ipv4/ipvs/ip_vs_core.c-
- * It is hooked at the NF_INET_FORWARD chain, used only for VS/NAT.
- * Check if outgoing packet belongs to the established ip_vs_conn,
- * rewrite addresses of the packet and send it on its way...
-
- static unsigned int
- ip_vs_out(unsigned int hooknum, struct sk_buff *skb,
- const struct net_device *in, const struct net_device *out,
- int (*okfn)(struct sk_buff *))
- {
- struct iphdr *iph;
- struct ip_vs_protocol *pp;
- struct ip_vs_conn *cp;
- int ihl;
-
- EnterFunction(11);
-
- if (skb->ipvs_property)
- return NF_ACCEPT;
-
- iph = ip_hdr(skb);
- if (unlikely(iph->protocol == IPPROTO_ICMP)) {
- int related, verdict = ip_vs_out_icmp(skb, &related);
-
- if (related)
- return verdict;
- iph = ip_hdr(skb);
- }
-
- pp = ip_vs_proto_get(iph->protocol);
- if (unlikely(!pp))
- return NF_ACCEPT;
-
-
- if (unlikely(iph->frag_off & htons(IP_MF|IP_OFFSET) &&
- !pp->dont_defrag)) {
- if (ip_vs_gather_frags(skb, IP_DEFRAG_VS_OUT))
- return NF_STOLEN;
- iph = ip_hdr(skb);
- }
-
- ihl = iph->ihl << 2;
-
-
- * Check if the packet belongs to an existing entry
-
- cp = pp->conn_out_get(skb, pp, iph, ihl, 0);
-
- if (unlikely(!cp)) {
- if (sysctl_ip_vs_nat_icmp_send &&
- (pp->protocol == IPPROTO_TCP ||
- pp->protocol == IPPROTO_UDP)) {
- __be16 _ports[2], *pptr;
-
- pptr = skb_header_pointer(skb, ihl,
- sizeof(_ports), _ports);
- if (pptr == NULL)
- return NF_ACCEPT;
- if (ip_vs_lookup_real_service(iph->protocol,
- iph->saddr, pptr[0])) {
-
- * Notify the real server: there is no
- * existing entry if it is not RST
- * packet or not TCP packet.
-
- if (iph->protocol != IPPROTO_TCP
- || !is_tcp_reset(skb)) {
- icmp_send(skb,ICMP_DEST_UNREACH,
- ICMP_PORT_UNREACH, 0);
- return NF_DROP;
- }
- }
- }
- IP_VS_DBG_PKT(12, pp, skb, 0,
- "packet continues traversal as normal");
- return NF_ACCEPT;
- }
-
- IP_VS_DBG_PKT(11, pp, skb, 0, "Outgoing packet");
-
- if (!skb_make_writable(skb, ihl))
- goto drop;
-
-
- if (pp->snat_handler && !pp->snat_handler(skb, pp, cp))
- goto drop;
- ip_hdr(skb)->saddr = cp->vaddr;
- ip_send_check(ip_hdr(skb));
-
-
- * machine itself may be routed differently to packets
- * passing through. We want this packet to be routed as
- * if it came from this machine itself. So re-compute
- * the routing information.
-
- if (ip_route_me_harder(skb, RTN_LOCAL) != 0)
- goto drop;
-
- IP_VS_DBG_PKT(10, pp, skb, 0, "After SNAT");
-
- ip_vs_out_stats(cp, skb);
- ip_vs_set_state(cp, IP_VS_DIR_OUTPUT, skb, pp);
- ip_vs_conn_put(cp);
-
- skb->ipvs_property = 1;
-
- LeaveFunction(11);
- return NF_ACCEPT;
-
- drop:
- ip_vs_conn_put(cp);
- kfree_skb(skb);
- return NF_STOLEN;
- }
|
ip_vs_forward_icmp()
该函数和前面讲到的ip_vs_out()在同一个FORWARD规则链上,但是的优先级为99,比ip_vs_out()的100要小(高),所以优先执行.
函数非常简单,就是将经过FORWARD规则链的所有icmp数据报交给ip_vs_in_icmp()处理.为什么进入本机的数据会到FORWARD规则链上呢,原因在于local配置成透明设备时,tcp/udp协议是比较容易将forward的数据让它input的,而icmp则没有那么简单了,所以有一些发往本机的icmp报文会跑到forward规则链上来(具体原因不明),所以在这里把漏掉的进入vs的icmp交给ip_vs_forward_icmp()处理
net/ipv4/ipvs/ip_vs_core.c-
- * It is hooked at the NF_INET_FORWARD chain, in order to catch ICMP
- * related packets destined for 0.0.0.0/0.
- * When fwmark-based virtual service is used, such as transparent
- * cache cluster, TCP packets can be marked and routed to ip_vs_in,
- * but ICMP destined for 0.0.0.0/0 cannot not be easily marked and
- * sent to ip_vs_in_icmp. So, catch them at the NF_INET_FORWARD chain
- * and send them to ip_vs_in_icmp.
-
- static unsigned int
- ip_vs_forward_icmp(unsigned int hooknum, struct sk_buff *skb,
- const struct net_device *in, const struct net_device *out,
- int (*okfn)(struct sk_buff *))
- {
- int r;
-
- if (ip_hdr(skb)->protocol != IPPROTO_ICMP)
- return NF_ACCEPT;
-
- return ip_vs_in_icmp(skb, &r, hooknum);
- }
|
ip_vs_post_routing()
此函数的优先级为NF_IP_PRI_NAT_SRC-1,比POSTROUTING上的nat,mangle的优先级都高,保证了早于他们执行,目的就是防止被ipvs修改过的数据报再次被netfilter修改.具体做法如下
net/ipv4/ipvs/ip_vs_core.c-
- * It is hooked before NF_IP_PRI_NAT_SRC at the NF_INET_POST_ROUTING
- * chain, and is used for VS/NAT.
- * It detects packets for VS/NAT connections and sends the packets
- * immediately. This can avoid that iptable_nat mangles the packets
- * for VS/NAT.
-
- static unsigned int ip_vs_post_routing(unsigned int hooknum,
- struct sk_buff *skb,
- const struct net_device *in,
- const struct net_device *out,
- int (*okfn)(struct sk_buff *))
- {
- if (!skb->ipvs_property)
- return NF_ACCEPT;
-
- return NF_STOP;
- }
|
参考资料
- iptables-tutorial-cn
- netfilter: Linux 防火墙在内核中的实现