Discussion:
discussion questions: SR-IOV, virtualization, and bonding
(too old to reply)
Chris Friesen
2012-08-02 19:21:34 UTC
Permalink
Hi all,

I wanted to just highlight some issues that we're seeing and see what
others are doing in this area.

Our configuration is that we have a host with SR-IOV-capable NICs with
bonding enabled on the PF. Depending on the exact system it could be
active/standby or some form of active/active.

In the guests we generally have several VFs (corresponding to several
PFs) and we want to bond them for reliability.

We're seeing a number of issues:

1) If the guests use arp monitoring then broadcast arp packets from the
guests are visible on the other guests and on the host, and can cause
them to think the link is good even if we aren't receiving arp packets
from the external network. (I'm assuming carrier is up.)

2) If both the host and guest use active/backup but pick different
devices as the active, there is no traffic between host/guest over the
bond link. Packets are sent out the active and looped back internally
to arrive on the inactive, then skb_bond_should_drop() suppresses them.

3) For active/standby the default is to set the standby to the MAC
address of the bond. If the host has already set the MAC address (using
some algorithm to ensure uniqueness within the local network) then the
guest is not allowed to change it.


So far the solutions to 1 seem to be either using arp validation (which
currently doesn't exist for loadbalancing modes) or else have the
underlying ethernet driver distinguish between packets coming from the
wire vs being looped back internally and have the bonding driver only
set last_rx for external packets.

For issue 2, it would seem beneficial for the host to be able to ensure
that the guest uses the same link as the active. I don't see a tidy
solution here. One somewhat messy possibility here is to have bonding
send a message to the standby PF which then tells all its VFs to fake
loss of carrier.

For issue 3, the logical solution would seem to be some way of assigning
a list of "valid" mac addresses to a given VF--like maybe all MAC
addresses assigned to a VM or something. Anyone have any bright ideas?


I'm sure we're not the only ones running into this, so what are others
doing? Is the only current option to use active/active with miimon?

Chris
--
Chris Friesen
Software Designer

3500 Carling Avenue
Ottawa, Ontario K2H 8E9
www.genband.com
Jay Vosburgh
2012-08-02 20:30:53 UTC
Permalink
Post by Chris Friesen
Hi all,
I wanted to just highlight some issues that we're seeing and see what
others are doing in this area.
Our configuration is that we have a host with SR-IOV-capable NICs with
bonding enabled on the PF. Depending on the exact system it could be
active/standby or some form of active/active.
In the guests we generally have several VFs (corresponding to several
PFs) and we want to bond them for reliability.
1) If the guests use arp monitoring then broadcast arp packets from the
guests are visible on the other guests and on the host, and can cause
them to think the link is good even if we aren't receiving arp packets
from the external network. (I'm assuming carrier is up.)
2) If both the host and guest use active/backup but pick different
devices as the active, there is no traffic between host/guest over the
bond link. Packets are sent out the active and looped back internally
to arrive on the inactive, then skb_bond_should_drop() suppresses them.
Just to be sure that I'm following this correctly, you're
setting up active-backup bonds on the guest and the host. The guest
sets its active slave to be a VF from "SR-IOV Device A," but the host
sets its active slave to a PF from "SR-IOV Device B." Traffic from the
guest to the host then arrives at the host's inactive slave (it's PF for
"SR-IOV Device A") and is then dropped.

Correct?
Post by Chris Friesen
3) For active/standby the default is to set the standby to the MAC
address of the bond. If the host has already set the MAC address (using
some algorithm to ensure uniqueness within the local network) then the
guest is not allowed to change it.
So far the solutions to 1 seem to be either using arp validation (which
currently doesn't exist for loadbalancing modes) or else have the
underlying ethernet driver distinguish between packets coming from the
wire vs being looped back internally and have the bonding driver only
set last_rx for external packets.
As discussed previously, e.g.,:

http://marc.info/?l=linux-netdev&m=134316327912154&w=2

implementing arp_validate for load balance modes is tricky at
best, regardless of SR-IOV issues.

This is really a variation on the situation that led to the
arp_validate functionality in the first place (that multiple instances
of ARP monitor on a subnet can fool one another), except that the switch
here is within the SR-IOV device and the various hosts are guests.

The best long term solution is to have a user space API that
provides link state input to bonding on a per-slave basis, and then some
user space entity can perform whatever link monitoring method is
appropriate (e.g., LLDP) and pass the results to bonding.
Post by Chris Friesen
For issue 2, it would seem beneficial for the host to be able to ensure
that the guest uses the same link as the active. I don't see a tidy
solution here. One somewhat messy possibility here is to have bonding
send a message to the standby PF which then tells all its VFs to fake
loss of carrier.
There is no tidy solution here that I'm aware of; this has been
a long standing concern in bladecenter type of network environments,
wherein all blade "eth0" interfaces connect to one chassis switch, and
all blade "eth1" interfaces connect to a different chassis switch. If
those switches are not connected, then there may not be a path from
blade A:eth0 to blade B:eth1. There is no simple mechanism to force a
gang failover across multiple hosts.

That said, I've seen a slight rub on this using virtualized
network devices (pseries ehea, which is similar in principle to SR-IOV,
although implemented differently). In that case, the single ehea card
provides all "eth0" devices for all lpars (logical partitions,
"guests"). A separate card (or individual per-lpar cards) provides the
"eth1" devices.

In this configuration, the bonding primary option is used to
make eth0 the primary, and thus all lpars use eth0 preferentially, and
there is no connectivity issue. If the ehea card itself fails, all of
the bonds will fail over simultaneously to the backup devices, and
again, there is no connectivity issue. This works because the ehea is a
single point of failure for all of the partitions.

Note that the ehea can propagate link failure of its external
port (the one that connects to a "real" switch) to its internal ports
(what the lpars see), so that bonding can detect the link failure. This
is an option to ehea; by default, all internal ports are always carrier
up so that they can communicate with one another regardless of the
external port link state. To my knowledge, this is used with miimon,
not the arp monitor.

I don't know how SR-IOV operates in this regard (e.g., can VFs
fail independently from the PF?). It is somewhat different from your
case in that there is no equivalent to the PF in the ehea case. If the
PFs participate in the primary setting it will likely permit initial
connectivity, but I'm not sure if a PF plus all its VFs fail as a unit
(from bonding's point of view).
Post by Chris Friesen
For issue 3, the logical solution would seem to be some way of assigning
a list of "valid" mac addresses to a given VF--like maybe all MAC
addresses assigned to a VM or something. Anyone have any bright ideas?
There's an option to bonding, fail_over_mac, that modifies
bonding's handling of the slaves' MAC address(es). One setting,
"active" instructs bonding to make its MAC be whatever the currently
active slave's MAC is, never changing any of the slave's MAC addresses.
Post by Chris Friesen
I'm sure we're not the only ones running into this, so what are others
doing? Is the only current option to use active/active with miimon?
I think you're at least close to the edge here; I've only done
some basic testing of bonding with SR-IOV, although I'm planning to do
some more early next week (and what you've found has been good input for
me, so thanks for that, at least).

I suspect that some bonding configurations are simply not going
to work at all; e.g., I'm not aware of any SR-IOV devices that implement
LACP on the internal switch, and in any event, it would have to create
aggregators that span across physical network devices to be really
useful.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, ***@us.ibm.com
Chris Friesen
2012-08-02 22:26:53 UTC
Permalink
Post by Jay Vosburgh
Post by Chris Friesen
2) If both the host and guest use active/backup but pick different
devices as the active, there is no traffic between host/guest over the
bond link. Packets are sent out the active and looped back internally
to arrive on the inactive, then skb_bond_should_drop() suppresses them.
Just to be sure that I'm following this correctly, you're
setting up active-backup bonds on the guest and the host. The guest
sets its active slave to be a VF from "SR-IOV Device A," but the host
sets its active slave to a PF from "SR-IOV Device B." Traffic from the
guest to the host then arrives at the host's inactive slave (it's PF for
"SR-IOV Device A") and is then dropped.
Correct?
Yes, that's correct. The issue is that the internal switch on device A
knows nothing about device B. Ideally what should happen is that the
internal switch routes the packets out onto the wire so that they come
back in on device B and get routed up to the host. However, at least
with the Intel devices the internal switch has no learning capabilities.

The alternative is to have the external switch(es) configured to do the
loopback, but that puts some extra requirements on the selection of the
external switch.
Post by Jay Vosburgh
Post by Chris Friesen
So far the solutions to 1 seem to be either using arp validation (which
currently doesn't exist for loadbalancing modes) or else have the
underlying ethernet driver distinguish between packets coming from the
wire vs being looped back internally and have the bonding driver only
set last_rx for external packets.
http://marc.info/?l=linux-netdev&m=134316327912154&w=2
implementing arp_validate for load balance modes is tricky at
best, regardless of SR-IOV issues.
Yes, I should have referenced that discussion. I thought I'd include it
here with the other issues to group everything together.
Post by Jay Vosburgh
This is really a variation on the situation that led to the
arp_validate functionality in the first place (that multiple instances
of ARP monitor on a subnet can fool one another), except that the switch
here is within the SR-IOV device and the various hosts are guests.
The best long term solution is to have a user space API that
provides link state input to bonding on a per-slave basis, and then some
user space entity can perform whatever link monitoring method is
appropriate (e.g., LLDP) and pass the results to bonding.
I think this has potential. This requires a virtual communication
channel between guest/host if we want the host to be able to influence
the guest's choice of active link, but I think that's not unreasonable.

Actually, couldn't we do this now? Turn off miimon and arpmon, then
just have the userspace thing write to
/sys/class/net/bondX/bonding/active_slave
Post by Jay Vosburgh
Post by Chris Friesen
For issue 2, it would seem beneficial for the host to be able to ensure
that the guest uses the same link as the active. I don't see a tidy
solution here. One somewhat messy possibility here is to have bonding
send a message to the standby PF which then tells all its VFs to fake
loss of carrier.
There is no tidy solution here that I'm aware of; this has been
a long standing concern in bladecenter type of network environments,
wherein all blade "eth0" interfaces connect to one chassis switch, and
all blade "eth1" interfaces connect to a different chassis switch. If
those switches are not connected, then there may not be a path from
blade A:eth0 to blade B:eth1. There is no simple mechanism to force a
gang failover across multiple hosts.
In our blade server environment those two switches are indeed
cross-connected, so we haven't had to do gang-failover.
Post by Jay Vosburgh
Note that the ehea can propagate link failure of its external
port (the one that connects to a "real" switch) to its internal ports
(what the lpars see), so that bonding can detect the link failure. This
is an option to ehea; by default, all internal ports are always carrier
up so that they can communicate with one another regardless of the
external port link state. To my knowledge, this is used with miimon,
not the arp monitor.
I don't know how SR-IOV operates in this regard (e.g., can VFs
fail independently from the PF?). It is somewhat different from your
case in that there is no equivalent to the PF in the ehea case. If the
PFs participate in the primary setting it will likely permit initial
connectivity, but I'm not sure if a PF plus all its VFs fail as a unit
(from bonding's point of view).
With current Intel drivers at least, if the PF detects link failure it
fires a message to the VFs and they detect link failure within a short
time (milliseconds).

We can recommend the use of the "primary" option, but we don't always
have total control over what the guest does, and for some reason some of
them don't want to use "primary". I'm not sure why.
Post by Jay Vosburgh
Post by Chris Friesen
For issue 3, the logical solution would seem to be some way of assigning
a list of "valid" mac addresses to a given VF--like maybe all MAC
addresses assigned to a VM or something. Anyone have any bright ideas?
There's an option to bonding, fail_over_mac, that modifies
bonding's handling of the slaves' MAC address(es). One setting,
"active" instructs bonding to make its MAC be whatever the currently
active slave's MAC is, never changing any of the slave's MAC addresses.
Yes, I'm aware of that option. It does have drawbacks though, as
described in the bonding.txt docs.
Post by Jay Vosburgh
Post by Chris Friesen
I'm sure we're not the only ones running into this, so what are others
doing? Is the only current option to use active/active with miimon?
I think you're at least close to the edge here; I've only done
some basic testing of bonding with SR-IOV, although I'm planning to do
some more early next week (and what you've found has been good input for
me, so thanks for that, at least).
Glad we could help. :)

Chris
Chris Friesen
2012-08-02 22:33:27 UTC
Permalink
Post by Jay Vosburgh
The best long term solution is to have a user space API that
provides link state input to bonding on a per-slave basis, and then some
user space entity can perform whatever link monitoring method is
appropriate (e.g., LLDP) and pass the results to bonding.
I think this has potential. This requires a virtual communication
channel between guest/host if we want the host to be able to influence
the guest's choice of active link, but I think that's not unreasonable.
Actually, couldn't we do this now? Turn off miimon and arpmon, then just
have the userspace thing write to /sys/class/net/bondX/bonding/active_slave
Hmm...looks like the bonding code requires either miimon or arpmon. I
wonder if setting miimon to INT_MAX might work, at least for some
bonding modes.

Chris
Jay Vosburgh
2012-08-02 23:01:31 UTC
Permalink
Post by Chris Friesen
Post by Jay Vosburgh
The best long term solution is to have a user space API that
provides link state input to bonding on a per-slave basis, and then some
user space entity can perform whatever link monitoring method is
appropriate (e.g., LLDP) and pass the results to bonding.
I think this has potential. This requires a virtual communication
channel between guest/host if we want the host to be able to influence
the guest's choice of active link, but I think that's not unreasonable.
Not necessarily, if something like LLDP runs across the virtual
link between the guest and slave, then the guest will notice when the
link goes down (although perhaps not very quickly). I'm pretty sure the
infrastructure to make LLDP work on inactive slaves is already there; as
I recall, the "no wildcard" or "deliver exact" business in the receive
path is at least partially for LLDP.

Still, though, isn't "influence the guest's choice" pretty much
satisified by having the VF interface go carrier down in the guest when
the host wants it to? Or are you thinking about more fine grained than
that?
Post by Chris Friesen
Actually, couldn't we do this now? Turn off miimon and arpmon, then just
have the userspace thing write to /sys/class/net/bondX/bonding/active_slave
That might work for active-backup mode, yes, although it may not
handle the case when all slaves have failed if "failed" does not include
the slave being carrier down. It's not quite the same thing as input to
the link monitoring logic.
Post by Chris Friesen
Hmm...looks like the bonding code requires either miimon or arpmon. I
wonder if setting miimon to INT_MAX might work, at least for some bonding
modes.
Not true; it's legal to leave miimon and arp_interval set to 0.
Older versions of bonding will whine about it, but let you do it; in
mainline, it's a debug message you have to choose to turn on (because
current versions of initscripts, et al, create the bond first, and then
set those options, so it tended to whine all the time).

-J

---
-Jay Vosburgh, IBM Linux Technology Center, ***@us.ibm.com
Chris Friesen
2012-08-02 23:15:28 UTC
Permalink
Post by Jay Vosburgh
Still, though, isn't "influence the guest's choice" pretty much
satisified by having the VF interface go carrier down in the guest when
the host wants it to? Or are you thinking about more fine grained than
that?
That was the first thing we started looking at.

It would actually be better technically (since it would use the
back-channel between PF and VFs rather than needing an explicit virtual
network link between host/guest) but it would require work in all the
PF/VF drivers. We'd need to get support from all the driver maintainers.

The main advantage of doing it in bonding is that we'd only need to
modify the code in one place.

Chris
Jay Vosburgh
2012-08-02 23:36:42 UTC
Permalink
Post by Chris Friesen
Post by Jay Vosburgh
Still, though, isn't "influence the guest's choice" pretty much
satisified by having the VF interface go carrier down in the guest when
the host wants it to? Or are you thinking about more fine grained than
that?
That was the first thing we started looking at.
It would actually be better technically (since it would use the
back-channel between PF and VFs rather than needing an explicit virtual
network link between host/guest) but it would require work in all the
PF/VF drivers. We'd need to get support from all the driver maintainers.
It might also be better (for a different definition of "better")
to use the virtual network link and do more functionality in a generic
user space piece that's not in the kernel and wouldn't require special
driver support. Either way, I imagine there's going to have to be some
sort of message passing going on.
Post by Chris Friesen
The main advantage of doing it in bonding is that we'd only need to modify
the code in one place.
As long as it works with VLANs bonded together; that seems to be
more common these days.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, ***@us.ibm.com
John Fastabend
2012-08-03 04:50:06 UTC
Permalink
Post by Jay Vosburgh
Post by Jay Vosburgh
The best long term solution is to have a user space API that
provides link state input to bonding on a per-slave basis, and then some
user space entity can perform whatever link monitoring method is
appropriate (e.g., LLDP) and pass the results to bonding.
I think this has potential. This requires a virtual communication
channel between guest/host if we want the host to be able to influence
the guest's choice of active link, but I think that's not unreasonable.
Not necessarily, if something like LLDP runs across the virtual
link between the guest and slave, then the guest will notice when the
link goes down (although perhaps not very quickly). I'm pretty sure the
infrastructure to make LLDP work on inactive slaves is already there; as
I recall, the "no wildcard" or "deliver exact" business in the receive
path is at least partially for LLDP.
Right we run LLDP over the inactive bond. However because LLDP
uses nearest customer bridge, nearest bridge, or neareast non-tpmr
addresses it should be dropped by switching components. The problem
with having VMs send LLDP and _not_ dropping the packets is it looks
like multiple neighbors to the peer. The point is there is really an
edge relay like component in the hardware with SR-IOV. So likely using
LLDP do to do this wouldn't work

If you happen to have the 2010 802.1Q rev section 8.6.3 "frame
filtering" has some more details. The 802.1AB spec has details on the
multiple neighbor case.
Post by Jay Vosburgh
Still, though, isn't "influence the guest's choice" pretty much
satisified by having the VF interface go carrier down in the guest when
the host wants it to? Or are you thinking about more fine grained than
that?
Perhaps one argument against this is if the hardware supports loopback
modes or the edge relay in the hardware is acting like a VEB it may
still be possible to support VF to VF traffic even if the external link
is down. Not sure how useful this is though or if any existing hardware
even supports it.

Just in case its not clear (it might not be) an edge relay (ER) is
defined in the new 802.1Qbg-2012 spec. "An ER supports local relay
among virtual stations and/or between a virtual station and other
stations on a bridged LAN". Similar to a bridge but without spanning
tree operations.

.John
Ben Hutchings
2012-08-03 17:49:23 UTC
Permalink
[...]
Post by John Fastabend
Post by Jay Vosburgh
Still, though, isn't "influence the guest's choice" pretty much
satisified by having the VF interface go carrier down in the guest when
the host wants it to? Or are you thinking about more fine grained than
that?
Perhaps one argument against this is if the hardware supports loopback
modes or the edge relay in the hardware is acting like a VEB it may
still be possible to support VF to VF traffic even if the external link
is down. Not sure how useful this is though or if any existing hardware
even supports it.
[...]

It seems to me that VF to VF traffic ought to still work. If it doesn't
then that's an unfortunate regression when moving from software bridging
and virtio to hardware-supported network virtualisation. (But hybrid
network virtualisation may help to solve that.)

Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
Chris Friesen
2012-08-10 18:41:40 UTC
Permalink
Post by Ben Hutchings
Post by John Fastabend
Perhaps one argument against this is if the hardware supports loopback
modes or the edge relay in the hardware is acting like a VEB it may
still be possible to support VF to VF traffic even if the external link
is down. Not sure how useful this is though or if any existing hardware
even supports it.
[...]
It seems to me that VF to VF traffic ought to still work. If it doesn't
then that's an unfortunate regression when moving from software bridging
and virtio to hardware-supported network virtualisation. (But hybrid
network virtualisation may help to solve that.)
I would have thought this to be desirable as well. Apparently the Intel
engineers disagreed. The 82599 datasheet has the following:

"Loopback is disabled when the network link is disconnected. It is
expected (but not required) that system software (including VMs) does
not post packets for transmission when the link is disconnected. Note
that packets posted by system software for transmission when the link is
down are buffered."

Chris

Loading...