[Coral-dev] crl_to_pcap stats?

Mon Jun 13 12:35:55 PDT 2005

On Mon, Jun 13, 2005 at 11:36:58AM -0700, Ken Keys wrote:
> If there is packet loss in an interval, crl apps print a message like this
> to the coral error file (stderr by default) at the end of the interval:
> 
>     warning: interval 1118687740.000000: iface 0: dropped 13 packets
> 
> If there is no loss, no message is printed.  If no intervals are set, the
> entire run is considered one interval for this statistics reporting.
> 

	It is perhaps worth pointing out that this is probably only one of the
(at least) three possible sources of packet loss. I expect this number is 
reporting the packet error loss counter from libpcap which covers the copy from
mbufs in kernel space in to the libpcap buffer in user space. In addition you 
can be losing packets (invisible to this counter) in the interface hardware 
(from the physical wire in to the interface hardware / device driver before 
the kernel buffers) or by exhausting kernel buffers at the mbuf level, running
out of CPU or running out of memory bandwith. Both  of those two error sources 
need to be detected at the kernel level (and I think some of the kernel buffer 
level drops may not get counted or displayed although I haven't yet had time 
to dig through the source code and see if thats true) and they vary by 
operating system just to keep life interesting. 
	For example this is the interface error printout from a SysKonnect 
fibre Gig card in a Suse 9.1 / Linux 2.6 kernel (this one without the ntop 
libpcap ring buffer code that I usually use) machine that is being abused 
at wire speed with 9K UDP packet bursts. As you see it isn't always happy 
about this (and Intel copper Gig cards are even more unhappy, haven't yet tried
my brand new Intel fibre GigE cards :-)). I believe (without as I say having 
looked at the Linux device driver code, although I have at the FreeBSD 
equivelent) that this complaint means that the 64K on card ring buffer was 
over written before the device driver serviced it (probably because of 
insufficient CPU, the machine is a dual 1.2 Gig Athelon). That could be fixed 
(since this is capture only) by changing the device driver to allocate more 
memory to the receive side, or using a faster / more modern card (the 
SysKonnect is also 4 years old) that does interrupt merging (which has its own 
problems):

sniffer:~ # cat /proc/net/sk98lin/eth0

Detailed statistic for device eth0
=======================================

Board statistics

Active Port                    A
Preferred Port                 A
Bus speed (MHz)                66
Bus width (Bit)                64
Driver version                 6.23
Hardware revision              v1.2
Temperature (C)                27.05
Temperature (F)                81.00
Voltage PCI (V)                5.104
Voltage PCI-IO (V)             3.344
Voltage ASIC (V)               3.344
Voltage PMA (V)                3.278

Receive statistics

Received bytes                 11761348565
Received packets               1710606
Receive errors                 10
Receive dropped                0
Received multicast             431
Receive error types
   length                      0
   buffer overflow             10
   bad crc                     0
   framing                     0
   missed frames               0
   too long                    0
   carrier extension           0
   too short                   0
   symbol                      0
   LLC MAC size                0
   carrier event               0
   jabber                      0

Transmit statistics

Transmited bytes               2047881323
Transmited packets             1083310
Transmit errors                0
Transmit dropped               0
Transmit collisions            0
Transmit error types
   excessive collision         0
   carrier                     0
   fifo underrun               0
   heartbeat                   0
   window                      0

	As noted we beleive that we are seeing additional loss in kernel space
that we haven't found error counters for (if they are there) because this error
rate doesn't explain the amount of loss we are seeing at the application level
(although the application in our case isn't corel-reef, the principles are 
unfortunatly universal).

Peter Van Epp / Operations and Technical Support 
Simon Fraser University, Burnaby, B.C. Canada