Bug 3917 - spamd under Cygwin causes SpamC to report "failed sanity check" on some messages
Summary: spamd under Cygwin causes SpamC to report "failed sanity check" on some messages
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd (show other bugs)
Version: 3.0.0
Hardware: PC Windows 2000
: P5 normal
Target Milestone: 3.1.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-10-21 18:58 UTC by Eugene Pressley
Modified: 2005-01-03 07:25 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
Sample #1 text/plain None Eugene Pressley [NoCLA]
Sample#2 text/plain None Eugene Pressley [NoCLA]
Sample #3 text/plain None Eugene Pressley [NoCLA]
Sample #4 text/plain None Eugene Pressley [NoCLA]
Zipped copy of attachment samples application/octet-stream None Eugene Pressley [NoCLA]
One more sample application/octet-stream None Eugene Pressley [NoCLA]
same zs2.out sample with resultant zs2a.out and Debug from SpamD application/octet-stream None Eugene Pressley [NoCLA]
my spamc.exe application/octet-stream None Eugene Pressley [NoCLA]
simplified server that demonstrates the problem (see following comments) text/plain None Sidney Markowitz [HasCLA]
even simpler server that demonstrates the problem text/plain None Sidney Markowitz [HasCLA]
patch to spamd.raw to use syswrite instead of print and printf patch None Sidney Markowitz [HasCLA]
patch to spamd.raw to use syswrite instead of print and printf patch None Sidney Markowitz [HasCLA]
patch to spamd.raw to use syswrite instead of print and printf patch None Sidney Markowitz [HasCLA]
3.0.2 SpamD with Syswrite Mods text/plain None Eugene Pressley [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Eugene Pressley 2004-10-21 18:58:55 UTC
Build SpamAssassin on Cygwin/Win2k3 system.  Using SpamC built from the 3.0.0 
distribution, using Cygwin.  When certain messages are submited to SpamD using 
SpamC I get the "failed sanity check" error reporting that the file size 
returned is smaller than what was submitted.  The SpamC client also returns an 
Error 76 "remote error in protocol" when I use the "-x" option.  I will be 
happy to supply samples of files that have gotten this error, just contact me 
and let me know where to send the files.
Comment 1 Eugene Pressley 2004-10-21 19:02:09 UTC
Created attachment 2468 [details]
Sample #1
Comment 2 Eugene Pressley 2004-10-21 19:02:32 UTC
Created attachment 2469 [details]
Sample#2
Comment 3 Eugene Pressley 2004-10-21 19:02:58 UTC
Created attachment 2470 [details]
Sample #3
Comment 4 Eugene Pressley 2004-10-21 19:03:17 UTC
Created attachment 2471 [details]
Sample #4
Comment 5 Eugene Pressley 2004-10-21 19:03:57 UTC
I have attached four messages that recieved this error.
Comment 6 Sidney Markowitz 2004-10-21 19:55:14 UTC
Can you check if it has anything to do with the mail message having DOS vs unix
newlines? That would be hidden in the attachments you posted unless you zipped
them and attached the zip files.
Comment 7 Eugene Pressley 2004-10-21 20:27:33 UTC
Created attachment 2472 [details]
Zipped copy of attachment samples
Comment 8 Eugene Pressley 2004-10-21 20:33:31 UTC
I submitted a copy of the samples as a zipped file.  Most messages are being 
properly processed only a small percentage are having this problem.  I have 
noticed that most if not all of the problem messages are multipart-mime files.  
As for the Dos v/ Unix newline issue I don't know.  I have set cygwin to DOS 
style newlines.  All of the files being sent to SpamD should have CRLF line 
endings.  If it is a problem with the newline ending why would only a small 
percentage of message have a problem.
Comment 9 Eugene Pressley 2004-10-21 20:42:14 UTC
As a test I tried converting a test message, using a DOS2UNIX utility, before 
sending it to SpamD.  I got the same results.  SpamD is only return a portion 
of the original message, which is what is causing the sanity check to fail.  I 
have noticed that the amount of the original message returned seems to vary a 
little with each submission.
Comment 10 Eugene Pressley 2004-10-21 21:06:58 UTC
Created attachment 2473 [details]
One more sample

The sample that I just attached I tried several times and recieved this:
D:\ESA>spamc300 < zs2.out > zs2a.out
failed sanity check, 109048 bytes claimed, 33721 bytes seen
Comment 11 Sidney Markowitz 2004-10-21 21:24:18 UTC
My initial test usaing zs2.out did not reproduce the problem, using 3.01 svn
branch and running spamc and spamd under Cygwin.

I'll have to check it more thoroughly when I get a chance to tonight.

What flags are you using to run spamd?

If you run spamd -D do you see any useful error messages?
Comment 12 Eugene Pressley 2004-10-22 05:42:47 UTC
Created attachment 2475 [details]
same zs2.out sample with resultant zs2a.out and Debug from SpamD

Here is the Debug from SpamD along with the in and out file that was produced
from the command:

spamc < zs2.out > zs2a.out

When I use the -x Command for spamc it will return an Error 76 
"EX_PROTOCOL	 76  remote error in protocol"
Comment 13 Eugene Pressley 2004-10-22 05:44:32 UTC
Created attachment 2476 [details]
my spamc.exe

Here is the spamc from my cygwin build, incase it might be useful.
Comment 14 Eugene Pressley 2004-10-22 10:07:15 UTC
Just tried the latest CVS build I got the following results:

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109044 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109044 bytes claimed, 34745 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109044 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109044 bytes claimed, 97209 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109044 bytes claimed, 34745 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109044 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 34745 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 34745 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 34745 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 34745 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 34745 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 85945 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 34745 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 54201 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 34745 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 87993 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 34745 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out
failed sanity check, 109058 bytes claimed, 33721 bytes seen

D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out

You may notice that out of a number of runs some of them worked correctly, this 
is very bizar.  I think I will remove cygwin and rebuild.
Comment 15 Eugene Pressley 2004-10-22 11:55:36 UTC
Well I reinstalled Cygwin and SA using the latest CVS build, still having the 
same problem.
Comment 16 Sidney Markowitz 2004-10-22 14:12:13 UTC
It looks like spamd thinks everything is ok, but spamc is getting in error in
the socket connection. 76 is, as you said, "remote error in protocol" which I
see is defined as "The remote system returned something that was not possible
during a protocol exchange".

I could see how I could not reproduce it trying it once on my system.

Here are some other things to try in order to narrow it down: Do you get the
same results if you disable the network checks by running spamd with -L? How
about running without Bayes? I'm thinking in terms of what happens if spamd is
not doing anything that potemtially takes a long elapsed time.

Another interesting extra piece of information would be to add timestamps to the
spamd -D output. I guess you could do that by adding in the time to the warn
"debug: " $msg line in sub dbg in SpamAssassin.pm. That would show if there are
any long delays during the spamd processing in the cases that fail.
Comment 17 Justin Mason 2004-10-22 15:34:13 UTC
hmm.  what about character sets?  is everyone using the US codepage?
Comment 18 Sidney Markowitz 2004-10-22 17:17:23 UTC
> what about character sets?

I don't think it is a matter of the character counts not matching. A protocol
error and losing the connection explains better the big difference in counts and
the variation in results across runs with the same message.
Comment 19 Eugene Pressley 2004-10-22 21:05:07 UTC
I have the LANG var set to en_US.  The messages the are failing are a variety 
of language types including "us-ascii".  If it is a protocol failure what could 
be causing it.  The SpamC and SpamD are running on the same machine.
Comment 20 Sidney Markowitz 2004-10-23 02:17:08 UTC
> If it is a protocol failure what could be causing it

I don't know. I tried forcing a timeout by setting a snall value for the spamc
-t option, but that returned an error code 74, not 76 like you are seeing. The
effect is close to what you are seeing, though. Spamc is getting an error while
spamd thinks nothing is wrong.

What are all the command line options you are using for spamd and spamc?
Comment 21 Eugene Pressley 2004-10-23 09:57:39 UTC
SpamD command line:

perl.exe spamd -D -s /var/log/spamd.log -r /var/log/spamd.pid

SpamC:

spamc -x < infile > outfile
Comment 22 Sidney Markowitz 2004-10-23 23:48:07 UTC
I've reproduced this doing the following. Maybe someone can see if it happens
outside of Cygwin.

I did this with a build of the 3.0.1 branch. I haven't yet tried it in trunk.

I ran spamd using the command line

spamd -D -s /c/temp/zsaspderr.log -r /c/temp/spand.pid

I ran spamc repeatedly using the example input from the last attached sample
using the following script, with the output shown:

$ myvar=0 ; while [ $myvar -ne 10 ] ; do spamc -x < zs2.out > zsa2.out ; echo $?
; myvar=$(( $myvar + 1 )) ; done

0
0
0
76
76
0
76
76
76
0
Comment 23 Sidney Markowitz 2004-10-24 01:12:12 UTC
More results:

I got the same sporadic return code 76 using svn trunk under Cygwin (perl
5.8.5), and also when I added a sleep n delay of 5, 10, and 20 seconds between
successive calls to spamc. So it isn't a timing problem caused by sending rapid
messages to spamd.

I did not get any return code 76 when I repeated the test under Fedora Core 2,
perl 5.8.3.

Comment 24 Sidney Markowitz 2004-10-24 14:41:19 UTC
I get the same results if I use -L and disable auto_whitelist and Bayes in
user_prefs. I'm having difficulty getting a network capture, so I still don't
know what is going in with the protocol.
Comment 25 Sidney Markowitz 2004-10-24 16:48:18 UTC
I have some firewall problem trying tests between machines, but I did verify
that going through a ssh tunnel between my Cygwin box and my Fedora core 2 box,
running spamc in Cygwin and spamd in linux has no problem, running spamd in
Cygwin and spamc in linux does show the problem, with the return code 76
happening every time.

I substituted a large file made up of sample-nonspam.txt concatenated with
itself many times and the same error happens.

I reduced the size of the file. The smaller files never showed the return code
76, larger files did. At the borderline, a file with GTUBE (i.e., marked as
spam) began showing the problem at a smaller size than a non-spam file.

Comment 26 Justin Mason 2004-10-24 22:23:51 UTC
hmm.  possibility: a helper app is closing the fd?
Comment 27 Eugene Pressley 2004-10-28 10:19:01 UTC
Sorry for not checking in for a while.  I was away at a technical conference.  
I am happy to see that the problem can be replicated.  So does anyone have any 
idea why this is happening?  It seems to be a problem when the SpamD is running 
on a CygWin system.  Is the problem with SpamD or with CygWin?
Comment 28 Sidney Markowitz 2004-10-31 15:23:12 UTC
I found the firewall that was holding up my testing -- remnant of a VPN client I
had been using -- and captured the traffic from a test with spamc on a linux box
and spamd running under Cygwin. Ethereal shows that spamd is sending a truncated
message. There are no errors, just a data packet that should not be the last one
and then a FIN packet. A debugging statement showed that $client->close() is
being called in the right place in spamd. I tried adding a $client->flush() and
even a $client->flush(); sleep(5); before the $client->close() to no effect.

The truncated message is not consistent... It does not truncate in the same
place each time and it does not happen every time on identical input.
Comment 29 Sidney Markowitz 2004-11-01 10:39:45 UTC
I have a workaround that tests ok, but I am not submitting a patch because I
don't think it is a workable solution. Instead I'm putting it in this comment in
the hopes that someone has some idea what it implies about the problem and what
might be done about it.

Remember, this only seems to happen in Cygwin. I can't say if it Win32 would
also do it because I have not distilled the problem into a standalone example
and Win32 doesn't run spamd.

The problem seems to be that in Cygwin we cannot reliably send strings of around
30,000 bytes over a tcp socket. If we do some do not arrive and the packet loss
is not corrected by the tcp protocol as it is supposed to.

spamd sends the entire message using a single call to print $client "some
header" $msg_resp;

Without a standalone test I can't say if it happens at an integer overflow
boundary of 32767 or if it  is a smaller number. The error is sporadic, so it is
not a simple buffer or integer overflow. My test is to send the same message to
spamd 10 times in a row with a 5 second sleep between them and checking the
return code. With my 38k test message about two to four of the times the return
code from spamc -x is 76.

If I change that to print $client "some header"; followed by the followiong
code, which slleps for a second after every 10,000 bhytes it works fine.

If I use a $client->flush() instead of sleep(1) it still breaks.

Does anyone have any insight?

This goes in spamd.raw approximately at line 1176 and comes after removing the
$msg_resp from each of the preceding print lines.

 my @segs = unpack("a10000" x ($msg_resp_length/10000) . " a*", $msg_resp);
   foreach my $sg (@segs) {
     print $client $sg;
     sleep(1);
   }
Comment 30 Justin Mason 2004-11-01 11:31:02 UTC
wow, that's a serious bug!   again, a standalone test case would rock so that it
can be escalated upstream to cygwin.

in the meantime, I think the substr hack workaround should be fine, using
$Config{osname} or similar to detect if it's running on cygwin and whether the
fix should be used.  it'd slow down scanning a little on cygwin, but the
alternative is this breakage, and we don't know when it's going to be fixed (if
ever).
Comment 31 Eugene Pressley 2004-11-01 11:36:26 UTC
Do you think this workaround will make it into the official Spam-Assassin 
Build.  And If so how would I know when it is there, other that looking at the 
code myself.
Comment 32 Sidney Markowitz 2004-11-01 12:14:34 UTC
It's probably ok to slow down scanning on Cygwin only.

I'll see about putting together a standalone test for a Cygwin bug report. That
will do two other things that I think are necessary before this can be turned
into a patch: Determine the actual size of data sent over the socket at which
breakage starts, and determine the actual sleep time that is necessary.

Does sleep() allow fractional seconds in its argument? If not, is there a way to
sleep for less than a second?
Comment 33 Eugene Pressley 2004-11-01 12:30:14 UTC
I just wanted to let everyone know that I have see the error with files as 
small as 8K.  What you may want try is to break the message response string 
into packets that would fit in a single tcp frame, shoot for 1500 bytes.  You 
might try it without any delay.  Just as long as the packets are not 
fragmented, might be enough to keep the problem from occuring.
Comment 34 Eugene Pressley 2004-11-01 13:26:46 UTC
Also could you give a little more detail on where to place the new code.
Comment 35 Justin Mason 2004-11-01 14:09:25 UTC
Subject: Re:  SpamC is reporting "failed sanity check" on some messages 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> Does sleep() allow fractional seconds in its argument? If not, is there
> a way to sleep for less than a second?

using "select(undef, undef, undef, $nsecs);" is portable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBhrPxMJF5cimLx9ARAlWnAJ9jQ1xZq+CqxFPYs65VACHLtkvHQwCcD7Gt
pDVf/MkPOwQJoitiUCScia4=
=iYce
-----END PGP SIGNATURE-----

Comment 36 Sidney Markowitz 2004-11-01 15:13:24 UTC
> Just as long as the packets are not fragmented,
> might be enough to keep the problem from occuring

I'll have to try that. If nothing else, the implied flush that would occur after
each call to print may have the same effect as the sleep calls, so breaking it
up into half the size that works with the sleep(1) may do the same thing with no
sleep delays. It's worth a try.

I'll post it as a patch once I've played with a bit more to see what the optimal
fragment size is and what kind of sleep delay I can get away with. If you can't
wait, look around line 1176 in spamd.raw. There are a number of print $client
statements in the different branches of an if ... elsif statement. Each print
outputs somethng followed by $msg_resp. I took out the $msg_resp from each of
them, then followed the entire if with the code I pasted into comment #29

Or you can wait for patch, which will probably be ready when I can get to it 9
or 10 hours from when I am typing this comment.
Comment 37 Loren Wilton 2004-11-01 17:44:20 UTC
Subject: Re:  SpamC is reporting "failed sanity check" on some
 messages

From the MS socket documentation:

The semantics of closesocket are affected by the socket options SO_LINGER and SO_DONTLINGER as follows (SO_DONTLINGER is enabled by default; SO_LINGER is disabled).

If SO_LINGER is set with a zero time-out interval (that is, the LINGER structure members l_onoff is not zero and l_linger is zero), closesocket is not blocked even if queued data has not yet been sent or acknowledged. This is called a hard or abortive close, because the socket's virtual circuit is reset immediately, and any unsent data is lost. Any recv call on the remote side of the circuit will fail with WSAECONNRESET.

If SO_LINGER is set with a nonzero time-out interval on a blocking socket, the closesocket call blocks on a blocking socket until the remaining data has been sent or until the time-out expires. This is called a graceful disconnect. If the time-out expires before all data has been sent, the Windows Sockets implementation terminates the connection before closesocket returns.

Enabling SO_LINGER with a nonzero time-out interval on a nonblocking socket is not recommended. In this case, the call to closesocket will fail with an error of WSAEWOULDBLOCK if the close operation cannot be completed immediately. If closesocket fails with WSAEWOULDBLOCK the socket handle is still valid, and a disconnect is not initiated. The application must call closesocket again to close the socket. If SO_DONTLINGER is set on a stream socket by setting the l_onoff member of the LINGER structure to zero, the closesocket call will return immediately and does not receive WSAWOULDBLOCK whether the socket is blocking or nonblocking. However, any data queued for transmission will be sent, if possible, before the underlying socket is closed. This is also called a graceful disconnect. In this case, the Windows Sockets provider cannot release the socket and other resources for an arbitrary period, thus affecting applications that expect to use all available sockets. This is the default behavior (SO_DONTLINGER is set by default).

Note  To assure that all data is sent and received on a connection, an application should call shutdown before calling closesocket (see Graceful shutdown, linger options, and socket closure for more information). Also note, an FD_CLOSE network event is not posted after closesocket is called.

Here is a summary of closesocket behavior: 

If SO_DONTLINGER is enabled (the default setting) it always returns immediately—connection is gracefully closed in the background. 
If SO_LINGER is enabled with a zero time-out: it always returns immediately —connection is reset/terminated. 
If SO_LINGER is enabled with a nonzero time-out: 
– with a blocking socket, it blocks until all data sent or time-out expires. 

– with a nonblocking socket, it returns immediately indicating failure. 

For additional information please see Graceful shutdown, linger options, and socket closure for more information.


Comment 38 Sidney Markowitz 2004-11-01 18:57:46 UTC
Loren, thanks, it looked promising, but adding a call to shutdown before the
close did not help :-(
Comment 39 Loren Wilton 2004-11-02 00:58:19 UTC
Subject: Re:  SpamC is reporting "failed sanity check" on some messages

I was more thinking of making sure that SO_LINGER is set correctly.  This
really sounds like a classic case of linger being set to zero so the
trailing stuff gets flushed.  I have no idea if that is accessible at the
perl level, but you might be able to dig into the socket implementation code
and see what it is doing.

Alternate test that might work:  Try doing a shutdown, wait 1, then close.

I think you said a sleep before the close didn't help by itself, which is
potentially not unreasonable - the data might not have been flushed yet,
even with a longish delay.  I *think* that a shutdown and a short wait
before the close might maybe do as a workaround.  However, if it does, it is
still a workaround for quite possibly the incorrect value of SO_LINGER in
the original setsockopt call.

I do know that the Win NT/2K/XP socket implementation of TCP doesn't
normally lose data when used correctly.  There have been busted drivers in
the past that would cause this.  But that was pretty long ago, and it seems
highly unlikely that *everyone* would just happen to have one of these, and
it only fails when used from perl.  So my current guess is that there is/are
bug(s) in the perl socket code.  Linger is a pretty classic mistake a lot of
people make.

        Loren

Comment 40 Sidney Markowitz 2004-11-02 04:16:05 UTC
sleep(5) between the call to shutdown and the close doesn't help at all.

I added a -l option to the call to spamc -x to see the error. Tthe results are
interesting. I'm testing with two different email messages. Each test consists
of calling spamc with the message that I am testing with, printing out the
return ocde, sleeping 5 seconds, then repeating. The error output with -l is
about a mismatch in the number of bytes expected and the number received from
spamd. Each message is consistent in how many bytes are received when it
truncates. They are each different, but totally consistent for any one message.
It is not like a single packet loss. One shows 78130 bytes claimed, 33792 bytes
seen, the other 109053 bytes claimed, 41984 bytes seen.

But here's kicker. I tested using the first 39166 bytes of the message of the
first test. It also showed 33792 bytes seen, in other words shutting down the
connection at the exact same place. That indicates some kind of dependency on
the content of the message.
 
I tried adding immediately after $client = $server->accept()

 setsockopt($client, SOL_SOCKET, SO_LINGER, pack("II", 1, 0)) or 
      warn "Can't set SO_LINGER on for socket: $!";

and it did not help. Neither did

setsockopt($client, SOL_SOCKET, SO_LINGER, pack("II", 1, 5)) or 
      warn "Can't set SO_LINGER on for socket: $!";

I still don't have a test case that is independent of spamd, but I have a start
at the server side of it.
Comment 41 Loren Wilton 2004-11-02 04:28:01 UTC
Subject: Re:  SpamC is reporting "failed sanity check" on some messages

> But here's kicker. I tested using the first 39166 bytes of the message of
the
> first test. It also showed 33792 bytes seen, in other words shutting down
the

Well, that pretty well proves it isn't a linger problem.

I wonder what character or character pair is at 33792 or thereabouts.
Perhaps an FF or ^Z?  I know that Telnet uses in-band signaling and can get
its tail in a knot this way, but I wouldn't really expect that on a vanella
TCP socket connection.

Comment 42 Justin Mason 2004-11-02 09:53:58 UTC
hmmmmm....

have you guys got virus scanners installed?  might they do something to "bad"
content being sent over 127.0.0.1?
Comment 43 Sidney Markowitz 2004-11-02 10:30:50 UTC
I did recently install XP Service pack 2 and it has been interacting badly with
lots of things... How about if I format my disk, install linux and see how
everything works after that? :-)

Sigh... I guess this is what VMWare is for...
Comment 44 Fred T 2004-11-02 10:41:53 UTC
Is XP SP2 Firewall enabled?
Comment 45 Sidney Markowitz 2004-11-02 12:56:19 UTC
XP firewall is not enabled. Symantec Client Security is installed, but I have
tested with it disabled and it should not affect anything on localhost anyway.
Well, should is not the same as does not. Fred, what do you have installed?

I just checked back on the dates in the comments and I see that I was
reproducing this bug before I installed SP2.

Another data point: I modified my test server to read a file from disk so it can
send approximately the same as spamd is -- approximately because I used the
email as input without the spamd protocol headers. It should be doing close to
what spamd is doing, except not reading the message over the tcp socket first. I
don't yet have a client written to test it. I used telnet. So far I have not
reproduced the problem that way. My next step is to modify spamc so I can see
exactly where it is cutting off the message stream, but I am back in my 12 hour
delay window until I can back to it.
Comment 46 Eugene Pressley 2004-11-11 09:08:49 UTC
Has there been any progress on this issue?
Comment 47 Sidney Markowitz 2004-11-11 11:36:28 UTC
> Has there been any progress on this issue?

I had to put working on this on hold until after my final exam. I might be able
to get back to it this weekend. I wrote a simplified server side test that does
not fail when I use telnet to send data to it. What I really need to do is write
a client that sends data to the test server which then sends it back. That will
be close to what spamc/spamd are doing. If it fails, then I can start narrowing
down what is necessary to demonstrate the problem, report a bug to the Cygwin
people, and if we are lucky deduce a workaround for spamd.
Comment 48 Sidney Markowitz 2004-11-12 23:16:45 UTC
Created attachment 2513 [details]
simplified server that demonstrates the problem (see following comments)

I have attached a perl server program that does the network stuff the same as
spamd and which demonstrates the problem while being much simplified.

To run this program use

 perl test2.pm 783

where the argument is the port it will use, default 783.

To demonstrate the problem use spamc to repeatedly send it a file consisting of
685 lines each of 75 letter 'a's. Here is the shell script that I used to do
that, with the file zs8.err being the 685 line test file.

myvar=0 ; while [ $myvar -ne 10 ] ; do spamc -x -l < zs8.err > zsa2.out ; echo
$?; myvar=$(( $myvar + 1 )) ; done

Anintersting thing about this is that while it did nor always fail, when it did
fail it was consistently at the same spot in the input file, independent of the
contents of the file:

spamc: failed sanity check, 53365 bytes claimed, 49152 bytes seen
76
Comment 49 Sidney Markowitz 2004-11-12 23:19:01 UTC
changed summary to better reflect actual problem
Comment 50 Sidney Markowitz 2004-11-18 16:42:34 UTC
Narrowing it down just a little bit more: I modified the server in the last
attachment to cache the first message it receives and then keep using that. I
then called it using spamc to send one large message followed by repeatedly
sending a 0 length message. The server kept sending back to spamc the original
large message.

This produces the same intermittent error, truncating in the same place.

That indicates that the sending of the large message to spamd is not a factor in
the truncation of the message that spamd sends out.

My next step in simplifying the test will be to have the server respond to a
client connection by sending back fixed large message, and use something simpler
than spamc as the client to test it.
Comment 51 Sidney Markowitz 2004-11-19 14:51:20 UTC
Created attachment 2530 [details]
even simpler server that demonstrates the problem

I've attached a simpler server program that demonstrates the problem without
requiring spamc in the test and without a lot of data being sent from the
client to the server. Run this under Cygwin. It will run with -w and -T
options. It takes two optional command line arguments. The first is the ip
address to listen on, defaulting to 127.0.0.1. The second argument is the port
number, defaulting to 783.

This server waits for a client to send something on the port, then it sends
back a 68013 byte message with headers that allow it to work with spamc as the
client.  So you can test it by running it and then running spamc -x -l
repeatedly, looking at the return code and error messages from spamc. You can
also use telnet as a client. The message that the server sends back consists of
1000 identical lines prefixed with line numbers 000 to 999 followed by a line
that says END OF TEST. That makes it easy to see how the data receivfed from
teh server is trucated. Just use telnet locahost 783 and type the Enter key
when you get a connection to the server.

This test consistently for me truncates the received data after 49152 bytes.
Comment 52 Sidney Markowitz 2004-11-20 01:11:05 UTC
I think I have this tracked down to a Windows bug :-(

I rewrote the last server test I attached here in C, ported that to use winsock,
and duplicated the problem with no perl and no Cygwin involved.

According to Microsoft documentation the proper way to gracefully shut down a
tcp socket connection is to call shutdown and then wait for an indication that
any pending outgoing data has been sent. Leaving aside for now the question of
how to wait for such an indication, what I found was that if I send out 70000
bytes all at once then call shutdown, the connection is killed and data after
about 48000 bytes is lost. It doesn't matter what I do after the shutdown, so it
is not a matter of waiting for any indication.

The only way I found to avoid the data loss is to sleep before the call to
shutdown. I tried the select() function to look for something about the socket
that would indicate when it is ready for a clean shutdown and didn't find anything.

The only workaround that I can think of right now would be to modify the spamd
protocol to require an acknowledgment back from the client when it has recieved
the message. If the only reason to do that is to make allowances for Windows
being crappy as a server compared to other platforms, it is not something that I
am inclined to do.

I'm open to hearing any other ideas about this, especially from anhyone who has
more expertise in the vagaries of Windows sockets. Otherwise I'm afraid that
this will have to be closed as a WONTFIX and we'll have to document that spamd
cannot be practically run under Windows even using Cygwin.
Comment 53 Daryl C. W. O'Shea 2004-11-20 01:52:25 UTC
Subject: Re:  spamd under Cygwin causes SpamC to report "failed
 sanity check" on some messages

> The only workaround that I can think of right now would be to modify the spamd
> protocol to require an acknowledgment back from the client when it has recieved
> the message. If the only reason to do that is to make allowances for Windows
> being crappy as a server compared to other platforms, it is not something that I
> am inclined to do.

Is this really a bad thing?  How many protocols let the server decide 
when to close the connection (in a non-error state)?  I can't think of 
too many, and no email related ones.

Having spamc issue a QUIT command at the end of the session might not be 
too bad an idea.


Daryl

Comment 54 Sidney Markowitz 2004-11-20 15:21:36 UTC
I did some more Googling to see if *anybody* talks about this bug and was
surprised to find nothing for such an egregious problem.

Just before I gave up I stumbled on what might be a related thread in a Cygwin
mailing list.

I have to read ovedr the full thread to see if it really is the same problem, as
so far I have just skimmed it. But the resolution to that problem is here:

http://www.cygwin.com/ml/cygwin/2004-05/msg00129.html

"However I found source of bug.  Norton Internet Security was messing
with sockets.  Removing NIS related entries in
HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Run\ and rebooting did
solve problem.  So, bye bye NIS, will have to tighten ifpw instead."

I have to leave right now, but will dig into this later.
Comment 55 Justin Mason 2004-11-20 15:29:36 UTC
told ya ;)
Comment 56 Eugene Pressley 2004-11-20 16:51:35 UTC
Subject: Re:  spamd under Cygwin causes SpamC to report
	"failed sanity check" on some messages

I do not have NIS running on my W2K3 server and I am still seeing the
problem.

>>> <bugzilla-daemon@bugzilla.spamassassin.org> 11/20/2004 6:29:37 PM
>>>
http://bugzilla.spamassassin.org/show_bug.cgi?id=3917 





------- Additional Comments From jm@jmason.org  2004-11-20 15:29
-------
told ya ;)



------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.



Comment 57 Sidney Markowitz 2004-11-21 00:25:25 UTC
Ok, I read the thread over more carefully and did some experiments. It doesn't
seem to have anything to do with our problem. I already early in this whole
issue eliminated my Symantec firewall as a cause by uninstalling it... I had
hoped when I read the excerpt that I quoted here which mentioned registry
entries that he had found some change to the registry that was not undone by an
uninstall. But no, he just talked about forcing NIS not to load at boot time,
which is certainly taken care of by an uninstall. Also, tracking back to the
beginning of the thread it turns out that he was seeing a problem with
shutdown() killing the connection immediately even with small messages. What we
are dealing with does seem like Windows behavior that happens when more than 50k
bytes are sent out before the shutdown.

I still find it really strange that I have found no mention of this anywhere
through Google. I guess it is unusual for a protocol to have one side send a
very long data stream without then waiting for some kind of acknowledgement from
the other side?

Here's something that might work with minimal impact on the use of spamd/spamc
on other platforms. The only problem is that it requires a small change to spamc
and a change to spamd under Cygwin that causes it to only work with the newer
spamc. But old spamc will continue to work with spamd on non-Windows platforms.

Right now spamc writes the raw message to the spamd socket, does a shutdown of
the write direction (SHUT_WR), then reads the filtered message from the spamd
socket, then does a shutdown of the read direction (SHUT_RD) and a closesocket.

What I propose is that it add a command line option to spamc to not do the first
shutdown. The second shutdown will always be SHUT_RDWR. That way, if spamc is
being used with a Windows spamd, the option will allow spamd to check for the
socket having been shut down by the client before it calls shutdown. Spamd can
do that check only if it is running under Cygwin. The check can be with a
timeout so the worst case it is equivalent to sleeping before the call to
shutdown. The behavior of spamd under Cygwin with an old spamc would be the same
as it is now, but no worse.

I'll test out the idea and put up a patch for spamd.raw and libspamc.c unless
someone has objections to that approach.
Comment 58 Sidney Markowitz 2004-11-21 03:51:48 UTC
I seem to be confused about what perl is doing.

Now that I am back to looking at the behavior in perl again in spamd, I see that
no matter what I do if spamd does

 print $client $msg_resp;

where $msg_resp is long (e.g., 70,000 bytes) the message that is received is
likely to be truncated, even if I sleep(5) before a $client->shutdown(2). The
only way to keep from losing data is to split up $msg_resp into chunks and print
each one separately followed by a sleep(1).

That behavior does not match what I narrowed this down to. Is it possible that
perl buffers the output to the socket so it doesn't really get sent until
shutdown is called, but it does send out the buffer during a sleep?

Speaking of which, it is time for me to sleep before thinking about this any more.
Comment 59 Sidney Markowitz 2004-11-21 12:57:43 UTC
I went back to my last attachment, which is in perl. With a sleep(5) after the
sending of the message to $client, followed by print $client "this is a test\n";
it ends up with a broken pipe error when trying to do that last print.

That implies that perl is doing its own buffering of the data sent out over the
socket and autoflush does not seem to force it to send after every print. If
there is a sleep(1) after a print, then the data is flushed. But without that
sleep call it appears that perl is buffering the print statements, resulting in
a single long send that dies when it is finally sent out.

If it turns out that the only workaround that lets us have a server on Windows
in perl is to break up the messages and insert calls to sleep, I'm inclined once
again to just say that the server does not work in Windows.
Comment 60 Justin Mason 2004-11-21 13:11:38 UTC
try using syswrite() instead of print().  that should evade perl's stdio layer...
Comment 61 Sidney Markowitz 2004-11-21 15:42:44 UTC
syswrite helps my test server, maybe even enough to solve the problem! It may be
a little while before I get to put it into spamd to see how that goes.
Comment 62 Sidney Markowitz 2004-11-21 17:25:08 UTC
Would using syswrite for output require that sysread be used for input on the
same socket instead of getline?
Comment 63 Sidney Markowitz 2004-11-21 19:02:11 UTC
I changed every print $client and printf $client in spamd into using syswrite
and everything works fine, with no other changes required.

I don't know how this bypasses the problem I saw with the C winsock test server.
Perhaps perl splits up the output sent by syswrite into chunks so as not to send
too big a chunk at a time over the socket?

In any case it works. Before I put up a patch I would like people's opinions as
to whether it is ok to just use syswrite or if it should only be under Cygwin
and stay with print if it is not Cygwin.
Comment 64 Justin Mason 2004-11-21 19:07:42 UTC
no, I don't think so; AFAIK it's ok as long as you don't intermix sysread() and
getline(), ie. both reading, not read-with-one-and-write-with-the-other.
Comment 65 Justin Mason 2004-11-21 20:07:25 UTC
btw 'no, I don't think so' was referring to this question: 
> Would using syswrite for output require that sysread be used for input on the
> same socket instead of getline?

sounds like perl's stdio layer in windows/cygwin is buggy.  should probably
report that upstream if possible...

changing spamd to use syswrite() on all OSes, is probably a good idea anyway. 
the use of print is a slight inefficiency, if syswrite() will work instead.

Comment 66 Sidney Markowitz 2004-11-21 20:47:00 UTC
Ok, a stylistic question: there are seven calls to print $client and two to
printf $client. Changing each of them to syswrite can be done by changing

 print $client "SPAMD/1.2 $resphash{EX_OK} PONG\r\n";

into

  my $str = "SPAMD/1.2 $resphash{EX_OK} PONG\r\n";
  syswrite($client, $str, length($str));

or into

  wprint( $client, "SPAMD/1.2 $resphash{EX_OK} PONG\r\n");

and have wprint be a two line sub that calls syswrite.

Any preference?
Comment 67 Sidney Markowitz 2004-11-21 20:58:43 UTC
Created attachment 2531 [details]
patch to spamd.raw to use syswrite instead of print and printf

This patch seems to fix the problem under Cygwin. I didn't get a chance to test
on other platforms. Please try it out and see if it helps if you are running on
Cygwin and if it causes new problems if you are not.

I took a guess at the style question, just to give people something to try.
This defines a wprint sub. It would be easy enough to change to inline calls to
syswrite if the perl gurus prefer that.
Comment 68 Justin Mason 2004-11-21 21:23:54 UTC
Subject: Re:  spamd under Cygwin causes SpamC to report "failed sanity check" on some messages 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>   my $str = "SPAMD/1.2 $resphash{EX_OK} PONG\r\n";
>   syswrite($client, $str, length($str));

this gets my vote.  btw the length() call is redundant; syswrite
does that itself if needed.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBoXe8MJF5cimLx9ARApYGAKCB4Hi6KliByL5jb8URwfS3emY63QCfbRxi
GPY6cJgYN+HaC05aS6eHKlI=
=UuiY
-----END PGP SIGNATURE-----

Comment 69 Sidney Markowitz 2004-11-21 21:39:21 UTC
> the length() call is redundant

Oh, so all I have to do is replace

  print $client "arg 1", $arg2, $arg3;

with

  syswrite($client, "arg 1" . $arg2 . $arg3);

correct?
Comment 70 Sidney Markowitz 2004-11-21 22:01:42 UTC
Created attachment 2532 [details]
patch to spamd.raw to use syswrite instead of print and printf

Since the length argument to syswrite is optional, the patch can be much
simpler. This one just replaces print and printf with syswrite and concatenates
the data arguments using . instead of ,
Comment 71 Sidney Markowitz 2004-11-21 22:53:06 UTC
Created attachment 2533 [details]
patch to spamd.raw to use syswrite instead of print and printf

This version removes the two calls to sprintf which I had used to replace the
calls to printf $client. They were ugly and unnecessary different from the
similar code that used print.
Comment 72 Justin Mason 2004-11-21 23:49:14 UTC
Subject: Re:  spamd under Cygwin causes SpamC to report "failed sanity check" on some messages 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> Oh, so all I have to do is replace
> 
>   print $client "arg 1", $arg2, $arg3;
> 
> with
> 
>   syswrite($client, "arg 1" . $arg2 . $arg3);
> 
> correct?

yep!

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBoZnMMJF5cimLx9ARAg39AJ0XRDud7FvjA4NOb1WeWxtnEIMlvQCeMKb+
veD8wxISlEsJLjm5xp3X+fc=
=ndrY
-----END PGP SIGNATURE-----

Comment 73 Sidney Markowitz 2004-11-22 01:22:14 UTC
committed to trunk rev 106170

Changed target milestone to 3.1. I don't think running spamd under Cygwin is
important enough to put this in the 3.0 branch.

Closing as fixed, but please review and test on other platforms.
Comment 74 Eugene Pressley 2004-11-25 12:00:55 UTC
I just wanted to let you know that I have had the a 3.1.0 build from last night 
running since then and it is working great.  The SpamAssassin error 76 that I 
was getting is gone, so I think that the fix you created is working.
Comment 75 Eugene Pressley 2005-01-03 14:50:11 UTC
I just tried to install the 3.0.2 build and I was getting the "Result: 76" 
error.  I looked at the code for SpamD and I could not find the SysWrites that 
had been added to correct the problem.  I did a comparison of the 3.0.2 SpamD 
and the 20041125120626 build that I had beed running.  I found the sections of 
code that had the SysWrite commands and copied them over to the 3.0.2 SpamD.  
The corrected the "Result: 76" error.  Why where the corrections removed from 
the 3.0.2 code?
Comment 76 Eugene Pressley 2005-01-03 14:52:42 UTC
Created attachment 2588 [details]
3.0.2 SpamD with Syswrite Mods

Here is the 3.0.2 SpamD with just the SysWrite Mods.
Comment 77 Michael Parker 2005-01-03 16:25:40 UTC
This was closed as fixed and according to comment #73 there was a decision to
not backport to 3.0.2.  So it wasn't.