SA Bugzilla – Bug 3917
spamd under Cygwin causes SpamC to report "failed sanity check" on some messages
Last modified: 2005-01-03 07:25:40 UTC
Build SpamAssassin on Cygwin/Win2k3 system. Using SpamC built from the 3.0.0 distribution, using Cygwin. When certain messages are submited to SpamD using SpamC I get the "failed sanity check" error reporting that the file size returned is smaller than what was submitted. The SpamC client also returns an Error 76 "remote error in protocol" when I use the "-x" option. I will be happy to supply samples of files that have gotten this error, just contact me and let me know where to send the files.
Created attachment 2468 [details] Sample #1
Created attachment 2469 [details] Sample#2
Created attachment 2470 [details] Sample #3
Created attachment 2471 [details] Sample #4
I have attached four messages that recieved this error.
Can you check if it has anything to do with the mail message having DOS vs unix newlines? That would be hidden in the attachments you posted unless you zipped them and attached the zip files.
Created attachment 2472 [details] Zipped copy of attachment samples
I submitted a copy of the samples as a zipped file. Most messages are being properly processed only a small percentage are having this problem. I have noticed that most if not all of the problem messages are multipart-mime files. As for the Dos v/ Unix newline issue I don't know. I have set cygwin to DOS style newlines. All of the files being sent to SpamD should have CRLF line endings. If it is a problem with the newline ending why would only a small percentage of message have a problem.
As a test I tried converting a test message, using a DOS2UNIX utility, before sending it to SpamD. I got the same results. SpamD is only return a portion of the original message, which is what is causing the sanity check to fail. I have noticed that the amount of the original message returned seems to vary a little with each submission.
Created attachment 2473 [details] One more sample The sample that I just attached I tried several times and recieved this: D:\ESA>spamc300 < zs2.out > zs2a.out failed sanity check, 109048 bytes claimed, 33721 bytes seen
My initial test usaing zs2.out did not reproduce the problem, using 3.01 svn branch and running spamc and spamd under Cygwin. I'll have to check it more thoroughly when I get a chance to tonight. What flags are you using to run spamd? If you run spamd -D do you see any useful error messages?
Created attachment 2475 [details] same zs2.out sample with resultant zs2a.out and Debug from SpamD Here is the Debug from SpamD along with the in and out file that was produced from the command: spamc < zs2.out > zs2a.out When I use the -x Command for spamc it will return an Error 76 "EX_PROTOCOL 76 remote error in protocol"
Created attachment 2476 [details] my spamc.exe Here is the spamc from my cygwin build, incase it might be useful.
Just tried the latest CVS build I got the following results: D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109044 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109044 bytes claimed, 34745 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109044 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109044 bytes claimed, 97209 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109044 bytes claimed, 34745 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109044 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 34745 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 34745 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 34745 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 34745 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 34745 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 85945 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 34745 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 54201 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 34745 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 87993 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 34745 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out failed sanity check, 109058 bytes claimed, 33721 bytes seen D:\ESA>zspamc-alt2 -x < zs2.out > zs2a.out You may notice that out of a number of runs some of them worked correctly, this is very bizar. I think I will remove cygwin and rebuild.
Well I reinstalled Cygwin and SA using the latest CVS build, still having the same problem.
It looks like spamd thinks everything is ok, but spamc is getting in error in the socket connection. 76 is, as you said, "remote error in protocol" which I see is defined as "The remote system returned something that was not possible during a protocol exchange". I could see how I could not reproduce it trying it once on my system. Here are some other things to try in order to narrow it down: Do you get the same results if you disable the network checks by running spamd with -L? How about running without Bayes? I'm thinking in terms of what happens if spamd is not doing anything that potemtially takes a long elapsed time. Another interesting extra piece of information would be to add timestamps to the spamd -D output. I guess you could do that by adding in the time to the warn "debug: " $msg line in sub dbg in SpamAssassin.pm. That would show if there are any long delays during the spamd processing in the cases that fail.
hmm. what about character sets? is everyone using the US codepage?
> what about character sets? I don't think it is a matter of the character counts not matching. A protocol error and losing the connection explains better the big difference in counts and the variation in results across runs with the same message.
I have the LANG var set to en_US. The messages the are failing are a variety of language types including "us-ascii". If it is a protocol failure what could be causing it. The SpamC and SpamD are running on the same machine.
> If it is a protocol failure what could be causing it I don't know. I tried forcing a timeout by setting a snall value for the spamc -t option, but that returned an error code 74, not 76 like you are seeing. The effect is close to what you are seeing, though. Spamc is getting an error while spamd thinks nothing is wrong. What are all the command line options you are using for spamd and spamc?
SpamD command line: perl.exe spamd -D -s /var/log/spamd.log -r /var/log/spamd.pid SpamC: spamc -x < infile > outfile
I've reproduced this doing the following. Maybe someone can see if it happens outside of Cygwin. I did this with a build of the 3.0.1 branch. I haven't yet tried it in trunk. I ran spamd using the command line spamd -D -s /c/temp/zsaspderr.log -r /c/temp/spand.pid I ran spamc repeatedly using the example input from the last attached sample using the following script, with the output shown: $ myvar=0 ; while [ $myvar -ne 10 ] ; do spamc -x < zs2.out > zsa2.out ; echo $? ; myvar=$(( $myvar + 1 )) ; done 0 0 0 76 76 0 76 76 76 0
More results: I got the same sporadic return code 76 using svn trunk under Cygwin (perl 5.8.5), and also when I added a sleep n delay of 5, 10, and 20 seconds between successive calls to spamc. So it isn't a timing problem caused by sending rapid messages to spamd. I did not get any return code 76 when I repeated the test under Fedora Core 2, perl 5.8.3.
I get the same results if I use -L and disable auto_whitelist and Bayes in user_prefs. I'm having difficulty getting a network capture, so I still don't know what is going in with the protocol.
I have some firewall problem trying tests between machines, but I did verify that going through a ssh tunnel between my Cygwin box and my Fedora core 2 box, running spamc in Cygwin and spamd in linux has no problem, running spamd in Cygwin and spamc in linux does show the problem, with the return code 76 happening every time. I substituted a large file made up of sample-nonspam.txt concatenated with itself many times and the same error happens. I reduced the size of the file. The smaller files never showed the return code 76, larger files did. At the borderline, a file with GTUBE (i.e., marked as spam) began showing the problem at a smaller size than a non-spam file.
hmm. possibility: a helper app is closing the fd?
Sorry for not checking in for a while. I was away at a technical conference. I am happy to see that the problem can be replicated. So does anyone have any idea why this is happening? It seems to be a problem when the SpamD is running on a CygWin system. Is the problem with SpamD or with CygWin?
I found the firewall that was holding up my testing -- remnant of a VPN client I had been using -- and captured the traffic from a test with spamc on a linux box and spamd running under Cygwin. Ethereal shows that spamd is sending a truncated message. There are no errors, just a data packet that should not be the last one and then a FIN packet. A debugging statement showed that $client->close() is being called in the right place in spamd. I tried adding a $client->flush() and even a $client->flush(); sleep(5); before the $client->close() to no effect. The truncated message is not consistent... It does not truncate in the same place each time and it does not happen every time on identical input.
I have a workaround that tests ok, but I am not submitting a patch because I don't think it is a workable solution. Instead I'm putting it in this comment in the hopes that someone has some idea what it implies about the problem and what might be done about it. Remember, this only seems to happen in Cygwin. I can't say if it Win32 would also do it because I have not distilled the problem into a standalone example and Win32 doesn't run spamd. The problem seems to be that in Cygwin we cannot reliably send strings of around 30,000 bytes over a tcp socket. If we do some do not arrive and the packet loss is not corrected by the tcp protocol as it is supposed to. spamd sends the entire message using a single call to print $client "some header" $msg_resp; Without a standalone test I can't say if it happens at an integer overflow boundary of 32767 or if it is a smaller number. The error is sporadic, so it is not a simple buffer or integer overflow. My test is to send the same message to spamd 10 times in a row with a 5 second sleep between them and checking the return code. With my 38k test message about two to four of the times the return code from spamc -x is 76. If I change that to print $client "some header"; followed by the followiong code, which slleps for a second after every 10,000 bhytes it works fine. If I use a $client->flush() instead of sleep(1) it still breaks. Does anyone have any insight? This goes in spamd.raw approximately at line 1176 and comes after removing the $msg_resp from each of the preceding print lines. my @segs = unpack("a10000" x ($msg_resp_length/10000) . " a*", $msg_resp); foreach my $sg (@segs) { print $client $sg; sleep(1); }
wow, that's a serious bug! again, a standalone test case would rock so that it can be escalated upstream to cygwin. in the meantime, I think the substr hack workaround should be fine, using $Config{osname} or similar to detect if it's running on cygwin and whether the fix should be used. it'd slow down scanning a little on cygwin, but the alternative is this breakage, and we don't know when it's going to be fixed (if ever).
Do you think this workaround will make it into the official Spam-Assassin Build. And If so how would I know when it is there, other that looking at the code myself.
It's probably ok to slow down scanning on Cygwin only. I'll see about putting together a standalone test for a Cygwin bug report. That will do two other things that I think are necessary before this can be turned into a patch: Determine the actual size of data sent over the socket at which breakage starts, and determine the actual sleep time that is necessary. Does sleep() allow fractional seconds in its argument? If not, is there a way to sleep for less than a second?
I just wanted to let everyone know that I have see the error with files as small as 8K. What you may want try is to break the message response string into packets that would fit in a single tcp frame, shoot for 1500 bytes. You might try it without any delay. Just as long as the packets are not fragmented, might be enough to keep the problem from occuring.
Also could you give a little more detail on where to place the new code.
Subject: Re: SpamC is reporting "failed sanity check" on some messages -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > Does sleep() allow fractional seconds in its argument? If not, is there > a way to sleep for less than a second? using "select(undef, undef, undef, $nsecs);" is portable. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBhrPxMJF5cimLx9ARAlWnAJ9jQ1xZq+CqxFPYs65VACHLtkvHQwCcD7Gt pDVf/MkPOwQJoitiUCScia4= =iYce -----END PGP SIGNATURE-----
> Just as long as the packets are not fragmented, > might be enough to keep the problem from occuring I'll have to try that. If nothing else, the implied flush that would occur after each call to print may have the same effect as the sleep calls, so breaking it up into half the size that works with the sleep(1) may do the same thing with no sleep delays. It's worth a try. I'll post it as a patch once I've played with a bit more to see what the optimal fragment size is and what kind of sleep delay I can get away with. If you can't wait, look around line 1176 in spamd.raw. There are a number of print $client statements in the different branches of an if ... elsif statement. Each print outputs somethng followed by $msg_resp. I took out the $msg_resp from each of them, then followed the entire if with the code I pasted into comment #29 Or you can wait for patch, which will probably be ready when I can get to it 9 or 10 hours from when I am typing this comment.
Subject: Re: SpamC is reporting "failed sanity check" on some messages From the MS socket documentation: The semantics of closesocket are affected by the socket options SO_LINGER and SO_DONTLINGER as follows (SO_DONTLINGER is enabled by default; SO_LINGER is disabled). If SO_LINGER is set with a zero time-out interval (that is, the LINGER structure members l_onoff is not zero and l_linger is zero), closesocket is not blocked even if queued data has not yet been sent or acknowledged. This is called a hard or abortive close, because the socket's virtual circuit is reset immediately, and any unsent data is lost. Any recv call on the remote side of the circuit will fail with WSAECONNRESET. If SO_LINGER is set with a nonzero time-out interval on a blocking socket, the closesocket call blocks on a blocking socket until the remaining data has been sent or until the time-out expires. This is called a graceful disconnect. If the time-out expires before all data has been sent, the Windows Sockets implementation terminates the connection before closesocket returns. Enabling SO_LINGER with a nonzero time-out interval on a nonblocking socket is not recommended. In this case, the call to closesocket will fail with an error of WSAEWOULDBLOCK if the close operation cannot be completed immediately. If closesocket fails with WSAEWOULDBLOCK the socket handle is still valid, and a disconnect is not initiated. The application must call closesocket again to close the socket. If SO_DONTLINGER is set on a stream socket by setting the l_onoff member of the LINGER structure to zero, the closesocket call will return immediately and does not receive WSAWOULDBLOCK whether the socket is blocking or nonblocking. However, any data queued for transmission will be sent, if possible, before the underlying socket is closed. This is also called a graceful disconnect. In this case, the Windows Sockets provider cannot release the socket and other resources for an arbitrary period, thus affecting applications that expect to use all available sockets. This is the default behavior (SO_DONTLINGER is set by default). Note To assure that all data is sent and received on a connection, an application should call shutdown before calling closesocket (see Graceful shutdown, linger options, and socket closure for more information). Also note, an FD_CLOSE network event is not posted after closesocket is called. Here is a summary of closesocket behavior: If SO_DONTLINGER is enabled (the default setting) it always returns immediately—connection is gracefully closed in the background. If SO_LINGER is enabled with a zero time-out: it always returns immediately —connection is reset/terminated. If SO_LINGER is enabled with a nonzero time-out: – with a blocking socket, it blocks until all data sent or time-out expires. – with a nonblocking socket, it returns immediately indicating failure. For additional information please see Graceful shutdown, linger options, and socket closure for more information.
Loren, thanks, it looked promising, but adding a call to shutdown before the close did not help :-(
Subject: Re: SpamC is reporting "failed sanity check" on some messages I was more thinking of making sure that SO_LINGER is set correctly. This really sounds like a classic case of linger being set to zero so the trailing stuff gets flushed. I have no idea if that is accessible at the perl level, but you might be able to dig into the socket implementation code and see what it is doing. Alternate test that might work: Try doing a shutdown, wait 1, then close. I think you said a sleep before the close didn't help by itself, which is potentially not unreasonable - the data might not have been flushed yet, even with a longish delay. I *think* that a shutdown and a short wait before the close might maybe do as a workaround. However, if it does, it is still a workaround for quite possibly the incorrect value of SO_LINGER in the original setsockopt call. I do know that the Win NT/2K/XP socket implementation of TCP doesn't normally lose data when used correctly. There have been busted drivers in the past that would cause this. But that was pretty long ago, and it seems highly unlikely that *everyone* would just happen to have one of these, and it only fails when used from perl. So my current guess is that there is/are bug(s) in the perl socket code. Linger is a pretty classic mistake a lot of people make. Loren
sleep(5) between the call to shutdown and the close doesn't help at all. I added a -l option to the call to spamc -x to see the error. Tthe results are interesting. I'm testing with two different email messages. Each test consists of calling spamc with the message that I am testing with, printing out the return ocde, sleeping 5 seconds, then repeating. The error output with -l is about a mismatch in the number of bytes expected and the number received from spamd. Each message is consistent in how many bytes are received when it truncates. They are each different, but totally consistent for any one message. It is not like a single packet loss. One shows 78130 bytes claimed, 33792 bytes seen, the other 109053 bytes claimed, 41984 bytes seen. But here's kicker. I tested using the first 39166 bytes of the message of the first test. It also showed 33792 bytes seen, in other words shutting down the connection at the exact same place. That indicates some kind of dependency on the content of the message. I tried adding immediately after $client = $server->accept() setsockopt($client, SOL_SOCKET, SO_LINGER, pack("II", 1, 0)) or warn "Can't set SO_LINGER on for socket: $!"; and it did not help. Neither did setsockopt($client, SOL_SOCKET, SO_LINGER, pack("II", 1, 5)) or warn "Can't set SO_LINGER on for socket: $!"; I still don't have a test case that is independent of spamd, but I have a start at the server side of it.
Subject: Re: SpamC is reporting "failed sanity check" on some messages > But here's kicker. I tested using the first 39166 bytes of the message of the > first test. It also showed 33792 bytes seen, in other words shutting down the Well, that pretty well proves it isn't a linger problem. I wonder what character or character pair is at 33792 or thereabouts. Perhaps an FF or ^Z? I know that Telnet uses in-band signaling and can get its tail in a knot this way, but I wouldn't really expect that on a vanella TCP socket connection.
hmmmmm.... have you guys got virus scanners installed? might they do something to "bad" content being sent over 127.0.0.1?
I did recently install XP Service pack 2 and it has been interacting badly with lots of things... How about if I format my disk, install linux and see how everything works after that? :-) Sigh... I guess this is what VMWare is for...
Is XP SP2 Firewall enabled?
XP firewall is not enabled. Symantec Client Security is installed, but I have tested with it disabled and it should not affect anything on localhost anyway. Well, should is not the same as does not. Fred, what do you have installed? I just checked back on the dates in the comments and I see that I was reproducing this bug before I installed SP2. Another data point: I modified my test server to read a file from disk so it can send approximately the same as spamd is -- approximately because I used the email as input without the spamd protocol headers. It should be doing close to what spamd is doing, except not reading the message over the tcp socket first. I don't yet have a client written to test it. I used telnet. So far I have not reproduced the problem that way. My next step is to modify spamc so I can see exactly where it is cutting off the message stream, but I am back in my 12 hour delay window until I can back to it.
Has there been any progress on this issue?
> Has there been any progress on this issue? I had to put working on this on hold until after my final exam. I might be able to get back to it this weekend. I wrote a simplified server side test that does not fail when I use telnet to send data to it. What I really need to do is write a client that sends data to the test server which then sends it back. That will be close to what spamc/spamd are doing. If it fails, then I can start narrowing down what is necessary to demonstrate the problem, report a bug to the Cygwin people, and if we are lucky deduce a workaround for spamd.
Created attachment 2513 [details] simplified server that demonstrates the problem (see following comments) I have attached a perl server program that does the network stuff the same as spamd and which demonstrates the problem while being much simplified. To run this program use perl test2.pm 783 where the argument is the port it will use, default 783. To demonstrate the problem use spamc to repeatedly send it a file consisting of 685 lines each of 75 letter 'a's. Here is the shell script that I used to do that, with the file zs8.err being the 685 line test file. myvar=0 ; while [ $myvar -ne 10 ] ; do spamc -x -l < zs8.err > zsa2.out ; echo $?; myvar=$(( $myvar + 1 )) ; done Anintersting thing about this is that while it did nor always fail, when it did fail it was consistently at the same spot in the input file, independent of the contents of the file: spamc: failed sanity check, 53365 bytes claimed, 49152 bytes seen 76
changed summary to better reflect actual problem
Narrowing it down just a little bit more: I modified the server in the last attachment to cache the first message it receives and then keep using that. I then called it using spamc to send one large message followed by repeatedly sending a 0 length message. The server kept sending back to spamc the original large message. This produces the same intermittent error, truncating in the same place. That indicates that the sending of the large message to spamd is not a factor in the truncation of the message that spamd sends out. My next step in simplifying the test will be to have the server respond to a client connection by sending back fixed large message, and use something simpler than spamc as the client to test it.
Created attachment 2530 [details] even simpler server that demonstrates the problem I've attached a simpler server program that demonstrates the problem without requiring spamc in the test and without a lot of data being sent from the client to the server. Run this under Cygwin. It will run with -w and -T options. It takes two optional command line arguments. The first is the ip address to listen on, defaulting to 127.0.0.1. The second argument is the port number, defaulting to 783. This server waits for a client to send something on the port, then it sends back a 68013 byte message with headers that allow it to work with spamc as the client. So you can test it by running it and then running spamc -x -l repeatedly, looking at the return code and error messages from spamc. You can also use telnet as a client. The message that the server sends back consists of 1000 identical lines prefixed with line numbers 000 to 999 followed by a line that says END OF TEST. That makes it easy to see how the data receivfed from teh server is trucated. Just use telnet locahost 783 and type the Enter key when you get a connection to the server. This test consistently for me truncates the received data after 49152 bytes.
I think I have this tracked down to a Windows bug :-( I rewrote the last server test I attached here in C, ported that to use winsock, and duplicated the problem with no perl and no Cygwin involved. According to Microsoft documentation the proper way to gracefully shut down a tcp socket connection is to call shutdown and then wait for an indication that any pending outgoing data has been sent. Leaving aside for now the question of how to wait for such an indication, what I found was that if I send out 70000 bytes all at once then call shutdown, the connection is killed and data after about 48000 bytes is lost. It doesn't matter what I do after the shutdown, so it is not a matter of waiting for any indication. The only way I found to avoid the data loss is to sleep before the call to shutdown. I tried the select() function to look for something about the socket that would indicate when it is ready for a clean shutdown and didn't find anything. The only workaround that I can think of right now would be to modify the spamd protocol to require an acknowledgment back from the client when it has recieved the message. If the only reason to do that is to make allowances for Windows being crappy as a server compared to other platforms, it is not something that I am inclined to do. I'm open to hearing any other ideas about this, especially from anhyone who has more expertise in the vagaries of Windows sockets. Otherwise I'm afraid that this will have to be closed as a WONTFIX and we'll have to document that spamd cannot be practically run under Windows even using Cygwin.
Subject: Re: spamd under Cygwin causes SpamC to report "failed sanity check" on some messages > The only workaround that I can think of right now would be to modify the spamd > protocol to require an acknowledgment back from the client when it has recieved > the message. If the only reason to do that is to make allowances for Windows > being crappy as a server compared to other platforms, it is not something that I > am inclined to do. Is this really a bad thing? How many protocols let the server decide when to close the connection (in a non-error state)? I can't think of too many, and no email related ones. Having spamc issue a QUIT command at the end of the session might not be too bad an idea. Daryl
I did some more Googling to see if *anybody* talks about this bug and was surprised to find nothing for such an egregious problem. Just before I gave up I stumbled on what might be a related thread in a Cygwin mailing list. I have to read ovedr the full thread to see if it really is the same problem, as so far I have just skimmed it. But the resolution to that problem is here: http://www.cygwin.com/ml/cygwin/2004-05/msg00129.html "However I found source of bug. Norton Internet Security was messing with sockets. Removing NIS related entries in HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Run\ and rebooting did solve problem. So, bye bye NIS, will have to tighten ifpw instead." I have to leave right now, but will dig into this later.
told ya ;)
Subject: Re: spamd under Cygwin causes SpamC to report "failed sanity check" on some messages I do not have NIS running on my W2K3 server and I am still seeing the problem. >>> <bugzilla-daemon@bugzilla.spamassassin.org> 11/20/2004 6:29:37 PM >>> http://bugzilla.spamassassin.org/show_bug.cgi?id=3917 ------- Additional Comments From jm@jmason.org 2004-11-20 15:29 ------- told ya ;) ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter.
Ok, I read the thread over more carefully and did some experiments. It doesn't seem to have anything to do with our problem. I already early in this whole issue eliminated my Symantec firewall as a cause by uninstalling it... I had hoped when I read the excerpt that I quoted here which mentioned registry entries that he had found some change to the registry that was not undone by an uninstall. But no, he just talked about forcing NIS not to load at boot time, which is certainly taken care of by an uninstall. Also, tracking back to the beginning of the thread it turns out that he was seeing a problem with shutdown() killing the connection immediately even with small messages. What we are dealing with does seem like Windows behavior that happens when more than 50k bytes are sent out before the shutdown. I still find it really strange that I have found no mention of this anywhere through Google. I guess it is unusual for a protocol to have one side send a very long data stream without then waiting for some kind of acknowledgement from the other side? Here's something that might work with minimal impact on the use of spamd/spamc on other platforms. The only problem is that it requires a small change to spamc and a change to spamd under Cygwin that causes it to only work with the newer spamc. But old spamc will continue to work with spamd on non-Windows platforms. Right now spamc writes the raw message to the spamd socket, does a shutdown of the write direction (SHUT_WR), then reads the filtered message from the spamd socket, then does a shutdown of the read direction (SHUT_RD) and a closesocket. What I propose is that it add a command line option to spamc to not do the first shutdown. The second shutdown will always be SHUT_RDWR. That way, if spamc is being used with a Windows spamd, the option will allow spamd to check for the socket having been shut down by the client before it calls shutdown. Spamd can do that check only if it is running under Cygwin. The check can be with a timeout so the worst case it is equivalent to sleeping before the call to shutdown. The behavior of spamd under Cygwin with an old spamc would be the same as it is now, but no worse. I'll test out the idea and put up a patch for spamd.raw and libspamc.c unless someone has objections to that approach.
I seem to be confused about what perl is doing. Now that I am back to looking at the behavior in perl again in spamd, I see that no matter what I do if spamd does print $client $msg_resp; where $msg_resp is long (e.g., 70,000 bytes) the message that is received is likely to be truncated, even if I sleep(5) before a $client->shutdown(2). The only way to keep from losing data is to split up $msg_resp into chunks and print each one separately followed by a sleep(1). That behavior does not match what I narrowed this down to. Is it possible that perl buffers the output to the socket so it doesn't really get sent until shutdown is called, but it does send out the buffer during a sleep? Speaking of which, it is time for me to sleep before thinking about this any more.
I went back to my last attachment, which is in perl. With a sleep(5) after the sending of the message to $client, followed by print $client "this is a test\n"; it ends up with a broken pipe error when trying to do that last print. That implies that perl is doing its own buffering of the data sent out over the socket and autoflush does not seem to force it to send after every print. If there is a sleep(1) after a print, then the data is flushed. But without that sleep call it appears that perl is buffering the print statements, resulting in a single long send that dies when it is finally sent out. If it turns out that the only workaround that lets us have a server on Windows in perl is to break up the messages and insert calls to sleep, I'm inclined once again to just say that the server does not work in Windows.
try using syswrite() instead of print(). that should evade perl's stdio layer...
syswrite helps my test server, maybe even enough to solve the problem! It may be a little while before I get to put it into spamd to see how that goes.
Would using syswrite for output require that sysread be used for input on the same socket instead of getline?
I changed every print $client and printf $client in spamd into using syswrite and everything works fine, with no other changes required. I don't know how this bypasses the problem I saw with the C winsock test server. Perhaps perl splits up the output sent by syswrite into chunks so as not to send too big a chunk at a time over the socket? In any case it works. Before I put up a patch I would like people's opinions as to whether it is ok to just use syswrite or if it should only be under Cygwin and stay with print if it is not Cygwin.
no, I don't think so; AFAIK it's ok as long as you don't intermix sysread() and getline(), ie. both reading, not read-with-one-and-write-with-the-other.
btw 'no, I don't think so' was referring to this question: > Would using syswrite for output require that sysread be used for input on the > same socket instead of getline? sounds like perl's stdio layer in windows/cygwin is buggy. should probably report that upstream if possible... changing spamd to use syswrite() on all OSes, is probably a good idea anyway. the use of print is a slight inefficiency, if syswrite() will work instead.
Ok, a stylistic question: there are seven calls to print $client and two to printf $client. Changing each of them to syswrite can be done by changing print $client "SPAMD/1.2 $resphash{EX_OK} PONG\r\n"; into my $str = "SPAMD/1.2 $resphash{EX_OK} PONG\r\n"; syswrite($client, $str, length($str)); or into wprint( $client, "SPAMD/1.2 $resphash{EX_OK} PONG\r\n"); and have wprint be a two line sub that calls syswrite. Any preference?
Created attachment 2531 [details] patch to spamd.raw to use syswrite instead of print and printf This patch seems to fix the problem under Cygwin. I didn't get a chance to test on other platforms. Please try it out and see if it helps if you are running on Cygwin and if it causes new problems if you are not. I took a guess at the style question, just to give people something to try. This defines a wprint sub. It would be easy enough to change to inline calls to syswrite if the perl gurus prefer that.
Subject: Re: spamd under Cygwin causes SpamC to report "failed sanity check" on some messages -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > my $str = "SPAMD/1.2 $resphash{EX_OK} PONG\r\n"; > syswrite($client, $str, length($str)); this gets my vote. btw the length() call is redundant; syswrite does that itself if needed. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBoXe8MJF5cimLx9ARApYGAKCB4Hi6KliByL5jb8URwfS3emY63QCfbRxi GPY6cJgYN+HaC05aS6eHKlI= =UuiY -----END PGP SIGNATURE-----
> the length() call is redundant Oh, so all I have to do is replace print $client "arg 1", $arg2, $arg3; with syswrite($client, "arg 1" . $arg2 . $arg3); correct?
Created attachment 2532 [details] patch to spamd.raw to use syswrite instead of print and printf Since the length argument to syswrite is optional, the patch can be much simpler. This one just replaces print and printf with syswrite and concatenates the data arguments using . instead of ,
Created attachment 2533 [details] patch to spamd.raw to use syswrite instead of print and printf This version removes the two calls to sprintf which I had used to replace the calls to printf $client. They were ugly and unnecessary different from the similar code that used print.
Subject: Re: spamd under Cygwin causes SpamC to report "failed sanity check" on some messages -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > Oh, so all I have to do is replace > > print $client "arg 1", $arg2, $arg3; > > with > > syswrite($client, "arg 1" . $arg2 . $arg3); > > correct? yep! - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBoZnMMJF5cimLx9ARAg39AJ0XRDud7FvjA4NOb1WeWxtnEIMlvQCeMKb+ veD8wxISlEsJLjm5xp3X+fc= =ndrY -----END PGP SIGNATURE-----
committed to trunk rev 106170 Changed target milestone to 3.1. I don't think running spamd under Cygwin is important enough to put this in the 3.0 branch. Closing as fixed, but please review and test on other platforms.
I just wanted to let you know that I have had the a 3.1.0 build from last night running since then and it is working great. The SpamAssassin error 76 that I was getting is gone, so I think that the fix you created is working.
I just tried to install the 3.0.2 build and I was getting the "Result: 76" error. I looked at the code for SpamD and I could not find the SysWrites that had been added to correct the problem. I did a comparison of the 3.0.2 SpamD and the 20041125120626 build that I had beed running. I found the sections of code that had the SysWrite commands and copied them over to the 3.0.2 SpamD. The corrected the "Result: 76" error. Why where the corrections removed from the 3.0.2 code?
Created attachment 2588 [details] 3.0.2 SpamD with Syswrite Mods Here is the 3.0.2 SpamD with just the SysWrite Mods.
This was closed as fixed and according to comment #73 there was a decision to not backport to 3.0.2. So it wasn't.