5981 – Improved error detection and reporting

Bug 5981 - Improved error detection and reporting

Summary: Improved error detection and reporting

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Libraries (show other bugs)
Version:	SVN Trunk (Latest Devel Version)
Hardware:	All All

Importance:	P5 enhancement
Target Milestone:	Undefined
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-09-18 09:42 UTC by Mark Martinec
Modified:	2009-11-26 11:07 UTC (History)
CC List:	0 users

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
proposed patch (just to give an idea on what I have in mind)	patch			None	Mark Martinec
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mark Martinec 2008-09-18 09:42:11 UTC

This trouble ticket is about improved error detection and reporting
in SpamAssassin. My guideline is: a missed opportunity to early problem
detection & reporting leads to hard-to-troubleshoot secondary failures,
which adds unnecessary work for maintainers and to user dissatisfaction.
An extra bit of effort and discipline while coding brings tenfold benefits
to program maintainability.

First some background - you'll notice this has been my pet peeve
since a long time:


> Bug 4412:
> Comment  #24 From Mark Martinec  2005-08-18
>
> [...] The reasoning is that if something unexpected happens,
> it is better to be reported as soon as possible and with as much
> information as possible (the $!), instead of being possibly masked
> by further attempts.
> [...]
> Unfortunately it didn't report the failure, and wasted the whole round 
> of 64k attempts to not do anything useful but return undef in $sock. 
>  
> WHICH MADE ME FIX THE ERROR REPORTING FIRST. Ignoring status, or not 
> reporting the full error status is always bad, at least in the long run, 
> when one has to deal with hordes of users, and remote troubleshooting 
> is very hard to do. 
> 
> Comment  #26 From Justin Mason  2005-08-18
> I initially sided with Sidney, but Mark's statements all make sense to me.
> [...] So let's go with the improved error reporting, and only later see
> if there's a problem left that we need to worry about...
> 
> 
> Bug 5507:
> Reporter: Mark Martinec  2007-06-10
>
> I'm attaching a small patch to improve diagnostics when
> site rules directory was present during install, but was
> inaccessible to a process running SA.
>   It took us some running around in circles (on a mailing list)
> when a user reported SA failing [...] even though the directory
> was present and accessible to anyone.
>   It turned out that one directory above [...] was too strongly
> protected, and a sloppy -e() test in Perl does not distinguish
> between 'does not exist' and 'is inaccessible'. [...]
> Please incorporate the patch to facilitate future troubleshooting.
> 
> Comment #4 From Mark Martinec  2007-08-22
> Perhaps, although there are dozens of other cases which test
> for -f, -d, -e and similar, without calling stat() or lstat()
> first, and therefore can not distinguish between (for example)
> not being a directory or not existing or an access violation,
> and not being able to be specific whether a test should follow
> a symbolic link or not. In the least this causes a misleading
> diagnostics, or at worst can cause a wrong decision.
> 
> If you think this can be a good idea, I'd like to undertake
> another global sweep across modules and replace simple
> tests like -f $file with a: stat($file); ... -f _,
> along the lines of the patch above [...]
> 
> The amount of change would make this suitable for 3.3.0,
> and probably not worth fixing a single case for 3.2.* .
> 
> Comment #8 From Tom Schulz 2007-10-18
> Should a new bug be opened for comment #4 ?
> 
> Comment #9 From Mark Martinec 2007-10-18
> I won't forget about my intent anyway - have notes scribbled in
> many places on the printed listing of SA code, which I reviewed
> during my summer vacation.


It took me a year (after my code audit in summer of 2007) to finally
invest some time and go systematically through code, adding status
checks and corresponding problem reporting throughout the code.
With little exceptions I added a check to each end every system call:

  open, close, readline, sysread, syswrite, print, printf, binmode,
  opendir, closedir, stat, lstat, -f, -d, -e, (and the like),
  mkdir, rmdir, unlink, flush, seek, tell, waitpid, flock

Also improved detection and reporting of spawned process termination
failures, replacing manual ad-hoc testing of a $? by new routines
in module Util: proc_status_ok and exit_status_str, which I borrowed
from amavisd-new (it is my own code, I assume I can re-use it here
regardless of a GPL license of amavisd-new).

Where there already was a status check, I tried to keep it unchanged,
except for possible unification of wording in related messages.

To some extent I tried to keep a style of a local neighborhood code
regarding use of TAB, indentation and the like, although the existing
style is somewhat inconsistent and I didn't try much to unify or
keep it.

How a problem is reported varies from place to place.
A choice between die / warn / dbg / info / print STDERR / ignore
is often rather arbitrary in existing code. Where there is a good
reason for the choice I adopted a choice from neighboring code.
Otherwise my main choice was to use a die. Most code, especially
the plugins, are wrapped in some outer level evals so a die does not
imply a program termination, just a limited loss of functionality.
When there is a serious problem in outer layers, it is better for a
program to fail than for example to result in a partially chopped-off
mail (e.g. in case of disk quota exceeded, disk full, I/O errors).

It is all in a form of a single diff (single revision), so it is
easy to back it off in case of serious objections or difficulties.
The diff is quite hefty, consisting of 195 patch chunks and
2900 lines of a unified diff, although most of the changes follow
the same idioms.

There are some corners which I left mostly untouched:
- spamd.raw and spamc (as I'm not using them, except for testing),
- mass checks, tests, tools, rules
- sa-update.raw (I had this finished, but then I lost changes
  so I must do it again)
- file based Bayes and AWL stores (as I'm not using them, only SQL)

I'm running this code for a week now on our production mailer,
and it does pass self-tests. Still, as tests and our usage profile
does not cover all setup scenarios, it is possible there might
be some error still lurking somewhere, or that some reaction to an
error is too strong. I'd like to deal with these on a case-by-case
basis, while keeping the general direction.

Barring strong objections, I'd like to roll-in the patch to HEAD
and see how it goes, hoping to receive prompt feedback on problems
that might show up, and deal with them as soon as possible.

Perhaps tomorrow earlier in the day would be a good time to check
it in, covering European afternoon and US daytime local times
so that any problems could be promptly resolved. (or now, if you
are impatient)

Comment 1 Mark Martinec 2008-09-18 09:47:49 UTC

Created attachment 4369 [details]
proposed patch (just to give an idea on what I have in mind)

Comment 2 Justin Mason 2008-09-19 03:34:16 UTC

wow, that's comprehensive!  +1 for application to trunk.

the relicensing of those 2 GPL functions is, as you say, fine -- since you're their author, you can reuse them as you like.

Comment 3 Mark Martinec 2008-09-19 04:28:24 UTC

> wow, that's comprehensive!  +1 for application to trunk.

Thanks. Here it comes, let's see how it fares...

Committed revision 697056.

Comment 4 Sidney Markowitz 2008-09-19 05:13:36 UTC

I really like this. We should do our best to encourage people to test with this well before 3.3 is likely to be released. There's a lot of potential for old bugs to be revealed once people start running with this code.

Comment 5 Mark Martinec 2008-09-19 08:05:01 UTC

> There's a lot of potential for old
> bugs to be revealed once people start running with this code.

I see we have the first old-time bug uncovered (see Hudson report).
The ArchiveIterator is happily letting _run_file process directory files,
which causes read(2) to fail with EISDIR on NFS.
A question: should _run_file check by itself (stat) what type of a path
it got as an argument, or should _scan_directory take care not to include
directory names in a list of files, or should do a stat unconditionally?

Comment 6 Mark Martinec 2008-09-19 08:06:30 UTC

< directory names in a list of files, or should do a stat unconditionally?

> directory names in a list of files, or should _scan_file do a stat unconditionally?

Comment 7 Justin Mason 2008-09-19 08:23:11 UTC

(In reply to comment #6)
> < directory names in a list of files, or should do a stat unconditionally?
> 
> > directory names in a list of files, or should _scan_file do a stat unconditionally?

_run_file is already statting the file/dir:

  if (! $self->{opt_all} && -s INPUT > BIG_BYTES) {

so I say skip directories there.  stat()s in the _scan_file method can be expensive if there are millions of files.

Comment 8 Mark Martinec 2008-09-19 09:44:21 UTC

> _run_file is already statting the file/dir: 
>   if (! $self->{opt_all} && -s INPUT > BIG_BYTES) {
> so I say skip directories there.  stat()s in the _scan_file method can be
> expensive if there are millions of files.

Ok, done (and some more). This should deal with failing bayesdbm.t test 38
on Hudson; don't know yet what happened at test 39.


Committed revision 697147 :
  ArchiveIterator: prevent _scan_directory from passing directories
  to _scan_file (on NFS it would fail with EISDIR on read(2);
  let _run_file double-check what kind of a path it is dealing
  with, also allow it to deal with STDIN (which might become
  useful some day)

Comment 9 Mark Martinec 2009-08-06 12:25:36 UTC

Closing, as the bulk of it is now done.
Still have to keep the issues in mind
and not no become sloppy in future...

Comment 10 Mark Martinec 2009-11-26 11:07:13 UTC

  SATest.pm: include errno of a failing copy command in error messages;
  SATest.pm and 20_saw_ampersand.t: add some missing status tests and
  report failures; turn off debugging in sa_compile.t to reduce noise
Sending        t/SATest.pm
Sending        t/sa_compile.t
Sending        xt/20_saw_ampersand.t
Committed revision 884667.