Bug 4693

Summary: copy_config failure caused disk full -> "Attempt to free unreferenced scalar"
Product: Spamassassin Reporter: Dallas Engelken <dallas>
Component: spamc/spamdAssignee: SpamAssassin Developer Mailing List <dev>
Status: RESOLVED WORKSFORME    
Severity: major CC: apache, spamassassin
Priority: P5    
Version: 3.1.0   
Target Milestone: Undefined   
Hardware: Other   
OS: other   
Whiteboard:

Description Dallas Engelken 2005-11-18 21:56:33 UTC
something went terribly wrong here...  in approx 24 hours, i had a 34GB spam.log
file and a disk full.  

# grep 'Attempt to free' spamd.log | wc -l
4970786

Approximately 800-900 'Attempt to free' log entries/second is about what i see.
 I t started right after a copy_config failure, which alarm'd 10 seconds after
an incoming connection was logged.

2005-11-17 13:12:12.421475500 [20716] error: __alarm__
2005-11-17 13:12:12.427746500 [20716] warn: spamd: copy_config timeout,
respawning child process after 1 messages at /usr/bin/spamd line 964.

 from looking at the code in /usr/bin/spamd,  $@='__alarm__' which should have
made spamd exit and respawned the parent.

         if ($err =~ /__alarm__/) {
            warn("spamd: copy_config timeout, respawning child process after ".
                            ($i+1)." messages");
            exit;         # so that the master spamd can respawn

          } else {
            die $err;     # propagate failures
          }

instead, it went into this spin out of control...  bigger log snippet below.


2005-11-17 13:11:28.095485500 [18735] info: spamd: connection from
localhost.localdomain [127.0.0.1] at port 39950
2005-11-17 13:11:28.097515500 [18735] info: spamd: handle_user unable to find
user: $global
2005-11-17 13:11:28.189210500 [18735] info: spamd: processing message
<Kilauea169891-31594-111201751-1-1001@flonetwork.com> for $global:200
2005-11-17 13:11:33.943557500 [18735] error: __alarm__
2005-11-17 13:11:36.110337500 [18735] info: spamd: clean message (-99.1/4.5) for
$global:200 in 8.0 seconds, 22460 bytes.
2005-11-17 13:11:36.110834500 [18735] info: spamd: result: . -99 -
AWL,BAYES_50,DBL_12_LETTER_FLDR,DBL_12_LETTER_PGIMG,HTML_MESSAGE,MIME_HTML_ONLY,NO_RECEIVED,NO_RELAYS,USER_IN_WHITELIST
scantime=8.0,size=22460,user=$global,uid=200,required_score=4.5,rhost=localhost.localdomain,raddr=127.0.0.1,rport=39950,mid=<Kilauea169891-31594-111201751-1-1001@flonetwork.com>,bayes=0.487863122868568,autolearn=no
2005-11-17 13:11:36.592555500 [818] info: spamd: handled cleanup of child pid
18735 due to SIGCHLD
2005-11-17 13:11:36.623644500 [818] info: spamd: server successfully spawned
child process, pid 20844
2005-11-17 13:12:02.204952500 [20716] info: spamd: connection from
localhost.localdomain [127.0.0.1] at port 39954
2005-11-17 13:12:12.421475500 [20716] error: __alarm__
2005-11-17 13:12:12.427746500 [20716] warn: spamd: copy_config timeout,
respawning child process after 1 messages at /usr/bin/spamd line 964.
2005-11-17 13:12:15.065816500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.065995500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.066132500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.066269500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.066406500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.066542500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.066678500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.066815500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.067004500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.067146500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.067283500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.067494500 [20716] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-17 13:12:15.067630500 [20716] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-17 13:12:15.067766500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.067903500 [20716] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-17 13:12:15.068038500 [20716] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-17 13:12:15.068172500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.068305500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.068437500 [20716] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-17 13:12:15.068573500 [20716] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-17 13:12:15.068706500 [20716] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-17 13:12:15.068839500 [20716] warn: Attempt to free non-existent shared
string during global destruction.
Comment 1 Dallas Engelken 2005-11-28 16:44:00 UTC
another one bits the dust after 'copy_config timeout' occurred.....


2005-11-23 11:16:03.973510500 [31352] info: spamd: processing message
<OFA8CC8528.B5E84CFE-ON862570C2.005EC146-862570C2.005ED63A@worleywarehousing.com>
for bgombert@DOMAIN.TLD:200
2005-11-23 11:16:17.895227500 [31478] error: child processing timeout at
/usr/bin/spamd line 1085, <GEN2733> line 4032.
2005-11-23 11:16:18.743445500 [31478] warn: spamd: timeout: (60 second timeout
while trying to PROCESS) at /usr/bin/spamd line 1678, <GEN2733> line 4032.
2005-11-23 11:16:30.069835500 [31478] error: __alarm__
2005-11-23 11:16:31.661297500 [31478] warn: spamd: copy_config timeout,
respawning child process after 23 messages at /usr/bin/spamd line 964.
2005-11-23 11:16:33.374891500 [31311] info: spamd: connection from
localhost.localdomain [127.0.0.1] at port 56464
2005-11-23 11:16:34.073753500 [31648] error: child processing timeout at
/usr/bin/spamd line 1085, <GEN2735> line 4034.
2005-11-23 11:16:34.074582500 [31648] warn: spamd: timeout: (60 second timeout
while trying to PROCESS) at /usr/bin/spamd line 1678, <GEN2735> line 4034.
2005-11-23 11:16:41.301635500 [31352] error: __alarm__
2005-11-23 11:16:44.275401500 [31311] error: __alarm__
2005-11-23 11:16:44.321221500 [31352] error: __alarm__
2005-11-23 11:16:44.569257500 [31311] warn: spamd: copy_config timeout,
respawning child process after 25 messages at /usr/bin/spamd line 964.
2005-11-23 11:16:48.123209500 [31352] info: spamd: clean message (0.6/5.0) for
bgombert@DOMAIN.TLD:200 in 49.2 seconds, 15092 bytes.
2005-11-23 11:16:48.123665500 [31352] info: spamd: result: .  0 -
AWL,HTML_MESSAGE,NO_REAL_NAME
scantime=49.2,size=15092,user=bgombert@DOMAIN.TLD,uid=200,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=56461,mid=<OFA8CC8528.B5E84CFE-ON862570C2.005EC146-862570C2.005ED63A@worleywarehousing.com>,autolearn=no
2005-11-23 11:16:48.448167500 [31648] info: spamd: connection from
localhost.localdomain [127.0.0.1] at port 56473
2005-11-23 11:16:48.450129500 [31648] info: spamd: handle_user unable to find
user: bgombert@DOMAIN.TLD
2005-11-23 11:16:48.511387500 [31478] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-23 11:16:48.511391500 [31478] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-23 11:16:48.511394500 [31478] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-23 11:16:48.511396500 [31478] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-23 11:16:48.511399500 [31478] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-23 11:16:48.511436500 [31478] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-23 11:16:48.511439500 [31478] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-23 11:16:48.511441500 [31478] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-23 11:16:48.511444500 [31478] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-23 11:16:48.511497500 [31478] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-23 11:16:48.511499500 [31478] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-23 11:16:48.511502500 [31478] warn: Attempt to free unreferenced scalar
during global destruction.
2005-11-23 11:16:48.511505500 [31478] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-23 11:16:48.511527500 [31478] warn: Attempt to free non-existent shared
string during global destruction.
2005-11-23 11:16:48.511530500 [31478] warn: Attempt to free unreferenced scalar
during global destruction.
Comment 2 Dallas Engelken 2005-11-28 19:40:33 UTC
here is a warn that actually puts a line number to it.... first time i had seen
that.

@40000000438b36b405207a24 [3205] warn: Attempt to free unreferenced scalar at
/usr/bin/spamd line 1085, <GEN136> line 2.


Comment 3 Dallas Engelken 2005-11-28 20:08:03 UTC
okay.. this is semi-reproducable.   starting spamd up like this

# spamd -p 777 --timeout-child=1 -D

and passing a message to it via

# cat msg.txt | spamc -p 777

gets me...

[27182] dbg: prefork: ordered 27204 to accept
[27182] dbg: prefork: sysread(7) not ready, wait max 300 secs
[27204] info: spamd: connection from localhost.localdomain [127.0.0.1] at port 
37675
[27182] dbg: prefork: child 27204: entering state 2
[27182] dbg: prefork: new lowest idle kid: 27205
[27204] info: spamd: setuid to dallase succeeded
[27204] error: child processing timeout at /usr/bin/spamd line 1085, <GEN7> 
line 2.
[27204] warn: spamd: timeout: (1 second timeout while trying to PROCESS) 
at /usr/bin/spamd line 1678, <GEN7> line 2.
[27204] dbg: config: copying current conf from backup
[27182] dbg: prefork: child 27204: entering state 1
[27182] dbg: prefork: new lowest idle kid: 27204
[27182] dbg: prefork: child reports idle
[27182] info: prefork: child states: II

then, if i let the spamd engine sit a bit before i issue a SIGINT to it, i get 
an error on the last line, which talks about "during global destruction".

[27204] dbg: prefork: sysread(8) not ready, wait max 300 secs
[27205] dbg: prefork: periodic ping from spamd parent
[27205] dbg: prefork: sysread(9) not ready, wait max 300 secs
[27204] dbg: prefork: periodic ping from spamd parent
[27204] dbg: prefork: sysread(8) not ready, wait max 300 secs
[27204] dbg: prefork: periodic ping from spamd parent
[27204] dbg: prefork: sysread(8) not ready, wait max 300 secs
[27205] dbg: prefork: periodic ping from spamd parent
[27205] dbg: prefork: sysread(9) not ready, wait max 300 secs
[27205] dbg: prefork: periodic ping from spamd parent
[27205] dbg: prefork: sysread(9) not ready, wait max 300 secs
[27182] dbg: prefork: new lowest idle kid: 27204
[27182] info: spamd: handled cleanup of child pid 27205 due to SIGCHLD
[27182] dbg: prefork: new lowest idle kid: none
[27182] info: spamd: handled cleanup of child pid 27204 due to SIGCHLD
[27182] info: spamd: server killed by SIGINT, shutting down
[27182] error: tcp timeout 
at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/SpamdForkScaling.pm line 
195 during global destruction.
tcp timeout 
at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/SpamdForkScaling.pm line 
195 during global destruction.


i'm not sure if these are related, but i'd wager a guess that they are... 
seeing that in Comment #1, a --timeout-child=60 was reached prior to the system 
spinning out of control from the global destruction log entries.
Comment 4 Justin Mason 2005-11-28 20:51:51 UTC
I hate to say it, but 'Attempt to free unreferenced scalar' and 'Attempt to free
non-existent shared string during global destruction' errors are extremely bad
news.  They typically indicate bugs in either the perl interpreter itself, or in
an XS module (ie. compiled C code).

I'd suggest trying another perl interpreter version to avoid these, if possible.
Comment 5 Dallas Engelken 2005-11-30 20:27:18 UTC
hmm.. strange.  this was just a stock perl 5.6.1 on redhat 7.3, which meets the 
requirements.   that said, i cant seem to reproduce that behavior on a 5.8.5 or 
5.8.6 perl.

Comment 6 Justin Mason 2006-01-04 23:11:51 UTC
hmm -- this sounds like the message on the users list today...
Comment 7 Dallas Engelken 2006-01-04 23:49:35 UTC
(In reply to comment #6)
> hmm -- this sounds like the message on the users list today...

thread at 
http://www.nabble.com/strange-bug%2C-filling-syslog-t843486.html

Comment 8 Justin Mason 2006-02-01 02:44:06 UTC
bug 4696 comment 31 From Dallas Engelken:

'i havent had the problem outlined in bug 4693 since
applying this patch..  FWIW.'

I'm not entirely sure this is closable just yet though -- what do you think Dallas?
Comment 9 Dallas Engelken 2006-02-01 04:54:43 UTC
well, it probably wouldnt hurt to leave open for a bit.   all my production 
boxes are currently set to override the $copy_config_p=1 setting 
in /usr/bin/spamd to just disable config copies since i load prefs via sql.

on my svn installs that i test with i cant seem to reproduce this anymore with 
the patch from bug 4696 applied.


Comment 10 Daryl C. W. O'Shea 2006-02-03 02:15:32 UTC
"all my production boxes are currently set to override the $copy_config_p=1
setting in /usr/bin/spamd to just disable config copies since i load prefs via sql"


Unless I'm forgetting something, it makes no difference if you're using SQL
prefs or not.  You'll still end up appending config options to multi-value
config options and replace the current value of single value config options.

Most noticable would be the leaking of white/black-list entries across users.
Comment 11 Dallas Engelken 2006-03-14 15:58:14 UTC
well i guess leaking white/black entries is a heck of a lot better than DoSing
the system due to a full disk.

FWIW, running 3.1.1 or latest svn with $copy_config_p set back to 1 has caused
this problem to resurface on rh7.3 boxes with perl 5.6.1.   i've seen a handful
of boxes in the last 2 days with full disks due to this.   i've never seen it
occur on fc3 with perl 5.8.5.

i thought the recent updates related to bug 4696 may have helped, but it doesnt
appear so.  Patrick von der Hagen is the only other person i have seen report
this issue... somehow i wish it was more widespread so i could stop banging my
head on my desk and yelling WHY ME!  ;)

so, i'd say leave this open for a while until i debug it down far enough to
pinpoint the issue.  
Comment 12 Daryl C. W. O'Shea 2006-03-15 03:44:51 UTC
"Other than" the nasty warnings causing your log volume to go down in flames,
does spamd continue to work OK?  I'm assuming that copy_config doesn't time out
on every message, right?

You could avoid logging the errors.  Obviously it wouldn't fix the problem, but
at least you wouldn't get logged to death.

Adding something like this at line 142 of Logger.pm would do it:

return if ($level eq "warn" && $message[0] =~ /Attempt to free .* during global
destruction/);
Comment 13 Dallas Engelken 2006-03-15 17:47:29 UTC
well, i suppose it does, bug i'm just guessing because i dont hear anyone 
complain until their disk is full and their mail stops coming in.

i've seen cases where daily logrotate will save it for a while.... in this 
case, it was happending daily for the last 8 days.

-rw-rw-r--    1 doublech doublech 1098107980 Mar 13 04:02 spamd.log.3.gz
-rw-rw-r--    1 doublech doublech 1142570815 Mar 12 04:01 spamd.log.4.gz
-rw-rw-r--    1 doublech doublech 1188325192 Mar 11 04:02 spamd.log.5.gz
-rw-rw-r--    1 doublech doublech 1231933772 Mar 10 04:03 spamd.log.6.gz
-rw-rw-r--    1 doublech doublech 1287468111 Mar  9 04:02 spamd.log.7.gz
-rw-rw-r--    1 doublech doublech 1327049798 Mar  8 04:02 spamd.log.8.gz
-rw-rw-r--    1 doublech doublech  1904957 Mar  7 03:11 spamd.log.9.gz
-rw-r--r--    1 doublech doublech  1294305 Mar  6 01:07 spamd.log.10.gz
-rwxr-xr-x    1 doublech doublech  1151532 Mar  3 19:19 spamd.log.11.gz
-rwxr-xr-x    1 doublech doublech   864265 Mar  1 22:15 spamd.log.12.gz
-rwxr-xr-x    1 doublech doublech   863982 Feb 28 22:03 spamd.log.13.gz
-rwxr-xr-x    1 doublech doublech   985953 Feb 27 21:00 spamd.log.14.gz

shouldnt be hard to identify when this this an issue if i monitor disk usage.  
When its in this high speed logging loop, the CPU is also hitting pretty 
hard... so i'm not sure how much of that CPU is due to the actual log write, 
and how much is due to it being stuck in a loop.  i can try adding that line to 
Logging.pm and see what happens.

i'll post more if i can catch it in progress.

Comment 14 Dallas Engelken 2006-03-15 18:14:19 UTC
ok, i caught one.   looks like child pid 23618 is the one that went nuts..  
child 6613 answered my spamc request just fine.

i've done a little grep magic to weed out the info lines and show how the warns 
from 23618 intertwine with them... otherwise you'd be looking at literally 
hundreds of warn lines between each info line.

[root@dcem spamassassin]# tail -f current  | grep -A1 -B1 info | tai64nlocal

2006-03-15 12:53:17.467291500 [23618] warn: Attempt to free unreferenced 
scalar, <GEN2263> line 255 during global destruction.
2006-03-15 12:53:17.471020500 [6613] info: spamd: connection from 
localhost.localdomain [127.0.0.1] at port 53583
2006-03-15 12:53:17.496962500 [23618] warn: Attempt to free non-existent shared 
string, <GEN2263> line 255 during global destruction.
2006-03-15 12:53:17.497196500 [23618] warn: Attempt to free unreferenced 
scalar, <GEN2263> line 255 during global destruction.
2006-03-15 12:53:17.503950500 [6613] info: spamd: processing message (unknown) 
for root:200
2006-03-15 12:53:17.577813500 [23618] warn: Attempt to free non-existent shared 
string, <GEN2263> line 255 during global destruction.
--
2006-03-15 12:53:18.631218500 [6613] warn: pyzor: check failed: internal error
2006-03-15 12:53:18.658736500 [6613] info: spamd: clean message (1.8/5.0) for 
root:200 in 1.2 seconds, 7 bytes.
2006-03-15 12:53:18.659210500 [6613] info: spamd: result: .  1 - 
BAYES_40,MISSING_DATE,MISSING_SUBJECT,NO_RECEIVED,NO_RELAYS,TO_CC_NONE 
scantime=1.2,size=7,user=root,uid=200,required_score=5.0,rhost=localhost.localdo
main,raddr=127.0.0.1,rport=53583,mid=
(unknown),bayes=0.362223941701955,autolearn=no,urihits=none
2006-03-15 12:53:18.713748500 [23618] warn: Attempt to free unreferenced 
scalar, <GEN2263> line 255 during global destruction.
Comment 15 Henrik Krohns 2022-03-06 16:35:46 UTC
Closing ancient stale bug.