Bug 3839 - increase amount of memory shared between spamd processes
increase amount of memory shared between spamd processes
Status: NEW
Product: Spamassassin
Classification: Unclassified
Component: Libraries
3.0.0
Other other
: P5 normal
: Future
Assigned To: SpamAssassin Developer Mailing List
:
Depends on: 3852
Blocks:
  Show dependency tree
 
Reported: 2004-09-28 13:23 UTC by Justin Mason
Modified: 2007-04-16 05:11 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
spamd patch to dump Devel::Peek dumps after each scan patch None Justin Mason [HasCLA]
spamd dump directly after compile_now(), gzipped text/plain None Justin Mason [HasCLA]
spamd dump after first scan, gzipped application/octet-stream None Justin Mason [HasCLA]
diff -ru between those two dump files text/plain None Justin Mason [HasCLA]
"sharetest" script text/plain None Justin Mason [HasCLA]
experiment; parse config in a subprocess patch None Justin Mason [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2004-09-28 13:23:20 UTC
this came up on the users list:
http://issues.apache.org/eyebrowse/ReadMsg?listId=275&msgNo=16398 . Basically,
we have this situation:

1. spamd starts up and parses a lot of configuration (rules, rule names, rule
descriptions, etc.), and allocates SVs for all the data it keeps in RAM.

2. spamd forks N subprocesses to do scanning

3a. *the plan*: that those N subprocesses never copy that memory; they can share
the read-only config data with the parent process.

3b. *reality*: shared RAM is very low; each subprocess has its own copy of these
pages, as this "top -b1 -n -c" output from linux 2.6.6 demonstrates:

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
25963 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 /usr/bin/perl -T
25969 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
25970 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
25971 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
25972 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
25973 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child

"SHARE" is actually an accurate illustration of how much of that memory is being
shared between those processes; as you can see, the vast majority of the SIZE is
*not* being shared.

I believe this is due to a behaviour of ref-counted strings, as David Skoll
suggested.   for example:

    my $result;
    foreach my $string (@massive_array_of_strings) {
      $result .= substr($string,0,1);
    }
    print $result,"\n";

@massive_array_of_strings is a large array of short strings, like the multiple
arrays and hashes we maintain of SpamAssassin rules, rule names, and their code.
even though you never modify any string from @massive_array_of_strings, the act
of iterating over them does this:

    - increment refcount of SV at $massive_array_of_strings[0]
    - get substr, add to $result
    - decrement refcount of SV at $massive_array_of_strings[0]
    - increment refcount of SV at $massive_array_of_strings[1]
    - get substr, add to $result
    - decrement refcount of SV at $massive_array_of_strings[1]
    - ...

Every refcount change that takes place results in the page that that SV
structure is on, becoming marked as "dirty" for the purposes of copy-on-write.
This in turn results in those pages being copied by the VM into the child
process' RSS.

See "Are My Variables Shared?"  at
http://www.perl.com/lpt/a/2002/07/30/mod_perl.html .

Another question is, is this the problem we're running into?   Some 
grovelling through Devel::Peek output may be required. :(

Possible fixes:

1. use "Readonly":
http://mirror.eunet.fi/CPAN/modules/by-module/Readonly/Readonly-1.03.readme . 
however, may not work at all.

2. store memory-hungry array/hash structures in a single massive packed string!
yes, this should work, since the refcounting for a string is a single counter in
that string's SV, in the first page.  All the pages containing the string's data
would never be written.

3. avoid looking at some of the arrays/hashes in the child processes, at all.


btw, Nutssh on IRC was looking for these -- so here they are for anyone
else who wants them.

/proc/*/maps for parent:

08048000-08141000 r-xp 00000000 03:07 5900534    /usr/bin/perl
08141000-0814a000 rw-p 000f9000 03:07 5900534    /usr/bin/perl
0814a000-09384000 rwxp 00000000 00:00 0
40000000-40016000 r-xp 00000000 03:07 3211322    /lib/ld-2.3.2.so
40016000-40017000 rw-p 00015000 03:07 3211322    /lib/ld-2.3.2.so
40017000-40018000 rw-p 00000000 00:00 0
4002e000-40030000 r-xp 00000000 03:07 3031109    /lib/tls/libdl-2.3.2.so
40030000-40031000 rw-p 00001000 03:07 3031109    /lib/tls/libdl-2.3.2.so
40031000-40053000 r-xp 00000000 03:07 3031117    /lib/tls/libm-2.3.2.so
40053000-40054000 rw-p 00022000 03:07 3031117    /lib/tls/libm-2.3.2.so
40054000-40055000 rw-p 00000000 00:00 0
40055000-40061000 r-xp 00000000 03:07 3031132    /lib/tls/libpthread-0.60.so
40061000-40062000 rw-p 0000c000 03:07 3031132    /lib/tls/libpthread-0.60.so
40062000-40064000 rw-p 00000000 00:00 0
40064000-40194000 r-xp 00000000 03:07 3031105    /lib/tls/libc-2.3.2.so
40194000-4019d000 rw-p 0012f000 03:07 3031105    /lib/tls/libc-2.3.2.so
4019d000-4019f000 rw-p 00000000 00:00 0
4019f000-401a3000 r-xp 00000000 03:07 3031106    /lib/tls/libcrypt-2.3.2.so
401a3000-401a4000 rw-p 00004000 03:07 3031106    /lib/tls/libcrypt-2.3.2.so
401a4000-401ed000 rw-p 00000000 00:00 0
401ed000-401f1000 r-xp 00000000 03:07 345268     /usr/lib/perl/5.8.4/auto/IO/IO.so
401f1000-401f2000 rw-p 00004000 03:07 345268     /usr/lib/perl/5.8.4/auto/IO/IO.so
401f2000-401f8000 r-xp 00000000 03:07 426408    
/usr/lib/perl/5.8.4/auto/Socket/Socket.so
401f8000-401f9000 rw-p 00005000 03:07 426408    
/usr/lib/perl/5.8.4/auto/Socket/Socket.so
401f9000-401fb000 r-xp 00000000 03:07 1262348   
/usr/lib/perl/5.8.4/auto/Sys/Hostname/Hostname.so
401fb000-401fc000 rw-p 00001000 03:07 1262348   
/usr/lib/perl/5.8.4/auto/Sys/Hostname/Hostname.so
401fc000-401ff000 r-xp 00000000 03:07 7422476   
/usr/lib/perl/5.8.4/auto/Fcntl/Fcntl.so
401ff000-40200000 rw-p 00003000 03:07 7422476   
/usr/lib/perl/5.8.4/auto/Fcntl/Fcntl.so
40200000-4021c000 r-xp 00000000 03:07 7029299   
/usr/lib/perl/5.8.4/auto/POSIX/POSIX.so
4021c000-4021d000 rw-p 0001b000 03:07 7029299   
/usr/lib/perl/5.8.4/auto/POSIX/POSIX.so
4021d000-40220000 r-xp 00000000 03:07 6914140   
/usr/lib/perl/5.8.4/auto/MIME/Base64/Base64.so
40220000-40221000 rw-p 00002000 03:07 6914140   
/usr/lib/perl/5.8.4/auto/MIME/Base64/Base64.so
40221000-40223000 r-xp 00000000 03:07 3784793    /usr/lib/perl5/auto/Net/DNS/DNS.so
40223000-40224000 rw-p 00001000 03:07 3784793    /usr/lib/perl5/auto/Net/DNS/DNS.so
40224000-40228000 r-xp 00000000 03:07 3784710   
/usr/lib/perl5/auto/Digest/SHA1/SHA1.so
40228000-40229000 rw-p 00004000 03:07 3784710   
/usr/lib/perl5/auto/Digest/SHA1/SHA1.so
40229000-40237000 r-xp 00000000 03:07 3784853   
/usr/lib/perl/5.8.4/auto/DB_File/DB_File.so
40237000-40238000 rw-p 0000d000 03:07 3784853   
/usr/lib/perl/5.8.4/auto/DB_File/DB_File.so
4023a000-40249000 r-xp 00000000 03:07 3031133    /lib/tls/libresolv-2.3.2.so
40249000-4024a000 rw-p 0000f000 03:07 3031133    /lib/tls/libresolv-2.3.2.so
4024a000-4024c000 rw-p 00000000 00:00 0
4024c000-40255000 r-xp 00000000 03:07 1474596   
/usr/lib/perl5/auto/HTML/Parser/Parser.so
40255000-40256000 rw-p 00008000 03:07 1474596   
/usr/lib/perl5/auto/HTML/Parser/Parser.so
40256000-40259000 r-xp 00000000 03:07 5406739    /usr/lib/perl/5.8.4/auto/Cwd/Cwd.so
40259000-4025a000 rw-p 00002000 03:07 5406739    /usr/lib/perl/5.8.4/auto/Cwd/Cwd.so
4025a000-4025e000 r-xp 00000000 03:07 1802278   
/usr/lib/perl/5.8.4/auto/Time/HiRes/HiRes.so
4025e000-4025f000 rw-p 00003000 03:07 1802278   
/usr/lib/perl/5.8.4/auto/Time/HiRes/HiRes.so
40262000-40336000 r-xp 00000000 03:07 5931583    /usr/lib/libdb-4.2.so
40336000-40338000 rw-p 000d4000 03:07 5931583    /usr/lib/libdb-4.2.so
40338000-4033c000 r-xp 00000000 03:07 1278045   
/usr/lib/perl/5.8.4/auto/Sys/Syslog/Syslog.so
4033c000-4033d000 rw-p 00003000 03:07 1278045   
/usr/lib/perl/5.8.4/auto/Sys/Syslog/Syslog.so
4033d000-40345000 r-xp 00000000 03:07 6882044   
/usr/lib/perl/5.8.4/auto/List/Util/Util.so
40345000-40346000 rw-p 00007000 03:07 6882044   
/usr/lib/perl/5.8.4/auto/List/Util/Util.so
40353000-4035c000 r-xp 00000000 03:07 3031124    /lib/tls/libnss_files-2.3.2.so
4035c000-4035d000 rw-p 00008000 03:07 3031124    /lib/tls/libnss_files-2.3.2.so
4035d000-40364000 r-xp 00000000 03:07 3031120    /lib/tls/libnss_compat-2.3.2.so
40364000-40365000 rw-p 00007000 03:07 3031120    /lib/tls/libnss_compat-2.3.2.so
40365000-40377000 r-xp 00000000 03:07 3031119    /lib/tls/libnsl-2.3.2.so
40377000-40378000 rw-p 00011000 03:07 3031119    /lib/tls/libnsl-2.3.2.so
40378000-4037a000 rw-p 00000000 00:00 0
4037a000-40383000 r-xp 00000000 03:07 3031129    /lib/tls/libnss_nis-2.3.2.so
40383000-40384000 rw-p 00008000 03:07 3031129    /lib/tls/libnss_nis-2.3.2.so
40384000-4044c000 rw-p 00000000 00:00 0
404bc000-40524000 rw-p 00138000 00:00 0
bfff0000-c0000000 rw-p ffff1000 00:00 0
ffffe000-fffff000 ---p 00000000 00:00 0



/proc/*/maps for child:

08048000-08141000 r-xp 00000000 03:07 5900534    /usr/bin/perl
08141000-0814a000 rw-p 000f9000 03:07 5900534    /usr/bin/perl
0814a000-093a5000 rwxp 00000000 00:00 0 
40000000-40016000 r-xp 00000000 03:07 3211322    /lib/ld-2.3.2.so
40016000-40017000 rw-p 00015000 03:07 3211322    /lib/ld-2.3.2.so
40017000-40018000 rw-p 00000000 00:00 0 
4002e000-40030000 r-xp 00000000 03:07 3031109    /lib/tls/libdl-2.3.2.so
40030000-40031000 rw-p 00001000 03:07 3031109    /lib/tls/libdl-2.3.2.so
40031000-40053000 r-xp 00000000 03:07 3031117    /lib/tls/libm-2.3.2.so
40053000-40054000 rw-p 00022000 03:07 3031117    /lib/tls/libm-2.3.2.so
40054000-40055000 rw-p 00000000 00:00 0 
40055000-40061000 r-xp 00000000 03:07 3031132    /lib/tls/libpthread-0.60.so
40061000-40062000 rw-p 0000c000 03:07 3031132    /lib/tls/libpthread-0.60.so
40062000-40064000 rw-p 00000000 00:00 0 
40064000-40194000 r-xp 00000000 03:07 3031105    /lib/tls/libc-2.3.2.so
40194000-4019d000 rw-p 0012f000 03:07 3031105    /lib/tls/libc-2.3.2.so
4019d000-4019f000 rw-p 00000000 00:00 0 
4019f000-401a3000 r-xp 00000000 03:07 3031106    /lib/tls/libcrypt-2.3.2.so
401a3000-401a4000 rw-p 00004000 03:07 3031106    /lib/tls/libcrypt-2.3.2.so
401a4000-401ed000 rw-p 00000000 00:00 0 
401ed000-401f1000 r-xp 00000000 03:07 345268     /usr/lib/perl/5.8.4/auto/IO/IO.so
401f1000-401f2000 rw-p 00004000 03:07 345268     /usr/lib/perl/5.8.4/auto/IO/IO.so
401f2000-401f8000 r-xp 00000000 03:07 426408    
/usr/lib/perl/5.8.4/auto/Socket/Socket.so
401f8000-401f9000 rw-p 00005000 03:07 426408    
/usr/lib/perl/5.8.4/auto/Socket/Socket.so
401f9000-401fb000 r-xp 00000000 03:07 1262348   
/usr/lib/perl/5.8.4/auto/Sys/Hostname/Hostname.so
401fb000-401fc000 rw-p 00001000 03:07 1262348   
/usr/lib/perl/5.8.4/auto/Sys/Hostname/Hostname.so
401fc000-401ff000 r-xp 00000000 03:07 7422476   
/usr/lib/perl/5.8.4/auto/Fcntl/Fcntl.so
401ff000-40200000 rw-p 00003000 03:07 7422476   
/usr/lib/perl/5.8.4/auto/Fcntl/Fcntl.so
40200000-4021c000 r-xp 00000000 03:07 7029299   
/usr/lib/perl/5.8.4/auto/POSIX/POSIX.so
4021c000-4021d000 rw-p 0001b000 03:07 7029299   
/usr/lib/perl/5.8.4/auto/POSIX/POSIX.so
4021d000-40220000 r-xp 00000000 03:07 6914140   
/usr/lib/perl/5.8.4/auto/MIME/Base64/Base64.so
40220000-40221000 rw-p 00002000 03:07 6914140   
/usr/lib/perl/5.8.4/auto/MIME/Base64/Base64.so
40221000-40223000 r-xp 00000000 03:07 3784793    /usr/lib/perl5/auto/Net/DNS/DNS.so
40223000-40224000 rw-p 00001000 03:07 3784793    /usr/lib/perl5/auto/Net/DNS/DNS.so
40224000-40228000 r-xp 00000000 03:07 3784710   
/usr/lib/perl5/auto/Digest/SHA1/SHA1.so
40228000-40229000 rw-p 00004000 03:07 3784710   
/usr/lib/perl5/auto/Digest/SHA1/SHA1.so
40229000-40237000 r-xp 00000000 03:07 3784853   
/usr/lib/perl/5.8.4/auto/DB_File/DB_File.so
40237000-40238000 rw-p 0000d000 03:07 3784853   
/usr/lib/perl/5.8.4/auto/DB_File/DB_File.so
4023a000-40249000 r-xp 00000000 03:07 3031133    /lib/tls/libresolv-2.3.2.so
40249000-4024a000 rw-p 0000f000 03:07 3031133    /lib/tls/libresolv-2.3.2.so
4024a000-4024c000 rw-p 00000000 00:00 0 
4024c000-40255000 r-xp 00000000 03:07 1474596   
/usr/lib/perl5/auto/HTML/Parser/Parser.so
40255000-40256000 rw-p 00008000 03:07 1474596   
/usr/lib/perl5/auto/HTML/Parser/Parser.so
40256000-40259000 r-xp 00000000 03:07 5406739    /usr/lib/perl/5.8.4/auto/Cwd/Cwd.so
40259000-4025a000 rw-p 00002000 03:07 5406739    /usr/lib/perl/5.8.4/auto/Cwd/Cwd.so
4025a000-4025e000 r-xp 00000000 03:07 1802278   
/usr/lib/perl/5.8.4/auto/Time/HiRes/HiRes.so
4025e000-4025f000 rw-p 00003000 03:07 1802278   
/usr/lib/perl/5.8.4/auto/Time/HiRes/HiRes.so
40262000-40336000 r-xp 00000000 03:07 5931583    /usr/lib/libdb-4.2.so
40336000-40338000 rw-p 000d4000 03:07 5931583    /usr/lib/libdb-4.2.so
40338000-4033c000 r-xp 00000000 03:07 1278045   
/usr/lib/perl/5.8.4/auto/Sys/Syslog/Syslog.so
4033c000-4033d000 rw-p 00003000 03:07 1278045   
/usr/lib/perl/5.8.4/auto/Sys/Syslog/Syslog.so
4033d000-40345000 r-xp 00000000 03:07 6882044   
/usr/lib/perl/5.8.4/auto/List/Util/Util.so
40345000-40346000 rw-p 00007000 03:07 6882044   
/usr/lib/perl/5.8.4/auto/List/Util/Util.so
40353000-4035c000 r-xp 00000000 03:07 3031124    /lib/tls/libnss_files-2.3.2.so
4035c000-4035d000 rw-p 00008000 03:07 3031124    /lib/tls/libnss_files-2.3.2.so
4035d000-40364000 r-xp 00000000 03:07 3031120    /lib/tls/libnss_compat-2.3.2.so
40364000-40365000 rw-p 00007000 03:07 3031120    /lib/tls/libnss_compat-2.3.2.so
40365000-40377000 r-xp 00000000 03:07 3031119    /lib/tls/libnsl-2.3.2.so
40377000-40378000 rw-p 00011000 03:07 3031119    /lib/tls/libnsl-2.3.2.so
40378000-4037a000 rw-p 00000000 00:00 0 
4037a000-40383000 r-xp 00000000 03:07 3031129    /lib/tls/libnss_nis-2.3.2.so
40383000-40384000 rw-p 00008000 03:07 3031129    /lib/tls/libnss_nis-2.3.2.so
40384000-4044c000 rw-p 00000000 00:00 0 
404bc000-40524000 rw-p 00138000 00:00 0 
bfff0000-c0000000 rw-p ffff1000 00:00 0 
ffffe000-fffff000 ---p 00000000 00:00 0
Comment 1 Justin Mason 2004-09-28 16:51:27 UTC
daniel pointed out -- I'd screwed up the top output -- pasting between 2.4.x and
2.6.x does that, it seems.   so here's the correct header for that output, from
"top -b -n1":

  PID USER      PR  NI CODE  VIRT  RES  SHR nDRT %CPU %MEM   TIME COMMAND
25963 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 /usr/bin/perl -T
25969 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
25970 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
25971 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
25972 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
25973 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
Comment 2 Justin Mason 2004-09-29 10:20:47 UTC
Subject: Re:  New: perl's refcounting may be causing us trouble with copy-on-write 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Matt Sergeant writes:
> On 28 Sep 2004, at 21:23, bugzilla-daemon@bugzilla.spamassassin.org 
> wrote:
> 
> > 3a. *the plan*: that those N subprocesses never copy that memory; they 
> > can share
> > the read-only config data with the parent process.
> >
> > 3b. *reality*: shared RAM is very low; each subprocess has its own 
> > copy of these
> > pages, as this "top -b1 -n -c" output from linux 2.6.6 demonstrates:
> >
> >   PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
> > 25963 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 
> > /usr/bin/perl -T
> > 25969 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 
> > spamd child
> > 25970 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 
> > spamd child
> > 25971 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 
> > spamd child
> > 25972 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 
> > spamd child
> > 25973 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 
> > spamd child
> >
> > "SHARE" is actually an accurate illustration of how much of that 
> > memory is being
> > shared between those processes; as you can see, the vast majority of 
> > the SIZE is
> > *not* being shared.
> 
> Your analysis is wrong. RSS is the memory being used by the process, 
> not SIZE. In the above, most of the memory *is* shared. Maybe this 
> isn't a problem at all?

see the bug -- I'd screwed up copying and pasting those lines, and the
header doesn't match the other lines.   correct output:

  PID USER      PR  NI CODE  VIRT  RES  SHR nDRT %CPU %MEM   TIME COMMAND
25963 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 /usr/bin/perl -T
25969 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
25970 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
25971 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
25972 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child
25973 jm        25   0 1032 24416  21m 4140    0  0.0  2.5   0:00 spamd child

Note that SHR is 4140 KB for all procs, and RES 21 MB.   That's 20% of the
memory usage being shared (according to the maps, that's all the shared
library and interpreter segments) and 80% has been copied for each process
(pretty much all the process' heap).

> See the section "Process Memory Measurements" in the mod_perl guide 
> (not sure where it is - I'm referring to the book version).

great! I now have pretty much every other perl book O'Reilly have
printed, but I don't have a copy of that ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBWu7jQTcbUG5Y7woRAmUJAKDosKsNRBMP07RTqyB1z0wjhWd1pACfbxpg
WAstMuNYft5N7Nsk5KGS2S8=
=Cwmt
-----END PGP SIGNATURE-----

Comment 3 Justin Mason 2004-09-29 10:34:30 UTC
Subject: Re:  New: perl's refcounting may be causing us trouble with copy-on-write 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Matt Sergeant writes:
> > Note that SHR is 4140 KB for all procs, and RES 21 MB.   That's 20% of
> > the memory usage being shared (according to the maps, that's all the
> > shared library and interpreter segments) and 80% has been copied for
> > each process (pretty much all the process' heap).
> 
> IIRC you load the config in the child though, not in the parent. Or has
> that changed?

nah -- it was always loaded in the parent.  later, the user-prefs are
loaded in the child, but the system-wide config (all rules, all descs,
all scores et al) are loaded in the parent.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBWvIaQTcbUG5Y7woRAlc1AJwObhsLwkn7D5iDXH10+DJAFV+nJwCg37sE
1+ejC+J1umXQPVK7or5upIA=
=Vq4c
-----END PGP SIGNATURE-----

Comment 4 Justin Mason 2004-10-01 18:30:28 UTC
Created attachment 2402 [details]
spamd patch to dump Devel::Peek dumps after each scan
Comment 5 Justin Mason 2004-10-01 18:33:57 UTC
Created attachment 2403 [details]
spamd dump directly after compile_now(), gzipped
Comment 6 Justin Mason 2004-10-01 18:35:14 UTC
Created attachment 2404 [details]
spamd dump after first scan, gzipped
Comment 7 Justin Mason 2004-10-01 18:42:46 UTC
Created attachment 2405 [details]
diff -ru between those two dump files

OK, here's the scoop.	This is the diff between Devel::Peek::Dump
output from just after compile_now(), and just after the first
scan in a child spamd process.

Interesting things:  in the "scores" hash:

		     Elt "INVALID_DATE" HASH = 0xf102ea6e
		     SV = PVNV(0x8e3d838) at 0x8e0cf0c
		       REFCNT = 1
-		       FLAGS = (NOK,pNOK)
-		       IV = 0
+		       FLAGS = (IOK,NOK,pIOK,pNOK)
+		       IV = 1
		       NV = 1
		       PV = 0

so somehow a few rules are picking up new "IOK" flags -- ie. cached
integer-cast versions.	That means that all the pages with those SVs would be
copied. These seem to be the rules that are hitting in the message I scanned.
(I used "t/spamd.t".)

In addition, an "AWL" entry is being created in the "scoresets" arrays, and
picking up PV values  -- ie. cached stringified versions.

The message was autolearned as spam and added to the AWL.  Both these
are not used in compile_now().

and now my brain hurts. ;)
Comment 8 Loren Wilton 2004-10-01 20:19:31 UTC
Subject: Re:  perl's refcounting may be causing us trouble with copy-on-write

>      Elt "INVALID_DATE" HASH = 0xf102ea6e
>      SV = PVNV(0x8e3d838) at 0x8e0cf0c
>        REFCNT = 1

What you don't know from that diff is whether that refcnt changed during the
rule processing by being incremented on reference and decremented after the
reference.  So while you only see changes on some rules, and there is
doubtless a way to make those changes not happen, you *don't* know that you
aren't doing an inc/dec mod on each rule touched.

To figure that out you practically have to figure out how to get one of
those dumps in the middle of processing a rule.

Comment 9 Justin Mason 2004-10-02 12:14:30 UTC
Created attachment 2409 [details]
"sharetest" script

'What you don't know from that diff is whether that refcnt changed during
the rule processing by being incremented on reference and decremented after
the reference.	So while you only see changes on some rules, and there is
doubtless a way to make those changes not happen, you *don't* know that you
aren't doing an inc/dec mod on each rule touched. To figure that out you
practically have to figure out how to get one of those dumps in the middle
of processing a rule.'

Actually, here's a test script that appears to show that this is *not* the
case.	It emulates the data structures used in Conf, and the each() loops
used in PerMsgStatus, and dumps continually throughout -- including in the
middle of the each() iteration loop.  The REFCNTs do not change, even when
iterating.  So that's not the problem, after all.

It does demonstrate the bug, however:

top -b -n1 | grep sharetest
  PID USER	PR  NI CODE  VIRT  RES	SHR nDRT %CPU %MEM   TIME COMMAND
11784 jm	25   0 1032 37148  31m 2620    0 86.6  3.6   0:02 sharetest
11769 jm	25   0 1032 37148  31m 2620    0  0.0  3.6   0:04 sharetest

Interestingly, however, I notice something unusual -- those VIRT/RES/SHR
figures *never change* even when there's only *one* process there. I'm
starting to think that "top" is reporting the wrong thing -- namely, that
the "SHR" figure refers only to pages of code loaded from shared libraries,
and that the "RES" figure basically includes all the resident allocated
memory, even if that memory is shared behind the scenes.

If this is the case, this "bug" is a non-issue after all.

Anyway, take a look at the script if you're curious...

--j.
Comment 10 Justin Mason 2004-10-04 13:28:11 UTC
sigh...

The more I look into this, the less and less it appears to be an issue.
This script:

    #!/usr/bin/perl
    # original at http://www.fedoraforum.org/forum/showthread.php?t=5336
    use strict;
    my $a = "\x00" x (32*1024*1024); #32Mb of data
    print "[$$] forking to test perl mem sharing...\n";
    if ( fork == 0 ) {
    print "[$$] subprocess sleeping\n"; sleep 1;
    print "[$$] subprocess done\n"; exit;
    }
    print "[$$] reporting\n";
    my @statm = split(m/\s+/,`cat /proc/$$/statm`);
    my $statmstr = join(' ', @statm);
    my $ratio = ($statm[2]/$statm[1]);
    my $uname = `uname -sr`; chomp $uname;
    print "ratio: ".int($ratio*100)."\%   uname: $uname   statm: $statmstr\n";
    wait; print "[$$] done\n";

allocates a large 32-meg string, forks, and outputs the Linux kernel
per-process stats that "top" bases its info on (from /proc/$$/statm). Since the
large string has not been modified (or even accessed) since the fork, a large
portion of those pages should be shared if copy-on-write is doing anything
at all. ;)

In fact, the results are:

    - Linux 2.4.[12]*, vanilla kernels: 99% shared
    - Linux 2.4.[12]*, Red Hat kernels: 1-3% shared
    - Linux 2.6.*, vanilla kernels: 1-3% shared

Looking into the latter, it appears that yes, 2.6.x changed the data reported
from /proc/$$/statm so that it just reports (text + shared-lib-code) for the
"shared" field, not what quite a few people (myself included ;) thought it
means.   I'll bet the RH kernels are the same.

Anyway, I've found a machine where the results are reliable (reports 99%
from that script) so will test there.

Comment 11 Justin Mason 2004-10-04 13:46:04 UTC
ok, now *THAT* works a lot better ;)

after spamd startup:

top -b -n1 | grep perl
11842 root       9   0  4816  444   444 S     0.0  0.0   0:01 perl
 2666 jm         9   0 18460  18M 17892 S     0.0  3.5   0:03 perl
 2670 jm         9   0 18460  18M 17784 S     0.0  3.5   0:00 perl
 2671 jm         9   0 18460  18M 17880 S     0.0  3.5   0:00 perl
 2672 jm         9   0 18460  18M 17872 S     0.0  3.5   0:00 perl
 2673 jm         9   0 18460  18M 17872 S     0.0  3.5   0:00 perl
 2674 jm         9   0 18460  18M 17884 S     0.0  3.5   0:00 perl

nice sharing, there!  but then, we haven't done anything yet ;)  after scanning
a single message ("t/data/spam/001"):

 2666 jm         9   0 18460  18M 17888 S     0.0  3.5   0:03 perl
 2670 jm        13   0 18676  18M 11720 S     0.0  3.6   0:00 perl <<<*NOTE*
 2671 jm         9   0 18460  18M 17872 S     0.0  3.5   0:00 perl
 2672 jm         9   0 18460  18M 17872 S     0.0  3.5   0:00 perl
 2673 jm         9   0 18460  18M 17872 S     0.0  3.5   0:00 perl
 2674 jm         9   0 18460  18M 17884 S     0.0  3.5   0:00 perl

6064k dropped immediately (that's 1516 4k pages).  after scanning another 111
messages, this time a wide selection of "real" spam:

 2666 jm         9   0 18460  18M 10968 S     0.0  3.5   0:03 perl
 2670 jm         9   0 19532  19M 10880 S     0.0  3.7   0:09 perl
 2671 jm        13   0 19840  19M 10888 S     0.0  3.8   0:09 perl
 2672 jm         9   0 20620  20M 10832 S     0.0  4.0   0:09 perl
 2673 jm         9   0 20672  20M 10836 S     0.0  4.0   0:09 perl
 2674 jm         9   0 20848  20M 10844 S     0.0  4.0   0:09 perl

I'd say that's about the typical low-water mark.  we should be able to increase
that.

btw, if you're curious -- test methodology used:

  cd t
  /usr/bin/perl -T -w ../spamd/spamd -D -x -s stderr -C log/test_rules_copy 
--siteconfigpath log/localrules.tmp -p 48373 -L &
  [wait for spamd to start]
  top -b -n1 | grep perl

  ../spamc/spamc -d localhost -p 48373 < data/spam/001
  top -b -n1 | grep perl

  for m in [corpus] ; do ../spamc/spamc -d localhost -p 48373 < $m > /dev/null; done
  top -b -n1 | grep perl


don't forget, this only works on vanilla linux 2.4.*; other kernels do not
report valid figures for copy-on-write shared pages.
Comment 12 Justin Mason 2004-10-20 22:17:52 UTC
retitling to reflect the fact that the refcounting mechanism doesn't seem to be
an issue, after all.
Comment 13 Justin Mason 2004-10-20 22:27:10 UTC
after discussing with perl5-porters, there was a suggestion as to what's happening:

- spamd starts and reads config until 20MB ram is allocated
- clears up and frees memory, leaving ~10MB in use (but 20MB allocated to the
process)
- forks children.  each child gets a copy-on-write "copy" of that 20MB,
including 10MB of allocated memory and 10MB of "free" memory that hasn't been
returned to the OS
- children process messages...
- after several scans, the children have allocated and deallocated objects,
eventually overwriting those "free" 10MB of pages, causing them to be copied
into their own process.

So in other words, there's 10MB of copy-on-writed, but free, pages being brought
through the fork, making it look like spamd has higher memory usage and giving
the kernel more pages to keep track of.  If we could avoid those allocations
early on, that would probably help.

btw one tip from perl5-porters that may help:

  'An approach that might benefit you is to serialise your data structures
  using some module such as storable, then recreating them in a fresh
  process.'

Not sure how practical that would be, though, since fork() won't dispose of
those pages; only fork()/exec() would.

Comment 14 Justin Mason 2005-02-07 21:31:47 UTC
I'm pushing this off the 3.1.0 list.

The current 3.1.0 preforking code is a lot more efficient in dealing with
allocated memory, by using the same child as frequently as possible instead of
round-robining through all kids.  so efficiency is improved there.

in addition, spamd's actually doing a pretty good job of sharing RAM as it is. 
it's just linux 2.6.x and Red Hat's 2.4.x kernels that are reporting it entirely
incorrectly.
Comment 15 Daniel Quinlan 2005-03-30 01:08:46 UTC
move bug to Future milestone (previously set to Future -- I hope)
Comment 16 Justin Mason 2006-10-18 05:01:48 UTC
Created attachment 3718 [details]
experiment; parse config in a subprocess

as a curiosity, here's a hack I was just trying.

The idea is to move parsing of the config into a sub-process, using a nifty
Storable hack from Brad Fitzpatrick's djabberd code -- the in_sub_process()
method in the patch.

with the patch applied, run spamd and a couple of spamcs:

  cd t
  ../spamd/spamd -D -x -s stderr -C log/test_rules_copy  --siteconfigpath
log/localrules.tmp -p 42580 -L & sleep 3; ps axuww | grep spamd ;
../spamc/spamc -F data/spamc_blank.cf -d localhost -p 42580 < data/spam/001; ps
axuww | grep spamd ; ../spamc/spamc -F data/spamc_blank.cf -d localhost -p
42580 < data/spam/001; ps axuww | grep spamd

(fwiw you may have to "prep" the t dir with ./spamd.t first.)

Sadly there's no appreciable difference between spamd running the old way
(parsing the 600KB of config_text string in-process) or this way (parsing in a
subprocess).

top -b -n1 output --

with patch:
9910 jm        16   0 35400  32m 2376 S  0.0  3.2   0:01.84 spamd
9922 jm        18   0 35928  32m 2036 S  0.0  3.3   0:00.52 spamd
9923 jm        15   0 35400  30m  724 S  0.0  3.1   0:00.00 spamd

without:
10152 jm	16   0 35408  32m 2376 S  0.0  3.2   0:02.15 spamd
10160 jm	18   0 35916  32m 2036 S  0.0  3.2   0:00.28 spamd
10161 jm	17   0 35936  32m 2032 S  0.0  3.2   0:00.28 spamd

I think that indicates that the theory that config-parsing is memory-heavy,
causing a bloated arena full of deleted objects and scalars after it
completes, is probably not truly the case.

Maybe I've missed something, though. if anyone feels like carrying this on to a

stage where it works, feel free ;)

(note btw that this patch will cause bugs; some of the config data is not
copied over.  it's just a POC.)