Bug 7727 - New Plugin TesseractOcr
Summary: New Plugin TesseractOcr
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Plugins (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-24 21:01 UTC by John Mertz
Modified: 2020-03-02 13:25 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
TesseractOcr plugin application/gzip None John Mertz [HasCLA]
Plugin Updated Jul 2 application/gzip None John Mertz [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description John Mertz 2019-06-24 21:01:24 UTC
Created attachment 5663 [details]
TesseractOcr plugin

Greetings,

On behalf of Fastnet SA/MailCleaner Software, Inc., I have developed a new OCR plugin.

The primary reasons behind this are the poor performance of FuzzyOCR in its default configuration and the need to maintain a separate set of rules with that plugin. For those who are not directly familiar, Fuzzy has a separate wordlist with a value for how approximate the match can be which is fairly opaque. When it finds a match, it simply passes the hit rule back.

TesseractOcr performs very well and efficiently. Instead of having a separate wordlist, the plugin I have written passed the parsed text back to the parent SpamAssassin process where the content can be matched by regular Body rules. Fuzzy-matching can then be handled through regular expressions which are already largely written with false-positives in mind and which are much easier to debug.

The plugin is currently operating on several MailCleaner machines without issue. Given the performance overhead of any OCR plugin, this is not proposed to be enabled by default.

I look forward to any feedback.

Regards,
John Mertz
john.mertz@mailcleaner.net
Comment 1 Henrik Krohns 2019-06-25 07:43:19 UTC
Hi there, some comments. I'm always blunt, so keep that in mind. :-)

- Quite hesitant about $p->set_rendered() use and the whole concept of adding "random crap" to rendered body, which could have many side effects. Nothing uses set_rendered() in current codebase, there needs to be some investigation if it actually works properly, how module priorities should be set, whether stuff end up in Bayes and other plugins that read rendered body etc. Of course this is also bad performance wise if all tests have to wait for OCR to complete. An async model would be much more generally usable, but it would have to have own rules (FuzzyOCR style?). Or perhaps body rules could be tflagged with "ocr" so they can be run independently against that data.

- Absolutely no reason to use a 9 year old Image::OCR::Tesseract module, from an author who seems to pretty much left CPAN business long time ago. The modules only purpose seems to be running tesseract/convert shell commands, for which a proper SA way is using $pms->enter_helper_run_mode() / Timeout / helper_app_pipe_open - see Pyzor.pm for example.

- It is _horrible_ security wise to use mime provided $fname for a filesystem name. Just use the static counter, and $ext if needed to construct a fixed safe name. This also removes need for the quite iffy chr(65+$unique++) loop..

- Unneeded dependency File::Which, should use Mail::SpamAssassin::Util::find_executable_in_env_path

- Proper way to combine path and filename string: File::Spec->catfile($tmpdir, $fname)

- "print PICT" should also check for errors (disk could be full?)

- Should not use parse_config to accept any random tocr_* line, use the set_config() stanza that defines exact supported options, and sets their defaults and types, validated

- tesseract_type() has unused $w,$h (remove?)

- should use more common dbg method (also use info() only for serious errors, not for "found attachment" things)

sub dbg {
  my $msg = shift;
  Mail::SpamAssassin::Plugin::dbg("TesseractOct: $msg", @_);
}

- A bit shame to lock into tesseract, as people had good success with gocr previously. Would have been nice to be able to select engine(s). Did you try gocr?


I would not delay 3.4.3 any further, there's really no need to include this there in this state. Just improve it in trunk, if voted so? I'm still +-0 because of all the architectural issues. Of course having an experimental and disabled by default plugin there is no problem.
Comment 2 Henrik Krohns 2019-06-25 08:11:43 UTC
Do you have any statistics on how well this has performed on your mail feeds? And have you actually verified what rules do hit the OCR'd body portion?
Comment 3 Henrik Krohns 2019-06-25 08:52:59 UTC
(In reply to Henrik Krohns from comment #1)
>
> Of course having an experimental and disabled by default plugin there is no problem.

I feel I have to clarify this comment from the SpamAssassin projects point of view:

- Plugins should not be committed to SA to just get some plugin "out there", they should be committed to be developed by SpamAssassin project and preferably widely used.
- Committed plugins are maintained by the SpamAssassin project. As such, if the intent is to also use and develop them internally by some other entity (corporate or not), it might be wiser to just leave maintaining and publishing to that entity. There are many third party plugins that are not part of SpamAssassin: https://wiki.apache.org/spamassassin/CustomPlugins
Comment 4 John Mertz 2019-06-25 15:36:50 UTC
(In reply to Henrik Krohns from comment #1)
> Hi there, some comments. I'm always blunt, so keep that in mind. :-)

Understood. I'm here for the advice. :)

> - Quite hesitant about $p->set_rendered() use and the whole concept of
>   adding "random crap" to rendered body, which could have many side
>   effects.

Well the idea is that it not be "random crap", but crap that has always
been in the message, but never considered to be part of the message in
the past. The email's reader doesn't really distinguish between a jpeg
of a word and the word in plaintext, and so the filter shouldn't really
either to the extent possible (from the point of view that it's all
text, not that the presence of an image is not an indicator for other
reasons).

In the testing I have done, very little content that is not text gets
added. This is mostly things like pipes being added where an actual
vertical line is being used as a separator. The characters matched are
not always perfect, of course, but I have not seen it add anything
completely unexpected.

>   Nothing uses set_rendered() in current codebase, there
>   needs to be some investigation if it actually works properly, how
>   module priorities should be set, whether stuff end up in Bayes and
>   other plugins that read rendered body etc.

I'm aware that set_rendered is not used elsewhere, it took quite some
time to discover exactly how to do the bits because I couldn't find any
other use of it for reference and there is limited documentation in
this area. My intent is that this content IS part of the text of the
message and should be treated the same as the literal text. This would
mean bayes consideration, etc.

>   Of course this is also
>   bad performance wise if all tests have to wait for OCR to complete.
>   An async model would be much more generally usable, but it would
>   have to have own rules (FuzzyOCR style?). Or perhaps body rules
>   could be tflagged with "ocr" so they can be run independently
>   against that data.

Of course. Given the intention that the content be treated as if it is
part of the body, the Fuzzy method isn't really possible. My
understanding of the tflags would be that the rules would run later but
that they would all be done after the message was completely parsed. I
didn't intend to have ocr-specific rules, so I don't think that
tflagging is constructive (I have limited understanding here).

> - Absolutely no reason to use a 9 year old Image::OCR::Tesseract
>   module, from an author who seems to pretty much left CPAN business
>   long time ago. The modules only purpose seems to be running
>   tesseract/convert shell commands, for which a proper SA way is using
>   $pms->enter_helper_run_mode() / Timeout / helper_app_pipe_open - see
>   Pyzor.pm for example.

This was a simple case of not wanting to duplicate effort. If there is
a proper SA way to do things I'll be happy to make that change.

> - It is _horrible_ security wise to use mime provided $fname for a
>   filesystem name. Just use the static counter, and $ext if needed to
>   construct a fixed safe name. This also removes need for the quite
>   iffy chr(65+$unique++) loop..

That part I stole directly from Fuzzy. I didn't see the need to use the
filename, but assumed there was a reason Fuzzy did it, so I did too.
Because fuzzy did it, I assumed that SA was sanitizing the $fname for
me. I'll be more than happy to switch to a counter.

> - Unneeded dependency File::Which, should use
>   Mail::SpamAssassin::Util::find_executable_in_env_path

Happy to do that as well. Thanks for bringing this to my attention.

> - Proper way to combine path and filename string:
>   File::Spec->catfile($tmpdir, $fname)

Thanks.

> - "print PICT" should also check for errors (disk could be full?)

Absolutely right. Will do.

> - Should not use parse_config to accept any random tocr_* line, use
>   the set_config() stanza that defines exact supported options, and
>   sets their defaults and types, validated

Ah. Yes. I was crude about this initially and neglected to improve it.
I'll do that.

> - tesseract_type() has unused $w,$h (remove?)

Correct. _skip was originally part of _type until being give it's own
function. I'll clean that up too.

> - should use more common dbg method (also use info() only for serious
>   errors, not for "found attachment" things)
>
> sub dbg {
>  my $msg = shift;
>  Mail::SpamAssassin::Plugin::dbg("TesseractOct: $msg", @_);
>}

Great. I'll re-jigger this too.

> - A bit shame to lock into tesseract, as people had good success with
>  gocr previously. Would have been nice to be able to select engine(s).
>  Did you try gocr?

I was having exceedingly poor results with Fuzzy when it was set up
with gocr. To the point that bold, unobscured names of certain
pharmaceuticals were not registering at all despite a very loose
matching valiue. When I configured it to use Tesseract, the results
were much better on the same images, so when I wanted to make a plugin
with differing operation Tesseract was the obvious choice.

I had considered making the OCR package configurable as in Fuzzy, but
was dissuaded because it adds unnecessary development and user
complexity. I certainly have nothing against others using gocr, if they
prefer it, but in my opinion a fork for each is simpler. To make
either/all engines usable in the same module either requires complexity
for handling differences within the module itself, or complexity
configuration for the user, or both. I'm not opposed to putting the
burden on the module to properly handle all of the differences, but
more complexity also means more potential bugs.

I'll address the comments mentioned above and provide an updated
tarball.
Comment 5 John Mertz 2019-06-25 15:38:24 UTC
(In reply to Henrik Krohns from comment #2)

> Do you have any statistics on how well this has performed on your
> mail feeds? 

At the moment the module is installed on a couple of mildly trafficked
machines. I wanted to get all of your lovely feedback to make sure
there wasn't any glaring errors that I had missed which would destroy
a busy machine. All of our machines already had FuzzyOcr enabled, so I
don't have a baseline for what the performance is vs no OCR at all.

For the machines on which it is running, there has not been a
significant change in performance when compared to FuzzyOcr (using
gocr). The stats seem to show that it is actually a little bit lighter
on load, but not more than could be explained by fluctuations in
traffic. It seems that it is somewhat more efficient than Fuzzy, but
because Fuzzy runs conditionally if the score is already over a
threshold, this balances things out more.

Very large images can noticeably impact scantimes. A 1920x1080 image of
12pt lorem ipsum text takes about 0.8 seconds of actual scantime on my
machine. Obviously this is a worst-case. It is going to be much faster
for images with little actual text and the plugin is configurable to
only scan messages within size and dimension constraints of your choice.

> And have you actually verified what rules do hit the
> OCR'd body portion?

I can verify that the OCR'd content hits just as if it were part of the
text in the body of the email.
Comment 6 John Mertz 2019-06-25 15:40:06 UTC
(In reply to Henrik Krohns from comment #3)

> I feel I have to clarify this comment from the SpamAssassin projects
> point of view:
> 
> - Plugins should not be committed to SA to just get some plugin "out
> there", they should be committed to be developed by SpamAssassin
> project and preferably widely used.
> - Committed plugins are maintained by the SpamAssassin project. As
> such, if the intent is to also use and develop them internally by
> some other entity (corporate or not), it might be wiser to just leave
> maintaining and publishing to that entity. There are many third party
> plugins that are not part of SpamAssassin:
> https://wiki.apache.org/spamassassin/CustomPlugins

Certainly, if that is the point of view of the project, I'm not here to
push back against any of that. 

I initially introduced the idea to Kevin McGrail because I was
interested in using CPAN as a distribution platform. There are
several plugins there which appear to be authored and presumably
maintained by entities outside of the project itself, so trying to do
the same seemed like the most logical coarse of action and that is why
I'm here.

The module varies from many on the Custom Plugins page in that it is
simply used to interface between SA and the OCR engine available within
all major distribution repositories. Aside from those dependencies, it
is self-contained and does not rely on any technology, data-feed,
licensing, or infrastructure provided by myself or the company. Any
development that I/we do is expected to only be granted by approval of
the project to ensure that it is useful and properly implemented for
everyone.
Comment 7 Henrik Krohns 2019-06-25 18:51:24 UTC
(In reply to John Mertz from comment #6)
> There are several plugins there which appear to be authored and presumably
> maintained by entities outside of the project itself, so trying to do
> the same seemed like the most logical coarse of action and that is why
> I'm here.

There are no plugins in SA that have any outside maintenance. They might have originated from elsewhere, but once in SA it's really the projects responsibility. Well most of the plugins don't require much "maintenance" anyway and possible original authors are usually long gone anyway.

But if you think this is ok, has no effect on your possible corporate business, I may consider +1 later as you seem happy to work along. :-)
Comment 8 Henrik Krohns 2019-06-25 19:17:50 UTC
(In reply to John Mertz from comment #4)
> In the testing I have done, very little content that is not text gets
> added. This is mostly things like pipes being added where an actual
> vertical line is being used as a separator. The characters matched are
> not always perfect, of course, but I have not seen it add anything
> completely unexpected.

Ok sounds promising, I might try extracting images from my corpus and see what tesseract is outputting.

> I'm aware that set_rendered is not used elsewhere, it took quite some
> time to discover exactly how to do the bits because I couldn't find any
> other use of it for reference and there is limited documentation in
> this area. My intent is that this content IS part of the text of the
> message and should be treated the same as the literal text. This would
> mean bayes consideration, etc.

I didn't even know set_rendered exists, and I've been at this for quite a while. ;-) Just a matter of checking that it does things properly and is good for possible other use cases too.

> Of course. Given the intention that the content be treated as if it is
> part of the body, the Fuzzy method isn't really possible. My
> understanding of the tflags would be that the rules would run later but
> that they would all be done after the message was completely parsed. I
> didn't intend to have ocr-specific rules, so I don't think that
> tflagging is constructive (I have limited understanding here).

Thinking about it more, flagging would be pointless. Just scan it like it would be a textual mime part.

> This was a simple case of not wanting to duplicate effort. If there is
> a proper SA way to do things I'll be happy to make that change.

We should always try to use modules that are packaged in most common OS repositories. And ones that actually do something worthwhile to take the risk of version updates breaking something. I'm really wary of manually CPANning these days, sometimes it's a dependency nightmare and hard verifying that stuff doesn't contain backdoors etc (just have a look at the recent npm event-stream bitcoin hack..).
Comment 9 John Mertz 2019-07-02 13:03:14 UTC
Created attachment 5665 [details]
Plugin Updated Jul 2

I have updated the plugin according to the comments received. Changes made include:

- removed dependency on Image::OCR::Tesseract and replaced with local helper_app_pipe_open functions.
- Replaced parse_config with set_config
- removed unneccessary variables from tesseract_type()
- debug error if storage of tmpfile fails
- tmpfile path generated by catfile instead of string concatenation
- use simple counter for tmpfile name instead of MIME filename
- switched to Plugin::dbg instead of manually defined Logger method
- switched to find_executable_in_env_path instead of Which
Comment 10 spamassassin 2019-12-12 12:17:19 UTC
This thread has been mentioned on the users mailing list, so I gave the second attached version a try.

0. I like the idea to provide a more general approach of passing recognized text back to SA.
1. There is a call to cleanup() where it should be clean_up().
2. There is a call to kill_pid() which is undefined.
3. Some tests:
3.1:
I trained a bayes database from a personal ham and a spam corpus without TesseractOcr. Then I compared the classification of enabled vs disabled TesseractOcr.
3.1.1. A run against the sample provided in [1]:
3.1.1.1. Without ocr it hits BAYES_40.
3.1.1.2. With ocr it hits BAYES_50 and additionally FUZZY_BROWSER.
3.1.2. A run against a current "Deutsche Burger werden reich" sample:
3.1.2.1. Without ocr it hits BAYES_99/BAYES_999.
3.1.2.2. With ocr it hits BAYES_95 and provides nothing additional, so the total score actually decreased.

3.2:
I trained a new bayes database from the same corpora with TesseractOcr and made the same quick tests.
3.2.1. A run against the sample provided in [1] provided same results as in 3.1.1.
3.2.2. A run against a current "Deutsche Burger werden reich" sample provides identical test results, i.e. the bayes scores match. This is good, as one can improve the situation with custom rules.

Some my takeaway is, that one should probably retrain bayes.

[1] https://mail-archives.apache.org/mod_mbox/spamassassin-users/201912.mbox/browser
Comment 12 Henrik Krohns 2019-12-13 14:49:25 UTC
It's reading all the STDERR stuff tesseract outputs into $content

 dbg: TesseractOcr: Found content: Page 1
 dbg: TesseractOcr: [...] Warning. Invalid resolution 0 dpi. Using 70 instead.
 dbg: TesseractOcr: [...] Estimating resolution as 938
 dbg: TesseractOcr: [...] Empty page!!
 dbg: TesseractOcr: [...] Estimating resolution as 938
 dbg: TesseractOcr: [...] Empty page!!
 dbg: TesseractOcr: [...] _

Fix is changing helper_app_pipe_open third argument to 0, so it doesn't read stderr.
Comment 13 Henrik Krohns 2019-12-13 15:35:41 UTC
I think the basic concept is sound, rendered() data seems to be working as intended, things are picked up by Bayes and URIBL etc.

Could use bit more code cleanup, atleast using global variables for $tmpdir/file is a bit messy and against object oriented coding (we could have multiple instances of TesseractOcr / $self..).

I still don't know if there's reason to consider this for inclusion in SA, if John wants to keep developing it (I see there's even recent commits on Github). Might make more sense to just advertise this on CustomPlugins page.
Comment 14 John Mertz 2019-12-15 06:15:23 UTC
Thanks for the recent comments!

I am currently working on a major revision to the plugin. The changes on GitHub are not stable. I'll ensure that the recommended changes are considered.

I've so far added optional image preprocessing which substitutes the dependency on ImageMagick for OpenCV. The results with this are much improved. Modern versions of Tesseract also get much better results but require languages to be specified in order to use the correct training data with the machine learning algorithm, so a configuration for this needs to be added.

I will post the code again when this is complete. Again, my reason for suggesting that it be taken in the the official repository is that it contains no proprietary code and so being more easily discoverable serves to benefit the most possible people. However, I accept whatever decision is made on that point.
Comment 15 John Mertz 2020-03-02 12:54:00 UTC
Hello all,

As mentioned in the previous comment, a fairly major revision to the plugin has been made which adds image preprocessing with OpenCV. I've been testing succesfully for more than a week on a handful of machines and am currently looking at deploying to some larger installations.

The code is now on our GitHub:

https://github.com/MailCleaner/TesseractOcr

Note that there are two significant branches. master is configured to work correctly with Tesseract version 4 and above which uses training data which provides significantly better results than earlier versions. Using this version is encouraged unless your system does not support it. This also provides additional configuration variables to define the training data location and languages to be passed when executing Tesseract.

If you system does not provide Tesseract version 4 or above, there is a branch called 3.00 which will continue to support that.

On the issue of adopting the plugin into the core distribution or advertising it as a 3rd party plugin. I don't mind either way now that we've moved distribution of the plugin to GitHub. I would note that the dependency on tesseract-ocr and libopencv-dev does add over 100MB to the plugin's otherwise modest size. Including this bloat for a disabled-by-default plugin probably is not great. We can close this thread and I will announce further updates to the relevant mailing list if no one has any additional input.

For emails with many/large images, the impact on scantimes can be significant. As it is, the plugin has configuration variables for various time, size and dimension limits, so the outlier messages cannot be too catastrophic to overall performance, but the next feature is likely to be caching to help reduce some additional load.
Comment 16 Henrik Krohns 2020-03-02 13:25:34 UTC
I've added it to wiki (feel free to register and edit if needed)

https://cwiki.apache.org/confluence/display/SPAMASSASSIN/CustomPlugins

I suggest engaging spamassassin-users list if you need testers. :-)