Bug 8026 - t/extracttext.t tesseract test fails on some installations
Summary: t/extracttext.t tesseract test fails on some installations
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Regression Tests (show other bugs)
Version: 4.0.0
Hardware: All All
: P2 normal
Target Milestone: 4.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-08-14 04:08 UTC by Sidney Markowitz
Modified: 2022-08-14 11:28 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Sidney Markowitz 2022-08-14 04:08:49 UTC
On my copy of FreeBSD 13.1-RELEASE installed on a VirtualBox VM with tesseract 5.1.0 installed from FreeBSD's pkg repository, test t/extracttext.t consistently fails because tesseract reads the "XJ" characters in the test jpg file as "X]J".

Recreating the test file using a font that is more tesseract-friendly seems to help. Since the test is not intended to test the limits of tesseract's OCR capabilities, this seems like a proper fix. I've redone the test data using Tex Gyre Bonum font as per the results in https://superuser.com/a/1543382
Comment 1 Sidney Markowitz 2022-08-14 11:28:54 UTC
It pointed out in another comment in the superuser article linked to in the previous comment, the fint used seems to be less important than font size. After initial experiments worked on freebsd but failed in differtent ways on macOS, I found settings that succeed using hte the available versions of tesseract on all platforms I tried.

These tests revealed a bug when tesseract is installed in a directory that has a space in the pathname, but that is a more minor issue. See bug 8027

trunk % svn ci -m "bug 8026 - Update extracttest.t with test data that works with more versions of tesseract"
Sending        MANIFEST
Deleting       t/data/spam/extracttext/gtube_jpg.eml
Adding         t/data/spam/extracttext/gtube_png.eml
Sending        t/extracttext.t
Transmitting file data ...done
Committing transaction...
Committed revision 1903411.