SA Bugzilla – Bug 8026
t/extracttext.t tesseract test fails on some installations
Last modified: 2022-08-14 11:28:54 UTC
On my copy of FreeBSD 13.1-RELEASE installed on a VirtualBox VM with tesseract 5.1.0 installed from FreeBSD's pkg repository, test t/extracttext.t consistently fails because tesseract reads the "XJ" characters in the test jpg file as "X]J". Recreating the test file using a font that is more tesseract-friendly seems to help. Since the test is not intended to test the limits of tesseract's OCR capabilities, this seems like a proper fix. I've redone the test data using Tex Gyre Bonum font as per the results in https://superuser.com/a/1543382
It pointed out in another comment in the superuser article linked to in the previous comment, the fint used seems to be less important than font size. After initial experiments worked on freebsd but failed in differtent ways on macOS, I found settings that succeed using hte the available versions of tesseract on all platforms I tried. These tests revealed a bug when tesseract is installed in a directory that has a space in the pathname, but that is a more minor issue. See bug 8027 trunk % svn ci -m "bug 8026 - Update extracttest.t with test data that works with more versions of tesseract" Sending MANIFEST Deleting t/data/spam/extracttext/gtube_jpg.eml Adding t/data/spam/extracttext/gtube_png.eml Sending t/extracttext.t Transmitting file data ...done Committing transaction... Committed revision 1903411.