Looking at the common crawl regression results, we see lots of documents being an "UNKNOWN" file: java.lang.IllegalArgumentException: The document is really a UNKNOWN file Although they are probably HTML files. This following patch covers at least the failure of identifying the already known magics.
Patched via r1847429