Created attachment 35269 [details] example file Over on TIKA-2448, I found that our DOM model is not extracting runs within "ruby" sections. This means that neither the primary text ("東京") nor the phonetic text ("とうきょう") is extracted. The more general point is that a run can contain a run...ugh! <w:body> <w:p> <w:r> <w:rPr> ... </w:rPr> <w:ruby> <w:rt> <w:r> <w:rPr> ..... </w:rPr> <w:t>とうきょう</w:t> </w:r> </w:rt> <w:rubyBase> <w:r w:rsidR="001B7DA3"> <w:rPr> .... </w:rPr> <w:t>東京</w:t> </w:r> </w:rubyBase> </w:ruby> </w:r> </w:p> </w:body>
r1806712 Added extraction of runs within ruby elements; added ability for users to select whether or not to concatenate phonetic runs; set default toString() behavior to include phonetic runs.
Created attachment 35271 [details] Test script
Spam?
Given this example: 162.242.228.174/docs/commoncrawl2/WI/WIFC2FI3QH64A6KHOBEDNQKLN5O5EYSS I wonder if we should cache the phonetic content as we read through the document and then dump it at the end. This would allow for a document to be found via the phonetic info, and it wouldn't completely wreck nlp applications. For another issue...
Created attachment 35275 [details] reason to cache...for posterity and a later issue
Created attachment 35276 [details] reason to cache for posterity and a later issue correct file attached this time