Bug 61470

Summary: Text with phonetic runs aren't extracted in docx
Product: POI Reporter: Tim Allison <tallison>
Component: XWPFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: major    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: example file
Test script
reason to cache...for posterity and a later issue
reason to cache for posterity and a later issue

Description Tim Allison 2017-08-29 17:42:19 UTC
Created attachment 35269 [details]
example file

Over on TIKA-2448, I found that our DOM model is not extracting runs within "ruby" sections.  This means that neither the primary text ("東京") nor the phonetic text ("とうきょう") is extracted.

The more general point is that a run can contain a run...ugh!


  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
           ...
        </w:rPr>
        <w:ruby>
          <w:rt>
            <w:r>
              <w:rPr>
               .....
              </w:rPr>
              <w:t>とうきょう</w:t>
            </w:r>
          </w:rt>
          <w:rubyBase>
            <w:r w:rsidR="001B7DA3">
              <w:rPr>
               ....
              </w:rPr>
              <w:t>東京</w:t>
            </w:r>
          </w:rubyBase>
        </w:ruby>
      </w:r>
    </w:p>
  </w:body>
Comment 1 Tim Allison 2017-08-30 16:31:08 UTC
r1806712

Added extraction of runs within ruby elements; added ability for users to select whether or not to concatenate phonetic runs; set default toString() behavior to include phonetic runs.
Comment 2 studio test 2017-08-31 08:13:56 UTC
Created attachment 35271 [details]
Test script
Comment 3 Tim Allison 2017-08-31 12:17:06 UTC
Spam?
Comment 4 Tim Allison 2017-08-31 19:39:53 UTC
Given this example:
162.242.228.174/docs/commoncrawl2/WI/WIFC2FI3QH64A6KHOBEDNQKLN5O5EYSS

I wonder if we should cache the phonetic content as we read through the document and then dump it at the end.  This would allow for a document to be found via the phonetic info, and it wouldn't completely wreck nlp applications.

For another issue...
Comment 5 Tim Allison 2017-08-31 19:44:50 UTC
Created attachment 35275 [details]
reason to cache...for posterity and a later issue
Comment 6 Tim Allison 2017-08-31 19:46:05 UTC
Created attachment 35276 [details]
reason to cache for posterity and a later issue

correct file attached this time