Bug 61470 - Text with phonetic runs aren't extracted in docx
Summary: Text with phonetic runs aren't extracted in docx
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-08-29 17:42 UTC by Tim Allison
Modified: 2017-08-31 19:46 UTC (History)
0 users



Attachments
example file (12.23 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2017-08-29 17:42 UTC, Tim Allison
Details
Test script (16.59 KB, application/xml)
2017-08-31 08:13 UTC, studio test
Details
reason to cache...for posterity and a later issue (476.27 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2017-08-31 19:44 UTC, Tim Allison
Details
reason to cache for posterity and a later issue (45.41 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2017-08-31 19:46 UTC, Tim Allison
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2017-08-29 17:42:19 UTC
Created attachment 35269 [details]
example file

Over on TIKA-2448, I found that our DOM model is not extracting runs within "ruby" sections.  This means that neither the primary text ("東京") nor the phonetic text ("とうきょう") is extracted.

The more general point is that a run can contain a run...ugh!


  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
           ...
        </w:rPr>
        <w:ruby>
          <w:rt>
            <w:r>
              <w:rPr>
               .....
              </w:rPr>
              <w:t>とうきょう</w:t>
            </w:r>
          </w:rt>
          <w:rubyBase>
            <w:r w:rsidR="001B7DA3">
              <w:rPr>
               ....
              </w:rPr>
              <w:t>東京</w:t>
            </w:r>
          </w:rubyBase>
        </w:ruby>
      </w:r>
    </w:p>
  </w:body>
Comment 1 Tim Allison 2017-08-30 16:31:08 UTC
r1806712

Added extraction of runs within ruby elements; added ability for users to select whether or not to concatenate phonetic runs; set default toString() behavior to include phonetic runs.
Comment 2 studio test 2017-08-31 08:13:56 UTC
Created attachment 35271 [details]
Test script
Comment 3 Tim Allison 2017-08-31 12:17:06 UTC
Spam?
Comment 4 Tim Allison 2017-08-31 19:39:53 UTC
Given this example:
162.242.228.174/docs/commoncrawl2/WI/WIFC2FI3QH64A6KHOBEDNQKLN5O5EYSS

I wonder if we should cache the phonetic content as we read through the document and then dump it at the end.  This would allow for a document to be found via the phonetic info, and it wouldn't completely wreck nlp applications.

For another issue...
Comment 5 Tim Allison 2017-08-31 19:44:50 UTC
Created attachment 35275 [details]
reason to cache...for posterity and a later issue
Comment 6 Tim Allison 2017-08-31 19:46:05 UTC
Created attachment 35276 [details]
reason to cache for posterity and a later issue

correct file attached this time