Bug 61470

Summary:	Text with phonetic runs aren't extracted in docx
Product:	POI	Reporter:	Tim Allison <tallison>
Component:	XWPF	Assignee:	POI Developers List <dev>
Status:	RESOLVED FIXED
Severity:	major
Priority:	P2
Version:	unspecified
Target Milestone:	---
Hardware:	PC
OS:	All
Attachments:	example file Test script reason to cache...for posterity and a later issue reason to cache for posterity and a later issue

Description Tim Allison 2017-08-29 17:42:19 UTC

Created attachment 35269 [details]
example file

Over on TIKA-2448, I found that our DOM model is not extracting runs within "ruby" sections.  This means that neither the primary text ("東京") nor the phonetic text ("とうきょう") is extracted.

The more general point is that a run can contain a run...ugh!


  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
           ...
        </w:rPr>
        <w:ruby>
          <w:rt>
            <w:r>
              <w:rPr>
               .....
              </w:rPr>
              <w:t>とうきょう</w:t>
            </w:r>
          </w:rt>
          <w:rubyBase>
            <w:r w:rsidR="001B7DA3">
              <w:rPr>
               ....
              </w:rPr>
              <w:t>東京</w:t>
            </w:r>
          </w:rubyBase>
        </w:ruby>
      </w:r>
    </w:p>
  </w:body>

Comment 1 Tim Allison 2017-08-30 16:31:08 UTC

r1806712

Added extraction of runs within ruby elements; added ability for users to select whether or not to concatenate phonetic runs; set default toString() behavior to include phonetic runs.

Comment 2 studio test 2017-08-31 08:13:56 UTC

Created attachment 35271 [details]
Test script

Comment 3 Tim Allison 2017-08-31 12:17:06 UTC

Spam?

Comment 4 Tim Allison 2017-08-31 19:39:53 UTC

Given this example:
162.242.228.174/docs/commoncrawl2/WI/WIFC2FI3QH64A6KHOBEDNQKLN5O5EYSS

I wonder if we should cache the phonetic content as we read through the document and then dump it at the end.  This would allow for a document to be found via the phonetic info, and it wouldn't completely wreck nlp applications.

For another issue...

Comment 5 Tim Allison 2017-08-31 19:44:50 UTC

Created attachment 35275 [details]
reason to cache...for posterity and a later issue

Comment 6 Tim Allison 2017-08-31 19:46:05 UTC

Created attachment 35276 [details]
reason to cache for posterity and a later issue

correct file attached this time