61470 – Text with phonetic runs aren't extracted in docx

Bug 61470 - Text with phonetic runs aren't extracted in docx

Summary: Text with phonetic runs aren't extracted in docx

Status:	RESOLVED FIXED

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	XWPF (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 major (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-08-29 17:42 UTC by Tim Allison
Modified:	2017-08-31 19:46 UTC (History)
CC List:	0 users

Attachments
example file (12.23 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document) 2017-08-29 17:42 UTC, Tim Allison	Details
Test script (16.59 KB, application/xml) 2017-08-31 08:13 UTC, studio test	Details
reason to cache...for posterity and a later issue (476.27 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document) 2017-08-31 19:44 UTC, Tim Allison	Details
reason to cache for posterity and a later issue (45.41 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document) 2017-08-31 19:46 UTC, Tim Allison	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tim Allison 2017-08-29 17:42:19 UTC

Created attachment 35269 [details]
example file

Over on TIKA-2448, I found that our DOM model is not extracting runs within "ruby" sections.  This means that neither the primary text ("東京") nor the phonetic text ("とうきょう") is extracted.

The more general point is that a run can contain a run...ugh!


  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
           ...
        </w:rPr>
        <w:ruby>
          <w:rt>
            <w:r>
              <w:rPr>
               .....
              </w:rPr>
              <w:t>とうきょう</w:t>
            </w:r>
          </w:rt>
          <w:rubyBase>
            <w:r w:rsidR="001B7DA3">
              <w:rPr>
               ....
              </w:rPr>
              <w:t>東京</w:t>
            </w:r>
          </w:rubyBase>
        </w:ruby>
      </w:r>
    </w:p>
  </w:body>

Comment 1 Tim Allison 2017-08-30 16:31:08 UTC

r1806712

Added extraction of runs within ruby elements; added ability for users to select whether or not to concatenate phonetic runs; set default toString() behavior to include phonetic runs.

Comment 2 studio test 2017-08-31 08:13:56 UTC

Created attachment 35271 [details]
Test script

Comment 3 Tim Allison 2017-08-31 12:17:06 UTC

Spam?

Comment 4 Tim Allison 2017-08-31 19:39:53 UTC

Given this example:
162.242.228.174/docs/commoncrawl2/WI/WIFC2FI3QH64A6KHOBEDNQKLN5O5EYSS

I wonder if we should cache the phonetic content as we read through the document and then dump it at the end.  This would allow for a document to be found via the phonetic info, and it wouldn't completely wreck nlp applications.

For another issue...

Comment 5 Tim Allison 2017-08-31 19:44:50 UTC

Created attachment 35275 [details]
reason to cache...for posterity and a later issue

Comment 6 Tim Allison 2017-08-31 19:46:05 UTC

Created attachment 35276 [details]
reason to cache for posterity and a later issue

correct file attached this time