Summary: | [Patch] Enhance XWPF Paragraph to parse (nested) smart tags | ||
---|---|---|---|
Product: | POI | Reporter: | Fabian Lange <fabian.lange> |
Component: | XWPF | Assignee: | POI Developers List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | normal | ||
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: | patch tar.gz and xml file |
Thanks for this, applied (with a few little tweaks) in r1210774. (For future reference, it might be better to avoid trying to reformat the rest of the code, as it makes the reviewing a bit harder. We should really standardise, but not always in a feature patch!) |
Created attachment 28026 [details] patch tar.gz and xml file Word sometimes adds smart tags to text entered by the user. They might be simle, like this: <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="country-region"> <w:r> <w:rPr> <w:lang w:val="en-US" /> </w:rPr> <w:t>India</w:t> </w:r> </w:smartTag> or even nested: <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="PersonName"> <w:smartTag w:uri="urn:schemas:contacts" w:element="GivenName"> <w:r> <w:rPr> <w:lang w:val="en-US" /> </w:rPr> <w:t>Marilyn</w:t> </w:r> </w:smartTag> <w:r> <w:rPr> <w:lang w:val="en-US" /> </w:rPr> <w:t xml:space="preserve"> </w:t> </w:r> <w:smartTag w:uri="urn:schemas:contacts" w:element="Sn"> <w:r> <w:rPr> <w:lang w:val="en-US" /> </w:rPr> <w:t>Monroe</w:t> </w:r> </w:smartTag> </w:smartTag> The previous implementation for a paragraph simply ignores instances of CTSmartTagRun. My proposed patch introduces recusrive parsing for CTSmartTagRun. I did consider making all tags recursive, but this failed other tests. I think this might be an option for further improvement. This makes test cases checking for smart tags pass and fixes two issues in Tika. My implementation does discard the information from the smart tag. Patch also contains minor cleanup of the mixed tab/spacing in this class, and removed a duplicate document!= null check.