Bug 63215 - Microsoft Excel files(.xlsx - created using Microsoft Office 2016) getting corrupted while manipulated
Summary: Microsoft Excel files(.xlsx - created using Microsoft Office 2016) getting co...
Status: NEEDINFO
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-28 05:17 UTC by Sushmita Nag
Modified: 2019-04-07 09:28 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sushmita Nag 2019-02-28 05:17:58 UTC
Specific microsoft excel files(.xlsx - created using Microsoft Office 2016) getting corrupted while manipulated using apache-poi-3.10-FINAL libraries. Could you please let us know the root cause of this issue ? 

Attached sample file here.

Usecase scenario - We upload/manipulate the attached sample file in our server & then we download the file. And once we download & open the file, we get a warning message saying the file has been corrupted. We have a legacy application and we are still using poi 3.10-FINAL libraries. Upgrading poi to 3.15 solves the problem. However, we would still like to know the root cause, as in why this specific .xlsx file is getting corrupted while manipulated using poi-3.10-FINAL libraries. Could you please let us know the same ?

Thanks in advance.

Regards,
Sushmita
Comment 1 Sushmita Nag 2019-02-28 05:21:09 UTC
Link for the sample .xlsx file.

https://drive.google.com/open?id=1I3k3TA82iVlCkX8fZe_vmvRwKikqK4Yq
Comment 2 PJ Fanning 2019-02-28 16:22:26 UTC
Apache POI is maintained by volunteers and I'm not sure if this task is going to be taken on by someone in their spare time. Apache POI 3.10 is no longer maintained.

If you do find the issue, please share it.
Comment 3 Greg Woolsey 2019-02-28 17:58:32 UTC
You can start by looking through the changelog [1], and examining issues that sound relevant.  The SVN repository and it's Git mirror are publicly available for read access [2].  Commit comments and change diffs would be where you would find the answer you are looking for.  I doubt you will get much more help that that for a question about an open source release that is over five years old. Also, without knowing exactly what manipulations are being done, there is no way to tell what is causing corruption in that old version.  There have been a huge number of changes, new features, bug fixes, optimizations, and security adjustments made in the last 5 years.

[1] https://poi.apache.org/changes.html
[2] https://poi.apache.org/devel/subversion.html
Comment 4 Andreas Beeker 2019-02-28 19:32:48 UTC
To find the root cause, I would first roughly identify the version when it started to work, e.g. not working in 3.13, but working in 3.14.

Then, identify the change in xml document, i.e. unzip the .xlsx of 3.13 and 3.14 and compare it's content. Update the 3.13 version of the .xlsx iteratively to match the 3.14 version until it works. Remove all other modifications until you know which minimum set which caused the error.

Then you can use "git bisect" [1] plus a test-driver which checks the affected document part for the modifications. Or seek the modification code and check its commit log.

[1] https://git-scm.com/docs/git-bisect-lk2009.html
Comment 5 Sushmita Nag 2019-03-18 09:31:56 UTC
hi Andreas,

I unzipped the files & compared the content. I could see one liner difference between both the drawing1.xml.rels.rels file of working & non-working .xlsx document.

File path :- Corrupted\xl\drawings\_rels\drawing1.xml.rels.

The corrupted & working file both can be found in below link:-

https://drive.google.com/open?id=1DoOYmJRtwJVug0A_4PgcIuP7Kq1aRDnt


The line which is missing in drawing1.xml.rels of the corrupted file is below:-

<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="javascript:" TargetMode="External"/>


Only in-case of hyperlink, this issue is occurring i guess. As in all other valid cases, i could only see image.

Could you please let us know your thoughts on this ?

Regards,
Sushmita
Comment 6 Sushmita Nag 2019-03-18 10:24:10 UTC
And also, while opening the corrupted .xlsx doc, it populates with the below error message.

Excel completed file level validation and repair. Some parts of this workbook may have been repaired or discarded.
Repaired Records: Drawing from /xl/drawings/drawing1.xml part (Drawing shape)
Comment 7 Andreas Beeker 2019-03-18 19:25:35 UTC
I've downloaded both your files and find a ton of differences - left alone their file size differs by 400kb.
So this is what I've done, for *.rels, *.vml and *.xml files.

find . -name "*.rels" -exec xmllint --format {} --output {} \;
find . -name "*.rels" -exec sed -e "s/ standalone=.yes.//" {} > {} \;

Then use WinMerge or Meld to compare the files.

> Only in-case of hyperlink, this issue is occurring i guess.

The guessing can be easily validated by adding the line into the .rels file and try again ...

If I would like to find that difference, I would try to minimize the example files to the bare minimum where the error occurs.

I'm just being curious, why would you invest so much time in finding the reason for a solved problem?