Apache OpenOffice (AOO) Bugzilla – Issue 51151
performance issue when using spreadsheet data pilot with a database
Last modified: 2017-05-20 10:45:21 UTC
- register the attached hsql database document under the name "testdb" - open a new spreadsheet - use DATA / DATAPILOT / START - select the database "testdb" and the table "performance" - press OK - wait ~15 seconds after click OK (part 1) - D&D "id" into the data area - D&D "part" into the column area - D&D "customer" into the row area - press OK - wait ~29 seconds (part 2) ==>> The result is displayed Maybe both parts have the same reason, so I submit one issue.
Created attachment 27441 [details] bugdoc (hsqldb 1.8.0)
reassign to fs
*** Issue 40682 has been marked as a duplicate of this issue. ***
please have also a look into issue 40682 while fixing and testing this one
I've experienced the same problem on 1.9.109 on Windows connecting to a Postgres database via ODBC. Logging the queries to the database shows that when the DataPilot dialog is opened, the column IDs are requested twice, and the entire table once: SELECT * FROM "public"."vsm_timesheets" WHERE 0 = 1 SELECT * FROM "public"."vsm_timesheets" WHERE 0 = 1 SELECT * FROM "public"."vsm_timesheets" I may be mistaken, but requesting the entire table would appear to be unnecessary when only the column names are required at this stage? Later, when the table is created (after pressing OK on the DataPilot dialog), there are three requests for the column IDs and two for the entire table: SELECT * FROM "public"."vsm_timesheets" WHERE 0 = 1 SELECT * FROM "public"."vsm_timesheets" WHERE 0 = 1 SELECT * FROM "public"."vsm_timesheets" SELECT * FROM "public"."vsm_timesheets" WHERE 0 = 1 SELECT * FROM "public"."vsm_timesheets" I imagine it should be possible to do this with a single request for the column IDs and another for the relevant columns only in the table (rather than all of the columns)
fs->nn: several issues, as I see it: - when OK is pressed in the second dialog, the table is completely traversed, and *all* values from *all* columns are obtained. This happens in ScDatabaseDPData::InitAllColumnEntries Two suggestions: - Move collecting the values to later place. At the moment, all those values are collected for no reason, since the user does not even need them. They are only needed if the user double-clicks a column in the 4th dialog (opening the "DataField" dialog this way), and from there, opens Yet Another Dialog via "Options". I think collecting the values should happen when opening the 5th dialog - as late as possible. - Let the database do more of the work. Collecting all distinct values for a column is something the database can usually do more efficient, if you give it a SELECT DISTINCT <column> FROM <table>. - after the user presses OK in the 3rd dialog, InitAllColumnEntries is called a second time, and seems to be doing exactly the same data collection it already did before. This sounds like a complete waste to me. - after this, the RowSet is executed again, and seems to be completely traversed, again - a third time. This means that all in all, the RowSet is traversed three times - at least two of those rounds sound superfluous to me :)
nn->fs: "Usually more efficient" doesn't help much. If the data from attached example is saved as a dBase file, "SELECT DISTINCT ID FROM PERFORMANCE" takes more than 3 minutes, so I can't do that change. One might wonder why it takes 40 seconds to read the whole table from HSQL, when the same from a dBase file is done in 3 seconds. But if that can't be changed, we'll have to retarget this to "later". Long-term, we might need a separate cache for the source data in Calc anyway (issue 32989).
fs->nn: The problem that our dBase implementation is *that* slow with SELECT DISTINCT was not known (to me), and is now filed as issue 53779, blocking this issue here. I agree that with the amount of work involved, 2.0.1 doesn't make much sense. However, when the DISTINCT problem of dBase solved, I still suggest that Calc should get rid off the 3 iterations through the complete result set, as described above.
Reset the assignee to the default "issues@openoffice.apache.org".