Issue 51151 - performance issue when using spreadsheet data pilot with a database
Summary: performance issue when using spreadsheet data pilot with a database
Alias: None
Product: Base
Classification: Application
Component: code (show other issues)
Version: recent-trunk
Hardware: All All
: P3 Trivial with 11 votes (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
: 40682 (view as issue list)
Depends on:
Reported: 2005-06-23 11:46 UTC by marc.neumann
Modified: 2017-05-20 10:45 UTC (History)
2 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---

bugdoc (hsqldb 1.8.0) (170.06 KB, application/vnd.sun.xml.base)
2005-06-23 12:02 UTC, marc.neumann
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description marc.neumann 2005-06-23 11:46:14 UTC
- register the attached hsql database document under the name "testdb"
- open a new spreadsheet
- select the database "testdb" and the table "performance"
- press OK
- wait ~15 seconds after click OK (part 1)
- D&D "id" into the data area
- D&D "part" into the column area
- D&D "customer" into the row area
- press OK
- wait ~29 seconds (part 2)
==>> The result is displayed

Maybe both parts have the same reason, so I submit one issue.
Comment 1 marc.neumann 2005-06-23 12:02:48 UTC
Created attachment 27441 [details]
bugdoc (hsqldb 1.8.0)
Comment 2 marc.neumann 2005-06-23 12:03:17 UTC
reassign to fs
Comment 3 marc.neumann 2005-06-23 13:45:57 UTC
*** Issue 40682 has been marked as a duplicate of this issue. ***
Comment 4 marc.neumann 2005-06-23 13:48:30 UTC
please have also a look into issue 40682 while fixing and testing this one
Comment 5 markcam 2005-07-12 10:32:38 UTC
I've experienced the same problem on 1.9.109 on Windows connecting to a Postgres
database via ODBC.  

Logging the queries to the database shows that when the DataPilot dialog is
opened, the column IDs are requested twice, and the entire table once: 

SELECT * FROM "public"."vsm_timesheets" WHERE  0 = 1 
SELECT * FROM "public"."vsm_timesheets" WHERE  0 = 1 
SELECT * FROM "public"."vsm_timesheets"

I may be mistaken, but requesting the entire table would appear to be
unnecessary when only the column names are required at this stage?

Later, when the table is created (after pressing OK on the DataPilot dialog),
there are three requests for the column IDs and two for the entire table:

SELECT * FROM "public"."vsm_timesheets" WHERE  0 = 1 
SELECT * FROM "public"."vsm_timesheets" WHERE  0 = 1 
SELECT * FROM "public"."vsm_timesheets"
SELECT * FROM "public"."vsm_timesheets" WHERE  0 = 1 
SELECT * FROM "public"."vsm_timesheets"

I imagine it should be possible to do this with a single request for the column
IDs and another for the relevant columns only in the table (rather than all of
the columns)
Comment 6 Frank Schönheit 2005-07-22 14:48:50 UTC
fs->nn: several issues, as I see it:
- when OK is pressed in the second dialog, the table is completely traversed,
  and *all* values from *all* columns are obtained. This happens in
  Two suggestions:
  - Move collecting the values to later place. At the moment, all those
    values are collected for no reason, since the user does not even need
    them. They are only needed if the user double-clicks a column in the 4th
    dialog (opening the "DataField" dialog this way), and from there, opens
    Yet Another Dialog via "Options". I think collecting the values should
    happen when opening the 5th dialog - as late as possible.
  - Let the database do more of the work. Collecting all distinct values for
    a column is something the database can usually do more efficient, if
    you give it a SELECT DISTINCT <column> FROM <table>.
- after the user presses OK in the 3rd dialog, InitAllColumnEntries is called
  a second time, and seems to be doing exactly the same data collection
  it already did before. This sounds like a complete waste to me.
- after this, the RowSet is executed again, and seems to be completely
  traversed, again - a third time. This means that all in all, the RowSet
  is traversed three times - at least two of those rounds sound superfluous
  to me :)
Comment 7 niklas.nebel 2005-08-24 18:08:02 UTC
nn->fs: "Usually more efficient" doesn't help much. If the data from attached
example is saved as a dBase file, "SELECT DISTINCT ID FROM PERFORMANCE" takes
more than 3 minutes, so I can't do that change.

One might wonder why it takes 40 seconds to read the whole table from HSQL, when
the same from a dBase file is done in 3 seconds. But if that can't be changed,
we'll have to retarget this to "later".

Long-term, we might need a separate cache for the source data in Calc anyway
(issue 32989).
Comment 8 Frank Schönheit 2005-08-25 12:01:36 UTC
The problem that our dBase implementation is *that* slow with SELECT DISTINCT
was not known (to me), and is now filed as issue 53779, blocking this issue here.

I agree that with the amount of work involved, 2.0.1 doesn't make much sense.
However, when the DISTINCT problem of dBase solved, I still suggest that Calc
should get rid off the 3 iterations through the complete result set, as
described above.
Comment 9 Marcus 2017-05-20 10:45:21 UTC
Reset the assignee to the default "".