For queries that use joins, the counts incorrect and inconsistent, even if the results

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Rows fetched and Hydrate calls counts are incorrect when using joins about steampipe HOT 3 CLOSED

cbruno10 commented on June 20, 2024

Rows fetched and Hydrate calls counts are incorrect when using joins

from steampipe.

Comments (3)

kaidaguerre commented on June 20, 2024 1

I can see 3 options here:

Option 1

Simply sum the rows returned by each scan. This may result in a higher row count that the actual result rows, but would allow granular reporting on cached and uncached rows

Option 2

Report the muber of rows actually returned, and only mark the rows as cached if all scans had a cache hit

Option 3

Expand the timing display to include more detailed information about each scan - perhaps have a verbose mode for the . timing option

Time: 158ms. Rows fetched: 36. Hydrate calls: 10.

Scans:
  1) Table: aws_ec2_instance. Time: 78ms. Rows fetched: 36. Hydrate calls: 10.  Quals: subnet_id = "subnet-a2c499fc37a6c1fe"

  2) Table: aws_vpc_subnet. Time: 90ms. Rows fetched: 56. Hydrate calls: 0.

(numbers made up)

I quite like option 3

from steampipe.

kaidaguerre commented on June 20, 2024

@cbruno10 @e-gineer I've found the issue - the CLI code to read the timing data does not take into account multiple scans.

tl;dr - I've found the bug but need some guidance on the correct way to display timing info for multi-scan queries. Please see next comment.

Details

For each scan that is executed, the FDW stores ScanMetadata in the Hub.

type ScanMetadata struct {
	Id           int
	Table        string
	CacheHit     bool
	RowsFetched  int64
	HydrateCalls int64
	Columns      []string
	Quals        map[string]*proto.Quals
	Limit        int64
	StartTime    time.Time
	Duration     time.Duration
}

This is then used to populate steampipe_internal.steampipe_scan_metadata foreign table.:

> select * from steampipe_internal.steampipe_scan_metadata 
+-----+------------------+-----------+--------------+---------------+---------------------------+----------+--------------------------------------+-------+---------------------------------------+
| id  | table            | cache_hit | rows_fetched | hydrate_calls | start_time                | duration | columns                              | limit | quals                                 |
+-----+------------------+-----------+--------------+---------------+---------------------------+----------+--------------------------------------+-------+---------------------------------------+
| 191 | aws_ec2_instance | false     | 1            | 0             | 2024-04-04T09:29:52+01:00 | 439      | ["instance_id","vpc_id","subnet_id"] | 0     | [                                     |
|     |                  |           |              |               |                           |          |                                      |       |  {                                    |
|     |                  |           |              |               |                           |          |                                      |       |   "column": "subnet_id",              |
|     |                  |           |              |               |                           |          |                                      |       |   "operator": "=",                    |
|     |                  |           |              |               |                           |          |                                      |       |   "value": "subnet-0a2c499fc37a6c1fe" |
|     |                  |           |              |               |                           |          |                                      |       |  }                                    |
|     |                  |           |              |               |                           |          |                                      |       | ]                                     |
|     |                  |           |              |               |                           |          |                                      |       |                                       |
| 192 | aws_ec2_instance | false     | 0            | 0             | 2024-04-04T09:29:53+01:00 | 433      | ["instance_id","vpc_id","subnet_id"] | 0     | [                                     |
|     |                  |           |              |               |                           |          |                                      |       |  {                                    |

The CLI keeps track of the last row that was read from this table and reads all rows since the previously read row. (i.e. it reads the scan metadata for all scans in the current query). However at present it only takes a single row from the result and uses that to display the timing.

So, in the case of a join or other multi-scan the timing being displayed is the result of just one of the scans, and it therefore incorrect.

It is simple to update the code to include all scan metadata, but not so simple knowing how to combine this data to give useful output. (see next comment)

from steampipe.

kaidaguerre commented on June 20, 2024

The currently logic for building the timing output from the scan metadata is as follows:

The displayed timing result is stored in a TimingMetadata struct:

type TimingMetadata struct {
	RowsFetched       int64
	CachedRowsFetched int64
	HydrateCalls      int64
}

This is populated from the steampipe_scan_metadata data as follows:

	timingResult.Metadata = &queryresult.TimingMetadata{}
	timingResult.Metadata.HydrateCalls += scanRow.HydrateCalls
	if scanRow.CacheHit {
		timingResult.Metadata.CachedRowsFetched += scanRow.RowsFetched
	} else {
		timingResult.Metadata.RowsFetched += scanRow.RowsFetched
	}

IOW, either CachedRowsFetched or RowsFetched is set, depending on whether the scan had a cache hit.

With multiple scans there are a couple of problems with this:

1) Incorrect row count

If we sum the RowsFetched for each scan, it is likely we will have a number greater than the rows returned by the query (as there may be joining/filtering)

We could avoid this by setting RowsFetched to simply be the actual rows returned by the query.

This causes issues with the following point:

2) How to handle mix of cache hits and cache misses

If some scans have a cache hit and some do not, do we count the result rows as cached or not?

from steampipe.

Rows fetched and Hydrate calls counts are incorrect when using joins about steampipe HOT 3 CLOSED

Comments (3)

Option 1

Option 2

Option 3

Details

1) Incorrect row count

2) How to handle mix of cache hits and cache misses

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent