Giter Club home page Giter Club logo

Comments (3)

kaidaguerre avatar kaidaguerre commented on June 20, 2024 1

I can see 3 options here:

Option 1

Simply sum the rows returned by each scan. This may result in a higher row count that the actual result rows, but would allow granular reporting on cached and uncached rows

Option 2

Report the muber of rows actually returned, and only mark the rows as cached if all scans had a cache hit

Option 3

Expand the timing display to include more detailed information about each scan - perhaps have a verbose mode for the . timing option

Time: 158ms. Rows fetched: 36. Hydrate calls: 10.

Scans:
  1) Table: aws_ec2_instance. Time: 78ms. Rows fetched: 36. Hydrate calls: 10.  Quals: subnet_id = "subnet-a2c499fc37a6c1fe"

  2) Table: aws_vpc_subnet. Time: 90ms. Rows fetched: 56. Hydrate calls: 0.  

(numbers made up)

I quite like option 3

from steampipe.

kaidaguerre avatar kaidaguerre commented on June 20, 2024

@cbruno10 @e-gineer I've found the issue - the CLI code to read the timing data does not take into account multiple scans.

tl;dr - I've found the bug but need some guidance on the correct way to display timing info for multi-scan queries. Please see next comment.

Details

For each scan that is executed, the FDW stores ScanMetadata in the Hub.

type ScanMetadata struct {
	Id           int
	Table        string
	CacheHit     bool
	RowsFetched  int64
	HydrateCalls int64
	Columns      []string
	Quals        map[string]*proto.Quals
	Limit        int64
	StartTime    time.Time
	Duration     time.Duration
}

This is then used to populate steampipe_internal.steampipe_scan_metadata foreign table.:

> select * from steampipe_internal.steampipe_scan_metadata 
+-----+------------------+-----------+--------------+---------------+---------------------------+----------+--------------------------------------+-------+---------------------------------------+
| id  | table            | cache_hit | rows_fetched | hydrate_calls | start_time                | duration | columns                              | limit | quals                                 |
+-----+------------------+-----------+--------------+---------------+---------------------------+----------+--------------------------------------+-------+---------------------------------------+
| 191 | aws_ec2_instance | false     | 1            | 0             | 2024-04-04T09:29:52+01:00 | 439      | ["instance_id","vpc_id","subnet_id"] | 0     | [                                     |
|     |                  |           |              |               |                           |          |                                      |       |  {                                    |
|     |                  |           |              |               |                           |          |                                      |       |   "column": "subnet_id",              |
|     |                  |           |              |               |                           |          |                                      |       |   "operator": "=",                    |
|     |                  |           |              |               |                           |          |                                      |       |   "value": "subnet-0a2c499fc37a6c1fe" |
|     |                  |           |              |               |                           |          |                                      |       |  }                                    |
|     |                  |           |              |               |                           |          |                                      |       | ]                                     |
|     |                  |           |              |               |                           |          |                                      |       |                                       |
| 192 | aws_ec2_instance | false     | 0            | 0             | 2024-04-04T09:29:53+01:00 | 433      | ["instance_id","vpc_id","subnet_id"] | 0     | [                                     |
|     |                  |           |              |               |                           |          |                                      |       |  {                                    |

The CLI keeps track of the last row that was read from this table and reads all rows since the previously read row. (i.e. it reads the scan metadata for all scans in the current query). However at present it only takes a single row from the result and uses that to display the timing.

So, in the case of a join or other multi-scan the timing being displayed is the result of just one of the scans, and it therefore incorrect.

It is simple to update the code to include all scan metadata, but not so simple knowing how to combine this data to give useful output. (see next comment)

from steampipe.

kaidaguerre avatar kaidaguerre commented on June 20, 2024

The currently logic for building the timing output from the scan metadata is as follows:

The displayed timing result is stored in a TimingMetadata struct:

type TimingMetadata struct {
	RowsFetched       int64
	CachedRowsFetched int64
	HydrateCalls      int64
}

This is populated from the steampipe_scan_metadata data as follows:

	timingResult.Metadata = &queryresult.TimingMetadata{}
	timingResult.Metadata.HydrateCalls += scanRow.HydrateCalls
	if scanRow.CacheHit {
		timingResult.Metadata.CachedRowsFetched += scanRow.RowsFetched
	} else {
		timingResult.Metadata.RowsFetched += scanRow.RowsFetched
	}

IOW, either CachedRowsFetched or RowsFetched is set, depending on whether the scan had a cache hit.

With multiple scans there are a couple of problems with this:

1) Incorrect row count

If we sum the RowsFetched for each scan, it is likely we will have a number greater than the rows returned by the query (as there may be joining/filtering)

We could avoid this by setting RowsFetched to simply be the actual rows returned by the query.

This causes issues with the following point:

2) How to handle mix of cache hits and cache misses

If some scans have a cache hit and some do not, do we count the result rows as cached or not?

from steampipe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.