Comments (3)
I can see 3 options here:
Option 1
Simply sum the rows returned by each scan. This may result in a higher row count that the actual result rows, but would allow granular reporting on cached and uncached rows
Option 2
Report the muber of rows actually returned, and only mark the rows as cached if all scans had a cache hit
Option 3
Expand the timing display to include more detailed information about each scan - perhaps have a verbose
mode for the . timing
option
Time: 158ms. Rows fetched: 36. Hydrate calls: 10.
Scans:
1) Table: aws_ec2_instance. Time: 78ms. Rows fetched: 36. Hydrate calls: 10. Quals: subnet_id = "subnet-a2c499fc37a6c1fe"
2) Table: aws_vpc_subnet. Time: 90ms. Rows fetched: 56. Hydrate calls: 0.
(numbers made up)
I quite like option 3
from steampipe.
@cbruno10 @e-gineer I've found the issue - the CLI code to read the timing data does not take into account multiple scans.
tl;dr - I've found the bug but need some guidance on the correct way to display timing info for multi-scan queries. Please see next comment.
Details
For each scan that is executed, the FDW stores ScanMetadata
in the Hub
.
type ScanMetadata struct {
Id int
Table string
CacheHit bool
RowsFetched int64
HydrateCalls int64
Columns []string
Quals map[string]*proto.Quals
Limit int64
StartTime time.Time
Duration time.Duration
}
This is then used to populate steampipe_internal.steampipe_scan_metadata
foreign table.:
> select * from steampipe_internal.steampipe_scan_metadata
+-----+------------------+-----------+--------------+---------------+---------------------------+----------+--------------------------------------+-------+---------------------------------------+
| id | table | cache_hit | rows_fetched | hydrate_calls | start_time | duration | columns | limit | quals |
+-----+------------------+-----------+--------------+---------------+---------------------------+----------+--------------------------------------+-------+---------------------------------------+
| 191 | aws_ec2_instance | false | 1 | 0 | 2024-04-04T09:29:52+01:00 | 439 | ["instance_id","vpc_id","subnet_id"] | 0 | [ |
| | | | | | | | | | { |
| | | | | | | | | | "column": "subnet_id", |
| | | | | | | | | | "operator": "=", |
| | | | | | | | | | "value": "subnet-0a2c499fc37a6c1fe" |
| | | | | | | | | | } |
| | | | | | | | | | ] |
| | | | | | | | | | |
| 192 | aws_ec2_instance | false | 0 | 0 | 2024-04-04T09:29:53+01:00 | 433 | ["instance_id","vpc_id","subnet_id"] | 0 | [ |
| | | | | | | | | | { |
The CLI keeps track of the last row that was read from this table and reads all rows since the previously read row. (i.e. it reads the scan metadata for all scans in the current query). However at present it only takes a single row from the result and uses that to display the timing.
So, in the case of a join or other multi-scan the timing being displayed is the result of just one of the scans, and it therefore incorrect.
It is simple to update the code to include all scan metadata, but not so simple knowing how to combine this data to give useful output. (see next comment)
from steampipe.
The currently logic for building the timing output from the scan metadata is as follows:
The displayed timing result is stored in a TimingMetadata
struct:
type TimingMetadata struct {
RowsFetched int64
CachedRowsFetched int64
HydrateCalls int64
}
This is populated from the steampipe_scan_metadata
data as follows:
timingResult.Metadata = &queryresult.TimingMetadata{}
timingResult.Metadata.HydrateCalls += scanRow.HydrateCalls
if scanRow.CacheHit {
timingResult.Metadata.CachedRowsFetched += scanRow.RowsFetched
} else {
timingResult.Metadata.RowsFetched += scanRow.RowsFetched
}
IOW, either CachedRowsFetched
or RowsFetched
is set, depending on whether the scan had a cache hit.
With multiple scans there are a couple of problems with this:
1) Incorrect row count
If we sum the RowsFetched
for each scan, it is likely we will have a number greater than the rows returned by the query (as there may be joining/filtering)
We could avoid this by setting RowsFetched to simply be the actual rows returned by the query.
This causes issues with the following point:
2) How to handle mix of cache hits and cache misses
If some scans have a cache hit and some do not, do we count the result rows as cached or not?
from steampipe.
Related Issues (20)
- [ARM64 OCI CloudShell] Initializing database... FAILED! HOT 2
- Uninstalling a plugin don't remove the rate-limiter config from the table `steampipe_plugin_limiter`
- steampipe postgres size management HOT 5
- Ollama Steampipe Plugin
- `Hydrates` count in timing verbose mode should use integer formatting(e.g. `119,138`) HOT 1
- Bad build of Linux Arm FDW version 1.11.0
- Raspberry PI: initializing database failed HOT 3
- Steampipe returns 0 exit code even if `failed to export snapshot`(query)
- Tables from Azure and AWS returns this invalid memory address or nil pointer dereference HOT 2
- Query command should support legacy 'true' and 'false' for `--timing` flag
- Ability to get ALL inventory (exported to csv/whatever), so I can diff it between runs HOT 1
- Unusual query timing results with aggregator connection with caching enabled when some connections have errors
- Query time decreasing on subsequent queries with caching disabled HOT 1
- Docker Image for latest steampipe HOT 4
- setting autocomplete in workspaces files has no effect
- JSON metadata output should not contain scan info when `--timing=on`
- `sps` output not working
- fatal error: concurrent map read and map write HOT 1
- AWS authentication fails on a large number of accounts when using credential_process HOT 1
- Stemapipe service restart stuck after first attempt
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from steampipe.