Comments (12)
Thank you for the update. Now I think everything is clear. Anyway, my recommendation is to change the data type ( to jsonb ) instead of performing the conversion to make the query. Performance tests have reduced CPU consumption by almost 50% with the change of the data type to how it was implemented before.
from harbor.
hi @peresureda, we tried it on our staging env, indeed with the conversion to jsonb and the add of the index it's much more faster, thank you
@wy65701436 @Vad1mo do you think this change could be included in a 2.10 version ? thank you
from harbor.
Thank you for the improvement, this makes sense.
one question does ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb
convert existing data?
from harbor.
Hi @Vad1mo ,
Important points:
- JSONB available from: PostgreSQL 9.4 (2014) added support for JSONB data type.
- PostgreSQL offers two types for storing JSON data: json and jsonb. The json and jsonb data types accept almost identical sets of values as input. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.
At summary: You can migrate a JSON column to JSONB simply by altering the table. However, it's not possible to revert back since JSONB does not store the JSON in raw format. If you want to revert back, you should first make a copy of the table.
create table task_aux as select * from task;
from harbor.
Hi,
we probably have the same issue.
since we updated Harbor from 2.10.0 to 2.10.2, we have a lot of slowness in harbor when we access artefacts page with the UI same with the artifacts API
we saw huge CPU usage on our postgres database since the update
It looks like its due to this change : https://github.com/goharbor/harbor/pull/20169/files#diff-529262240b8fd8c4239ff9411c3c13ae66343e674ff7ce1fa4978f692079089eR118
explain command on one of the select * from task :
difference in CPU usage between today (yellow line) and previous week (gray dotted line) :
from harbor.
hi @peresureda, we tried it on our staging env, indeed with the conversion to jsonb and the add of the index it's much more faster, thank you
@wy65701436 @Vad1mo do you think this change could be included in a 2.10 version ? thank you
Query analysis :
Before applying the fix :
Table "public.task"
Column | Type | Collation | Nullable | Default
-----------------+-----------------------------+-----------+----------+----------------------------------
id | integer | | not null | nextval('task_id_seq'::regclass)
execution_id | integer | | not null |
job_id | character varying(64) | | |
status | character varying(16) | | not null |
status_code | integer | | not null |
status_revision | integer | | |
status_message | text | | |
run_count | integer | | |
extra_attrs | jsonb | | |
creation_time | timestamp without time zone | | | now()
start_time | timestamp without time zone | | |
update_time | timestamp without time zone | | |
end_time | timestamp without time zone | | |
vendor_type | character varying(64) | | not null |
Indexes:
"task_pkey" PRIMARY KEY, btree (id)
"idx_task_extra_attrs_report_uuids" gin ((extra_attrs -> 'report_uuids'::text))```
when you run a query like just below the result of the EXPLAIN command indicates a seq_scan. It meens there's no use of the index "idx_task_extra_attrs_report_uuids" defined with task table.
```explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}' as jsonb );```
If you just apply :
```ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb```
You might think the index will be used but it's not the case it's still a seq_scan.
After applying the new index :
```CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);```
Here is the result of the explain command :
```explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}' as jsonb );
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on task (cost=8.01..10.02 rows=1 width=253)
Recheck Cond: (extra_attrs @> '{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}'::jsonb)
-> Bitmap Index Scan on indx_task_jsonb (cost=0.00..8.01 rows=1 width=0)
Index Cond: (extra_attrs @> '{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}'::jsonb)
(4 rows)```
In our staging environment here is the performance impact of the modification :
Number of rows in the task table = 13k rows
Average of the painful query before = 175 ms/query
Average of the painful query after (ALTER +INDEX) = 0.25 ms/query
When the task table is nicelly loaded (more than 300k rows like our production), the query avg is around 1s.
Modifying the type of the column means we have to check all the internal services of Harbor that might use this table.
It's not easy to evaluate the impact.
Here is a query that helps to validate the correct interactions between task and execution tables :
```WITH task_error as (
select b.execution_id,count(b.id) as cnt from task as b where b.status='Error' group by b.vendor_type, b.execution_id
), task_success as (
select b.execution_id,count(b.id) as cnt from task as b where b.status='Success' group by b.vendor_type, b.execution_id
), task_stopped as (
select b.execution_id,count(b.id) as cnt from task as b where b.status='Stopped' group by b.vendor_type, b.execution_id
)
select a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status,
count(distinct a.id) as nb_execution,
SUM(task_error.cnt) as task_error_cnt,
SUM(task_success.cnt) as task_success_cnt,
SUM(task_stopped.cnt) as task_stopped_cnt
from execution as a
left Join task_error on task_error.execution_id=a.id
left Join task_success on task_success.execution_id=a.id
left Join task_stopped on task_stopped.execution_id=a.id
group by a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status
order by to_char(a.start_time,'YYYY-MM-DD') desc limit 20;```
In order to validate the performance fix :
This query should indicates no regression for every executions and tasks dependencies.
AS for now in our staging environment the fix seems to have no negative impact but it needs to be confirmed by the maintainers.
``` vendor_type | to_char | status | nb_execution | task_error_cnt | task_success_cnt | task_stopped_cnt
--------------------+------------+---------+--------------+----------------+------------------+------------------
EXECUTION_SWEEP | 2024-05-31 | Success | 3 | | 3 |
GARBAGE_COLLECTION | 2024-05-31 | Success | 1 | | 1 |
IMAGE_SCAN | 2024-05-31 | Running | 7 | | |
REPLICATION | 2024-05-31 | Success | 10 | | 16 |
SCAN_ALL | 2024-05-31 | Running | 1 | | 2082 |
SCHEDULER | 2024-05-31 | Running | 1 | | | ```
An easy way to test the performance issue is to create a replication that is scanning all the harbor projects.
Example of configuration with a single check on a label :
![image](https://github.com/goharbor/harbor/assets/57966036/ee5bec28-e14a-4202-bdd4-bd3fd0cb4a33)
The fix seems legit and operationnal, we just need confirmation from the maintainers and wait for a new release.
from harbor.
Related Issues (20)
- Pagination is missing on tag immutability rules
- SBOM status of index should be consistent
- `SBOM` accessory that is replicated over from another registry is not deleted after triggering SBOM manual generation HOT 1
- Confusing status of NO SBOM and Not Generted HOT 3
- It should not be have multiple harbor generted sbom for any specific artifact HOT 1
- Should we rename the butto of Stop Scan to Stop Vulnerability HOT 3
- Update README for Harbor
- The status should be cleaned on recreating SBOM
- Generate SBOM button not showing properly for Unsupported Artifact
- SBOM status stay blank for index type of image once generation succeeded.
- Proxy Cache HOT 2
- Push an image to Harbor, the `SBOM` column should be `No SBOM` with a circle around it.
- SBOM replicated to destination Harbor is not deleted for the first time when triggering SBOM manual generation
- Is there any parameter which restricts thenumber of connection to harbor ? HOT 8
- Image pull issue from Harbor registry which is deployed on the same cluster. HOT 5
- `No SBOM` logic between SBOM deleted and literally no SBOM generation should be the same HOT 2
- Use nginx to proxy harbor can lead to unexpect problem HOT 3
- Failed to fetch
- Maximum parallel replications for pull HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from harbor.