Is your feature request related to a problem? Please describe. Wh

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Performance Improvement Proposal for queries related with the Task Table about harbor HOT 12 OPEN

peresureda commented on July 21, 2024 6

Performance Improvement Proposal for queries related with the Task Table

from harbor.

Comments (12)

peresureda commented on July 21, 2024 1

Thank you for the update. Now I think everything is clear. Anyway, my recommendation is to change the data type ( to jsonb ) instead of performing the conversion to make the query. Performance tests have reduced CPU consumption by almost 50% with the change of the data type to how it was implemented before.

from harbor.

flbla commented on July 21, 2024 1

hi @peresureda, we tried it on our staging env, indeed with the conversion to jsonb and the add of the index it's much more faster, thank you

@wy65701436 @Vad1mo do you think this change could be included in a 2.10 version ? thank you

from harbor.

Vad1mo commented on July 21, 2024

Thank you for the improvement, this makes sense.

one question does ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb convert existing data?

from harbor.

peresureda commented on July 21, 2024

Hi @Vad1mo ,

Important points:

JSONB available from: PostgreSQL 9.4 (2014) added support for JSONB data type.
PostgreSQL offers two types for storing JSON data: json and jsonb. The json and jsonb data types accept almost identical sets of values as input. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.

At summary: You can migrate a JSON column to JSONB simply by altering the table. However, it's not possible to revert back since JSONB does not store the JSON in raw format. If you want to revert back, you should first make a copy of the table.
create table task_aux as select * from task;

from harbor.

flbla commented on July 21, 2024

Hi,
we probably have the same issue.
since we updated Harbor from 2.10.0 to 2.10.2, we have a lot of slowness in harbor when we access artefacts page with the UI same with the artifacts API
we saw huge CPU usage on our postgres database since the update
It looks like its due to this change : https://github.com/goharbor/harbor/pull/20169/files#diff-529262240b8fd8c4239ff9411c3c13ae66343e674ff7ce1fa4978f692079089eR118

explain command on one of the select * from task :

difference in CPU usage between today (yellow line) and previous week (gray dotted line) :

from harbor.

AlenversFr commented on July 21, 2024

hi @peresureda, we tried it on our staging env, indeed with the conversion to jsonb and the add of the index it's much more faster, thank you

@wy65701436 @Vad1mo do you think this change could be included in a 2.10 version ? thank you

Query analysis :

Before applying the fix :

                                           Table "public.task"
     Column      |            Type             | Collation | Nullable |             Default              
-----------------+-----------------------------+-----------+----------+----------------------------------
 id              | integer                     |           | not null | nextval('task_id_seq'::regclass)
 execution_id    | integer                     |           | not null | 
 job_id          | character varying(64)       |           |          | 
 status          | character varying(16)       |           | not null | 
 status_code     | integer                     |           | not null | 
 status_revision | integer                     |           |          | 
 status_message  | text                        |           |          | 
 run_count       | integer                     |           |          | 
 extra_attrs     | jsonb                       |           |          | 
 creation_time   | timestamp without time zone |           |          | now()
 start_time      | timestamp without time zone |           |          | 
 update_time     | timestamp without time zone |           |          | 
 end_time        | timestamp without time zone |           |          | 
 vendor_type     | character varying(64)       |           | not null | 
Indexes:
    "task_pkey" PRIMARY KEY, btree (id)
    "idx_task_extra_attrs_report_uuids" gin ((extra_attrs -> 'report_uuids'::text))```


when you run a query like just below the result of the EXPLAIN command indicates a seq_scan. It meens there's no use of the index  "idx_task_extra_attrs_report_uuids"  defined with task table.

```explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}' as jsonb );```

If you just apply :
```ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb```
You might think the index will be used but it's not the case it's still a seq_scan.

After applying the new index :
```CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);```

Here is the result of the explain command : 
```explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}' as jsonb );
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on task  (cost=8.01..10.02 rows=1 width=253)
   Recheck Cond: (extra_attrs @> '{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}'::jsonb)
   ->  Bitmap Index Scan on indx_task_jsonb  (cost=0.00..8.01 rows=1 width=0)
         Index Cond: (extra_attrs @> '{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}'::jsonb)
(4 rows)```

In our staging environment here is the performance impact of the modification : 
Number of rows in the task table = 13k rows
Average of the painful query before = 175 ms/query
Average of the painful query after (ALTER +INDEX) = 0.25 ms/query

When the task table is nicelly loaded (more than 300k rows like our production), the query avg is around 1s.


Modifying the type of the column means we have to check all the internal services  of Harbor that might use this table.
It's not easy to evaluate the impact.

Here is a query that helps to validate the correct interactions between task and execution tables : 
```WITH task_error as (                                          
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Error' group by b.vendor_type, b.execution_id
), task_success as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Success' group by b.vendor_type, b.execution_id
), task_stopped as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Stopped' group by b.vendor_type, b.execution_id
)
 select a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status,
 count(distinct a.id) as nb_execution,
 SUM(task_error.cnt) as task_error_cnt,
 SUM(task_success.cnt) as task_success_cnt,
 SUM(task_stopped.cnt) as task_stopped_cnt
 from execution as a
 left Join task_error on task_error.execution_id=a.id
 left Join task_success on task_success.execution_id=a.id
 left Join task_stopped on task_stopped.execution_id=a.id
 group by a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status
 order by to_char(a.start_time,'YYYY-MM-DD') desc limit 20;```
 
 In order to validate the performance fix  :
 This query should indicates no regression for every executions and tasks dependencies.
  
AS for now in our staging environment the fix seems to have no negative impact but it needs to be confirmed by the maintainers.

```   vendor_type     |  to_char   | status  | nb_execution | task_error_cnt | task_success_cnt | task_stopped_cnt 
--------------------+------------+---------+--------------+----------------+------------------+------------------
 EXECUTION_SWEEP    | 2024-05-31 | Success |            3 |                |                3 |                 
 GARBAGE_COLLECTION | 2024-05-31 | Success |            1 |                |                1 |                 
 IMAGE_SCAN         | 2024-05-31 | Running |            7 |                |                  |                 
 REPLICATION        | 2024-05-31 | Success |           10 |                |               16 |                 
 SCAN_ALL           | 2024-05-31 | Running |            1 |                |             2082 |                 
 SCHEDULER          | 2024-05-31 | Running |            1 |                |                  |  ```
 
 An easy way to test the performance issue is to create a replication that is scanning all the harbor projects.
 Example of configuration with a single check on a label :
 
![image](https://github.com/goharbor/harbor/assets/57966036/ee5bec28-e14a-4202-bdd4-bd3fd0cb4a33)


The fix seems legit and operationnal, we just need confirmation from the maintainers and wait for a new release.

from harbor.

Performance Improvement Proposal for queries related with the Task Table about harbor HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent