Giter Club home page Giter Club logo

Comments (12)

peresureda avatar peresureda commented on July 21, 2024 1

Thank you for the update. Now I think everything is clear. Anyway, my recommendation is to change the data type ( to jsonb ) instead of performing the conversion to make the query. Performance tests have reduced CPU consumption by almost 50% with the change of the data type to how it was implemented before.

from harbor.

flbla avatar flbla commented on July 21, 2024 1

hi @peresureda, we tried it on our staging env, indeed with the conversion to jsonb and the add of the index it's much more faster, thank you

@wy65701436 @Vad1mo do you think this change could be included in a 2.10 version ? thank you

from harbor.

Vad1mo avatar Vad1mo commented on July 21, 2024

Thank you for the improvement, this makes sense.

one question does ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb convert existing data?

from harbor.

peresureda avatar peresureda commented on July 21, 2024

Hi @Vad1mo ,

Important points:

  1. JSONB available from: PostgreSQL 9.4 (2014) added support for JSONB data type.
  2. PostgreSQL offers two types for storing JSON data: json and jsonb. The json and jsonb data types accept almost identical sets of values as input. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.

At summary: You can migrate a JSON column to JSONB simply by altering the table. However, it's not possible to revert back since JSONB does not store the JSON in raw format. If you want to revert back, you should first make a copy of the table.
create table task_aux as select * from task;

from harbor.

flbla avatar flbla commented on July 21, 2024

Hi,
we probably have the same issue.
since we updated Harbor from 2.10.0 to 2.10.2, we have a lot of slowness in harbor when we access artefacts page with the UI same with the artifacts API
we saw huge CPU usage on our postgres database since the update
It looks like its due to this change : https://github.com/goharbor/harbor/pull/20169/files#diff-529262240b8fd8c4239ff9411c3c13ae66343e674ff7ce1fa4978f692079089eR118

explain command on one of the select * from task :
image

difference in CPU usage between today (yellow line) and previous week (gray dotted line) :
image

from harbor.

AlenversFr avatar AlenversFr commented on July 21, 2024

hi @peresureda, we tried it on our staging env, indeed with the conversion to jsonb and the add of the index it's much more faster, thank you

@wy65701436 @Vad1mo do you think this change could be included in a 2.10 version ? thank you

Query analysis :

Before applying the fix :

                                           Table "public.task"
     Column      |            Type             | Collation | Nullable |             Default              
-----------------+-----------------------------+-----------+----------+----------------------------------
 id              | integer                     |           | not null | nextval('task_id_seq'::regclass)
 execution_id    | integer                     |           | not null | 
 job_id          | character varying(64)       |           |          | 
 status          | character varying(16)       |           | not null | 
 status_code     | integer                     |           | not null | 
 status_revision | integer                     |           |          | 
 status_message  | text                        |           |          | 
 run_count       | integer                     |           |          | 
 extra_attrs     | jsonb                       |           |          | 
 creation_time   | timestamp without time zone |           |          | now()
 start_time      | timestamp without time zone |           |          | 
 update_time     | timestamp without time zone |           |          | 
 end_time        | timestamp without time zone |           |          | 
 vendor_type     | character varying(64)       |           | not null | 
Indexes:
    "task_pkey" PRIMARY KEY, btree (id)
    "idx_task_extra_attrs_report_uuids" gin ((extra_attrs -> 'report_uuids'::text))```


when you run a query like just below the result of the EXPLAIN command indicates a seq_scan. It meens there's no use of the index  "idx_task_extra_attrs_report_uuids"  defined with task table.

```explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}' as jsonb );```

If you just apply :
```ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb```
You might think the index will be used but it's not the case it's still a seq_scan.

After applying the new index :
```CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);```

Here is the result of the explain command : 
```explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}' as jsonb );
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on task  (cost=8.01..10.02 rows=1 width=253)
   Recheck Cond: (extra_attrs @> '{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}'::jsonb)
   ->  Bitmap Index Scan on indx_task_jsonb  (cost=0.00..8.01 rows=1 width=0)
         Index Cond: (extra_attrs @> '{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}'::jsonb)
(4 rows)```

In our staging environment here is the performance impact of the modification : 
Number of rows in the task table = 13k rows
Average of the painful query before = 175 ms/query
Average of the painful query after (ALTER +INDEX) = 0.25 ms/query

When the task table is nicelly loaded (more than 300k rows like our production), the query avg is around 1s.


Modifying the type of the column means we have to check all the internal services  of Harbor that might use this table.
It's not easy to evaluate the impact.

Here is a query that helps to validate the correct interactions between task and execution tables : 
```WITH task_error as (                                          
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Error' group by b.vendor_type, b.execution_id
), task_success as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Success' group by b.vendor_type, b.execution_id
), task_stopped as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Stopped' group by b.vendor_type, b.execution_id
)
 select a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status,
 count(distinct a.id) as nb_execution,
 SUM(task_error.cnt) as task_error_cnt,
 SUM(task_success.cnt) as task_success_cnt,
 SUM(task_stopped.cnt) as task_stopped_cnt
 from execution as a
 left Join task_error on task_error.execution_id=a.id
 left Join task_success on task_success.execution_id=a.id
 left Join task_stopped on task_stopped.execution_id=a.id
 group by a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status
 order by to_char(a.start_time,'YYYY-MM-DD') desc limit 20;```
 
 In order to validate the performance fix  :
 This query should indicates no regression for every executions and tasks dependencies.
  
AS for now in our staging environment the fix seems to have no negative impact but it needs to be confirmed by the maintainers.

```   vendor_type     |  to_char   | status  | nb_execution | task_error_cnt | task_success_cnt | task_stopped_cnt 
--------------------+------------+---------+--------------+----------------+------------------+------------------
 EXECUTION_SWEEP    | 2024-05-31 | Success |            3 |                |                3 |                 
 GARBAGE_COLLECTION | 2024-05-31 | Success |            1 |                |                1 |                 
 IMAGE_SCAN         | 2024-05-31 | Running |            7 |                |                  |                 
 REPLICATION        | 2024-05-31 | Success |           10 |                |               16 |                 
 SCAN_ALL           | 2024-05-31 | Running |            1 |                |             2082 |                 
 SCHEDULER          | 2024-05-31 | Running |            1 |                |                  |  ```
 
 An easy way to test the performance issue is to create a replication that is scanning all the harbor projects.
 Example of configuration with a single check on a label :
 
![image](https://github.com/goharbor/harbor/assets/57966036/ee5bec28-e14a-4202-bdd4-bd3fd0cb4a33)


The fix seems legit and operationnal, we just need confirmation from the maintainers and wait for a new release.

from harbor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.