My Environment ArangoDB Version : 3.1

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Optimize statistic large data with COLLECT INTO too slow,about arangodb/arangodb

Comments (25)

jsteemann commented on June 1, 2024 2

@son2408 : according to the query profile you supplied in this ticket, the main driver of the query's execution time is the EnumerateViewNode and not the COLLECT. The EnumerateViewNode takes 12 out of the 18 seconds total runtime.
So this should be the key part to optimize.

I tried running the query locally with the provided dataset, and I get a similar execution plan, with higher execution times (potentially due to my inferior hardware):

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00001   * ROOT
  2   EnumerateViewNode    5000   5000000          0      34.32340     - FOR i IN search_hosotiepnhan_2 SEARCH (i.`tinhId` > 0)   /* view query */
  3   LimitNode            5000   5000000          0       0.56718       - LIMIT 0, 5000000
  4   CalculationNode      5000   5000000          0       2.18918       - LET #9 = i.`tinhId`   /* attribute expression */
  6   CalculationNode      5000   5000000          0       4.63421       - LET #13 = { "tinhId" : i.`tinhId`, "daTra" : i.`daTra` }   /* simple expression */
  7   CollectNode             1        37          0       1.83643       - COLLECT tinhId = #9 AGGREGATE count = LENGTH() INTO group = #13   /* hash */
 17   SortNode                1        37          0       0.04750       - SORT tinhId ASC   /* sorting strategy: standard */
 18   SubqueryStartNode      37        74          0       0.05996       - LET countDaTra = ( /* subquery begin */
  9   EnumerateListNode    5019   5000037          0       1.03633         - FOR i IN group   /* list iteration */
 10   CalculationNode      5019   5000037          0       1.45216           - LET #15 = (i.`daTra` == 1)   /* simple expression */
 11   FilterNode             37   4228828     771209       0.39480           - FILTER #15
 12   CollectNode             1        74          0       0.00047           - COLLECT AGGREGATE countDaTra = LENGTH()   /* count */
 19   SubqueryEndNode         1        37          0       0.00015           - RETURN  countDaTra ) /* subquery end */
 15   CalculationNode         1        37          0       0.00004       - LET #17 = { "val" : tinhId, "count" : count, "countDaTra" : countDaTra }   /* simple expression */
 16   ReturnNode              1        37          0       0.00002       - RETURN #17

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-calculations-up-2
  3   handle-arangosearch-views
  4   remove-unnecessary-calculations-2
  5   move-calculations-down
  6   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0     771209      329908224        46.54266

It seems that the problem is that the query produces 5M matches in the EnumerateViewNode, but in order to build the resulting documents for them, it will need to do 5M additional document lookups.
This could be avoided by adding the required attributes (tinhId and daTra for this query) to the view definition as "stored values".

I just tried this, and created another view (search_hosotiepnhan_3) with that definition:

db._createView("search_hosotiepnhan_3", "arangosearch", {"storedValues":[["tinhId"],["daTra"]], links: {vertex_hosotiepnhan_2:{analyzers:["identity"], includeAllFields: true, trackListPositions: false}}});

Using that adjusted view in the query, while keeping everything else the same, dramatically speeds up the query:

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00001   * ROOT
  2   EnumerateViewNode    5000   5000000          0       1.19117     - FOR i IN search_hosotiepnhan_3 SEARCH (i.`tinhId` > 0) LET #21 = i.`daTra` LET #9 = i.`tinhId`   /* view query without materialization */
  3   LimitNode            5000   5000000          0       0.43068       - LIMIT 0, 5000000
  6   CalculationNode      5000   5000000          0       1.95566       - LET #13 = { "tinhId" : #9, "daTra" : #21 }   /* simple expression */
  7   CollectNode             1        37          0       1.23645       - COLLECT tinhId = #9 AGGREGATE count = LENGTH() INTO group = #13   /* hash */
 17   SortNode                1        37          0       0.04437       - SORT tinhId ASC   /* sorting strategy: standard */
 18   SubqueryStartNode      37        74          0       0.05579       - LET countDaTra = ( /* subquery begin */
  9   EnumerateListNode    5018   5000037          0       0.97822         - FOR i IN group   /* list iteration */
 10   CalculationNode      5018   5000037          0       1.37251           - LET #15 = (i.`daTra` == 1)   /* simple expression */
 11   FilterNode             37   4213311     786726       0.37546           - FILTER #15
 12   CollectNode             1        74          0       0.00039           - COLLECT AGGREGATE countDaTra = LENGTH()   /* count */
 19   SubqueryEndNode         1        37          0       0.00014           - RETURN  countDaTra ) /* subquery end */
 15   CalculationNode         1        37          0       0.00003       - LET #17 = { "val" : tinhId, "count" : count, "countDaTra" : countDaTra }   /* simple expression */
 16   ReturnNode              1        37          0       0.00002       - RETURN #17

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-calculations-up-2
  3   handle-arangosearch-views
  4   remove-unnecessary-calculations-2
  5   move-calculations-down
  6   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0     786726      329973760         7.64167

I also tried running the query (with the adjusted view) on our current devel, which will become ArangoDB 3.12 at some point in Q1 next year.
The query performance there will be even better:

Execution plan:
 Id   NodeType            Calls    Par     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1      -         1          0       0.00001   * ROOT
  2   EnumerateViewNode    5000      -   5000000          0       1.04602     - FOR i IN search_hosotiepnhan_3 SEARCH (i.`tinhId` > 0) LET #15 = i.`daTra` LET #9 = i.`tinhId`   /* view query without materialization */
  3   LimitNode            5000      -   5000000          0       0.07833       - LIMIT 0, 5000000
  6   CalculationNode      5000   4999   5000000          0       1.80908       - LET #11 = { "tinhId" : #9, "daTra" : #15 }   /* simple expression */
  7   CollectNode             1      0        37          0       1.16470       - COLLECT tinhId = #9 AGGREGATE count = LENGTH() INTO group = #11   /* hash */
 17   SortNode                1      0        37          0       0.02060       - SORT tinhId ASC   /* sorting strategy: standard */
 18   SubqueryStartNode      37      -        74          0       0.02330       - LET countDaTra = ( /* subquery begin */
  9   EnumerateListNode    5018      -   5000037          0       0.77800         - FOR i IN group   /* list iteration */
 10   CalculationNode      5018      -   5000037          0       1.07768           - LET #12 = (i.`daTra` == 1)   /* simple expression */
 11   FilterNode             37      -   4213311     786726       0.31608           - FILTER #12
 12   CollectNode             1      -        74          0       0.00023           - COLLECT AGGREGATE countDaTra = LENGTH()   /* count */
 19   SubqueryEndNode         1      -        37          0       0.00012           - RETURN  countDaTra ) /* subquery end */
 15   CalculationNode         1      0        37          0       0.00003       - LET #13 = { "val" : tinhId, "count" : count, "countDaTra" : countDaTra }   /* simple expression */
 16   ReturnNode              1      -        37          0       0.00001       - RETURN #13

Indexes used:
 none

Optimization rules applied:
 Id   Rule Name                                 Id   Rule Name                                 Id   Rule Name                        
  1   move-calculations-up                       4   remove-unnecessary-calculations-2          7   splice-subqueries                
  2   move-calculations-up-2                     5   move-calculations-down           
  3   handle-arangosearch-views                  6   async-prefetch                   

Query Statistics:
 Writes Exec      Writes Ign      Doc. Lookups      Scan Full      Scan Index      Cache Hits/Misses      Filtered      Peak Mem [b]      Exec Time [s]
           0               0                 0              0         5000000                  0 / 0        786726         330006528            4.10029

Query Profile:
 Query Stage                Duration [s]         Query Stage                Duration [s]         Query Stage                Duration [s]
 initializing                    0.00001         loading collections             0.00002         instantiating executors         0.00007
 parsing                         0.00009         instantiating plan              0.00004         executing                       4.09961
 optimizing ast                  0.00001         optimizing plan                 0.00042         finalizing                      0.00004

from arangodb.

jsteemann commented on June 1, 2024 1

One idea is to loop over the results only once instead of creating an arbitrary number of subqueries:

FOR i IN search_hosotiepnhan_3
    SEARCH i.tinhId > 0
    limit 0,5000000
    COLLECT tinhId = i.tinhId
    AGGREGATE count = length(1)
    INTO group = {
      daTra: i.daTra, 
      ngayTra: i.ngayTra_ts, 
      choBoSung: i.choBoSung, 
      thoiDiemChoBoSung: i.thoiDiemChoBoSung_ts, 
      ngayXuLy: i.ngayXuLy_ts, 
      ngayTiepNhan:i.ngayTiepNhan_s
    }
    LET counters = (
      FOR i IN group
        COLLECT AGGREGATE 
          countDaTra = SUM(i.daTra == 1 ? 1 : 0),
          countNgayTra = SUM(i.ngayTra > 0 ? 1 : 0),
          countChoBoSung = SUM(i.choBoSung == 1 ? 1 : 0),
          countThoiDiemChoBoSung = SUM(i.thoiDiemChoBoSung > 0 ? 1 : 0),
          countNgayXuLy = SUM(i.ngayXuLy > 0 ? 1 : 0),
          countNgayTiepNhan = SUM(i.ngayTiepNhan > 0 ? 1 : 0)
        RETURN {countDaTra, countNgayTra, countChoBoSung, countThoiDiemChoBoSung, countNgayXuLy, countNgayTiepNhan}
    )
    RETURN {
      val: tinhId, 
      count: count, 
      countDaTra: counters[0].countDaTra, 
      countNgayTra: counters[0].countNgayTra, 
      countChoBoSung: counters[0].countChoBoSung,
      countThoiDiemChoBoSung: counters[0].countThoiDiemChoBoSung,
      countNgayXuLy: counters[0].countNgayXuLy,
      countNgayTiepNhan: counters[0].countNgayTiepNhan
    }

I can't estimate how this will impact performance (positively or negatively), but at least it is something you could try if you are worried about the amount of subqueries.

from arangodb.

jsteemann commented on June 1, 2024 1

Ok, I think there isn't anything left that I can do now.
The only suggestion I have is to try the upcoming version of ArangoDB (3.12), which is currently in development. I can't say if that will help or not, but maybe it is worth a try.
Linux nightly builds can be found here: https://download.arangodb.com/nightly/devel/Linux/x86_64/index.html
Please do not run a nightly build on the original data, but on a copy of your data!

from arangodb.

jsteemann commented on June 1, 2024 1

Hi @son2408 , Unfortunately, I am not aware of any good way to speed up COLLECT ... INTO further.

from arangodb.

son2408 commented on June 1, 2024

@jsteemann yes, with your view definition the query speeds up dramatically. If my query have subqueries count then runtime query too slowly. It takes 16.40239s for 5M documents. Let help me optimize it.

Query

FOR i IN search_hosotiepnhan_3
    SEARCH i.tinhId > 0
    limit 0,5000000
    COLLECT tinhId = i.tinhId
    AGGREGATE count = length(1)
    INTO group = {daTra: i.daTra, ngayTra: i.ngayTra_ts, choBoSung:i.choBoSung, thoiDiemChoBoSung:i.thoiDiemChoBoSung_ts, ngayXuLy: i.ngayXuLy_ts, ngayTiepNhan:i.ngayTiepNhan_ts}
    LET countDaTra = (FOR i IN group
                        FILTER i.daTra == 1
                        COLLECT WITH COUNT INTO countDaTra
                        RETURN countDaTra)
    LET countNgayTra = (FOR i IN group
                        FILTER i.ngayTra > 0
                        COLLECT WITH COUNT INTO countNgayTra
                        RETURN countNgayTra)
                        
    LET countChoBoSung = (FOR i IN group
                        FILTER i.choBoSung == 1
                        COLLECT WITH COUNT INTO countChoBoSung
                        RETURN countChoBoSung)
    LET countThoiDiemChoBoSung = (FOR i IN group
                        FILTER i.thoiDiemChoBoSung > 0
                        COLLECT WITH COUNT INTO countThoiDiemChoBoSung
                        RETURN countThoiDiemChoBoSung)
    LET countNgayXuLy = (FOR i IN group
                        FILTER i.ngayXuLy > 0
                        COLLECT WITH COUNT INTO countNgayXuLy
                        RETURN countNgayXuLy)
                        
    LET countNgayTiepNhan = (FOR i IN group
                        FILTER i.ngayTiepNhan > 0
                        COLLECT WITH COUNT INTO countNgayTiepNhan
                        RETURN countNgayTiepNhan)
    RETURN {val: tinhId, count : count, countDaTra: countDaTra, countNgayTra: countNgayTra, countChoBoSung: countChoBoSung,countThoiDiemChoBoSung: countThoiDiemChoBoSung,countNgayXuLy:countNgayXuLy,countNgayTiepNhan:countNgayTiepNhan}

Query profile

Query String (1837 chars, cacheable: false):
 FOR i IN search_hosotiepnhan_3
     SEARCH i.tinhId > 0
     limit 0,5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE count = length(1)
     INTO group = {daTra: i.daTra, ngayTra: i.ngayTra_ts, choBoSung:i.choBoSung, 
 thoiDiemChoBoSung:i.thoiDiemChoBoSung_ts, ngayXuLy: i.ngayXuLy_ts, ngayTiepNhan:i.ngayTiepNhan_s}
     LET countDaTra = (FOR i IN group
                         FILTER i.daTra == 1
                         COLLECT WITH COUNT INTO countDaTra
                         RETURN countDaTra)
     LET countNgayTra = (FOR i IN group
                         FILTER i.ngayTra > 0
                         COLLECT WITH COUNT INTO countNgayTra
                         RETURN countNgayTra)
                         
     LET countChoBoSung = (FOR i IN group
                         FILTER i.choBoSung == 1
                         COLLECT WITH COUNT INTO countChoBoSung
                         RETURN countChoBoSung)
     LET countThoiDiemChoBoSung = (FOR i IN group
                      ...

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00000   * ROOT
  2   EnumerateViewNode    5000   5000000          0       0.99799     - FOR i IN search_hosotiepnhan_3 SEARCH (i.`tinhId` > 0) LET #68 = i.`ngayTra_ts` LET #70 = i.`choBoSung` LET #34 = i.`tinhId` LET #74 = i.`ngayXuLy_ts` LET #66 = i.`daTra` LET #72 = i.`thoiDiemChoBoSung_ts` LET #76 = i.`ngayTiepNhan_s`   /* view query without materialization */
  3   LimitNode            5000   5000000          0       0.07333       - LIMIT 0, 5000000
  6   CalculationNode      5000   5000000          0       2.80703       - LET #38 = { "daTra" : #66, "ngayTra" : #68, "choBoSung" : #70, "thoiDiemChoBoSung" : #72, "ngayXuLy" : #74, "ngayTiepNhan" : #76 }   /* simple expression */
  7   CollectNode             1        37          0       0.96351       - COLLECT tinhId = #34 AGGREGATE count = LENGTH() INTO group = #38   /* hash */
 52   SortNode                1        37          0       0.07302       - SORT tinhId ASC   /* sorting strategy: standard */
 63   SubqueryStartNode      37        74          0       0.16044       - LET countNgayTiepNhan = ( /* subquery begin */
 44   EnumerateListNode    5020   5000037          0       0.75889         - FOR i IN group   /* list iteration */
 45   CalculationNode      5020   5000037          0       0.76964           - LET #50 = (i.`ngayTiepNhan` > 0)   /* simple expression */
 46   FilterNode             37        37    5000000       0.14499           - FILTER #50
 47   CollectNode             1        74          0       0.00011           - COLLECT AGGREGATE countNgayTiepNhan = LENGTH()   /* count */
 64   SubqueryEndNode         1        37          0       0.00002           - RETURN  countNgayTiepNhan ) /* subquery end */
 61   SubqueryStartNode      37        74          0       0.16176       - LET countNgayXuLy = ( /* subquery begin */
 37   EnumerateListNode    5020   5000037          0       0.75396         - FOR i IN group   /* list iteration */
 38   CalculationNode      5020   5000037          0       0.87656           - LET #48 = (i.`ngayXuLy` > 0)   /* simple expression */
 39   FilterNode             37   5000037          0       0.14957           - FILTER #48
 40   CollectNode             1        74          0       0.00012           - COLLECT AGGREGATE countNgayXuLy = LENGTH()   /* count */
 62   SubqueryEndNode         1        37          0       0.00002           - RETURN  countNgayXuLy ) /* subquery end */
 59   SubqueryStartNode      37        74          0       0.15743       - LET countThoiDiemChoBoSung = ( /* subquery begin */
 30   EnumerateListNode    5020   5000037          0       0.75541         - FOR i IN group   /* list iteration */
 31   CalculationNode      5020   5000037          0       0.86033           - LET #46 = (i.`thoiDiemChoBoSung` > 0)   /* simple expression */
 32   FilterNode             37   1123639    3876398       0.16305           - FILTER #46
 33   CollectNode             1        74          0       0.00012           - COLLECT AGGREGATE countThoiDiemChoBoSung = LENGTH()   /* count */
 60   SubqueryEndNode         1        37          0       0.00002           - RETURN  countThoiDiemChoBoSung ) /* subquery end */
 57   SubqueryStartNode      37        74          0       0.16859       - LET countChoBoSung = ( /* subquery begin */
 23   EnumerateListNode    5020   5000037          0       0.75705         - FOR i IN group   /* list iteration */
 24   CalculationNode      5020   5000037          0       0.84098           - LET #44 = (i.`choBoSung` == 1)   /* simple expression */
 25   FilterNode             37     57277    4942760       0.15686           - FILTER #44
 26   CollectNode             1        74          0       0.00013           - COLLECT AGGREGATE countChoBoSung = LENGTH()   /* count */
 58   SubqueryEndNode         1        37          0       0.00002           - RETURN  countChoBoSung ) /* subquery end */
 55   SubqueryStartNode      37        74          0       0.15836       - LET countNgayTra = ( /* subquery begin */
 16   EnumerateListNode    5020   5000037          0       0.75944         - FOR i IN group   /* list iteration */
 17   CalculationNode      5020   5000037          0       0.90365           - LET #42 = (i.`ngayTra` > 0)   /* simple expression */
 18   FilterNode             37   4219307     780730       0.16860           - FILTER #42
 19   CollectNode             1        74          0       0.00012           - COLLECT AGGREGATE countNgayTra = LENGTH()   /* count */
 56   SubqueryEndNode         1        37          0       0.00002           - RETURN  countNgayTra ) /* subquery end */
 53   SubqueryStartNode      37        74          0       0.08596       - LET countDaTra = ( /* subquery begin */
  9   EnumerateListNode    5020   5000037          0       0.76448         - FOR i IN group   /* list iteration */
 10   CalculationNode      5020   5000037          0       0.84268           - LET #40 = (i.`daTra` == 1)   /* simple expression */
 11   FilterNode             37   4211854     788183       0.16787           - FILTER #40
 12   CollectNode             1        74          0       0.00011           - COLLECT AGGREGATE countDaTra = LENGTH()   /* count */
 54   SubqueryEndNode         1        37          0       0.00002           - RETURN  countDaTra ) /* subquery end */
 50   CalculationNode         1        37          0       0.00004       - LET #52 = { "val" : tinhId, "count" : count, "countDaTra" : countDaTra, "countNgayTra" : countNgayTra, "countChoBoSung" : countChoBoSung, "countThoiDiemChoBoSung" : countThoiDiemChoBoSung, "countNgayXuLy" : countNgayXuLy, "countNgayTiepNhan" : countNgayTiepNhan }   /* simple expression */
 51   ReturnNode              1        37          0       0.00001       - RETURN #52

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-calculations-up-2
  3   handle-arangosearch-views
  4   remove-unnecessary-calculations-2
  5   move-calculations-down
  6   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0   15388071     1829371904        16.40413

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00013
 optimizing ast                 0.00002
 loading collections            0.00001
 instantiating plan             0.00012
 optimizing plan                0.00122
 instantiating executors        0.00028
 executing                     16.40239
 finalizing                     0.00007

I have multi subqueries. I see it spends alot of time for EnumerateListNode and CalculationNode for each subquery. Do you have the best way to statistic data with multi count by query in a FOR .. IN ? Because I see subqueries similar FOR i IN group

from arangodb.

son2408 commented on June 1, 2024

@jsteemann thanks you, I tried query as your suggestion, but time query not as expected, it takes 26.65674 seconds. It's slower old query :(

Profile

Query String (1307 chars, cacheable: false):
 FOR i IN search_hosotiepnhan_3
     SEARCH i.tinhId > 0
     limit 0,5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE count = length(1)
     INTO group = {
       daTra: i.daTra, 
       ngayTra: i.ngayTra_ts, 
       choBoSung: i.choBoSung, 
       thoiDiemChoBoSung: i.thoiDiemChoBoSung_ts, 
       ngayXuLy: i.ngayXuLy_ts, 
       ngayTiepNhan:i.ngayTiepNhan_s
     }
     LET counters = (
       FOR i IN group
         COLLECT AGGREGATE 
           countDaTra = SUM(i.daTra == 1 ? 1 : 0),
           countNgayTra = SUM(i.ngayTra > 0 ? 1 : 0),
           countChoBoSung = SUM(i.choBoSung == 1 ? 1 : 0),
           countThoiDiemChoBoSung = SUM(i.thoiDiemChoBoSung > 0 ? 1 : 0),
           countNgayXuLy = SUM(i.ngayXuLy > 0 ? 1 : 0),
           countNgayTiepNhan = SUM(i.ngayTiepNhan > 0 ? 1 : 0)
         RETURN {countDaTra, countNgayTra, countChoBoSung, countThoiDiemChoBoSung, countNgayXuLy, 
 countNgayTiepNhan}
     )
     RETURN {
       val: tinhId, 
       count: count, 
       co...

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00000   * ROOT
  2   EnumerateViewNode    5000   5000000          0      13.14875     - FOR i IN search_hosotiepnhan_3 SEARCH (i.`tinhId` > 0)   /* view query */
  3   LimitNode            5000   5000000          0       0.07553       - LIMIT 0, 5000000
  4   CalculationNode      5000   5000000          0       0.62708       - LET #14 = i.`tinhId`   /* attribute expression */
  6   CalculationNode      5000   5000000          0       5.60139       - LET #18 = { "daTra" : i.`daTra`, "ngayTra" : i.`ngayTra_ts`, "choBoSung" : i.`choBoSung`, "thoiDiemChoBoSung" : i.`thoiDiemChoBoSung_ts`, "ngayXuLy" : i.`ngayXuLy_ts`, "ngayTiepNhan" : i.`ngayTiepNhan_s` }   /* simple expression */
  7   CollectNode             1        36          0       0.98457       - COLLECT tinhId = #14 AGGREGATE count = LENGTH() INTO group = #18   /* hash */
 22   SortNode                1        36          0       0.07236       - SORT tinhId ASC   /* sorting strategy: standard */
 24   SubqueryStartNode       1        72          0       0.08490       - LET counters = ( /* subquery begin */
  9   EnumerateListNode    5001   5000036          0       0.74345         - FOR i IN group   /* list iteration */
 10   CalculationNode      5001   5000036          0       0.75794           - LET #20 = ((i.`daTra` == 1) ? 1 : 0)   /* simple expression */
 11   CalculationNode      5001   5000036          0       0.80083           - LET #22 = ((i.`ngayTra` > 0) ? 1 : 0)   /* simple expression */
 12   CalculationNode      5001   5000036          0       0.74203           - LET #24 = ((i.`choBoSung` == 1) ? 1 : 0)   /* simple expression */
 13   CalculationNode      5001   5000036          0       0.75944           - LET #26 = ((i.`thoiDiemChoBoSung` > 0) ? 1 : 0)   /* simple expression */
 14   CalculationNode      5001   5000036          0       0.77572           - LET #28 = ((i.`ngayXuLy` > 0) ? 1 : 0)   /* simple expression */
 15   CalculationNode      5001   5000036          0       0.86129           - LET #30 = ((i.`ngayTiepNhan` > 0) ? 1 : 0)   /* simple expression */
 16   CollectNode             1        72          0       0.62120           - COLLECT AGGREGATE countDaTra = SUM(#20), countNgayTra = SUM(#22), countChoBoSung = SUM(#24), countThoiDiemChoBoSung = SUM(#26), countNgayXuLy = SUM(#28), countNgayTiepNhan = SUM(#30)   /* sorted */
 23   LimitNode               1        72          0       0.00001           - LIMIT 0, 1
 17   CalculationNode         1        72          0       0.00005           - LET #32 = { "countDaTra" : countDaTra, "countNgayTra" : countNgayTra, "countChoBoSung" : countChoBoSung, "countThoiDiemChoBoSung" : countThoiDiemChoBoSung, "countNgayXuLy" : countNgayXuLy, "countNgayTiepNhan" : countNgayTiepNhan }   /* simple expression */
 25   SubqueryEndNode         1        36          0       0.00002           - RETURN  #32 ) /* subquery end */
 20   CalculationNode         1        36          0       0.00009       - LET #34 = { "val" : tinhId, "count" : count, "countDaTra" : counters[0].`countDaTra`, "countNgayTra" : counters[0].`countNgayTra`, "countChoBoSung" : counters[0].`countChoBoSung`, "countThoiDiemChoBoSung" : counters[0].`countThoiDiemChoBoSung`, "countNgayXuLy" : counters[0].`countNgayXuLy`, "countNgayTiepNhan" : counters[0].`countNgayTiepNhan` }   /* simple expression */
 21   ReturnNode              1        36          0       0.00001       - RETURN #34

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   optimize-subqueries
  3   move-calculations-up-2
  4   handle-arangosearch-views
  5   remove-unnecessary-calculations-2
  6   move-calculations-down
  7   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0          0     1805746176        26.65769

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00010
 optimizing ast                 0.00001
 loading collections            0.00001
 instantiating plan             0.00006
 optimizing plan                0.00067
 instantiating executors        0.00015
 executing                     26.65674
 finalizing                     0.00006

from arangodb.

jsteemann commented on June 1, 2024

Please note that the extraction from the view is again the slowest part of your query.
You will need to adjust the view definition again so it includes all used attributes as "stored values".

from arangodb.

son2408 commented on June 1, 2024

@jsteemann yes, I adjusted the view defintion.

{
  "globallyUniqueId": "h64D5FDE00828/162169",
  "cleanupIntervalStep": 2,
  "writebufferIdle": 64,
  "writebufferActive": 0,
  "id": "162169",
  "consolidationIntervalMsec": 1000,
  "name": "search_hosotiepnhan_3",
  "commitIntervalMsec": 1000,
  "type": "arangosearch",
  "consolidationPolicy": {
    "type": "tier",
    "segmentsBytesFloor": 2097152,
    "segmentsBytesMax": 5368709120,
    "segmentsMax": 10,
    "segmentsMin": 1,
    "minScore": 0
  },
  "primarySort": [
    {
      "field": "tinhId",
      "asc": true
    }
  ],
  "primarySortCompression": "lz4",
  "storedValues": [
    {
      "fields": [
        "tinhId"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "ngayTra_ts"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "choBoSung"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "thoiDiemChoBoSung_ts"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "ngayXuLy_ts"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "ngayTiepNhan_s"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "dayTiepNhan_ts"
      ],
      "compression": "lz4"
    },
    {
      "fields": [
        "daTra"
      ],
      "compression": "lz4"
    }
  ],
  "writebufferSizeMax": 33554432,
  "links": {
    "vertex_hosotiepnhan_2": {
      "analyzers": [
        "identity"
      ],
      "fields": {},
      "includeAllFields": true,
      "storeValues": "none",
      "trackListPositions": false
    }
  }
}

from arangodb.

son2408 commented on June 1, 2024

@jsteemann I see if used attributes the more the slow, pls check help me. Thanks you.

from arangodb.

MBkkt commented on June 1, 2024

@son2408 it's expected, using additional columns slow down query
You can try to unify in single column few fields (just write [["a", "b"]] instead of [["a"], ["b"]], if they're always used together it commonly will be faster
Ofc there will be some limit, starting from what read whole document from rocksdb will be faster

from arangodb.

son2408 commented on June 1, 2024

@MBkkt I tried to unify in single column few fields, but it's still too slow. Total runtime is 26.95196 seconds.
Profile

Query String (1306 chars, cacheable: false):
 FOR i IN search_hosotiepnhan
     SEARCH i.tinhId > 0
     limit 0,5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE count = length(1)
     INTO group = {
       daTra: i.daTra, 
       ngayTra: i.ngayTra_ts, 
       choBoSung: i.choBoSung, 
       thoiDiemChoBoSung: i.thoiDiemChoBoSung_ts, 
       ngayXuLy: i.ngayXuLy_ts, 
       ngayTiepNhan:i.ngayTiepNhan_ts
     }
     LET counters = (
       FOR i IN group
         COLLECT AGGREGATE 
           countDaTra = SUM(i.daTra == 1 ? 1 : 0),
           countNgayTra = SUM(i.ngayTra > 0 ? 1 : 0),
           countChoBoSung = SUM(i.choBoSung == 1 ? 1 : 0),
           countThoiDiemChoBoSung = SUM(i.thoiDiemChoBoSung > 0 ? 1 : 0),
           countNgayXuLy = SUM(i.ngayXuLy > 0 ? 1 : 0),
           countNgayTiepNhan = SUM(i.ngayTiepNhan > 0 ? 1 : 0)
         RETURN {countDaTra, countNgayTra, countChoBoSung, countThoiDiemChoBoSung, countNgayXuLy, 
 countNgayTiepNhan}
     )
     RETURN {
       val: tinhId, 
       count: count, 
       cou...

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00001   * ROOT
  2   EnumerateViewNode    5000   5000000          0      12.71218     - FOR i IN search_hosotiepnhan SEARCH (i.`tinhId` > 0)   /* view query */
  3   LimitNode            5000   5000000          0       0.07812       - LIMIT 0, 5000000
  4   CalculationNode      5000   5000000          0       0.63890       - LET #14 = i.`tinhId`   /* attribute expression */
  6   CalculationNode      5000   5000000          0       6.13356       - LET #18 = { "daTra" : i.`daTra`, "ngayTra" : i.`ngayTra_ts`, "choBoSung" : i.`choBoSung`, "thoiDiemChoBoSung" : i.`thoiDiemChoBoSung_ts`, "ngayXuLy" : i.`ngayXuLy_ts`, "ngayTiepNhan" : i.`ngayTiepNhan_ts` }   /* simple expression */
  7   CollectNode             1        37          0       1.08334       - COLLECT tinhId = #14 AGGREGATE count = LENGTH() INTO group = #18   /* hash */
 22   SortNode                1        37          0       0.07627       - SORT tinhId ASC   /* sorting strategy: standard */
 24   SubqueryStartNode       1        74          0       0.08903       - LET counters = ( /* subquery begin */
  9   EnumerateListNode    5001   5000037          0       0.74561         - FOR i IN group   /* list iteration */
 10   CalculationNode      5001   5000037          0       0.75636           - LET #20 = ((i.`daTra` == 1) ? 1 : 0)   /* simple expression */
 11   CalculationNode      5001   5000037          0       0.79988           - LET #22 = ((i.`ngayTra` > 0) ? 1 : 0)   /* simple expression */
 12   CalculationNode      5001   5000037          0       0.74970           - LET #24 = ((i.`choBoSung` == 1) ? 1 : 0)   /* simple expression */
 13   CalculationNode      5001   5000037          0       0.75335           - LET #26 = ((i.`thoiDiemChoBoSung` > 0) ? 1 : 0)   /* simple expression */
 14   CalculationNode      5001   5000037          0       0.77716           - LET #28 = ((i.`ngayXuLy` > 0) ? 1 : 0)   /* simple expression */
 15   CalculationNode      5001   5000037          0       0.92626           - LET #30 = ((i.`ngayTiepNhan` > 0) ? 1 : 0)   /* simple expression */
 16   CollectNode             1        74          0       0.63191           - COLLECT AGGREGATE countDaTra = SUM(#20), countNgayTra = SUM(#22), countChoBoSung = SUM(#24), countThoiDiemChoBoSung = SUM(#26), countNgayXuLy = SUM(#28), countNgayTiepNhan = SUM(#30)   /* sorted */
 23   LimitNode               1        74          0       0.00001           - LIMIT 0, 1
 17   CalculationNode         1        74          0       0.00006           - LET #32 = { "countDaTra" : countDaTra, "countNgayTra" : countNgayTra, "countChoBoSung" : countChoBoSung, "countThoiDiemChoBoSung" : countThoiDiemChoBoSung, "countNgayXuLy" : countNgayXuLy, "countNgayTiepNhan" : countNgayTiepNhan }   /* simple expression */
 25   SubqueryEndNode         1        37          0       0.00002           - RETURN  #32 ) /* subquery end */
 20   CalculationNode         1        37          0       0.00008       - LET #34 = { "val" : tinhId, "count" : count, "countDaTra" : counters[0].`countDaTra`, "countNgayTra" : counters[0].`countNgayTra`, "countChoBoSung" : counters[0].`countChoBoSung`, "countThoiDiemChoBoSung" : counters[0].`countThoiDiemChoBoSung`, "countNgayXuLy" : counters[0].`countNgayXuLy`, "countNgayTiepNhan" : counters[0].`countNgayTiepNhan` }   /* simple expression */
 21   ReturnNode              1        37          0       0.00001       - RETURN #34

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   optimize-subqueries
  3   move-calculations-up-2
  4   handle-arangosearch-views
  5   remove-unnecessary-calculations-2
  6   move-calculations-down
  7   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0          0     1886453760        26.95313

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00012
 optimizing ast                 0.00001
 loading collections            0.00001
 instantiating plan             0.00007
 optimizing plan                0.00069
 instantiating executors        0.00031
 executing                     26.95196
 finalizing                     0.00005

View configuration

{
  "globallyUniqueId": "h64D5FDE00828/196510",
  "cleanupIntervalStep": 2,
  "writebufferIdle": 64,
  "writebufferActive": 0,
  "id": "196510",
  "consolidationIntervalMsec": 1000,
  "name": "search_hosotiepnhan",
  "commitIntervalMsec": 1000,
  "type": "arangosearch",
  "consolidationPolicy": {
    "type": "tier",
    "segmentsBytesFloor": 2097152,
    "segmentsBytesMax": 5368709120,
    "segmentsMax": 10,
    "segmentsMin": 1,
    "minScore": 0
  },
  "primarySort": [],
  "primarySortCompression": "lz4",
  "storedValues": [
    {
      "fields": [
        "tinhId",
        "daTra",
        "ngayTra_ts",
        "choBoSung",
        "thoiDiemChoBoSung_ts",
        "ngayXuLy_ts",
        "ngayTiepNhan_ts"
      ],
      "compression": "lz4"
    }
  ],
  "writebufferSizeMax": 33554432,
  "links": {
    "vertex_hosotiepnhan_2": {
      "analyzers": [
        "identity"
      ],
      "fields": {},
      "includeAllFields": true,
      "storeValues": "none",
      "trackListPositions": false
    }
  }
}

from arangodb.

son2408 commented on June 1, 2024

@jsteemann I included used attributes as "stored values", but it can't improve. I see the more atributes, the slower speed. Can you help me try it with data is shared in above, thanks you.
Query

FOR i IN search_hosotiepnhan_3
    SEARCH i.tinhId > 0
    limit 0,5000000
    COLLECT tinhId = i.tinhId
    AGGREGATE count = length(1)
    INTO group = {
      daTra: i.daTra, 
      ngayTra: i.ngayTra_ts, 
      choBoSung: i.choBoSung, 
      thoiDiemChoBoSung: i.thoiDiemChoBoSung_ts, 
      ngayXuLy: i.ngayXuLy_ts, 
      ngayTiepNhan:i.ngayTiepNhan_ts
    }
    LET counters = (
      FOR i IN group
        COLLECT AGGREGATE 
          countDaTra = SUM(i.daTra == 1 ? 1 : 0),
          countNgayTra = SUM(i.ngayTra > 0 ? 1 : 0),
          countChoBoSung = SUM(i.choBoSung == 1 ? 1 : 0),
          countThoiDiemChoBoSung = SUM(i.thoiDiemChoBoSung > 0 ? 1 : 0),
          countNgayXuLy = SUM(i.ngayXuLy > 0 ? 1 : 0),
          countNgayTiepNhan = SUM(i.ngayTiepNhan > 0 ? 1 : 0)
        RETURN {countDaTra, countNgayTra, countChoBoSung, countThoiDiemChoBoSung, countNgayXuLy, countNgayTiepNhan}
    )
    RETURN {
      val: tinhId, 
      count: count, 
      countDaTra: counters[0].countDaTra, 
      countNgayTra: counters[0].countNgayTra, 
      countChoBoSung: counters[0].countChoBoSung,
      countThoiDiemChoBoSung: counters[0].countThoiDiemChoBoSung,
      countNgayXuLy: counters[0].countNgayXuLy,
      countNgayTiepNhan: counters[0].countNgayTiepNhan
    }

from arangodb.

jsteemann commented on June 1, 2024

Hard to say without any profile what the limiting factor is now.
You could also try removing the one left subqueries for the counting and fuse its COLLECT statement with the initial COLLECT. Then only one pass over the data is needed.

from arangodb.

son2408 commented on June 1, 2024

@jsteemann yes, i tried to remove some attributes, runtime query speed up. With initial COLLECT is 6 attributes then runtime query is 32s, now attributes numbers is 2 then runtime query is 20s
profile now

Query String (598 chars, cacheable: false):
 FOR i IN search_hosotiepnhan
     SEARCH i.tinhId > 0
     limit 0,5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE count = length(1)
     INTO group = {
       daTra: i.daTra, 
       ngayTra: i.ngayTra_ts
     }
     LET counters = (
       FOR i IN group
         COLLECT AGGREGATE 
           countDaTra = SUM(i.daTra == 1 ? 1 : 0),
           countNgayTra = SUM(i.ngayTra > 0 ? 1 : 0)
         RETURN {countDaTra, countNgayTra}
     )
     RETURN {
       val: tinhId, 
       count: count, 
       countDaTra: counters[0].countDaTra, 
       countNgayTra: counters[0].countNgayTra
     }

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00000   * ROOT
  2   EnumerateViewNode    5000   5000000          0      12.83193     - FOR i IN search_hosotiepnhan SEARCH (i.`tinhId` > 0)   /* view query */
  3   LimitNode            5000   5000000          0       0.07608       - LIMIT 0, 5000000
  4   CalculationNode      5000   5000000          0       0.64449       - LET #10 = i.`tinhId`   /* attribute expression */
  6   CalculationNode      5000   5000000          0       3.21296       - LET #14 = { "daTra" : i.`daTra`, "ngayTra" : i.`ngayTra_ts` }   /* simple expression */
  7   CollectNode             1        37          0       0.81389       - COLLECT tinhId = #10 AGGREGATE count = LENGTH() INTO group = #14   /* hash */
 18   SortNode                1        37          0       0.02310       - SORT tinhId ASC   /* sorting strategy: standard */
 20   SubqueryStartNode       1        74          0       0.02810       - LET counters = ( /* subquery begin */
  9   EnumerateListNode    5001   5000037          0       0.70404         - FOR i IN group   /* list iteration */
 10   CalculationNode      5001   5000037          0       0.70010           - LET #16 = ((i.`daTra` == 1) ? 1 : 0)   /* simple expression */
 11   CalculationNode      5001   5000037          0       0.90486           - LET #18 = ((i.`ngayTra` > 0) ? 1 : 0)   /* simple expression */
 12   CollectNode             1        74          0       0.30604           - COLLECT AGGREGATE countDaTra = SUM(#16), countNgayTra = SUM(#18)   /* sorted */
 19   LimitNode               1        74          0       0.00001           - LIMIT 0, 1
 13   CalculationNode         1        74          0       0.00004           - LET #20 = { "countDaTra" : countDaTra, "countNgayTra" : countNgayTra }   /* simple expression */
 21   SubqueryEndNode         1        37          0       0.00002           - RETURN  #20 ) /* subquery end */
 16   CalculationNode         1        37          0       0.00004       - LET #22 = { "val" : tinhId, "count" : count, "countDaTra" : counters[0].`countDaTra`, "countNgayTra" : counters[0].`countNgayTra` }   /* simple expression */
 17   ReturnNode              1        37          0       0.00000       - RETURN #22

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   optimize-subqueries
  3   move-calculations-up-2
  4   handle-arangosearch-views
  5   remove-unnecessary-calculations-2
  6   move-calculations-down
  7   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      5000000               0 / 0          0      547389440        20.24671

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00010
 optimizing ast                 0.00001
 loading collections            0.00001
 instantiating plan             0.00007
 optimizing plan                0.00046
 instantiating executors        0.00027
 executing                     20.24583
 finalizing                     0.00003

from arangodb.

son2408 commented on June 1, 2024

@jsteemann i think "stored values" is no efficient with COLLECT INTO

from arangodb.

jsteemann commented on June 1, 2024

"stored values" and COLLECT INTO are unrelated.

from arangodb.

jsteemann commented on June 1, 2024

Can you try fusing the two COLLECT statements into a single one, so you will need only one iteration over the dataset?

from arangodb.

son2408 commented on June 1, 2024

@jsteemann two COLLECT on two other collections. COLLECT 1 for initial search statement, COLLECT 2 for group of COLLECT 1. How to fusing them ?

from arangodb.

jsteemann commented on June 1, 2024

Have you tried something along the lines of

 FOR i IN search_hosotiepnhan_3
     SEARCH i.tinhId > 0
     limit 0,5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE 
           count = length(1),
           countDaTra = SUM(i.daTra == 1 ? 1 : 0),
           countNgayTra = SUM(i.ngayTra > 0 ? 1 : 0),
           countChoBoSung = SUM(i.choBoSung == 1 ? 1 : 0),
           countThoiDiemChoBoSung = SUM(i.thoiDiemChoBoSung > 0 ? 1 : 0),
           countNgayXuLy = SUM(i.ngayXuLy > 0 ? 1 : 0),
           countNgayTiepNhan = SUM(i.ngayTiepNhan > 0 ? 1 : 0)
...

from arangodb.

son2408 commented on June 1, 2024

@jsteemann i tried it, runtime query is 4 seconds. thanks you.
With more complex usecase is I need subquery collect by key huyenId field with query daTra == 1 so I must use INTO. How to rebuild this query without INTO as this above your query? becuase using INTO it run query too slowly

FOR i IN search_hosotiepnhan_3
     SEARCH i.tinhId > 0
     limit 0,5000000
     COLLECT tinhId = i.tinhId
     AGGREGATE 
           count = length(1),
           countDaTra = SUM(i.daTra == 1 ? 1 : 0)
     INTO group = {daTra: i.daTra, huyenId: i.huyenId}
     LET statHuyen = (FOR g IN group
                         FILTER g.daTra == 1
                         COLLECT huyenId = g.huyenId
                         AGGREGATE
                              countHuyen = length(1)
                              RETURN {val: huyenId, count: countHuyen }
                     )
      ....
      ```

from arangodb.

MBkkt commented on June 1, 2024

/* view query */

view used without stored values

/* view query without materialization */

view used stored values

from arangodb.

son2408 commented on June 1, 2024

Ok, I think there isn't anything left that I can do now. The only suggestion I have is to try the upcoming version of ArangoDB (3.12), which is currently in development. I can't say if that will help or not, but maybe it is worth a try. Linux nightly builds can be found here: https://download.arangodb.com/nightly/devel/Linux/x86_64/index.html Please do not run a nightly build on the original data, but on a copy of your data!

Ok. Let me to try it. Thanks @jsteemann

from arangodb.

son2408 commented on June 1, 2024

hi @jsteemann I upgraded to version 3.12.0-NIGHTLY.20231219. I see runtime query with COLLECT..INTO improved more than old version. But It is not as expect. For detail usecase.
My query with three times INTO takes 8.5.
View Configuration

{
  "globallyUniqueId": "h64D5FDE00828/16843",
  "cleanupIntervalStep": 2,
  "writebufferIdle": 64,
  "writebufferActive": 0,
  "id": "16843",
  "consolidationIntervalMsec": 1000,
  "name": "search_hosotiepnhan",
  "commitIntervalMsec": 1000,
  "type": "arangosearch",
  "consolidationPolicy": {
    "type": "tier",
    "segmentsBytesFloor": 2097152,
    "segmentsBytesMax": 5368709120,
    "segmentsMax": 10,
    "segmentsMin": 1,
    "minScore": 0
  },
  "primarySort": [
    {
      "field": "hoSoTiepNhanId",
      "asc": true
    }
  ],
  "primarySortCompression": "lz4",
  "storedValues": [
    {
      "fields": [
        "tinhId",
        "huyenId",
        "daTra",
        "ngayTra_ts",
        "choBoSung",
        "thoiDiemChoBoSung_ts",
        "ngayXuLy_ts",
        "ngayTiepNhan_ts",
        "dayTiepNhan",
        "monthTiepNhan",
        "quarterTiepNhan"
      ],
      "compression": "lz4"
    }
  ],
  "writebufferSizeMax": 33554432,
  "links": {
    "vertex_hosotiepnhan": {
      "analyzers": [
        "identity"
      ],
      "fields": {},
      "includeAllFields": true,
      "storeValues": "none",
      "trackListPositions": false
    }
  }
}

Query

FOR i IN search_hosotiepnhan
    limit 0, 5000000
    collect tinhId = i.tinhId
    aggregate
      countTinh = length(1)
    into group = {daTra: i.daTra, dayTiepNhan: i.dayTiepNhan, monthTiepNhan: i.monthTiepNhan, quarterTiepNhan: i.quarterTiepNhan}
    
    //dayTiepNhan
    let statDayTiepNhan = (
      FOR g in group
        collect collDayTiepNhan = g.dayTiepNhan
        aggregate
          countDayTiepNhan = length(1)
        into groupDayTiepNhan = {daTra:g.daTra}
        let statDayTiepNhanDaTra = (
          FOR gDayTiepNhan in groupDayTiepNhan
            collect daTra = gDayTiepNhan.daTra
            aggregate
              countDaTra = length(1)
              return {val: daTra, val: countDaTra}
        )
        return {val: collDayTiepNhan, count: countDayTiepNhan, buckets: statDayTiepNhanDaTra}
    )
    //monthTiepNhan
    let statMonthTiepNhan = (
      FOR g in group
        collect monthTiepNhan = g.monthTiepNhan
        aggregate
          countMonthTiepNhan = length(1)
        into groupMonthTiepNhan = {daTra:g.daTra}
        let statDaTra = (
          FOR gMonthTiepNhan in groupMonthTiepNhan
            collect daTra = gMonthTiepNhan.daTra
            aggregate
              countDaTra = length(1)
            return {val: daTra, val: countDaTra}
        )
        return {val: monthTiepNhan, count: countMonthTiepNhan}
    )
  return {val: tinhId, count: countTinh, statDayTiepNhan: statDayTiepNhan, statMonthTiepNhan: statMonthTiepNhan}

Profile

Query String (1520 chars, cacheable: false):
FOR i IN search_hosotiepnhan
   limit 0, 5000000
   collect tinhId = i.tinhId
   aggregate
     countTinh = length(1)
   into group = {daTra: i.daTra, dayTiepNhan: i.dayTiepNhan, monthTiepNhan: i.monthTiepNhan, 
quarterTiepNhan: i.quarterTiepNhan}
   
   //dayTiepNhan
   let statDayTiepNhan = (
     FOR g in group
       collect collDayTiepNhan = g.dayTiepNhan
       aggregate
         countDayTiepNhan = length(1)
       into groupDayTiepNhan = {daTra:g.daTra}
       let statDayTiepNhanDaTra = (
         FOR gDayTiepNhan in groupDayTiepNhan
           collect daTra = gDayTiepNhan.daTra
           aggregate
             countDaTra = length(1)
             return {val: daTra, val: countDaTra}
       )
       return {val: collDayTiepNhan, count: countDayTiepNhan, buckets: statDayTiepNhanDaTra}
   )
   //monthTiepNhan
   let statMonthTiepNhan = (
     FOR g in group
       collect monthTiepNhan = g.monthTiepNhan
       aggregate
...

Execution plan:
Id   NodeType            Calls    Par     Items   Filtered   Runtime [s]   Comment
1   SingletonNode           1      -         1          0       0.00000   * ROOT
2   EnumerateViewNode    5000      -   5000000          0       0.51076     - FOR i IN search_hosotiepnhan LET #30 = i.`tinhId` LET #110 = i.`monthTiepNhan` LET #109 = i.`dayTiepNhan` LET #108 = i.`daTra` LET #111 = i.`quarterTiepNhan`   /* view query without materialization */
3   LimitNode            5000      -   5000000          0       0.01663       - LIMIT 0, 5000000
6   CalculationNode      5000   4999   5000000          0       2.16173       - LET #32 = { "daTra" : #108, "dayTiepNhan" : #109, "monthTiepNhan" : #110, "quarterTiepNhan" : #111 }   /* simple expression */
7   CollectNode             1      0        37          0       1.00809       - COLLECT tinhId = #30 AGGREGATE countTinh = LENGTH() INTO group = #32   /* hash */
44   SortNode                1      0        37          0       0.04790       - SORT tinhId ASC   /* sorting strategy: standard */
52   SubqueryStartNode       1      -        74          0       0.11422       - LET statMonthTiepNhan = ( /* subquery begin */
26   EnumerateListNode    5001      -   5000037          0       0.74329         - FOR g IN group   /* list iteration */
27   CalculationNode      5001      -   5000037          0       0.37468           - LET #40 = g.`monthTiepNhan`   /* attribute expression */
29   CalculationNode      5001      -   5000037          0       1.03349           - LET #42 = { "daTra" : g.`daTra` }   /* simple expression */
30   CollectNode             1      -       457          0       0.29846           - COLLECT monthTiepNhan = #40 AGGREGATE countMonthTiepNhan = LENGTH()   /* hash */
47   SortNode                1      -       457          0       0.00010           - SORT monthTiepNhan ASC   /* sorting strategy: standard */
39   CalculationNode         1      -       457          0       0.00017           - LET #46 = { "val" : monthTiepNhan, "count" : countMonthTiepNhan }   /* simple expression */
53   SubqueryEndNode         1      -        37          0       0.00005           - RETURN  #46 ) /* subquery end */
48   SubqueryStartNode       1      -        74          0       0.06846       - LET statDayTiepNhan = ( /* subquery begin */
9   EnumerateListNode    5001      -   5000037          0       0.73775         - FOR g IN group   /* list iteration */
10   CalculationNode      5001      -   5000037          0       0.35109           - LET #33 = g.`dayTiepNhan`   /* attribute expression */
12   CalculationNode      5001      -   5000037          0       1.03256           - LET #35 = { "daTra" : g.`daTra` }   /* simple expression */
13   CollectNode             1      -       292          0       0.44655           - COLLECT collDayTiepNhan = #33 AGGREGATE countDayTiepNhan = LENGTH() INTO groupDayTiepNhan = #35   /* hash */
45   SortNode                1      -       292          0       0.00862           - SORT collDayTiepNhan ASC   /* sorting strategy: standard */
50   SubqueryStartNode       1      -       547          0       0.00820           - LET statDayTiepNhanDaTra = ( /* subquery begin */
15   EnumerateListNode    5001      -   5000292          0       0.31771             - FOR gDayTiepNhan IN groupDayTiepNhan   /* list iteration */
16   CalculationNode      5001      -   5000292          0       0.33699               - LET #36 = gDayTiepNhan.`daTra`   /* attribute expression */
18   CollectNode             1      -       793          0       0.27310               - COLLECT daTra = #36 AGGREGATE countDaTra = LENGTH()   /* hash */
46   SortNode                1      -       793          0       0.00023               - SORT daTra ASC   /* sorting strategy: standard */
19   CalculationNode         1      -       793          0       0.00039               - LET #38 = { "val" : daTra, "val" : countDaTra }   /* simple expression */
51   SubqueryEndNode         1      -       292          0       0.00009               - RETURN  #38 ) /* subquery end */
22   CalculationNode         1      -       292          0       0.00016           - LET #39 = { "val" : collDayTiepNhan, "count" : countDayTiepNhan, "buckets" : statDayTiepNhanDaTra }   /* simple expression */
49   SubqueryEndNode         1      -        37          0       0.00005           - RETURN  #39 ) /* subquery end */
42   CalculationNode         1      0        37          0       0.00004       - LET #47 = { "val" : tinhId, "count" : countTinh, "statDayTiepNhan" : statDayTiepNhan, "statMonthTiepNhan" : statMonthTiepNhan }   /* simple expression */
43   ReturnNode              1      -        37          0       0.00000       - RETURN #47

Indexes used:
none

Optimization rules applied:
Id   Rule Name                                 Id   Rule Name                                 Id   Rule Name                        
1   move-calculations-up                       5   remove-collect-variables                   9   async-prefetch                   
2   remove-redundant-calculations              6   handle-arangosearch-views                 10   splice-subqueries                
3   remove-unnecessary-calculations            7   remove-unnecessary-calculations-2
4   move-calculations-up-2                     8   move-calculations-down           

Query Statistics:
Writes Exec      Writes Ign      Doc. Lookups      Scan Full      Scan Index      Cache Hits/Misses      Filtered      Peak Mem [b]      Exec Time [s]
         0               0                 0              0         5000000                  0 / 0             0        1563394048            8.63782

Query Profile:
Query Stage                Duration [s]         Query Stage                Duration [s]         Query Stage                Duration [s]
initializing                    0.00000         loading collections             0.00001         instantiating executors         0.00018
parsing                         0.00014         instantiating plan              0.00022         executing                       8.63007
optimizing ast                  0.00001         optimizing plan                 0.00719         finalizing                      0.00004

if this query has two times COLLECT..INTO. It takes 6s.
Query

FOR i IN search_hosotiepnhan
    limit 0, 5000000
    collect tinhId = i.tinhId
    aggregate
      countTinh = length(1)
    into group = {daTra: i.daTra, dayTiepNhan: i.dayTiepNhan, monthTiepNhan: i.monthTiepNhan, quarterTiepNhan: i.quarterTiepNhan}
    
    //dayTiepNhan
    let statDayTiepNhan = (
      FOR g in group
        collect collDayTiepNhan = g.dayTiepNhan
        aggregate
          countDayTiepNhan = length(1)
        into groupDayTiepNhan = {daTra:g.daTra}
        let statDayTiepNhanDaTra = (
          FOR gDayTiepNhan in groupDayTiepNhan
            collect daTra = gDayTiepNhan.daTra
            aggregate
              countDaTra = length(1)
              return {val: daTra, val: countDaTra}
        )
        return {val: collDayTiepNhan, count: countDayTiepNhan, buckets: statDayTiepNhanDaTra}
    )
  return {val: tinhId, count: countTinh, statDayTiepNhan: statDayTiepNhan}

Profile

Query String (925 chars, cacheable: false):
 FOR i IN search_hosotiepnhan
     limit 0, 5000000
     collect tinhId = i.tinhId
     aggregate
       countTinh = length(1)
     into group = {daTra: i.daTra, dayTiepNhan: i.dayTiepNhan, monthTiepNhan: i.monthTiepNhan, 
 quarterTiepNhan: i.quarterTiepNhan}
     
     //dayTiepNhan
     let statDayTiepNhan = (
       FOR g in group
         collect collDayTiepNhan = g.dayTiepNhan
         aggregate
           countDayTiepNhan = length(1)
         into groupDayTiepNhan = {daTra:g.daTra}
         let statDayTiepNhanDaTra = (
           FOR gDayTiepNhan in groupDayTiepNhan
             collect daTra = gDayTiepNhan.daTra
             aggregate
               countDaTra = length(1)
               return {val: daTra, val: countDaTra}
         )
         return {val: collDayTiepNhan, count: countDayTiepNhan, buckets: statDayTiepNhanDaTra}
     )
   return {val: tinhId, count: countTinh, statDayTiepNhan: statDayTiepNhan}

Execution plan:
 Id   NodeType            Calls    Par     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1      -         1          0       0.00000   * ROOT
  2   EnumerateViewNode    5000      -   5000000          0       0.51039     - FOR i IN search_hosotiepnhan LET #17 = i.`tinhId` LET #58 = i.`monthTiepNhan` LET #56 = i.`daTra` LET #57 = i.`dayTiepNhan` LET #59 = i.`quarterTiepNhan`   /* view query without materialization */
  3   LimitNode            5000      -   5000000          0       0.01718       - LIMIT 0, 5000000
  6   CalculationNode      5000   4999   5000000          0       2.15355       - LET #19 = { "daTra" : #56, "dayTiepNhan" : #57, "monthTiepNhan" : #58, "quarterTiepNhan" : #59 }   /* simple expression */
  7   CollectNode             1      0        37          0       1.00086       - COLLECT tinhId = #17 AGGREGATE countTinh = LENGTH() INTO group = #19   /* hash */
 27   SortNode                1      0        37          0       0.05242       - SORT tinhId ASC   /* sorting strategy: standard */
 30   SubqueryStartNode       1      -        74          0       0.06866       - LET statDayTiepNhan = ( /* subquery begin */
  9   EnumerateListNode    5001      -   5000037          0       0.72884         - FOR g IN group   /* list iteration */
 10   CalculationNode      5001      -   5000037          0       0.35558           - LET #20 = g.`dayTiepNhan`   /* attribute expression */
 12   CalculationNode      5001      -   5000037          0       1.02633           - LET #22 = { "daTra" : g.`daTra` }   /* simple expression */
 13   CollectNode             1      -       292          0       0.43884           - COLLECT collDayTiepNhan = #20 AGGREGATE countDayTiepNhan = LENGTH() INTO groupDayTiepNhan = #22   /* hash */
 28   SortNode                1      -       292          0       0.00886           - SORT collDayTiepNhan ASC   /* sorting strategy: standard */
 32   SubqueryStartNode       1      -       547          0       0.00816           - LET statDayTiepNhanDaTra = ( /* subquery begin */
 15   EnumerateListNode    5001      -   5000292          0       0.31943             - FOR gDayTiepNhan IN groupDayTiepNhan   /* list iteration */
 16   CalculationNode      5001      -   5000292          0       0.34112               - LET #23 = gDayTiepNhan.`daTra`   /* attribute expression */
 18   CollectNode             1      -       793          0       0.26691               - COLLECT daTra = #23 AGGREGATE countDaTra = LENGTH()   /* hash */
 29   SortNode                1      -       793          0       0.00022               - SORT daTra ASC   /* sorting strategy: standard */
 19   CalculationNode         1      -       793          0       0.00038               - LET #25 = { "val" : daTra, "val" : countDaTra }   /* simple expression */
 33   SubqueryEndNode         1      -       292          0       0.00009               - RETURN  #25 ) /* subquery end */
 22   CalculationNode         1      -       292          0       0.00014           - LET #26 = { "val" : collDayTiepNhan, "count" : countDayTiepNhan, "buckets" : statDayTiepNhanDaTra }   /* simple expression */
 31   SubqueryEndNode         1      -        37          0       0.00004           - RETURN  #26 ) /* subquery end */
 25   CalculationNode         1      0        37          0       0.00002       - LET #27 = { "val" : tinhId, "count" : countTinh, "statDayTiepNhan" : statDayTiepNhan }   /* simple expression */
 26   ReturnNode              1      -        37          0       0.00000       - RETURN #27

Indexes used:
 none

Optimization rules applied:
 Id   Rule Name                                 Id   Rule Name                                 Id   Rule Name                        
  1   move-calculations-up                       4   move-calculations-up-2                     7   move-calculations-down           
  2   remove-redundant-calculations              5   handle-arangosearch-views                  8   async-prefetch                   
  3   remove-unnecessary-calculations            6   remove-unnecessary-calculations-2          9   splice-subqueries                

Query Statistics:
 Writes Exec      Writes Ign      Doc. Lookups      Scan Full      Scan Index      Cache Hits/Misses      Filtered      Peak Mem [b]      Exec Time [s]
           0               0                 0              0         5000000                  0 / 0             0        1246724096            6.03617

Query Profile:
 Query Stage                Duration [s]         Query Stage                Duration [s]         Query Stage                Duration [s]
 initializing                    0.00000         loading collections             0.00001         instantiating executors         0.00014
 parsing                         0.00013         instantiating plan              0.00009         executing                       6.03324
 optimizing ast                  0.00001         optimizing plan                 0.00255         finalizing                      0.00004

from arangodb.

son2408 commented on June 1, 2024

@jsteemann

from arangodb.

son2408 commented on June 1, 2024

@jsteemann Can you let me the best way to speedup query with multi COLLECT .. INTO ? . I see the more COLLECT .. INTO the slower runtime query. Thanks you

from arangodb.

Optimize statistic large data with COLLECT INTO too slow about arangodb HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent