Currently the formula usage is checked by deparsing a formula into a string and runnin

Redo formula handling for well-formedness checks about contrastable HOT 4 CLOSED

tsostarics commented on June 3, 2024

Redo formula handling for well-formedness checks

from contrastable.

Comments (4)

tsostarics commented on June 3, 2024

New recursion-based functions .omit_function_calls and .simplify_formula solve this issue. In the process I cleaned up a few of the other error checks.

from contrastable.

tsostarics commented on June 3, 2024

Quick note that the .is_dominated_by_identity function was updated so that it can keep track of repeated operators more effectively. Consider two cases:

1) varName ~ sum_code + 1 + 2
2) varName ~ sum_code + 1 * 2 + 2

If you only look to see if a node is directly dominated by itself (=*Identity constraint) then 1 is invalid but 2 is not. To fix this we can keep track of the operators we've encountered while recursing through the tree.

from contrastable.

tsostarics commented on June 3, 2024

another note on this.. Given that .simplify_formula already recurses through the structure it perhaps doesn't make sense to have to recurse through it a second time with .is_dominated_by_identity. It's a bit faster to just use a regular expression on the simplified formula. With this, we don't ever need to deparse the raw formula, so we can remove function parameter and code entirely.

## Matrix call is included to give .simplify_formula something to work with
# Case where the repeated operator appears later
tst <- gear ~ matrix(c(0.75, -0.25, -0.25,
                                 -0.25/4, -0.25*2, 0.75^2.3,
                                 -0.25, -0.25, -0.25%in%c(1,2,3),
                                 -0.25, 0.75, -0.25) %>% abs(), nrow = 4) + 4 * 2 + - 2 + 2

# Case where the repeated operator appears immediately after the first ocurrence
tst2 <- gear ~ matrix(c(0.75, -0.25, -0.25,
                        -0.25/4, -0.25*2, 0.75^2.3,
                        -0.25, -0.25, -0.25%in%c(1,2,3),
                        -0.25, 0.75, -0.25) %>% abs(), nrow = 4) + 4 + 4

# Case where there is no repeated operator
tst3 <- gear ~ matrix(c(0.75, -0.25, -0.25,
                        -0.25/4, -0.25*2, 0.75^2.3,
                        -0.25, -0.25, -0.25%in%c(1,2,3),
                        -0.25, 0.75, -0.25) %>% abs(), nrow = 4) + 4 *2

# Case where the repeated operator appears MUCH later
tst4 <- gear ~ matrix(c(0.75, -0.25, -0.25,
                        -0.25/4, -0.25*2, 0.75^2.3,
                        -0.25, -0.25, -0.25%in%c(1,2,3),
                        -0.25, 0.75, -0.25) %>% abs(), nrow = 4) + 4 * 2 - 1 | 3 / 3 & 2 ^ 5 + 4

test_1 <- function(f) {
  s_f <- .simplify_formula(f)

  .any_dominated_by_identity(s_f[[3]])
}

test_2 <- function(f) {
  s_f <- .simplify_formula(f)
  c_f <- deparse1(s_f)
  
  grepl("([|+*-]).+(\\1)", c_f)
}

library(microbenchmark)
set.seed(111)
mb1 <- microbenchmark(test_1(tst), test_2(tst), times = 3000)
mb2 <- microbenchmark(test_1(tst2), test_2(tst2), times = 3000)
mb3 <- microbenchmark(test_1(tst3), test_2(tst3), times = 3000)
mb4 <- microbenchmark(test_1(tst4), test_2(tst4), times = 3000)

mb1;mb2;mb3;mb4

Unit: microseconds
        expr   min    lq    mean median     uq     max neval cld
 test_1(tst) 358.2 371.4 536.008  379.7 445.15 20494.9  3000   b
 test_2(tst) 319.0 330.4 474.503  337.8 402.70 29234.3  3000  a 
Unit: microseconds
         expr   min     lq     mean median     uq     max neval cld
 test_1(tst2) 230.3 243.45 343.1459  248.4 271.65 11962.1  3000   b
 test_2(tst2) 191.4 202.50 283.7148  206.4 227.00 13673.7  3000  a 
Unit: microseconds
         expr   min     lq     mean median    uq     max neval cld
 test_1(tst3) 284.9 298.10 406.0027  303.7 335.8 10351.7  3000   b
 test_2(tst3) 196.1 206.25 285.4946  210.7 233.0 14352.8  3000  a 
Unit: microseconds
         expr   min    lq     mean median    uq     max neval cld
 test_1(tst4) 503.3 520.7 778.1386  532.7 663.6 20459.4  3000   b
 test_2(tst4) 316.6 331.7 512.9367  340.0 406.3 13944.8  3000  a

The overall speed is less than a millisecond in either case, but the regex method is a bit faster (line=if the time for both methods is equal). So the .is_dominated_by_identity doesn't actually seem to be needed, which makes things simpler. A bit sad to see it not used in the end, but it might be useful elsewhere at a later point, we'll see.

from contrastable.

tsostarics commented on June 3, 2024

Last note on this & I'm surprised i didn't consider this earlier but just adding the error handling to .omit_function_calls directly and allowing it to prematurely throw an error prevents it from needing to recurse through the whole structure 🤦‍♂️ This yields substantial improvement on the deeply nested case (tst4) above & an improvement on the simple case (tst3) since the final formula doesn't need to be checked

same tests as above (done on a better computer though), tryCatch added in for consistency across the three

test_1 <- function(f) {
  s_f <- .simplify_formula(f)
  
  .any_dominated_by_identity(s_f[[3]])
  tryCatch(stop(1), error = \(e) TRUE)
}

test_2 <- function(f) {
  s_f <- .simplify_formula(f)
  c_f <- deparse1(s_f)
  
  grepl("([|+*-]).+(\\1)", c_f)
  tryCatch(stop(1), error = \(e) TRUE)
}

test_3 <- function(f) {
  tryCatch(.simplify_formula2(f),error = \(e) TRUE)
}

library(microbenchmark)
set.seed(111)
mb1 <- microbenchmark(test_1(tst), test_2(tst), test_3(tst), times = 3000)
mb2 <- microbenchmark(test_1(tst2), test_2(tst2),test_3(tst2), times = 3000)
mb3 <- microbenchmark(test_1(tst3), test_2(tst3),test_3(tst3), times = 3000)
mb4 <- microbenchmark(test_1(tst4), test_2(tst4),test_3(tst4), times = 3000)

mb1;mb2;mb3;mb4

Unit: microseconds
        expr   min    lq     mean median     uq     max neval
 test_1(tst) 142.9 150.3 196.2080  155.2 163.05 87013.7  3000
 test_2(tst) 147.9 154.7 180.3039  159.5 167.70  4260.9  3000
 test_3(tst)  31.8  34.9  41.5482   36.5  38.40  5217.0  3000
Unit: microseconds
         expr  min   lq      mean median    uq    max neval
 test_1(tst2) 90.7 97.0 111.18980   99.1 101.9 4042.5  3000
 test_2(tst2) 92.9 97.9 108.21883   99.9 102.8 3876.9  3000
 test_3(tst2) 31.9 34.7  37.92533   36.0  37.3 3828.6  3000
Unit: microseconds
         expr   min    lq     mean median     uq    max neval
 test_1(tst3) 114.3 121.2 136.9549  123.7 127.45 3937.6  3000
 test_2(tst3)  92.9  98.3 109.9073  100.4 103.40 4052.2  3000
 test_3(tst3)  32.2  34.8  37.7552   36.0  37.30 3680.9  3000
Unit: microseconds
         expr   min    lq     mean median    uq    max neval
 test_1(tst4) 209.5 220.5 254.6805  224.5 230.4 5924.3  3000
 test_2(tst4) 147.6 154.8 168.0258  157.4 161.5 4122.5  3000
 test_3(tst4)  31.7  35.1  38.2686   36.5  37.8 3835.9  3000

from contrastable.

Redo formula handling for well-formedness checks about contrastable HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent