The esceval from mentalisttraceur

Old performance test scraps

Just jotting this down so that there's a record of it somewhere.

Years ago (pretty sure it was before COVID 19), I did some performance testing of different shell implementations of esceval.

This was the code I used (the way I grabbed the timing information is really bad practice if you need high precision, but with large-enough test inputs like in this case, the actual implementation performance dominated, so it was fine). This was rather manual - I was basically just sourcing these definitions and then calling the functions.

[click to expand]

setup_test_data()
{
 d1=`cat "$1"`
 d2="$d1 $d1"
 d4="$d2 $d2"
 d8="$d4 $d4"
 d16="$d8 $d8"
 d32="$d16 $d16"
 d64="$d32 $d32"
 d128="$d64 $d64"
 d256="$d128 $d128"
 d512="$d256 $d256"
 d1024="$d512 $d512"
 d2048="$d1024 $d1024"
}

timed_test_run()
{
 command=$1
 file=$2
 shift 2
 printf '%s :: ' "$command"
 
 start_time=`date +%s`
 "$command" "$@" >"$file"
 end_time=`date +%s`
 printf '%s :: ' "$? $((end_time - start_time))"
 
 start_time=`date +%s`
 captured_output=`"$command" "$@"`
 end_time=`date +%s`
 printf '%s\n' "$? $((end_time - start_time))"
}

esceval0()
{
 case $# in 0) return 0; esac
 (
  b='\\'
  while :
  do
   escaped=`
    printf '%s\n' "'$1" \
    | sed "
     s/'/'$b''/g
     1 s/^'$b''/'/
     $ s/$/'/
    "
   `
   shift
   case $# in 0) break; esac
   printf '%s ' "$escaped"
  done
  printf '%s\n' "$escaped"
 )
}

escevalp()
{
 sed "
  s/'/'\\\\''/g
  1 s/^/'/
  $ s/$/'/
 "
}

esceval1()
{
 case $# in 0) return 0; esac
 (
  set -e
  while :
  do
   escaped=`printf '%s\n' "$1" | escevalp`
   shift
   case $# in 0) break; esac
   printf '%s ' "$escaped"
  done
  printf '%s\n' "$escaped"
 )
}

esceval2()
{
 case $# in 0) return 0; esac
 (
  set -e
  while :
  do
   printf \'
   unescaped=$1
   while :
   do
    case $unescaped in
    *\'*)
     printf %s "${unescaped%%\'*}""'\''"
     unescaped=${unescaped#*\'}
    ;;
    *)
     break
    esac
   done
   printf %s "$unescaped"
   shift
   case $# in 0) break; esac
   printf "' "
  done
  printf "'\n"
 )
}

esceval3()
{
 case $# in 0) return 0; esac
 (
  set -e
  while :
  do
   escaped=\'
   unescaped=$1
   while :
   do
    case $unescaped in
    *\'*)
     escaped=$escaped${unescaped%%\'*}"'\''"
     unescaped=${unescaped#*\'}
     ;;
    *)
     break
    esac
   done
   escaped=$escaped$unescaped\'
   shift
   case $# in 0) break; esac
   printf '%s ' "$escaped"
  done
  printf '%s\n' "$escaped"
 )
}

esceval4()
{
 case $# in 0) return 0; esac
 (
  set -e
  escaped=\'
  while :
  do
   unescaped=$1
   while :
   do
    case $unescaped in
    *\'*)
     escaped=$escaped${unescaped%%\'*}"'\''"
     unescaped=${unescaped#*\'}
     ;;
    *)
     break
    esac
   done
   escaped=$escaped$unescaped"' '"
   shift
   case $# in 0) break; esac
  done
  escaped=${escaped%" '"}
  printf '%s\n' "$escaped"
 )
}

esceval5()
{
 case $# in 0) return 0; esac
 (
  set -e
  while :
  do
   escaped=\'${1//\'/"'\''"}\'
   shift
   case $# in 0) break; esac
   printf "%s " "$escaped"
  done
  printf "%s\n" "$escaped"
 )
}

The random data files were created by reading from /dev/urandom. Note that if you just create a big data file from /dev/urandom, you can end up a few very large shell words or very tiny ones, but on average you'll get words roughly ~85 characters long. (With the default three IFS characters, roughly every 3/256 bytes will be word-splitters for the the shell.) So we might do head -c 4096 </dev/urandom >test-input.dat, and then setup_test_data 'test-input.dat'. But if you have small words, you can then create large words by quoting one of the d{{number}} variables created by setup_test_data.

The biggest finding was that this basically doesn't matter, since the performance difference doesn't become significant until you're working with huge inputs - megabyte of data, etc. Even on a 600Mhz single-code ARMv7 CPU with 256MiB RAM.

The most interesting finding was that somehow on zsh the naive sed-based implemention out-performed the shell's native variable substitution expansion (esceval0 vs esceval5), despite the opposite being true in bash. Or something like that, it's been years - I definitely remember almost questioning if zsh somehow optimized invoking sed or provided its own built-in of it, it was that surprising.

The obvious expected finding was that there's an inflection point where for large-enough words, the sed-based implementation starts beating the in-shell implementation - process forking overhead dominates for small words, but once the word gets long enough, doing it in-shell spends a bunch of time mucking with entire copies of the word in memory, probably doing a bunch of allocations and deallocations and unoptimized string copies (all while looping at the level of the relatively unoptimized interpreter), while sed can stream the word and do it in constant space (all while looping in machine code).

I don't remember what else I found, if anything. It was all micro-optimizations beyond that.

mentalisttraceur / esceval Goto Github PK

esceval's People

Contributors

Stargazers

Watchers

esceval's Issues

Old performance test scraps

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent