Giter Club home page Giter Club logo

dgsh's Introduction

The Directed Graph Shell (dgsh)

Build Status

The directed graph shell, dgsh, allows the expressive expression of efficient big data set and streams processing pipelines using existing Unix tools as well as custom-built components. It is a Unix-style shell allowing the specification of pipelines with non-linear scatter-gather operations. These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the operation's processing throughput.

You can find a complete introduction, reference documentation, and illustrated examples in the suite's web site.

See also, a quick video overview and the associated (open access) paper, Extending Unix pipelines to DAGs, published in the IEEE Transactions on Computers, 66(9):1547–1561, 2017.

dgsh's People

Contributors

dspinellis avatar lucaswerkmeister avatar mfragkoulis avatar mingcongbai avatar tammam1998 avatar trantor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dgsh's Issues

Explore suitability of sgsh for stream processing on data windows

#!/usr/bin/env sgsh -s /bin/bash
#
# SYNOPSIS Web log statistics
# DESCRIPTION
# Provides continuous statistics over web log stream data.
# Demonstrates stream processing.
# Provide as an argument either the name of a growing web log file
# or -s and a static web log file, which will be processed at a rate
# of about 10 lines per second.
#
#  Copyright 2013 Diomidis Spinellis
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
#

# Size of the window to report in seconds
WINDOW=10
WINDOW_OLD=$(expr $WINDOW \* 2)

# Update interval in seconds
UPDATE=2

# Print the sum of the numbers read from the standard input
sum()
{
    awk '{ sum += $1 } END {print sum}'
}

# Print the rate of change as a percentage
# between the first (old) and second (new) value
change()
{
    # Can't use bc, because we have numbers in scientific notation
    awk "END {OFMT=\"%.2f%%\"; print ($2 - $1) * 100 / $1}" </dev/null
}

if [ "$1" = "-s" ]
then
    # Simulate log lines coming from a file
    while read line
    do
        echo $line
        sleep 0.01
    done  <"$2"
else
    tail -f "$1"
fi |
scatter |{
    # Window of accessed pages
    -| awk -Winteractive '{print $7}' |store:page -b $WINDOW -u s

    # Get the bytes requested
    -| awk -Winteractive '{print $10}' |{
        # Store total number of bytes
        -| awk -Winteractive '{ s += $1; print s}' |store:total_bytes
        # Store total number of pages requested
        -| awk -Winteractive '{print ++n}' |store:total_pages
        # Window of bytes requested
        -||store:bytes -b $WINDOW -u s
        # Previous window of bytes requested
        -||store:bytes_old -b $WINDOW_OLD -e $WINDOW -u s
    |}

# Gather and print the results
|} gather |{
    # Produce periodic reports
    while :
    do
        WINDOW_PAGES=$(store:bytes -c | wc -l)
        WINDOW_BYTES=$(store:bytes -c | sum )
        WINDOW_PAGES_OLD=$(store:bytes_old -c | wc -l)
        WINDOW_BYTES_OLD=$(store:bytes_old -c | sum)
        clear
        cat <<EOF
Total
-----
Pages: $(store:total_pages -c)
Bytes: $(store:total_bytes -c)
Over last ${WINDOW}s
--------------------
Pages: $WINDOW_PAGES
Bytes: $WINDOW_BYTES
kBytes/s: $(awk "END {OFMT=\"%.0f\"; print $WINDOW_BYTES / $WINDOW / 1000}" </dev/null )
Top page: $(store:page -c | sort | uniq -c | sort -rn | head -1)
Change
------
Requests: $(change $WINDOW_PAGES_OLD $WINDOW_PAGES)
Data bytes: $(change $WINDOW_BYTES_OLD $WINDOW_BYTES)
EOF
    sleep $UPDATE
    done
|}

Dgsh freezes when drawing a graph

The following command blocks

DRAW_EXIT=1 DGSH_DOT_DRAW=graphdot/compress-compare ./unix-dgsh-tools/bash/bash --dgsh example/compress-compare.sh </dev/null

Wrap automatically non-dgsh compatible commands

Have the bash shell check whether a command is dgsh compatible when processing a dgsh pipeline.
If it isn't compatible, prepend dgsh-wrap to the command for execution.
As a result, bash will invoke dgsh-wrap which will first engage in a dgsh negotiation and then exec the passed command.

Fail to run simple script

Running sgsh on a file with the following produces no output.

{{
        echo hi &
        echo there &
}} | {{
        sed s/^/a/ &
        sed s/^/b/ &
}}

Same with

sgsh -c '{{ echo hi & echo there & }} | {{ sed s/^/a/ & sed s/^/b/ & }}'

Rationalize grep behavior

  • By default list only matching elements.
  • If exactly one -L -l or -v -o -c etc is specified, output only what is specified.
  • If --matching-lines is specified in addition to an option in the above list, also output the corresponding matching lines.
  • If more than one of the -L -l or -v -o -c etc options is specified generate one output per option.
  • If more than one -e option is specified, output one stream per option (do not implement this if internally the REs are merged into one).

Pipelines involving printf hang

The following command works fine.

dgsh -c 'echo hi | {{  cat & echo there & }} | cat'

The following command hangs.

dgsh -c 'echo hi | {{  cat & printf there & }} | cat'

Incorrect graph in nested dgsh invocations

When running example/5x5.sh with DGSH_DOT_DRAW the following file is created.

digraph {
        n0 [label="bash --dgsh-negotiate -c matrix"];
        n0 -> n1;
        n1 [label="scatter-gather"];
}
digraph {
        n0 [label="bash --dgsh --dgsh-negotiate /tmp/dgsh-parallel-31690"];
        n0 -> n1;
        n1 [label="scatter-gather"];
}
digraph {
        n0 [label="bash --dgsh --dgsh-negotiate /tmp/dgsh-parallel-31726"];
        n0 -> n1;
        n1 [label="paste"];
}
digraph {
        n0 [label="bash --dgsh --dgsh-negotiate /tmp/dgsh-parallel-31732"];
        n0 -> n1;
        n1 [label="paste"];
}
digraph {
        n0 [label="bash --dgsh --dgsh-negotiate /tmp/dgsh-parallel-31728"];
        n0 -> n1;
        n1 [label="paste"];
}
digraph {
        n0 [label="bash --dgsh --dgsh-negotiate /tmp/dgsh-parallel-31729"];
        n0 -> n1;
        n1 [label="paste"];
}
digraph {
        n0 [label="bash --dgsh --dgsh-negotiate /tmp/dgsh-parallel-31730"];
        n0 -> n1;
        n1 [label="paste"];
}

Problems with recursive invocations

Consider the following command

#!/usr/bin/env dgsh

row()
{
  dgsh -c "dgsh-parallel -n 5 'echo C{}' | paste"
}

matrix()
{
  dgsh -c "/home/dds/libexec/dgsh/dgsh-parallel -n 5 row"
}

export -f row matrix

call matrix | cat
  1. It won't work without the explicit specification of the dgsh-parallel path.
  2. With the absolute path it hangs after outputting the following.
[dds@stereo dgsh]$ example/5x5.sh
▒▒▒▒▒▒▒▒▒7▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒7/home/dds/libexec/dgsh/bash --dgsh --dgsh-negotiate /tmp/dgsh-parallel-14282

I realize that this can be fixed by attaching cat in the matrix statement, but we should somehow error and abort when dgsh commands hanve dangling I/O.

Document adapted tools

A notes column in the index.html file should suffice. What is needed is a one-sentence explanation on the allocation of input and output channels.

text-properties.sh, line 55, tee <$1 causes "ambiguous redirect"

The documentation for this example makes it clear that text-properties.sh
is meant to work for either one filename argument or none.
But I get an "ambiguous redirect" error if I do not give this script a filename
on the command line.

This change made it so that it works for me in both cases:

``
@@ -52,7 +52,7 @@
export -f ranked_frequency
export -f ngram

-tee <$1 |
+tee < "${1-/dev/fd/0}" |
{{
# Split input one word per line
tr -cs a-zA-Z \n |
``

Or, you might prefer using 'cat $1 | ...'

-- Guy Shaw

Bug when running code-metrics.sh

The merge at 86cb776 introduces a tricky bug in MacOSX (testing on other platforms is required) when executing examples/code-metrics.sh.
A dgsh-tee command suddenly exits the negotiation with error, but no other cause can be traced prior to this event.

Gracefully teminate negotiation when an error occurs

Consider the following example, which freezes, because cat never negotiates.

$ dgsh -c 'echo hi | cat -n'
getopt: invalid option -- 'n'
Usage: cat [-u] [file ...]

Here is a sketch of the proposed solution.

  • Modify dgsh_negotiate to set negotiation_completed to true
  • Add the following functions.
#include <stdbool.h>
#include <stdlib.h>

bool negotiation_completed = false;
static void
on_exit_handler(int v, void *ptr)
{
  if (negotiation_completed)
    return;
  // Run negotiation indicating a failure
}
    
static __attribute__((constructor))
install_on_exit_handler(void)
{
  on_exit(dgsh_on_exit_handler);
}

Crash in nested dgsh execution

$ dgsh -c 'dgsh -c "{{ echo hi & echo there & }}" | paste'
/home/dds/libexec/dgsh/bash: line 1: 16110 Done                    dgsh -c "{{ echo hi & echo there & }}"
     16111 Segmentation fault      (core dumped) | paste

The second command (e.g. paste or cat) crashes.

DGSH_DOT_DRAW_EXIT ignores specified directory

The following command produces the output in the current directory.

DRAW_EXIT=1 DGSH_DOT_DRAW_EXIT=graphdot/reorder-columns.dot ./unix-dgsh-tools/bash/bash --dgsh example/reorder-columns.sh

Document branching for Unix tools

On which branch do modifications take place, and what is the procedure for merging upstream changes?

I would expect a dgsh branch where upstream changes are merged from master.

sgsh/bin/cat isn't plug compatible with /bin/cat

$ cat /etc/motd
Usage cat [-b size] [-i file] [-IMs] [-o file] [-m size] [-t char]
-b size Specify the size of the buffer to use (used for stress testing)
-f      Overflow buffered data into a temporary file
-I      Input-side buffering
-i file Gather input from specified file
-m size[k|M|G]  Specify the maximum buffer memory size
-M      Provide memory use statistics on termination
-o file Scatter output to specified file
-p d1[,d2...]   Permute inputs to specified outputs
-s      Scatter the input across the files, rather than copying it to all
-T dir  Specify directory for storing temporary file
-t char Process char-terminated records (newline default)
  • The sgsh path needs to be set differently from PATH=/usr/local/sgsh/bin:$PATH?
  • The command should exec /bin/cat in some cases?

Improve error reporting

Consider the following error report.

ERROR: No solution was found to satisfy the I/O requirements of the participating processes.dgsh-wrap: dgsh negotiation failed for /usr/bin/perl /home/dds/libexec/dgsh/dgsh-merge-sum with status code 4

dgsh-wrap: dgsh negotiation failed for /usr/bin/uniq -c with status code 4

dgsh-wrap: dgsh negotiation failed for /usr/bin/tr -s  \t\n\r\f with status code 4

dgsh-wrap: dgsh negotiation failed for /usr/bin/uniq -c with status code 4

dgsh-wrap: dgsh negotiation failed for /usr/bin/tr -s  \t\n\r\f with status code 4

dgsh-wrap: dgsh negotiation failed for /usr/bin/uniq -c with status code 4

dgsh-wrap: dgsh negotiation failed for /usr/bin/tr -s  \t\n\r\f with status code 4

dgsh-wrap: dgsh negotiation failed for /usr/bin/uniq -c with status code 4

dgsh-wrap: dgsh negotiation failed for /usr/bin/tr -s  \t\n\r\f with status code 4

dgsh-tee: dgsh negotiation failed with status code 4

The first line should offer more explanations. The remaining lines should not be outputted.

Dgsh hangs on error

$ git rev-parse HEAD
29aa232b9d48a4664a99e5b15ea8cc986fc8c34f

dgsh -c 'echo hi | {{ cat & cat & }} | cat'
ERROR: More than one edges are flexible. Cannot compute solution. Exiting.
ERROR: No solution was found to satisfy the I/O requirements of the following participating processes:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.