[mu TECH] wc : faster, more gnarly

From: Alfie Costa (agcosta@gis.net)
Date: Sun Mar 19 2000 - 01:07:04 CET


Friends, Romans, ash-enthusiasts,

Attached is a slightly improved 'wc'. It's maybe 50-100% faster than my last
attempt because it only calls 'ls' once per run, instead of once for each file.
The code is also a little different too, more on which later, but first some
questions...

1) How "compatible" should a 'mu' util be? I've been looking at the GNU 'wc'
to see how it does things, and this here attached 'wc', (which I'll call 'mu-
wc'), does two things differently:

a) GNU 'wc' sends an error message to standard error if you tell it to count a
directory, and then it keeps on going. The present 'mu-wc' just ignores
directories, because the output looks prettier that way.

b) GNU 'wc' sends an error message to standard error if you tell it to count a
filename that doesn't exist, and then keeps going. 'mu-wc' is stunned when it
reads a nonsense filename and then refuses to do any work unless it gets a
command line that's nicer. The theory was that this might be more convenient
for the real-time user.

It shouldn't be difficult to make 'mu-wc' handle a) and b) more like GNU 'wc'
however. Would this conformity be a good thing?

2) When binary files are the input, GNU 'wc' gives better numbers for counting
words and lines. 'mu-wc' uses an awk routine to do this, and awk only seems to
handle text correctly; so 'mu-wc' gives erroneous counts for binary files.
Perhaps this doesn't need fixing, as nobody should want to do word counts on
binary files. Still, it wouldn't be a bad thing to fix this, assuming there's
an easy solution, such as a clever 'sed' script. Does anyone have such a 'sed'
script?

Notes on the code:

It now takes three passes to parse the command line. First pass reads and sets
the options. Second pass deletes the options. Just after that, a hyphen is
added if there's nothing left. Third pass is a workhorse: it looks for a
hyphen, which of course means stdIO; it checks if the files are readable, and
if they're directories. (Last time this was done in the main loop.) The
directories it forgets, and the unreadable files make it give up.

Several variables weren't really necessary: the flags $checkW $checkL $checkC
and $NoFiles, so they're gone. This should make the logic simpler because
there's fewer variables, but maybe not. That is, the variables are gone, but
the flags aren't; $c does double duty as its own flag, etc., which works good.
On the other hand, some schools of programming preach that such variable re-use
is a sin of vainglory.

Before the main loop this next bit does an 'ls -Uo' (the '-Uo' means: detailed
and unsorted, it has to be unsorted) and creates a space separated list of the
file sizes, which are stored in $charlist. It only does this if $c isn't null,
so there's $c acting as its own flag...

        # get list of char counts? Kludge to call 'ls' only once...
        [ "$c" ] && charlist=`ls -Uo "$@" | awk '{ printf "%s ", $4 }'`

The main loop is more simple than last time. The file checks are now done in
the parsing phase so that helps. Here's another improvement:

    if [ "$f" = "$stdIOfile" ] # stdIO?
    then filename=$hyphen # display a hyphen or not?
    else filename="$f"
    fi

The flag $hyphen is now either "-" or null.

There's less use of 'awk' and more of 'set' now. It turns out 'set' is faster.

BEFORE:
        w=`echo $TmpWL | awk '{ print $2 }'`
AFTER:
        w=`(set $TmpWL ; echo $2)`

The parenthesis start a subshell, which is needed, otherwise 'set' will defile
the main loop's command line.

That's about all that's new in the main loop.

Also scattered about is some ugly logical shorthand that may be interesting.
Consider this:

        cat > $stdIOfile || Bail # if 'cat' fails, quit.

The ash '||' (logical OR) turns out to work the same as an IF...NOT would, if
ash had one. The last bit might be written like this:

        cat > $stdIOfile
        if [ "$?" != "0" ] # did 'cat' return an error?
        then Bail
        fi

Or without the '!=' like:

        cat > $stdIOfile
        if [ "$?" = "0" ] # did 'cat' succeed?
        then # do nothing if so.
        else Bail
        fi

Here's a small change...

        eval set dummyoption ${opts:--}
        shift # remove dummyoption

Last time it was just '$opts', not '${opts:--}'. The new version saves an
if...then. If $opts is null, then it outputs a "-", which the 'dummyoption'
protects the easily confused 'set' from.

Well, that's it for now...


#!/bin/ash
# rustique wc (3/17/00 by A. Costa)
# writes a temp file to count stdIO chars, uses awk and ls...
# (NB: Currently formatted to 4 spaces per tab.)

# Functions

Help()
{
echo "Usage: wc [-clw | -a] [filename]"
exit
}

CleanUp() # get rid of temp files if necessary
{
[ -w "$stdIOfile" ] && rm $stdIOfile 2>/dev/null
}

Bail()
{
CleanUp
exit 2
}

ShowLine() # syntax: Showline lines# words# chars# filename
{
echo $1:$2:$3:"$4" | awk -F: '{printf "%7s%10s%12s %s\n", $1, $2, $3, $4}'
}

CheckHyphen()
{
if [ $hyphen ] # Chastise user?...
then
    echo "error: only one stdIO hyphen allowed." >& 2
    Bail
fi
hyphen="-"
}

#Parse options...

for b in "$@" # Pass 1, get options, wherever they are...
do
    case "Z$b" in
        Z-) CheckHyphen;;
        Z-d) set -x;; # debug mode
        Z-a) c=0 lines=0 w=0;;
        Z-c) c=0;;
        Z-w) w=0;;
        Z-l) lines=0;;
        Z-cw|Z-wc) c=0 w=0;;
        Z-cl|Z-lc) c=0 lines=0;;
        Z-lw|Z-wl) lines=0 w=0;;
        Z-h|Z-?*) Help ;;
        Z*) ;;
    esac
done

# no flagged options? Then set 'em to default...
[ -z "$c$w$lines" ] && c=0 lines=0 w=0

for b in "$@" # Pass 2, remove all options from command line...
do
    case "Z$b" in
        Z-?*) ;;
        Z*) opts="$opts"" "\""$b"\" ;; # for vfat filenames with spaces
    esac
done

# 'eval' is needed to parse $opts. If $opts is null, make it a hyphen
# for stdio, and use dummyoption to protect this hyphen from 'set'.
eval set dummyoption ${opts:--} # new commandline has no switches...
shift # remove dummyoption

unset opts
for b in "$@" # Pass 3, replace hyphen, proofread filenames
do
    case "Z$b" in
        Z-) trap 'Bail' 1 2 3 15
            stdIOfile=/tmp/$$RusticWc.tmp
            cat > $stdIOfile || Bail # if 'cat' fails, quit.
            b=$stdIOfile ;;
        Z*) if [ ! -r "$b" ] # is the file readable?
            then
                echo "error: can't read \"$b\" " >& 2
                Bail
            fi
            [ -d "$b" ] && continue ;; # skip any directories...
    esac
    opts="$opts"" "\""$b"\"
done

eval set $opts

# get list of char counts? Kludge to call 'ls' only once...
[ "$c" ] && charlist=`ls -Uo "$@" | awk '{ printf "%s ", $4 }'`

for f in "$@"
do
    if [ "$f" = "$stdIOfile" ] # stdIO?
    then filename=$hyphen # display a hyphen or not?
    else filename="$f"
    fi

    if [ $c ] # get how many chars it is?
    then
        c=`(set $charlist ; echo $1)`
        cSum=`expr "$cSum" + $c`
        charlist=`(set $charlist ; shift ; echo "$@")`
    fi

    # check words and lines.
    # this first "if-then" is a wrapper, so awk is only called once per file.
    if [ $w$lines ]
    then
        TmpWL=`awk 'BEGIN { w=0 } { w+=NF } END { print NR, w }' < "$f"`

        if [ $w ]
        then
            w=`(set $TmpWL ; echo $2)`
            wSum=`expr "$wSum" + $w`
        fi

        if [ $lines ]
        then
            lines=`(set $TmpWL ; echo $1)`
            linesSum=`expr "$linesSum" + $lines`
        fi
    fi
 
    ShowLine "$lines" "$w" "$c" "$filename"
    n=`expr "$n" + 1`
done

[ "$n" -gt "1" ] && ShowLine "$linesSum" "$wSum" "$cSum" total

CleanUp


---------------------------------------------------------------------
To unsubscribe, e-mail: mulinux-unsubscribe@sunsite.auc.dk
For additional commands, e-mail: mulinux-help@sunsite.auc.dk



This archive was generated by hypermail 2.1.6 : Sat Feb 08 2003 - 15:27:13 CET