BATCH ANALYSIS OF HTML CODE IN A UNIX ENVIRONMENT WITH SAPID
Riccardo Magliocchetti, riccardo --at-- datahost --dot-- it
v. 1.0.1
Getting into a software project code that has been developed by various people
over years isn't an easy task since it tends to dishomogeneity. Some factors
need to be taken into consideration: people's programming habits and level of
knowledge, and the passage of time, which improves and refines technology.
This is true in international and long term projects which, thanks to free /
open source software, are now quite common. The process fortunately can be
controlled by adhering to strict coding style guidelines and W3 [1] standards.
But with projects that are often done in contributors' free time and that
consist of hundreds of thousands of lines of code, can often be difficult to
apply these rules.
Recently I've been involved in a project called Seagull [2], which is a PHP
framework under the BSD license. Seagull's HTML templates were created by
several people and a common coding style was never been fully enforced. On top
of that the HTML was run through Tidy which destroyed the indentation even more.
The result was HTML that was not standards compliant and that had no common
indentation approach. Furthermore, the project had changed the doctype from HTML
4.0 to XHTML 1.0 generating even more validation errors - mostly missing
trailing slashes.
My target was not only to fix the obvious errors, but to make the HTML code
clean and easily maintainable - this was only possible by the strict adoption of
standards and a sane use of CSS. So I needed the - big picture -, but since no
one could give me one I set out to solve the problem myself: I created a small
and simple script called Sapid [3] (Stats About Patterns In a Directory) for
collecting stats and grouping common entity usage.
Sapid is a shell script that only does two things: look recursively in a
directory for pattern's occurrences and collect stats about the usage of the
specified pattern. These actions are not performed at the same time, each
operation is performed on standard output. Sapid is covered by a BSD style
licence.
Listing 1. sapid.sh version 0.8
# Copyright (c) Riccardo Magliocchetti
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# 1. Redistributions of source code must retain the above copyright
# notice immediately at the beginning of the file, without modification,
# this list of conditions, and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# 3. The name of the author may not be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR
# ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
# OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
# OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE.
#!/bin/sh
# sapid.sh - Stats About Patterns In a Directory
function help {
echo -e "sapid look recursively for a pattern in a directory"
echo -e "and display results on standard output.\n"
echo -e "USAGE: sapid.sh -p PATTERN [-d DIRECTORY] [--stat] [--help]"
exit 1
}
# Flag for statistics, default is off
STAT=0
# Default directory is current
DIR=$(pwd)
for parameter in "$@"; do
case "$parameter" in
-d) DIR=$2 ; shift 2 ;;
-p) PATTERN=$2 ; shift 2 ;;
--stat) STAT=1 ; shift 1 ;;
--help) help ;;
esac
done
# Add some entropy in order to avoid a race condition on TMP name, try until it
# is quite safe
ENTROPY=$(date +%N)
TMP=sapid_$ENTROPY.tmp
while [ -e $TMP ]; do
ENTROPY=$(date +%N)
TMP=sapid_$ENTROPY.tmp
done
if [ $STAT -ne 1 ]; then
for i in "ls -R $DIR"; do
grep -n "$PATTERN" $i >> $TMP
done
sort -n $TMP
else
for i in "ls -R $DIR"; do
grep -h "$PATTERN" $i >> $TMP
done
# Remove spaces at line beginning, sort, remove duplicated lines,
# sort in numerical reverse order
sed "s/^ *//" "$TMP" | sort | uniq -c | sort -rn
fi
rm $TMP
Sapid is not very useful all by itself, in order for it to be practical we must
first create an environment of rules for it. Here is a little shell script (I
know, make(1) would be more convenient) that finds missing trailing slashes of
some common tags. Sapid could also be used for finding accessibility standards
deficiencies like missing attributes.
Listing 2. missing_trailing_slashes.sh
#!/bin/sh
SAPID="/path/to/sapid.sh"
DIR="bad/html/code"
TARGET="trailing_slashes"
$SAPID -d $DIR -p "" > img.$TARGET 2> /dev/null
$SAPID -d $DIR -p "" > input.$TARGET 2> /dev/null
$SAPID -d $DIR -p "
" > br.$TARGET 2> /dev/null
The other functionality of Sapid is collecting statistics. In the example we'll
see the usage of