BATCH ANALYSIS OF HTML CODE IN A UNIX ENVIRONMENT WITH SAPID Riccardo Magliocchetti, riccardo --at-- datahost --dot-- it v. 1.0.1 Getting into a software project code that has been developed by various people over years isn't an easy task since it tends to dishomogeneity. Some factors need to be taken into consideration: people's programming habits and level of knowledge, and the passage of time, which improves and refines technology. This is true in international and long term projects which, thanks to free / open source software, are now quite common. The process fortunately can be controlled by adhering to strict coding style guidelines and W3 [1] standards. But with projects that are often done in contributors' free time and that consist of hundreds of thousands of lines of code, can often be difficult to apply these rules. Recently I've been involved in a project called Seagull [2], which is a PHP framework under the BSD license. Seagull's HTML templates were created by several people and a common coding style was never been fully enforced. On top of that the HTML was run through Tidy which destroyed the indentation even more. The result was HTML that was not standards compliant and that had no common indentation approach. Furthermore, the project had changed the doctype from HTML 4.0 to XHTML 1.0 generating even more validation errors - mostly missing trailing slashes. My target was not only to fix the obvious errors, but to make the HTML code clean and easily maintainable - this was only possible by the strict adoption of standards and a sane use of CSS. So I needed the - big picture -, but since no one could give me one I set out to solve the problem myself: I created a small and simple script called Sapid [3] (Stats About Patterns In a Directory) for collecting stats and grouping common entity usage. Sapid is a shell script that only does two things: look recursively in a directory for pattern's occurrences and collect stats about the usage of the specified pattern. These actions are not performed at the same time, each operation is performed on standard output. Sapid is covered by a BSD style licence. Listing 1. sapid.sh version 0.8 # Copyright (c) Riccardo Magliocchetti # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # 1. Redistributions of source code must retain the above copyright # notice immediately at the beginning of the file, without modification, # this list of conditions, and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE # ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR # ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL # DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS # OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) # HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY # OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #!/bin/sh # sapid.sh - Stats About Patterns In a Directory function help { echo -e "sapid look recursively for a pattern in a directory" echo -e "and display results on standard output.\n" echo -e "USAGE: sapid.sh -p PATTERN [-d DIRECTORY] [--stat] [--help]" exit 1 } # Flag for statistics, default is off STAT=0 # Default directory is current DIR=$(pwd) for parameter in "$@"; do case "$parameter" in -d) DIR=$2 ; shift 2 ;; -p) PATTERN=$2 ; shift 2 ;; --stat) STAT=1 ; shift 1 ;; --help) help ;; esac done # Add some entropy in order to avoid a race condition on TMP name, try until it # is quite safe ENTROPY=$(date +%N) TMP=sapid_$ENTROPY.tmp while [ -e $TMP ]; do ENTROPY=$(date +%N) TMP=sapid_$ENTROPY.tmp done if [ $STAT -ne 1 ]; then for i in "ls -R $DIR"; do grep -n "$PATTERN" $i >> $TMP done sort -n $TMP else for i in "ls -R $DIR"; do grep -h "$PATTERN" $i >> $TMP done # Remove spaces at line beginning, sort, remove duplicated lines, # sort in numerical reverse order sed "s/^ *//" "$TMP" | sort | uniq -c | sort -rn fi rm $TMP Sapid is not very useful all by itself, in order for it to be practical we must first create an environment of rules for it. Here is a little shell script (I know, make(1) would be more convenient) that finds missing trailing slashes of some common tags. Sapid could also be used for finding accessibility standards deficiencies like missing attributes. Listing 2. missing_trailing_slashes.sh #!/bin/sh SAPID="/path/to/sapid.sh" DIR="bad/html/code" TARGET="trailing_slashes" $SAPID -d $DIR -p "" > img.$TARGET 2> /dev/null $SAPID -d $DIR -p "" > input.$TARGET 2> /dev/null $SAPID -d $DIR -p "" > br.$TARGET 2> /dev/null The other functionality of Sapid is collecting statistics. In the example we'll see the usage of tag in Seagull's HTML code. Listing 3. table_usage.sh #!/bin/sh SAPID="/path/to/sapid.sh" DIR="www/themes/default" $SAPID -d $DIR -p " table.stat 2> /dev/null The figures below are the times the script needs to run on my workstation (Debian GNU/Linux, P3 1000 mhz, Ultra wide scsi) over Seagull's www/themes/default directory (400 files) and the output. riccardo@debbie:~/seagull$ time ./table_usage.sh real 0m0.090s user 0m0.034s sys 0m0.033s riccardo@debbie:~/seagull$ cat table.stat 67
37
15
1 var cteTxti = '
' 1
1
Sapid is fast because it runs in batch mode and uses tools written in C. With Sapid we can check thousands of files and have a full report in a few seconds. Can you imagine how much time it would take to interactively check your pages via the W3 (or similar) checker [4]? Another advantage is Sapid's ability to check embedded HTML code. The script is portable because it uses a small subset of tools available in every GNU system, Windows (TM) (under cygwin [5]) and other unix-like systems too. Sapid is a work in progress, a major disadvantage it has compared to an interactive checker is its lack of awareness of syntax. It cannot perform a syntax and semantic check on the code, it only does pattern matching with simple use of regular expressions. LINKS: 1. World Wide Web Consortium: http://www.w3.org 2. Seagull framework: http://seagull.phpkitchen.com 3. Sapid homepage: http://foundmeone! 4. W3C (X)HTML validator: http://validator.w3.org 5. Cygwin: http://www.cygwin.com Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved. The author would like to thank Demian Turner for the dedication in correcting his bad English.