3  Input and Output: The Conventional Way

Unfortunately it isn't always possible to format data in the S-expression way. Some people just haven't heard that S-expressions are superior to plain, unstructued text.5 Some people are paren-phobic. We may have to interact with data files that such people want to exchange with us. Whatever the reason, we may need to deal with plain text files. This section presents a first introduction to this topic.

3.1  Pre-1958 Input and Output: The Lazy Approach

Help Desk: read-line, printf, reading and printing

Before people discovered in 1958 that reading parenthesized data is far easier than reading plain text, people kept grade records such as these:

  Adam 78 88 69 66
  Brad 88 87 86 22
  Cath 99 88 88 90
  Dave 77 78 77 78
  Fawn 90 89 81 60
  Gege 67 78 81 85

Indeed, some people still do, and you may have to write a program that reads this kind of data. Ideally we could just wait for a genie to come around and to turn this stone-age form of data into something we can just read.

Of course, the genie exists. It is a Scheme program -- known as a pre-processor -- that adds a pair of parentheses around each line, an opening parenthesis at the top, and a closing parenthesis at the bottom of the file. In other words, we turn the above file into something such as this:

  (
  (Adam 78 88 69 66)
  (Brad 88 87 86 22)
  (Cath 99 88 88 90)
  (Dave 77 78 77 78)
  (Fawn 90 89 81 60)
  (Gege 67 78 81 85)
  )

The exact distribution of spaces is irrelevant, because the next program reads this file as an S-expression and white space plays no true role in S-expressions.

To design a function that adds something to each line of a file, we must provide something like a data definition for files. For the given example, we should think of a file as a sequence of lines whose length we don't know. This suggests the following description:

A file is either

  1. empty
  2. a line followed by a file.

This data definition is similar to the data definitions for lists and natural numbers. The difference is that a file is not a piece of Scheme data, and that we therefore don't have the usual functions, that is, constructors, selectors, and predicates. Still, there are functions that play similar roles, and they are specified in figure 11.

;; read-line : -> (union string eof)
;; to read a line from the default input device

;; eof-object? : _ -> boolean
;; to read a line from the default input device

;; printf : string internal-S-expression ... -> void
;; to print a series of Scheme values according to format

Figure 11:  Scheme's basic line-oriented procedures for input and output

Unlike cons, first, rest, cons?, the primitive functions for lists, or add1, sub1, and positive?, the primitive procedures for natural numbers, the functions in figure 11 are not mathematical functions. With the exception of eof-object?, the functions affect the state of our computer and/or produce different results for different uses. For an example, consider this expression:

(with-input-from-file "line-grades.dat"
  (lambda ()
    (list
      (read-line)
      (read-line)
      (read-line))))      

Assuming "line-grades.dat" refers to the line-oriented version of our grade file, this expression reads the first three lines of the file and produces a list with three strings:

(list "Adam 78 88 69 66"
      "Brad 88 87 86 22"
      "Cath 99 88 88 90")      

Also, read-line deals with the empty file, too. Once all lines have been read, it no longer produces a string but a unique Scheme value: the end-of-file object (or eof-object).

We can still design our file-processing programs in the style of ``How to Design Programs,'' but we must be careful to use read-line exactly once per line. Otherwise we proceed as always:

;; fun-for-file : -> ???
;; to read and process each line in a file
(define (fun-for-file)
  (let ([a-line (read-line)])
    (cond
      [(eof-object? a-line) ...]
      [else ... a-line ... (fun-for-file) ...])))

The template reads one line, names the string a-line, and then processes it depending on whether it represents the empty file or the first line of a file with additional content. The recursion in the second cond clause naturally corresponds to the self-reference in the data definition for files.

Working with this template as a starting point, we can quickly design the pre-processor. We start with a function that adds a pair of parentheses to each line in standard input and prints it to the standard output:

(define OPEN "(")
(define CLOSE ")")

;; add-parens-to-each-line : -> void
;; to read all lines from the standard input
;; to print each line surrounded by ( and )
(define (add-parens-to-each-line)
  (let ([a-line (read-line)])
    (cond 
      [(eof-object? a-line) (void)]
      [else (begin
	      ;; --- print "(", followed by the line, followed by ")"
              (display OPEN)
              (display a-line)
              (display CLOSE)
              ( newline)
	      ;; --- do the rest
              (add-parens-to-each-line))])))

All we need to finish the program is a function that adds the pair of parentheses to the top and bottom of the file. Since this is a unique action, we separate it from the above function:

;; add-parens-to-file : -> void
;; to add a pair of parens to each line in the standard input
;; to add a pair of parens around the entire file content
(define (add-parens-to-file)
  (begin
    ;; --- the opening parenthesis
    (display OPEN)
    ( newline)
    ;; --- process the file
    (for-each-line)
    ;; --- the closing parenthesis   
    (display CLOSE)
    ( newline)))

;; add-parens-to-file : -> void
;; effect: add ( ... ) around contents of stdin,
;; prints to stdout; calls add-parens-to-lines
(define (add-parens-to-file)
  (printf "~a~n" OPEN)
  (add-parens-to-lines)
  (printf "~a~n" CLOSE))

;; add-parens-to-lines : -> void
;; effect: reads all lines from stdin, puts ( ... ) around each
;; prints to stdout
(define (add-parens-to-lines)
  (let ((a-line (read-line)))
    (cond
      [(eof-object? a-line) (void)]
      [else (begin
              (printf "(~a)~n" a-line)
              (add-parens-to-lines))])))

Figure 12:  Transforming a line-based file into an S-expression

The two functions (and several others we have seen) contain s with several applications of output functions in a row. For example, to add a pair of parentheses around a string and to print that string as a line requires five function applications. Since this pattern occurs again and again, Scheme provides sveeral functions for printing sequences of Scheme values and formatting them according to some directive. The most important one is printf, which forms a string from a string and a number of Scheme values and prints it to the standard output. More precisely, its first argument is a formatting string and a variable number of arguments. A formatting string contains special sequences which printf replaces with the additional arguments.

Here are the three formatting sub-strings that are relevant to our problem:

ñ newline
ã insert the corresponding argument as a string without quotes
s insert the corresponding argument as an S-expression

For example:

> (printf "hello: ( ~a ) ~s ~n" "world" "world")

hello: ( world ) "world" 

The application inserts "world" twice into the format string, once without doublequotes and once with the quotes.

Using printf, we can write the pre-processor in a more concise manner: see figure 12. The core function is now a plain ``three-line program.''6 If we now compose add-parens-to-file with grader-say, with pipe from figure 7, we get a grader program for our grade files in plain text format. Of course, the output is still in S-expression format, but with printf this is easy to fix, too.


Exercises

Exercise 3.1.1.   Use pipe from figure 7 to compose add-parens-to-file with grader

Exercise 3.1.2.   Design a grader program for people who keep grades in plain text files. The result should be a program that reads plain text files like those used in this section and that prints the grades in a tabular form.

For example, when given the following file:

  Adam 78 88 69 66
  Brad 88 87 86 22
  Cath 99 88 88 90
  Dave 77 78 77 78
  Fawn 90 89 81 60
  Gege 67 78 81 85

the program should produce this file:

  Name | Average
  ------------
  Adam | 75.25
  Brad | 70.75
  Cath | 91.25
  Dave | 77.5
  Fawn | 80.0
  Gege | 77.75

Initially assume that all names have four letters. Then refine the program so that it can deal with names of arbitrary length. Hint: Append enough blank spaces to all strings that are shorter than the longest one. See part III or the Help Desk for more information on strings. 

Exercise 3.1.3.    


3.2  More on Plain Text Input/Output

Help Desk: string, regular expressions

On occasion it just isn't enough to read each line and to print it so that some other Scheme program can use read. One problem is that read was designed to read programs -- which are represented with external S-expressions -- and that certain keyboard characters mean something special in programs and are treated in a special way. We know, for example, that ``;'' is such a character. Others with special meaning are: period (``.''), comma (``,''), quote (``'''), backquote ```'', and hash (``#'').

Let's look at a specific example. Suppose some application produces files with the following format:

  { {5 , 12} 
    {3 , 4} 
    {0 , 0} }

We could trivially process the data in this file with a Scheme function if it didn't contain the commas. A single (read) would create an S-expression and we could go from here.

Thus the goal is to read each line, to replace the commas with blank spaces, and to print the line. Clearly, this function is just another instance of fun-for-file, though this time we need to process the string, not just print it with additional characters. Here is the solution:

;; commas-to-blanks : -> void
;; to replace all commas in a file with blank spaces
(define (commas-to-blanks)
  (let ([line (read-line)])
    (cond
      [(eof-object? line) (void)]
      [else (begin
              (display (regexp-replace* COMMA line SPACE))
              (commas-to-blanks))])))

(define COMMA ",")
(define SPACE " ")

The only new part of the definition is the underlined expression. It uses the function regexp-replace*, which searches all the occurrences of a string in another string and replaces it with a third string. The use here replaces all occurrences of ``,'' with ``'' in line. The result is a braced S-expression, which Scheme can read directly.


Exercise

Exercise 3.2.1.   Here is a function that computes the distance of a list of two-dimensional points to the origin:

;; point = (list num num)

;; distance : (listof point) -> (listof num)
(define (distance alop)
  (map (lambda (p) (sqrt (+ (square (first p)) (square (second p))))) 
       alop))

Develop a program that reads files of the following shape:

  { {num , num} 
    ...
    {num , num} }

and prints the list of distances to the origin. For the above example, the program would print

  (13 5 0)

Printing the list with parentheses instead of braces is acceptable. 


Consider the problem of recognizing proper HTTP headers and counting them according to their method type. An HTTP header has one the following shapes:

  1. GET ... HTTP/N...N.N...N
  2. POST ... HTTP/N...N.N...N
  3. HEAD ... HTTP/N...N.N...N
  4. PUT ... HTTP/N...N.N...N
  5. DELETE ... HTTP/N...N.N...N
  6. TRACE ... HTTP/N...N.N...N

The plain dots ... stand for a non-empty sequence of arbitrary characters and the N...N represents a non-empty sequence of digits. The first word determines the method type of the request.

Here the problem is not to replace one substring by another but to find one. For this purpose, we use another function from the regular expression world: regexp-match. In contrast to regexp-replace*, it merely searches for occurrences of certain substrings in a given string.

The example also demonstrates the need for recognizing not just a string within a string but a pattern. Unlike a plain string, a pattern contains characters with special meaning and it may match more than one string. For example, all three of the following matching attempst succeed

(regexp-match "1 [A-Za-z]* 2000" "1 November 2000")
(regexp-match "1 [A-Za-z]* 2000" "1 July 2000")
(regexp-match "1 [A-Za-z]* 2000" "1 January 2000")

because the pattern matches any string that contains a 1, followed by a sequence of the characters A ...Z or a ...z, followed by 2000. The regexp-match function indicates success with a list of matching substrings. If it fails to find a matching substring, it produces false.

In many cases, we also wish to know the subpattern that a string contains. We can do that by surrounding a subpattern with parentheses:

  (regexp-match "1 ([A-Za-z]*) 2000" "1 November 2000") 
= ("1 November 2000" "November")

  (regexp-match "1 ([A-Za-z]*) 2000" "1 July 2000")
= ("1 July 2000" "July")

  (regexp-match "1 ([A-Za-z]*) 2000" "1 January 2000")
= ("1 January 2000" "January")

If regexp-match succeeds now, it produces a list with a second item: the matching subpattern.

Using the pattern matching language, we can specify an HTTP header as follows:

(define HTTP-HEADER
  "^(GET|HEAD|POST|PUT|DELETE|TRACE) .+ HTTP/[0-9]+\\.[0-9]+\$")

This pattern uses several constructs from the regular expression pattern language. For a detailed explanation, see part III. Here we just need to know that HTTP-HEADER matches strings that

  1. start with one of these six strings: "GET", "HEAD", "POST", "PUT", "DELETE", or "TRACE";

  2. followed by an arbitrary sequence of characters that do not contain the string "HTTP";

  3. ending in "HTTP/" and two sequences of digits separated by a dot.

The string may not contain any additional characters.

Here are three examples of HTTP headers:

  1. "GET /UL/paintings/77.jpg HTTP/1.0"

  2. "POST / HTTP/1.0"

  3. "HEAD /UL/index.html HTTP/1.1"

They represent one 'get, a 'post, and a 'head request, respectively.

Based on this discusion of regexp-match and our knowledge of file processing, we can design and implement the HTTP classifying program. Figure 13 presents the result. The file processing follows the standard template. The counting routines use a hash table (see part [***]) to keep track of the counts. This may seem a superflous exercise because there are only six different headers, which we could count using six state variables (or six parameters for the inner read-log). Our design is, however, inherently extensible. To classify an additional kind of header, a change to HTTP-HEADER, the pattern describing headers, suffices. Any other program design would require additional changes.

;; Processing a server LOG file

;; read-log : -> (listof (list sym num))
;; to compute how many GET, PUT, POST, HEAD, DELETE, TRACE
;; and bad requests are in the standard input
(define (read-log)
  (local ((define (read-log)
	    (let ([a-line (read-line)])
	      (cond
		[(eof-object? a-line) (show)]
		[else (let ([r (regexp-match HTTP-HEADER a-line)])
			(begin
			  (count (if (pair? r) (stringsymbol (cadr r))) 'BAD)
			  (read-log)))]))))
    (begin
      (init)
      (read-log))))
                  
;; TABLE : (hash-table-of sym (box num))
;; to count how many times a symbol has been encountered
(define TABLE (make-hash-table))
                  
;; init : -> void
;; to set TABLE to a fresh hash-table
(define (init) (set! TABLE (make-hash-table)))
                  
;; count : sym -> void
;; to increment the counter for a or to add a with count 1
(define (count a)
  (let/ec escape
    (let ([r (hash-table-get TABLE a (lambda ()
				       (hash-table-put! TABLE a (box 1))
				       (escape)))])
      (set-box! r (+ (unbox r) 1)))))
                      
;; show : -> (listof (list sym num))
;; to turn TABLE into a list
(define (show)
  (hash-table-map TABLE (lambda (key count) (list key (unbox count)))))

Figure 13:  Classfying HTTP requests


Exercises

Exercise 3.2.2.   ... For example, if we wish to double the occurrences of ``', we modify the table as follows:

(define CONVERSION-TABLE
  '(("," " comma ")
    ("." " dot ")
    (";" " semi ")))

...

Exercise 3.2.3.   UNIX exercise

(system ``ls -l'')

create a ``.aux'' file for all .tex files and and



5 This is odd, because people saw this as early as 1958 (LISP: 1958-1972. RIP).

6 If we had used an unless expression, it would indeed be three lines long. See Help Desk for unless.