SED: Pattern Matching Across More than 1 Line

Yes, this is something that a lot of people want to do (whether they realise it or not) as the s/pattern1/replacement/ does not work if the string spans more than one line.

Example

Suppose we want to replace every instance of Microsoft Windows 95 with Linux (I mean, just replace the text !). Our first attempt is this:

s/Microsoft Windows 95/Linux/g

Unfortunately, the script fails if our file looks like this:

Microsoft
Windows 95

Since neither line matches the pattern microsoft Windows 95

So we need to do better. We need the "multiline next" or N command.

The next command N appends the next line to the pattern space.

So our second attempt is this:

N

N

s/Microsoft[ tn]*Windows[ tn]*95/Linux/g

Now note that we have made reference to t and n. These are the tab and end of line characters respectively. The end of line character only appears in multiline patterns. In multiline patterns, it should also be noted that ^ and $ match the beginning and end of the pattern space.

The above is a start, but it breaks if we have a file that looks like this:

Foo

Microsoft

Windows

95

Why does it break ? Let’s look at what the script does.

  1. First, it reads the line "Foo" into the pattern space.
  2. It sees the N command and appends line 2 to the pattern space. The pattern space now looks like:

FoonMicrosoft

  1. Executing the second N command , it reads line 3 into the pattern space. At this stage, the pattern space looks like this:

FoonMicrosoftnWindows

  1. Now the script runs the substitute command.

FoonMicrosoftnWindows

This doesn’t match the search pattern, so no substitution is performed.

  1. Since the end of the script is reached, the contents of the pattern space are written to STDOUT , and the script starts again from the first line
  2. The last line of the file "95" is read into pattern space.

This is the main error in the script : once the end of the script is reached, the first line that * has not been read into the pattern space already * is read. It is NOT true that the Nth iteration of the script reads from the Nth line of the file.

The following too N commands fail and the script exits without writing ’95’ to STDOUT.

So there are too things to be learned from this:

  • Each line of the file is read in exactly once. After you read a line into the pattern space, you can not read it again.
  • It’s good practice to use $!N in place of N to avoid errors, since the N command doesn’t make sense on the last line of a file.

A better version is as follows:

/Microsoft[ t]*$/{

N

}

/Microsoft[ tn]*Windows[ t]*$/{

N

}

s/Microsoft[ tn]*Windows[ tn]*95/Linux/g

This only performs the search on extra lines when necessary.

Example: removing text between matching pairs of delimiters

Suppose we want to eliminate all text enclosed by a matching pair of delimiters This is a problem that comes up frequently. For example, removing html commands from html documents. We will use <angle brackets> in this example. So the task then is to eliminate anything between matching pairs of these brackets.

Our first attempt is shown as follows:

s/<[^>]*>//g

But this might break: the angle brackets might span more than one line, or there may be nested angle brackets. Actually, the latter is unlikely to happen if the html is correct. only possible to nest angle brackets inside html comments. ) But we will assume that it might happen anyway (since it makes the problem more fun) So here is the improved version.

:top

/<.*>/{

s/<[^<>]*>//g

t top

}

/</{

N

b top

}

A fine point: why didn’t we replace the third line of the script with

s/<[^>]*>//g

and removing the t command that follows ? Well consider this sample file:

<<hello>

hello>

The desired output would be the empty set, since everything is enclosed in angled brackets. However, the output will look like this:

hello>

since the first line matches the expression <[^>]*> So the point is that we have set up the script to recursively remove the contents of the innermost matching pair of delimiters.

Leave a Reply

Your email address will not be published. Required fields are marked *