Search:

### Forming the regular expression

Having analysed the dataset, we have in mind the problems that we're going to encounter, and we have already decided that we have two basic problems on a line of data. The first problem is the driver and car, the remaining problem is separating up the individual times. Lets solve the driver and car problem first.

 1 Richard Pinkett MG TA 00 76.52 74.73 73.56 72.92 1000 1000 74.69 75.23 72.92 74.69 147.61 5 131

We've decided we're not going to split the driver and car, so the problem is now much simpler. The thing that we notice from this line of data now, is that the single driver/car field is surrounded by numbers. This is a satisfactory (if not perhaps good) way to decide where to put the commas. The trouble now comes from the make and particularly the model of the car. Some of the car model names have numbers in them. How do we tell the difference between a modelname with a number in it, and the timing data?

Well, we're in luck; May the farce be with us! If you look at the data you'll see that the CLS field (in red) is immediately to the right of the modelname for the car. The CLS field, is very regular, and it's always composed of two digits. Thats just what we need. A quick look through the whole dataset reveals that no car modelname listed has two digits separated by spaces. We can use this information to identify the end of the car modelname field.

We've not really said anything about that digit all on it's own at the beginning of the line yet. Suffice to say it's not a big problem. Regex is naturally greedy, it wants to select things. Normally you're trying to work out how to constrain it. The important thing is that we have a way to identify where it should stop selecting things, i.e. the two digits surrounded by spaces in the CLS field.

The first thing we want to do, is to capture the the first digit field on the line (in blue). To do this with regex is very simple. We use the following expression ^\d+\x20. The ^ simply mandates a match at the beginning of a line. When we look at the whole dataset we see that some lines start with two digits, so we must match one or perhaps more digits. This is done with \d+ the slash d indicates digits, and the plus indicates one or more. The plus is greedy, so it will try to match as much as it can. We know that this field is always followed by a space. We can use this knowledge to terminate the greedy plus. This is the function of \x20. It terminates the repeat match, with a space.

The reason for using the hexdecimal value, rather than an actual space character is that unlike space it stands out. A space would be valid, but it's easier to see what's going on. To choose the hexdecimal value we just looked up ASCII in a search engine. It returned links to character sets, and we were able to use these to identify the space character as hexdecimal 20. If you're unsure about this, ASCII isn't going to change and \x20 is space. It's pretty much the only hex character you'll want to use, so don't worry about it.

So far, so good. But what we wanted to do was insert commas. What we're going to do is modify the previous expression to allow this. We extend the previous expression to form ^(\d+(?=\x20))\x20, from the original ^\d+\x20. This new expression selects exactly the same information as the old one did, but you've probably noticed that there are a few more brackets in the new expression. What we've done is to "mark" a part of the search expression such that we can use that part in the replacement expression.

In the replacement expression we can refer to any group (in brackets), by it's reading order index. When we do this, we place whatever was matched in that group into the replacement string. What we're aiming to do with this change to the expression, is to get the part of the expression that matches search text we want to keep, into a bracketed group. Anything we want to change (the space) we want to keep out of the bracketed group.

When we perform the actual replacement, whatever was matched and in a bracketed group can be put back into the search text. Whatever is matched but not in a bracketed group, will not end up back in the search text. If we wish we can put a new character where the unbracketed token was removed. In our case this is the comma that replaces the space, and we do this part in the same way we would with a normal search and replace.

Contrary to what has just been said, when looking at this new expression, it is apparent that we now have two \x20's and one of them is still in brackets. This has occured because we are using an important technique to get the \x20 outside the bracket. Earlier we spoke of the \x20 terminating the greedy \d+. We still need to implement this terminating action, but because of the replacement, we don't want the \x20 inside the brackets with \d+. The trouble is that when we move the \x20 outside the brackets, it no longer terminates the \d+. This happens because parser must search for the expression in the group seaparately from everything else in order that it can obtain separate match text, that we want to use in our replacement.

The solution to this problem is the (?=\x20) that we see inside the first group. This is a special group that forces it's content to become metatokens. Although this group matches the \x20, it does not capture any characters for inclusion in the replacement text. In addition it does not consume those characters as a part of the match. This is why we now have the additional \x20 outside the first group. Both, match the same character.

The first \x20 matches invisibly inside the replacement group to terminate the digit find. The other matches visibly outside the replacement group. The digits and the space are selected. The digits are available as replacement text, but the space is not. When we create the replacement expression, \1,, we use it in conjunction with the search expression, ^(\d+(?=\x20))\x20. When a match is found the replacement expression is used. The \1 reinserts the digits found, and the , overwrites the space. 1 becomes 1,.

This technique is very useful, and usually crucial to the use of regular expressions for text processing. Although, perhaps, explaining it has been long winded, it's importance is great. Using this technique, we can match virtually anything and then choose what actually goes inside the groups we use to choose unmodified replacements. It means that we can think of matches as being spreadsheet cells, and only modify some smaller part of the cell data. More importantly it leads the way to being able to develop the whole expression by parts, merely concatenating them once the whole expression is complete.

It is important to understand this technique now, because we'll use it again, but not visit it in such detail.