Solid Fluid System Solutions  
Home Software About Hardware Firmware
Document Icon 1. Introduction
Document Icon 2. Source data
Document Icon 3. Copied data
Current Document Icon 4. Analysing process data
Document Icon 5. Forming the expression
Document Icon 6. Extending the expression
Document Icon 7. Completing processing
Document Icon 8. Conclusion

Analysing the data to be processed

We've seen how to capture the data, but the crux of this is how to form the regular expression. Obviously all the lines in the data file are different, but when you look carefully they all follow a general format. Let's just take the first line, and have a look at that;

1 Richard Pinkett MG TA 00 76.52 74.73 73.56 72.92 1000 1000 74.69 75.23 72.92 74.69 147.61 5 131

At this point what we're looking to do is insert commas at strategic points, in order to define the cells on the line. On the face of it, we could just replace spaces, with commas. We wouldn't need regex at all, but it wouldn't solve the problem completely because we have some spaces where we don't want commas to be. Using that scheme we would end up dividing the name field, and the car field into as many cells as we have words. In the end, we'd either have to manually remove some of those errant commas, or we'd end up with a whole host of cell manipulation in the sreadsheet. Using regex we shall submit to niether scourge.

The blanket replacement with commas is an important idea, because, but for the text data it would be a completely satisfactory solution. What it reveals is that the dissection problem for this particular data lies in the text fields, the numbers are easy. Focussing on the text fields reveals that it's going to be more difficult to separate the driver from the manufacturer and model of car. The regular expression parser is sufficiently complex to handle this problem in a generic way, but the overhead for us trying to devise the expression is large.

The immediate thought is, perhaps, to define a regular expression which already knows all the possible makes and models of car, and identifies the fields by their content. Regex could do this, but telling the parser all the possible make and models of the car is not satisfactory, because this is the sort of information we want from the spreadsheet we're trying to create. The same idea is also true for the name field, except there are probably more combinations of name, than models of car.

If we look back at the whole data, it reveals that the name field is actually more regular than the make and model of the car. In general the name field is made up of a forename and a surname, whereas the make and model of the car is really complex. Sometimes the manufacturer and model is three words, sometimes just one. Sometimes it's got digits in it. At this point we need to think about what we are trying to achieve. What we want is graphs. It's the numeric data we're after. The name of the driver, and the car driven are important, but only in the context of working back from the graph. We have to ask ourselves, do we really need to separate the name of the driver from the make and model of his car? In this case the answer is no.

At this point, I'm going to explain how to separate the driver from the car just to illustrate the point. Critically the idea of not doing so, is just as important. Forming up the regular expression is a design process. You have to make tradeoffs. Everything is possible (within reason!) but here we'll see the law of diminishing returns in action. The harder we try, the more difficult it becomes to achieve the goal.

Well then, the best bet for the driver/car separation is to look at the name field as being a two part name. Since we know that the name starts in the second field of the line and consists of two fields, it's easy to tell regex to place a comma between the first and second, and the third and fourth fields. This is easy enough to do. What is not clear from our reduced example dataset, is that one name of the 150 is actually made up of three words, i.e. a middle name was used. In another three cases hyphenated names were used. You really need an eagle eye, to spot these things, and decide how hard you want to try.

It would be completely valid to go the extra mile and split the drivers from their cars. The best approach would then be to fix things up in these four cases, once the data is in the spreadsheet. Indeed it is probably possible to devise a regex which will even cope with these extreme cases. Since we needed an eagle eye to spot these particular situations as potential difficulties, it's now easy enough to code the regular expression for one generic case plus these additional specific cases. For the purposes of this example we're not going to go there.

Copyright © Solid Fluid 2007-2022
Last modified: SolFlu  Thu, 25 Jun 2009 19:31:26 GMT