Solid Fluid System Solutions  
Home Software About Hardware Firmware
Document Icon 1. Introduction
Document Icon 2. Source data
Document Icon 3. Copied data
Document Icon 4. Analysing process data
Document Icon 5. Forming the expression
Document Icon 6. Extending the expression
Current Document Icon 7. Completing processing
Document Icon 8. Conclusion

Extending the regular expression

To this point we've shown how to separate the red, green and blue fields, with commas with the following expressions;

  • Search expression - ^(\d+(?=\x20))\x20([\d\w\s\-()]+(?=\x20\d{2}\x20))\x20
  • Replace expression - \1,\2,
1 Richard Pinkett MG TA 00 76.52 74.73 73.56 72.92 1000 1000 74.69 75.23 72.92 74.69 147.61 5 131

We've already considered the general approach to the digit fields (in black) right at the beginning of this excercise. At the time we considered that a good (if not perfect) approach would be to simply replace the spaces. The problem is that now we have partially processed the file, some of our comma separated fields have spaces within them. We could simply replace spaces with commas, but our existing cell definitions would be broken into separate fields

This then is the clue as to how to proceed. If we can find the last comma on each line, all we need to do is to find the next space after that, and replace it with a comma. We implement this behaviour as follows. (.+,.+?(?=\x20))\x20 Having come this far with us, we're going to assume that you're beginning to get the hang of all this.

This expression works by saying "Find anything as many times as possible, terminated by a comma. Find anything, terminated by a space, but don't capture it for replacement. Capture the space such that it may be removed.". It's a bit of a leap. The only important thing that we've not already discussed is the question mark, following the second plus. The first plus is greedy, so on each test, it will select anything it can until it finds the last comma. At this point what we want to do, is gobble up the digit field that we know follows the comma. We can do this by terminating with a space, but the problem is there will be more than one digit field followed by a space. Since the plus is greedy, it'll just simply gobble up all the digit fields.

The question mark acts on the second plus, to make it lazy. This ensures that it will only capture the first digit field. As usual the space is declared as a metatoken, and then matched formally outside the group, such that it may be replaced. The important thing to notice here is that we've not really considered where the match begins. It's quite likely that the match in the first bracket is quite large, possibly almost the whole document. The importance is that with a really big operation, this isn't a very efficient approach. There are other ways of implementing this method, but this is sufficient for our need.

If we use this expression for replacement then, again, we just use the simple replacement expression \1,. When it runs as a "Replace All" operation, it will replace the first space after a comma on every line throughout the document. Each time this expression runs, another column will have commas inserted. This approach has the advantage that if some lines have less cells on them, then they will be skipped, and no commas added. The .csv format we are trying to create treats each line end as a row, so we don't have to worry about cells from other rows appearing on the current one. Since rows with fewer cells will be skipped, then it follows that one simply runs this expression until it makes no more replacements.

1,Richard Pinkett MG TA,00,76.52,74.73,73.56,72.92,1000,1000,74.69,75.23,72.92,74.69,147.61,5,131
2,Ian Anderson BL Mini GT,00,1000,67.36,66.09,64.88,68.15,67.69,67.46,67.33,64.88,67.33,132.21,4,129
3,Rob Choules Suzuki Swift Gti,00,1000,61.69,61.85,61.44,64.07,61.36,62.87,61.85,61.44,61.85,123.29,2,120
4,James Tapner Peugeot 106 Rallye,00,1000,60.66,59.43,59.26,63.6,62.14,63.34,61.81,59.26,61.81,121.07,1,116
5,Andy Thomas Rover Metro Gti,00,64.93,63.68,63.32,63.32,66.36,65.6,65.53,65.4,63.32,65.4,128.72,3,126
10,Stephen Biggs VW Golf Gti,01,65.23,66.26,62.38,60.45,64.18,64.33,65.13,71.68,60.45,65.13,125.58,6,125
11,Dave Penycate Volkswagen Golf Gti,01,61.68,61.09,61.98,61.26,65.75,64.86,64.62,62.89,61.26,62.89,124.15,5,122
12,Andrew Till MG ZR160,01,55.51,55.09,56.86,54.83,60.21,58.6,57.78,57.47,54.83,57.47,112.3,2,85
14,Jeremy Parker Honda S2000,01,52.82,51.19,51.27,51.84,57.17,57.37,56.77,55.97,51.27,55.97,107.24,1,55
15,Vicki Lawrence Nissan Sunny Gti,01,58.31,56.45,57.17,57.84,63.29,61.14,60.47,60.37,57.17,60.37,117.54,3,108
16,Peter Lawrence Nissan Sunny Gti,01,59.78,59.78,59.07,59.59,61.42,62.27,61.4,61.62,59.07,61.4,120.47,4,113
21,Tim Cole Mini Cooper,02,55.43,54.45,54.23,53.76,57.05,56.25,55.83,55.91,53.76,55.83,109.59,1,68
22,Nigel Patten Renault 8 Gordini,02,66.03,63.35,62.2,61.38,65.2,63.35,63.66,63.23,61.38,63.23,124.61,2,123
29,Lee Whittaker Subaru Impreza,05,55.47,52.99,52.08,52.85,56.05,55.18,55.19,54.23,52.08,54.23,106.31,3,53

The original .pdf file is now in .csv format, suitable for import into most spreadsheets.

Copyright © Solid Fluid 2007-2022
Last modified: SolFlu  Thu, 25 Jun 2009 19:31:27 GMT