woensdag 23 december 2015

Advanced pattern matching in SQL Server

This week I had the privilege of extracting times from a varchar column. This had to be done in T-SQL, which as you probably know does not support regular expressions. So I could only use PATINDEX, which is a lousy replacement for real regexes. Nevertheless, I found a solution that you may find useful.

The code is below, I will start with an explanation of the idea.

The problem at hand:

  1. A varchar column that contains a start time and end time, separated by a dash or a space.
  2. Times can be at the start, the middle or the end.
  3. The text may contain no time, just a start time, just an end time or both.
  4. Time can be denoted with three or four digits, with or without a leading zero for times before 10am.
  5. Time can be denoted with or without a dot as separator between hours and minutes.
  6. The dash may be surrounded by spaces, either before, after or both.
This is a lot of variation, that leads to many patterns to check. This could be done in a very ugly case statement. I don't like ugly statements so I came up with a Common Table Expression (CTE) holding all the possible patterns. This CTE contains the patterns from longest to shortest, so the first match that is found is always the most specific. Beside the pattern, the CTE contains a decription (just for human readability) and two numbers (startLenght and endLength) that help in finding the actual time piece from the start of the pattern. (PATINDEX just returns the position where the pattern is found. Nothing like multiple matching groups that you like so much about regex...)

Then I join the patterns CTE with the table that contains the text to be searched, and perform a PATINDEX for every combination. That results in zero, one or more matches per text. If there are more than one, the most specific will have row number 1. So then I filter on ROW_NUMBER() = 1.

From there is it a matter of smart substringing and converting.

I resolved the variations following from problems steps 5 and 6 by applying the REPLACE function for each of the variations. This could also have been solved by adding separate patterns. If you decide to do that, you will probably also need an extra index number for the position of the end-time in the pattern.

The CTE 'UnitTests' is what the name suggests: a list of test notes to retrieve the times from, including the correct answers.

The environment

SQL Server 11.0 SP2, aka SQL Server 2012

The code

 DECLARE @ThreeDigitTimePattern nvarchar(20) = '[0-9][0-5][0-9]';  
 DECLARE @FourDigitTimePattern nvarchar(20) = '[0-2]' + @ThreeDigitTimePattern;   
   
 WITH   
 StartEndTimePatterns AS  
 (  
     -- Patterns in order of matching length. This makes sure that the longest match is the first match.  
     SELECT patternID, startLength, endLength, pattern, [description]  
     FROM  
     (  
         VALUES   
          (1, 4, 4, '%[^0-9,-]' + @FourDigitTimePattern + '[- ]' + @FourDigitTimePattern + '[^0-9]%', 'FourDigitTime/FourDigitTime/MiddleOfSentence: blabla 1135-1345 blabla')  
   
         , (2, 3, 4, '%[^0-9,-]' + @ThreeDigitTimePattern + '[- ]' + @FourDigitTimePattern + '[^0-9]%', 'ThreeDigitTime/FourDigitTime/MiddleOfSentence: blabla 735-1145 blabla')  
   
         , (3, 4, 3, '%[^0-9,-]' + @FourDigitTimePattern + '[- ]' + @ThreeDigitTimePattern + '[^0-9]%', 'FourDigitTime/ThreeDigitTime/MiddleOfSentence: blabla 735-945 blabla')  
   
         , (4, 3, 3, '%[^0-9,-]' + @ThreeDigitTimePattern + '[- ]' + @ThreeDigitTimePattern + '[^0-9]%', 'ThreeDigitTime/ThreeDigitTime/MiddleOfSentence: blabla 735-945 blabla')  
   
         , (5, 4, 0, '%[^0-9,-]' + @FourDigitTimePattern + '[^0-9]%', 'FourDigitTime/null/MiddleOfSentence: blabla 1135 blabla')  
   
         , (6, 3, 0, '%[^0-9,-]' + @ThreeDigitTimePattern + '[^0-9]%', 'ThreeDigitTime/null/MiddleOfSentence: blabla 735 blabla')  
   
         , (7, 0, 3, '%-' + @ThreeDigitTimePattern + '[^0-9]%', 'null/ThreeDigitTime/MiddleOfSentence: blabla -735 blabla')  
   
         , (8, 0, 4, '%-' + @FourDigitTimePattern + '[^0-9]%', 'null/ThreeDigitTime/MiddleOfSentence: blabla -735 blabla')  
     ) as T(patternID, startLength, endLength, pattern, [description])  
 )  
 , UnitTests AS   
 (  
     -- A list of examples for unittesting.  
     -- First step: remove '.', and extra spaces around the dash.  
     SELECT testID, [description], '__' + Replace(Replace(Replace(NoteText, '.', ''), '- ', '-'), ' -', '-') + '__' as NormalizedNoteText, CAST(startTime_ref as time) as startTime_ref, CAST(endTime_ref as time) as endTime_ref  
     FROM  
     (  
         VALUES (1, 'ThreeDigitTime/ThreeDigitTime/WholeSentence', '735-945', '7:35', '9:45')  
         , (2, 'ThreeDigitTime/FourDigitTime/WholeSentence', '735-1145', '7:35', '11:45')  
         , (3, 'FourDigitTime/FourDigitTime/WholeSentence', '1135-1345', '11:35', '13:45')  
         , (4, 'ThreeDigitTime/ThreeDigitTime/StartOfSentence', '735-945 blabla', '7:35', '9:45')  
         , (5, 'ThreeDigitTime/FourDigitTime/StartOfSentence', '735-1145 blabla', '7:35', '11:45')  
         , (6, 'FourDigitTime/FourDigitTime/StartOfSentence', '1135-1345 blabla', '11:35', '13:45')  
         , (7, 'ThreeDigitTime/ThreeDigitTime/MiddleOfSentence', 'blabla 735-945 blabla', '7:35', '9:45')  
         , (8, 'ThreeDigitTime/FourDigitTime/MiddleOfSentence', 'blabla 735-1145 blabla', '7:35', '11:45')  
         , (9, 'FourDigitTime/FourDigitTime/MiddleOfSentence', 'blabla 1135-1345 blabla', '11:35', '13:45')  
         , (10, 'ThreeDigitTime/ThreeDigitTime/EndOfSentence', 'blabla 735-945', '7:35', '9:45')  
         , (11, 'ThreeDigitTime/FourDigitTime/EndOfSentence', 'blabla 735-1145', '7:35', '11:45')  
         , (12, 'FourDigitTime/FourDigitTime/EndOfSentence', 'blabla 1135-1345', '11:35', '13:45')  
         , (13, 'ThreeDigitTime/null/WholeSentence', '735', '7:35', null)  
         , (14, 'FourDigitTime/null/WholeSentence', '1135', '11:35', null)  
         , (15, 'ThreeDigitTime/null/StartOfSentence', '735 blabla', '7:35', null)  
         , (16, 'FourDigitTime/null/StartOfSentence', '1135 blabla', '11:35', null)  
         , (17, 'ThreeDigitTime/null/MiddleOfSentence', 'blabla 735 blabla', '7:35', null)  
         , (18, 'FourDigitTime/null/MiddleOfSentence', 'blabla 1135 blabla', '11:35', null)  
         , (17, 'ThreeDigitTime/null/EndOfSentence', 'blabla 735', '7:35', null)  
         , (18, 'FourDigitTime/null/EndOfSentence', 'blabla 1135', '11:35', null)  
         , (19, 'TooLongNumber/null/WholeSentence', '73545', null, null)  
         , (20, 'TooLongNumber/null/StartOfSentence', '73545 blabla', null, null)  
         , (21, 'TooLongNumber/null/MiddleOfSentence', 'blabla 73545 blabla', null, null)  
         , (22, 'TooLongNumber/null/EndOfSentence', 'blabla 73545', null, null)  
         , (23, 'LeadingZero/ThreeDigitTime/StartOfSentence', '0852-912 blabla', '08:52', '9:12')  
         , (24, 'LeadingZero/ThreeDigitTime/MiddleOfSentence', 'blabla 0852-912 blabla', '08:52', '9:12')  
         , (25, 'LeadingZero/ThreeDigitTime/EndOfSentence', 'blabla 0852-912', '08:52', '9:12')  
         , (26, 'Point/ThreeDigitTime/StartOfSentence', '8.52-912 blabla', '8:52', '9:12')  
         , (27, 'Point/Point/StartOfSentence', '8.52-10.12 blabla', '8:52', '10:12')  
         , (28, 'null/ThreeDigitTime/StartOfSentence', '-912 blabla', null, '9:12')  
         , (29, 'null/ThreeDigitTime/MiddleOfSentence', 'blabla -912 blabla', null, '9:12')  
         , (30, 'null/ThreeDigitTime/EndOfSentence', 'blabla -912', null, '9:12')  
         , (31, 'null/FourDigitTime/StartOfSentence', '-1012 blabla', null, '10:12')  
         , (32, 'null/FourDigitTime/MiddleOfSentence', 'blabla -1012 blabla', null, '10:12')  
         , (33, 'null/FourDigitTime/EndOfSentence', 'blabla -1012', null, '10:12')  
         , (34, 'ThreeDigitTimeWithSpace/FourDigitTime/MiddleOfSentence', 'blabla 735 -1145 blabla', '7:35', '11:45')  
         , (35, 'PointAndLeadingZero/ThreeDigitTime/StartOfSentence', '08.52-912 blabla', '8:52', '9:12')  
         , (36, 'PointAndLeadingZero/Point/StartOfSentence', '08.52-10.12 blabla', '8:52', '10:12')  
         , (37, 'Point/ThreeDigitTime/MiddleOfSentence', 'blabla 08.52-912 blabla', '8:52', '9:12')  
         , (38, 'Point/Point/MiddleOfSentence', 'blabla 08.52-10.12 blabla', '8:52', '10:12')  
         , (39, 'ThreeDigitTime/LeadingZero/StartOfSentence', '852-0912 blabla', '08:52', '9:12')  
         , (40, 'ThreeDigitTime/LeadingZero/MiddleOfSentence', 'blabla 852-0912 blabla', '08:52', '9:12')  
         , (41, 'ThreeDigitTime/LeadingZero/EndOfSentence', 'blabla 852-0912', '08:52', '9:12')  
         , (50, 'PracticeTest', 'pkno 09.20-10.00 ASA 2', '9:20', '10:00')  
         , (51, 'PracticeTest', 'PVAT 08.09-9.08/2', '8:09', '9:08')  
         , (52, 'PracticeTest', 'PVVO 08.14-8.52/1', '8:14', '8:52')  
         , (53, 'PracticeTest', 'ppch 0847-0915', '8:47', '9:15')  
         , (54, 'PracticeTest', 'pver 856-0945', '8:56', '9:45')  
         , (55, 'PracticeTest', 'PNCH 0900-0945', '9:00', '9:45')  
         , (56, 'PracticeTest', 'PKCH ASA 2 9.53-10.10%0D%0A', '9:53', '10:10')  
         , (57, 'PracticeTest', 'pgyn -1609 +', null, '16:09')  
   
     ) as T(testID, [description], NoteText, startTime_ref, endTime_ref)  
 )  
   
 , StartEndTimeIndices as   
 (  
     -- Second step: find the position where the pattern is found  
     select t.testID, t.[description] as TestDescription, NormalizedNoteText, p.patternID, p.startLength, p.endLength, PATINDEX(p.pattern, NormalizedNoteText) as patternFoundAt, startTime_ref, endTime_ref  
     from UnitTests t, StartEndTimePatterns p  
 )  
 , StartEndTimeOrderedIndices as  
 (  
     -- Third step: filter the rows where any pattern matched, add a row_number  
     select testID, TestDescription, NormalizedNoteText, patternID, startLength, endLength, patternFoundAt, startTime_ref, endTime_ref  
     , ROW_NUMBER() OVER (PARTITION BY testID ORDER BY testID, patternID asc, patternFoundAt asc) as rownumber  
     from StartEndTimeIndices  
     where patternFoundAt > 0  
 )  
 , StartEndTimeStrings as  
 (  
     -- Fourth step: use only the first matches (those are the longest); extract the start- and endTime based on their lengths  
     select testID, patternID, patternFoundAt, NormalizedNoteText,   
         startLength, IIF(startLength > 0, SUBSTRING(NormalizedNoteText, patternFoundAt + 1, startLength), null) as startTime, startTime_ref,   
         endLength, IIF(endLength > 0, SUBSTRING(NormalizedNoteText, patternFoundAt + 1 + CAST(startLength as BIT) + startLength, endLength), null) as endTime, endTime_ref  
         from StartEndTimeOrderedIndices  
     where rownumber = 1  
 )  
 , StartEndTimeIntegers as  
 (  
     -- Fifth step: cast the start- end endTime to integers if possible; makes for easy calculation of the time in the next step.  
     select testID, patternID, patternFoundAt, NormalizedNoteText,   
         startLength, IIF(ISNUMERIC(startTime) = 1, CAST(startTime as INT), null) as startTime, startTime_ref,   
         endLength, IIF(ISNUMERIC(endTime) = 1, CAST(endTime as INT), null) as endTime, endTime_ref  
         from StartEndTimeStrings  
 )  
 , StartEndTimes as  
 (  
     -- Sixth step: calculate time values from the integers.  
     select testID, patternID, patternFoundAt, NormalizedNoteText, startLength, endLength  
     , IIF(startTime is not null AND startTime/100 < 24 AND startTime % 100 < 60, TIMEFROMPARTS(startTime / 100, startTime % 100, 0, 0, 0), null) as startTime, startTime_ref  
     , IIF(endTime is not null AND endTime/100 < 24 AND endTime % 100 < 60, TIMEFROMPARTS(endTime / 100, endTime % 100, 0, 0, 0), null) as endTime, endTime_ref  
     from StartEndTimeIntegers  
 )  
 -- Check whether the computed times are equal to the reference times from the unittests.  
 -- Ideally, this yields no results :-)  
 select * from StartEndTimes  
 where   
 (  
     (startTime is null and startTime_ref is not null)  
 or    (startTime is not null and startTime_ref is null)  
 or    (startTime != startTime_ref)  
 or    (endTime is null and endTime_ref is not null)  
 or    (endTime is not null and endTime_ref is null)  
 or    (endTime != endTime_ref))