Unicode in VBScript — the AscW() rant

Today, I spent most of my time at work with VBScript. Not that I know it or want to know it, but we need it for signature generation using CAPICOM in Internet Explorer. Enough scary words for a sentence? I'll add Unicode to that.

So we want to be able to sign some UTF8 encoded data for approvals in OpenXPKI. The data is contained in a form field and should be passed on to a SignedData object from CAPICOM. Unluckily, just doing SignedData.Content = theForm.text.value does not work. Apparently (did I mention that I barely know VBScript), everything is UTF-16LE internally, so even a simple SignedData.Content = "test" won't do and will sign a »t« (our guess is that CAPICOM assumes that what it gets is 0-terminated, so it stops reading after the UTF-16LE encoded »t«).

Some more or less extensive googling showed that one can go through the string using Mid() (VBScript's weird name for substr) and get the Unicode codepoint using AscW. The first idea was to encode these values in hex, add a delimiter and sign just that. But of course that's not really readable and not really space-efficient, too. Someone suggested that converting the whole string to UTF8 might be feasible from here — and offered the additional benefit of not having to code different validation methods on the server for different browsers, as Mozilla's crypto.signText() method already signs UTF8. I have to admit that although I've used it quite a bit, I've never looked at how UTF8 works. Luckily, it is actually quite easy (just some bitshifting). Some time later, the UnicodeToUTF8 method was done. It looked quite similar to this:

Function UnicodeToUTF8(ByRef pstrUnicode)
    ' Written 2007 by Alexander Klink for the OpenXPKI Project
    ' (c) 2007 by the OpenXPKI Project, released under the Apache License v2.0
    ' converts a unicode string to UTF8
    ' reference: http://en.wikipedia.org/wiki/UTF8
    Dim i, result

    result = ""
    For i = 1 To Len(pStrUnicode)
        CurrentChar = Mid(PstrUnicode, i, 1)
        CodePoint = AscW(CurrentChar)

        MaskSixBits   = 2^6 - 1 ' the lower 6 bits are 1
        MaskFourBits  = 2^4 - 1 ' the lower 4 bits are 1
        MaskThreeBits = 2^3 - 1 ' the lower 3 bits are 1
        MaskTwoBits   = 2^2 - 1 ' the lower 3 bits are 1

        'MsgBox CurrentChar & " : " & CodePoint
        If (CodePoint >= 0) And (CodePoint < 128) Then
            ' for codepoints < 128, just add one byte with the
            ' value of the codepoint (this is the ASCII subset)
            Zs = CodePoint
            result = result & ChrB(Zs)
        End If
        ' this is common for all of the following
        Zs = CodePoint And MaskSixBits
        If (CodePoint >= 128) And (CodePoint < 2048) Then
            ' for naming, see the Wikipedia article referenced above
            Ys = RightShift(CodePoint, 6)
            FirstByte  = LeftShift(6, 5) Xor Ys ' 110yyyy 
            SecondByte = LeftShift(2, 6) Xor Zs ' 10zzzzz
            'MsgBox "Case 1: " & FirstByte & ", " & SecondByte
            result = result & ChrB(FirstByte) & ChrB(SecondByte)
        End If
        If (CodePoint >= 2048) And (CodePoint < 65536) Then
            Ys = RightShift(CodePoint, 6) And MaskSixBits
            Xs = RightShift(CodePoint, 12) And MaskFourBits
            FirstByte  = LeftShift(14, 4) Xor Xs ' 1110xxxx
            SecondByte = LeftShift(2, 6) Xor Ys  ' 10yyyyyy
            ThirdByte  = LeftShift(2, 6) Xor Zs  ' 10zzzzzz
            'MsgBox "Case 2: " & FirstByte & ", " & SecondByte & ", " & ThirdByte
            result = result & ChrB(FirstByte) & ChrB(SecondByte) & ChrB(ThirdByte)
        End If
        If (CodePoint >= 65536) And (CodePoint < 1114112) Then
            Ws = RightShift(CodePoint, 18) And MaskThreeBits
            Xs = RightShift(CodePoint, 12) And MaskSixBits
            Ys = RightShift(CodePoint, 6)  And MaskSixBits
            FirstByte  = LeftShift(30, 3) Xor Ws ' 11110www
            SecondByte = LeftShift(2, 6) Xor Xs  ' 10xxxxxx
            ThirdByte  = LeftShift(2, 6) Xor Ys  ' 10yyyyyy
            FourthByte = LeftShift(2, 6) Xor Zs  ' 10zzzzzz
            'MsgBox "Case 3: " & FirstByte & ", " & SecondByte & ", " & ThirdByte & FourthByte
            result = result & ChrB(FirstByte) & ChrB(SecondByte) & ChrB(ThirdByte) & ChrB(FourthByte)
        End If
    Next
    UnicodeToUTF8 = result
End Function

Note that VBScript does not have left shifts or right shifts, so I had to implement these, too. Bummer. Not that it is particularly difficult, but what language does not have builtin shift operators?

Anyways, testing showed that it worked fine with both the german and the russian I18N data that we are using. But of course I wanted to be sure that it worked in the more obscure cases, too. I copied some four-byte UTF8 characters from fileformat.info and tried them out. Weirdly enough, that did not seem to work. Of course, at first I thought that it was my fault, but some debugging showed that AscW returned negative numbers for the corresponding characters. Negative numbers? What? Yes, AscW returns an Integer, which ranges from -32768 to +32767. So how can they accomodate Unicode codepoints, which go up to 1114111? Well, they can't. Microsoft even noticed that this might be a problem and published a Knowledgebase article about it. But did they really get the problem? Of course not, as they only talk about Unicode characters from 32768 to 65535 (I really like the quote »Unicode numbers occupy a 16-bit positive range from 0 to 65535 (0xFFFF)« – yeah, sure), for which AscW actually returns the one's complement of the number, so adding 65536 works. Anyhow, you're screwed with everything that is above 65535 (not that one reason for Unicode is to get rid of all the two-byte crap).

Here is the new version of UnicodeToUTF8, which probably should be called UnicodeToUTF8SortOf:

Function UnicodeToUTF8(ByRef pstrUnicode)
    ' Written 2007 by Alexander Klink for the OpenXPKI Project
    ' (c) 2007 by the OpenXPKI Project, released under the Apache License v2.0
    ' converts a unicode string to UTF8 (well, sort of)
    ' reference: http://en.wikipedia.org/wiki/UTF8
    Dim i, result

    result = ""
    For i = 1 To Len(pStrUnicode)
        CurrentChar = Mid(PstrUnicode, i, 1)
        CodePoint = AscW(CurrentChar)

        If (CodePoint < 0) Then
            ' AscW is broken. Badly. It can only return an integer,
            ' which is 32767 at most. So everything up to 65535 is
            ' AscW() + 65536. That Unicode chars exist beyond 65535
            ' is apparently unknown to Microsoft ...
            CodePoint = CodePoint + 65536
        End If

        MaskSixBits   = 2^6 - 1 ' the lower 6 bits are 1
        MaskFourBits  = 2^4 - 1 ' the lower 4 bits are 1
        MaskThreeBits = 2^3 - 1 ' the lower 3 bits are 1
        MaskTwoBits   = 2^2 - 1 ' the lower 3 bits are 1

        'MsgBox CurrentChar & " : " & CodePoint
        If (CodePoint >= 0) And (CodePoint < 128) Then
            ' for codepoints < 128, just add one byte with the
            ' value of the codepoint (this is the ASCII subset)
            Zs = CodePoint
            result = result & ChrB(Zs)
        End If
        ' this is common for all of the following
        Zs = CodePoint And MaskSixBits
        If (CodePoint >= 128) And (CodePoint < 2048) Then
            ' for naming, see the Wikipedia article referenced above
            Ys = RightShift(CodePoint, 6)
            FirstByte  = LeftShift(6, 5) Xor Ys ' 110yyyy 
            SecondByte = LeftShift(2, 6) Xor Zs ' 10zzzzz
            'MsgBox "Case 1: " & FirstByte & ", " & SecondByte
            result = result & ChrB(FirstByte) & ChrB(SecondByte)
        End If
        If (CodePoint >= 2048) And (CodePoint < 65536) Then
            Ys = RightShift(CodePoint, 6) And MaskSixBits
            Xs = RightShift(CodePoint, 12) And MaskFourBits
            FirstByte  = LeftShift(14, 4) Xor Xs ' 1110xxxx
            SecondByte = LeftShift(2, 6) Xor Ys  ' 10yyyyyy
            ThirdByte  = LeftShift(2, 6) Xor Zs  ' 10zzzzzz
            'MsgBox "Case 2: " & FirstByte & ", " & SecondByte & ", " & ThirdByte
            result = result & ChrB(FirstByte) & ChrB(SecondByte) & ChrB(ThirdByte)
        End If
    Next
    UnicodeToUTF8 = result
End Function

I still hope that in spite of all the ranting, the above will be useful to someone.

shift or die

security. photography. foobar.

Unicode in VBScript — the AscW() rant