2016-04-12

UTF-8 Character in C

Talking about UNICODE, UTF-8 format is widely used. But there is not native supporting in C language. So, how to deal with it?

For C programming, we usually use one byte as a character (type char). UTF-8 is variable-length encoding, it also includes the ASCII characters from 0x00 to 0x7F. This feature makes it the best solution to be compatible with traditional text files.

In fact, the process is pretty simple. If we can convert variable-length encoding to fixed-length then it can be easily used by font rendering system. Here is my method to get an UTF-8 character from a string and converted as 32 bits UNICODE encoding.
uint32_t utf8_getc(char* s, char** sp)
{
    uint32_t unicode = 0;

    if((*s & 0x80) == 0){
        // ASCII code 0x00 - 0x7F
        uint32_t b0 = *s++;
        unicode = b0;
    }else if((*s & 0xE0) == 0xC0){
        // 2 bytes
        uint32_t b1 = (*s++ & 0x1F);
        uint32_t b0 = (*s++ & 0x3F);
        unicode = b0 | (b0<<6);
    }else if((*s & 0xF0) == 0xE0){
        // 3 bytes
        uint32_t b2 = (*s++ & 0x1F);
        uint32_t b1 = (*s++ & 0x3F);
        uint32_t b0 = (*s++ & 0x3F);
        unicode = b0 | (b1<<6) | (b2<<12);
    }else if((*s & 0xF8) == 0xF0){
        // 4 bytes
        uint32_t b3 = (*s++ & 0x1F);
        uint32_t b2 = (*s++ & 0x3F);
        uint32_t b1 = (*s++ & 0x3F);
        uint32_t b0 = (*s++ & 0x3F);
        unicode = b0 | (b1<<6) | (b2<<12) | (b3<<18);
    }else if((*s & 0xFC) == 0xF8){
        // 5 bytes
        uint32_t b4 = (*s++ & 0x1F);
        uint32_t b3 = (*s++ & 0x3F);
        uint32_t b2 = (*s++ & 0x3F);
        uint32_t b1 = (*s++ & 0x3F);
        uint32_t b0 = (*s++ & 0x3F);
        unicode = b0 | (b1<<6) | (b2<<12) | (b3<<18) | (b4<<24);
    }else if((*s & 0xFE) == 0xFC){
        // 6 bytes
        uint32_t b5 = (*s++ & 0x1F);
        uint32_t b4 = (*s++ & 0x3F);
        uint32_t b3 = (*s++ & 0x3F);
        uint32_t b2 = (*s++ & 0x3F);
        uint32_t b1 = (*s++ & 0x3F);
        uint32_t b0 = (*s++ & 0x3F);
        unicode = b0 | (b1<<6) | (b2<<12) | (b3<<18) | (b4<<24) | (b5<<30);
    }else{
        return 0;
    }

    if(sp)
        *sp = s;

    return unicode;
}
Where, s is pointing to the string with UTF-8 characters. And sp is the address of s i.e. sp points to s. Well, by updating sp to the next UTF-8 location, this routine can be used in a loop to process the whole string.