Dealing with LSP Encoding

Microsoft's Language Server Protocol requires UTF-8 encoding for your text, but by default it wants you to measure offsets and positions using UTF-16 "code units", which is a different encoding system. This mismatch makes it difficult to comply with the specification. (Unless you happen to be writing in Java or JS or C#, which all match the strange UTF-16 "code unit" behavior by default)

How to opt out of the Code Units

Your first goal should be to try to negotiate for the sane system.

As a client, include a positionEncodings array with a utf-8 option in your initialize message:

The server should try to choose utf-8 if the server offers the option:

If the server leaves their choice blank, it implies utf-16, even if utf-16 wasn't an option the client offers.

When this process goes right, both sides can avoid all the trouble and measure characters in plain bytes.

How to deal with the Code Units

If the above negotiation fails, then you have to deal with UTF-16 code units.

Here each of the four types of UTF-8 character:

Characteraα𝕒
# of bytes1234
# of codepoints1111
# of UTF-16 code units1112

You can see that UTF-16 code units are different from counting codepoints. Your Unicode library probably supports codepoints, but for LSP compliance, you need code units.

Instead of incorrectly using codepoints from a library, I recommend computing the code unit offsets yourself. This involves iterating over your string and counting a running total.

One way to count the number of UTF-16 code units is to parse each character and refer to the table above, and count each codepoint as +1, except 4-byte codepoints, which are +2.

Another way, if you're feeling like skipping validation, is to iterate over each byte in your string, and count bytes in 0 <= b <= 191 as 1 character, 192 <= b <= 223 as 0 characters, and 224 <= b <= 247 as -1 characters. (248 <= b <= 255 is not allowed in UTF-8 text)

See also