Dealing with LSP Encoding
Microsoft's Language Server Protocol requires UTF-8 encoding for your text, but by default it wants you to measure offsets and positions using UTF-16 "code units", which is a different encoding system. This mismatch makes it difficult to comply with the specification. (Unless you happen to be writing in Java or JS or C#, which all match the strange UTF-16 "code unit" behavior by default)
How to opt out of the Code Units
Your first goal should be to try to negotiate for the sane system.
As a client, include a positionEncodings
array with a utf-8
option in your initialize
message:
{"capabilities": { "general": {"positionEncodings": ["utf-8"] }}}
The server should try to choose utf-8
if the server offers the option:
{"capabilities": {"positionEncoding": "utf-8"}}
If the server leaves their choice blank, it implies utf-16
, even if utf-16
wasn't an option the client offers.
When this process goes right, both sides can avoid all the trouble and measure characters in plain bytes.
How to deal with the Code Units
If the above negotiation fails, then you have to deal with UTF-16 code units.
Here each of the four types of UTF-8 character:
Character | a | α | a | 𝕒 |
---|---|---|---|---|
# of bytes | 1 | 2 | 3 | 4 |
# of codepoints | 1 | 1 | 1 | 1 |
# of UTF-16 code units | 1 | 1 | 1 | 2 |
You can see that UTF-16 code units are different from counting codepoints. Your Unicode library probably supports codepoints, but for LSP compliance, you need code units.
Instead of incorrectly using codepoints from a library, I recommend computing the code unit offsets yourself. This involves iterating over your string and counting a running total.
One way to count the number of UTF-16 code units is to parse each character and refer to the table above, and count each codepoint as +1, except 4-byte codepoints, which are +2.
Another way, if you're feeling like skipping validation, is to iterate over each byte in your string, and count bytes in 0 <= b <= 191
as 1 character, 192 <= b <= 223
as 0 characters, and 224 <= b <= 247
as -1 characters. (248 <= b <= 255
is not allowed in UTF-8 text)