Writing an open source Apex syntax highlighter for the Monaco editor
Oli Lane on September 3rd 2018
Microsoft's VSCode has come along leaps and bounds in the last couple of years, both in functionality and popularity. One of my favourite things about it is that it's open source, which means that not only can you check out the source code for yourself, you can contribute, too!
At its core, VSCode is powered by the Monaco editor, which handles all of the code editing functionality. Since it's all built on top of browser technology, it's actually possible to use the Monaco editor by itself inside a browser - and in fact, that's exactly what we do to render most of our code and XML diffs within Gearset.
Monaco handles things like diff rendering, quick navigation and syntax highlighting for you, but when it came to displaying Apex diffs, there was no Apex syntax highlighter available. Since Java and Apex are syntactically pretty similar, we got away with using the inbuilt Java syntax highlighting for a little while, but we wanted to both improve our product and give back to the open source community that helps build this fantastic editor. So, I decided to take a stab at writing an Apex syntax highlighter and contributing it back to the project.
What does a syntax highlighter do, anyway?
Syntax highlighting requires understanding the code and "tokenizing" it. This entails marking which bits of text correspond to different "token types", where a token type might be something like "identifier" or "string" or "comment". Each of those token types can then be mapped to a certain colour to provide syntax highlighting. In a lot of editors, those colours can be customised by themes - and the tokenization can also help the editor provide other functionality such as highlighting matching brackets and code folding.
The code/configuration that carries out this tokenization process is sometimes called a tokenizer or a grammar, or even a lexical specification.
How does Monaco do syntax highlighting?
If you're familiar with Salesforce's Apex Code Editor plugin for VSCode, you might be wondering why I need to create a syntax highlighter for Monaco, given that the plugin provides Apex syntax highlighting.
It turns out that there's actually several ways in which to implement syntax highlighting in Monaco:
TextMate grammars were created for the TextMate editor, but are now used by many editors and are something of a defacto standard when it comes to syntax highlighting. Since so many TextMate grammars exist for so many different languages, when you make a new editor it makes sense to support TextMate grammars for your syntax highlighting. So, in VSCode, Monaco has a module which support TextMate grammars.
Unfortunately this support relies on a native regex library for performance reasons, and therefore the support only works in VSCode, and not when you run Monaco in a browser.
Microsoft also has a specification called the Language Server Protocol (LSP), which can be used to let an editor communicate with a separate server which will provide language-specific features like autocomplete and syntax highlighting.
The Apex Code Editor plugin uses this functionality - it provides a separate Apex Language Server which communicates with VSCode over LSP. Unfortunately, this separate server is written in Java and designed to run locally, so it doesn't make sense in the browser either.
Monaco also has its own way of creating syntax highlighters by specifying rules in a JSON format, using a library called Monarch. This library is designed to be efficient (so that it can run fast in a browser environment) - and is the only syntax highlighting method that runs in Monaco in the browser.
Since no Monarch Apex grammar existed, we were stuck when it came to highlighting Apex in the browser.
Creating a Monarch grammar for Apex
How Monarch grammars are defined
The Monaco editor website has a great page on how to write a Monarch grammar. Not only does it explain the Monarch specification pretty clearly, it also has a playground where you can modify a language definition and see the resulting syntax highlighting applied to some code in real time. The real time editor was invaluable when trying out different things during development.
As mentioned above, in Monarch you provide a series of attributes for your language in a JSON format in order to create a language definition. The main configuration happens within the
tokenizer attribute which contains a series of states, and rules. The rules match on the input and tell the tokenizer to perform a certain action - usually to transition into a different state or mark the matched text with a certain token.
The states are needed to keep track of context - the string
4.5 will probably match some sort of number token normally, but it shouldn't do so when it appears within a comment.
A starting point
As a novice when it comes to writing language grammars, I would have been a bit lost trying to start from scratch!
Luckily, I had a good jumping off point in the form of the Java Monarch grammar which already exists for Monaco (and which we were already using to highlight our Apex code in Gearset). Java and Apex are very similar in many ways, so it made sense to copy the existing grammar and modify it to suit some of the differences that Apex has.
One of the ways I went about finding these differences was to grab a fairly diverse set of Apex code from open source repositories on GitHub which I could run the current highlighter on for testing. The Monarch website's live playground editor was really useful for this.
One of the things that immediately jumps out when you use a Java highlighter on Apex code is that some of the keywords aren't highlighted - keywords like
future don't exist in Java, and so they usually get interpreted as identifiers instead. Merging in the list of Apex keywords immediately improved the results.
Another difference is that Apex is case insensitive, and Java isn't. You can mark Monarch grammars as case insensitive, which sounds like a great solution - unfortunately, it clashed with another feature which I wanted to include. When you set your grammar to be case insensitive, it stops any of your rules discriminating based on the casing of the text (which makes sense).
The problem was, I also wanted to highlight identifiers that started with an uppercase letter differently to those that didn't, because it's a good clue that the identifier is a type rather than a variable name. This rule wasn't possible to write without case sensitivity turned on.
As a compromise, I created a small function which takes all of the keywords and generates some common casing variations (specifically, all uppercase and with the first letter upper cased). This means that the highlighter will correctly match things like
Decimal as keywords, but it won't match
tomORROW. This seemed like a decent compromise to me, but it's not perfect. In particular, PascalCase keywords like
TestMethod seem perfectly sensible but won't be highlighted correctly.
Some other small changes to the Java highlighter which I made included:
- Removing binary, hex and octal numbers (Apex doesn't support them)
- Changing the javadoc tokens to apexdoc tokens
Making sure it works
Every Salesforce developer knows the value of a good test suite - so it was time to make sure the highlighter had a good set of tests to verify the functionality (and help out the next people who come to make improvements).
Tests in the monaco-languages repository basically consist of a set of test inputs and the expected token output. Again, the existing test suite for the Java highlighter was a big help.
Submitting a pull request
Finally, after running through the checklist of things to do when adding a new language, I submitted a pull request with my work.
It was accepted, and is now released in version
0.14.0 of Monaco editor - and it's also now live in the Gearset app for your Apex code diffs! 🎉
Of course, although it's a good starting point, it's not perfect by any means. Here's a list of things that could do with some improvement:
- SOQL - although lots of SOQL keywords like
FROMare highlighted correctly the tokenizer won't recognise SOQL as a different context and therefore the highlighting within SOQL queries isn't perfect
- I didn't manage to find a fully comprehensive list of Apex keywords. There's one in the docs, but some keywords such as
switchare marked as "for future use" despite being in use now, and some keywords such as
voidare missing. It also doesn't include any of the built in types. Because of this, I actually merged this list with the list of keywords from the Java highlighter to get my final list.
- As mentioned above, the highlighter isn't totally case insensitive.
- Some keywords are context dependent - for example,
sharingare keywords when used defining a class but aren't reserved words and can be used as identifiers in different contexts. I didn't tackle this issue in my first pass - the highlighter simply won't detect these as keywords.
- Although apexdoc is recognised as a whole block, individual usages of things like
@paramwithin the apexdoc block are not parsed separately.
Like I said, it's all open source, so if one of these issues is bugging you and you think you can take a crack at making it better, you can!