Proposal to make backslash escaping stable
Escaping markup using a backslash character (“backslash escaping”) is one of the weaker areas of the AsciiDoc syntax. As currently described in the user docs (which reflects how it's implemented in Asciidoctor), a backslash character is only treated as meaningful if it precedes a markup element (markup that would have otherwise been interpreted). For example, \*stars*
becomes *stars*
. If the backslash is used in front of a character that isn't a markup element (i.e., doesn't match a grammar rule), such as \*star
, the input remains as is, \*star
. To a writer not well-versed in the rules of the AsciiDoc syntax, this behavior appears broken.
What we can say is that backslash escaping is contextual. Rather than instructing the parser to pass through the escaped character without interpreting it (\*
becomes *
), it's dependent on whether that markup character is enlisted in a markup element. That puts the onus on the writer to track where the markup character is being used and whether that usage gives it special meaning. Expecting the writer to take on this responsibility makes backslash escaping feel unstable. Writers often avoid using this escaping mechanism and resort to more brute-force methods such as inline passthroughs.
As part of formalizing the AsciiDoc language, I feel strongly that we should stabilize this mechanism to make it more approachable.
I see three ways we could define backslash escaping:
-
contextual - the backslash prevents a markup element from being interpreted; in this case, the backslash is consumed (
\*stars*
becomes*stars*
); if no markup element is found immediately following the backslash, the backslash is left in place (\*star
remains as\*star
) -
universal - a backslash can be used in front of any character and consumed; the character it escapes will not be considered when looking for markup elements (
\look at that \*
becomeslook at that *
); a literal backslash must escape itself (\\
becomes\
) -
reserved - a backslash can be used in front of any reserved character in the markup and is always consumed; if used in front of any other character, the backslash is left as is (
\n is a line feed; \* is an asterisk
becomes\n is a line feed; * is an asterisk
); a literal backlash in front of a reserved markup character would have to itself be escaped (\\*word*
becomes\<strong>word</strong>
); otherwise, the backslash can be written as\\
or\
One exception to maintain backwards compatibility is a macro prefix, which is treated as a single markup expression (
\link: starts a link macro
becomes (link: starts a link macro
); (another option would be to switch to contextual backslash escaping in this case, though it would add a dependency on using semantic predicates in the parser); regardless, moving forward, escaping the colon would be preferred (link\: starts a link macro
); another exception is a bare URL, which is treated as a single markup expression and thus a contextual escape (this wouldn't rely on a semantic predicate since there is no intention to interpret the identified URL any other way)
As mentioned above, AsciiDoc is currently described to permit contextual backslash escaping. We want to move past this. However, universal backslash escaping may be a step too far if we consider the impact on compatibility. The most notable problem will be Windows file paths. Under universal backslash escaping rules, C:\projects
becomes C:projects
. We can't expect writers to go back and fix all these cases. Besides, there's no expectation that a backslash has a meaning in this case (and quickly introduces leaning toothpick syndrome).
Therefore, reserved backslash escaping may offer the best compromise. By choosing reserved backslash escaping, the writer no longer has to worry about escaped markup that doesn't match a syntax rule, but also won't be faced with the Windows file path problem. The only thing that still must be considered is that escaping markup could cause different markup to be found, which then must be escaped.
One open question is which markup characters to define as reserved? Should we say that all symbol/punctuation characters in the ASCII charset can be escaped, or limit it to just the ASCII characters that the AsciiDoc syntax currently uses? For reference, CommonMark allows escaping all ASCII punctuation.
Here are the reserved markup characters identified thus far:
\ ` _ * # ~ ^ : [ < ( {
Note that it shouldn't be necessary to have to escape the closing bracket of a markup element, hence why those characters are not listed here as reserved.
Another open question is how to escape unconstrained marked text. Currently, AsciiDoc requires that the opening unconstrained mark be double escaped (\\**stars**
). However, this is both context-dependent and ambiguous (as escaping a backslash should make a literal backslash). Therefore, we may have to change this rule to be (\*\*stars**
). This will introduce a slight incompatibility, but one that is reasonable to explain and to justify with the goal of making backslash escaping stable.
The examples provided thus far focus on where backslash escaping is used in inline syntax. It should also be considered for the following block-level constructs:
- preprocessor directive (
\include::target[]
) - block macro (
\image::target[]
) - list item (
\* is an asterisk
) - dlist term (
App\:: is a Ruby namespace
) (or should it be\App:: is a Ruby namespace
?) - heading (
\= is an equals sign
)
Open Question: For block-level constructs, are we interpreting the backslash because it's at the beginning of the line, or because it is escaping a character? I think we should consider it because it's used at the beginning of the line. (I think this would translate to removing the backslash at the beginning of a paragraph). That reduces how much markup we have to designate as reserved.
In terms of parsing efficiency, we have identified the following optimization for processing backslash escaping. During parsing, only consider backslash characters that are escaping a grammar rule that is being considered. Once parsing is complete, drop the backlash in front of all reserved characters in the transformation from a parse tree to an AST/ASG. This can minimize the number of checks that the grammar has to consider.