Proposal to transition inline syntax from substitutions to a formal grammar
The way the inline syntax is processed in AsciiDoc is one of the most ambiguous aspects of the language. This proposal seeks to address that matter.
Background
Inline syntax refers to the identification and interpretation of markup in regular (non-verbatim) text (e.g., paragraphs, headings/titles, reftext, etc.). It’s fair to say that, up to this point, the inline syntax in AsciiDoc is largely implementation-defined (or, at the very least, implementation-biased). From our research, we’ve concluded that the transition away from substitutions to process the inline syntax is more pressing than we originally anticipated. It must be stated, the request to revisit the inline syntax traces back nearly a decade. This issue identifies the challenges with the current approach and proposes a new path forward for defining the inline syntax in the AsciiDoc language.
Current state and challenges
In it’s current state, the AsciiDoc language defines the inline syntax as a sequence of markup substitutions, effectively search and replace. Regardless of how sophisticated it may be, one could still describe this methodology as a battery of regex-based substitutions for inline formatting
in contrast to constructing a proper AST. For regular text, these substitutions consist of: specialchars, attributes, quotes, replacements, macros, post_replacements.
There are two glaring problems with this methodology:
-
The input changes while substitutions are being applied.
Each successive substitution operates on a slightly different source, a hybrid of input and output. As a consequence, the interpretation of the text can be impacted (sometimes dramatically) by the output format and order of substitutions. The interpretation can often be unexpected or surprising for the writer, pushing them to have to resort to workarounds or defensive markup (e.g., passthroughs).
It also means that the inline syntax cannot be accurately described since the interpretation is so context-dependent. That, in turn, means it’s not possible to consistently and logically control where substitutions get applied. Another concern is that specialchars and replacements substitutions cater to the needs of a specific family of converters, at the cost of compromising the integrity of the input.
-
The substitution methodology does not produce an abstract syntax tree (AST).
The lack of a parse tree makes it impossible for tools to analyze the structure or locate inline nodes (using source range mappings) in the input without resorting to converter hacks. This deficiency also makes it difficult to extract information from the document or to transform the input before converting it. The location of inline markup is very important for generating accurate diagnostic messages when things go wrong.
The parse phase never actually completes since it’s still happening while the document is being converted.
These are severe problems. If this specification were to describe the inline syntax as a sequence of substitutions now, we fear these problems will become so ingrained in the language that it will be impossible to address them in future versions of the specification.
Proposed change
We’ve determined that the substitution methodology is fundamentally incompatible with a parsing grammar. It’s necessary to rethink how the inline syntax is to make it possible to parse the AsciiDoc inline syntax into structured data. Thus, in order to specify the AsciiDoc language unambiguously, we’re proposing that the specification effort make the transition now from describing the inline syntax as substitutions to instead defining it using a formal, parsable grammar (the input to a parser generator). This change would certainly make the inline syntax easier to teach, understand, and process.
In a blog post by Guillem Marpons titled An AsciiDoc processor and Pandoc front-end in Haskell
, Guillem provides strong justifications for why the specification should aim to define a grammar for the inline syntax, and also makes that case that it’s achievable.
Consequences
While this change represents a completely different mental model for how the text is interpreted, it does not mean abandoning compatibility. Far from it. Although the inline syntax is not currently defined using a parsable grammar, the grammar is inherent in the writer’s perception of the inline structure. We aim to tease out that grammar so the text is interpreted in a way that matches that perception and, consequently, allows a processor to match the current output as closely as possible. Where the behavior differs, it will likely differ in a way that more closely matches the writer’s expectation, thus being a welcomed change. In cases where the current behavior cannot be matched, the text should be interpreted in such a way that no information is lost. We’ll also need to address explicit subs in the source document, likely by limiting permutations and relying on different parsing profiles, where necessary.
There are notable benefits ushered in by this change. Some aspects of the syntax that were previously very difficult become much simpler. For example, we’ll be able to support backslash escaping for reserved characters (instead of contextual). Syntax rules will naturally nest (except for marked text directly inside of like marked text), avoiding ambiguity in how boundaries are interpreted or compelling the author to resort to hacks. It would also allow the syntax to be more well-behaved since markup can be interpreted or not based on where it resides in the flow.
Feasibility
This proposed change raises the question about whether the inline syntax can be described using a grammar. We don’t want to end up writing fiction, so we did thorough research to answer this question. We can now say with confidence that it’s possible to accurately describe the inline syntax using a parsing expression grammar (PEG) and that the inline syntax can be parsed efficiently from that grammar (with or without packrat parsing). Not only that, but the parsing becomes much more accurate, able to handle situations such as nested syntax and non-interpreted regions flawlessly, addressing many scenarios that were previously ambiguous.
However, it does require doing away with the existing substitution order and introducing phases. Specifically, it means introducing the concept of a preprocessor phrase to handle passthroughs and attribute references and a postprocessing step (perhaps a responsibility of the converter) to replace special characters and typography shorthands. (The reason attribute references have to be included in the inline preprocessor is because the replaced value can contributed interpreted text and change the boundary conditions for interpreted text). The details of these proposed phases will be resolved in separate issues.
Why PEG?
A key question that might come up is why PEG?
While we’re not saying that the inline syntax can only be parsed using a PEG grammar, it has thus far proven to be the most suitable. We don’t think, however, that by defining the inline syntax using a PEG grammar rules closes the door on using another parsing expression, and we're happy to receive a proof for how it can be done with another technology. But we do want to focus on the reasons using PEG will allow us to proceed in a timely fashion.
In recent years, the Python language migrated from a homegrown parser to a PEG parser. In the enhancement proposal for this change, they clearly describe the strengths of PEG. I think the same reasoning applies well for the AsciiDoc syntax. There are a few key points I want to highlight.
- the way (a PEG grammar) is written more closely reflects how the parser will operate when parsing it
- a PEG parser will check if the first alternative succeeds and only if it fails, will it continue with the second or the third one in the order in which they are written
Since the goal of this specification is to clearly describe the AsciiDoc syntax, there’s a key advantage for the grammar to be clear, hence point 1. But the critical point is point 2. In the inline AsciiDoc syntax, all sequences of characters are valid. But only some sequences have special meaning, and interpreted text activates additional rules. Thus, it’s essential that the parsing move from start to end, consider the next alternative if a rule fails, and allow a result of no interpreted text if no alternatives match. Thus, one could say that PEG is tailor-made for parsing AsciiDoc.
In a blog post by Guillem Marpons cited above Guillem identifies why a PEG or PEG-like grammar is required for parsing the inline syntax in AsciiDoc.
Exhibit
To conclude this proposal, we present an exhibit of an abbreviated PEG grammar expressed using the peggy DSL that demonstrates how to define the inline syntax in AsciiDoc. This demonstrates some of the nuance of consuming characters efficiently while being careful not to consume markup that must be interpreted, all without relying on semantic predicates. With the right rules, and strategic use of lookaheads, the AsciiDoc inline syntax can be parsed using a parser generated from a PEG grammar. (And semantic predicates are always an option to fall back on where needed).
root = node*
node = marked_text / macro / text
marked_text = code / emphasis / strong
code = unconstrained_code / constrained_code
emphasis = unconstrained_emphasis / constrained_emphasis
strong = unconstrained_strong / constrained_strong
macro = xref_shorthand / url_macro / ...
unconstrained_code = pre:($wordy+)? '``' main:(contents:(!'``' @constrained_code / emphasis / strong / macro / unconstrained_code_text)+ '``')
unconstrained_code_text = $(wordy ('`' !'`' / '_' / '*')) / $(not_mark_or_space+ (space not_mark_or_space+)* (space+ / &'``')) / [^`]
constrained_code = '`' !space contents0:(unconstrained_code / emphasis / strong / macro / @'`' !wordy / constrained_code_text) contents1:(unconstrained_code / emphasis / strong / macro / constrained_code_text)* '`' !wordy
constrained_code_text = $(wordy* constrained_left_mark_in_code) / $(not_mark_or_space+ (space not_mark_or_space+)* &('`' !wordy)) / $(space+ (!'`' / &'``' &unconstrained_code / '`')) / @'`' &wordy / escaped / [^ `]
// repeat previous four rules for emphasis, strong, and mark
xref_shorthand = '<<' target:(!space @$[^,>]+) contents:(',' space? @(marked_text / xref_shorthand_text)+)? '>>'
url_macro = protocol:('link:' @'' / @('https://' / 'http://')) !space target:$[^\[]]+ '[' contents:(marked_text / macro_contents_text)* ']'
text = $(wordy* ('`' / '_' / '*')) / $(not_mark_or_space+ (space / ':'? !.))+ / space / escaped / .
escaped = '\\' ([`_*<] / $(wordy* ':'))
wordy = [\p{Alpha}0-9]
not_mark_or_space = [^ `_*:<\\]
constrained_left_mark_in_code = '_' / '*'
constrained_left_mark_in_emphasis = '`' / '*'
constrained_left_mark_in_strong = '`' / '_'
space = ' '
A lot of the specification work will be focused around what syntax and combinations of syntax these rules permit.
There are notable absences in the grammar above (as it’s not yet complete). Among them are the passthroughs and attribute references. Those will necessarily need to be done in a preprocessing phase. A proposal to introduce that phase, and how to preserve the source ranges of inlines, will be covered in a separate issue.