AsciiDoc Language issueshttps://gitlab.eclipse.org/eclipse/asciidoc-lang/asciidoc-lang/-/issues2024-02-28T00:17:15Zhttps://gitlab.eclipse.org/eclipse/asciidoc-lang/asciidoc-lang/-/issues/43Decide whether the attribute value reader and inline preprocessor preemptivel...2024-02-28T00:17:15ZDan AllenDecide whether the attribute value reader and inline preprocessor preemptively resolve escaped backslashesThe text of a paragraph goes through two phases of inline parsing, the inline preprocessor and the inline parser. The value of an attribute entry goes through two phases as well, the value assembler and the inline preprocessor. (That val...The text of a paragraph goes through two phases of inline parsing, the inline preprocessor and the inline parser. The value of an attribute entry goes through two phases as well, the value assembler and the inline preprocessor. (That value will subsequently go through the inline parser when referenced in a paragraph). At each stage, there's syntax that can be escaped using a backslash. For example, in the value of an attribute entry, an attribute reference or a value continuation can be escaped using a backslash. Consider the following case:
```
:hint: Use \{backslash} to insert \\
```
When the `hint` attribute is referenced in a paragraph, we expect to see the following in the rendered document:
```
Use {backslash} to insert \
```
We need to consider how we end up at that result.
There are two strategies for how escaped backslashes can be handled as the processor works through the inline parsing phases.
## Strategy 1: Resolve escaped backslashes per phase (strict)
Following this strategy, each time the processor looks for escaped backslashes, it resolves (or normalizes) them. What that entails is consuming the odd backslash (if present) as an escape, then reducing the number of backslashes by half. Consider this sequence:
```
\\\
```
That would resolve to:
```
\
```
The benefit of this strategy is that it can account for every permutation of backslash escaping. If you want the backslash to be treated as a literal backslash, you just add more backslashes. However, this strategy quickly leads to the leaning toothpick problem...which is essentially an exponential increase of required backslashes.
Let's assume that we start with the following AsciiDoc source:
```
:command: *begin*
:text: Use ??
{command} to begin a block.
```
We want to see the following the output document:
```
Use \<strong>begin</strong> to begin a block.
```
The question is, how many trailing backslashes do we need to use in place of ?? to produce a literal backslash without impacting the attribute reference and text formatting it contains? The answer is, we need 9.
```
:text: Use \\\\\\\\\
{command} to begin a block.
```
The last backslash acts as a value continuation. Then, it reduces the even number of backslashes that precede it by half, leaving us with 4. At this stage, this is what the processor sees:
```
Use \\\\{command} to begin a block.
```
Now we resolve the attribute reference, once again reducing the even number of backslashes that precede it by half. At this stage, here's what the processor sees:
```
Use \\*begin* to begin a block.
```
When the `{text}` attribute reference is used in the paragraph, the inline parser will locate the escaped backslash in the resolved value and once again reduce the backslashes by half, reducing it to 1 (which will not impact the text formatting). Thus, we arrive at the following result:
```
Use \<strong>begin</strong> to begin a block.
```
While this works, it's hard to explain to an author—especially someone not familiar with the low-level phases—why 9 backslashes are needed. Thus, I think we should consider strategy 2.
## Strategy 2: Only resolve escaped backslashes once, during inline parsing
In this strategy, the escaped backslashes are still considered at each phase, but they are left as is until inline parsing (the last phase). That way, they remain stable through the phases rather than being reduced at each stage. As a result, the user only needs to escape a backslash once.
Revisiting the previous example, the author only needs 3 trailing backslashes to achieve the desired result.
```
:command: *begin*
:text: Use \\\
{command} to begin a block.
```
The odd backslash is consumed as the value continuation. The remaining escaped backslash is reduced to a literal backslash by the inline parser.
The drawback of this strategy is that it's not possible to use a backslash to escaped the resolved value of an attribute. Let's assume that we want the following output instead:
```
Use *begin* to begin a block.
```
If we use `\\{command}`, then we're going to end up with `\\*begin*` rather than `\*begin*`. So we've sacrificed some flexibility for simplicity. However, there's still a mechanism available to achieve the desired result. If we set the `esc` attribute to a single backslash, then it becomes possible to insert an escape character in front of the resolved value of the attribute. Consider this case:
```
:esc: \
:text: Use {esc}\
{command} to begin a block.
```
Now when we reference `{text}`, the inline parser will see `\*begin*`. That means the output will show:
```
Use *begin* to begin a block.
```
**NOTE:** The value of the implicit `backslash` attribute will need to be `\\` rather than `\` so it produces a literal backslash as expected.
## Proposed decision
Given that the audience for AsciiDoc is more than just programmers, I think the simplistic approach is best here. We want to avoid the leaning toothpick problem, and we want to be able to easily explain the AsciiDoc rules without having to make the user aware of all the low-level phases. There are still plenty of mechanisms available in AsciiDoc to escape syntax without having to rely on strict backslash escaping.
It's worth noting that none of these scenarios mentioned in this issue are even available in Asciidoctor or its predecessor. That's because both only consider whether the character that immediately precedes the reserved syntax (e.g., an attribute reference) is a backslash, not whether that backslash is itself escaped. So this issue is primarily a refinement of #25.https://gitlab.eclipse.org/eclipse/asciidoc-lang/asciidoc-lang/-/issues/41Clarify syntax and parsing rules for continuing an attribute entry value acro...2024-02-27T21:00:43ZDan AllenClarify syntax and parsing rules for continuing an attribute entry value across multiple linesMost of the time, an attribute entry occupies a single line. For example:
```
:source-language: java
```
When the value is very long, the AsciiDoc syntax allows that value to be split across multiple lines by ending each previous line ...Most of the time, an attribute entry occupies a single line. For example:
```
:source-language: java
```
When the value is very long, the AsciiDoc syntax allows that value to be split across multiple lines by ending each previous line in a backslash, called an *attribute continuation*. This feature is inspired by shell interpreters, such as Bash. For example:
```
:description: This page is a migration guide. \
It only covers the migration between each LTS release.
```
The attribute continuation has never been very well defined beyond a basic example. This issue aims to resolve the syntax and parsing rules while also making the feature more robust and universal.
The attribute continuation serves two purposes. First, it tells the parser to append the next line to the value as long as that line is not an interrupting line. If the line is taken, the continuation and the newline that follows it are dropped. If the line is not taken, the continuation is preserved (meaning it remains as part of the value), but not the trailing newline.
Thus, the resolved value of the previous example is as follows:
```
This page is a migration guide. It only covers the migration between each LTS release.
```
**NOTE:** In addition to the interrupting lines for a paragraph, an attribute entry is interrupted by an adjacent attribute entry. Asciidoctor does not always get this requirement right. (Also, it's still unclear whether a list continuation should only be an interrupting line when inside of a list, or at any time).
Both Asciidoctor and its predecessor required the attribute continuation to be proceeded by a space. However, this is an unnecessary requirement and it makes it impossible to continue the the value without introducing a space. It should be possible to use the continuation directly at the end of the line.
```
:product-code: ISV-\
1234
```
This attribute entry would produce the value `ISV-1234`.
Any time we rely on a character to have special meaning, especially a backslash, it should be possible to escape that character. Like with the inline preprocessor, we will want to apply contextual escaping here. What that means is that if there are an even number of backslashes at the end of the line, the last backslash does not act as an attribute continuation and those backslashes are reduced by half. If there are an odd number, the last backslash is an attribute continuation and the remaining backslashes are reduced by half. Escaped backslashes anywhere else in the line are not considered.
Here's an example of how to use a literal backslash at the end of a value:
```
:instructions: escape markup using \\
```
However, keep in mind that most of the time this won't be necessary. That's because the backslash is preserved if the attribute entry is interrupted, which it almost always is. So this is unlikely to affect existing documents. Consider this case:
```
:instructions: escape markup using \
{instructions}
```
Here's an example of how to use a literal backslash and then continue the value:
```
:instructions: escape an autolink using \\\
https://example.org
```
Again, these are pretty rare events, so we're just defining the rules for completeness.
An attribute continuation allows the continued value to be aligned with the value on the previous line. Yet, the indentation is dropped from the value. Consider this case:
```
:description: This page is a migration guide. \
It only covers the migration between each LTS release.
```
Shell interpreters also support this feature. In shell interpreters, the repeating spaces are always normalized to a single space. However, I don't think we want that behavior. Instead, all leading indentation should be removed and only the space to the left of the attribute continuation should be kept. That gives the user better control over where the space ends up in the resolved value.
Of course, we have to consider whether we even want to normalize the spaces at all or just keep them as entered. In other words, do we want to encourage this style of formatting in the AsciiDoc source, or should the wrapped line always start at the left margin?
The final point to consider is how to specify a hard wrap. Consider the case when the value of the attribute entry is going to be used in a verbatim block or a paragraph with the hardbreaks option. The author is going to want to be able to preserve the newlines in the attribute value so that they carry over. But this is not possible in AsciiDoc.
Asciidoctor offers a partial compromise by enhancing the attribute continuation to recognize a hard line break shorthand before the continuation. When Asciidoctor detects this case, it preserves the newline. Consider this case:
```
:lines: one + \
two + \
three
```
The resolved value would be as follows:
```
one +
two +
three
```
This is a not a general purpose feature, and thus I think we can do better. I see two possible ways to express that the newline should be preserved, and there's no need to link it to the hard line break shorthand.
The first option is to use a double attribute continuation offset by a space. For example:
```
:lines: one\ \
two\ \
three
```
The escaped space in front of the list continuation would tell the processor to keep the newline after the attribute continuation. This is not likely a syntax that would interfere with content. However, it may be costly to parse.
Another option is to take a page from YAML and use the `|` character in front of the continuation as a hint to keep the newline.
```
:lines: one|\
two|\
three
```
However, the risk here is that the pipe character is used to separate table cells, so it could cause an AsciiDoc table cell to end prematurely. Though it could be escaped in that case.
Yet another option is to take a hint from Markdown and use multiple spaces in front of the continuation as a hint to preserve the ensuing newline.
```
:lines: one \
two \
three
```
This may be the safest and most portable option, and it's not terribly difficult to parse. It's rare that you need spaces at the end of a line, so we're able to take advantage of characters that would otherwise have no meaning. That's ideal for introducing a new feature. When newlines are preserved, indentation on wrapped lines is also preserved.
We can apply this to the earlier example of the partial syntax offered by Asciidoctor to see how it compares:
```
:lines: one + \
two + \
three
```
It's nearly the same syntax, but now it's not coupled to the hard line break shorthand.
* In summary, an attribute entry is interrupted by an adjacent attribute entry or paragraph interrupting line
* An attribute value can be continued to the next line by ending the line in an attribute continuation (trailing backslash)
* If the attribute continuation is unused, it is preserved at the end of the value
* Indentation is removed from wrapped lines
* The attribute continuation can be escaped using a backslash (any even number of backslashes at the end of the line)
* Newlines in an attribute value can be preserved by preceding the attribute continuation with two spaces0.4.0 (milestone build)https://gitlab.eclipse.org/eclipse/asciidoc-lang/asciidoc-lang/-/issues/26Proposal to clarify the behavior of the preprocessor2023-05-05T22:47:24ZDan AllenProposal to clarify the behavior of the preprocessor## Purpose
The line-oriented preprocessor in AsciiDoc (herein the AsciiDoc preprocessor) is one of the biggest hurdles we face in formalizing the AsciiDoc language syntax (using a grammar formalism). That’s because the preprocessor dire...## Purpose
The line-oriented preprocessor in AsciiDoc (herein the AsciiDoc preprocessor) is one of the biggest hurdles we face in formalizing the AsciiDoc language syntax (using a grammar formalism). That’s because the preprocessor directives are both coupled with the document structure and work outside of it.
The purpose of this issue is to figure how to describe the behavior of the preprocessor in a way that allows the language to be formalized. We’ll look at it from various perspectives that range from untangling the preprocessor from the document structure so it’s easier to parse to carefully defining grammar rules, actions, and parsing requirements to match the existing functionality. In the end, it may just come down to accepting the existing behavior and figuring out how to describe it in terms of a grammar formalism.
This issue can be resolved once we've decided on the behavior of the AsciiDoc preprocessor and resolved how to describe it in such a way that it doesn't prevent formalizing the grammar for the AsciiDoc language.
## Background
> In this section, I define the AsciiDoc preprocessor, describe its purpose, and explain how it works today (according to the user documentation and how it’s implemented in Asciidoctor).
The AsciiDoc preprocessor provides directives that add or remove lines from the source document ahead of (block-level) parsing. The preprocessor is strictly line-oriented.
There are two types of preprocessor directives: conditional directives and the include directive. The conditional directives (ifdef, ifndef, and ifeval) are for filtering lines in the source document. Conditional directives are useful for producing variations of a single document, such as to repurpose it for different audiences. The include directive (include) is for adding new lines to the source document from an external file, thus allowing documents to be composed.
Once a preprocessor directive is processed, the parser does not see it (and thus does not introduce a boundary in the structure). Rather, the parser only sees the outcome of the directive (which is either more or less lines than what’s in the source document).
In its purest form, a [preprocessor](https://en.wikipedia.org/wiki/Preprocessor) is applied to the source document before the document is parsed (i.e., called a lexical preprocessor). But that’s not how the AsciiDoc preprocessor works. The AsciiDoc preprocessor is somewhere between a lexical preprocessor and a syntactic preprocessor.
The AsciiDoc preprocessor is able to see the value of attributes set or unset by attribute entries in the document. This pertains to attribute entries set in either the header or the body. However, since attribute entries are not permitted anywhere in an AsciiDoc document, knowing how to find them means the preprocessor must have at least some awareness of the document structure. On the other hand, the preprocessor directives themselves can appear anywhere in the document (except perhaps in comment blocks, which is open for discussion), meaning they exist outside the structure of the document. Thus, the preprocessor must be able to recognize the structure of the document enough to process attribute entries, but not restrict where the preprocessor directives can be used.
To summarize, the preprocessor has access to document attributes as soon as they are defined in the document (in addition to ones passed to the processor), but does not otherwise recognize or honor the document’s block structure. Thus, we can say that the AsciiDoc preprocessor is not a lexical preprocessor, but rather a syntactic preprocessor (at least in part). I prefer to think of it as a priority (or contextual) preprocessor.
Unfortunately, the behavior I just described presents a real problem for defining a grammar for AsciiDoc. The requirements it calls for are really at odds with a grammar formalism and potentially compromises our ability to define one. Addressing this problem may call for a separate parsing phase which handles preprocessor directives with just enough parsing of the block structure baked in to also handle attribute entries. Or it may be possible to fold that behavior into the primary grammar, hiding it behind select annotated rules to keep the grammar tidy. Either way, it’s going to put some real constraints on which parsing technologies can be used to parse AsciiDoc.
Let’s consider different approaches that make the described behavior compatible with a grammar formalism and/or change the behavior so it can be.
## Proposed Models
There are at least four ways we can consider defining the behavior of the preprocessor:
* lexical preprocessor
* lexical preprocessor in body
* priority block processor
* priority line processor
We’ll look at each of these in detail.
### Lexical Preprocessor
One approach we could take is to redefine the AsciiDoc preprocessor as a strict lexical preprocessor. Using this model, the preprocessor only looks for preprocessor directives as reserved, line-oriented tokens. In other words, it looks at the source document as a series of lines, but otherwise doesn’t acknowledge the structure of the document. This is by far the easiest to implement. The grammar rules only have to look for lines that are preprocessor directives, adding or removing lines as prescribed. The grammar does not have to try to find or process attribute entries. However, what it means is that the only attributes the preprocessor directives can see are the ones passed into the processor (in other words, attributes defined in the document don’t affect the operation of the preprocessor).
Although this model may sound alluring, it has a tremendous impact on compatibility. The assumption that preprocessor directives can reference attributes defined in the document, at the very least in the document header, is ingrained and, as such, documents that use preprocessor directives are often written in this way. Changing this model now would almost certainly violate our commitment to creating a specification that’s reasonably compatible with existing content. In other words, it would be a significant departure from AsciiDoc as we know it.
### Lexical Preprocessor in Body
To address the problem of the lexical preprocessor not being able to see attribute entries defined in the document header, where they are most often defined, we could consider different preprocessing rules for the document header. This would work because the structure of the document header is flat and thus lends itself to line-oriented processing. Locating the end of the header is only complicated by having to consider preprocessor directives, which would already be addressed here.
The document header could be processed line-by-line, allowing the preprocessor directives to see the result of the previous line and filtering lines in advance of the header parser. While the grammar for the header would be slightly less formal (or require extra parsing requirements), that exception would be confined to the document header. Once the header is cleared, the preprocessor would switch to being a lexical preprocessor, ignoring all remaining structure (and, as such, attribute entries).
This model is certainly something worth considering, but I'd need to see a proof of concept of it working. It does have an impact on compatibility, but only in the case where documents are written to use preprocessor directives that rely on attributes defined in the body of the document. If there are documents that rely on this behavior, then this change will break compatibility in a way that is significant.
An impact assessment would certainly need to be done here. It’s not uncommon for documents to change the value of an attribute to modify the target of a subsequent include. We often see this pattern used in books, where the target of a chapter file is controlled by an attribute. Documents may also use attributes to change the location of an example file that is included in a code block. My instinct tells me that we’re going to find that it’s going to present major problems.
### Priority Block Processor
Another model for the preprocessor is to process it as transparent block that does not appear in the ASG. In this model, the preprocessor would be part of the document structure and thus would naturally be able to see attributes set or unset by attribute entries. However, it would impose a lot of new restrictions on how and where the preprocessor directives can be used. It could also introduce side effects in the parsing.
For one, conditional preprocessor directives would have to be balanced within the document structure. In other words, they couldn’t overlap boundaries of a block like they can today. They would also not be permitted in places where blocks are not allowed, such as around block attribute lines. There’s also a question of how they would be processed within verbatim content, something that’s permitted today. It’s also not clear whether lines contributed by adjacent include directives would be stitched back together to create a single block. In other words, the preprocessor directives would end up introducing artificial boundaries in the block structure.
While this model has some merit, it also has a tremendous consequences on compatibility. And while it may simplify the grammar, it would also require additional processing to transform the parse tree. Thus, I don’t really see how we can consider it.
### Priority Line Processor
A priority line processor is the closest model to what we have in AsciiDoc today. Thus, this is the preferred proposal.
In this model, every line must be checked for a preprocessor directive before it’s considered by the grammar parser. If a preprocessor directive is found, it needs to be processed and the input modified so the pending grammar rule only sees the outcome. If the preprocessor directive leaves behind a preprocessor directive on the same line (such as by an include directive), that directive must also be processed. Once the current line is confirmed to not be a preprocessor directive, the pending grammar rule may proceed.
The priority line processor can either be integrated with the grammar parser (thus a single parsing phase), or it can be done as a separate parsing phase. If done as a separate phase, it will still have to consider the structure of the document in order to locate and process attribute entries, but this mode is effectively a lightweight parse rather than a complete one. By lightweight, I mean that it would be doing just enough to identify valid attribute entries.
While choosing this model has no impact on compatibility, it puts rather substantial restrictions on what parsing technologies can be used for parsing AsciiDoc. We are essentially requiring the parser to allow the input ahead of the cursor to be modified while parsing is taking place. It also has to be possible to instruct the parser to backtrack to the location of the preprocessor directive after the directive has been processed. That, in turn, means that any information cached about the input at that point forward needs to be cleared. These requirements are distinctly at odds with a grammar formalism.
With that said, it’s not likely that an implementation will use a grammar-based parser to handle the preprocessor requirement. Instead, it may decide to employ bespoke line-based processing logic for this step, such as we see in Asciidoctor (and downdoc). But we still may be able to describe the behavior of the preprocessor using the grammar from a grammar-based parser that can accommodate the stated requirements. In doing so, we will have achieved the goal of communicating the normative rules using a grammar while, at the same time, not mandating that an implementation do it that way.
Here’s a partial exhibit of a dedicated grammar for the preprocessor that shows how a priority line processor might work:
```
document = header? body lf*
header = ...
body = pp_block*
pp_block = (pp (lf / attribute_entry / block_attribute_line))* block
pp = (pp_directive* . !.)?
pp_directive = pp_conditional / pp_conditional_short / pp_include
pp_conditional_short = operator:pp_conditional_name '::' attribute_name:attribute_name '[' contents:$([^\n\]]+ &(']' eol) / ([^\n\]] / ']' !eol)+) ']' eol
{
// see action for pp_conditional rule
}
pp_conditional = operator:pp_conditional_name '::' attribute_name:attribute_name '[]\n' contents:conditional_lines 'endif::[]' eol
{
const { start: { offset: startOffset }, end: { offset: endOffset } } = location()
const drop = operator === 'ifdef' ? !(attribute_name in options.attributes) : (attribute_name in options.attributes)
// TODO record line offsets
input = input.slice(0, (peg$currPos = startOffset)) + (drop ? '' : contents.join('')) + input.slice(endOffset)
peg$posDetailsCache = [{ line: 1, column: 1 }]
return true
}
conditional_lines = (!('endif::[]' eol) @(pp_conditional_pair / $([^\n]+ eol) / '\n'))*
pp_conditional_pair = opening:$(pp_conditional_name '::' attribute_name '[]\n') contents:conditional_lines closing:$('endif::[]' eol)?
pp_conditional_name = 'ifdef' / 'ifndef'
pp_include = 'include::' target:$[^\[\n]+ '[]' eol
{
const { start: { offset: startOffset }, end: { offset: endOffset } } = location()
const contents = require('fs').readFileSync(target, 'utf8').split(/(?<=\n)/)
// TODO record line offsets
input = input.slice(0, (peg$currPos = startOffset)) + contents.join('') + input.slice(endOffset)
peg$posDetailsCache = [{ line: 1, column: 1 }]
return true
}
block = example / listing / list / ... / paragraph
attribute_entry = ':' name:attribute_name ':' value:(' ' @$[^\n]+ / '') eol
{
options.attributes[name] = value
}
attribute_name = $[a-z]+
example = ...
listing = '----\n' contents:$(pp !('----' eol) line / '\n')* pp '----' eol
list = ...
paragraph = (pp !(block_attribute_line / any_parent_block_delimiter_line) @line)+
line = value:$([^\n]+ eol)
eol = '\n' / eof
eof = !.
lf = '\n'
```
There are a couple things to notice about this grammar. Any time the grammar looks for a line, it must run the `pp` rule to make sure the line has been preprocessed. When the grammar looks for preprocessor directives, it must keep looking until it doesn’t find any at that location. It must then fail that rule so that the cursor is not advanced. (In practice, I found it necessary to reset the cursor manually since I couldn’t find a way to fail each `pp_directive` rule individually but still continue checking for preprocessor directives). In the action that processes the preprocessor directive, it must be modify the input to replace the directive with its contents (either the conditional lines or the contents of the include). It then needs to move the cursor back to the start offset of the directive so the input can be reprocessed starting at that point. The grammar needs to walk the block structure, but does not have to get into the finer details of how to parse the blocks. In particular, it doesn’t need to consider the inline syntax at all.
The behavior of the priority line processor is being described formally as follows:
* a lightweight parse of the block structure in order to identify and process attribute entries
* the inclusion of the `pp` rule to identify and process for preprocessor directives
* a rule to read the contents of a conditional preprocessor directive without processing the lines
It’s debatable whether it helps to have a separate preprocessing phase, though it’s certainly useful as a tool (consider the role of Asciidoctor Reducer). I think when we define the primary grammar for the language, we may want to do so without including the preprocessor rules (thus thinking about them as a separate phase). But an implementation may combine the grammars to avoid having to maintain separate grammars.
One very important factor to consider in all these models is how to map nodes to the original source. In other words, how to track line offsets as a result of resolving the preprocessor directives.
## Line Offsets
Another big challenge with the preprocessor (irrespective of the model) is tracking line offsets. When reporting problems, or to allow a document to be properly analyzed, we want to be able to map the location of nodes in the parsed document / ASG to the source document or documents. If a preprocessor comes through and moves lines around, it compromises the parser’s ability to provide this information accurately. Thus, when the preprocessor runs, it must build a map of processed lines to source lines. The parser then needs to run the reported location through this map to resolve the correct location in the source document. (From experience, building this map can be quite tricky).
Although the logic is difficult, the result of the mapping is quite easy to understand. Consider the following AsciiDoc source:
```asciidoc
début
conditional content
fin
```
Here’s the source the parser will see (after the preprocessor does its thing):
```asciidoc
début
conditional content
fin
```
A line offset mapping may look something like this:
```json
{
"1": { "line": 1, "column": 1, "delta": 0 },
"2": { "line": 3, "column": 1, "delta": 1 },
"3": { "line": 5, "column": 1, "delta": 2 },
}
```
In the parser, it can run the reported line through this map to get the source line. Obviously, it gets a little trickier when we have to consider included lines, but the idea is still the same.
## Conclusion
After careful analysis, I don’t see any way to make the preprocessor less complicated than it currently is (both in terms of how to define it using a grammar formalism and how to implement it). Of the models presented, I think the priority line processor is the best choice to pursue. That’s partly because it maintains compatibility with existing usage. It’s also because I’ve proved that it’s possible to use a grammar formalism to describe its behavior given we can make use of specialized parsing features to do it. Namely, we have to assume that the parser is capable of modifying the input as it proceeds and to reprocess input that was modified by resolving preprocessor directives.
I do think it’s at least worth discussing the switch to a lexical preprocessor for the document body. However, I’m not convinced that actually makes AsciiDoc simpler to parse (in addition to the incompatibility problem it introduces). Thus, it may be better to stick to defining the preprocessor as currently works in Asciidoctor (a priority line processor), but to do a better job of tracking line offsets accurately.