AsciiDoc Language issueshttps://gitlab.eclipse.org/eclipse/asciidoc-lang/asciidoc-lang/-/issues2023-05-05T22:47:24Zhttps://gitlab.eclipse.org/eclipse/asciidoc-lang/asciidoc-lang/-/issues/26Proposal to clarify the behavior of the preprocessor2023-05-05T22:47:24ZDan AllenProposal to clarify the behavior of the preprocessor## Purpose
The line-oriented preprocessor in AsciiDoc (herein the AsciiDoc preprocessor) is one of the biggest hurdles we face in formalizing the AsciiDoc language syntax (using a grammar formalism). That’s because the preprocessor dire...## Purpose
The line-oriented preprocessor in AsciiDoc (herein the AsciiDoc preprocessor) is one of the biggest hurdles we face in formalizing the AsciiDoc language syntax (using a grammar formalism). That’s because the preprocessor directives are both coupled with the document structure and work outside of it.
The purpose of this issue is to figure how to describe the behavior of the preprocessor in a way that allows the language to be formalized. We’ll look at it from various perspectives that range from untangling the preprocessor from the document structure so it’s easier to parse to carefully defining grammar rules, actions, and parsing requirements to match the existing functionality. In the end, it may just come down to accepting the existing behavior and figuring out how to describe it in terms of a grammar formalism.
This issue can be resolved once we've decided on the behavior of the AsciiDoc preprocessor and resolved how to describe it in such a way that it doesn't prevent formalizing the grammar for the AsciiDoc language.
## Background
> In this section, I define the AsciiDoc preprocessor, describe its purpose, and explain how it works today (according to the user documentation and how it’s implemented in Asciidoctor).
The AsciiDoc preprocessor provides directives that add or remove lines from the source document ahead of (block-level) parsing. The preprocessor is strictly line-oriented.
There are two types of preprocessor directives: conditional directives and the include directive. The conditional directives (ifdef, ifndef, and ifeval) are for filtering lines in the source document. Conditional directives are useful for producing variations of a single document, such as to repurpose it for different audiences. The include directive (include) is for adding new lines to the source document from an external file, thus allowing documents to be composed.
Once a preprocessor directive is processed, the parser does not see it (and thus does not introduce a boundary in the structure). Rather, the parser only sees the outcome of the directive (which is either more or less lines than what’s in the source document).
In its purest form, a [preprocessor](https://en.wikipedia.org/wiki/Preprocessor) is applied to the source document before the document is parsed (i.e., called a lexical preprocessor). But that’s not how the AsciiDoc preprocessor works. The AsciiDoc preprocessor is somewhere between a lexical preprocessor and a syntactic preprocessor.
The AsciiDoc preprocessor is able to see the value of attributes set or unset by attribute entries in the document. This pertains to attribute entries set in either the header or the body. However, since attribute entries are not permitted anywhere in an AsciiDoc document, knowing how to find them means the preprocessor must have at least some awareness of the document structure. On the other hand, the preprocessor directives themselves can appear anywhere in the document (except perhaps in comment blocks, which is open for discussion), meaning they exist outside the structure of the document. Thus, the preprocessor must be able to recognize the structure of the document enough to process attribute entries, but not restrict where the preprocessor directives can be used.
To summarize, the preprocessor has access to document attributes as soon as they are defined in the document (in addition to ones passed to the processor), but does not otherwise recognize or honor the document’s block structure. Thus, we can say that the AsciiDoc preprocessor is not a lexical preprocessor, but rather a syntactic preprocessor (at least in part). I prefer to think of it as a priority (or contextual) preprocessor.
Unfortunately, the behavior I just described presents a real problem for defining a grammar for AsciiDoc. The requirements it calls for are really at odds with a grammar formalism and potentially compromises our ability to define one. Addressing this problem may call for a separate parsing phase which handles preprocessor directives with just enough parsing of the block structure baked in to also handle attribute entries. Or it may be possible to fold that behavior into the primary grammar, hiding it behind select annotated rules to keep the grammar tidy. Either way, it’s going to put some real constraints on which parsing technologies can be used to parse AsciiDoc.
Let’s consider different approaches that make the described behavior compatible with a grammar formalism and/or change the behavior so it can be.
## Proposed Models
There are at least four ways we can consider defining the behavior of the preprocessor:
* lexical preprocessor
* lexical preprocessor in body
* priority block processor
* priority line processor
We’ll look at each of these in detail.
### Lexical Preprocessor
One approach we could take is to redefine the AsciiDoc preprocessor as a strict lexical preprocessor. Using this model, the preprocessor only looks for preprocessor directives as reserved, line-oriented tokens. In other words, it looks at the source document as a series of lines, but otherwise doesn’t acknowledge the structure of the document. This is by far the easiest to implement. The grammar rules only have to look for lines that are preprocessor directives, adding or removing lines as prescribed. The grammar does not have to try to find or process attribute entries. However, what it means is that the only attributes the preprocessor directives can see are the ones passed into the processor (in other words, attributes defined in the document don’t affect the operation of the preprocessor).
Although this model may sound alluring, it has a tremendous impact on compatibility. The assumption that preprocessor directives can reference attributes defined in the document, at the very least in the document header, is ingrained and, as such, documents that use preprocessor directives are often written in this way. Changing this model now would almost certainly violate our commitment to creating a specification that’s reasonably compatible with existing content. In other words, it would be a significant departure from AsciiDoc as we know it.
### Lexical Preprocessor in Body
To address the problem of the lexical preprocessor not being able to see attribute entries defined in the document header, where they are most often defined, we could consider different preprocessing rules for the document header. This would work because the structure of the document header is flat and thus lends itself to line-oriented processing. Locating the end of the header is only complicated by having to consider preprocessor directives, which would already be addressed here.
The document header could be processed line-by-line, allowing the preprocessor directives to see the result of the previous line and filtering lines in advance of the header parser. While the grammar for the header would be slightly less formal (or require extra parsing requirements), that exception would be confined to the document header. Once the header is cleared, the preprocessor would switch to being a lexical preprocessor, ignoring all remaining structure (and, as such, attribute entries).
This model is certainly something worth considering, but I'd need to see a proof of concept of it working. It does have an impact on compatibility, but only in the case where documents are written to use preprocessor directives that rely on attributes defined in the body of the document. If there are documents that rely on this behavior, then this change will break compatibility in a way that is significant.
An impact assessment would certainly need to be done here. It’s not uncommon for documents to change the value of an attribute to modify the target of a subsequent include. We often see this pattern used in books, where the target of a chapter file is controlled by an attribute. Documents may also use attributes to change the location of an example file that is included in a code block. My instinct tells me that we’re going to find that it’s going to present major problems.
### Priority Block Processor
Another model for the preprocessor is to process it as transparent block that does not appear in the ASG. In this model, the preprocessor would be part of the document structure and thus would naturally be able to see attributes set or unset by attribute entries. However, it would impose a lot of new restrictions on how and where the preprocessor directives can be used. It could also introduce side effects in the parsing.
For one, conditional preprocessor directives would have to be balanced within the document structure. In other words, they couldn’t overlap boundaries of a block like they can today. They would also not be permitted in places where blocks are not allowed, such as around block attribute lines. There’s also a question of how they would be processed within verbatim content, something that’s permitted today. It’s also not clear whether lines contributed by adjacent include directives would be stitched back together to create a single block. In other words, the preprocessor directives would end up introducing artificial boundaries in the block structure.
While this model has some merit, it also has a tremendous consequences on compatibility. And while it may simplify the grammar, it would also require additional processing to transform the parse tree. Thus, I don’t really see how we can consider it.
### Priority Line Processor
A priority line processor is the closest model to what we have in AsciiDoc today. Thus, this is the preferred proposal.
In this model, every line must be checked for a preprocessor directive before it’s considered by the grammar parser. If a preprocessor directive is found, it needs to be processed and the input modified so the pending grammar rule only sees the outcome. If the preprocessor directive leaves behind a preprocessor directive on the same line (such as by an include directive), that directive must also be processed. Once the current line is confirmed to not be a preprocessor directive, the pending grammar rule may proceed.
The priority line processor can either be integrated with the grammar parser (thus a single parsing phase), or it can be done as a separate parsing phase. If done as a separate phase, it will still have to consider the structure of the document in order to locate and process attribute entries, but this mode is effectively a lightweight parse rather than a complete one. By lightweight, I mean that it would be doing just enough to identify valid attribute entries.
While choosing this model has no impact on compatibility, it puts rather substantial restrictions on what parsing technologies can be used for parsing AsciiDoc. We are essentially requiring the parser to allow the input ahead of the cursor to be modified while parsing is taking place. It also has to be possible to instruct the parser to backtrack to the location of the preprocessor directive after the directive has been processed. That, in turn, means that any information cached about the input at that point forward needs to be cleared. These requirements are distinctly at odds with a grammar formalism.
With that said, it’s not likely that an implementation will use a grammar-based parser to handle the preprocessor requirement. Instead, it may decide to employ bespoke line-based processing logic for this step, such as we see in Asciidoctor (and downdoc). But we still may be able to describe the behavior of the preprocessor using the grammar from a grammar-based parser that can accommodate the stated requirements. In doing so, we will have achieved the goal of communicating the normative rules using a grammar while, at the same time, not mandating that an implementation do it that way.
Here’s a partial exhibit of a dedicated grammar for the preprocessor that shows how a priority line processor might work:
```
document = header? body lf*
header = ...
body = pp_block*
pp_block = (pp (lf / attribute_entry / block_attribute_line))* block
pp = (pp_directive* . !.)?
pp_directive = pp_conditional / pp_conditional_short / pp_include
pp_conditional_short = operator:pp_conditional_name '::' attribute_name:attribute_name '[' contents:$([^\n\]]+ &(']' eol) / ([^\n\]] / ']' !eol)+) ']' eol
{
// see action for pp_conditional rule
}
pp_conditional = operator:pp_conditional_name '::' attribute_name:attribute_name '[]\n' contents:conditional_lines 'endif::[]' eol
{
const { start: { offset: startOffset }, end: { offset: endOffset } } = location()
const drop = operator === 'ifdef' ? !(attribute_name in options.attributes) : (attribute_name in options.attributes)
// TODO record line offsets
input = input.slice(0, (peg$currPos = startOffset)) + (drop ? '' : contents.join('')) + input.slice(endOffset)
peg$posDetailsCache = [{ line: 1, column: 1 }]
return true
}
conditional_lines = (!('endif::[]' eol) @(pp_conditional_pair / $([^\n]+ eol) / '\n'))*
pp_conditional_pair = opening:$(pp_conditional_name '::' attribute_name '[]\n') contents:conditional_lines closing:$('endif::[]' eol)?
pp_conditional_name = 'ifdef' / 'ifndef'
pp_include = 'include::' target:$[^\[\n]+ '[]' eol
{
const { start: { offset: startOffset }, end: { offset: endOffset } } = location()
const contents = require('fs').readFileSync(target, 'utf8').split(/(?<=\n)/)
// TODO record line offsets
input = input.slice(0, (peg$currPos = startOffset)) + contents.join('') + input.slice(endOffset)
peg$posDetailsCache = [{ line: 1, column: 1 }]
return true
}
block = example / listing / list / ... / paragraph
attribute_entry = ':' name:attribute_name ':' value:(' ' @$[^\n]+ / '') eol
{
options.attributes[name] = value
}
attribute_name = $[a-z]+
example = ...
listing = '----\n' contents:$(pp !('----' eol) line / '\n')* pp '----' eol
list = ...
paragraph = (pp !(block_attribute_line / any_parent_block_delimiter_line) @line)+
line = value:$([^\n]+ eol)
eol = '\n' / eof
eof = !.
lf = '\n'
```
There are a couple things to notice about this grammar. Any time the grammar looks for a line, it must run the `pp` rule to make sure the line has been preprocessed. When the grammar looks for preprocessor directives, it must keep looking until it doesn’t find any at that location. It must then fail that rule so that the cursor is not advanced. (In practice, I found it necessary to reset the cursor manually since I couldn’t find a way to fail each `pp_directive` rule individually but still continue checking for preprocessor directives). In the action that processes the preprocessor directive, it must be modify the input to replace the directive with its contents (either the conditional lines or the contents of the include). It then needs to move the cursor back to the start offset of the directive so the input can be reprocessed starting at that point. The grammar needs to walk the block structure, but does not have to get into the finer details of how to parse the blocks. In particular, it doesn’t need to consider the inline syntax at all.
The behavior of the priority line processor is being described formally as follows:
* a lightweight parse of the block structure in order to identify and process attribute entries
* the inclusion of the `pp` rule to identify and process for preprocessor directives
* a rule to read the contents of a conditional preprocessor directive without processing the lines
It’s debatable whether it helps to have a separate preprocessing phase, though it’s certainly useful as a tool (consider the role of Asciidoctor Reducer). I think when we define the primary grammar for the language, we may want to do so without including the preprocessor rules (thus thinking about them as a separate phase). But an implementation may combine the grammars to avoid having to maintain separate grammars.
One very important factor to consider in all these models is how to map nodes to the original source. In other words, how to track line offsets as a result of resolving the preprocessor directives.
## Line Offsets
Another big challenge with the preprocessor (irrespective of the model) is tracking line offsets. When reporting problems, or to allow a document to be properly analyzed, we want to be able to map the location of nodes in the parsed document / ASG to the source document or documents. If a preprocessor comes through and moves lines around, it compromises the parser’s ability to provide this information accurately. Thus, when the preprocessor runs, it must build a map of processed lines to source lines. The parser then needs to run the reported location through this map to resolve the correct location in the source document. (From experience, building this map can be quite tricky).
Although the logic is difficult, the result of the mapping is quite easy to understand. Consider the following AsciiDoc source:
```asciidoc
début
conditional content
fin
```
Here’s the source the parser will see (after the preprocessor does its thing):
```asciidoc
début
conditional content
fin
```
A line offset mapping may look something like this:
```json
{
"1": { "line": 1, "column": 1, "delta": 0 },
"2": { "line": 3, "column": 1, "delta": 1 },
"3": { "line": 5, "column": 1, "delta": 2 },
}
```
In the parser, it can run the reported line through this map to get the source line. Obviously, it gets a little trickier when we have to consider included lines, but the idea is still the same.
## Conclusion
After careful analysis, I don’t see any way to make the preprocessor less complicated than it currently is (both in terms of how to define it using a grammar formalism and how to implement it). Of the models presented, I think the priority line processor is the best choice to pursue. That’s partly because it maintains compatibility with existing usage. It’s also because I’ve proved that it’s possible to use a grammar formalism to describe its behavior given we can make use of specialized parsing features to do it. Namely, we have to assume that the parser is capable of modifying the input as it proceeds and to reprocess input that was modified by resolving preprocessor directives.
I do think it’s at least worth discussing the switch to a lexical preprocessor for the document body. However, I’m not convinced that actually makes AsciiDoc simpler to parse (in addition to the incompatibility problem it introduces). Thus, it may be better to stick to defining the preprocessor as currently works in Asciidoctor (a priority line processor), but to do a better job of tracking line offsets accurately.