Proposal to introduce an inline preprocessor step

If we decide to move forward with the transition from substitutions to a formal grammar (#16 (closed)) to define the inline syntax in AsciiDoc, which looks probable, it will require rethinking how attribute references and inline passthroughs are processed. This issue identifies an inherent conflict in the inline syntax and offers several proposals for how to solve it. The leading proposal is to introduce an inline preprocessor step.

Challenges

The primary issue with parsing the inline syntax using grammar rules is dealing with attribute references. We can’t consider attribute references without also considering inline passthroughs. So, in fact, there are two markup elements that present a problem: attribute references and inline passthroughs. At first glance, it may seem as though these are just part of the parsable grammar. However, there are expected behaviors with both markup elements that can fundamentally change the stream of text being parsed, and thus what gets interpreted.

Based on how AsciiDoc is currently processed in Asciidoctor, attribute references are replaced after the quotes substitution (i.e., marked text), which itself happens after the specialchars substitution. When an attribute reference is replaced, there’s an expectation that the value of that attribute is considered when parsing the text (or at least the remaining substitutions, notably macros). If the attribute reference were treated as merely a node in the parse tree, it would be too late for the resolved value to influence the parsing. It’s also not possible to go back and reparse the input at this stage as it’s no longer in text form, but rather in the form of a parse tree. That's why we say that there is an inherent conflict in the syntax (at least when trying applying a formal grammar to it).

Consider the case of a URL macro where the URL is stored in an attribute (e.g., {url-chat}[Chat]). It’s not possible to replace the attribute reference, then back up to parse the text that it added. One, because most parsing models don’t support modifying the input mid-parse or backtracing. But two, even if it could, how far back would you go to look for markup? It’s not obvious or deterministic.

We also have to consider the case when an attribute reference adds syntax that would impact a markup element that’s currently being tested. Again, it’s not possible (or not easy) to tell the parser to integrate this new text.

It’s clear that attribute references live outside of the primary parsing step, and thus must be expanded before the text is parsed (or just ahead of it).

Now let’s consider inline passthroughs. On the one hand, we could view an inline passthrough as a span of marked text that doesn't allow the markup in its contents to be interpreted. However, one of the things that an inline passthrough can hide is an attribute reference. If attribute references are expanded in a preprocessing step, then the inline passthroughs must be considered there as well. That would mean that the inline passthroughs have to be discovered in both the preprocessor step and the primary parsing step. In the simplest case, this duplicate check is redundant (not to mention that it complicates the grammar). But there’s a larger issue at hand, which is the fact that expanding an attribute reference can defuse an inline passthrough (either because it changes the constrained boundaries or it introduces a significant backslash).

If an attribute reference is directly adjacent to an inline passthrough, that attribute reference can contribute text which may cause the boundaries of the inline passthrough to change. As a consequence, that inline passthrough may be found in the preprocessor step, but not in the primary parsing step. Thus, it’s necessary to process and replace both the inline passthroughs and attribute references together (then later patch in the contents of the inline passthrough into the appropriate slot in the parse tree).

Fortunately, there’s no expectation that the contents of an inline passthrough can influence the parsing of the surrounding text, so we can safely reinsert its contents after the fact.

If an inline preprocessor step is introduced, it will need to account for the source range deltas caused by expanding attribute references and extracting inline passthroughs causes. The parse tree still needs to be able to map each node back to its original location in the input. For attribute references, all characters in the attribute value should be attributed to the start location of the attribute value. Otherwise, the source range could refer to an invalid range that extends beyond the attribute reference in the input.

We may also need to reconsider the perplexing order in which the value of an attribute reference is introduced. Currently, the value is introduced after the marked text has been discovered. (Note that the value itself has already had specialchars substitutions applied it). This ordering is not possible to replicate in a formal grammar. However, this behavior has always been suspect anyway.

We may need to view an attribute reference strictly as a reusable snippet of source text and allow it to be interpreted as though its value had been added directly to the text. This will certainly be easier to explain and would avoid the kinds of hacks (read as: passthroughs with custom subs) authors often use to get marked text to be interpreted in the value of an attribute. In #16 (closed), we’ve already explained how the responsibility of the specialchars substitution would be delegated to the converter, which would make it a non-issue here. Thus, all we would be changing is allowing the attribute to hold a span (or partial span) of marked text.

Solutions

There are at least three solutions to addressing how to integrate attribute references and inline passthroughs when interpreting the inline syntax using a formal grammar. One solution is to introduce a preprocessor step, which is the preferred approach being proposed. The second and third solutions involve relying on semantic predicates to examine the value of attributes during parsing, effectively limiting the scope of where the attribute reference can influence the parsing (the third solution more limited than the second).

Let’s examine how these solutions will work in detail.

Solution A: Expand attributes references up front

With this strategy, attribute references which are outside of inline passthroughs will be expanded first. The boundaries of the passthroughs do have to be considered in this phase in addition to the main parsing phase, but since they are a flat structure, there isn’t too much overhead involved. The passthroughs could be extracted during this phase and replaced with placeholders, which then get restored after parsing (see Appendix A for details). The later approach, while more drastic, avoids the need to use semantic predicates to resolve side effects caused by expanding attribute values.

The main drawback of this strategy is that it changes what is interpreted in the attribute value. Once the value of the attribute is inserted, marked text (aka quotes) in that value will be interpreted. This is a departure from the current behavior, although one that may not come up all that often and be welcomed when it does.

The other drawback of this strategy is that character offsets become more difficult to track. Once the attribute reference is replaced, the character offset of all text that follows will potentially shift. That means the character offsets identified by the parser have to be corrected during the main parsing phases to account for these shifts. See Appendix B for details.

Solution B: Expand attributes in text nodes and reparse node's text

With this strategy, attribute references are kept as text until all text is parsed into a parse tree. During semantic extraction of the tree (AST -> ASG), all text nodes are visited and the attribute references in the value (if any are present) are expanded at that time.

Since there is an expectation that an attribute value can hold all or part of a macro, the parser must be run again on the modified value of the text node with the attribute references expanded. The text node must then be replaced with the normalized AST from this secondary parse. However, it will also need to avoid passthrough text (which is now integrated into the value of the text node). Thus, it may be necessary to keep passthrough marks until after this phase is complete.

Aside from being absurdly complex, there is another fundamental drawback to this strategy. If the macro structure is only formed after the attribute reference has been replaced, but the literal/textual portions (such as the attrlist) have been consumed by other rules, thus being parsed into a tree, the run of text that remains will no longer be recognized as a macro structure. It will be surprising to a writer that this happens when the attrlist contains formatted text, but won’t happen when it does not. This doesn't seem like a viable strategy, so another approach is needed.

One way to deal with these problems is to recognize the macro structure during the initial parse. However, to do so means relying on a semantic predicate. Using a semantic predicate, the parser can "peek inside" an attribute reference to see if contains the characters the grammar is looking for. If we consider the input {url-chat}[Chat], where url-chat contains a URL, peeking into the {url-chat} attribute reference would reveal the URL that starts the macro. The attribute reference would later be replaced in the macro element’s target during AST -> ASG (and not parsed further). The parser could then do the same for any kind of macro for which we want to allow this attribute replacement.

There would still be limits to where an attribute reference can introduce interpreted text, but limits that might be considered reasonable and which can be logically explained. For example, it would not be possible to write a URL macro as {h}{t}{t}{p}{s}:// as that would be unreasonable (if even possible) to implement.

Solution C: Expand attributes in text nodes and parse attribute value

With this strategy, like the previous strategy, attribute references are kept as text until all the text is parsed into a parse tree. What changes is that the secondary parse only runs on the attribute’s value, not on the text into which the attribute value is inserted. This limits the scope of what markup can be found in the attribute value. The advantage this has over the previous strategy is that it’s a bit easier for the author to understand. However, it still has severe drawbacks in terms of compatibility. An attribute reference would no longer be able to arbitrarily layer in interpreted text.

Conclusion

In order to make the transition from substitutions to a formal grammar, the inherent conflict in the syntax caused by attribute references (and how they related to inline passthroughs) must be addressed. We first considered the challenges that attribute references present and how they can be reframed to mitigate these challenges. We then considered three solutions for handling attribute references in a way that’s compatible with formal grammar rules.

We considered expanding attribute references up front using an inline preprocessor and allowing markup in the attribute value to be processed (Solution A). We also considered expanding attribute references on the text nodes in the parse tree and parsing the expanded text again (Solution B and C). In the latter case (Solution C), we considered parsing either the entire run of text or the value that is inserted.

After exploring these three solutions, and considering how they impact compatibility with existing documents and complexity of the language and implementations, it’s clear that Solution A, introducing an inline preprocessor step, is the best path forward.

Appendix A and B provide details about how to fully implement Solution A.

Appendix A: Passthrough Processing

Inline passthroughs mark regions of text that should not be interpreted (or that should be interpreted differently, as is the case of custom subs). In order to achieve this, the inline parser must identify these regions and pass over them, removing the enclosure in the process.

If attribute references are to be expanded before other interpreting the text, then inline passthroughs have to be considered as well so attribute references within the passthrough are not considered. That warrants a dedicated parsing phase, which we might refer to as the inline preprocessor.

While this phase can both identify inline passthroughs and expand attribute references, the question comes up of what to do with these passthroughs. One option is to leave them in place, which means the second phase has to once again identify them. While this could work, it’s not very efficient since the grammar rules have to run again, and have to be considered in all descents. It’s also necessary to consider the case when an attribute value changes the boundaries of a passthrough such that it is defused. Handling that situation likely requires use of a semantic predicate to consult the boundaries from the first phase. The characters in the passthrough region could also cause the optimization of sniffing for parsable text to fail, resulting in incremental parsing when it wasn’t necessary.

One way to address these shortcomings is to replace the passthrough regions with a placeholder or placeholder characters. The placeholder would necessarily have to be a non-word character, and one that would not appear in the source text. Control characters (including NUL) are ideally suited for this purpose. This technique would entail capturing the passthrough into a table and replacing the a passthrough region with "data link escape" (DLE) followed by NUL characters to fill the region. (If a single character replacement is used instead, the parser would need to account for the offset it creates). In the second phase, the parser would first look for this placeholder sequence, look up the contents of the passthrough in the table, and put the contents into the parse tree as passthrough (or raw) text. The parser would then continue on looking for other rules. But it would not need to worry about any of those rules extending into a passthrough region since a match would be impossible.

Using placeholders for passthrough content during the second parsing phase is not mandatory, but can substantially simplify the parsing rules and make room for optimizations (specifically those that avoid parsing when not necessary). For implementations that choose to take this route, the AsciiDoc Language specification should designate characters for this purpose (e.g., DLE and NUL).

Appendix B: Offsets and Attribute Reference Expansion

One of the drawbacks of expanding attribute references in a preprocessor phase is that it modifies character offsets, causing the second parse phase to report character offsets that don’t line up with the source text. If we want an accurate parse tree from the inline syntax, this problem needs to be addressed. This problem can be solved by tracking the accumulated delta between the source text and the expanded text during the first phase.

First, an offsets table should be built that maps the 0-based position of every character in the source text to an offset object. At a minimum, each offset object contains a delta property/key, which is initially 0. When an attribute reference is encountered, the delta between the length of the attribute reference (e.g, {name}) and the length of the resolved value should be computed. The value should then be assigned to the delta property/key in the offsets table for every position that follows the attribute reference. In other words, we track how much the following characters have shifted as a result of the expansion.

If a delta already exists at the start position of the attribute reference, it should be added to the delta (which becomes the accumulated delta). For each subsequent position in the attribute reference, 1 should be subtracted from the delta. This allows the position of every character in the attribute value to be attributed to the start position of the attribute reference (since that value is not actually in the source text). All replaced positions in the offsets table should be assigned the attribute name that was replaced so that this can be traced. The offsets table could optionally track the offset of that character within the attribute value (for use in application messages).

During the second parsing phase, the parser must translate the beginning position of a node to the position in the source text by adding the delta from the offsets table for that position. This ensures that the character offset for a node is the one from the source text, not the expanded text.