AsciiDoc Language issueshttps://gitlab.eclipse.org/eclipse/asciidoc-lang/asciidoc-lang/-/issues2024-02-28T00:22:45Zhttps://gitlab.eclipse.org/eclipse/asciidoc-lang/asciidoc-lang/-/issues/25Proposal to make backslash escaping stable2024-02-28T00:22:45ZDan AllenProposal to make backslash escaping stableEscaping markup using a backslash character (“backslash escaping”) is one of the weaker areas of the AsciiDoc syntax. As currently described in the user docs (which reflects how it's implemented in Asciidoctor), a backslash character is ...Escaping markup using a backslash character (“backslash escaping”) is one of the weaker areas of the AsciiDoc syntax. As currently described in the user docs (which reflects how it's implemented in Asciidoctor), a backslash character is only treated as meaningful if it precedes a markup element (markup that would have otherwise been interpreted). For example, `\*stars*` becomes `*stars*`. If the backslash is used in front of a character that isn't a markup element (i.e., doesn't match a grammar rule), such as `\*star`, the input remains as is, `\*star`. To a writer not well-versed in the rules of the AsciiDoc syntax, this behavior appears broken.
What we can say is that backslash escaping is contextual. Rather than instructing the parser to pass through the escaped character without interpreting it (`\*` becomes `*`), it's dependent on whether that markup character is enlisted in a markup element. That puts the onus on the writer to track where the markup character is being used and whether that usage gives it special meaning. Expecting the writer to take on this responsibility makes backslash escaping feel unstable. Writers often avoid using this escaping mechanism and resort to more brute-force methods such as inline passthroughs.
As part of formalizing the AsciiDoc language, I feel strongly that we should stabilize this mechanism to make it more approachable.
I see three ways we could define backslash escaping:
* **contextual** - the backslash prevents a markup element from being interpreted; in this case, the backslash is consumed (`\*stars*` becomes `*stars*`); if no markup element is found immediately following the backslash, the backslash is left in place (`\*star` remains as `\*star`)
* **universal** - a backslash can be used in front of any character and consumed; the character it escapes will not be considered when looking for markup elements (`\look at that \*` becomes `look at that *`); a literal backslash must escape itself (`\\` becomes `\`)
* **reserved** - a backslash can be used in front of any reserved character in the markup and is always consumed; if used in front of any other character, the backslash is left as is (`\n is a line feed; \* is an asterisk` becomes `\n is a line feed; * is an asterisk`); a literal backlash in front of a reserved markup character would have to itself be escaped (`\\*word*` becomes `\<strong>word</strong>`); otherwise, the backslash can be written as `\\` or `\`
One exception to maintain backwards compatibility is a macro prefix, which is treated as a single markup expression (`\link: starts a link macro` becomes (`link: starts a link macro`); (another option would be to switch to contextual backslash escaping in this case, though it would add a dependency on using semantic predicates in the parser); regardless, moving forward, escaping the colon would be preferred (`link\: starts a link macro`); another exception is a bare URL, which is treated as a single markup expression and thus a contextual escape (this wouldn't rely on a semantic predicate since there is no intention to interpret the identified URL any other way)
As mentioned above, AsciiDoc is currently described to permit contextual backslash escaping. We want to move past this. However, universal backslash escaping may be a step too far if we consider the impact on compatibility. The most notable problem will be Windows file paths. Under universal backslash escaping rules, `C:\projects` becomes `C:projects`. We can't expect writers to go back and fix all these cases. Besides, there's no expectation that a backslash has a meaning in this case (and quickly introduces leaning toothpick syndrome).
Therefore, reserved backslash escaping may offer the best compromise. By choosing reserved backslash escaping, the writer no longer has to worry about escaped markup that doesn't match a syntax rule, but also won't be faced with the Windows file path problem. The only thing that still must be considered is that escaping markup could cause different markup to be found, which then must be escaped.
One **open question** is which markup characters to define as reserved? Should we say that all symbol/punctuation characters in the ASCII charset can be escaped, or limit it to just the ASCII characters that the AsciiDoc syntax currently uses? For reference, CommonMark allows escaping all ASCII punctuation.
Here are the reserved markup characters identified thus far:
```
\ ` _ * # ~ ^ : [ < ( {
```
Note that it shouldn't be necessary to have to escape the closing bracket of a markup element, hence why those characters are not listed here as reserved.
Another **open question** is how to escape unconstrained marked text. Currently, AsciiDoc requires that the opening unconstrained mark be double escaped (`\\**stars**`). However, this is both context-dependent and ambiguous (as escaping a backslash should make a literal backslash). Therefore, we may have to change this rule to be (`\*\*stars**`). This will introduce a slight incompatibility, but one that is reasonable to explain and to justify with the goal of making backslash escaping stable.
The examples provided thus far focus on where backslash escaping is used in inline syntax. It should also be considered for the following block-level constructs:
* preprocessor directive (`\include::target[]`)
* block macro (`\image::target[]`)
* list item (`\* is an asterisk`)
* dlist term (`App\:: is a Ruby namespace`) (or should it be `\App:: is a Ruby namespace`?)
* heading (`\= is an equals sign`)
**Open Question:** For block-level constructs, are we interpreting the backslash because it's at the beginning of the line, or because it is escaping a character? I think we should consider it because it's used at the beginning of the line. (I think this would translate to removing the backslash at the beginning of a paragraph). That reduces how much markup we have to designate as reserved.
In terms of parsing efficiency, we have identified the following optimization for processing backslash escaping. During parsing, only consider backslash characters that are escaping a grammar rule that is being considered. Once parsing is complete, drop the backlash in front of all reserved characters in the transformation from a parse tree to an AST/ASG. This can minimize the number of checks that the grammar has to consider.https://gitlab.eclipse.org/eclipse/asciidoc-lang/asciidoc-lang/-/issues/39Clarify how empty lines and list continuations impact list boundaries2024-02-27T21:17:44ZDan AllenClarify how empty lines and list continuations impact list boundariesLists have implicit boundaries in AsciiDoc (and in most lightweight markup languages). Hence, a common matter for an author is how to maintain the boundaries of a list or how to break out of them. Our goal is to ensure that its easy for ...Lists have implicit boundaries in AsciiDoc (and in most lightweight markup languages). Hence, a common matter for an author is how to maintain the boundaries of a list or how to break out of them. Our goal is to ensure that its easy for an author to keep a list together when needed, but also easy to separate lists when they shouldn't be adjoined.
In AsciiDoc, there are two forms that impact this outcome, the list continuation and the empty line. In this issue, we'll clarify what impact these two forms have on list parsing with the intent to solidify the rules of list boundaries.
## Scenarios
To understand how the grammar rules are defined, we'll be examining several scenarios:
* **[l-1]** Empty lines between list items (after the list item definitively ends)
```
* first item
* second item
```
* **[l-2]** Empty lines above a block attached with a list continuation, as well as between its metadata lines
```
* item
+
attached block
```
```
* item
+
[#idname]
attached block with an ID
```
* **[l-3]** Empty lines above a new list with or without block metadata following a list item
```
* item
. nested list
```
```
* item
[]
. nested or sibling list?
```
* **[l-4]** Empty lines above an indented (literal) block with or without block metadata following a list item
```
* item
indented
```
```
* item
[]
indented
```
Although it won't be discussed in this issue, an empty line above a list continuation applies that list continuation to an ancestor list. The number of empty lines equates to how many levels it ascends (e.g., one empty line means it applies to the parent).
### l-1
Let's start with **[l-1]**, since a decision here sets the foundation for what rules are available for the other scenarios. We often refer to **[l-1]** as ventilated list items. The reason is, authors have a tendency to want to put some space between list items to make them more readable. The question is, how much space is allowed? Consider the following case:
```
* first item
* second item
```
In pre-spec AsciiDoc, any amount of empty lines are tolerated between list items and the list will still stay together. To tighten this rule, and make it easier to separate lists, one proposal is to only permit a single empty line between list items. Any more and the list would be severed. However, this proposal could have major compatibility implications as there are many documents in the wild that rely on arbitrary ventilation. Furthermore, this new rule would be inconsistent with other lightweight markup languages such as Markdown and reStructuredText. It just seems to be part of the unwritten code of lightweight markup languages to allow list items to be separated by an arbitrary number of empty lines. (One notable exception is textile). If we decide to honor that code, then we won't be able to allow adjacent lists that have the same marker to be separated using empty lines alone.
Currently, the main way to separate adjacent lists that are congruent (i.e., same list marker) is to insert a block attribute line between them (with or without a preceding empty line). The block attribute line can be empty (i.e., nothing between the square brackets). Since a sibling list item cannot have metadata lines above it, this line effectively acts as an interrupting line. As a result, it causes the first list to end and a new one to begin. For example:
```
* first list
[]
* second list
```
This technique also works if the second list is preceded by a block title line, though it usually has to follow an empty line in order to be recognized as a block title line (typical rules).
If the lists are not congruent, the empty line above the block attribute line is required (to account for the scenario in **[l-3]**).
To make the intent of the interrupting line more clear, a non-functional option could be used to communicate the block attribute line's function:
```
* first list
[%interrupt]
* second list
```
Another technique to keep them apart is to enclose one of the lists in an open block:
```
* first list
~~~~
* second list
~~~~
```
We are still considering whether there are other ways to separate adjacent lists, such as using a line comment. However, we generally prefer comments to not impact parsing, so this may not be pursued. Either way, it will addressed in a separate issue.
### l-2
Let's now consider **[l-2]**. Here, there's an argument to once again be tolerable of multiple empty lines, but for different reasons. Before going on, it's important to emphasize that the list continuation cannot be preceded by an empty line (otherwise, it becomes a list continuation for an ancestor list). When the list continuation is found, it effectively tells the parser to expect a single block. Normally, a block can have empty lines above it (either above the metadata or in between it). Thus, it seems like it would be safe and consistent to allow them here too. The intent of the author is clear: "find one block to attach". If empty lines were not tolerated, then the parser would essentially be ignoring the request of the author and leave the list continuation dangling.
The AsciiDoc style guide should certainly encourage authors to not leave empty lines after a list continuation as it makes the attachment less clear. But from the standpoint of the parser, there's no real benefit of giving empty lines special meaning here.
### l-3
In a list, there are two cases of an implicit list continuation, **[l-3]** a nested list and **[l-4]** an indented (literal) block. In these cases, the rules about when empty lines are tolerated are more strict.
Let's look at **[l-3]** first. If an adjacent list is encountered that's different and has no metadata lines, that list is attached as a child of the current list item regardless of how many empty lines are above it. Again, this comes from the empty line tolerance in lists across lightweight markup languages. While we could forbid consecutive empty lines, we'd be introducing a special rule just for this case, which will be hard to remember.
The primary question is, how many empty lines should be permitted if the adjacent list has metadata lines? Typically in AsciiDoc, a block attribute line acts as an interrupting line. Intuition would then tell us that a block attribute line above an adjacent list will cause the previous list to end. Consider this case:
```
* disc
[square]
** square
```
However, the AsciiDoc syntax grants a special exception here. If there's no empty line above the block attribute line, it acts as through there's an implicit list continuation above it. Thus, the second list becomes a child of the list item in the first list (hence a nested list).
But what happens if there's an empty line above the block attribute line? Consider this case:
```
* disc
[square]
** square
```
Now we are torn between two standard rules. On the one hand, we said earlier that a block attribute line is one way to separate adjacent lists (i.e., prevent nesting). On the other hand, there's an implicit list continuation above an adjacent list when that list is different.
There are two possibilities here. The first choice is that we stick with the idea that *empty line + block metadata line* acts as an interrupting line with a list. This rule matches pre-spec AsciiDoc. In that case, here's how the second list would need to be attached if preceded by an empty line:
```
* disc
+
[square]
** square
```
The second choice is that we tolerate at least one empty line, but not consecutive ones. This rule borrows from an earlier proposal. Since no where else in the AsciiDoc syntax do consecutive empty lines have a different meaning than a single empty line (especially above a block), I think the second rule would be a risky choice to introduce here. I'm inclined to reject the idea.
### l-4
Finally, we arrive at **[l-4]**. Like with a nested list, an indented (literal) block has an implicit list continuation. If the indented block has no metadata lines, then it must be offset by at least one empty line or else it gets soaked up as part of the list item principal. Consider this case:
```
* item
indented
```
Since we've already established that a nested list without metadata lines can be preceded by an arbitrary number of empty lines, it's both logical and consistent to allow it in this case as well.
Once again, we need to consider what happens if the indented block has metadata lines. Consider this case:
```
* item
[.output]
indented
```
Pre-spec does not apply the implicit list continuation if the indented block has at least one metadata line. So the indented block would not be attached to the list item in this case. Instead, it would require an explicit list continuation to do so. However, if we want the rules of an implicit list continuation to be consistent, then we **should** attach the indented block if not preceded by any empty lines:
```
* item
[.output]
indented
```
The block attribute line interrupts the list item principal, so the indented block should be a candidate for attachment in this case. This is not supported in pre-spec AsciiDoc, but we could add it now.
## Summary and decisions
As we've stated in other issues, while formalizing AsciiDoc, we're trying to remain as consistent with how the language is currently interpreted as possible. At the same time, we need to address idiosyncrasies so the language is easy to understand, remember, and use.
With that in mind, we want to make it easy to keep lists together and also easy to separate them. The main subject of concern are empty lines. When are empty lines tolerated in a list and are consecutive empty lines are allowed? We established that pre-spec AsciiDoc—and lightweight markup languages in general—are quite tolerant of empty lines in a list. Any number of empty lines are permitted between list items of the same list, and empty lines are permitted following an explicit or implicit list continuation.
We considered the proposal of assigning meaning to consecutive empty lines so they act like a list interrupting line. While enticing, this proposal would greatly threaten compatibility and deviate from the unwritten code of lightweight markup languages. Thus, we don't think it's worth the risk.
We then clarified that adjacent lists can be separated using an empty line followed by a block attribute line. (If the two lists are congruent, the empty line is not required). This is a pattern that was heavily promoted in pre-spec AsciiDoc and, as a result, plenty of documents now depend on it. It offers an definitive way to separate lists that can be made to be self-documenting.
We then accepted that empty lines should be permitted above a block attached using an explicit list continuation. The justification is that the intent of the explicit list continuation is clear and there's no reason to counter that intent by giving empty lines special meaning. The goal of the list continuation is to find a block, and the parser should proceed until it does.
We then considered whether an empty line or lines should be permitted in the case an implicit list continuation is being used to attach a block with metadata lines. Here we decided that the syntax should not be tolerant of empty lines. The reason is that it would break the contract that *empty line + block attribute line* can be used to separate adjacent lists. The block metadata line either must not be preceded by an empty line or the list must be attached using an explicit list continuation.https://gitlab.eclipse.org/eclipse/asciidoc-lang/asciidoc-lang/-/issues/38Decide whether a non-indented line interrupts an indented block form2024-02-27T20:59:42ZDan AllenDecide whether a non-indented line interrupts an indented block formAn indented block form is defined as one or more contiguous lines indented by at least one space. This is an implicit structure that produces a literal block in the parsed document (the ASG).
(In pre-spec AsciiDoc, this was referred to ...An indented block form is defined as one or more contiguous lines indented by at least one space. This is an implicit structure that produces a literal block in the parsed document (the ASG).
(In pre-spec AsciiDoc, this was referred to as a literal paragraph, but we've since decided to name it a literal block with the indented form to make the terminology more accurate and consistent).
A question that has come up when defining the grammar is what to do if a subsequent line is not indented (and not otherwise an interrupting line). In other words, can a paragraph interrupt an indented literal block? Consider the following case:
```
indented
not indented
```
In both Asciidoctor and its predecessor, the non-indented line does not interrupt the indented block form. Thus, only the first line has to be indented by at least one space. This parsing behavior mandates that an adjacent paragraph must be separated by at least one empty line. In other words, a non-indented line cannot interrupt the indented block form, but is rather consumed as part of it.
There are two reasons why this behavior may be problematic:
* It's not consistent with other markup languages, Markdown in particular. (rST also treats it as an interrupting line, though the indented block is a blockquote)
* According to CommonMark, "A blank line is not needed ... between a code block and a following paragraph."
* It's makes it more nuanced to explain and to identify in the source.
There's one other important reason this should be considered. The next list item should be allowed to interrupt the indented block.
```
* first item
indented
* second next item
```
However, it currently is not permitted, which is definitely surprising. And yet the list item is permitted to interrupt an attached paragraph. The interruption rules just seem inconsistent in this regard.
It's very unlikely that existing documents rely on this behavior since the general practice is to surround the indented block by empty lines. But in the event that it does occur, the parser must have deterministic behavior. I think we should at least discuss changing the rule so that a non-indented line acts as an interrupting line, meaning it's not consumed as part of the indented block.