Decide which grammar (or grammars) to use to formally describe the AsciiDoc language

mentioned in issue #4 (closed)

I am not sure what problems you see in describing it in a grammar. A grammar defines allowed sequences of characters in a document, and it seems to me that's exactly what you want to describe. Common grammars tend to silently drop whitespace since that fits in the model of the target language (eg computer programming languages), but there is no rule you have to do this. Also, common grammars tend to split tokenizing from rule matching which reduces recognition strength (the tokenizer may make the wrong choice in token for the context). Again there is no rule you have to do this. You can perfectly fine write a grammar at character level including white space. https://en.wikipedia.org/wiki/Scannerless_parsing

Also, please consider the description separate from the parsing problem. It's feasible to describe allowed input in a grammar beyond what the standard parsing techniques like LL(1) or LALR(1) can handle. The parsing techniques add additional constraints to the form of the grammar. However from a pure "describe the input language" point of view the grammar doesn't even need to be parsable at all.

Note that nowadays there are more powerful GLR or GLL parsers that practically accept everything you throw at it to my knowledge, including ambiguous parses (and you get a forest of rule deductions then).

Coming back to this now that I've thoroughly studied AsciiDoc parsing from the perspective of a formalized grammar...

Also, please consider the description separate from the parsing problem. It's feasible to describe allowed input in a grammar beyond what the standard parsing techniques like LL(1) or LALR(1) can handle. The parsing techniques add additional constraints to the form of the grammar. However from a pure "describe the input language" point of view the grammar doesn't even need to be parsable at all.

Absolutely not. Figuring out how to parse AsciiDoc in a reproducible way, and one that doesn't require a PhD to implement, is one of the prime reasons this group got together in the first place to launch this initiative. We need more implementations and ones that are compatible to avoid fragmentation.

It would be irresponsible for us to define the grammar and not present how it can be used to parse AsciiDoc. Worse than that, it would likely leave behind a plethora of ambiguities due to a fictional, untested grammar. Many of those ambiguities are the very ones that live in the language today and that we're trying to weed out. We have to consider parsing the language as part and parcel of the grammar. I would not accept the grammar (or composite grammar, if necessary) otherwise.

The challenges we're going to face are described nicely by Guillem in the following blog post: https://www.tweag.io/blog/2021-06-15-asciidoc-haskell-pandoc/

The main concern we have is compatibility. AsciiDoc has never been officially described by a grammar, so while it may work in theory, we're likely going to encounter edge cases that don't work in practice. And we will have to deal with those situations. Our goal here is to set a course that is going to put us in the best position to address these challenges.

changed milestone to %0.1.0 (milestone build)

mentioned in issue #16 (closed)

added area/grammar triagepassed labels

I've done extensive research into this question over the course of the last month. What I concluded is that the AsciiDoc syntax—most notably the inline syntax—is best described using a PEG grammar (or a parsing formalism that has similar characteristics).

There's one characteristic of AsciiDoc in particular that leads me to this conclusion. Within reason, any sequence of Unicode characters is considered a valid AsciiDoc document. That's because AsciiDoc is first and foremost a language for writing, not programming. We assume that if no grammar rules match (i.e., no markup is found), then the text is understood to be content for the reader's eyes. On the other hand, if a sequence of characters (such as a block delimiter line) matches a grammar rule (i.e., markup), then that sequence of characters is interpreted and used to infer semantic meaning (such as enclosing content in a sidebar block).

Once a grammar rule for markup is matched, it may activate additional requirements, such as closing a delimited block or requiring a reference to be valid. However, these problems are generally considered warnings to the writer (rather than halting the processing), and the parser does its best to recover gracefully (how gracefully is still open to discussion in the specification). But there are many cases when it's valid to fail a grammar rule, such as an inline macro that's not closed. In these cases, the parser should fall back to the next alternative, or pass through the text as uninterpreted if there are no other rules to consider.

In #16 (closed), I shared an exhibit of a PEG grammar that is capable of accurately describing much of the inline syntax in AsciiDoc. I have backed up this grammar with tests that prove it works (above and beyond) what's expected. So I'm confident in saying that a PEG grammar is viable.

Arriving at the same conclusion that Guillem suggests in this post, I'm also of the belief that we should use a separate parser (and thus a separate root grammar) for blocks and inlines. The first reason is that blocks and inlines are parsed in fundamentally different ways. Blocks are (almost always) line-oriented. Thus, every rule has to consider a line ending as a meaningful character, often a boundary. Inlines see line endings as just another type of spacing character and will comfortably span them. The second reason is that the block context (specifically the content model plus any explicit subs hints) will determine which profile to use to parse the inline syntax. To try to account for all these permutations in a single grammar would make it unnecessarily complex. It's better to have a pause point in the parsing where the processor transitions from block to inline parsing and can choose the appropriate parser for the context. And the syntax lends itself to this strategy since it was designed for and has always been parsed this way. We may find this is unnecessary, but it's a reasonable approach to start.

With that information in hand, I feel comfortable saying that we should use a PEG grammar (such as the peggy DSL or PEGN) to describe the grammar of the markup in an AsciiDoc document, at least to get us started. If someone comes along and offers a better alternative, we will absolutely consider it.

I'll follow-up in a separate comment to provide more justification for why we should focus on PEG to specify the AsciiDoc grammar.

I've discovered several articles which may help explain why most parsing methods are not well-suited (or simply challenging) for parsing a language like AsciiDoc, why we're having so much success with PEG (which is unambiguous by definition), and why many other languages are moving in that direction.

I think the last article hits the nail on the head:

Parsing expression grammars avoid ambiguity by forcing all grammar rules to be defined in terms of prioritized choice

That's precisely the situation we have in AsciiDoc. However, we have to be careful with regard to this point:

every PEG rule could be hiding a “conceptual” ambiguity

The syntax in AsciiDoc is fairly restricted (as far as languages go) and we have so many tests and real world documents to work with, so I'm resonably confident we can account for those cases (more tests is never a bad thing IMO). Furthermore, the failure scenario, which is that the text just isn't interpreted, is not of grave consequence (not like a program failing to run).

I also like this point from the last article:

Many opt for hand-written parsers that are not based on any formalism at all. Language specifications are often defined in terms of a formalism like BNF, but it’s almost never the case that real parsers can be generated directly from this formalism.

I'd like to avoid that. Hopefully we can arrive at something that can be mostly reused or at least heavily referenced.

The third article goes on to say:

I think it is safe to say that pure LL and LR parsers have proven to be largely inadequate for real-world use cases.

Interesting, though the article does goes on to say that they can be if extended, hence something like ANTLR.

changed the description

When I looked into describing the AsciiDoc language using a parsing grammar (i.e., a grammar formalism), my initial focus was on the inline syntax (#16 (closed)) since I perceived that to be the more difficult syntax to tackle. Since then, I've looked closer at the block parsing. As it turns out, there are some characteristics of the block syntax that present unique challenges for a grammar formalism.

The most notable challenge is how delimited blocks are parsed (and, by association, how block delimiters are interpreted). And even the behavior I'm about to describe isn't exactly how Asciidoctor works, but rather what Asciidoctor is supposed to be doing. So I'm going to focus on the way it's supposed to work. Consider this delimited block:

====
example
====

In AsciiDoc, a block delimiter line starts and ends a delimited block. The parser will look for the opening delimiter, then parse input as children of that block up until the closing delimiter (not allowing the parser to consume the closing delimiter). That part is simple enough. Except that the delimiter line can be of variable length (typically 4 or more characters) and the parser has to match the opening delimiter to the closing delimiter that has the same length. Now consider this delimited block:

======
example
======

This presents a challenge in the grammar. In order to represent this, the grammar would conceivably have to consider an infinite number of permutations of delimiter pairs, represented by an infinite number of grammar rules. That's obviously unreasonable. This is where it would be necessary to rely on memory (i.e., state) in the parser.

Corollary: permitting variable length block delimiters (fenced boundaries) that must be balanced mandates the use of semantic predicates

When the parser finds a block delimiter line, it remembers that line by pushing it onto a stack. It then parses input until it finds a block delimiter line that matches the top one on the stack. In order for this to work, the parser must use a semantic predicate (or action) on the match for the starting block delimiter. When it finds the closing delimiter, it should pop it from the stack.

The only complication with this strategy (aside from the parser needing to support this feature) is that if the rule fails, then the wrong delimiter line is left sitting on the stack. That's where the next characteristic of the grammar comes into play. The block syntax in AsciiDoc is predominantly a forward-only grammar. In other words, there's limited backtracing. And this is one of those cases. Once the opening delimiter line is found, the parser should never reconsider that line (and hence a cached result never comes into play, so long as rules are defined carefully). At that point, it must find a closing delimiter or consider the rest of the document to be the contents of that delimited block (with a warning). Thus, it's safe to push and pop delimiter lines from a stack maintained by the parser as it will always be accurate.

We'll see this come up a lot in the block-level parsing. Generally speaking, once the block-level parser matches the rule that starts a block (at which point state is saved), that rule should succeed (thus, the parser should not have to revisit that rule). There are exceptions/workarounds, which I'll talk about later.

NOTE: It's possible we could permit an implementation to only support a fixed number of delimiter lengths (up to a length of 8, for example). On the contrary, if we don't, we'd have to find a way to assert that the implementation isn't doing this.

Consulting a stack to match a block delimiter also satisfies the requirement of nested delimited blocks whose delimiters vary in length. Though it does mean relying on a semantic predicate who's value will different depending on visit. Consider these delimited blocks:

====
======
example
======
====

When the parser encounters a delimiter line of the kind (i.e., structural container) it's looking for, but the length is different, it starts a new block instead of closing the current one.

Note that varying the delimiter line length is only required for a direct descendant of the same kind. Consider these delimited blocks:

====
****
[NOTE]
====
example
====
****
====

If there's an intermediary delimited block (in this case a sidebar block), it will start a new parsing context, thus effectively hide the delimiter line of the inner admonition block from the ancestor example block (until the intermediary block is closed). This is an aspect of the AsciiDoc language that Asciidoctor gets wrong (and we want to correct).

Here's an exhibit of a peggy (pegjs) grammar that can be used to parse a sample of delimited blocks:

{{
function enterBlock (options, delimiter) {
  if (closesBlock(options, delimiter)) return false
  const { containerStack } = options
  containerStack.push({ delimiter })
  return true
}
function exitBlock(options) {
  const { containerStack } = options
  return containerStack.pop().delimiter
}
function closesBlock ({ containerStack }, delimiter) {
  return containerStack.length && delimiter === containerStack[containerStack.length - 1].delimiter
}
}}
{
options.containerStack = []
}
body = block*

block = lf* @(example / sidebar / paragraph)

example_delimiter_line = @$('====' [=]*) &eol

example = (openingDelim:example_delimiter_line &{ return enterBlock(options, openingDelim) }) blocks:(lf @example / lf @sidebar / paragraph)* closingDelim:(lf @example_delimiter_line / eof)
  {
    const delimiter = exitBlock(options)
    if (closingDelim === undefined) console.log('unclosed example block')
    return { type: 'block', name: 'example', form: 'delimited', delimiter, blocks, location: location().start }
  }

sidebar_delimiter_line = @$('****' [*]*) &eol

sidebar = (openingDelim:sidebar_delimiter_line &{ return enterBlock(options, openingDelim) }) blocks:(lf @example / lf @sidebar / paragraph)* closingDelim:(lf @sidebar_delimiter_line / eof)
  {
    const delimiter = exitBlock(options)
    if (closingDelim === undefined) console.log('unclosed sidebar block')
    return { type: 'block', name: 'sidebar', form: 'delimited', delimiter, blocks, location: location().start }
  }

paragraph = lines:(lf? !(delim:any_compound_block_delimiter_line &{ return closesBlock(options, delim) }) @line)+
  {
    const inlines = [{ type: 'cdata', name: 'text', value: lines.join('\n') }]
    return { type: 'block', name: 'paragraph', inlines, location: location().start }
  }

any_compound_block_delimiter_line = example_delimiter_line / sidebar_delimiter_line

line = $[^\n]+

lf = '\n'

eol = lf / eof

eof = !.

Take note that a closing block delimiter can interrupt a paragraph (though we're considering making this a universal rule).

This exhibit does not consider the fact that a delimited block should temporarily reset the list parsing context (meaning lists inside a delimited block aren't directly related to lists outside of it and can thus reuse the same markers).

Another notable challenge is how sections are parsed, which shares a lot in common with how lists are parsed. Neither sections or lists have explicit boundaries. Instead, they both begin with a line that starts with a designated marker (e.g., == for a section, ** for a list). And that marker can be of variable length (just like with block delimiters). Both sections and lists consume blocks until a sibling or ancestor is found. If no sibling or ancestor is found, a section runs to the end of the document, whereas a list will stop once an interrupting line is found that's not followed by a nested list.

Lists are even more challenging since the list itself isn't actually represented in the syntax. Rather, it's implied by the presence of one or more sibling list items. Thus, the first list item is important because it establishes what list items in the list look like (i.e., what marker they use).

At first, I thought to myself, "how in the world are we going to be able to represent section and list hierarchy in the grammar"? Fortunately, I thought of at least two ways to approach it, one of which relies on semantic predicates.

The first approach is to not try to describe the hierarchy with the grammar but rather keep the sections flat just like they appear in the document. Each time a section heading is found, it captures blocks until the next section heading. Once the document is parsed, the AST is transformed to reorganize the sections into a hierarchy based on the section level. (I still need to do some experimentation to prove this will work). The grammar would have to document it, but it could not describe it using rules.

This approach is more tricky with lists. That's because sibling list items need to be initially organized into a list or else there's no concept of a list. A semantic action can be used here to track the current list marker and a semantic predicate can be used to avoid consuming a sibling list item as a child. However, all nested lists would be siblings too and thus need to be reorganized later in the AST transformation. I haven't fully tested this theory yet, so I may be misrepresenting something.

The second approach is to rely heavily on semantic predicates to maintain the hierarchy of sections and lists during parsing. While we know this strategy can work from the experience parsing delimited blocks, it may get disrupted if caching (packrat parsing) is turned on (though I haven't fully tested this theory). Regardless, the need for such caching is unlikely since the block-level parsing only revisits offsets in the input on rare occasion. Thus, it could be an acceptable tradeoff.

Corollary: a side-effect of using a semantic predicate + state for list nesting means that another semantic predicate must be used to reset that context within a delimited block; if semantic predicates are not used to track listing nested, this correlation would not exist

Here's an exhibit of a peggy (pegjs) grammar that can be used to parse sections and lists:

{{
function isNestedSection ({ sectionStack }, heading) {
  if (!sectionStack.length || heading.level > sectionStack[sectionStack.length - 1]) {
    sectionStack.push(heading.level)
    return true
  }
  return false
}
function isCurrentList ({ listStack }, marker) {
  if (marker[marker.length - 1] === '.' && marker !== '.') marker = '1.'
  return marker === listStack[listStack.length - 1]
}
function isNewList ({ listStack }, marker) {
  if (marker[marker.length - 1] === '.' && marker !== '.') marker = '1.'
  if (listStack.length && ~listStack.indexOf(marker)) return false
  listStack.push(marker)
  return true
}
}}
{
options.sectionStack = []
options.listStack = []
}
body = block*

block = lf* @(section / list / paragraph)

section = heading:heading &{ return isNestedSection(options, heading) } blocks:block*
  {
    options.sectionStack.pop()
    return Object.assign(heading, { name: 'section', blocks })
  }

heading = marker:$'='+ ' ' title:line
  {
    const inlines = [{ type: 'cdata', name: 'text', value: title }]
    return { type: 'block', name: 'heading', title: inlines, level: marker.length - 1, location: location().start }
  }

list = &(marker:list_start &{ return isNewList(options, marker) }) items:(lf* @list_item)+
  {
    const marker = options.listStack.pop()
    const variant = marker[0] === '*' ? 'unordered' : 'ordered'
    return { type: 'block', name: 'list', variant, marker, items: items, location: location().start }
  }

list_start = @list_marker ![ \n]

list_marker = @($[*]+ / $([0-9]* [.]+) / '-') ' '

list_item_principal = $(line (lf !(list_continuation_line / list_start) @line)*)

list_continuation_line = '+' &eol

list_item = marker:list_marker &{ return isCurrentList(options, marker) } principal:list_item_principal blocks:(lf '+' lf @paragraph / lf+ @list)*
  {
    return { type: 'block', name: 'listItem', marker, principal: { inlines: [{ type: 'cdata', name: 'text', value: principal }] }, blocks, location: location().start }
  }

paragraph = !heading lines:(lf? @line)+
  {
    const inlines = [{ type: 'cdata', name: 'text', value: lines.join('\n') }]
    return { type: 'block', name: 'paragraph', inlines, location: location().start }
  }

line = $[^\n]+

lf = '\n'

eol = lf / eof

eof = !.

What I haven't yet addressed in the two exhibits shown are block attribute lines. That just requires adding more checks than just newlines above a block and treating them as interrupting lines when parsing paragraphs and list items. But it raises draws attention to another nuance.

The parser often needs to be able consult the block attributes to determine how to proceed. For example, if [discrete] is present above a heading, the parser should treat it as a discrete heading instead of a section/section title. Short of having to reparse the block attribute lines for each kind of block, that means storing more state in the parser. That's something we'll need to account for in the grammar.

I've identified three characteristics of a grammar with respect to how it uses semantic predicates.

No use of semantic predicates to complete a match; I don't think this is achievable for block-level parsing in AsciiDoc
Semantic predicates that don't disrupt caching (packrat parsing); they are only used on a successful rule that never fails, hence the rule is never revisited
Semantic predicates that disrupt caching; they may run multiple times on the same offset and return different values

Ideally, we can achieve (2). Regardless, we may want to represent the grammar using (3) to better show the intent. Neither are a requirement for an implementation. All that matters is that the implementation can extract the same semantic structure from the document (which comes down to passing the TCK).

My conclusion is that PEG parsing may not be a great fit for parsing the block-level syntax in AsciiDoc. For one, there's a lot of assertions around line boundaries that could be avoided if the input was already split into lines. Second, we've established that the block-level syntax is predominantly forward-only parsing (with retries only needed to pop out of a context). And third, it relies pretty heavily on state to maintain parsing context and to determine how to proceed. I think we can likely use a PEG grammar to describe the syntax, but implementations may decide to take that information to build a custom parser instead.

One issue we'll need to follow-up on with regard to list parsing is when a list continuation is needed to attach a child/nested list. Generally, a nested list is implicitly attached, even if offset by empty lines. But if we permit lists that have block metadata (i.e., block attribute lines), then the parser would have to look past those lines to see if they are associated with a nested list. Should that be done only if the block metadata is directly adjacent, or would empty lines preceding it be permitted? We need to solidify this conditions for implicit attachment.

added assessmentactive label

Some time ago (before I knew about this attempt to create a specification for asciidoc), I started my own project to parse asciidoc as a base for my personal literate programming tool. It uses a peg based grammar to parse the asciidoc source. Since I code it mostly in my spare time it's far from complete but maybe it can serve as an inspiration/starting point for how a peg based grammar might be used in the wild.

You can find it here https://github.com/lammermann/literate_programming_toolsuite/blob/master/asciidoctrine/src/reader/asciidoc.pest

Greetings, Benjamin

Thanks for sharing, Benjamin! That will certainly be helpful. Any materials I can find to help me look at the syntax from various angles helps me better understand how to structure and define the rules, and more importantly to account for the exceptions.

One thing I picked up on from your grammar that I had also arrived at is that it's necessary to push and pop the block delimiter when parsing delimited blocks, using a semantic predicate to avoid ending the block when a child block that uses the same structural container and a different delimiter length is encountered. I think this is worth drawing attention to in the spec as a requirement of the block-level parser.

changed milestone to %0.2.0 (milestone build)

An open question that's come up is where the parsing of the block attribute line fits in. Let's explore it.

At first, it might seem obvious to define it in the rule for the block attribute line in the block parser. Something like:

block_attribute_line = '[' @block_attrlist ']' eol
block_attrlist = anchor:(block_anchor / ' '* {}) attrs:block_attr|.., ',' !' ' / ' '* ',' ' '*| ' '*

However, there's a feature of AsciiDoc that disrupts this strategy. As AsciiDoc is currently implemented, attribute references in the block attribute line get resolved/substituted before the attrlist is parsed. For example:

[{attrlist-1}]

The block parser is not currently set up to modify the input as it parsed (see #26). Thus, it would not be able to parse this attrlist directly in the block attribute line rule. This raises a few questions:

Should attribute references be resolved in a block attribute line before parsing?
1. If they should, should it be limited to a single attribute reference at the start of the boxed value?
Should attribute references be resolved by the preprocessor?
1. If not, should the attrlist be parsed by a subparser in the action on the block attribute line rule in the block parser?

The drawback of resolving the attribute references in the preprocessor is two fold. First, it introduces the possibility that the value left behind is no longer a block attribute line, hence allowing the attrlist to breach the boundaries of the box. Second, there would be no (straightforward) way to track character offsets caused by the replacement of the attribute reference since that's not currently a feature of the preprocessor. In other words, the preprocessor would become even more difficult to implement.

I really get the sense that the right approach is a subparser in the action of the attrlist rule. The benefit here is that value won't be allowed to breach the box (we are parsing within), both the attrlist rule that captures the attrlist and the attrlist parser would be simpler since they don't have to worry about running over the closing boundary of the box, and character offsets can be tracked, if necessary. Here's the revised rule:

block_attribute_line = '[' @block_attrlist ']' eol
block_attrlist = attrlist:$[^\n]*
  {
    return parseAttrlist(resolveAttributeReferences(attrlist, documentAttributes))
  }

(The details of how to track character offsets is not shown here)

The main drawback is that it requires firing up a subparser, which may have overhead. However, simple cases could be detected and optimized, so it may be a non-issue.

Additionally, an action is likely needed anyway as attribute values that are quoted must be run through the inline parser. I almost feel like the answer of how to parse a block attribute line and the attrlist it contains is inherent in the question.

The benefit here is that value won't be allowed to breach the box

This is a strong benefit and in my opinion a good rule that we should enforce.

Maybe we should generalize this rule?

Attribute references cannot breach block boundaries?

Yes, .

added assessmentcomplete label and removed assessmentactive label

After months of research, what we've discovered is that PEG is most well-suited for describing AsciiDoc using a formal grammar. We've focused on Peggy (a JavaScript-based PEG implementation) in particular since it has proven to be the most flexible.

So why PEG?

AsciiDoc is not a context-free language. Specifically, it relies on the ordered choice operator, which can also tap into predicates to make that choice. A predicate is an arbitrarily complex expression or action to look ahead into the input string without actually consuming it. The parsing of AsciiDoc cannot be ambiguous. Only one parse tree is valid. In many aspects of the language, AsciiDoc relies on recursive descent (section hierarchy, delimited blocks, nested text formatting, etc). PEG also supports regex matches, which is crucial for supporting all written languages. Actions can be associated with rules to transform the parsed result into a semantic node, which is vital for describing how an element must be interpreted. We don't have to worry about the inefficiency of PEG as much since AsciiDoc is already naturally designed to be able to veto rules early, and is thus well suited to parsers generated by PEG. In short, we rely on the power of PEG parsing to describe the AsciiDoc language, at least for the purpose of writing the specification.

The PEG page on wikipedia points out that not all languages that can be expressed using parsing expression grammars can be parsed by LL or LR parsers. We think this is the case for AsciiDoc as well, though it remains an open question to prove or disprove.

We've confirmed that AsciiDoc cannot be described using a single grammar alone. That's because there are various preprocessing steps involved with parsing AsciiDoc. Instead, it must be described using at least 5 different grammars:

line preprocessor (may be possible to integrate with block grammar)
block
attrlist
inline preprocessor
inline

There's no question that the line preprocessor is the biggest hurdle for describing AsciiDoc using a formal grammar. We still aren't entirely convinced this part should be done using PEG, or even that it can be described that way. We're also still considering how to tighten the rules to decouple it from block parsing (#26). While the preprocessor could be integrated directly into the block grammar, it relies on the PEG parsing being allowed to modify the input at or ahead of what it's already parsed. The integration also makes the block grammar more difficult to understand.

Because of how the language evolved, there are restarts between each of these steps that require shifting to a subparser. Trying to describe the language using a single grammar would permit behavior which is forbidden by the design, such as a paragraph running over the end of a delimited block or a block attribute value running over the end of a line.

It's not necessary that an implementation use a PEG parser, or even that it uses each of these discrete steps. Ultimately, what it comes down to is whether the implementation produces the expected ASG.

As it stands now, the normative sections of the specification will use a PEG grammar and rule actions as a way to communicate the syntax rules, relationships, and expected behaviors of the AsciiDoc language.

closed

mentioned in merge request !19 (merged)

Closed by SDR-8. See !19 (merged).

assigned to @mojavelinux

Decide which grammar (or grammars) to use to formally describe the AsciiDoc language

Designs

Child items ...

Activity