Clarify how block attribute lines are parsed and aggregated
While working through the block-level grammar, it became clear that there's a lot of gray area with regard to how the block attribute lines are parsed. This issue seeks to clarify these rules.
First, it should be stated how the attrlist in a block attribute line is found and parsed. We might be tempted to think that the attrlist should be parsed incrementally using a top-level rule in the grammar within the block attribute line rule. At a high level, something like:
block_attribute_line = '[' block_attrs ']' eol
block_attrs = attrs:(block_attr|.., ',' !' ' / ' '* ',' ' '*|)
block_attr = block_attr_name '=' block_attr_value / block_attr_value
...
However, this approach is not compatible with the rule that a block attribute line must be restricted to a single line and that line must start and end with matching square brackets. In other words, the closing square bracket at the end of the line is a hard boundary (especially important when we get into the matter of resolving attribute references).
Instead, the attrlist should be parsed using a subparser, then aggregated with the result from other attrlists, in a rule action:
block_with_metadata = metadata:(attrlists:block_attribute_line* {
const attributes = {}
for (const { source: attrlist, location: loc } of attrlists) {
const theseAttributes = parseAttrlist(attrlist, { ..., line: loc[0].line, startCol: loc[0].col })
...
}
return { metadata: { attributes } }
}) block:block
{
return metadata ? Object.assign(block, { metadata }) : block
}
block_attribute_line = '[' block_attrlist ']' eol
block_attrlist = !space source:$(!(lf / space? ']') .)*
{
return { source, location: toSourceLocation(getLocation()) }
}
(We're saying here that the attrlist cannot start or end with a space)
This approach also has the benefit of making the attrlist parser easier to implement since it doesn't have to worry about overrunning the end of the line. However, since it's a subparser, it does require more effort to propagate the location information.
R1: Use subparser to parse attrlist matched by block attribute line rule
The next issue has to do with attribute references. AsciiDoc has always allowed attribute references to be resolved before the attrlist is parsed. Changing that now would be very problematic. However, keeping this feature does raise some questions that need to be answered.
First, having to preprocess the attrlist by resolving attribute references makes it clear that the attrlist requires a subparser. It's not possible to replace attribute references as the block attribute line is being matched. It could be done in the block/line preprocessor. However, the consequence of that is that an attribute value could introduce additional lines that could breach the boundaries of the block attribute line and actually change the structure of the document. This is not allowed today, and we don't want users to start exploiting that loophole. Therefore, it's imperative that the block attribute line be found first, then the contents within it (the attrlist) be parsed.
If attribute references are resolved first, we need a new mode for the inline preprocessor that only processes attributes. At the same time, it must not resolve attribute references inside inline passthroughs, so those ranges still need to be considered. It then needs to return the same result as it does for the inline parser, except the source mapping only contains information about resolved attribute references, not inline passthroughs. At this point, parsing of the attrlist may proceed. That parser need not worry about newline characters since any source that was added by an attribute value since the block parser still considers it all on the same line in the source document.
R2: Use inline preprocessor to resolve attributes before parsing attrlist
The next question has to do when inline parsing is performed on an attribute value. As part of a larger commit a decade ago, Asciidoctor introduced the feature into the AsciiDoc Language that substitutions (i.e., inline parsing) are applied to an attribute value enclosed in single quotes. If we were to preserve this feature, then we would expect the location on the inline nodes to be accurate (even though location information is not stored for unparsed attribute values).
But we have to wonder whether we should support this feature. After all, there are numerous attributes whose value should never be parsed, such as id
, role
, opts
, cols
, etc. Should the parser ignore the single quotes in these cases? And what about the title attribute. In Asciidoctor, the value is always parsed even if it is not enclosed in single quotes. Should that behavior be preserved?
Assuming we still want to parse single-quoted attribute values, the next question is how to handle inline preprocessing. Since attribute references have already been resolved, we don't want attribute references to be resolved again. However, the inline passthroughs need to be processed just as they would have been had that processing been done at the same time as attribute references. That means the inline preprocessor has to only work on ranges between resolved attribute references. We have effectively split the inline preprocessor into two modes or phases, yet it needs to work as if it was done all in the same phase.
In the case the attrlist has at least one attribute reference and one single-quoted value, the parsing order is:
resolve attribute references -> parse attrlist -> extract passthroughs -> parse inlines -> restore passthroughs
R3: Either a) universally parse single-quoted attribute value, b) ignore single-quoted enclosure and parse attribute values of certain attribute names, like title, c) parse attribute values of certain attribute names if value is single quoted
The next question is how to aggregate block attributes. Although multiple block attribute lines are permitted, in the end we need a single map of attributes. Here a sketch of the aggregation rules:
- role values are always aggregated; duplicates are ignored
- options is an alias for opts
- options are always aggregated; duplicates are ignored
- if any other attribute appears twice (whether it is named or positional), it is overwritten
- positional attributes are stored using 1-based index keys as a string (did we want to add a $ in front of the number)?
- the order of the names in the attribute map match the order in which they first appear in the document
- the first positional attribute permits shorthands (#idname, .rolename, %optname, [idname,reftext])
- what's left, if any, is interpreted as the style (e.g., source from source%linenums)
- if nothing remains, the existing style is not overwritten
Note that attributes that are overwritten don't appear in the ASG.
R4: Aggregate certain attribute values (role, opts) and overwrite the rest
Certain attributes are promoted to a top-level property on the node in the ASG. These attributes are:
- id
- title
- reftext
- roles (an array form of the role attribute); may want to move this to metadata.roles
The value of the title and reftext properties are always an array of inlines, even if the attribute value is not parsed (because it was enclosed in single quotes).
R5: The metadata property is only defined if the block has metadata lines; metadata contains the properties attributes, options, roles, and location. The title and reftext attributes are promoted to top-level properties. The value of the top-level id property either comes from the attributes or generated, if applicable.