While developing a text and binary data format (https://concise-encoding.org/), I ran into trouble building a formal description for them. Many formats use EBNF or ABNF, or just embed their own BNF-style metalanguage in the spec itself (I started taking cues from the XML spec).
But describing a binary format is deceptively complex, and eventually I had to split out the metalanguage. KBNF is the result.
I'd really appreciate some proofreading, as right now the only way I'm able to effectively find issues is to let it sit for a couple of weeks so that I can look at it from a fresh viewpoint, and even then I'm not confident that I'm discovering as much as multiple eyes could.
When you give an example of things being non-greedy by default why does
azzzbzzzczzz
represent a document with 3 records rather than 3 documents? We intuitively understand why, but without defining document as special, making it consume records eagerly, or defining anchors for the start and end of the data, I think the example may be incorrect.
What can you do with it? Do you have a parser generator or grammar validator?
Is the & really necessary? It seems to me like they could be removed without losing anything (but it is late at night here so I may be missing something).
Related to the previous question, if & is necessary are there typos in "byte(type) byte(bind(length, ~)) & byte(~){length};" and "'(' & TOKEN_SEP item TOKEN_SEP & ')';", which are missing some?
There's no parser generator; its primary purpose is to document how a format looks and works (similar to what ABNF and EBNF do). I am trying to make it consistent enough that tools could be made for it, however.
The & is something I added later, and I thought I'd updated all of the examples but I must have missed some. Using whitespace as an operator was just causing too much trouble and making the more complex grammars harder for a human to follow, which is why I opted to make every operation require an actual symbol, and leave whitespace without semantic value.
> Using whitespace as an operator was just causing too much trouble and making the more complex grammars harder for a human to follow…
Something (it’s late and I don’t quite remember) I’ve been playing with lately defines a rule named ‘_’ as the whitespace token so the grammars are both easily read and clear. Be like:
Yes, TBH I'm a little stuck coming up with a name. I want it to be evident that this follows in the footsteps of the Backus-Naur form (for familiarity's sake), but I don't want to do something crass like call it "Ultimate Backus-Naur form" or something. We already have "Advanced Backus-Naur Form" and "Extended Backus-Naur form" and will eventually run out of superlatives...
A long time ago, I started working on something similar. There are some binary editors that support writing grammars. I worked on reverse engineering several binary file formats. When working on Quark Express file format, I discovered you do need two layers of grammars, as the file contains a header followed by allocated blocks, where the bocks are linked (by indices) and chains of linked blocks contain streams of data. I remember that there are also file formats, that simply have pointers to certain sections of the data contained in the file. I once implemented a memory mapped file data storage where all pointers to allocated blocks where represented by an offset from the start of the file. (I later, came up with a better idea and that is to make them relative to the location of the pointer, such that you do need to know the offset at which the memory mapped file is placed in memory. So basically, there are lots of 'pointers' in the file and you can only read the data in the file following those pointers.)
Lord. I still have whole books I laid out in QuarkXPress for commercial printing that I can't open. Reverse engineering the file format was not something that had occurred to me.
I have made the code available on GitHub, The code was developed on Windows and makes use of some Windows specific functions. But I think that dependency can be removed.
Unity asset files killed my attempt at a comprehensive binary encoding format spec language. There’s a header containing a list of block pointers. The blocks are compressed and the (several) files are sequential. Typically the first file is a resource containing a block of metadata describing object type formats, then object references keyed to the specified formats. The objects can have transparent references (PPtrs) to other objects in the resource file, or other files in the asset, or other files or objects in in other asset files. One asset file usually contains the master list of what object is in what file.
None of this is a true blocker, but your format has to contain sigma types and at that point you’re just writing a parser normally.
You are conflating, in a typical model/metamodel confusion, numbers in the data structures you are describing, with their machine representations that you are modeling decently, and numbers in your language, which should probably be just integers without complications.
The former are are patterns in the data, e.g.
rpm = float(32, -1000~1000);
which specifies a subset of the 2^32 floating point values, and the latter are actual numbers, e.g.
identifier = 'a'~'z'{5~8};
which shouldn't worry about rounding, signed zero, NaN values etc.
What arithmetic do you expect to need, in practice, at each of the two levels?
> real: any value from the set of reals, including qnan and snan unless otherwise specified
You mean floating point, not actual real numbers. Then, even within the IEEE-754 options, you need to specify what variant of floating point.
Note that floating point constants by themselves have ambiguous type: 1.3e5 can be float(32...), float(64...), etc.
> Note: Calculations can produce a quiet NaN value under certain conditions in accordiance with the IEEE 754 specification. If different processing is required (such as traps or exceptions), this must be documented in your specification.
This means specifying what to do with NaN results and similar errors: a heavy burden for the user.
> unsigned: limited to positive integers and 0
> signed: limited to positive and negative integers, and 0 (but excluding -0)
These are floating point values pretending to be integers. You should have actual unlimited precision integers with constraints like explicit maximum and minimum values.
Moreover, "-0" is a IEEE-754 technical detail that has no place in a formal, abstract language.
Hmm good point. My plan was to allow concepts such as inf, NaN and negative 0 to be available in the abstract, and then convertible to a concrete implementation such as ieee754. But in actual fact it's mostly muddying the waters, and those kinds of values would be better represented as functions that produce ieee754 values rather than abstract concepts that don't really work abstractly anyway (I was thinking that as an abstract concept it could be passed to a general function for any given floating type implementation such as binary float, decimal float, bfloat, etc).
>> unsigned: limited to positive integers and 0 > signed: limited to positive and negative integers, and 0 (but excluding -0)
> These are floating point values pretending to be integers. You should have actual unlimited precision integers with constraints like explicit maximum and minimum values.
The idea here is to have all numbers be treated as reals (in the mathematical sense) for abstract calculation purposes, and then use the pseudo-types "signed" and "unsigned" to represent common invariants on those reals (such as "can only be 0 or an integer" or "can only be 0 or a positive integer"). For example, ('a'~'z'{chars_per_record}){char_count / chars_per_record} (where char_count and chars_per_record are bound from another part of the document) is a common enough method used to envelope data in binary formats. But if char_count = 100 and chars_per_second = 3, you can't accept a fractional count (i.e. it would be an indication that the data in this document is corrupt or malformed).
Only the functions `uint`, `sint`, and `float` would convert a number into an actual expression (provided their invariants are respected).
> The idea here is to have all numbers be treated as reals (in the mathematical sense)
No. You can't represent, compare etc. most real numbers, so they aren't any good for computational purposes.
Practically usable number types include integers and things that have sufficiently number-like behaviour (e.g. fields) and can be represented with a finite number of integers: rational numbers, rational numbers with field extensions, vector spaces, finite sets, intervals, etc.
In a language for grammars you can probably stop at integers; general purpose programming languages might want more (for instance, Python has rational numbers) but advanced number types can usually be implemented with libraries.
I think maybe I'm not understanding you correctly...
In most cases for binary formats an integer type will be expected for 99% of cases. Something like `uint(8,a/b)` will only generate an alternates set within the realm of integers, so a rule containing such an expression as this could only match when this particular byte matches a/b, which can only happen when a/b results in an integer from 0 to 255. For all a/b where this isn't true, the rule will not match.
But support for matching float encodings is still needed, such as for `reading = float(32,0~1);`, which would happily match an ieee754 32-bit binary float approximation of 0.3 or 0.8 or anything from 0 to 1 that can be represented in this float format.
You are talking about the limited precision numbers in data, not about appropriate numbers for use in your language to express counts, repetitions, aggregate sizes etc. They are two related but categorically different kinds of entity, like screws and screwdrivers or sheet music and audio files.
Grammar rules aren't supposed to work "for 99% of cases" depending on complex invisible constraints such as whether finite precision causes floating point arithmetic to depart from exact integer arithmetic.
I'm not saying that grammar rules are supposed to work for only 99% of cases; that comment was referring to the fact that most binary data of interest for binding (counts etc) will be stored in the document as integers. But I'm still not understanding what you're saying.
All counts, repetitions, aggregate sizes etc are going to come from limited precision data in the document being examined (except in cases where you're passing in a numeric literal). In some cases these values will be used directly to produce expressions via functions and repetition, and in others they will be used in calculations that are then passed to functions or repetition to produce expressions.
The grammar rules only match expressions, which are bit patterns, not numbers. Some of those bit patterns are arrived at through functions such as `float()`, taking a number range to give an expression composed of a set of possible (ieee754 binary) bit patterns. The bit patterns that the `float(32,1~10)` expression will match are far more numerous than the bit patterns that the `uint(32,1~10)` expression will match (0x00000001, 0x00000002, ... 0x0000000a), even though they both are taking in the same numeric range (each picks out only the values it can represent as discrete bit patterns from the range). `uint(32,1.5)` is a malformed grammar because it violates the invariant of `uint()`, whereas `uint(32,2)` is fine (giving an expression of the big endian bit pattern 00000000000000000000000000000010).
The problem is slightly subtler and, I need to reiterate, at the grammar level.
Suppose you define some constants a, b, c, etc. with integer values.
First of all, it isn't obvious that these constants do not overflow the mantissa size of whatever floating point type you process them as; but we can generously assume that they are representable exactly.
Then you use these fake integers to compute something that should be an actual integer, for example the number of occurrences of something as
(a*b)/c
in one place and
(a/c)*b
in another place (probably obfuscated by layers of variables and very simple computations that would be harmless with exact arithmetic), and the two values might differ (and they might be both wrong).
So without even getting into the bit patterns of valid "example" values you are unable to specify your grammar reliably. Floating point arithmetic in the grammar is an unnecessary nightmare that you are inflicting to your users.
But these are reals, not floats. With reals, commutativity holds, even though for ieee754 floats it does not.
There's nothing forcing an implementation to calculate (2*6)/3 or (2/3)*6 using binary floats rather than some other method that it can guarantee will give a correct result. These are implementation details, which the grammar doesn't concern itself with. An analyzer could easily discover that the ultimate destination of the calculation is an unsigned integer, and rejig the calculation as necessary to produce the expected integer result (or produce a no-match expression if the result of the calculation would violate the unsigned integer invariant).
Computerized math is hard no matter what you do (rounding, range, precision, overflow behavior, impossible calculations, infinities, etc), and those will still exist whether the grammar prescribes a particular computerized approach or not. So it's better to not force implementation details when you don't have to.
Another thing to consider is that implementations of this metalanguage won't even have to be 100% correct or even handle crazy complex calculations, because real-world data formats won't do such things since they want speed and accuracy in the codecs that don't fall over on platform subtleties. A real world format won't expect the precise bits 00111110100110011001100110011010 (~0.3 in ieee754 binary float 32) for anything, and even if (god forbid) it did, one could just as easily write uint(32,0x3e99999a) instead to make sure there's no mistake (subnormals notwithstanding). You could have provably correct (but slow) implementations, and less-correct-but-super-fast-and-actually-useful-for-the-real-world implementations. A performant implementation might even require for example that calculations whose destination is an integer encoding must be calculable solely using integer math - i.e. (a * b) / c, not (a / c) * b. Nothing wrong with that if it allows you to maximize performance.
On a side note, even the float() function is fraught with subtleties. Different algorithms exist for converting decimal strings to binary floats, which produce subtly different bit patterns depending on the value. We can't get away from that, but once again for the real world it almost never matters because we don't need that level of precision so we just live with it (which is why ieee754 binary has enjoyed such success, and one reason among many why ieee754 decimal is slow to catch on).
The math is pure and should remain pure (especially in the documentation, which this metalanguage is designed for). Making it actually work in silicon is a job for a computer.
You might be interested in Guy Steele’s talk “It’s time for a new old language”. He talks about similar notations and the problem of inventing new ones (although most of his talk is about things other than BNF-like notations : he just talks about them for a bit in the beginning). It’s an interesting talk and is useful in thinking about proposing new notations.
In creating and enhancing XTRAN, my Expert System that knows over 40 computer languages, I got tired of hard-coding parsers, so I created a parsing engine that takes EBNF (specified in more friendly syntax I created), parses the EBNF to an internal format, and then when asked to parse a language, executes its EBNF in order to parse. I would be interested to know if anyone else is executing EBNF to parse.
Since code must come back out after XTRAN rules have created / changed / translated it, I also created a rendering engine that is responsible for rendering XTRAN's internal format of language content out as text source code, complete with extensive styling controls. The rendering engine is also driven by EBNF, which it executes in order to render code content to text source code.
If you're interested in learning more, see WWW.XTRAN-LLC.com. I'll be glad to answer any questions.
Not really a fan of using ‘&’ as the concatenation operator instead of the traditional EBNF way. The examples show both ways with the TOKEN_SEP (? space I guess) not seeming to need it.
Also the string rules look like they don’t allow an empty string, maybe intentionally as that’s probably a bug in someone’s grammar.
About as far as I got on a quick glance, got distracted by the mathematical operators and other things like concatenation sharing the same tokens and started thinking how hard it would be to parse it.
If you would like to include unicode support rather than supporting bytes and punting unicode to a library or the end user, it may be good to have opinions about unicode normalization.
"copy-pasta"? Is that a spaghetti replicator? :) I can reverse that, with XTRAN rules I have to automate removal of GOTOs by structuring code. (It uses XTRAN's code match/replace facility.)
While developing a text and binary data format (https://concise-encoding.org/), I ran into trouble building a formal description for them. Many formats use EBNF or ABNF, or just embed their own BNF-style metalanguage in the spec itself (I started taking cues from the XML spec).
But describing a binary format is deceptively complex, and eventually I had to split out the metalanguage. KBNF is the result.
I want it to be useful for other people, so I've decided to release it standalone here: https://github.com/kstenerud/kbnf/blob/master/kbnf_v1.md
I'd really appreciate some proofreading, as right now the only way I'm able to effectively find issues is to let it sit for a couple of weeks so that I can look at it from a fresh viewpoint, and even then I'm not confident that I'm discovering as much as multiple eyes could.
Cheers!