Topic: MNML: a minimalistic structured data format (Read 1034 times)

Nadaka · « **on:** August 21, 2012, 06:22:42 pm »

I am creating this thread so I don't spam up the if self.isCoder(): post() #Programming Thread

MNML

Preface:

What is bad about XML:
1: verbosity. each element name is repeated twice, and each of those names is enclosed in its own set of braces.
2: whitespace. xml has fiddly rules for what white space is and is not important, and there are inconsistent implementations in the interpretation of this (thank you microsoft).
3: validity. An entire document must be well formed or discarded entirely. Combine this with the single root element requirement and you have a format unsuitable for streaming data sources.
4: There is more, but screw it for the moment.

What is bad about json:
json has difficulty serializing certain types of objects. In javascript all objects are arrays and all arrays are objects. JSON notation has difficulty dealing with this because it does not allow setting named attributes on arrays, nor does it allow putting values into an object as if it was an array, even though both are possible and indeed frequent in javascript.

Introduction:

M Minimalistic (Model Driven?)
N Non-xml
M Markup
L Language

The actual definition of this acronym is subject to change. The goal of MNML is to create a simple, efficient but effective language to replace the use of XML and/or JSON as a data transport and storage specification in web and other software development. I will use every day English to describe the language explicitly as well as provide a formal grammar.

The design philosophies I am aiming for are simplicity and compactness.
[/spoiler]

Spoiler: Overview: (click to show/hide)

this section provides an overview of the structure and format of MNML. The MNML language consists of collections, attributes, values, strings, words, whitespace and comments. Collections, strings and words are collectively referred to as elements. Collections and attributes are structured data, they can contain child elements. String and words are unstructured, they do not contain child elements. A MNML document minimally consists of an implied root collection, its attributes (if any) and its child elements (if any). The implied root container allows most conventional text documents to be valid MNML code if they do not contain MNML reserved characters.

A collection is a container structure that can contain 0 or more other elements. When writing MNML, a collection is enclosed by square brackets"[]". Inside that enclosure are the attributes and member elements of the collection. An element within a collection is a child of the collection. The collection containing an element or attribute is called the elements or attributes parent. The ancestors of an element or attribute are its parent and all of the ancestors of its parent including the root collection.

An attribute is a named property of a collection that may or may not have a value associated with it. A collection may only have a single instance of any given attribute. When writing MNML, an attribute consists of an attribute identifier '@' followed by an attribute name consisting of any number of alphanumeric characters and optionally followed by an attribute value separator '|' followed by a value. Attributes are always located before any child elements.

A value is a single element associated with an attribute of a collection. When writing MNML, a value is separated from its attribute by a vertical bar "|". Not all attributes have values.

NOTE: THE SEPERATOR HAS BEEN CHANGED TO FROM SEMI-COLON BECAUSE SEMI-COLONS ARE RELATIVELY COMMON IN BOTH STANDARD TEXT AND PROGRAMMING LANGUAGES.

A word is a basic token in the MNML language with arbitrary meaning. A word consists of any number of consecutive printable non-whitespace characters. A word can include MNML reserved characters by using an escape sequence. A valid MNML document may consist only of words if desired, this means that text documents are valid MNML documents if they do not contain reserved characters.

A string is another basis token in the MNML language with arbitrary meaning. A string consists of any number of consecutive printable characters enclosed within quotes """". Inside a string whitespace is part of the data and the only necessary escape sequences within a string are for the escape character "\" and the quote """, the other escape sequences are not valid.

Whitespace consists of the space, horizontal tab, new line characters and any other unicode whitespace character. Whitespace is in most cases optional and ignored. The only place in a mnml file/stream where whitespace is required is between two adjacent words. The only place where whitespace is prohibited is between the attribute identifier and its corresponding attribute name. The type, amount and arrangement of whitespace used outside a string has no bearing on the meaning of a MNML document at all.

Comments are for documenting the markup used with MNML. When writing MNML, a comment is enclosed by parenthesis "()". Comments are ignored as if they were white space.

Reserved characters for MNML are '@', '|', '(', ')', '[', ']', '"' and '\'. These can be escaped by using the and '\' character. Other escape sequences include space '\s', tab '\t', new line '\n'. These escape sequences produce characters that may be part of words or strings.

Spoiler: Example MNML compliant document: (click to show/hide)

Spoiler: Formal Grammar: (click to show/hide)

javascript implementation:
pending...

java implementation:
pending...

Spoiler: c# implementation in progress (click to show/hide)

Recent Changes:
Removed alternate escape sequences from strings because they create an issue when converting strings between escaped and escaped values. When a string transitions back and forth between escaped and unescaped it should always be the same when in the same state, and allowing mixed use of escapes prevents that.

Removed escape sequences for collapse space because it is problematic. Mnml doesn't really have an concept of "display". and if display means unescaping and merging two escaped words, it runs into the same consistency issue that faces strings with mixed escapes. This may be able to make a comeback by using one of unicodes 'skinny' whitespace characters as the unescaped value for \c, though it may have inconsistent results for systems using fixed width fonts.

Removed escape sequence for unicode character because it is unnecessary.

Simplified white space rule because it is both more simple and more compact.

Feedback:
Comments: I am growing unsatisfied with comments. The '(' and ')' characters are common enough that it will require a lot of escaping. yet I do not want to remove the concept of comments.

I've been working fairly steadily for the last 3 or 4 days to produce and unit test a decent java implementation, and I should have that up soon.

I am also looking for any other comments or criticisms.

JanusTwoface · « **Reply #1 on:** August 21, 2012, 10:33:22 pm »

I guess my biggest question would be why? There are already dozens of markup languages:

Spoiler: List of markup languages (from the bottom of the JSON Wikipedia page) (click to show/hide)

Surely something in there can scratch that itch you've got going there. Which come to think of, I'm not really sure what your itch is. You don't like that XML is so verbose. Got that. And JSON has issues with some types of objects. It's work-around-able, but fair enough. And it seems though that you want something capable of data transport rather than just marking up documents.

Have you looked at YAML? It's a lot less verbose than XML and can handle all sorts of arbitrary objects. (And it's actually a superset of JSON).

Spoiler: Example YAML (click to show/hide)

(source)


--- !clarkevans.com/^invoice
invoice: 34843
date   : 2001-01-23
bill-to: &id001
    given  : Chris
    family : Dumars
    address:
        lines: |
            458 Walkman Dr.
            Suite #292
        city    : Royal Oak
        state   : MI
        postal  : 48046
ship-to: *id001
product:
    - sku         : BL394D
      quantity    : 4
      description : Basketball
      price       : 450.00
    - sku         : BL4438H
      quantity    : 1
      description : Super Hoop
      price       : 2392.00
tax  : 251.42
total: 4443.52
comments: >
    Late afternoon is best.
    Backup contact is Nancy
    Billsmer @ 338-4338.

Alternatively, consider s-expressions. The basis of languages like LISP and Scheme where data and code have the same format. Basically, you can have something like this:

Spoiler: Example s-expression (click to show/hide)

So you get the same nice nested structure of JSON/XML but without the redundant closing tags of XML and a bit more flexibility than JSON gives you.

Side note: I really don't mean this to be harsh and if you're just in it to see if you can make a markup language with parsers / generators, well power to you. It's a neat project. Just don't spend too much time reinventing such a popular wheel.

Thief^ · « **Reply #2 on:** August 22, 2012, 04:27:28 am »

What about SGML?

It's what both html and xml are based on, and lets you do things like this:
<TITLE/abcdef/
and this:
<TITLE>abcdef</>
which are both equivalent to (and considerably shorter than):
<TITLE>abcdef</TITLE>

Nadaka · « **Reply #3 on:** August 22, 2012, 11:56:37 am »

YAML: i am not fond of whitespace being important because it makes human reading and writing more error prone.
S-Expressions: no way to distinguish between named elements (attributes) and members of a collection and comments are sketchy.
SGML: it suffers from the same issues as xml and is further complicated by having multiple representations for the same data.

Also: because I can, and I don't really get to use most of my education at work. And it might be useful to someone else.

JanusTwoface · « **Reply #4 on:** August 22, 2012, 12:06:44 pm »

Quote from: Nadaka on August 22, 2012, 11:56:37 am

YAML: i am not fond of whitespace being important because it makes human reading and writing more error prone.

That's a fair enough point and actually part of the reason I personally don't use it.

Quote from: Nadaka on August 22, 2012, 11:56:37 am

S-Expressions: no way to distinguish between named elements (attributes) and members of a collection and comments are sketchy.

What's the difference? It all really depends on how you want to structure your data.

And what are comments really?

They're either actually part of the data / important to understanding the data (in which case why can't you encode them as any other data, just with a 'comment' tag of some sort) or they aren't, in which case why are they even there?

And comments in s-expressions (if you're dealing with Scheme) are really easy as well. You can comment out a single line with ; , an entire block with #| ... |# and a single s-expression with #;(containing any amount of information or multiple lines.

Nadaka · « **Reply #5 on:** August 23, 2012, 08:34:00 pm »

As for comments in s-expressions, I was going off the basic concept, not any particular language implementation. I didn't see a specification for comments.

You can have fields labeled comment that are real data that shouldn't be discarded as a comment when dealing with arbitrary data structures, for instance . This would result in an inherently ambiguous grammar if you start making that assumption.

However, I would not need separate comment reserved chars if I eventually added a way to mark elements with some kind of metadata. But that is opening a whole new can of worms, and I don't want to go there just yet.

On a different note: I will be adding an escape sequence for null (\0), because I can't just assume that every collection won't have null values in it.

Bay 12 Games Forum

News:

Author Topic: MNML: a minimalistic structured data format (Read 1034 times)

Nadaka

MNML: a minimalistic structured data format

JanusTwoface

Re: MNML: a minimalistic structured data format

Thief^

Re: MNML: a minimalistic structured data format

Nadaka

Re: MNML: a minimalistic structured data format

JanusTwoface

Re: MNML: a minimalistic structured data format

Nadaka

Re: MNML: a minimalistic structured data format