I am creating this thread so I don't spam up the
if self.isCoder(): post() #Programming ThreadMNML
Preface:What is bad about XML:
1: verbosity. each element name is repeated twice, and each of those names is enclosed in its own set of braces.
2: whitespace. xml has fiddly rules for what white space is and is not important, and there are inconsistent implementations in the interpretation of this (thank you microsoft).
3: validity. An entire document must be well formed or discarded entirely. Combine this with the single root element requirement and you have a format unsuitable for streaming data sources.
4: There is more, but screw it for the moment.
What is bad about json:
json has difficulty serializing certain types of objects. In javascript all objects are arrays and all arrays are objects. JSON notation has difficulty dealing with this because it does not allow setting named attributes on arrays, nor does it allow putting values into an object as if it was an array, even though both are possible and indeed frequent in javascript.
Introduction:M Minimalistic (Model Driven?)
N Non-xml
M Markup
L Language
The actual definition of this acronym is subject to change. The goal of MNML is to create a simple, efficient but effective language to replace the use of XML and/or JSON as a data transport and storage specification in web and other software development. I will use every day English to describe the language explicitly as well as provide a formal grammar.
The design philosophies I am aiming for are simplicity and compactness.
[/spoiler]
this section provides an overview of the structure and format of MNML. The MNML language consists of collections, attributes, values, strings, words, whitespace and comments. Collections, strings and words are collectively referred to as elements. Collections and attributes are structured data, they can contain child elements. String and words are unstructured, they do not contain child elements. A MNML document minimally consists of an implied root collection, its attributes (if any) and its child elements (if any). The implied root container allows most conventional text documents to be valid MNML code if they do not contain MNML reserved characters.
A collection is a container structure that can contain 0 or more other elements. When writing MNML, a collection is enclosed by square brackets"[]". Inside that enclosure are the attributes and member elements of the collection. An element within a collection is a child of the collection. The collection containing an element or attribute is called the elements or attributes parent. The ancestors of an element or attribute are its parent and all of the ancestors of its parent including the root collection.
An attribute is a named property of a collection that may or may not have a value associated with it. A collection may only have a single instance of any given attribute. When writing MNML, an attribute consists of an attribute identifier '@' followed by an attribute name consisting of any number of alphanumeric characters and optionally followed by an attribute value separator '|' followed by a value. Attributes are always located before any child elements.
A value is a single element associated with an attribute of a collection. When writing MNML, a value is separated from its attribute by a vertical bar "|". Not all attributes have values.
NOTE: THE SEPERATOR HAS BEEN CHANGED TO FROM SEMI-COLON BECAUSE SEMI-COLONS ARE RELATIVELY COMMON IN BOTH STANDARD TEXT AND PROGRAMMING LANGUAGES.
A word is a basic token in the MNML language with arbitrary meaning. A word consists of any number of consecutive printable non-whitespace characters. A word can include MNML reserved characters by using an escape sequence. A valid MNML document may consist only of words if desired, this means that text documents are valid MNML documents if they do not contain reserved characters.
A string is another basis token in the MNML language with arbitrary meaning. A string consists of any number of consecutive printable characters enclosed within quotes """". Inside a string whitespace is part of the data and the only necessary escape sequences within a string are for the escape character "\" and the quote """, the other escape sequences are not valid.
Whitespace consists of the space, horizontal tab, new line characters and any other unicode whitespace character. Whitespace is in most cases optional and ignored. The only place in a mnml file/stream where whitespace is required is between two adjacent words. The only place where whitespace is prohibited is between the attribute identifier and its corresponding attribute name. The type, amount and arrangement of whitespace used outside a string has no bearing on the meaning of a MNML document at all.
Comments are for documenting the markup used with MNML. When writing MNML, a comment is enclosed by parenthesis "()". Comments are ignored as if they were white space.
Reserved characters for MNML are '@', '|', '(', ')', '[', ']', '"' and '\'. These can be escaped by using the and '\' character. Other escape sequences include space '\s', tab '\t', new line '\n'. These escape sequences produce characters that may be part of words or strings.
@title;[@huh?|what? This is the title of this document.]
@some;stuff
This is some content of the document. (this is a comment) [this is some more content and is the 8th element in the root element.] [@foo This has/is a foo.] [@bar;baz This things bar is baz.] \n \( this is not a comment but is on a new line. The parenthesis are separate words \) [this element has 5 words.] " This element is a string with a lot of extra whitespace and some characters that would otherwise have to be escaped []@()| "
Note: The current whitespace rules can not be formalized in grammar in a simple and consistant way.
Note: whitespace characters terminate every token except for strings
Note: Tokens are italics
Note: [b/Terminals are bold[/b]
Note: The '-' character is being used as a seperator to indicate that no form of whitespace may present.
Note: The word epsilon is being used in place of the traditional epsilon character.
Note: There may be some reording of rules and epsilon optimizations, or minor errors.
Note: this grammar assumes that comments are stripped out before processing.
document -> collectionContents
collectionContents -> attribute attributes elements| elements
attributes ->attribute attributes | epsilon
attribute -> attributeIdentifier-attributeName value
attributeIdentifier -> @
attributeName -> alphaNumericCharacter-alphaNumericCharacters
value -> | element | epsilon
alphaNumericCharacters -> alphaNumericCharacter-alphaNumericCharacters | epsilon
elements -> element elements | epsilon
element -> collection | string | word
collection -> collectionBeginChar collectionContents collectionEndChar
collectionBeginChar -> [
collectionEndChar -> ]
string -> "-stringCharacters-"
stringCharacters -> stringCharacter-stringCharacters | epsilon
stringCharacter -> nonEscapeNonQuoteCharacter | stringEscapeSequence
nonEscapeNonQuoteCharacter -> [ALL CHARACTERS THAT ARE NOT \ OR "]
stringEscapeSequence -> escapeCharacter-stringReservedCharacter
escapeCharacter -> \
stringReservedCharacter -> | \ | "
nonEscapeNonQuoteCharacter -> [ALL CHARACTERS THAT ARE NOT \ OR "]
word -> wordCharacter-wordCharacters
wordCharacter -> wordEscapeSequence | nonWhitespaceNonReservedCharacter
wordCharacters -> wordCharacter-wordCharacters | epsilon
wordEscapeSequence -> escapeCharacter-sequence
sequence -> reservedCharacter | s | t | n
reservedCharacter -> [ | ] | ( | ) | \ | @ | | | "
nonWhitespaceNonReservedCharacter -> [ALL CHARACTERS THAT ARE NOT IN reservedCharacter AND ARE NOT WHITESPACE]
alphaNumericCharacter -> [a-zA-Z0-9 AND UNICODE CHARACTERS OF TYPE LETTER]
whiteSpace -> whiteSpaceCharacter whiteSpaceCharacters
whiteSpaceCharacters -> whiteSpaceCharacter whiteSpaceCharacters | epsilon
whiteSpaceCharacter -> [ALL CHARACTERS THAT ARE WHITESPACE] | comment
comment -> (-nonClosePerenCharacters-)
nonClosePerenCharacters -> nonClosePerenCharacter nonClosePerenCharacters | epsilon
nonClosePerenCharacter -> [ALL CHARACTERS THAT ARE NOT )]
javascript implementation:
pending...
java implementation:
pending...
using System;
using System.Collections.Generic;
using System.Collections;
using System.Linq;
using System.Text;
namespace us.nadaka.mnml
{
public class simpleMnmlReader
{
}
public class simpleMnmlWriter
{
public static void buildDocument(StringBuilder builder, MnmlCollection impliedRoot)
{
buildCollectionContents(builder, impliedRoot);
}
private static void buildCollectionContents(StringBuilder builder, MnmlCollection collection)
{
bool hasAttributes = (collection.AttributeCount > 0);
bool hasChildren = (collection.ChildCount > 0);
bool lastElementWasWord = false;
if (hasAttributes)
{
foreach (string name in collection.AttributeList)
{
builder.Append("@");
builder.Append(name);
MnmlElement value;
if (collection.tryGetAttribute(name, out value))
{
if (value != null)
{
builder.Append("|");
lastElementWasWord = buildElement(builder, value, false);
}
}
}
if (hasChildren)
{
builder.Append(" ");
}
}
if (hasChildren)
{
foreach (MnmlElement element in collection.Children)
{
lastElementWasWord = buildElement(builder, element, lastElementWasWord);
}
}
}
private static bool buildElement(StringBuilder builder, MnmlElement value, bool spaceIfWord)
{
bool isAWord = false;
MnmlCollection mCollection = value as MnmlCollection;
if (mCollection != null)
{
builder.Append("[");
buildCollectionContents(builder, mCollection);
builder.Append("]");
}
else
{
MnmlString mString = value as MnmlString;
if (mString != null)
{
builder.Append("\"");
builder.Append(mString.Value);
builder.Append("\"");
}
else
{
MnmlWord mWord = value as MnmlWord;
if (mWord != null)
{
if (spaceIfWord)
{
builder.Append(" ");
}
builder.Append(mWord.Value);
isAWord = true;
}
}
}
return isAWord;
}
}
public class MnmlCollection : MnmlElement
{
private Hashtable _attributes = new Hashtable();
private List<MnmlElement> _children = new List<MnmlElement>();
public int AttributeCount
{
get { return _attributes.Count; }
}
public List<string> AttributeList
{
get { return new List<string>(_attributes.Keys.Cast<string>()); }
}
public void removeAttribute(string name)
{
_attributes.Remove(name);
}
public void setAttribute(string name, MnmlElement value)
{
if (validateName(name))
{
_attributes[name] = value;
}
}
private bool validateName(string name)
{
bool answer = true;
throw new NotImplementedException();
return answer;
}
public bool tryGetAttribute(string name, out MnmlElement value)
{
if (_attributes.ContainsKey(name))
{
value = (MnmlElement)_attributes[name];
return true;
}
else
{
value = null;
return false;
}
}
public int ChildCount
{
get { return _children.Count; }
}
public List<MnmlElement> Children
{
get { return _children; }
}
}
public class MnmlWord : MnmlElement
{
private string _value;
private MnmlWord() { }
public MnmlWord(string value)
{
Value = value;
}
public string Value {
get {return _value; }
set
{
if (validate(value))
{
_value = value;
}
}
}
public static bool validate(string value)
{
bool answer = true;
throw new NotImplementedException();
return answer;
}
}
public class MnmlString : MnmlElement
{
private string _value;
private MnmlString() { }
public MnmlString(string value)
{
Value = value;
}
public string Value {
get {return _value; }
set
{
if (validate(value))
{
_value = value;
}
}
}
public static bool validate(string value)
{
bool answer = true;
throw new NotImplementedException();
return answer;
}
}
public class MnmlElement
{
}
}
Recent Changes:Removed alternate escape sequences from strings because they create an issue when converting strings between escaped and escaped values. When a string transitions back and forth between escaped and unescaped it should always be the same when in the same state, and allowing mixed use of escapes prevents that.
Removed escape sequences for collapse space because it is problematic. Mnml doesn't really have an concept of "display". and if display means unescaping and merging two escaped words, it runs into the same consistency issue that faces strings with mixed escapes. This may be able to make a comeback by using one of unicodes 'skinny' whitespace characters as the unescaped value for \c, though it may have inconsistent results for systems using fixed width fonts.
Removed escape sequence for unicode character because it is unnecessary.
Simplified white space rule because it is both more simple and more compact.
Feedback:Comments: I am growing unsatisfied with comments. The '(' and ')' characters are common enough that it will require a lot of escaping. yet I do not want to remove the concept of comments.
I've been working fairly steadily for the last 3 or 4 days to produce and unit test a decent java implementation, and I should have that up soon.
I am also looking for any other comments or criticisms.