XML vs JSON is such a fascinating topic, a clash of different approaches, academic vs practical might be one way of describing it but I'm sure there's many others. Are there any good books or essays on it?
And having been there in the 90s seeing XML and thinking "they've tried to do everything and got into a big mess" (XSLT anyone?) its so satisfying to see something simple and flawed displace it.
Sometimes I just don't really see it as a competition. For my research work often XML is a markup, which JSON is not. If I want to qualitative code a text, this is perfectly doable by hand:
<note>
<paragraph>This is a <code vibe="positive">remarkable</code> text.
</paragraph>
</note>
If you do it in JSON, it is not really readable anymore, and I would have to write GUI for the input, e.g.,
IMO, you made the JSON unreadable with all the extra whitespace that you didn't add to the XML.
{"note":
{"paragraph":
["This is a ", {"code": [{"vibe":"positive"}, "remarkable"]}, "text."]
}
}
Also, in this particular case I think the child/property dichotomy works in the favor of XML, but typically I find it to be more of a liability than an asset.
Yeah, S-expressions are way more pleasant to write.
With the benefit of a lot of hindsight, S-expressions seem like a superior choice for writing web applications (instead of HTML + JavaScript + some JS framework that writes HTML again (regardless of DOM vs Virtual DOM)).
Even though I prefer dialects like Fennel for programming rather than Common Lisp (I'd probably be fine with Clojure and Janet as well, but I haven't tried them), I wouldn't mind any dialect if that means I could use S-expressions instead of HTML+JS, assuming the amount of effort put into sandboxing that approach were as much as the effort that has been put into the current approach.
I love Lisp, but the quotes, and the structure ... Markup really shines here. Sorry. Same with JsonML, even though I never heard of it before, and will have a look because it just sounds so interesting.
In many respects the failure of XML to live up to the grandiose promises made in the 90s was due to the lack of decent tools around XML. Decent structure-aware, refactoring, editors didn't exist then, and barely do now. XML cries out for something like paredit, but dumbed down and tag aware. The author shouldn't even have to see the textual representation of the tags.
But, primarily, the thing that made JSON win out is that it reflects the existing semantic/data-structure model of an actual programming language (and approximates that of many others.)
XML (and XSchema, oh god), has this mixture of associative and sequentially ordered data, and two separate kinds of associative/nesting models (elements vs attributes). It wants to eat the world with its new semantic model, but it doesn't offer any particular semantic advantages and doesn't match what programming languages (or relational databases for that matter) work with.
And then on top of that the syntax is just ugly to read.
Still there was a moment in the 90s where I found XML kind of exciting and interesting. I don't really recall why now? Seemed everybody did.
I also remember being excited about XML in the 90s and I can recall why. At the time, every application I used had its own bespoke file format and needed a custom parser to read its output. This especially applied to several applications storing hierarchical information in CSV files with their own way of delimiting CSV data as content in a CSV cell. The switch to XML meant that I'd have to figure out the tag soup that the developer had chosen for their XML data, but that was so much more comfortable than figuring out the goulash of ASCII escape tags in the old format.
XML is like the pager. It might seem clunky today, but it's better than having to tell the baby-sitter the phone number for the restaurant and theatre you'd be visiting that night.
The interesting thing for me is that I've circled back to "CSV is good, actually" since for loosely structured source input, it has few equals - it lets you organize by cell, and you can use spreadsheet software to edit it. The balance shifts towards schema generation once you want to apply a specific model on the data, but "import CSV to my schema" is convenient as an interface to that.
XML also has a place as an interface, but it's much more niche, since most interfaces that are hierarchical-document-shaped are going to parse text instead of bracketed entities. I think we got a bit excited about how well HTML worked at the time.
any graph can be represented as the set of tuples representing the edges. and in fact this is the most flexible form.
it depends on if you need it "human readable" or not, but a hierarchical/network representation of a graph is only one representation
XML has a unique spot in the intersection of payload and human crafted file.
JSON is something I see as payload-only really. And YAML etc are human-craft only.
One thing I would change in XML is that closing a tag should not need to specify the name. <> should close any tag. It would make editing a bit easier and save a byte or two.
SGML is also the reason why a HTML parser is often much more complex that a xml parser, even thought modern html is no longer SGML and instead a simplified version of it.
- XML parsers were quite bad in the old days (2000), even in java land.
- Parser speed is a functin of the input size. And XML inputs are quite big, too big in my humble opinion.
- you have tag attribute and tag value, and so people get confused on how to use them for simple scenario.
- & must be escaped but no one does it. So when 'AT&T' ended up in a stream created by "hand" by a COBOL procedure, the XML got suddenly broken
- < and > musty be escaped, and so your SQL queries must be escaped
JSON is simply an associative hierarchical maps.
A Lot of 1995 PHP code used the same data structure and just works.
No attribute, only hierachy recursive structure.
This is true but in the early days XML framework libraries gained a well deserved reputation for poor performance because they would parse into a DOM first and then serialize from there. I improved the runtimes of several projects from hours to seconds by switching the serialization from DOM to SAX. The prevalence and focus on the document model for every usage of XML resulted in a reputation hit. One that XML has really struggled to overcome.
That's all true, I'd just add that in JSON you have to escape " in strings just like you have to escape & < and > in XML, so there is still the potential for e.g. COBOL software to produce invalid files.
Around the time xml was released I asked a (smarter than me) friend what the hubbub was. And he got a pained look on his face and said. People think it'll solve a problem and it won't.
His reasoning was the problem isn't standard file formats. It's turning those into something a program can operate on as data[1]. And XML does not do that for you. No you have suck it in and then manually transform it into 'data'. So it just moves the problem from one place to another place. Which is a trash solution usually.
Far as I know JSON broken as it is does do that. At least with dynamic/ish languages.
[1] gross stuff people used to do like copy data structures in memory directly to a file actually solves that problem while being a maintenance nightmare.
I consider this a feature. XML specifies a data exchange format. How this is loaded into memory depends on your use case, language etc. You can use a DOM if this is appropriate or serialize into custom or native data structures. A browser rendering a web page will use different data structures than a crawler indexing a document.
> Far as I know JSON broken as it is does do that.
JSON in scripting languages does "as well" as Java's java.io.ObjectInputStream/java.io.ObjectOutputStream and dontnet's System.Runtime.Serialization.XmlObjectSerializer.
aka: will explode in your face in spetacular fashion [causing security vulnerabilities or denial of service in the way] unless you take special care to only use explicit safe serializable objects, and you handle corner cases correctly (for example, Javascript `__proto__` naive deserialization).
Converting it to some object is not converting it to usable data. Anyone who tried to convert arbitrary json to python will know the struggle of `json[0][0][“root”]…`. This is also possible with xml and is similarly completely useless. At least the latter has tools around this fundamental problem.
I used to love XSLT (I would probably still love it if I still used it, but the projects have mostly disappeared). Pattern matching is akin to event driven programming: just deal with what you have, if and when you have it. It was very clean and never broke.
My web-technology professor at my uni (around 2007) was a big fan of XML and all the surrounding technologies. He saw it as this beautiful interconnected system where the web being pure XML would allow all data in the world to be queried like a database and transformed into any format you wanted.
Basically, he was in love with serialization and connecting services together, that part is cool. It's just a lot more diverse in serialization now than before. And a lot of these projects compete in reducing overhead - they are orders of magnitude more efficient than XML.
I wrote many thousands of lines of XSLT back in the day, converting XML into XSL:FO and pumping it through FOP. It worked remarkably well. It was my second experience with a declarative language (SQL being the first).
There were some great resources - libraries of code fragments, similar in spirit to tailwind ui - and people did crazy things with them.
We eventually abandoned it, after many years. It was indeed terribly difficult to work out how a document got transformed, and when returning to a transform after a long time working in other languages, you could easily spend a day just trying to work out how some trivial thing worked.
What XSLT really needed was a gui where you could play with the template and the input and see the output change in real time, a little like how those regex websites work. With a few colours and arrows to elucidate what bit was doing what. Not a trivial undertaking, granted. I wonder if such a thing was ever produced?
I do remember xsltproc being a game changer. it was still a CLI but it was really fast. So at least you could turn things around quickly. And IIRC Preview on the Mac would reload a PDF automatically if it changed. So you could get pretty close to a gui flow sometimes… as long as your document was short!
Until xsltproc, I’d been using whatever XSLT processor that came with Java, which (as always) was fine on a warmed up production server but sucked for REPL.
But what a gui could have helped with is working back from the output element to the node that generated it. That would have been sweet. But if memory serves anything commercial for XSLT back then was “enterprise licensing”, which we couldn’t afford.
I think the true lesson of xslt, is xml makes a terrible syntax for a programming language. The language itself minus the syntax is fine, although i think many people prefer a more imperative approach.
XSLT was also a terrible combination of two principles: Imperative coding + data pattern matching. For a novice beginner - and me at the time - I was never clear which pattern to use and how it best fit together.
In a sense though the circle has been closed. With the latest JSON Schema spec, we were able to do a 1:1 mapping between our XSD's and JSON Schema, so now we can accept JSON as well as XMLs.
For small messages though I agree JSON is smoother.
Maybe not a popular thing to say: A schema capability is needed. You want to verify input. You want to contract among partners. You want to Intellisense something in a code editor.
And is schema languages a complicated beast: Hell yes. Do we need it nevertheless: Yes
I despise XML, not because it was a bad idea, but because of all the bad ways I've seen people use it.
For example, I used to work at an ed-tech company that bought bubble test questions from Pearson, who provided their data in HUGE XML documents. If I remember correctly, they would do things like splitting the sentences of the test questions up so half the question was a tag attribute and the other half was an element. So, instead of just parsing questions, we'd have to parse them and then stitch them back together to make complete sentences. They did that with the answers too, I believe. So weird.
The JSON format makes it harder to abuse data like my example above and it is a lot easier to parse, so I'd reach for it before XML any day.
That would be a really good resource. I mean, anybody who has worked remotely with web technologies has an opinion about this topic but it would be interesting to compile an objective knowledge base and interpretation of why things evolved this way.
XML was flawed due to XHTML. The key spec was XML Infoset. It removed many capabilities what XML had and focused on data transfer not document representation. That is basically JSON but more powerful (comments + namespaces + attributes).
Personally, I think you cannot operate a interface without a schema. There is always a contract. Within your own team you may not care but as soon as two team work with different order on things, contracts are needed.
And yes, SOAP, WS-*, XSLT, etc. is where the madness starts. But to be also honest, they used XML, they are not XML.
I've always had a soft spot for idref attributes, intended to support internal linking and graph structures in documents and never seriously adopted; I suspect that more important XML mechanisms exhausted the annoyance/complexity budget of implementations, until the tree-structured data applications crushed document applications as you describe.
And having been there in the 90s seeing XML and thinking "they've tried to do everything and got into a big mess" (XSLT anyone?) its so satisfying to see something simple and flawed displace it.