Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

XML vs JSON is such a fascinating topic, a clash of different approaches, academic vs practical might be one way of describing it but I'm sure there's many others. Are there any good books or essays on it?

And having been there in the 90s seeing XML and thinking "they've tried to do everything and got into a big mess" (XSLT anyone?) its so satisfying to see something simple and flawed displace it.



Sometimes I just don't really see it as a competition. For my research work often XML is a markup, which JSON is not. If I want to qualitative code a text, this is perfectly doable by hand:

  <note>
    <paragraph>This is a <code vibe="positive">remarkable</code> text.
    </paragraph>
  </note>
If you do it in JSON, it is not really readable anymore, and I would have to write GUI for the input, e.g.,

  { "note": 
    { "paragraph": 
      [ "This is a ",
        { "code": [ { "vibe": "positive" }, "remarkable" ] },
        "text." 
      ]
    }
  }


IMO, you made the JSON unreadable with all the extra whitespace that you didn't add to the XML.

    {"note": 
      {"paragraph": 
        ["This is a ", {"code": [{"vibe":"positive"}, "remarkable"]}, "text."]
      }
    }
Also, in this particular case I think the child/property dichotomy works in the favor of XML, but typically I find it to be more of a liability than an asset.


If you use arrays instead of objects, it becomes JsonML[1]:

    ["note",
        ["paragraph",
            "This is a ", ["code", {"vibe": "positive"}, "remarkable"], " text."]]
[1]: https://en.wikipedia.org/wiki/JsonML

This is also why I find Mithril.js so pleasant to use without JSX, because it's basically JsonML.


Use parens and you have S-expressions. No need for all that noise.

    (note
      (paragraph
        "This is a " (code (:vibe . "positive") "remarkable") " text."))
Lisp was invented in 1958 and we routinely rediscover it, and reimplement it badly.


Yeah, S-expressions are way more pleasant to write.

With the benefit of a lot of hindsight, S-expressions seem like a superior choice for writing web applications (instead of HTML + JavaScript + some JS framework that writes HTML again (regardless of DOM vs Virtual DOM)).

Even though I prefer dialects like Fennel for programming rather than Common Lisp (I'd probably be fine with Clojure and Janet as well, but I haven't tried them), I wouldn't mind any dialect if that means I could use S-expressions instead of HTML+JS, assuming the amount of effort put into sandboxing that approach were as much as the effort that has been put into the current approach.


I love Lisp, but the quotes, and the structure ... Markup really shines here. Sorry. Same with JsonML, even though I never heard of it before, and will have a look because it just sounds so interesting.


How do you parse such JSON?

A "note" contains a "paragraph" object, which is an array of...? Strings, or objects where there are keys, but the values are arrays of...?


In many respects the failure of XML to live up to the grandiose promises made in the 90s was due to the lack of decent tools around XML. Decent structure-aware, refactoring, editors didn't exist then, and barely do now. XML cries out for something like paredit, but dumbed down and tag aware. The author shouldn't even have to see the textual representation of the tags.

But, primarily, the thing that made JSON win out is that it reflects the existing semantic/data-structure model of an actual programming language (and approximates that of many others.)

XML (and XSchema, oh god), has this mixture of associative and sequentially ordered data, and two separate kinds of associative/nesting models (elements vs attributes). It wants to eat the world with its new semantic model, but it doesn't offer any particular semantic advantages and doesn't match what programming languages (or relational databases for that matter) work with.

And then on top of that the syntax is just ugly to read.

Still there was a moment in the 90s where I found XML kind of exciting and interesting. I don't really recall why now? Seemed everybody did.


I also remember being excited about XML in the 90s and I can recall why. At the time, every application I used had its own bespoke file format and needed a custom parser to read its output. This especially applied to several applications storing hierarchical information in CSV files with their own way of delimiting CSV data as content in a CSV cell. The switch to XML meant that I'd have to figure out the tag soup that the developer had chosen for their XML data, but that was so much more comfortable than figuring out the goulash of ASCII escape tags in the old format.

XML is like the pager. It might seem clunky today, but it's better than having to tell the baby-sitter the phone number for the restaurant and theatre you'd be visiting that night.


The interesting thing for me is that I've circled back to "CSV is good, actually" since for loosely structured source input, it has few equals - it lets you organize by cell, and you can use spreadsheet software to edit it. The balance shifts towards schema generation once you want to apply a specific model on the data, but "import CSV to my schema" is convenient as an interface to that.

XML also has a place as an interface, but it's much more niche, since most interfaces that are hierarchical-document-shaped are going to parse text instead of bracketed entities. I think we got a bit excited about how well HTML worked at the time.


CSV is nice for tabular data. Especially if it is plain and straight forward in its nature. This covers a lot of use cases.

For an object graph, CSV is less nice.


any graph can be represented as the set of tuples representing the edges. and in fact this is the most flexible form. it depends on if you need it "human readable" or not, but a hierarchical/network representation of a graph is only one representation


Yeah, it is perfectly possible, I just don't think it is very nice. Neither for parsing nor for human reading.

If I need a file or network message to represent a handful of object types with a bunch of relations that form a graph, I prefer XML (or JSON).


> XML meant that I'd have to figure out the tag soup that the developer had chosen for their XML data

There was DTD for that but probably rarely used.


XML has a unique spot in the intersection of payload and human crafted file.

JSON is something I see as payload-only really. And YAML etc are human-craft only.

One thing I would change in XML is that closing a tag should not need to specify the name. <> should close any tag. It would make editing a bit easier and save a byte or two.


Its true, I can bang together an XML doc by hand with less worry than if I type out some JSON


SGML allowed you to minimize a closing tag. And HTML supported this in theory, although mainstream browsers never supported it.

XML went for a syntax with fewer options because the complexity of SGML was considered a hinderance to its adoption.


SGML is also the reason why a HTML parser is often much more complex that a xml parser, even thought modern html is no longer SGML and instead a simplified version of it.


In my huble experience:

- XML parsers were quite bad in the old days (2000), even in java land.

- Parser speed is a functin of the input size. And XML inputs are quite big, too big in my humble opinion.

- you have tag attribute and tag value, and so people get confused on how to use them for simple scenario.

- & must be escaped but no one does it. So when 'AT&T' ended up in a stream created by "hand" by a COBOL procedure, the XML got suddenly broken

- < and > musty be escaped, and so your SQL queries must be escaped

JSON is simply an associative hierarchical maps. A Lot of 1995 PHP code used the same data structure and just works. No attribute, only hierachy recursive structure.

Recusion always wins.


> Parser speed is a functin of the input size. And XML inputs are quite big, too big

You don't have to parse the complete document to do something about it. You can use a streaming parser, and you can execute XSLT on the fly.


This is true but in the early days XML framework libraries gained a well deserved reputation for poor performance because they would parse into a DOM first and then serialize from there. I improved the runtimes of several projects from hours to seconds by switching the serialization from DOM to SAX. The prevalence and focus on the document model for every usage of XML resulted in a reputation hit. One that XML has really struggled to overcome.


That's all true, I'd just add that in JSON you have to escape " in strings just like you have to escape & < and > in XML, so there is still the potential for e.g. COBOL software to produce invalid files.


JSON:

"path\to\file"

XML:

<a>path\to\file</a>


Around the time xml was released I asked a (smarter than me) friend what the hubbub was. And he got a pained look on his face and said. People think it'll solve a problem and it won't.

His reasoning was the problem isn't standard file formats. It's turning those into something a program can operate on as data[1]. And XML does not do that for you. No you have suck it in and then manually transform it into 'data'. So it just moves the problem from one place to another place. Which is a trash solution usually.

Far as I know JSON broken as it is does do that. At least with dynamic/ish languages.

[1] gross stuff people used to do like copy data structures in memory directly to a file actually solves that problem while being a maintenance nightmare.


I consider this a feature. XML specifies a data exchange format. How this is loaded into memory depends on your use case, language etc. You can use a DOM if this is appropriate or serialize into custom or native data structures. A browser rendering a web page will use different data structures than a crawler indexing a document.


The problem is the schema and validation is the hard part and XML doesn't help with that at all.


XML contains a scheme/validation language, the DTD.

The hard part is getting independent parties to agree on a common data exchange format. Processing the data in code is the easy part.


> Far as I know JSON broken as it is does do that.

JSON in scripting languages does "as well" as Java's java.io.ObjectInputStream/java.io.ObjectOutputStream and dontnet's System.Runtime.Serialization.XmlObjectSerializer.

aka: will explode in your face in spetacular fashion [causing security vulnerabilities or denial of service in the way] unless you take special care to only use explicit safe serializable objects, and you handle corner cases correctly (for example, Javascript `__proto__` naive deserialization).


Converting it to some object is not converting it to usable data. Anyone who tried to convert arbitrary json to python will know the struggle of `json[0][0][“root”]…`. This is also possible with xml and is similarly completely useless. At least the latter has tools around this fundamental problem.


> XSLT anyone?

I used to love XSLT (I would probably still love it if I still used it, but the projects have mostly disappeared). Pattern matching is akin to event driven programming: just deal with what you have, if and when you have it. It was very clean and never broke.


My web-technology professor at my uni (around 2007) was a big fan of XML and all the surrounding technologies. He saw it as this beautiful interconnected system where the web being pure XML would allow all data in the world to be queried like a database and transformed into any format you wanted.

The web sure didn't go that way...


Basically, he was in love with serialization and connecting services together, that part is cool. It's just a lot more diverse in serialization now than before. And a lot of these projects compete in reducing overhead - they are orders of magnitude more efficient than XML.


I think it was called semantic web or Web 3.0.


And then history repeats and we call this now blockchain and the mess around it ;)


Thats "XML - the religion"


XSLT is great, the problem is that the browsers stopped at XSLT 1.

XSLT 3 is a different beast, and XSLT 4 is being worked on at the moment.


I wrote many thousands of lines of XSLT back in the day, converting XML into XSL:FO and pumping it through FOP. It worked remarkably well. It was my second experience with a declarative language (SQL being the first).

There were some great resources - libraries of code fragments, similar in spirit to tailwind ui - and people did crazy things with them.

We eventually abandoned it, after many years. It was indeed terribly difficult to work out how a document got transformed, and when returning to a transform after a long time working in other languages, you could easily spend a day just trying to work out how some trivial thing worked.

I remember it fondly, but I wouldn’t do it again.


What XSLT really needed was a gui where you could play with the template and the input and see the output change in real time, a little like how those regex websites work. With a few colours and arrows to elucidate what bit was doing what. Not a trivial undertaking, granted. I wonder if such a thing was ever produced?


I never saw one.

I do remember xsltproc being a game changer. it was still a CLI but it was really fast. So at least you could turn things around quickly. And IIRC Preview on the Mac would reload a PDF automatically if it changed. So you could get pretty close to a gui flow sometimes… as long as your document was short!

Until xsltproc, I’d been using whatever XSLT processor that came with Java, which (as always) was fine on a warmed up production server but sucked for REPL.

But what a gui could have helped with is working back from the output element to the node that generated it. That would have been sweet. But if memory serves anything commercial for XSLT back then was “enterprise licensing”, which we couldn’t afford.


https://xsltfiddle.liberty-development.net/

I use this if I want to play around with XSLT.


I think the true lesson of xslt, is xml makes a terrible syntax for a programming language. The language itself minus the syntax is fine, although i think many people prefer a more imperative approach.


XSLT was also a terrible combination of two principles: Imperative coding + data pattern matching. For a novice beginner - and me at the time - I was never clear which pattern to use and how it best fit together.


> XML vs JSON

Fun fact: Tim Bray was the/an editor on both specs:

* https://www.rfc-editor.org/rfc/rfc8259

* https://www.w3.org/TR/xml/

I wonder if he'll commemorate/note the anniversary in some way:

* https://www.tbray.org/ongoing/


In a sense though the circle has been closed. With the latest JSON Schema spec, we were able to do a 1:1 mapping between our XSD's and JSON Schema, so now we can accept JSON as well as XMLs.

For small messages though I agree JSON is smoother.


Whenever I hear people talk about JSON schema spec it reminds me of XML in the 90s.


Maybe not a popular thing to say: A schema capability is needed. You want to verify input. You want to contract among partners. You want to Intellisense something in a code editor.

And is schema languages a complicated beast: Hell yes. Do we need it nevertheless: Yes


I despise XML, not because it was a bad idea, but because of all the bad ways I've seen people use it.

For example, I used to work at an ed-tech company that bought bubble test questions from Pearson, who provided their data in HUGE XML documents. If I remember correctly, they would do things like splitting the sentences of the test questions up so half the question was a tag attribute and the other half was an element. So, instead of just parsing questions, we'd have to parse them and then stitch them back together to make complete sentences. They did that with the answers too, I believe. So weird.

The JSON format makes it harder to abuse data like my example above and it is a lot easier to parse, so I'd reach for it before XML any day.


Sounds like something that XSLT could have solved. When in doubt use more XML.


> Are there any good books or essays on it?

That would be a really good resource. I mean, anybody who has worked remotely with web technologies has an opinion about this topic but it would be interesting to compile an objective knowledge base and interpretation of why things evolved this way.


XML seems like the apex of "design by committee" (and I mean, didn't we have some thick books about it? What for?)

Nobody cares about 99% of that


I am happy we are not using SGML, which is a base for XML and is MUCH MORE complex and is also a standard...


XML itself is a pretty focused spec. XML schemas and the whole SOAP stack is where it went off the rails.


XML was flawed due to XHTML. The key spec was XML Infoset. It removed many capabilities what XML had and focused on data transfer not document representation. That is basically JSON but more powerful (comments + namespaces + attributes).

Personally, I think you cannot operate a interface without a schema. There is always a contract. Within your own team you may not care but as soon as two team work with different order on things, contracts are needed.

And yes, SOAP, WS-*, XSLT, etc. is where the madness starts. But to be also honest, they used XML, they are not XML.


I've always had a soft spot for idref attributes, intended to support internal linking and graph structures in documents and never seriously adopted; I suspect that more important XML mechanisms exhausted the annoyance/complexity budget of implementations, until the tree-structured data applications crushed document applications as you describe.


Graphs are always hard. I would love seeing graph databases to be more common, but no.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: