Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The XML spec is 25 years old today (w3.org)
134 points by klez on Feb 10, 2023 | hide | past | favorite | 193 comments


XML is nowhere near "hip" these days, but I think it is underrated. Yes, a lot of bad things have been done with XML, but the technology itself is just fine and sometimes hated for the wrong reasons.

XML is considered overly verbose, which is not always the fault of XML, but of how it is used.

    <dependency>
      <groupId>org.junit.jupiter</groupId>
      <artifactId>junit-jupiter-engine</artifactId>
      <version>5.9.2</version>
    </dependency>
Could (in principle) be written as:

    <dependency groupId="org.junit.jupiter" artifactId="junit-jupiter-engine" version="5.9.2" />
What really stands out is XML Schema. It is so much more powerful and precise compared to Json Schema.

See also: http://www.nichesoftware.co.nz/2017/04/25/in-defense-of-xml....


The problem is that XML is a markup language, which is entirely incompatible with the way data is usually structured in programs.

Your example is simple enough, but consider the following:

    <root>
        This is some text.
        <type1 attributes="yes">
            This is content with <i>markup</i>.
        </type1>
        <type1>
            Another element of the same type.
        </type1>
        <type2>
            And this is an element of an entirely different type.
        </type2>
        This is more text.
    </root>
How the hell can I map that to my data model? My application data will, at worst, be a graph of class instances with named fields, containers (lists, sets, etc), and primitives (strings, ints, etc).

The fact that XML has both attributes and subelements is already a huge impedance mismatch because classes have a single namespace. But what do I do with the repeated element? With the text interspaced with tags? What's the API to navigate this mess?

I'm pretty sure that's why JSON won: it started from the application data model and just simplified it (no custom types, trees only). A deserialized JSON object doesn't even need an API, it's just a nested structure of built-in types found in any modern programming language.


This is a made-up problem that never existed in reality. On practice if you have to deal with a mix of text and elements, it is always deliberate and explicit choice which probably makes sense. In my decades long career I have never seen it, so it must be a rare case.

You ask about API: XML mapping problem has been solved more than 20 years ago and standard cross-platform API exists since then. Have you ever heard of DOM? Have you ever heard of data mappers? XML Schema-based code generation? In modern Java you can even use the same mapping both for JSON and XML.

There are two reasons why JSON dominates in certain applications: easier to write a small structure, browser-friendly. XML still has advantages in large datasets and schema validation.


Disagree.

JSON can be deserialzed to hashmaps / lists with a single call that has no schema or parsing instructions. The spec unambiguously deserializes a "data document" to a data structure.

XML cannot do this:

- possibility of text being just ... somewhere ... it wasn't intended.

- tag data vs attribute data: ambiguous

- a list CAN be represented in XML, but there is no array begin/end indicator like JSON that will confirm that the tags are a list, versus some other screwy thing

- finally, JSON can represent a "data document" that is just an array/list, while XML cannot do this unambiguously. It has to have a root tag.

As for schema validation, that was always a hack. Schema will ultimately encounter domain-impacted information, like "is this a valid ID in our database" that a schema rule can't encode.

Because JSON is so readily deserialized to, at a minimum, hashmaps and lists, it can be readily validated with code. Virtually all JSON deserialization libraries also provide the ability (in typed languages) to provide a desired type to deserialize into, and the type information of the member variables, even lists/maps, is decorated such that those can be deserialized as well.

In that case, the JSON deserialization-to-type provides an automatic schema validator with no schema rules to maintain alongside the class specification.

If you REALLY like XML because it kind of self-documents a bit better or stuctures better or ... well I can't really think of a reason why to use XML, but anyway I recommend YAML as a modern replacement.

God I wish XPath was ported to json/yaml/toml land though. That was the bees knees as my grandma used to say.


Isn‘t that element vs attribute dichotomy one of the main drawbacks of XML?

You don‘t have that with other formats.


It's usually forgotten that XML is intended as a markup language: text is the main content, text is contained in elements, and elements might have properties (i.e. attributes). Elements with no text children are a special case of mixed content, not the norm.


By the time the XML specification was standardized from the more general SGML, I don't think there was any idea of using it only as a markup language. The XML specification lists support for a wide variety of applications as one of the design goals. Even schema validation, which long preceded XML, is something more commonly used for data not text. By this point XML is so rarely used as a markup language (by comparison with its usage as a data format) that the "Markup Language" part of the title is almost a misnomer.


It’s ambiguous only until you define a design convention for your schema. The rule can be as simple as using attributes only for simple non-user generated values (eg order, postal code, birth date or name is an element, uuid or timestamp is an attribute).

And clients just map whatever defined in schema.


Somehow this is only mentioned as a drawback of XML, not of HTML --- usually.


Because HTML is a markup language. XML is also a markup language but people somehow insist on using it for things where a markup language is not appropriate (which is almost everything).


XML is rarely used as a markup language, in fact it is almost a failure as a markup language, witness the non-use of XHTML. It is a markup language that turned out to have more utility as a general purpose data format, and it certainly didn't get that way by people using it for applications it was not intended to support.


Because in HTML it's clear what's an attribute and what's content

In the case above, not so much (though I'd say it's the artifactId)

But since XMLers like waddling in verbosity and bureaucracy it seems they shot their own foot with it


It's been so long since being part of a debate cycle about whether a piece of data was an 'attribute' or an 'element' in XML. I had forgotten how much angst was eliminated when JSON came along like some kind of uncivilized philistine and just didn't have attributes.


In JSON everything that is not map or array is effectively an attribute. It is perfectly possible to use the same logic when defining XML schema.


Yes! The Caucho Resin server https://caucho.com/ used / uses RelaxNG as its schema, and , amazingly flexibly, allowed for either attribute- or element-based configuration, the latter being much more pleasing to the eye in many cases. I don't think I've seen such flexibility anywhere else.


oops! I mean "the former" -- attribute-based configuration being much more pleasing to the eye


> What really stands out is XML Schema. It is so much more powerful and precise compared to Json Schema.

As someone who was involved in making XML based international standards (HR-XML), I think what made JSON great was that it was so simple that there was initially no industry to be made out of making schema. XML had so many ways to do things a schema was needed. JSON? Not so much.


A lot of bad rep for XML stems from the decades when it was "when all you got is a hammer, everything is a nail"

XML shall be used for Good, not Evil.


The saying I remember from back then is

"XML is like violence: if it doesn’t solve your problem, you aren’t using enough of it."


See also: WS-Deathstar


XML get's used in a lot of places it shouldn't be used. It also has two competing standards for parsing one of which is far too heavyweight for many of the use cases it has been used in.

9 times out of 10 for data transport you want SAX parsing not DOM. Unless your XML is actually a marked up document DOM is overkill. But many data transport libs used to use DOM which has a very real performance cost. I'm not sure how much of that is still true but the reputation for bloat and being heavy weight was often caused by parsing to DOM and then deserializing from there.


Lots and lots of applications make the exact same mistakes with Json.

Read the entire document into memory and let Jackson parse it out into a heavyweight object graph. Blech.


Well, yes. But the difference between a JSOn object graph and an XML Document Object Model is night and day in terms of size. Not even close to comparable.


> XML get's used in a lot of places it shouldn't be used. It also has two competing standards for parsing one of which is far too heavyweight for many of the use cases it has been used in.

That's not the standard's fault. DOM and SAX (or rather streaming parsing in general) both have their places, it's up to the developer to know when to use which.

Of course, developers often make bad choices, but the only way a standard can "solve" that is by forcing a developer to make bad choices in some contexts, so it's not the developer's fault anymore.


It's not the standards fault per se but that doesn't change the fact that the standard now has a reputation whether it wanted it or not.


I have some experience with libxml2. I would say DOM parsing is partly inefficient due to the many small allocations, and the library does not offer fine grained control over it. The allocator is globally replaceable, but that's not very flexible.

I wished many times if I could just use an arena with a dumb allocator for a small document and just drop it when I finish with the document, but it's not very easy to do with libxml2, unless you write your own DOM parser on top of SAX.


I’ve made good use of PugiXML: https://pugixml.org/. Faster than most SAX parsers with the simpler interface of DOM. Gets around the small allocation problem by making a mutable copy of the entire input, then modifying it in place, and only allocating (using arenas) when absolutely necessary. C++ though so not very easy to use from other languages like libxml2 is.


It looks interesting, but it looks like it has no or limited XML namespace support, which is a shame.


Probably true but also DOM parsing is usually not necessary and streaming parsing is not that hard so why not just skip the DOM and go straight to streaming instead?


DOM parsing is not necessary until it is. Sometimes you can get away with parsing the stream twice and still beat DOM parsing, but not always.

Stream parsing is indeed preferable when you can get away with it.


> It is so much more powerful and precise compared to Json Schema.

Maybe so, but the continuing popularity of JSON, YAML, TOML, and recent competitors shows that developers (in general) prefer something that is both easily readable and editable by hand.

Personal opinion of course, but I wouldn't call XML precise, I would call it the opposite because it's too flexible. The older I get, the more I think programming languages/markup languages/config languages should be opinionated and strict.


The fact that you can use it in two ways, one of which is bad, is not really a good thing.


"Welcome to Niche Software, a MicroISV based in Wellington, New Zealand. "

Ahh, "MicroISV". That is a term I don't hear enough these days!(Greetings from a UK based micro MicroISV)


JSX is standard today ?


It's hard to imagine nowadays how much hype there was around XML. It was the buzzword of the day for a while. It felt as pervasive as "AI" or "blockchain" have been recently. There were many conferences on the subject, every acronym has an X for XML in there (c.f AJAX). All this for a markup language (ok, and a schema/query/transform language). It seems so weird now.


The underlying idea was that this self describing format would enable easy interoperability. It wasn’t 100% wrong since before standards like this you had a million ugly bespoke formats and fragile CSV formats.

In the end it was JSON plus API description and generation tooling that delivered, though XML lives on under the hood in lots of places. It’s usually used for cases closer to markup like word processor documents or for very complex configuration files.

Hype is just something you have to get used to in this industry. It’s really bad. Any new thing usually gives birth to a wave of breathless unhinged hype whose main purpose is to get corporate purchasers and VCs to cough up money.


I'd argue that (JSON plus API description and generation tooling) didn't really deliver everything.

XML has schemas and validation as part of the main standard. JSON still doesn't have that. There are some things tacked on that 90% of people don't use.

XML has comments, JSON doesn't (!).

JSON is okish and it's easy to pick up. As Visual Basic, Javascript, PHP, Ruby, Python, Golang have proved, convenience at the start beats almost everything else.


I am not under the impression that JSON has replaced XML as a data interchange format for general usage at all. It is useful for APIs, problematic for configuration files, and relatively uncommon as a file format. I don't think there is any risk of SVG being supplanted by a JSON equivalent for example, nor most business data interchange formats either.


Yes! XML was a major feature / selling point of so many industry tools and services. What does it do? No idea, but it uses XML to do it so take my money. So many businesses just screamed about XML. It was honestly kind of a bizarre furor in the early aughts.


> Yes! Cryptocurrencies were a major feature / selling point of so many industry tools and services. What does it do? No idea, but it uses cryptocurrencies to do it so take my money. So many businesses just screamed about cryptocurrencies. It was honestly kind of a bizarre furor in the early 2020s.


> Yes! Rust was a major feature / selling point of so many industry tools and services. What does it do? No idea, but it uses Rust to do it so take my money. So many businesses just screamed about Rust. It was honestly kind of a bizarre furor in the early 2020s.


Those times were all about XML and application servers like ATG Dynamo, Coldfusion, BEA, JBoss, and WebSphere, with very delicate installation procedures that probably also involved XML.


That is crazy to think about. I wonder what this sentence will be about in another 25 years.


"AI", of course.


I remember people saying stuff like "if you're not using XML everywhere then you haven't understood XML".


As you work through your XML implementation, you'll have many questions, and the answer to every one of them is more XML.


XML was supposed to be a silver bullet of the myriad of ad-hoc formats for structuring documents that existed in the wild pre 2000(I got this information contested here, but that's what an old professor told me).

In the XML "stack" you can find the XSL to ensure the validity of an XML document, the XPath to be a querying language, the XSLT to be a language for writing transformations, etc. I think one must at least respect XML stack vision of coherence and completeness.

Then you have things like JSON where the whole wheel is being reinvented, in my opinion, for no clear purpose. You have for instance hundreds of libraries for creating a JSON schema and everything looks like a waste of time overall.

Before anyone asks me, yes I have used JSON and YAML and I never could really understand why people keep recreating basic stuff like markup languages instead of sticking to whatever is available. And yes, verbosity because of the XML tag system always sounded a pretty silly argument to me.


The reason is pretty clear, XML and its ecosystem was (is) enormously complex. When naively - but correctly - parsing introduces serious security vulnerabilities [1] you kinda have problems as a format. XML is just incredibly hard to get correct, understand, and be performant. JSON from the get-go was very easy to understand and start with so it gained the upper hand.

[1] https://owasp.org/www-community/vulnerabilities/XML_External...


I wouldn't consider that a parsing vulnerability - XML is relatively easy to parse - but rather an XML processing vulnerability, as in you must disable certain features (like arbitrary filesystem access) when processing an untrusted document.

With hindsight, an XML processing library should require that you turn features like that on, rather than require that you turn them off. Opt in to dangerous features rather than opt out of them in other words.


> for no clear purpose

performance may have been one driver, though for the "API web" which is the main use case there are better performing options than json

> everything looks like a waste of time overall

maybe all that matters in the end is to be gainfully employed. having multiple formats and standards helps with that. when the Euro got introduced in Europe thousands of FX traders busy trading national currencies had to switch to trading real estate.


>maybe all that matters in the end is to be gainfully employed. having multiple formats and standards helps with that. when the Euro got introduced in Europe thousands of FX traders busy trading national currencies had to switch to trading real estate.

Agreed.


XML was a bad fit for almost everything people did with it. The natural data model of XML is alien to all programming languages people use: it is alien to scripting languages because you want hashmaps and lists and it is alien to static languages because you want structs and lists. So you usually ended up having to have a two stage parser, with the first stage turning XML into some kind of DOM and then your handrolled DOM-to-internal-representation thing. That's also most of the justification for XML's ecosystem:

- You need XSD to make sure that your DOM-to-internal-representation code doesn't have to deal with garbage

- You need XPath to query XML directly if you don't want to convert it to something usable

- You need XSLT as a scripting language where the DOM isn't alien and awkward to manipulate

This is why there is no xslt equivalent (although you could argue that jq is that) and the various json schema solutions don't see much adoption, despite json being almost as old as XML at this point.


Honestly I don't agree with you, the abstract concept behind XML is just to provide a metalanguage for the representation of hierarchical data, and that's exactly what JSON and YAML do. They are completely interchangeable. XML is the general case for hierarchical data and provides tools for this general case. JSON got closer to the representation of list and hashmaps, but I see this as a mere convenience.

>This is why there is no xslt equivalent (although you could argue that jq is that) and the various json schema solutions don't see much adoption, despite json being almost as old as XML at this point.

I think it's incorrect to part from the assumption that nobody's gonna need a schema language, a query language, a mapping language or whatever. For the json schema this is blatantly untrue and you see many people creating their own version of a json schema because it's really needed. With XSLT you have a language for processing XML that's written in XML itself, what I find pretty nice. We're not gonna see a YAML processing language written in YAML any time soon though.


> the abstract concept behind XML is just to provide a metalanguage for the representation of hierarchical data

What you are describing here is s-expressions, XML represents a far more specific type of data structure: a tree of nodes where each node is either a text node or a tag node. Text nodes are always terminal, tag nodes can either be terminal or not. Tag nodes also have a list of attributes, while text nodes do not. Each attribute is a key value pair where both the key and the value are text, but the key can only contain certain characters, whereas the value is unrestricted.

This is all fairly complicated and it doesn't really map well to almost anything.

Even the description is longwinded. You say:

> JSON got closer to the representation of list and hashmaps, but I see this as a mere convenience

I say that calling it "mere" convenience is underselling it. The data model of XML is of so little use that no programming langauge, besides XSLT, offers it as part of the language.

Vice versa hashmaps and list are nearly universal.

> I think it's incorrect to part from the assumption that nobody's gonna need a schema language, a query language, a mapping language or whatever

I didn't say that. I said that with XML the need for those tools is especially pressing because XML is such a poor fit for any programming language.

> you see many people creating their own version of a json schema because it's really needed

Sure, but you also see very low adoption for json schema, because it isn't needed as much as xml schema was.


So object graphs are rare in your opinion?


> the abstract concept behind XML is just to provide a metalanguage for the representation of hierarchical data

No it isn't, for the umpteenth time. It is a language for marking up text. Markup language deal with grammars - data serialization not data models, and if anything, having regular content models is a more characteristic property for markup languages compared to hierarchy.


I fail to see why any typical use of XML _needs_ XSD, XPath, or XSLT. You certainly don't need XSD validation. It is trivial with stream processing to detect, skip over, and report on elements that shouldn't be there and every general purpose XML alternative has exactly the same issue, if not a worse one through use of eval style import of objects of questionable provenance.

Then you have no idea what your object structure now imported natively in your language contains, and you would have to post process it for vulnerabilities using introspection to find out.

In my view there just is no substitute for careful analysis of data submitted by any third party, and parsing an object into a tree structure and validating it after the fact is a questionable practice with untrusted data from anywhere. It may consume an arbitrary amount of resources before you have any idea whether the data in question bears even a passing resemblance to what it is supposed to be. If you have untrusted data you really shouldn't be using an unrestricted DOM constructing parser or eval style processing at all.


JSON is not a reinvention. It's two different things. XML is a markup language. It's for markup, so, do you need markup? Then don't use JSON. You don't need markup? Then don't use XML.


Markup is syntax that gives structure to text. Ironically there's a port of HTML to JSON syntax. Guess who did it.


> that's what an old professor told me

Well then your prof has no idea what (s)he's talking about I'm afraid. XML is just what the spec says on page 1 sentence 1 [1]:

> The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.

Where "generic SGML" refers to the property of XML being always fully tagged and not needing markup declarations, but still able to make use of those for validation and limited forms of entity expansion. Full SGML OTOH uses markup declarations for tag inference/omission (as in HTML), attribute short form canonicalization (also as in HTML), Wiki syntaxes (as in markdown), stylesheets (as in CSS), and other features such as type safe/injection free templating, processing pipelines/document "views" and as general extension mechanism.

[1]: https://www.w3.org/TR/REC-xml/


Completely agree! But I think you meant "XSD" here:

> you can find the XSL to ensure the validity of an XML document


Thanks!


Because worse is better (which i mean in the unix sense of the phrase).

On the specific subject of schema, i think most of that is most users have no need or interest in a schema.


> most users have no need or interest in a schema

that is a strange statement. users must reach out for a schema practically as soon as some data is exchanged

the only users that don't need a schema are the ones operating within a controlled environment where everything has been somehow agreed or imposed (e.g. within a walled garden or when you control both sides of the exchange)

missing, incomplete, low-information schemas about what data is supposed to mean and how it can be used correctly etc is a major, major wast of time. in data science circles the joke is that 90% of the work is "data cleaning". what do you think data cleaning is about?


I think that the point there is that most users (and especially for RPC-ish applications, which is what most of the existing XSD documents are about) do not need a formal schema but the schema is only implied by the implementation. The issue there is that even if you have detailed formal schema and generate the interface code from that, there is still some ad-hoc contract between the generated interface code and rest of this application, and details of this ad-hoc contract will invariably leak through the interface code.

Designing formats for long term storage of some kind of data is another thing and there formal schemas make sense. But that is not something that most users (meaning developers) do frequently or often even ever.


Even when you control both sides, code can still create unintended data due to bugs.


But should care about and actually care about are different things.


not to mention that future versions of yourself would think that past versions surely had mental bugs :-)


>On the specific subject of schema, i think most of that is most users have no need or interest in a schema.

You have no idea, my friend. I came many times across the situation where people needed to have a JSON conformant with a schema and they just resorted to the ad-hoc badly written json-schema library du jour.


JSON is not a markup language, it is a serialization format.


A bad one at that. Not the worst (which would be yaml), but certainly not great.


I actually didn't realise the spec was so young. I started using the internet around the same time as it was finalised. It's something you just assume has always been there.

XML has its issues and personally I am always happy when I am dealing with an API or data source and see JSON rather than XML. However, there are a lot of cool things built on top of it. RSS is an obvious one that pretty much everyone here will appreciate. GPX is another really useful one. A lot of financial reporting is now also being standardised using XML. Even derivatives trading: https://www.fpml.org/


I had a similar feeling. I think it's because XML relatives HTML and SGML (GML anyone?) go so much further back.


HTML was invented in 1993, which is not that much further back. GML was invented in 1969, and SGML in 1986. I guess you could say that HTML resembles GML quite a bit with the addition of angle brackets and urls. XML on the other hand is basically SGML with a number of optional features left out.


Its a quarter of a century already... Enough of an empirical observation window to conclude that the way different entities choose to implement or adopt online information exchange is a tortured, highly inefficient process that is ultimately just a collection of historical accidents and technological aspects or merit are secondary

On the face of it is strange that there is so much friction [0]. I will just enumerate a number of bizarre outcomes that I am aware of and I am sure there are others:

* the xml web still goes strong both as-is and in large number of popular or niche-but-important domains (from html, svg to xbrl, sdmx, ...)

* the semantic web (rdf, owl etc.) and making some sense of all the data exchange with metadata has gone nowhere

* the API web (json) is reinventing the wheel for validation, annotation (schema etc)

[0] xkcd jokes aside, people do have, in principle, strong incentives to adopt common languages and standards.


XML basically came out of SGML which had its origins in GML, which dates back to the 1960s. With XML they sort of dropped all the hard/difficult bits and used it as a simpler serialization format for tree shaped stuff. That happened after people got annoyed with another vaguely SGML inspired thing called HTML which of course was a bit lacking in specification and uniformity at the time. Which is why this whole thing came out of w3.org.

So XML was specified and then right along came XHTML 1.0 and later 1.1, which was just whatever HTML 4 was supposed to be (Netscape, Opera, Microsoft, etc. had some conflicting points of view on this) but in well formed XML shape.

Of course XHTML became a bit of a problem when it turned out that browsers had to deal with mostly not so well formed HTML coming from servers. People were just slapping whatever header on top of their html to trick browsers into various parsing modes. Eventually, the whole notion was dropped with HTML 5 just standardizing all the not so well formed bits in a way that browsers could make sense of. The WhatWG was responsible for that one. When it became clear that was the only HTML/CSS interpretation that still mattered, it was grandfathered into the w3.org and since then people have stopped talking about xhtml as a thing to pursue.

Around the same time the w3c lost the plot a little bit and went off and built a whole sandcastle of standards on top of XML that mostly did not get a whole lot of traction. Capital S Semantic Web vs. lower case semantic web became a thing. One was a new web reimagined on things like SOAP, XHTML, RDF, and whole lot of other things. The other was web 2.0 featuring things like RSS (almost but not quite well formed xml, depending on where you got it), micro formats, hashtags, and a whole lot of people doing rather hacky things with wordpress, php, and related things. Then AJAX became a thing when MS added XmlRpcRequest. Soon, people figured out that this stuff just got a whole lot easier if you send Json over that instead of XML. Much easier to parse from Javascript for example.


> thing called HTML which of course was a bit lacking in specification and uniformity at the time.

HTML was actually pretty well specified except for one crucial omission: How a user agent should treat invalid HTML. XML simply declared that a user agent should stop processing and report an error - something you can get away with when you don't need to worry about backwards compatibility. Unfortunately this left XHTML dead in the water.


XML shines at the intersection of structured documents and structured data. For pure data you can use JSON or s-expresssions but this does not work very well with more document-orinted data. There is a reason frontend frameworks use HTML templates or JSX even though js supports JSON natively.


Why do people advocate for JSON so much? It's simple, I can see that, but it also got most of its stuff wrong. If I had a liberty to choose a format in which I want to receive data, JSON would've been very much down on the list.

These are just the problems that actually bit me in my professional use of JSON, not some theoretical possibilities:

* No bounds on how big integers may be, subsequently, an ambiguity about what to do if the number is bigger than what would be considered a typical integer size for the platform: should it be interpreted as a float or as an integer? In my case, I had ids coming from a database that used 128-bit to represent them and the client side silently converted them to floats, but, on the way back, converted them to integers again, which resulted in rare collisions and lost data.

* No defined semantics for repeated keys in hash-tables. I used a library that tried to stream JSON in SAS-like way, but the authors didn't realize that duplicate keys are a possibility. Again, this caused subtle and hard to identify data integrity issues.

* Unnecessary null value. For some languages null and false are the same thing (and I prefer it that way too), but converting data back and forth between JSON and language-native data would sometimes produce inconsistent results, and, again, caused a once-in-a-year case of data integrity issue.

* Mandatory Unicode string encoding. Caused problems with DICOMs from East Asia. DICOMs have some special encodings that can have characters not representable in Unicode. You can guess that I've learned it the hard way too. And there isn't a way to transfer raw data (of course, you can Base64 encode it etc. but that's not a function of the format, the parties need to agree on format extension in order to understand that).


> For some languages null and false are the same thing (and I prefer it that way too)

Ugh, It's hard to pin the blame on JSON for this one. Seriously?

My other gripe is languages/libraries that can't distinguish between missing key and present key with null value. It's more subtle, but these are also different.


There's absolutely no need for two nulls / nils. But the story of JSON is that it was created by extracting part of JavaScript... so, that part was kept because it was part of JavaScript, which has never been known for making good language choices. So, bad choices made once just stuck with this format. I don't think the author gave it much thought really.


It is an implementation detail in C that 0, false, and NULL has the same representation. Other languages are differet, e.g SQL.


XML does not support non-unicode characters either. For an interchange format this is a good thing, not a drawback.


Isn't your complain about integers the reverse of your complaint about nulls? I think it's reasonable to want better semantics for precisely specifying common types (I'd add dates to that list) but it seems like in both cases what you want is a way to say exactly what your source and destination types should be rather than having silent lossy conversion.


No. Two different nulls are unnecessary. Integers are just poorly defined (but are, actually quite handy to have, I'm not against integers). I wouldn't care if there was only one type of integers, or if there were hundreds as long as the definition doesn't leave it up to the program writing the data to define the meaning of the data. It's very similar to the problem of single-byte text encoding, where programs wrote data somehow, and afterwards had to guess how did they write it.


> Two different nulls are unnecessary

false and null have different semantic meanings, do you mean “null“ and “undefined”?


False and null are the same thing. Undefined may be a property of a variable (what in Prolog-like languages would be called "variable" or "non-ground"). But, in this capacity it wouldn't be a value, it would be a property, similar to how in OO language you could say "it's a big class" (meaning it's a class that has more than five methods specializing on it), but you wouldn't start modeling your problem domain by creating BigClass class and then inheriting from it any time you want a class with more than five methods.


JSON sucks as a format. You forgot to mention the lack of optional commas, and the need to quote keys.

However 'jq' is absolutely great, and that unfortunately is why we used it for some command line tools that had to output structured data.


“As a format” depends on what you use it for. No comments and simple syntax is great for an interchange format but a drawback for a configuration file format.


The problem is json is not a simple format. In one major respect it is more difficult to generate than XML. The inability to handle commas as terminators rather than separators is a disaster that immensely complicates generating arbitrary json. That aspect only appears to be useful with trusted documents and dangerous eval style processing in Javascript in the first place.

Of course it originated as JavaScript Object Notation, but it has significant downsides even as a replacement for a restricted dialect of XML. It is (regrettably) both difficult to read and difficult to generate by comparison. The absence of support for comments is a problem for ubiquitous use for configuration files as well.


> The inability to handle commas as terminators rather than separators is a disaster that immensely complicates generating arbitrary json.

This is a bit over the top. At worst you need a flag to special-case the first item.


>And there isn't a way to transfer raw data (of course, you can Base64 encode it etc. but that's not a function of the format, the parties need to agree on format extension in order to understand that).

That seems like an odd complaint. The particular structure of the JSON is also not part of JSON itself, and thus parties need to agree on that too. Why are some implicit agreements okay while others aren't?


I think, what you mean by structure of JSON is the structure of individual documents encoded in JSON, not JSON format itself. In the case of JSON format itself no agreement is necessary. As long as both programs implement the format definition, they are good to go and don't need to be aware of the existence of the other. This is evidenced by, eg. ability of text editors to edit JSON files w/o needing any knowledge about the producer of the file.

Compare this to, for example, Protobuf, where once you get the binary Protobuf message, you absolutely have to know what program generated it and how, otherwise you will not be able to read it. You cannot make a generic Protobuf editor the same way you can make a generic JSON editor.


As someone who has circled back into Java land recently after a number of years away, it is startling the extent to which the use of XML in this environment has boiled away to nearly nothing (the only XML file in our projects is the Maven POM), when these languages were once joined at the hip. The addition of annotations into Java has been a game-changer here, but also related ecosystems like OpenApi are using Json and even Yaml in preference.


> but also related ecosystems like OpenApi are using Json and even Yaml in preference.

And that explodes my mind. Yaml is such a steaming pile of shit compared to Json, let alone XML. Yaml is the perfect example for the dichotomy of easiness versus simplicity. At first glance, it looks so easy. No tags, no braces, no commas, you can use comments. It seams to be so much better than XML and Json. After a while you begin to realize that the easiness is only perceived at first glance. In truth it is agglomeration of corner cases that got hyped like crazy. See https://noyaml.com or the top answer here https://stackoverflow.com/questions/3790454/how-do-i-break-a... for examples why I have such a strong opinion of Yaml.


My guess is yaml triggers exactly the same initial response as jquery; it's surface seems "small", "simple" and "intuitive", but once you need to actually do things and dig deeper on how it is implemented, you often wish you haven't used it.


I don't think that's fair to jQuery.


YAML fucked up bad with 1.1 and YAML 1.2 has virtually no adoption. It could have been great.


My impression from perusing a book on Android development is that it feels like XML-oriented programming. It gave me EJB deployment descriptor vibes. A very curious contrast to Kotlin being a modern language.


XML vs JSON is such a fascinating topic, a clash of different approaches, academic vs practical might be one way of describing it but I'm sure there's many others. Are there any good books or essays on it?

And having been there in the 90s seeing XML and thinking "they've tried to do everything and got into a big mess" (XSLT anyone?) its so satisfying to see something simple and flawed displace it.


Sometimes I just don't really see it as a competition. For my research work often XML is a markup, which JSON is not. If I want to qualitative code a text, this is perfectly doable by hand:

  <note>
    <paragraph>This is a <code vibe="positive">remarkable</code> text.
    </paragraph>
  </note>
If you do it in JSON, it is not really readable anymore, and I would have to write GUI for the input, e.g.,

  { "note": 
    { "paragraph": 
      [ "This is a ",
        { "code": [ { "vibe": "positive" }, "remarkable" ] },
        "text." 
      ]
    }
  }


IMO, you made the JSON unreadable with all the extra whitespace that you didn't add to the XML.

    {"note": 
      {"paragraph": 
        ["This is a ", {"code": [{"vibe":"positive"}, "remarkable"]}, "text."]
      }
    }
Also, in this particular case I think the child/property dichotomy works in the favor of XML, but typically I find it to be more of a liability than an asset.


If you use arrays instead of objects, it becomes JsonML[1]:

    ["note",
        ["paragraph",
            "This is a ", ["code", {"vibe": "positive"}, "remarkable"], " text."]]
[1]: https://en.wikipedia.org/wiki/JsonML

This is also why I find Mithril.js so pleasant to use without JSX, because it's basically JsonML.


Use parens and you have S-expressions. No need for all that noise.

    (note
      (paragraph
        "This is a " (code (:vibe . "positive") "remarkable") " text."))
Lisp was invented in 1958 and we routinely rediscover it, and reimplement it badly.


Yeah, S-expressions are way more pleasant to write.

With the benefit of a lot of hindsight, S-expressions seem like a superior choice for writing web applications (instead of HTML + JavaScript + some JS framework that writes HTML again (regardless of DOM vs Virtual DOM)).

Even though I prefer dialects like Fennel for programming rather than Common Lisp (I'd probably be fine with Clojure and Janet as well, but I haven't tried them), I wouldn't mind any dialect if that means I could use S-expressions instead of HTML+JS, assuming the amount of effort put into sandboxing that approach were as much as the effort that has been put into the current approach.


I love Lisp, but the quotes, and the structure ... Markup really shines here. Sorry. Same with JsonML, even though I never heard of it before, and will have a look because it just sounds so interesting.


How do you parse such JSON?

A "note" contains a "paragraph" object, which is an array of...? Strings, or objects where there are keys, but the values are arrays of...?


In many respects the failure of XML to live up to the grandiose promises made in the 90s was due to the lack of decent tools around XML. Decent structure-aware, refactoring, editors didn't exist then, and barely do now. XML cries out for something like paredit, but dumbed down and tag aware. The author shouldn't even have to see the textual representation of the tags.

But, primarily, the thing that made JSON win out is that it reflects the existing semantic/data-structure model of an actual programming language (and approximates that of many others.)

XML (and XSchema, oh god), has this mixture of associative and sequentially ordered data, and two separate kinds of associative/nesting models (elements vs attributes). It wants to eat the world with its new semantic model, but it doesn't offer any particular semantic advantages and doesn't match what programming languages (or relational databases for that matter) work with.

And then on top of that the syntax is just ugly to read.

Still there was a moment in the 90s where I found XML kind of exciting and interesting. I don't really recall why now? Seemed everybody did.


I also remember being excited about XML in the 90s and I can recall why. At the time, every application I used had its own bespoke file format and needed a custom parser to read its output. This especially applied to several applications storing hierarchical information in CSV files with their own way of delimiting CSV data as content in a CSV cell. The switch to XML meant that I'd have to figure out the tag soup that the developer had chosen for their XML data, but that was so much more comfortable than figuring out the goulash of ASCII escape tags in the old format.

XML is like the pager. It might seem clunky today, but it's better than having to tell the baby-sitter the phone number for the restaurant and theatre you'd be visiting that night.


The interesting thing for me is that I've circled back to "CSV is good, actually" since for loosely structured source input, it has few equals - it lets you organize by cell, and you can use spreadsheet software to edit it. The balance shifts towards schema generation once you want to apply a specific model on the data, but "import CSV to my schema" is convenient as an interface to that.

XML also has a place as an interface, but it's much more niche, since most interfaces that are hierarchical-document-shaped are going to parse text instead of bracketed entities. I think we got a bit excited about how well HTML worked at the time.


CSV is nice for tabular data. Especially if it is plain and straight forward in its nature. This covers a lot of use cases.

For an object graph, CSV is less nice.


any graph can be represented as the set of tuples representing the edges. and in fact this is the most flexible form. it depends on if you need it "human readable" or not, but a hierarchical/network representation of a graph is only one representation


Yeah, it is perfectly possible, I just don't think it is very nice. Neither for parsing nor for human reading.

If I need a file or network message to represent a handful of object types with a bunch of relations that form a graph, I prefer XML (or JSON).


> XML meant that I'd have to figure out the tag soup that the developer had chosen for their XML data

There was DTD for that but probably rarely used.


XML has a unique spot in the intersection of payload and human crafted file.

JSON is something I see as payload-only really. And YAML etc are human-craft only.

One thing I would change in XML is that closing a tag should not need to specify the name. <> should close any tag. It would make editing a bit easier and save a byte or two.


Its true, I can bang together an XML doc by hand with less worry than if I type out some JSON


SGML allowed you to minimize a closing tag. And HTML supported this in theory, although mainstream browsers never supported it.

XML went for a syntax with fewer options because the complexity of SGML was considered a hinderance to its adoption.


SGML is also the reason why a HTML parser is often much more complex that a xml parser, even thought modern html is no longer SGML and instead a simplified version of it.


In my huble experience:

- XML parsers were quite bad in the old days (2000), even in java land.

- Parser speed is a functin of the input size. And XML inputs are quite big, too big in my humble opinion.

- you have tag attribute and tag value, and so people get confused on how to use them for simple scenario.

- & must be escaped but no one does it. So when 'AT&T' ended up in a stream created by "hand" by a COBOL procedure, the XML got suddenly broken

- < and > musty be escaped, and so your SQL queries must be escaped

JSON is simply an associative hierarchical maps. A Lot of 1995 PHP code used the same data structure and just works. No attribute, only hierachy recursive structure.

Recusion always wins.


> Parser speed is a functin of the input size. And XML inputs are quite big, too big

You don't have to parse the complete document to do something about it. You can use a streaming parser, and you can execute XSLT on the fly.


This is true but in the early days XML framework libraries gained a well deserved reputation for poor performance because they would parse into a DOM first and then serialize from there. I improved the runtimes of several projects from hours to seconds by switching the serialization from DOM to SAX. The prevalence and focus on the document model for every usage of XML resulted in a reputation hit. One that XML has really struggled to overcome.


That's all true, I'd just add that in JSON you have to escape " in strings just like you have to escape & < and > in XML, so there is still the potential for e.g. COBOL software to produce invalid files.


JSON:

"path\to\file"

XML:

<a>path\to\file</a>


Around the time xml was released I asked a (smarter than me) friend what the hubbub was. And he got a pained look on his face and said. People think it'll solve a problem and it won't.

His reasoning was the problem isn't standard file formats. It's turning those into something a program can operate on as data[1]. And XML does not do that for you. No you have suck it in and then manually transform it into 'data'. So it just moves the problem from one place to another place. Which is a trash solution usually.

Far as I know JSON broken as it is does do that. At least with dynamic/ish languages.

[1] gross stuff people used to do like copy data structures in memory directly to a file actually solves that problem while being a maintenance nightmare.


I consider this a feature. XML specifies a data exchange format. How this is loaded into memory depends on your use case, language etc. You can use a DOM if this is appropriate or serialize into custom or native data structures. A browser rendering a web page will use different data structures than a crawler indexing a document.


The problem is the schema and validation is the hard part and XML doesn't help with that at all.


XML contains a scheme/validation language, the DTD.

The hard part is getting independent parties to agree on a common data exchange format. Processing the data in code is the easy part.


> Far as I know JSON broken as it is does do that.

JSON in scripting languages does "as well" as Java's java.io.ObjectInputStream/java.io.ObjectOutputStream and dontnet's System.Runtime.Serialization.XmlObjectSerializer.

aka: will explode in your face in spetacular fashion [causing security vulnerabilities or denial of service in the way] unless you take special care to only use explicit safe serializable objects, and you handle corner cases correctly (for example, Javascript `__proto__` naive deserialization).


Converting it to some object is not converting it to usable data. Anyone who tried to convert arbitrary json to python will know the struggle of `json[0][0][“root”]…`. This is also possible with xml and is similarly completely useless. At least the latter has tools around this fundamental problem.


> XSLT anyone?

I used to love XSLT (I would probably still love it if I still used it, but the projects have mostly disappeared). Pattern matching is akin to event driven programming: just deal with what you have, if and when you have it. It was very clean and never broke.


My web-technology professor at my uni (around 2007) was a big fan of XML and all the surrounding technologies. He saw it as this beautiful interconnected system where the web being pure XML would allow all data in the world to be queried like a database and transformed into any format you wanted.

The web sure didn't go that way...


Basically, he was in love with serialization and connecting services together, that part is cool. It's just a lot more diverse in serialization now than before. And a lot of these projects compete in reducing overhead - they are orders of magnitude more efficient than XML.


I think it was called semantic web or Web 3.0.


And then history repeats and we call this now blockchain and the mess around it ;)


Thats "XML - the religion"


XSLT is great, the problem is that the browsers stopped at XSLT 1.

XSLT 3 is a different beast, and XSLT 4 is being worked on at the moment.


I wrote many thousands of lines of XSLT back in the day, converting XML into XSL:FO and pumping it through FOP. It worked remarkably well. It was my second experience with a declarative language (SQL being the first).

There were some great resources - libraries of code fragments, similar in spirit to tailwind ui - and people did crazy things with them.

We eventually abandoned it, after many years. It was indeed terribly difficult to work out how a document got transformed, and when returning to a transform after a long time working in other languages, you could easily spend a day just trying to work out how some trivial thing worked.

I remember it fondly, but I wouldn’t do it again.


What XSLT really needed was a gui where you could play with the template and the input and see the output change in real time, a little like how those regex websites work. With a few colours and arrows to elucidate what bit was doing what. Not a trivial undertaking, granted. I wonder if such a thing was ever produced?


I never saw one.

I do remember xsltproc being a game changer. it was still a CLI but it was really fast. So at least you could turn things around quickly. And IIRC Preview on the Mac would reload a PDF automatically if it changed. So you could get pretty close to a gui flow sometimes… as long as your document was short!

Until xsltproc, I’d been using whatever XSLT processor that came with Java, which (as always) was fine on a warmed up production server but sucked for REPL.

But what a gui could have helped with is working back from the output element to the node that generated it. That would have been sweet. But if memory serves anything commercial for XSLT back then was “enterprise licensing”, which we couldn’t afford.


https://xsltfiddle.liberty-development.net/

I use this if I want to play around with XSLT.


I think the true lesson of xslt, is xml makes a terrible syntax for a programming language. The language itself minus the syntax is fine, although i think many people prefer a more imperative approach.


XSLT was also a terrible combination of two principles: Imperative coding + data pattern matching. For a novice beginner - and me at the time - I was never clear which pattern to use and how it best fit together.


> XML vs JSON

Fun fact: Tim Bray was the/an editor on both specs:

* https://www.rfc-editor.org/rfc/rfc8259

* https://www.w3.org/TR/xml/

I wonder if he'll commemorate/note the anniversary in some way:

* https://www.tbray.org/ongoing/


In a sense though the circle has been closed. With the latest JSON Schema spec, we were able to do a 1:1 mapping between our XSD's and JSON Schema, so now we can accept JSON as well as XMLs.

For small messages though I agree JSON is smoother.


Whenever I hear people talk about JSON schema spec it reminds me of XML in the 90s.


Maybe not a popular thing to say: A schema capability is needed. You want to verify input. You want to contract among partners. You want to Intellisense something in a code editor.

And is schema languages a complicated beast: Hell yes. Do we need it nevertheless: Yes


I despise XML, not because it was a bad idea, but because of all the bad ways I've seen people use it.

For example, I used to work at an ed-tech company that bought bubble test questions from Pearson, who provided their data in HUGE XML documents. If I remember correctly, they would do things like splitting the sentences of the test questions up so half the question was a tag attribute and the other half was an element. So, instead of just parsing questions, we'd have to parse them and then stitch them back together to make complete sentences. They did that with the answers too, I believe. So weird.

The JSON format makes it harder to abuse data like my example above and it is a lot easier to parse, so I'd reach for it before XML any day.


Sounds like something that XSLT could have solved. When in doubt use more XML.


> Are there any good books or essays on it?

That would be a really good resource. I mean, anybody who has worked remotely with web technologies has an opinion about this topic but it would be interesting to compile an objective knowledge base and interpretation of why things evolved this way.


XML seems like the apex of "design by committee" (and I mean, didn't we have some thick books about it? What for?)

Nobody cares about 99% of that


I am happy we are not using SGML, which is a base for XML and is MUCH MORE complex and is also a standard...


XML itself is a pretty focused spec. XML schemas and the whole SOAP stack is where it went off the rails.


XML was flawed due to XHTML. The key spec was XML Infoset. It removed many capabilities what XML had and focused on data transfer not document representation. That is basically JSON but more powerful (comments + namespaces + attributes).

Personally, I think you cannot operate a interface without a schema. There is always a contract. Within your own team you may not care but as soon as two team work with different order on things, contracts are needed.

And yes, SOAP, WS-*, XSLT, etc. is where the madness starts. But to be also honest, they used XML, they are not XML.


I've always had a soft spot for idref attributes, intended to support internal linking and graph structures in documents and never seriously adopted; I suspect that more important XML mechanisms exhausted the annoyance/complexity budget of implementations, until the tree-structured data applications crushed document applications as you describe.


Graphs are always hard. I would love seeing graph databases to be more common, but no.


I worked on XML-EDI "back in the day". When I came to the blockchain (Ethereum team 2014) I had high hopes that it might provide the architectural solution to the problems with XML in terms of workflow, API endpoints, and the (infamous) failure of WSDL ("web services description language"). I wrote it up in the Ethereum launch piece here: https://medium.com/humanizing-the-singularity/by-the-end-of-...

=== quote begins ===

Of course, there are attempts to clarify this mess — to introduce standards and code reusability to help streamline these operations and make business interoperability a fact. You can choose from EDI, XMI-EDI, JSON, SOAP, XML-RPC, JSON-RPC, WSDL and half a dozen more standards to assist your integration processes.

Needless to say, the reason there are so many standards is because none of them work properly.

Finally, there is the problem of scaling collaboration. Say that two of us have paid the upfront costs of collaboration and have achieved seamless technical harmony, and now a third partner joins our union. And now a fourth, and a fifth. By five partners, we have 13 connections to debug. Six, seven… by ten the number is 45. The cost of collaboration keeps going up for each new partner as they join our network, and the result is small pools of collaboration which just will not grow.

Remember, this isn’t just an abstract problem — this is banking, this is finance, medicine, electrical grids, food supplies, and the government.

Our computers are a mess.

=== quote ends ===

My company uses XML in a sort of "semantic web" paradigm to describe products, and links to these documents from smart contracts to provide accurate machine-readable descriptions of things which are for sale. It works reasonably well, and maybe it'll turn into something a lot more powerful as operations expand. So we do walk the talk to some degree.


XML 1.0 was clearly a huge success. What I'm curious about is the seeming failure of XML 1.1. What happened there? Is it just like IPv6, in that there are too many non-upgradable legacy systems?


XML 1.1 did not "fail". It is from the start designed as slight relaxation of XML syntax rules for applications that can benefit from it, which essentially means applications running platforms where the usual encoding is not derived from ascii (eg. EBCDIC). It was never meant to be widely used.


Of course XML 1.1 was meant to be widely used. The design goals[1] for XML 1.1 are the same as for XML 1.0, including being "straightforwardly usable", "support[ing] a wide variety of applications", and being "easy to create".

The relaxation of names in XML 1.1 is good for internationalisation[2], by not limiting names to Unicode 2.0 characters. Allowing NEL characters, as used on IBM mainframes, was not the sole motivation for XML 1.1.

[1] https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-origin-go... [2] https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-xml11


It's like YAML 1.2



> They are not identical. The aspects you are willing to ignore are more important than the aspects you are willing to accept. Robbery is not just another way of making a living, rape is not just another way of satisfying basic human needs, torture is not just another way of interrogation. And XML is not just another way of writing S-exps. There are some things in life that you do not do if you want to be a moral being and feel proud of what you have accomplished.

Wow.


If there is an argument beneath the hyperbole and i-am-very-smart posturing, it seem very weak. He end up advocating a binary format (like TCP/IP) instead of a text-based like XML because it will be faster to parse. And because a binary format is harder to change, people will be sure to get it right the first time.

Yeah right.


I am the only one who utterly despises namespaces in xml? Query a document with a namespace? Oh no results? Ah forgot to add the namespace...

I had never a use case for namespaces, and the ones I can envision would have been better served by a property or attribute.


Namespaces are great as a concept: I can take your standard document and extend it without any chance of conflicting in the future if you later a property or element of the same name. This is really important for interchange systems since you inevitably end up with cases where a standard format has almost but not all of the vocabulary you need.

Unfortunately, namespaces as implemented in most XML tools are terrible and that's 100% the fault of the implementers. The problem is basically this:

    <foo xmlns:bar="http://example.org/foobarspec">
       <bar:a href="baaz">hello world</bar>
    <foo>
To a human reading that, you should be able to write a selector like /foo/bar:a or even /foo/a but as an XML parser sees the world bar is actually:

    {http://example.org/foobarspec}a
That's an implementation detail but a shocking percentage of XML library developers did not implement a lookup mechanism to resolve it when given "bar:a".

I would maintain that paying a single developer to make libxml2 suck less would have done more to help XML adoption than everything done by any XML-related standards group combined after Y2K.


First thing I do in any XML-Parser I write: search/replace the namespaces. I know this is awful and impacts speed, but namespaces are so f*** annoying to handle with most parsers. Getting rid of them is the simplest solution for my sanity.


Recently I had to convert a bunch of well structured XMLs to csvs (flattening). I downloaded the apache xalan-c (xslt engine) and xerces-c(xml engine) windows binaries from 2004 (!) and to my surprise they worked fine on windows 10.

After that it was half an hour work to write a quick XSLT and convert the XML to CSV.

I know that there are other (even easier) ways to extract data from an XML but I have to admit that writing an XSL transformer and see if working gives me joy.

Probably I'm getting too old remembering the old days...


Here's how JavaScript's XSD validator works:

> A (XSD) schema validator for NodeJS that uses Java to perform the actual validation. Why? Because Java can do schema validation and NodeJS cannot. https://www.npmjs.com/package/xsd-schema-validator

<facepalm/>


That’s not “JavaScript’s”, that’s just a library someone published on NPM, right?


Yep. JS doesn't have a standard library so npm is the best we've got.

And speaking of the best we've got, the XSD validator above is the only one I'm aware of on npm—kind of shocking given how many overall libraries there are.

It speaks to the unpopularity of XSDs among web devs, which is ironic since HTML is XML.


And here's to the next 25!


I highly doubt computers will still exist in 1.551121e+25 years!


[factorial joke]


never explain a joke! :)


And coffee was spilled! Well played!


I would want experiment with root structure of XML.

Make it possible to have multiple root elements. One problem is that however big an XML document is, it has to be closed at the end, so it ends up as one atomic element, you can't partially parse an XML document correctly.


XSLT 3.0 allows streaming for large XML files. See e.g. https://www.saxonica.com/html/documentation10/sourcedocs/str...


Isn't that what SAX (event based) and reader (pull based) XML parsing is for? Those allow you to incrementally/partially parse XML.


Well... There is nothing stopping you partially parsing an XML documents. What you can't do is validated it. Which is the same for any other file format. You be sure the file is fully valid without fully parsing the whole file.

However, the only order you can partially parse an XML file is linear order, which clashes badly with the fact that it's a tree based format. Depending on how tree-like your schema is, this might be a massive hinderance.

This flaw isn't unique to XML. All text encoded file formats share this characteristic, and can only be parsed linearly. If you move across to the world of binary file formats, it's extremely common for them to have indexes of offsets so a parser can navigate a tree-like structure in tree order without having to fully parse it, along with other types of non-linear data structures.


> All text encoded file formats share this characteristic, and can only be parsed linearly. If you move across to the world of binary file formats, it's extremely common for them to have indexes of offsets so a parser can navigate a tree-like structure in tree order without having to fully parse it, along with other types of non-linear data structures.

You don't always have the luxury of randomly accessing a file (obvious example: shell pipelines with a producer and consumer exchanging lots of temporary data), so taking advantage of indexing might require saving a temporary file and stalling processing until the file is ready.

Personally, I'm used to parsing large XML files with event-based APIs and throwing away data aggressively, keeping in memory only one unfinished element of interest, the stack of its ancestors, and my collected data (instead of a DOM for the whole document).


Its super common for xml parsers to support partial documents.


and, AIUI, XMPP is based entirely around that very idea so it's not even a niche concept (well, aside from XMPP being arguably a niche protocol)


What do you mean by "partially parse"?


If you imagine a log file that gets lines added to it now and then, its actually quite fiddly to do that with XML because of the closing outer tag


Well, I see the awkwardness of emitting a log like that, but it's not any worse than emitting a single JSON array and waiting for that closing ']'.

Both can be emitted and parsed in a streaming fashion though, but I wouldn't say that either XML or JSON is suitable for logs. Maybe NDJSON, but it's more like a hack around this limitation.


> but I wouldn't say that either XML or JSON is suitable for logs

Of course. Better use a binary format for this.

https://systemd.io/JOURNAL_FILE_FORMAT/

:-)


for json it's common to have multiple records per file. Then it's just one object per line, for example.


Yes, that's NDJSON, as in newline delimited JSON.

http://ndjson.org/


I remember being excited about XML when it first came out, since there were so many clunky languages used for configuration and data storage. That feeling lasted for a couple of years, and the 2000s were basically a long disappointment as the parts I liked were buried under huge piles of enterprise ejecta.

The main lessons I've drawn looking back:

1. Tool quality matters more than it might seem, and you can't count on other people to fix your mess for you. A couple of things which come to mind were the lack of a good formatter and validator (people spent too much time on things like formatting errors, especially if they were starting with a very lenient parser and things seemed to work for a while until they hit a standard one) and also the way things like XSLT/XPath shipped major spec updates but since nobody implemented them in libxml2 most developers couldn't use them, use them in browsers, etc.

2. Usability similarly matters far more than experts think: I'm using namespaces here because in my experience that spoiled more people on XML than any other single feature. The requirement that you pedantically specify the namespace in things like XPath or XSLT meant that people couldn't write those using the values they saw in front of them without getting errors or, worse, silent data loss. This reliably tripped up even experienced users and it could have been easily avoided by making a handful of key client libraries smarter. Everyone deep in the XML community has learned and internalized this but it came up over and over and over as something users hated.

3. Continuing that theme, documentation and especially examples matter a lot for the same kind of reasons. I remember about a decade ago getting evangelized by a semweb person and evaluating a couple of the things they mentioned. The only documentation available for any of the half-dozen referenced specs were the specs themselves (which were not written targeting implementers or mutually compatible), and there were no examples which I could find which matched the current versions of the specs, worked with the few public tools, or, in a couple of cases, were even valid XML.

The common thread in all of this was hubris and especially the assumption of inevitability. A lot of people were aware of these as sources of friction but seemed to assume that someone else would make better tools, train people, write good documentation, maintain examples, etc. — especially since if you were the kind of person sitting on a W3C community, these were less of a problem for you since you had the knowledge & skills to keep these from being significant impediments.


Maybe someone can help me find an obscure XML feature that I’m fairly certain I didn’t imagine:

Is there a way to refer to another part of the XML tree from elsewhere in the document?

I vaguely recall hearing about this feature, but now I can’t seem to find it, even after skimming the spec.



That’s it. Thanks!

Edit: id/idref


There's lots of comments here on XML as a file format, but that's not precisely correct (even the opening of the wikipedia article on it has the same conceptual flaw in the opening description). I think understanding what it actually is, and why it exists can really offer a lot of perspective. As somebody who lived through the emergence of XML (and HTML) from SGML, and remembers SGML actually in use in some settings, I think it's important.

XML is a specification for how to create data interchange formats using a semantic, hierarchical, organizational scheme -- it's not a file format.

You create data formats that follow XML, but every different schema is a different data format.

What XML did, and the reason it was important, was that it made it much easier to design and specify data formats that had standard tooling for writing, validating, and reading, than pretty much any other earlier conceptualization. Before then, pretty much every piece of software, and even every different version of the same software, had a custom, usually binary, undocumented, and proprietary data format that it used to serialize data (usually as a file format on disk) or to transmit data across a network.

Today, it's more or less assumed that when you serialize data to disk, or build up some data object to send over a network, it'll probably be in some kind of human readable format, figuring out how to encode your data into bits, or dump out serialized C structs or whatever isn't the common practice anymore. Even using binary serialization format like Cap'n_Proto or BSON is considered kind of unusual outside of some large organizations. Developers back in those days were dealing with a variety of CPU and memory architectures, endianness, word size, and other things that made cross-platform movement of data a total nightmare. And almost nobody documented their formats in any way, reverse engineering formats was a valuable skill in some contexts.

There was some precedent for XML-like approaches. A very common one was CSV, which was often used to move tabular data between systems (and is useful to view as another "specification" for making data formats). But again, it lacks a standard specification, lacks validation, and has other flaws that make it frustrating to use for data interchange.

When XML hit, it was kind of like the enlightenment happened. Developers suddenly had tooling and standards that let them not only pump out data formats in a self describing way, but schemas that also let them know they were writing the data correctly and let readers know how to parse it. Even better, a human can usually just open up a data format created to the XML specification in a text editor and just read-off the data fields and values. Long-term archival became easier, networking systems became easier, RPC and API development became almost trivial, and so on. SOAP was created because XML was created. Most of Java's built in data format stuff and RPC functionality assumes XML. Thousand page books written by a dozen authors showed up in stores trying to explain to developers all of the weird edge bits and use-cases.

Then the ugly parts of XML started to settle in. The verbosity, the overengineered bits, the lack of clarity over fields and attributes, stylistic disagreements over how to organize data using it. Turns out most developers don't want to go through the hassle of writing out the validation schemas so that was usually left unused, and then most of the tooling actually wasn't very good -- for a long time good implementations of readers/writers/validators just didn't even exist for many languages. It also turns out just handling and parsing through XML, even with decent tooling, is kind of an arcane chore.

Since the early 2000s then, there's been an explosion of approaches that all have learned the lessons of XML, the good and the bad, and seek to tackle various perceived shortcomings: size, ease of use, etc.

XML is an exquisite crystal castle, built using the purest most elegant thinking, over decades. It turns out most people just want good enough to get it done. So most developers it seems just serialize to JSON or JSONL, gzip it, and send it over the wire or write it to disk. If you're lucky somebody will write down the fields and expected data values in a team wiki somewhere, and off everybody goes. Gone are validators, carefully crafted hierarchical data values, schemas with carefully considered semantics.

As somebody who lived through the dregs of binary formats and the heights of XML, I think this is okay. But it's good to know that XML exists if extreme formalism is still needed, and binary formats are relegated to highly niche use-cases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: