> with no restrictions on the text in fields or the need to try and escape characters.
Maybe I'm missing something, but wouldn't it still need escaping for those ASCII separator characters (or alternatively, a restriction for the stored text not to have them)?
It's true that having to deal with escaping much less often (since the ASCII separator characters are rarer than commas/quotes) would be convenient for manual reading/writing, but I feel that's canceled out by the characters being hard to type/see (likely the reason why they're rare) - and it wouldn't necessarily save on writer/parser code complexity.
See this is why I once used moon-viewing-ceremony-seperated-values (MVCSV). The Moon Viewing Ceremony emoji was unlikely to show up in my dataset, and not only is the emoji visible, it's quite visually pleasing.
Not if you just say those characters are invalid data. I first heard about them decades ago, but I don't think I have ever once seen them in use.
The real problem is that there is no easy universal way to type them with a keyboard. So it would require software interfaces in the application, and at that point it’s basically binary.
> but these specific characters are not text. they exist solely to be delimiters.
Even if people used them only for their intended purpose, someone could use them as delimiters within the text you want to store (e.g: list of tags in filename) - unless I'm misunderstanding.
> It would be like trying to escape a column in your spreadsheet.
Other formats do allow escaping their delimiters, so that you can use that character literally or even nest a string of that format within an entry.
> But there is never any reason to use these characters literally, they are just delimiters.
I can put []* in my comment (maybe because I'm demonstrating the format, referencing the characters, or just being capricious), and now someone scraping and storing comments has a need to use those characters literally. Sometimes fine to ignore certain content or store it lossily, but often not.
yes, you would still need to clean your inputs before randomly adding it to your table. Your contrived example brings me back to my original assertion that as long as you’re ok with those characters not being valid data it works fine. So, sure if someone really wanted to store those two literal non-visible characters in a text file that would not work. Everyone else could just not do that.
> yes, you would still need to clean your inputs before randomly adding it to your table.
Lossy is fine in some cases, but in many cases you do actually need the specific text you're trying to store - not just something similar to it. Hence my objection to "never any reason to use these characters literally".
> Your contrived example [...] if someone really wanted to store those two literal non-visible characters in a text file
Needing to store these specific characters is rare, but needing to store arbitrary text (possibly from adversarial/mischievous parties, or just a large enough dataset that encountering all edge-cases is inevitable) is common. For instance, for security reasons a log shouldn't break or have a blindspot for folders with those characters.
> as long as you’re ok with those characters not being valid data it works fine
Which is what I'm saying in my original comment with "or alternatively, a restriction for the stored text not to have them"
if you’re storing arbitrary text from untrusted sources you will always need to clean it first. Plus in those instances you’ll probably want a db or json or whatever works well with your language anyway.
I guess I’m saying that if this had caught on early with ubiquitous support it could have saved us from the mess that is csv/tsv/etc. It wouldn’t have negated the need for more advanced storage and serialization formats.
> if you’re storing arbitrary text from untrusted sources you will always need to clean it first
Reversible escaping of characters is pretty common (though not always; length-before-text formats don't require it). But to "clean" as in deleting characters such that you can no longer get back to the original string is definitely not required for all formats, and is a fairly undesirable property.
> Plus in those instances you’ll probably want a db or json or whatever works well with your language anyway.
You'd want to use some format that doesn't have the problem this one has, yeah. IMO ASCII delimited text just isn't really anywhere on the Pareto front of formats you'd want to use - it's unpleasant to work with manually, and once you're writing the file through code or a tabular editor you may as well use a format that can handle arbitrary text.
> I guess I’m saying that if this had caught on early with ubiquitous support it could have saved us from the mess that is csv/tsv/etc
I think you could say the same of RFC 4180. In reality, I don't see why this wouldn't also spawn dialects, like people adding newlines between rows so they can open it in a text editor without it being in one huge long line, or inventing an escaping scheme so that it can handle arbitrary text.
I feel like this (and some of the replies to this) is missing the point a bit.
I don’t think the goal was to make a bullet-proof delimiter that fails at nothing.
The goal was to solve the problem of not allowing things like commas, quotes, newlines, tabs, pipes, etc. in text files.
I feel like using the proposed ASCII characters would eliminate these limitations, while also allowing machine creatable and readable format (emphasis on machine as opposed to human).
Yes, it would still be tough for a human to type or read these delimiters, so in that case, go with traditional CSV or TSV (or MVCSV!).
But if you only need to use a machine to create/read the text, this sounds like a great solution, allowing all of the normal characters you might see in text.
If you need a machine-readable format, why not go with escaping like most other formats, or length-before-text, to include all characters - instead of a format that fails on some (albeit rare) characters?
Both of those are fine, but they add additional complexities (even if small) where there is very little, if any, complexity added with using these two characters as delimiters.
For machine-readable formats, I'd argue length-before-text is simpler to parse than splitting by separators - even before having to add extra application logic to handle this method being fallible (e.g: if you want to use it to store filenames, you now need a check and pop-up about some OS-valid filenames not being supported).
> For machine-readable formats, I'd argue length-before-text is simpler to parse than splitting by separators
Yes.
> even before having to add extra application logic to handle this method being fallible (e.g: if you want to use it to store filenames, you now need a check and pop-up about some OS-valid filenames not being supported).
I don't see why you'd need that. CSV does not have anything like that.
That's a higher-level concern. This ascii-delimited format (like CSV) is supposed to be a stupid row/column format. And also simpler to implement than CSV.
If you're using a fallible CSV dialect, your application does need to handle that case in some way (or in some cases it may be fine just to let it crash). Something like length-before-text is convenient because you don't have to worry about that case.
Yup exactly, it just pushes the problem around, without solving it.
The delimiters can occur in binary data, or when there's nesting -- trying to store TSV in TSV, or JSON in JSON, etc. The latter definitely happens a lot, for better or worse
The title says ASCII Delimited Text not ASCII Delimited Binary Data.
For the purposes of CSV, I consider text to be anything that satisfies the regex ^\P{Cc}+$ (https://www.compart.com/en/unicode/category/Cc) and I normally strip chars in that category before saving some text (for single-line text). ^[\p{Cc}&&[^\n]]+$ is a regex that can be used to strip all control chars except for the newline.
You can disallow those metacharacters in the data proper. Then you have a format that can store any utf8 or whatever except the non-whitespace control codes without any escaping. That solves a problem in an opinionated way. Just like how json is opinionated (utf8 only).
You can convert to another format if you need something crazier than rows and columns consisting of normal text.
Then I don't understand. Like the sibling comment said the really "problematic" character in TSV is the line feed. But a tab can occur as well.
The format that I described does not already exist in the form of TSV. And further based on your original comment I would have thought that both TSV and this format would be discarded as not-useful.
I like this format best of all, CSV is # 2 favorite.
At work we settled on using ^G (0x07) as a delimiter instead of TABs for file transfers and loading data into various databases.
The reason was Excel. People/systems who create these files sometimes source from Excel. And Excel can have a habit of placing odd characters in text fields. We found the one character never encountered was BEL.
For text fields we tend to remove embedded white space and after replacing TABs with 1 space.
In the early 2000s, back at the beginning of the world, Yahoo's web code used ^A and ^B for field and record separators to avoid having to escape commas and quotes and newlines. That was probably the last time I ever saw ASCII control characters used as intended in the wild.
There is no technical reason why CSV should have won out, except that keyboards have a comma key and almost never a ^A key.
That's a huge technical obstacle for most people though. The whole point of XSV formats is to be human editable. There are better formats for computer-to computer records. If you can't the core delimiters on a keyboard, your format is going to lose out despite any of us other benefits
It's an editor issue then. "Back in the old days", people used to understand how to input ^A and ^B. Showing these characters is also only a mere addition in the character set.
Sure, there is inertia to change, but even rich text format is/was supported by windows.
Being unable to deal with this is a lazyness of the developer that spilled into the user being unable to deal with it.
This is nothing a user can't be trained on, and I'd argue it makes more sense than weird escaping sequences in the event you actually do want a ",".
P.S.: But then again, with proper editors the escaping issue vanishes - and no, I do not mean IDE's. Lots of people decided it was worth it to support rtf, I figure the decision to support 2 additional characters is way easier in a user friendly way.
That doesn't need to be an obstacle, a graphical editor that lets you click a button to add a row/column exists for nearly every other tabular format. It only matters if you want to edit the file using a plaintext editor. If the format were popular, shortcuts would be created to enter the delimiters in many plaintext editors too, which is a chicken/egg problem but let's not kid ourselves into thinking it's not a solvable problem.
If you're using a bespoke editor anyway, why are you restricting yourself to a texty format? XSV formats inhabit this weird space halfway between data exchange and human readable. They're a compromise. If you go full data exchange you might as well use something even better suited for the job.
You don't really have to have a bespoke editor, you just have to display the control characters sanely in an otherwise normal plaintext editor and add a shortcut or menu item to insert the control characters.
And how do we escape those characters? With ESC (27)? Inside a SI/SO (15/14) pair?
I think CSV or TSV are good enough. People keep trying to find a format where you can separate the records and fields with a simple string.split and there's no need to contemplate escapes.
But that's not possible, no matter the format you'll have to parse it right. And then, a format that uses visual delimiters has the obvious advantage of being editable with any text editor.
Kind of a short sighted take. Sticking special characters that (in many early editors) would be invisible complicates development and maintenance. Even tabs have a visual, albeit inconsistent (if your editor wants to align columns for you) manifestation you can work with.
Technically, XML is superior for data representation on many fronts. But likewise, it is an absolute PITA to maintain without significant editor support.
> The most anoying thing about the whole problem is that it was solved by design in the ASCII character set.
This is a great example of not understanding what “the problem” actually is, and then assuming that because part of a technical solution exists, that everyone should be using it and if they’re not it’s because of ignorance rather than choice. I think we all do this, at least I know I’m sometimes guilty, but it’s amusing when faced with what happens in the real world at scale, to jump to the conclusion that the world is wrong rather than to first question our own assumptions.
Personally, I think it’s funny to assume that ASCII == text. Obviously not all ASCII is “text” in the sense that most people will assume. When people say “text file” I assume it contains nothing that you can’t type on a physical typewriter, other than the annoying and persistent difference between LF and CRLF. ASCII has lots of characters you can’t type on a typewriter, and are not intended to print as a character.
But if you want to invent new “text” characters for a “text” file, the problem suddenly becomes not just having a char code, but how to easily type it, how to easily display it, how to teach everyone to recognize and use it, and how to standardize these things so everyone knows them. Personally at this point I probably wouldn’t call a file with ASCII chars 28..31 in them “text”. The ASCII characters haven’t solved the overall problem, they have created several more and bigger problems that remain unsolved, and are much easier to solve in practice by using a comma instead, which is why people aren’t using the special ASCII characters in practice.
Some notes from when the USV project tried using control characters:
> We tried using the control characters, and also tried configuring various editors to show the control characters by rendering the control picture characters.
> First, we encountered many difficulties with editor configurations, attempting to make each editor treat the invisible zero-width characters by rendering with the visible letter-width characters.
> Second, we encountered problems with copy/paste functionality, where it often didn't work because the editor implementations and terminal implementations copied visible letter-width characters, not the underlying invisible zero-width characters.
> Third, users were unable to distinguish between the rendered control picture characters (e.g. the editor saw ASCII 31 and rendered Unicode Unit Separator) versus the control picture characters being in the data content (e.g. someone actually typed Unicode Unit Separator into the data content).
If people used these for ASCII delimited text, they'd have to not use them for anything else, like some other text format otherwise you might insert an entire ASCII delimited file into a text field of that other thing and break that other thing's parsing. You couldn't even insert part of a file into a string field in another ASCII-delimited file. You only get to use them once so they wouldn't be part of general purpose plain text and an ASCII delimited file wouldn't be a plain text file that you could treat in the same way as other text files, so it's effectively a binary format or has restrictions on what text characters can appear in its records without escaping - oh no, that was its entire value proposition!
they break all the time,
whole point is to have less pain.
2 invisible ascii characters evidently no one's rlly used for anything else (other than nextvalue nextline) sounds like an order of magnitute less pain on parsing the escaped escapes along with the whole junk paste of llm generated markdown+code junk probably pasted in there.
You still have to escape the delimiters to be safe except now they're more rare so easier to forget about.
The bigger problem with CSV is all the inconsistent implementations. For example, some people want semicolons instead of commas because their culture uses commas as decimal points, so I suppose semicolon should really be the standard, if there was one.
Yep, the only advantage I see with using ASCII control characters is that you can save a few bytes depending on the content. To make this approach robust, escaping is still needed.
How big an issue is CSV format really? I work in bioinformatics where it seems like everything is one odd CSV-like format or another. In Python, I have access to tools like pandas, duckdb, and polars, which have detailed ingestion options and sometimes a sniffer. I can read part of a file and check in seconds if it looks right.
Dealing with the variety of formats certainly isn’t the bottleneck in my productivity. Is it for others? I’d be curious why.
I also work with foreign CSVs regularly. I'll have to try the Python Way next time I have a weird file to work with.
I typically use PowerShell to process the files from a unknown CSV format to a known one so it's easier to work with, and I've found it easy to use to iterate on.
Oh yeah that does sound challenging. If you’re interested, here’s my take on the three libraries I mentioned.
1. Pandas is more mature with much better batch reading of larger than memory CSV files than Polars. But it’s slower and the syntax is worse.
2. Polars is my goto for one off analysis of CSV files that fit in memory. When max performance isn’t a concern, sometimes I’ll iterate through the CSV using Pandas to get it in batches, then immediately convert to Polars to do any analysis. ChatGPT has been poisoned by Polars’ early syntax changes so it often makes mistakes, but Polars’ syntax is so clean and consistent this often doesn’t matter much as it’s easy to fix.
3. DuckDB is a different beast obviously as it’s a full database, not just a single dataframe. It’s slightly more setup, but it has a CSV sniffer, does out of memory processing really well (no need to batch iterate) and lets you use SQL. I’m not too experienced at SQL yet, and it’s nice that ChatGPT is really pretty good at creating complex SQL queries. I am now gravitating to DuckDB for any larger than memory processing that can be handled in SQL. If line by line streaming is needed for the algorithm I’m implementing then I still use pandas or the pandas+polars approach.
I’m working on a PWA which includes a dictionary search[1] feature and only a static web server (so no server side database to optimize the search). I did want searching to work in offline mode anyway. I decided it was best to generate an index file which the users download on first visit. For some reason I found USV[2] to be the best fit for this. USV I think allows seperating with ASCII control characters, but I used the unicode variants (␟, ␞, and ␝).
I really liked this as it allowed me to add the glossary as an array in one of the columns. I wrote the parser my self which searches through the text structure, and it was simple enough. The reason I opted not to use a CSV or a TSV was that I didn’t want to deal with escaping surprise commas or tabs I would find in the dictionary data plus the extra dimension was nice. Since the file is generated, I didn’t have to type the characters my self so it had none of the downsides of this format honestly.
You can type them in the terminal by prefixing with Ctrl-V, so you can enter a record separator by pressing Ctrl-V, Ctrl-Shift-6. Typing Ctrl-\ is tricky because some programs interpret it to mean end of input, e.g. it exits the Python repl, but I don't think that one in particular is super important to type manually. In hindsight if these were assigned to Ctrl-<letter>, they would have been a lot easier to type and use.
Straightforward is completely subjective. But a comma is relatively much simpler in an absolute sense.
Ctrl+shift+6 is a 3-key chord, it’s potentially hard to discover (I can’t say I’ve actually ever seen it), it seems likely to be overridden by applications, and caret isn’t a natural separator and is more commonly used for other things, like exponents.
A comma is 1 key on the keyboard, and it’s already a natural separator; the very meaning of comma is separator. Note how many commas are used in this thread compared to the number of record separators. :P
Having to type both ctrl+shift+6 and ctrl+shift+minus a lot seems like a small physical and mental friction compared to using commas and returns each time a character is typed, that adds up to a lot of physical and mental friction over time. Enough that the eventual implication is that you need better tooling than a text editor provides in order to author delimited files, enough that it sort of undermines the idea of having a text file. It’s a mistake to think that because a key chord exists that it’s a solved problem, and a mistake to underestimate the value of making commonly used items as simple as possible, especially if it’s going to affect a lot of different people.
While I agree with the sentiment of that, it still seems a lot easier than escaping characters. I would probably opt for a mix. Control separation with new lines
Isn’t escaping a separate orthogonal issue? Or am I misunderstanding your point? Several people have pointed out that the special ascii field separator will have to be escaped if used within a field, just like a comma is. It seems like escaping is an issue either way, and aside from that, a comma is easier in practice than a special character at every level of interaction; discovery, typing, displaying, tooling, printing, standards, etc..?
I would concede that having the special characters inside fields will be less common than having commas inside CSV field is. I guess that is worth a lot even if it doesn’t fully solve the problem.
ctrl+_ to separate cells, ctrl+^ to separate rows - works perfectly in notepad++.
I think the proposal can be improved by using ctrl+^ followed by a newline as a row separator, it looks much more readable plus will allow various line-based CLI tools to be used unless there are newlines in the cells.
That’s the whole point of having field and record separators as distinct values in ASCII. There is no other valid use for them, so no escaping is necessary. Have you ever used ASCII value 30 for anything, anywhere, in your life?
> I have seen no end of CSV data that embeds CSV data in a field.
So in CSV [1] a record is separated by a CRLF (0x0D 0xA[2]) and a field value is separated by a comma (0x2c). In ADT (ASC?), the record separator is 0x1E and the unit/field separator is 0x1F.
But there are two more separators defined: file (0x1C) and group (0x1D).
I'm not sure if it's defined anywhere, but if if you wish to embed ADT data with-in a ADT file, and have it as part of the CSV-equivalent field (unit), you could say that:
after the 0x1E record separator, put a group separator (0x1D)
which will denote the beginning of ADT sequence which will
be treated as a unit value. The end of the value shall
("MUST"?) be denoted by another group separator, after which
a unit separator will indicate the next field.
The fact that there are four separation characters would allow for some to be used for embedding applications to tell parsers that a new 'level' of parsing is being done.
Out of curiosity, in which general circumstances do you tend to see this sort of data? A lot of the support I see for ASV/USV/whatever comes from the idea that this is negligibly rare, which has always sounded to me like a dangerous assumption for a general-purpose format.
Some (many?) years ago i wrote a FOSS tool for messing with CSV called CSVfix. People used to mail me CSV files and ask how to parse them. Also, in my general consulting/contracting work I came across all sorts of weird stuff, a lot of it machine-generated.
Sometimes the content of a cell in e.g. a CSV file will be another CSV file. This inevitably happens with pretty much every format, whether CSV, TSV, JSON, etc, so you'll need some sort of quoting rule no matter what you do.
But that’s my point. The person I responded to was objecting that they might want to use control characters like ASCII 30 as values in a field. As you and I agree, that will never happen.
> There is no other valid use for them, so no escaping is necessary.
Famous last words...
> Have you ever used ASCII value 30 for anything, anywhere, in your life?
No but if this format took off I would be.
And then because text editors start displaying them so the files make sense, people start using them for other purposes, like separating values on a line or copying spreadsheet cells to the clipboard, because it's more "semantic", and those lines get pasted into a field in a tool that exports its data as CSV.
This is a bit silly. Any modern text editor (whether vim or VSCode or BBEdit or Notepad++ or whatever) is capable of displaying control characters and of copy/pasting them. Keyboard shortcuts for inserting any characters whatsoever are easy to add. And even with CSV files, if you’re editing them by hand rather than manipulating them with code, you’re probably doing it wrong.
All of this is easy (use the proper editor, configure it for this particular weirdass situation, do something other than the thing you want to do, etc) in a way that’s exactly analogous to ‘you can spin up your own dropbox over the weekend with ftp and rsync’.
I asser that is still not convenient or scalable. You will need to mentally parse each number (two characters) to ensure it is correct. Compare this with a simple glyph (i.e. a single comma) which is easy to eyeball.
"Then you have a text file format that is trivial to write out and read in, with no restrictions on the text in fields or the need to try and escape characters."
Not being a "developer", I have been productively using these non-printing separators for personal use as a UNIX-like OS and text-only internet user for close to three decades. Of course I have a bias for ASCII and against Unicode and I only use the English language for computing. Perhaps this is why using the ASCII charactors, including the record and file separators, work so well for me.
Using ASCII non-printing separators might not work for everybody but it would be false to assume it will not work for anybody.
Historically ASCII worked for some computer users. It still does today. For those who stil use it like myself.
The author states, "The most anoying[sic] thing about the whole problem is that it was solved by design in the ASCII character set."
"Developers" might not use the ASCII solution but that does not prevent other computer owners from using it.
That assumes you control both the producer and consumer. But if you're doing CSV, it's likely because you're looking to integrate with someone else's system. So you have to deal with whatever they're doing that they call CSV. And if "they" are "all your customers", you're going to encounter every weird quirk of different system's CSV parsing from that guy who just used
String.split(",").map(it.replace("\"\"", "\""))
to Spark insisting that backslash escapes exist in CSV.
For extra fun consider the German-Speaking word where CSV files are actually Semicolon-Separated but everyone still calls them CSV and looks at you like you drooled on yourself when you point out that “;” is not a “,”.
This appears to be because we use “,” as decimal separator and were too dense to learn how to use " properly in CSV.
How do you know the variant? If the user is just using an "import csv" on the web form, how do you know? You can't even ask the user because it's not even clear they'll know, it's just the CSV they got from Jira, their other vendor, whatever.
And then the other direction is even worse, when you emit a RFC compliant CSV and the other party complains their batch job chokes on "" escaped quotes, etc., so you end up holding a mapping of clients -> "CSV" formats
I remember Klarna using ", " as their separator. Not ",". There had to be a space as well, which most CSV parsers can not do. So when giving us a CSV file, with currency, and Swedish kronor used "," as the decimal separator you'd get some fun result. Pretty much every CSV parser we tried would assume that kronor and øre was two separate fields.
you don't always control how csv files are made. Most of the time you are just given them, and this is when you start pulling your hair.
CSV is a terrible, terrible format, because it fails in too many use cases.
I've used these when I've had some code with thousands of strings. I concatenated them with the ASCII separators in the source code, then called String.split as needed. The speedup was noticeable, probably since the runtime choked on instantiating so many strings at one time when launched.
But really you could have used any other character that wasn't going to appear in your strings, especially a visible one like “␟” U+241F "Symbol For Unit Separator".
i actually used [U+263A] and [U+263B] for this purpose, ignorantly (in good faith)...in pro/gov/civ projects....not realizing the canonized name wasn't "Smiley/Inverted Smiley" at the time, which may have been an oversight.
Ironically, I looked for that very control character, and I think it may not had worked with Excel/Clipboard, so was a no-go for biz ops.
I never understood how people were able to abbreviate the U S into ␟ without triggering the USA flag, like on youtube.
After reading historical/proposed Unicode RFC's, having scrolled past every glyph that could combine into graphene clusters and fuzzed unicode input on many systems....
today I am humbled to learn that ␟ is not nor US, but the exact unit by which its own proliferation would itself de-nomen-ize itself.
I dug up where I used it. It was a bible in HTML and JS. At first, it was using arrays of arrays of strings (for chapters and verses), but I refactored it to really long strings for each book with those separators. The entire bible is one JSON object in the source, keyed by book and the values are those really long strings.
Splitting, even if it's not re-concatenated (unlikely in practice) would still benefit from the cache-optimization/ lack of garbage collection overhead / consequential bytes in literal RAM, no?
All we need is native Excel support, and HTML5 web support. In web browsers it should be the default copy formatting, and if you’re writing an HTML document these characters should be an alternative to using TD and TR tags.
There are several formats like this, the most well-known is probably canonical S-expressions. Or, if I'm being sadistic, PDF is somewhat like this, too.
For the unlikely event that you are dealing with data with the metacharacters: qsv will use some other control character as the “quote” character to deal with that.
I think this would catch on much more quickly if text editors treated the Record Separator character as a new line, and there was a special character for the Unit Separator.
people saying \034 / \035 are not readable / printable so they don't make good human readable delimiters: make it ,\034 and \n\035. looks like csv, but is actually ascii delimited. just remove last character from all entries.
Maybe I'm missing something, but wouldn't it still need escaping for those ASCII separator characters (or alternatively, a restriction for the stored text not to have them)?
It's true that having to deal with escaping much less often (since the ASCII separator characters are rarer than commas/quotes) would be convenient for manual reading/writing, but I feel that's canceled out by the characters being hard to type/see (likely the reason why they're rare) - and it wouldn't necessarily save on writer/parser code complexity.