Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ASCII Delimited Text – Not CSV or Tab Delimited Text (ronaldduncan.wordpress.com)
114 points by bonsai_spool on Nov 10, 2024 | hide | past | favorite | 117 comments


> with no restrictions on the text in fields or the need to try and escape characters.

Maybe I'm missing something, but wouldn't it still need escaping for those ASCII separator characters (or alternatively, a restriction for the stored text not to have them)?

It's true that having to deal with escaping much less often (since the ASCII separator characters are rarer than commas/quotes) would be convenient for manual reading/writing, but I feel that's canceled out by the characters being hard to type/see (likely the reason why they're rare) - and it wouldn't necessarily save on writer/parser code complexity.


See this is why I once used moon-viewing-ceremony-seperated-values (MVCSV). The Moon Viewing Ceremony emoji was unlikely to show up in my dataset, and not only is the emoji visible, it's quite visually pleasing.


I’m now free-falling down a moon viewing ceremony rabbit hole of emoji history, and enjoying the ride!


Wait until you expand into the Japanese market and all your users are talking about 月見


Not if you just say those characters are invalid data. I first heard about them decades ago, but I don't think I have ever once seen them in use.

The real problem is that there is no easy universal way to type them with a keyboard. So it would require software interfaces in the application, and at that point it’s basically binary.


> Not if you just say those characters are invalid data

The author's claim is "with no restrictions on the text".

It's easy if you can forbid certain characters, but then you can't store arbitrary text (e.g: filepaths, or scraped comments).


but these specific characters are not text. they exist solely to be delimiters.

It would be like trying to escape a column in your spreadsheet.


> but these specific characters are not text. they exist solely to be delimiters.

Even if people used them only for their intended purpose, someone could use them as delimiters within the text you want to store (e.g: list of tags in filename) - unless I'm misunderstanding.

> It would be like trying to escape a column in your spreadsheet.

Other formats do allow escaping their delimiters, so that you can use that character literally or even nest a string of that format within an entry.


Right, nesting wouldn’t be possible. But there is never any reason to use these characters literally, they are just delimiters.


> But there is never any reason to use these characters literally, they are just delimiters.

I can put []* in my comment (maybe because I'm demonstrating the format, referencing the characters, or just being capricious), and now someone scraping and storing comments has a need to use those characters literally. Sometimes fine to ignore certain content or store it lossily, but often not.

*: (copy-paste between the brackets into https://bobpritchett.com/unicode-inspector)


yes, you would still need to clean your inputs before randomly adding it to your table. Your contrived example brings me back to my original assertion that as long as you’re ok with those characters not being valid data it works fine. So, sure if someone really wanted to store those two literal non-visible characters in a text file that would not work. Everyone else could just not do that.


> yes, you would still need to clean your inputs before randomly adding it to your table.

Lossy is fine in some cases, but in many cases you do actually need the specific text you're trying to store - not just something similar to it. Hence my objection to "never any reason to use these characters literally".

> Your contrived example [...] if someone really wanted to store those two literal non-visible characters in a text file

Needing to store these specific characters is rare, but needing to store arbitrary text (possibly from adversarial/mischievous parties, or just a large enough dataset that encountering all edge-cases is inevitable) is common. For instance, for security reasons a log shouldn't break or have a blindspot for folders with those characters.

> as long as you’re ok with those characters not being valid data it works fine

Which is what I'm saying in my original comment with "or alternatively, a restriction for the stored text not to have them"


if you’re storing arbitrary text from untrusted sources you will always need to clean it first. Plus in those instances you’ll probably want a db or json or whatever works well with your language anyway.

I guess I’m saying that if this had caught on early with ubiquitous support it could have saved us from the mess that is csv/tsv/etc. It wouldn’t have negated the need for more advanced storage and serialization formats.


> if you’re storing arbitrary text from untrusted sources you will always need to clean it first

Reversible escaping of characters is pretty common (though not always; length-before-text formats don't require it). But to "clean" as in deleting characters such that you can no longer get back to the original string is definitely not required for all formats, and is a fairly undesirable property.

> Plus in those instances you’ll probably want a db or json or whatever works well with your language anyway.

You'd want to use some format that doesn't have the problem this one has, yeah. IMO ASCII delimited text just isn't really anywhere on the Pareto front of formats you'd want to use - it's unpleasant to work with manually, and once you're writing the file through code or a tabular editor you may as well use a format that can handle arbitrary text.

> I guess I’m saying that if this had caught on early with ubiquitous support it could have saved us from the mess that is csv/tsv/etc

I think you could say the same of RFC 4180. In reality, I don't see why this wouldn't also spawn dialects, like people adding newlines between rows so they can open it in a text editor without it being in one huge long line, or inventing an escaping scheme so that it can handle arbitrary text.


I feel like this (and some of the replies to this) is missing the point a bit.

I don’t think the goal was to make a bullet-proof delimiter that fails at nothing.

The goal was to solve the problem of not allowing things like commas, quotes, newlines, tabs, pipes, etc. in text files.

I feel like using the proposed ASCII characters would eliminate these limitations, while also allowing machine creatable and readable format (emphasis on machine as opposed to human).

Yes, it would still be tough for a human to type or read these delimiters, so in that case, go with traditional CSV or TSV (or MVCSV!).

But if you only need to use a machine to create/read the text, this sounds like a great solution, allowing all of the normal characters you might see in text.


If you need a machine-readable format, why not go with escaping like most other formats, or length-before-text, to include all characters - instead of a format that fails on some (albeit rare) characters?


Both of those are fine, but they add additional complexities (even if small) where there is very little, if any, complexity added with using these two characters as delimiters.


For machine-readable formats, I'd argue length-before-text is simpler to parse than splitting by separators - even before having to add extra application logic to handle this method being fallible (e.g: if you want to use it to store filenames, you now need a check and pop-up about some OS-valid filenames not being supported).


> For machine-readable formats, I'd argue length-before-text is simpler to parse than splitting by separators

Yes.

> even before having to add extra application logic to handle this method being fallible (e.g: if you want to use it to store filenames, you now need a check and pop-up about some OS-valid filenames not being supported).

I don't see why you'd need that. CSV does not have anything like that.

That's a higher-level concern. This ascii-delimited format (like CSV) is supposed to be a stupid row/column format. And also simpler to implement than CSV.


> CSV does not have anything like that.

If you're using a fallible CSV dialect, your application does need to handle that case in some way (or in some cases it may be fine just to let it crash). Something like length-before-text is convenient because you don't have to worry about that case.


Yup exactly, it just pushes the problem around, without solving it.

The delimiters can occur in binary data, or when there's nesting -- trying to store TSV in TSV, or JSON in JSON, etc. The latter definitely happens a lot, for better or worse


The title says ASCII Delimited Text not ASCII Delimited Binary Data.

For the purposes of CSV, I consider text to be anything that satisfies the regex ^\P{Cc}+$ (https://www.compart.com/en/unicode/category/Cc) and I normally strip chars in that category before saving some text (for single-line text). ^[\p{Cc}&&[^\n]]+$ is a regex that can be used to strip all control chars except for the newline.


Thanks for that. Handy. I think I’ll have use for that myself.


You can disallow those metacharacters in the data proper. Then you have a format that can store any utf8 or whatever except the non-whitespace control codes without any escaping. That solves a problem in an opinionated way. Just like how json is opinionated (utf8 only).

You can convert to another format if you need something crazier than rows and columns consisting of normal text.


That already exists -- TSV. You disallow the tab metacharacter.


Then I don't understand. Like the sibling comment said the really "problematic" character in TSV is the line feed. But a tab can occur as well.

The format that I described does not already exist in the form of TSV. And further based on your original comment I would have thought that both TSV and this format would be discarded as not-useful.


And the newline character. Both of which commonly occur in normal text.


Thank you for writing that complaint out so I don't have to. It solves nothing.


I like this format best of all, CSV is # 2 favorite.

At work we settled on using ^G (0x07) as a delimiter instead of TABs for file transfers and loading data into various databases.

The reason was Excel. People/systems who create these files sometimes source from Excel. And Excel can have a habit of placing odd characters in text fields. We found the one character never encountered was BEL.

For text fields we tend to remove embedded white space and after replacing TABs with 1 space.


This sounds like a noisy format.


Silently ignore it.


In the early 2000s, back at the beginning of the world, Yahoo's web code used ^A and ^B for field and record separators to avoid having to escape commas and quotes and newlines. That was probably the last time I ever saw ASCII control characters used as intended in the wild.

There is no technical reason why CSV should have won out, except that keyboards have a comma key and almost never a ^A key.


That's a huge technical obstacle for most people though. The whole point of XSV formats is to be human editable. There are better formats for computer-to computer records. If you can't the core delimiters on a keyboard, your format is going to lose out despite any of us other benefits


It's an editor issue then. "Back in the old days", people used to understand how to input ^A and ^B. Showing these characters is also only a mere addition in the character set. Sure, there is inertia to change, but even rich text format is/was supported by windows.

Being unable to deal with this is a lazyness of the developer that spilled into the user being unable to deal with it. This is nothing a user can't be trained on, and I'd argue it makes more sense than weird escaping sequences in the event you actually do want a ",".

P.S.: But then again, with proper editors the escaping issue vanishes - and no, I do not mean IDE's. Lots of people decided it was worth it to support rtf, I figure the decision to support 2 additional characters is way easier in a user friendly way.


That doesn't need to be an obstacle, a graphical editor that lets you click a button to add a row/column exists for nearly every other tabular format. It only matters if you want to edit the file using a plaintext editor. If the format were popular, shortcuts would be created to enter the delimiters in many plaintext editors too, which is a chicken/egg problem but let's not kid ourselves into thinking it's not a solvable problem.


If you're using a bespoke editor anyway, why are you restricting yourself to a texty format? XSV formats inhabit this weird space halfway between data exchange and human readable. They're a compromise. If you go full data exchange you might as well use something even better suited for the job.


You don't really have to have a bespoke editor, you just have to display the control characters sanely in an otherwise normal plaintext editor and add a shortcut or menu item to insert the control characters.


The FIX financial protocol still uses ^A.


And how do we escape those characters? With ESC (27)? Inside a SI/SO (15/14) pair?

I think CSV or TSV are good enough. People keep trying to find a format where you can separate the records and fields with a simple string.split and there's no need to contemplate escapes.

But that's not possible, no matter the format you'll have to parse it right. And then, a format that uses visual delimiters has the obvious advantage of being editable with any text editor.


Kind of a short sighted take. Sticking special characters that (in many early editors) would be invisible complicates development and maintenance. Even tabs have a visual, albeit inconsistent (if your editor wants to align columns for you) manifestation you can work with.

Technically, XML is superior for data representation on many fronts. But likewise, it is an absolute PITA to maintain without significant editor support.

It is no accident that CSV/tabs 'won'.


Seems like a tooling problem though. I don't think it would be that difficult to have editors draw them in a readable way.


If you are okay with needing tooling to be able to edit half decently then the problem it "fixes" in CSV doesn't exists to begin with


Yeah and many devs often oversee that with having separator, enclosure and escapechar specified and used, it even supports newlines in its cell values


> The most anoying thing about the whole problem is that it was solved by design in the ASCII character set.

This is a great example of not understanding what “the problem” actually is, and then assuming that because part of a technical solution exists, that everyone should be using it and if they’re not it’s because of ignorance rather than choice. I think we all do this, at least I know I’m sometimes guilty, but it’s amusing when faced with what happens in the real world at scale, to jump to the conclusion that the world is wrong rather than to first question our own assumptions.

Personally, I think it’s funny to assume that ASCII == text. Obviously not all ASCII is “text” in the sense that most people will assume. When people say “text file” I assume it contains nothing that you can’t type on a physical typewriter, other than the annoying and persistent difference between LF and CRLF. ASCII has lots of characters you can’t type on a typewriter, and are not intended to print as a character.

But if you want to invent new “text” characters for a “text” file, the problem suddenly becomes not just having a char code, but how to easily type it, how to easily display it, how to teach everyone to recognize and use it, and how to standardize these things so everyone knows them. Personally at this point I probably wouldn’t call a file with ASCII chars 28..31 in them “text”. The ASCII characters haven’t solved the overall problem, they have created several more and bigger problems that remain unsolved, and are much easier to solve in practice by using a comma instead, which is why people aren’t using the special ASCII characters in practice.


Some notes from when the USV project tried using control characters:

> We tried using the control characters, and also tried configuring various editors to show the control characters by rendering the control picture characters.

> First, we encountered many difficulties with editor configurations, attempting to make each editor treat the invisible zero-width characters by rendering with the visible letter-width characters.

> Second, we encountered problems with copy/paste functionality, where it often didn't work because the editor implementations and terminal implementations copied visible letter-width characters, not the underlying invisible zero-width characters.

> Third, users were unable to distinguish between the rendered control picture characters (e.g. the editor saw ASCII 31 and rendered Unicode Unit Separator) versus the control picture characters being in the data content (e.g. someone actually typed Unicode Unit Separator into the data content).

https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...


If people used these for ASCII delimited text, they'd have to not use them for anything else, like some other text format otherwise you might insert an entire ASCII delimited file into a text field of that other thing and break that other thing's parsing. You couldn't even insert part of a file into a string field in another ASCII-delimited file. You only get to use them once so they wouldn't be part of general purpose plain text and an ASCII delimited file wouldn't be a plain text file that you could treat in the same way as other text files, so it's effectively a binary format or has restrictions on what text characters can appear in its records without escaping - oh no, that was its entire value proposition!


they break all the time, whole point is to have less pain.

2 invisible ascii characters evidently no one's rlly used for anything else (other than nextvalue nextline) sounds like an order of magnitute less pain on parsing the escaped escapes along with the whole junk paste of llm generated markdown+code junk probably pasted in there.


You still have to escape the delimiters to be safe except now they're more rare so easier to forget about.

The bigger problem with CSV is all the inconsistent implementations. For example, some people want semicolons instead of commas because their culture uses commas as decimal points, so I suppose semicolon should really be the standard, if there was one.


The fact that CSV is still strong is that it already covers all „shortcomings” (I.e. presence of quotations in the content) mentioned by this article.


Yep, the only advantage I see with using ASCII control characters is that you can save a few bytes depending on the content. To make this approach robust, escaping is still needed.


How big an issue is CSV format really? I work in bioinformatics where it seems like everything is one odd CSV-like format or another. In Python, I have access to tools like pandas, duckdb, and polars, which have detailed ingestion options and sometimes a sniffer. I can read part of a file and check in seconds if it looks right.

Dealing with the variety of formats certainly isn’t the bottleneck in my productivity. Is it for others? I’d be curious why.


The Python csv module has a dialect option.

https://docs.python.org/3/library/csv.html


I also work with foreign CSVs regularly. I'll have to try the Python Way next time I have a weird file to work with.

I typically use PowerShell to process the files from a unknown CSV format to a known one so it's easier to work with, and I've found it easy to use to iterate on.


Oh yeah that does sound challenging. If you’re interested, here’s my take on the three libraries I mentioned.

1. Pandas is more mature with much better batch reading of larger than memory CSV files than Polars. But it’s slower and the syntax is worse.

2. Polars is my goto for one off analysis of CSV files that fit in memory. When max performance isn’t a concern, sometimes I’ll iterate through the CSV using Pandas to get it in batches, then immediately convert to Polars to do any analysis. ChatGPT has been poisoned by Polars’ early syntax changes so it often makes mistakes, but Polars’ syntax is so clean and consistent this often doesn’t matter much as it’s easy to fix.

3. DuckDB is a different beast obviously as it’s a full database, not just a single dataframe. It’s slightly more setup, but it has a CSV sniffer, does out of memory processing really well (no need to batch iterate) and lets you use SQL. I’m not too experienced at SQL yet, and it’s nice that ChatGPT is really pretty good at creating complex SQL queries. I am now gravitating to DuckDB for any larger than memory processing that can be handled in SQL. If line by line streaming is needed for the algorithm I’m implementing then I still use pandas or the pandas+polars approach.


This is awesome, thanks for the advice! I'll definitely give these tools a shot for my next import job.


I’m working on a PWA which includes a dictionary search[1] feature and only a static web server (so no server side database to optimize the search). I did want searching to work in offline mode anyway. I decided it was best to generate an index file which the users download on first visit. For some reason I found USV[2] to be the best fit for this. USV I think allows seperating with ASCII control characters, but I used the unicode variants (␟, ␞, and ␝).

I really liked this as it allowed me to add the glossary as an array in one of the columns. I wrote the parser my self which searches through the text structure, and it was simple enough. The reason I opted not to use a CSV or a TSV was that I didn’t want to deal with escaping surprise commas or tabs I would find in the dictionary data plus the extra dimension was nice. Since the file is generated, I didn’t have to type the characters my self so it had none of the downsides of this format honestly.

1: https://shodoku.app/dictionary

2: https://github.com/SixArm/usv


The shortcoming of using the control characters is that there is no easy way to type them on a keyboard. You can trivially edit csv in a text editor.


Technically there is (not sure about easy, most of these require ctrl+shift, but they are on the keyboard):

    CTRL   DEC HEX CHR NAME
    Ctrl-\  28  1C  FS File Separator (Right Arrow)
    Ctrl-]  29  1D  GS Group Separator (Left Arrow)
    Ctrl-^  30  1E  RS Record Separator (Up Arrow)
    Ctrl-_  31  1F  US Unit Separator (Down Arrow)
From https://www3.rocketsoftware.com/bluezone/help/v42/en/bzadmin...

You can type them in the terminal by prefixing with Ctrl-V, so you can enter a record separator by pressing Ctrl-V, Ctrl-Shift-6. Typing Ctrl-\ is tricky because some programs interpret it to mean end of input, e.g. it exits the Python repl, but I don't think that one in particular is super important to type manually. In hindsight if these were assigned to Ctrl-<letter>, they would have been a lot easier to type and use.


ctrl-letters are used for other things


It looks like it would just be be ctrl+^, which seems pretty straightforward.


Straightforward is completely subjective. But a comma is relatively much simpler in an absolute sense.

Ctrl+shift+6 is a 3-key chord, it’s potentially hard to discover (I can’t say I’ve actually ever seen it), it seems likely to be overridden by applications, and caret isn’t a natural separator and is more commonly used for other things, like exponents.

A comma is 1 key on the keyboard, and it’s already a natural separator; the very meaning of comma is separator. Note how many commas are used in this thread compared to the number of record separators. :P

Having to type both ctrl+shift+6 and ctrl+shift+minus a lot seems like a small physical and mental friction compared to using commas and returns each time a character is typed, that adds up to a lot of physical and mental friction over time. Enough that the eventual implication is that you need better tooling than a text editor provides in order to author delimited files, enough that it sort of undermines the idea of having a text file. It’s a mistake to think that because a key chord exists that it’s a solved problem, and a mistake to underestimate the value of making commonly used items as simple as possible, especially if it’s going to affect a lot of different people.


While I agree with the sentiment of that, it still seems a lot easier than escaping characters. I would probably opt for a mix. Control separation with new lines


Isn’t escaping a separate orthogonal issue? Or am I misunderstanding your point? Several people have pointed out that the special ascii field separator will have to be escaped if used within a field, just like a comma is. It seems like escaping is an issue either way, and aside from that, a comma is easier in practice than a special character at every level of interaction; discovery, typing, displaying, tooling, printing, standards, etc..?

I would concede that having the special characters inside fields will be less common than having commas inside CSV field is. I guess that is worth a lot even if it doesn’t fully solve the problem.


ctrl+_ to separate cells, ctrl+^ to separate rows - works perfectly in notepad++.

I think the proposal can be improved by using ctrl+^ followed by a newline as a row separator, it looks much more readable plus will allow various line-based CLI tools to be used unless there are newlines in the cells.


> no easy way to type them on a keyboard

or probably to see them in the screen in any sensible form.

Also, what to do if you want to embed those ctrl characters in a field? you are likely back to the way that CSV does it with quotes, commas and CRLFs.


That’s the whole point of having field and record separators as distinct values in ASCII. There is no other valid use for them, so no escaping is necessary. Have you ever used ASCII value 30 for anything, anywhere, in your life?


I have seen no end of CSV data that embeds CSV data in a field. I'm sure this will happen for whatever character format you pick.


> I have seen no end of CSV data that embeds CSV data in a field.

So in CSV [1] a record is separated by a CRLF (0x0D 0xA[2]) and a field value is separated by a comma (0x2c). In ADT (ASC?), the record separator is 0x1E and the unit/field separator is 0x1F.

But there are two more separators defined: file (0x1C) and group (0x1D).

I'm not sure if it's defined anywhere, but if if you wish to embed ADT data with-in a ADT file, and have it as part of the CSV-equivalent field (unit), you could say that:

    after the 0x1E record separator, put a group separator (0x1D)
    which will denote the beginning of ADT sequence which will
    be treated as a unit value. The end of the value shall
    ("MUST"?) be denoted by another group separator, after which
    a unit separator will indicate the next field.
The fact that there are four separation characters would allow for some to be used for embedding applications to tell parsers that a new 'level' of parsing is being done.

[1] https://datatracker.ietf.org/doc/html/rfc4180

[2] https://www.man7.org/linux/man-pages/man7/ascii.7.html


Out of curiosity, in which general circumstances do you tend to see this sort of data? A lot of the support I see for ASV/USV/whatever comes from the idea that this is negligibly rare, which has always sounded to me like a dangerous assumption for a general-purpose format.


Some (many?) years ago i wrote a FOSS tool for messing with CSV called CSVfix. People used to mail me CSV files and ask how to parse them. Also, in my general consulting/contracting work I came across all sorts of weird stuff, a lot of it machine-generated.


No, no one has ever used that character because it can't be typed or displayed which is the entire point of the thread you're replying to


Sometimes the content of a cell in e.g. a CSV file will be another CSV file. This inevitably happens with pretty much every format, whether CSV, TSV, JSON, etc, so you'll need some sort of quoting rule no matter what you do.


But that’s my point. The person I responded to was objecting that they might want to use control characters like ASCII 30 as values in a field. As you and I agree, that will never happen.


What about the cases where people don’t want to use it but still do?

People C&P values from a source that includes those characters and they‘ll break your export file


> There is no other valid use for them, so no escaping is necessary.

Famous last words...

> Have you ever used ASCII value 30 for anything, anywhere, in your life?

No but if this format took off I would be.

And then because text editors start displaying them so the files make sense, people start using them for other purposes, like separating values on a line or copying spreadsheet cells to the clipboard, because it's more "semantic", and those lines get pasted into a field in a tool that exports its data as CSV.

What then?


And equally as important, you can cat a CSV file and easily understand it.


Or concatenate them, or diff or grep.


This is a bit silly. Any modern text editor (whether vim or VSCode or BBEdit or Notepad++ or whatever) is capable of displaying control characters and of copy/pasting them. Keyboard shortcuts for inserting any characters whatsoever are easy to add. And even with CSV files, if you’re editing them by hand rather than manipulating them with code, you’re probably doing it wrong.


All of this is easy (use the proper editor, configure it for this particular weirdass situation, do something other than the thing you want to do, etc) in a way that’s exactly analogous to ‘you can spin up your own dropbox over the weekend with ftp and rsync’.


Maybe you with a hex editor that shows dual pane: hex and text.


I asser that is still not convenient or scalable. You will need to mentally parse each number (two characters) to ensure it is correct. Compare this with a simple glyph (i.e. a single comma) which is easy to eyeball.


"Then you have a text file format that is trivial to write out and read in, with no restrictions on the text in fields or the need to try and escape characters."

Not being a "developer", I have been productively using these non-printing separators for personal use as a UNIX-like OS and text-only internet user for close to three decades. Of course I have a bias for ASCII and against Unicode and I only use the English language for computing. Perhaps this is why using the ASCII charactors, including the record and file separators, work so well for me.

Using ASCII non-printing separators might not work for everybody but it would be false to assume it will not work for anybody.

Historically ASCII worked for some computer users. It still does today. For those who stil use it like myself.

The author states, "The most anoying[sic] thing about the whole problem is that it was solved by design in the ASCII character set."

"Developers" might not use the ASCII solution but that does not prevent other computer owners from using it.


I sometimes use them for machine to machine transfer. The biggest problem is that regular editors don't handle it in a sensible way.


CSV isn‘t that complicated if done right.

1. if a value included the line separator, row separator or text qualifier surround the value with the text qualifier.

2. if the value contains the text qualifier double it in the value.


That assumes you control both the producer and consumer. But if you're doing CSV, it's likely because you're looking to integrate with someone else's system. So you have to deal with whatever they're doing that they call CSV. And if "they" are "all your customers", you're going to encounter every weird quirk of different system's CSV parsing from that guy who just used

    String.split(",").map(it.replace("\"\"", "\""))
to Spark insisting that backslash escapes exist in CSV.


That’s the real problem. That those system claim the read or write CSV but in reality it’s a bastard form of CSV.

Never had a problem when both sides know the rules.

But I doubt those other system even support ASCII separated value files.


It’s all bastard CSV out here in the real world.

For extra fun consider the German-Speaking word where CSV files are actually Semicolon-Separated but everyone still calls them CSV and looks at you like you drooled on yourself when you point out that “;” is not a “,”.

This appears to be because we use “,” as decimal separator and were too dense to learn how to use " properly in CSV.

Bastard CSVs as far as the eye can see.


It's easier in most cases to just use a parser flexible enough for you to specify whatever variant the producer actually emitted.


How do you know the variant? If the user is just using an "import csv" on the web form, how do you know? You can't even ask the user because it's not even clear they'll know, it's just the CSV they got from Jira, their other vendor, whatever.

And then the other direction is even worse, when you emit a RFC compliant CSV and the other party complains their batch job chokes on "" escaped quotes, etc., so you end up holding a mapping of clients -> "CSV" formats


But it's so rarely done right.

I remember Klarna using ", " as their separator. Not ",". There had to be a space as well, which most CSV parsers can not do. So when giving us a CSV file, with currency, and Swedish kronor used "," as the decimal separator you'd get some fun result. Pretty much every CSV parser we tried would assume that kronor and øre was two separate fields.


I doubt ASV would be done right more often


you don't always control how csv files are made. Most of the time you are just given them, and this is when you start pulling your hair. CSV is a terrible, terrible format, because it fails in too many use cases.


CSV doesn‘t fail, they fail to handle CSV.

And I doubt those who create wrong CSVs can even handle ASVs.

CSV can be read and edited with any text editor.


I've used these when I've had some code with thousands of strings. I concatenated them with the ASCII separators in the source code, then called String.split as needed. The speedup was noticeable, probably since the runtime choked on instantiating so many strings at one time when launched.


But really you could have used any other character that wasn't going to appear in your strings, especially a visible one like “␟” U+241F "Symbol For Unit Separator".


i actually used [U+263A] and [U+263B] for this purpose, ignorantly (in good faith)...in pro/gov/civ projects....not realizing the canonized name wasn't "Smiley/Inverted Smiley" at the time, which may have been an oversight.

Ironically, I looked for that very control character, and I think it may not had worked with Excel/Clipboard, so was a no-go for biz ops.

I never understood how people were able to abbreviate the U S into ␟ without triggering the USA flag, like on youtube.

After reading historical/proposed Unicode RFC's, having scrolled past every glyph that could combine into graphene clusters and fuzzed unicode input on many systems....

today I am humbled to learn that ␟ is not nor US, but the exact unit by which its own proliferation would itself de-nomen-ize itself.


StringBuilder in both JS and .Net has a class especially for this.


I dug up where I used it. It was a bible in HTML and JS. At first, it was using arrays of arrays of strings (for chapters and verses), but I refactored it to really long strings for each book with those separators. The entire bible is one JSON object in the source, keyed by book and the values are those really long strings.

https://raw.githubusercontent.com/theandrewbailey/OfflineBib...

A StringBuilder wouldn't work, since there's nothing left to concatenate.


That's for concatenating. He's splitting.


Splitting, even if it's not re-concatenated (unlikely in practice) would still benefit from the cache-optimization/ lack of garbage collection overhead / consequential bytes in literal RAM, no?


Yea sounds bizarre but he observed that.


Nice idea, but as others have pointed out, non-printable characters pose their own problems. People expect to be able to edit CSV files.

Someone mentioned XML, but for most use cases XML is stupidly over-engineered. JSON is simpler - the entire specification is just a dozen or so pages.


I still see a lot of XML in SOAP/WSDL APIs, typically in Microsoft shops, but thankfully JSON feels like the norm when IIS isn't involved.


All we need is native Excel support, and HTML5 web support. In web browsers it should be the default copy formatting, and if you’re writing an HTML document these characters should be an alternative to using TD and TR tags.


Perhaps we should someday have length delimited text formats, and editors should recalculate the length on the fly.

Something like:

5:hello

2:pi

Maybe with one blank line with no delimiter as a record separator.

All fields on the same line could work, and would be more greppable, but harder to read for humans.


1966 called and wants its idea back

https://en.wikipedia.org/wiki/Hollerith_constant

As soon as DJB pays up:

http://cr.yp.to/proto/netstrings.txt


> but harder to read for humans

harder to write too - very easy to get the length wrong, and tiresome to have to count the length


There are several formats like this, the most well-known is probably canonical S-expressions. Or, if I'm being sadistic, PDF is somewhat like this, too.


That is the annoying format that Wordpress uses to store text. It does not lend itself for search and replace very well.


You can use ASCII-separated values in qsv.[1]

For the unlikely event that you are dealing with data with the metacharacters: qsv will use some other control character as the “quote” character to deal with that.


Whoops, meant to link to qsv https://github.com/jqnatividad/qsv


I think this would catch on much more quickly if text editors treated the Record Separator character as a new line, and there was a special character for the Unit Separator.


Tab and commas are ascii characters, so a csv file and a tdf are ascii-delimited by definition.

This lack of precision in writing is annoying.


people saying \034 / \035 are not readable / printable so they don't make good human readable delimiters: make it ,\034 and \n\035. looks like csv, but is actually ascii delimited. just remove last character from all entries.


Would love to see an explanation and some examples of what this would look like to work with for common use cases.


2009. has been shared here many times before


plaintext is obsolete. Only good for storing passwords.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: