That is correct, the data needs to be simple where the delimiter characters are never embedded inside a quoted field. I wrote a simple (and fast) utility to ensure that CSV files are handled properly by all the standard UNIX command line data tools. If you like using awk, sed, cut, tr, etc. then it may be useful to you.
There is a small program I wrote called csvquote[1] that can be used to sanitize input to awk so it can rely on delimiter characters (commas) to always mean delimiters. The results from awk then get piped through the same program at the end to restore the commas inside the field values.
Good idea! Looks similar to something I wrote called csvquote https://github.com/dbro/csvquote , which enables awk and other command line text tools to work with CSV data that contains embedded commas and newlines.
To simplify working with CSV data using command line tools, I wrote csvquote ( https://github.com/dbro/csvquote ). There are some examples on that page that show how it works with awk, cut, sed, etc.
While not exactly what you asked for, I wrote something similar called csvquote ( https://github.com/dbro/csvquote ) which transforms "typical" CSV or TSV data to use the ASCII characters for field separators and record separators, and also allows for a reverse transform back to regular CSV or TSV files.
It is handy for pipelining UNIX commands so that they can handle data that includes commas and newlines inside fields. In this example, csvquote is used twice in the pipeline, first at the beginning to make the transformation to ASCII separators and then at the end to undo the transformation so that the separators are human-readable.
It doesn't yet have any built-in awareness of UTF or multi-byte characters, but I'd be happy to receive a pull request if it's something you're able to offer.
You might want to check out https://github.com/dbro/csvquote which helps awl and other text tools handle csv files which have quoted strings as values.
If I'm going to preprocess before invoking awk I think I'd rather switch the separators to use ascii record/unit separator values than to replace the content of the actual fields.
(1) I filter on column content using regex and dealing with a sub character adds complexity.
(2) Many of my columns are free-form text containing commas, carriage returns, new lines, tab, vertical tabs and file separator (0x1c). Occasionally, text is in UCS-2/UTF-16 or uses UTF-8 and foreign characters (a non-trivial quantity of the text I process is in French for example.)
(If you read between the lines here, some columns can contain MLLP-encoded HL7 messages, others contain free-form text and I'm in the medical field.)
Here's another suggestion for the criticism section (which is a good idea for any open-minded project to include):
Instead of using a separate set of tools to work with CSV data, use an adapter to allow existing tools to work around CSV's quirky quoting methods.
csvquote (https://github.com/dbro/csvquote) enables the regular UNIX command line text toolset (like cut, wc, awk, etc.) to work properly with CSV data.
I do think there is room for both tools though. One of the cooler things I did with `xsv` was implement a very basic form of indexing. It's just a sequence of byte offsets where records start in some CSV data. Once you have that, you can do things like process the data in parallel or slice records in CSV instantly regardless of where those records occur.
csvquote: https://github.com/dbro/csvquote
Especially for use with existing shell text processing tools, eg. cut, sort, wc, etc.