Rather than splitting, we can positively tokenize: recognize lines as items that match lexical tokens. They will include the terminating newlines that we need to trim away.
We need to be tolerant to files that are not terminated by a final newline.
This regex seems to do the trick (under the TXR Lisp tok function):
1> (tok #/[^\n]*./ "")
nil
2> (tok #/[^\n]*./ "no newline at end of file")
("no newline at end of file")
3> (tok #/[^\n]*./ "\nno-newline")
("\n" "no-newline")
4> (tok #/[^\n]*./ "line\nno-newline")
("line\n" "no-newline")
5> (tok #/[^\n]*./ "line\nline\n")
("line\n" "line\n")
6> (tok #/[^\n]*./ "\n")
("\n")
7> (tok #/[^\n]*./ "\n\n")
("\n" "\n")
8> (tok #/[^\n]*./ "\n\n\n")
("\n" "\n" "\n")
A line is a (maximally long) sequence of zero or more non-newline characters, followed by a character.
LOL; I don't think I've ever used regex-driven tokenizing to recognize Unix-style lines in a text stream.
Yes, line splitting is built-in. But this is more about split versus tokenize.
I have split and tokenize functions which have the same interface.
Split places the focus on identifying separators: the "negative space" between what we want to keep. When we use that for lines, it has problems with edge cases, like not giving us an empty list of pieces when the string being split is empty.
I find it difficult to be interested in this philosophical difference between the two when only difference is in how they treat the empty string.
That is, splitting with regex look-behind/look-ahead, like:
(?<=\n)(?=.)
with a suitable definition of dot, also matches "negative space", and produces the same results as Python's str.splitlines(True), except for the empty string.
>>> import re
>>> p = re.compile(r"(?<=\n)(?=.)", re.DOTALL)
>>> def check(s):
... lines = p.split(s)
... assert lines == s.splitlines(True), lines
... return lines
...
>>> check("no newline at end of file")
['no newline at end of file']
>>> check("\nno-newline")
['\n', 'no-newline']
>>> check("line\nno-newline")
['line\n', 'no-newline']
>>> check("line\nline\n")
['line\n', 'line\n']
>>> check("\n")
['\n']
>>> check("\n\n")
['\n', '\n']
>>> check("\n\n\n")
['\n', '\n', '\n']
>>> check("")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in check
AssertionError: ['']
We need to be tolerant to files that are not terminated by a final newline.
This regex seems to do the trick (under the TXR Lisp tok function):
A line is a (maximally long) sequence of zero or more non-newline characters, followed by a character.LOL; I don't think I've ever used regex-driven tokenizing to recognize Unix-style lines in a text stream.