Rather than splitting, we can positively tokenize: recognize lines as items that...

eesmith · on June 21, 2023

Or we can do like the Python code does and depend either on line iteration of the input file, or use str.splitlines(). ;)

kazinator · on June 21, 2023

All those tools conceal finite automata which have to do the equivalent of that regex pattern.

eesmith · on June 21, 2023

Yes, of course. Lines of text are classic regular grammars so anything parsing them is a DFA at heart.

Is there a take-home message I'm supposed to get from your comment?

My comment was meant to be a light-hearted observation that you're overthinking the problem, since your language of choice likely has this built-in.

Or a boast that Python has a built-in feature for what requires a custom TXL and/or TypeScript solution. ;)

kazinator · on June 21, 2023

Yes, line splitting is built-in. But this is more about split versus tokenize.

I have split and tokenize functions which have the same interface.

Split places the focus on identifying separators: the "negative space" between what we want to keep. When we use that for lines, it has problems with edge cases, like not giving us an empty list of pieces when the string being split is empty.

eesmith · on June 22, 2023

I find it difficult to be interested in this philosophical difference between the two when only difference is in how they treat the empty string.

That is, splitting with regex look-behind/look-ahead, like:

  (?<=\n)(?=.)

with a suitable definition of dot, also matches "negative space", and produces the same results as Python's str.splitlines(True), except for the empty string.

  >>> import re
  >>> p = re.compile(r"(?<=\n)(?=.)", re.DOTALL)
  >>> def check(s):
  ...   lines = p.split(s)
  ...   assert lines == s.splitlines(True), lines
  ...   return lines
  ...
  >>> check("no newline at end of file")
  ['no newline at end of file']
  >>> check("\nno-newline")
  ['\n', 'no-newline']
  >>> check("line\nno-newline")
  ['line\n', 'no-newline']
  >>> check("line\nline\n")
  ['line\n', 'line\n']
  >>> check("\n")
  ['\n']
  >>> check("\n\n")
  ['\n', '\n']
  >>> check("\n\n\n")
  ['\n', '\n', '\n']
  >>> check("")
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "<stdin>", line 3, in check
  AssertionError: ['']