Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Rather than splitting, we can positively tokenize: recognize lines as items that match lexical tokens. They will include the terminating newlines that we need to trim away.

We need to be tolerant to files that are not terminated by a final newline.

This regex seems to do the trick (under the TXR Lisp tok function):

  1> (tok #/[^\n]*./ "")
  nil
  2> (tok #/[^\n]*./ "no newline at end of file")
  ("no newline at end of file")
  3> (tok #/[^\n]*./ "\nno-newline")
  ("\n" "no-newline")
  4> (tok #/[^\n]*./ "line\nno-newline")
  ("line\n" "no-newline")
  5> (tok #/[^\n]*./ "line\nline\n")
  ("line\n" "line\n")
  6> (tok #/[^\n]*./ "\n")
  ("\n")
  7> (tok #/[^\n]*./ "\n\n")
  ("\n" "\n")
  8> (tok #/[^\n]*./ "\n\n\n")
  ("\n" "\n" "\n")
A line is a (maximally long) sequence of zero or more non-newline characters, followed by a character.

LOL; I don't think I've ever used regex-driven tokenizing to recognize Unix-style lines in a text stream.



Or we can do like the Python code does and depend either on line iteration of the input file, or use str.splitlines(). ;)


All those tools conceal finite automata which have to do the equivalent of that regex pattern.


Yes, of course. Lines of text are classic regular grammars so anything parsing them is a DFA at heart.

Is there a take-home message I'm supposed to get from your comment?

My comment was meant to be a light-hearted observation that you're overthinking the problem, since your language of choice likely has this built-in.

Or a boast that Python has a built-in feature for what requires a custom TXL and/or TypeScript solution. ;)


Yes, line splitting is built-in. But this is more about split versus tokenize.

I have split and tokenize functions which have the same interface.

Split places the focus on identifying separators: the "negative space" between what we want to keep. When we use that for lines, it has problems with edge cases, like not giving us an empty list of pieces when the string being split is empty.


I find it difficult to be interested in this philosophical difference between the two when only difference is in how they treat the empty string.

That is, splitting with regex look-behind/look-ahead, like:

  (?<=\n)(?=.)
with a suitable definition of dot, also matches "negative space", and produces the same results as Python's str.splitlines(True), except for the empty string.

  >>> import re
  >>> p = re.compile(r"(?<=\n)(?=.)", re.DOTALL)
  >>> def check(s):
  ...   lines = p.split(s)
  ...   assert lines == s.splitlines(True), lines
  ...   return lines
  ...
  >>> check("no newline at end of file")
  ['no newline at end of file']
  >>> check("\nno-newline")
  ['\n', 'no-newline']
  >>> check("line\nno-newline")
  ['line\n', 'no-newline']
  >>> check("line\nline\n")
  ['line\n', 'line\n']
  >>> check("\n")
  ['\n']
  >>> check("\n\n")
  ['\n', '\n']
  >>> check("\n\n\n")
  ['\n', '\n', '\n']
  >>> check("")
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "<stdin>", line 3, in check
  AssertionError: ['']




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: