The biggest challenge is figuring out how to store the text document in memory. My first thought was to use an array, but that has horrible performance if the user inserts text anywhere other than the end of the document.
The counter-argument to that is that processors are ridiculously fast in human timescales --- copying memory at gigabytes per second --- so unless you're focusing your use-case on editing correspondingly huge files, there's no real need to make your implementation more complicated than a single array. Even when DOS machines with <640K of memory and memcpy() speeds in the low MB/s were the norm, people edited text files of similar sizes, with editors that used a single array buffer, and for that purpose they weren't noticeably slower than ones today.
My ideas for challenging projects are a little less open-ended, so they exercise different set of skills: being able to implement a specification correctly and efficiently.
- TTF renderer
- GIF or JPEG en/decoder
- Video decoder (start with H.261)
IMHO being able to consume existing content with code you wrote is very rewarding.
>> A rope, or cord, is a data structure composed of smaller strings that is used to efficiently store and manipulate a very long string. For example, a text editing program may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done efficiently.
> there's no real need to make your implementation more complicated than a single array
Yeah, good luck enabling line numbers in such an editor.
In Emacs, which uses a gap-buffer for storing text, line numbers have had notoriously slow. It's gotten a bit better lately, but suffice to say, a naïve flat array / gap-buffer approach is not good enough for some relatively common scenarios even on modern hardware.
I don't think there should be a problem with line numbers. I would make two helper arrays containing the indices of the new-line characters, corresponding to the two gap buffer text arrays (new-line positions are sorted ascending for the first array, descending for the second array).
Speaking as someone who's gone all the way from implementing a Red-black tree to making a rope data structure using the RB tree, to making a text editor that can edit almost arbitrarily large text files (dozens of gigabytes) without user-perceivable latency ;-)
But line numbers are trivial to update when a gap buffer needs a move.
A list of strings is more elegant, of course, where only the line being edited becomes a gap buffer. It taxes the allocator a bit more, though, which might be a concern on computers of the time when Emacs was born.
While that is true to an extent, I've made a lot of money cleaning up after people that didn't architect and design their code to cleanly grow into a fairly obvious potential use case, requiring major rewrites. It isn't a premature optimization to avoid walling yourself into a corner..
I agree strongly with designing your code so it's easily changeable into whatever new features are needed. This is much easier said than done, and I don't know if anyone has written well about the tricks of that trade.
But anyway, if you have that kind of code, swapping out whatever you need to make line numbers happen is no more work later than sooner.
Code bases with features implemented that are never used, but you still have to keep working through all changes, because someone imagines it will be a real requirement someday, are what my nightmares are made of.
It's all about the interfaces. More performant solutions require (in general) more complex interfaces.
If your application has grown as long as it could with the simple implementation, and now it is all too slow, chances are there's a lot of code depending on the interface. If your interface (and the implementation) is too simplistic, then all of that code will need rearchitecting, too.
IIRC many DOS-era editors used an array/list of line buffers. Which to me seems like good middle ground solution. Certainly for the todays typical usecase when you care about performance, ie. editing >100MB text data file by hand, which is giant pain in emacs because the gap buffer simply is not good structure for doing few simple edits across three places 10MBs away (as you spend most of the time moving the gap around, while touching essentially all of the memory)
as far as how the emacs API works it should not be an issue, but on the other hand there is bunch of elisp code, that realy expects an gap buffer based implementation and in turn is depended by who knows what...
I don't edit really huge files that often (maybe a couple of times a week), but when I do I want to be able to use the same editor I use for everything else. A really great text editor is fast and flexible and powerful regardless of the size of the file you're trying to edit.
Thats one issue I have with these proposed programming tasks.
You are not going to write a really great text editor as a learning exercise. It has been done by better programmers who had better overview of the problems and over thousands man hours.
This automatically means the task is as useless as a gameboy emulator or basic compiler. The underlying "Things to learn" points are good, but tasks themselves are not.
Writing experimental text editors for fun in various programming languages has been one of the most rewarding learning exercises of my life.
It's not really clear what your point is. You say the task is "useless"—what does that mean? Personally I can say that you are categorically wrong, because the skills I gained building things that are not completely new ideas fueled my passion for programming and opened up doors for me that otherwise would have remained closed. Even if I didn't still use a lot of these projects myself (because I built them to fit me), the value I derived from them would still be significant in the "grand" scheme of my life.
If a programmer is excited about the idea of writing her own text editor, what would you suggest she build instead that will sustain that same excitement and offer exploration into the same diverse subject matter but also satisfy your nebulous criterion of not being "useless"?
> there's no real need to make your implementation more complicated than a single array.
I think you are misunderstood about the concept an array. An array has 1) an interface that is easy to use. On the other hand, by definition, an array is 2) contiguous in memory. Property 1 is good but 2 can cause problems. I think you want only 1.
The solution is to create a data type that has the interface of an array but a different implementation under the hood. You can have a linked-list of arrays, a tree of strings, etc.
I think the original commenter knows full well what an array is.
Vague justifications like "can cause problems" is probably exactly what he's referring to, in fact - people who know that inserting elements into an array is "slow" and end up making large and complex code as a result. Yes, it's O(N) on the length of your code, but the point is that for a couple of megs of text, O(N) is perfectly acceptable.
At least on a desktop, that'll fit in L3 cache which these days is around 175GB/sec. Or to put it another way, inserting that single char can probably be done at around 40,000 times per second. Which is faster than I can type, at any rate.
You'd be correct if people used editors for opening only source code files. The problem is that people usually open data files too which can be not only larger than L3 cache, but larger the entire system memory. The magic of a good editor like Vim is the capability to handle such files.
The other problem with your comment is the support for Undo operation. Even if you use a flat array, you need a more sophisticated data structure for storing previous changes. Storing a separate array for every single change is not an option.
Whether an array is contiguous in memory depends on the language (and the specific implementation of that language). JavaScript uses hash tables for its arrays which are really objects.
Good point! Dynamic languages are different in their terminology. AFAIK strongly typed languages have a clear definition of arrays. The OP was talking about arrays having "horrible performance if the user inserts text anywhere other than the end of the document". I think this statement has an implicit assumption that arrays are contiguous which is not true in Javascript.
But did you try editing a multi megabyte file with that method? Cause I have seen enough editors struggle with big files (especially if the file is a single line and you're going through it with word-wrap on), that I think the basic straight-forward approach already isn't sufficient on such workloads.
from the article:
> Luckily, there are some nice data structures to learn to solve this.
You could have also learned a new data structure!
I mean, it should be obvious that "this thing that the cursor does when moving lines" isn't the big takeaway from this challenge. It's almost cute that the author never noticed it (as a programmer), because I actually use that behaviour to navigate code sometimes. Who hasn't done a quick arrow-left/right to make the cursor lose its memory of which column it used to be on?
> Even when DOS machines with <640K of memory and memcpy() speeds in the low MB/s were the norm, people edited text files of similar sizes, with editors that used a single array buffer, and for that purpose they weren't noticeably slower than ones today.
No way. Every reasonably performant text editor in those days used special data structures and not just an array. Imagine having to copy the entire buffer on each key press (so, when inserting at the start of the file). Believe me, on a 640K DOS machine you'll feel that.
This isn't new stuff, I learned about these data structures in uni -- except I don't remember them because back then I was young and arrogant and didn't think you'd need these fancy data structures for something as simple as an editor :) :)
... but if you never tried to write one, it's hard to see in what ways these editors are not as easy as you think.
I dont see any problem with an array. Make it huge so you only have to reallocate every megabyte or so. Keep track of the document length and only move as much as needed. Your processor can do this every character faster than you can type. No need for fancy data structures, and trivial to load and save files. The interesting part then becomes formatting for the display.
Imagine inserting text in the middle of 1GB file. Moving 500MB of data will definitely take longer than 18ms, and thus will cause at least some visible lag.
Takes less than 100 ms to read in the file (calling realloc in a loop), insert a byte in the middle (realloc + memmove), and write the modified file out on stdout. The byte insertion amounts to about 4ms.
That's hardly fast, yet still a lot snappier than most modern editors' UI or the web, where apparently achieving 60 fps for a few hundred dynamic DOM nodes is some kind of an achievement.
This approach really starts to suck when you implement macros that are going to perform a lot of one-char inserts quickly. Or when you're editing multi-gigabyte files.
I must admit I was surprised, although I shouldn't be. Are we at > 10GB/s memory bandwidth now?
> This approach really starts to suck when you implement macros that are going to perform a lot of one-char inserts quickly. Or when you're editing multi-gigabyte files.
I'm working on an editor that I've optimized for such cases. In a test it made random edits to a 4GB file in < 50 microseconds. But, it cost a load of sweat and blood to get that rope data structure right. And it loads files only at about 100MB/s (should optimize for bulk inserts). https://github.com/jstimpfle/astedit
Yes, I make it to compile on both platforms from time to time. The current commit should compile using MSVC, gcc, and clang I believe. I'm happy to fix any issues if you find them :-)
>> This approach really starts to suck when you implement macros that are going to perform a lot of one-char inserts quickly.
What operation is that? Search and replace might have that effect but could be done by copying the entire buffer with replacement happening along the way.
The counter-argument to that is that processors are ridiculously fast in human timescales
Until you actually have to implement your algorithm on a mobile device that is both memory, and power constrained and that doesn’t have a swap file. The OS will either lol your program for being too memory or power inefficient, kill another program running in the background (not a great user experience) and/or force the use of the high power cores using unnecessary battery life when a more efficient algorithm could have used the lower power cores.
Attitudes like this also explains why developers don’t think twice about delivering battery consuming Electron apps.
Umm the dos 640k were paged, non contiguous. Additionally, smoothly editing larger texts back then required some clever linked lists of blocks to give a truly instantaneous editing experience for inserting text at the beginning of a large text. Those were the 286/386 days.
Today you have fancy rendering, and an instantaneous editing experience for that reason again suggests a more sophisticated data structure for the editor.
Which all text editors have, when you look inside vi/emacs/nano/whatever...
> so unless you're focusing your use-case on editing correspondingly huge files, there's no real need to make your implementation more complicated than a single array.
I think the other major corner case is when you need concurrent, distributed editing (although that's not popular or anything these days), in which case an array is a very poor datastructure.
For TIFF (and most formats) that's heavily dependent on if you're talking about implementing a reader or a writer for the format. TIFF readers need to handle JPEG streams, so in that sense implementing a general purpose TIFF reader is a superset of implementing a general purpose JPEG reader.
On the other hand, TIFF writers can (very conveniently!) be almost as simple as you want, including no compression at all, just blobs of raw pixel values, and a smattering of tags for width, height, pixel format, and that's it. The only thing simpler to output IMO would be uncompressed ASCII formats like XPM.
So in that sense you're correct- the simplest possible JPEG writer is much more complicated than the simplest possible TIFF writer, but TIFF in general is extensible to a fault (arguably), in the sense that the number of possible combinations of pixel and metadata encodings you have to prepare yourself for when opening arbitrary .tif files are far greater than when opening arbitrary .jpg files, including JPEGs within TIFFs.
Back in the days TIF was just a large uncompressed file.
The initial format is older than GIF87a (no animation which people associate GIF nowadays with). It had header but that pretty much it. Of course the format developed with time and even added LZW once the patent expired. Currently TIF is all kind of things, so writing a fully feature reader is a proper challenge (perhaps not coding-wise, but understand it and implementing the myriads of types/extensions, etc.)
I'll also link Cristi Cuturicu's "A note about the JPEG decoding algorithm", which is where I started my decoder implementation from, and it was indeed a ton of fun.
"no real need to make your implementation more complicated than a single array"
That's our industry in a nutshell. Our computers, instead of becoming more capable over time, can barely keep pace with the increasing naivety of our programmers.
Yes, this is engineering in a nutshell: determining a course of action within a set of constraints that meets your objectives. Where constraints can be time, cost, physical limitations (processor speed, memory size, disk space), etc; and objectives can be functional (user can edit files), nonfunctional (user can edit large files in < X seconds, energy usage), personal learning, or any number of other requirements.
The GP offered a valid decision point to consider based upon what an engineer is solving for. I don’t think he said that an array was the solution he’d ship in a production text editor to millions of end-users.
Actually, that's exactly what I was saying --- plenty of existing text editors use the "stupid" single array, yet no one complains about their performance.
Just because people tend not to edit large files in Notepad doesn't mean they'll complain about it when they do. Actual complaints are of course sparse because hardly anyone uses Notepad for anything serious if they can use an alternative. BUT when they do, oh they will complain.
I believe an older version of Notepad even had a (fairly low) limit on file size it would open.
I mean that's the reverse argument, computers have gigabytes of memory today, and are super fast, so you should be able to load a multi gigabyte text file and edit it, on a single line, with word wrapping.
In webdev, increasingly often, the expectation is that programmers not only to do the backend, but also the database management, the frontend (which used to be graphic design, css/html, and js, separately) and everything devops.
Outside of webdev, Unity springs to mind, as another great example of this: The stuff you can do as a single game developer is mind boggling, or at least used to be, until indie devs everywhere started boggling our minds on a daily basis and thus raising the standard of what consumers expect an indie game to be.
This is, of course, not possible because within 50 years humans evolved to be a lot better or smarter or faster than their predecessors. It is made possible through more flexible higher level tooling, that you don't have to understand the inner workings of to take advantage of, and more abundant computing resources, that in tandem, enable work that will be in the "good enough" territory for most use cases.
This is also not a choice that programmers as individuals or even a group make. It's a choice that the market makes.
There is nothing naive about it. Naive is assuming, it would be any other way.
In web dev I have observed the opposite trend: when I first started my career everyone was expected to be full stack and know how to deploy a thing. nowadays devs tend to be strictly front end or back end or dev ops, etc. Devs that can optimize a sql query, model a db schema and then write a well organized react or angular front end seem to be the exception not the rule.
> In webdev, increasingly often, the expectation is that programmers not only to do the backend, but also the database management, the frontend (which used to be graphic design, css/html, and js, separately) and everything devops.
What do you mean, increasingly often? This was the case 15 years ago already and I see only examples that it has gotten less, because of all the frameworks that exist.
Also it's exactly what I liked about webdev. When your existing talents for graphics design and explainer-of-technical-things shine in a tech context, that feels good. A lot of programmers have no feel for this, and a lot of designers write awful code. Which could have, but historically did NOT improve at all with higher level tooling, mainly because of this "good enough" attitude. Feel free to prove me otherwise, but what did happen: Thanks to things like Bootstrap, now programmers can avoid the worst design mistakes without having to learn design. Graphics Designers, however, well .. I don't know? Are there tools that allow them to write or generate code that doesn't suck? (Without programming skills, like the coders without design skills).
> This is also not a choice that programmers as individuals or even a group make. It's a choice that the market makes.
> There is nothing naive about it. Naive is assuming, it would be any other way.
I don't know ... Do you believe there no longer exist people that deliver quality over this entire skill set? Or that they somehow exist outside of the market?
People jump for complex and overly optimized solutions too quickly, IMHO. From a conceptual perspective, I enjoy these sort of challenges but that's where it ends.
For product demands where deadlines are constantly unrealistic, underfunded, underscoped and demands are ever changing, I'm a fan of providing the simplest conceptual solution to the task at hand and not focusing on developing complex abstractions and optimizations too early.
From my experience, that time is typically wasted until functionality is zeroed in and real money is available to pay for the work, as the early complex abstractions typically fail to meet pace with demands and the optimizations break when ever changing requirements.. change. That's just my experience, YMMV.
Are you suggesting that it’s bad to use the simple uncomplicated approach because it’s inefficient, or that it’s bad to add layers upon layers of complexity which end up bringing modern computers to their knees?
Personally I’m in the latter camp. There’s so many layers of abstraction nowadays which each in theory make programming better/safer/easier which in practice end up creating an incredibly inefficient mess.
Today's software suffers from too many layers of complexity that are each pretty dumb and serve mostly bookkeeping. The result looks like an overinflated bureaucracy. In the example above, using a more efficient data structure for text representation will add at most one layer of abstraction (but there's a good chance you'd create that layer to hide the array anyway), but offer significant benefits in terms of performance, at a cost of little and well-isolated complexity.
This is the best kind of abstraction: complex, deep behavior hidden behind simple interface.
Same. I generally write in C without too many layers between my code and the CPU, and it is just incredible how fast modern CPUs are with naive code that doesn't even attempt to be optimal.
I wish others understood that, because the things I work on are losing performance (and a massive amount of developer time, which could be used for optimization or other useful work) to excess complexity, not too simplistic code.
Yet if you read the rest of the comment you would realize this specific use-case (editing text) was done fine with a single array buffer when computers had less than 1mb of memory to work with.
This is a perfect example of when it’s stupid to keep optimizing.
I wish that were our industry. Instead, we make things super complicated and make them slower at the same time.
Let's take the text editor example. Let's say we use it to edit a large document. Is Moby Dick large enough? It's around a megabyte of (pure) text. Let's figure out a persistence solution. How about "we save the entire text to disk"? So a megabyte to disk. My laptop's SSD does (large) writes at 2GB/s. So the ultra simple solution could save the entire text around 2000 times per second.
Your laptop's SSD sure, 2GB/s - that 5400 rpm laptop hard disk that your user has is writing at a measly 1mb/s because the disk is also being accessed by 5 other programs.
Now the user is either queuing up a bunch of background saves leading to overload or is forced to wait 1s per keystroke.
Well done!
I guess the simple solution then is to tell the user to buy a $3000 laptop just so it's capable of running notepad.
Hmm...Mac laptops have been SSD-only for how many years?
Anyway, even laptop drives are well over 40-50 MB/s these days, and any disk scheduler worth its salt will schedule this kind of write (one contiguous chunk) near optimally, so still 40-50 writes/s.
And of course, you queue these writes asynchronously, dropping as needed, so if you actually manage to out-type your disk, all that happens is that your save-rate drops to every couple of characters. Big whoop.
Also remember that this is Moby Dick we're talking about. 700+ pages, so something that vastly exceeds the size of the kinds of documents people are likely to attempt with a Notepad class application.
Last not least, this is a thought experiment to demonstrate just how incredibly fast today's machines are, and that if something is slow, it is almost certainly because someone did something stupid, often in the name of "optimization" that turned into pessimization, because doing Doing the Simplest Thing that Could Possible Work™, i.e. brute-forcing would have been not only simpler but significantly faster.
Raise your hand if just running your web browser has pegged your top-end, multiprocessor, high-mem system in the last month. Both Firefox and Chrome have for me.
I think that's largely due to JavaScript and its ecosystem of abstraction-bloat that has been mentioned in another comment here, along with the trend of "appifying" what should really be static sites. A static page that contains even dozens of MB of content won't stress a browser as much as a "web app" containing only a few hundred KB of countless JavaScript frameworks glued together --- despite the latter presenting a fraction of the actual content.
I use to read the blog on virtualdub.org (video capture and processing) and enjoy his rants on bundled library bloat. Virtualdub was small in footprint and great to use. So do programmers become reliant on scaffolding too much, or is it a necessity as you learn?
Why not both? I mean, I wouldn't say that it's actually necessary, but scaffolding exists to hide away the incidental complexities of the problem being solved, revealing the problem for what it is. Demonstrations of recursion and pattern matching tend to use the same problems because they're such a good fit that there's a very close correspondence between the high-level explanation of how to solve the problem and the code itself.
At the same time we ought to be aware of that scaffolding and how it works (or could work), and how to build such abstractions ourselves. Not just because all abstractions leak[1][2] and potentially introduce bloat, but also because it means I don't have to pull in another dependency to save me a page (or three lines) of trivial code. Or maybe because the "standard" solution doesn't quite support your use case (I can't count the number of times that I've rewritten python's lru_cache[3] because of it not accepting lists and dicts).
Not sure about Gb of throughout makes editor memory representations unimportant. Just this week my friend said he kill emacs with a few-MB text file. I was astonished that a software of that esteem would struggle with that.
In my first job I would routinely open 10-20 MB files in Emacs. It handled it just fine. I mean, it gives a warning that this is considered big, but I ignored it.
Now if you open a large text file in something other than text mode, it could bring it to its knees depending on the mode. As an example, opening an XML file in the nXML mode is quite expensive, because nXML mode is powerful and utilizes your XML structure. I just tried a 12 MB XML file and told it to go to the end of the file. It's taking Emacs forever to do it (easily over 30s). But if I switch to text mode for that same file, it handles it just fine.
I just tried an 800 MB text file. It handled it fine.
The one thing where you can easily get in trouble: Long lines. Emacs cannot handle long lines well. Kinda sad.
But that's massively better than just an array. It changes the time complexity from O(char_inserts x file_len) to O(cursur_move x file_len), which is likely a couple orders of magnitude better.
The counter-argument to that is that processors are ridiculously fast in human timescales --- copying memory at gigabytes per second --- so unless you're focusing your use-case on editing correspondingly huge files, there's no real need to make your implementation more complicated than a single array. Even when DOS machines with <640K of memory and memcpy() speeds in the low MB/s were the norm, people edited text files of similar sizes, with editors that used a single array buffer, and for that purpose they weren't noticeably slower than ones today.
My ideas for challenging projects are a little less open-ended, so they exercise different set of skills: being able to implement a specification correctly and efficiently.
IMHO being able to consume existing content with code you wrote is very rewarding.