Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm dealing with this very question right now, only I'm coming from a different angle. I've already picked the language, but I'm attempting to build a framework in it that makes it work in a very different domain.

What I'm working on is sort of an answer to node.js. It is a coffeescript platform (use js if you prefer) built on top of erlang. So, the coffeescript runs in an erlang environment. This means, when you call into the DB, this spawns off an erlang process. Your collection of coffeescript functions can be executed on any number of cores, or any number of hosts. In fact, your handler for a web request can be spread out all over a cluster, with each function running on the node that has the data... or it can all run on a single node, but across many processes. (to some extent the amount of distribution will be controllable as a configuration parameter-- so if you're doing processing that analyzes big data, you can move your code to the data and run it there, lowering the cluster communications load, but if the data is small, it may make sense to keep handling a request constrained to a single node where everything is conveniently in RAM.)

This is accomplished by compiling the coffeescript in to javascript and running the javascript on a vm, specifically erlang_js, though I'm looking at going with V8 via erlv8. Your code and the libraries are all rendered into a single ball of javascript that we'll call the "application" that is handed off to various nodes.

How do I plan to get sequential code to work in a fundamentally distributed environment? That's the $64,000 question and why I'm bringing this up here-- I could be doing it wrong.

The plan is simple: 1. The developer needs to know that their application is not running in a single environment and account for that. 2. Each entry-point provided by the developer to the platform's API is assumed that it could be running in isolation in a separate process. 3. There's a shared context that all the processes have access to. (an in-RAM Riak database where the bucket is unique to a given request, but the keys are up to the developer.) 4. The APIs let the developer give callback functions which will be called when the data is available. (EG: "Go fetch a list of blog posts" could have a callback that is invoked when the list is returned from Riak. 5. There's a set of known phases that each request goes thru, in a known sequence, and we don't move on to the next phase until the processes spawned by the previous phase are finished. All of the phases are optional, so the developer can implement as many as they want or only a single one. The phases are: init, setup, start, query, evaluate, render, composite, finish functions. The assumption is that you can get your app to work with 9 opportunities to do a bunch of DB queries and get the results. 6. Init will be called when the request comes in. Init can cause any number of processes to be started (DB queries, or map reduce, etc.) They will all be finished, and their callbacks called (if any) before setup is called. Setup can also spin up any number of processes, and so on. All of these are optional and a hello world app might just implement one (it doesn't matter which.)

So, the developer can write in a sequential style, they are called regularly in sequence and know for each phase the previous phase's queries will have data. Each phase can cause more queries, or even spin up other apps, that will be rendered before the next phase. And they get the results from a context that is always available.

This way, init, start ,query and render could all run on different nodes, though they would run in sequence and each one would have access to the shared context for the query.

Another way of looking at this, and the way it might be implemented, is that each of those phases is a long running process that lives on, and is invoked with different contexts each time to handle its part of handling a query. (So this lets us, or the developer, experiment with the right way to arrange things for best resource utilization, since the results can be dramatically different depending on the kind of work the application needs to do.)

That's how I'm running a sequential language in a genuinely distributed manner...you can think in callbacks, or in phases, or both, and your coffeescript really can run in parallel.

A downside of this, though, is that you couldn't write a request handler that, say, generated a random key, did a lookup on the database, and then would loop and do that again until it got a result it liked. You have your 9 phases, and that's it, for a given request. However, there is an API to invoke another application (e.g.: you could have a login application that is responsible for part of your page, so, rather than implement a login/logged-in area on each page, you write it once and include it as a sub-application.) Conceivably you could do recursion but I haven't thought about the consequences of that yet. This does sort of lock you into a specific way of doing things, which is why there are 9 phases, if you only need 3, only implement 3... but if you need all 9 you have them.

I'm sure I've managed to make something that is not so complicated sound muddy... This works for me, since coffeescript is convenient, and it is easy for me to think in terms of erlang concurrency... but it might be an adjustment for js programmers who are used to setting variables and expecting them to be there later on... (you'd just have an API that store the values under a key.)

If you're interested in this project, you can find periodic announcements on twitter @nirvanacore I expect to have an alpha sometime in late September, and a Beta sometime after Riak 1.0 (on which this is based) ships.

Apologies if it seems like I'm hijacking a thread here... obviously my thoughts are about concurrency, but I am differing from the author in assuming json for common data structures, and directly programming in coffeescript/javascript. I'm not too worried about compiled speed- I'm more interested in concurrency than performance. I'd rather add an additional node and have a homogenous server infrastructure and no thinking about server architecture... than try to optimize for single CPU performance, etc.



Are you going to be able to pass data between the phases other than through the DB? It doesn't sound like it from your description, but living without closure equivalent would be painful. Maybe some way to add some data that gets message-passed to the next phase?

Sounds interesting.


My "through the DB" solution is not as good as a heap or stack would be, but it's not as bad as it might sound, because the DB lives in memory. If, in a given phase, you have some data, you add it to the context, it will be there in the next phase.

It would be easy to have an API that is along the lines of "in the next phase, call this function, pass it this data". I could make an API that does that, or you could put the data under a key in the context, and then call that function at the beginning of the next phase. IF the set of functions you'd like to have called that way varies from request to request, they could be stuffed in a list under a key, and you just process each of the functions in that list.

I think it will be quite possible to provide something equivalent to closures, via an API, though I can't yet say how syntactically convenient they will be, but really not too bad, I don't think.

On further thought, I think it would be quite possible to do actor style message passing... I'm focusing a bit much on the mechanics of implementation right now, and not making this transparent, but the context could easily be used to manage a set of mailboxes and "processes", where, in each phase, or even between phases, whenever a message is available in a mailbox, the function that it was sent to gets woken up and executed. In fact, not function, but process.

So, I can add an API that provides an actor model interface. The actors can be identified by a process ID, they can send messages to each other (addressed by PID) and include arbitrary data, and this can happen in concurrently in coffeescript.


Wouldnt it be cleaner if you send messages to a computation state (this request in a future phase) as an indirection, as the pid might not be allocated yet?


I think the pids are getting confused. When I say pid, I mean an id for a combination of a given function and some data, an instance, a fake sort of process that is facilitated by my code invoking the function with the data from its mailbox, whenever there is a message sent to the function by another "process". I'm not talking about erlang processes or "real" processes. So, you wouldn't have the problem of the "pid might not be allocated yet" because you would allocated it.

example in pseudo coffeerlangscript:

init-> pidOne = spawn(functionA, argumentlist), pidTwo = spawn(functionA, differentarguments), contextSet("pidOne",pidOne), contextSet("pidTwo",pidTwo), lookupData(bucket, key, pidOne), lookupData(bucket, key, functionB).

functionA(message) -> doStuff().

So, the here you're "spawning" two processes. For a function to act like a process it is written such that it takes any messages it get as arguments. I could set up their own contexts too, so "contextSet" in pidOne and pidTwo would be unique namespaces. LookupData, instead of taking a function to invoke, takes a process, and sends a message when it has retrieved the data off of the disk.

FunctionB could send a message or to pidOne and pidTwo (which it can find in the context).

So, the init phase is here, and later the start phase will be called. But the thread of execution would be: init, then the database queries happen in parallel, when they are successful, pidOne gets a message and functionB are called (possibly running in different environments.) FunctionB sends a message to pidOne and pidTwo, both of which are invoked with these new messages. When there are no more messages waiting for any of these pseudo processes, and no more database queries or other long running processes running in parallel, then the next phase is called.

If you're saying there's a better way to do this, my ears are open, I just need a little more explanation.


Ah ok by pid I took it to mean a unixy pid or an Erlang mailbox. What you are saying is what I was thinking...




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: