Qnice – An Elegant 16 Bit Processor

throwaway_pdp09 · on March 13, 2020

Tangential question, something I've always wanted to ask. The ISA here is given by example, here's a couple

  MOVE src, dst:     dst := src 
  ADD src, dst:     dst := dst + src

I have always wondered why assembler is written the first way

  MOVE src, dst
  ADD src, dst

rather than the far more intuitive (and slightly more compact) second, something like

  dst := src 
  dst += src

This also completely eliminates questions about which direction data goes, is 'mov a,b' a:=b or b:=a for example.

I can't see any reason for not using the established C-type notation, so why is the original style always perpetuated?

I'm aware that C approximately maps onto the original PDP ISA and has been called a high-level assembler, true or not that's irrelevant, but why the higher-level syntax has never made its way to lower level ASM has baffled me.

pjc50 · on March 13, 2020

> why the higher-level syntax has never made its way to lower level ASM has baffled me

Traditionally assembler was both for bootstrap processes and for the older heavily resource constrained systems, so there was a lot of emphasis on making it as simple to parse as possible; opcode first, then arguments, because the opcode is first in the byte stream of almost all variable-instruction length systems.

And of course no support for complex expressions, so no point in building a full mathematic expression parser.

Then there's the question of which add instruction you want. The vast proliferation of instructions in things like AVX gives you "VADDPS" and even "VFMSUBADD132PD", which expands to: https://www.felixcloutier.com/x86/vfmsubadd132pd:vfmsubadd21...

    IF (VEX.128) THEN
        DEST[63:0]←RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] + SRC2[63:0])
        DEST[127:64]←RoundFPControl_MXCSR(DEST[127:64]*SRC3[127:64] - SRC2[127:64])
        DEST[MAXVL-1:128] ←0
    ELSEIF (VEX.256)
        DEST[63:0]←RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] + SRC2[63:0])
        DEST[127:64]←RoundFPControl_MXCSR(DEST[127:64]*SRC3[127:64] - SRC2[127:64])
        DEST[191:128]←RoundFPControl_MXCSR(DEST[191:128]*SRC3[191:128] + SRC2[191:128])
        DEST[255:192]←RoundFPControl_MXCSR(DEST[255:192]*SRC3[255:192] - SRC2[255:192]
    FI
    VFMSUBADD213PD DEST, SRC2, SRC3
    IF (VEX.128) THEN
        DEST[63:0]←RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] + SRC3[63:0])
        DEST[127:64]←RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] - SRC3[127:64])
        DEST[MAXVL-1:128] ←0
    ELSEIF (VEX.256)
        DEST[63:0]←RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] + SRC3[63:0])
        DEST[127:64]←RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] - SRC3[127:64])
        DEST[191:128]←RoundFPControl_MXCSR(SRC2[191:128]*DEST[191:128] + SRC3[191:128])
        DEST[255:192]←RoundFPControl_MXCSR(SRC2[255:192]*DEST[255:192] - SRC3[255:192]
    FI

throwaway_pdp09 · on March 13, 2020

We don't have resource constrained systems now. I doubt parsing asm ever took up much anyway.

I'm not suggesting we permit complex (ie. general opcode combination) expressions. I deliberately never suggested it.

As to straighforward FMAC-type instructions, that's even clearer in my notation: a += b * c

But As for the proliferation of those add types you linked to, OK, possibly valid, but how much of this are you going to be manually writing compared to the very mundane style non-simd non-packed instructions? My guess is very little as you have relatively few number of such instructions (although they're doing a great deal of work over streams of data, but that's not relevant).

pjc50 · on March 13, 2020

Oddly I think it's the weird instructions that people are going to write more often: hardly anyone writes assembler compared to the more likely use case of reading disassembly, and when they are writing it it's usually specifically to do something that's difficult or impossible to get a high level language to emit.

throwaway_pdp09 · on March 13, 2020

I think that makes a lot of sense.

carapace · on March 13, 2020

"Terse"

I'll be damned, it still exists, in all it's glory: http://terse.com/

> TERSE represents a whole new concept in low-level programming and is the first real advance in assembly language programming since the invention of the Macro Assembler.

> TERSE is an x86 specific programming language compatible with the entire processor family from the 8088 through the Pentium 4 and beyond. It is a machine-level language that gives you all of the control available in assembly language with the ease-of-use and the look-and-feel of a high-level language like C.

> TERSE is a very mature language. Conceived in 1986, implemented in 1987, proven in real world applications for over a decade, and used by Fortune 250 corporations, universities, and programmers on six continents since 1996! TERSE has virtually replaced assembly language for time and/or space critical applications in embedded x86 based and PC applications.

oso2k · on March 13, 2020

Good ol' Terse. Terse & HLA [0] always tended too far to the ASM side. Made writing it seem more like messy ASM. I wanted some thing more like C. Then I realized, I already had working C.

[0] http://plantation-productions.com/Webster/HighLevelAsm/hla_e...

MaxBarraclough · on March 13, 2020

Your sense of which is intuitive, isn't universal. When we say Move this over there, it's left-to-right.

> why the higher-level syntax has never made its way to lower level ASM has baffled me.

It has. In the AT&T x86 assembly syntax, MOV is left-to-right. (In the Intel syntax, though, it's right-to-left.)

https://en.wikipedia.org/w/index.php?title=X86_assembly_lang...

oso2k · on March 13, 2020

I like to say that AT&T follows natural English flow and byte order of the finally assembled binary code. Intel followed the more mathematically oriented syntax. Might be a bit back duck syndrome since I learned math and Intel first, but Intel seemed more intuitive to me as well.

   AT&T
   ASM             | English          | Binary
   MOVE src, dst   | src -> dst       | 01 src dst
   ADD src, dst    | src + dst -> dst | 02 src dst
                     or
                     src + dst = dst

Compared to

   Intel
   ASM             | English          | Binary
   MOV dst, src    | dst <- src       | 01 dst src
   ADD dst, src    | dst <- dst + src | 02 dst src
                     or
                     dst = dst + src
                     dst += src

<- could be read as `loaded from`

-> could be read as `stored to`

Notice the RISC-y-ness of that.

However, ISAs developed later and are primarily MIPS-like follow Intel convention and extend it because it aligns better with 3AC [0] SSA[1] based optimizations.

[0] https://en.wikipedia.org/wiki/Three-address_code

[1] https://en.wikipedia.org/wiki/Static_single_assignment_form

Zardoz84 · on March 13, 2020

It's Intel vs AT&T syntax :

    Intel: mov eax, 4
    AT&T : movl $4, %eax

One could interpret that doing ADD a, b, c would be "do a + b and put in c"

titzer · on March 13, 2020

It's very useful to have the actual instruction mnemonics spelled out, instead of relying on an implicit (often overloaded) definition of "+". E.g. there might be a 8-bit, 16-bit, 32-bit integer add, a 32-bit, 64-bit floating point add, a 16x8 vector integer add, etc.

throwaway_pdp09 · on March 13, 2020

Hm, good point. How about

  a +=L b

for a long type? Or have a non-annotated opcode for the native machine word size (just '+') and annotated for anything smaller (edit: anything other) - ('+B' for byte addition).

An alternative and perhaps much better higher-level solution is to have typed ASM. I believe there is work done in this area (quick search gets http://www.cs.cornell.edu/talc/), which would allow the type to be implicit but checkably safe but for the occasional explicitly needed coercions.

pacman128 · on March 13, 2020

Randall Hyde created a higher level assembler back in the 90's. It supported writing C like control statements, etc. It doesn't look like it has been updated to x86-64, still just 32-bit.

https://en.wikipedia.org/wiki/High_Level_Assembly

gtirloni · on March 13, 2020

AT&T vs Intel syntax

magicalhippo · on March 13, 2020

Ah neat!

Very recently I picked up an FPGA dev board and started playing with implementing my first toy soft-CPU. For fun, and definitely not for profit, I decided to design my own ISA for it.

I decided to see if I could make a RISC-y MISC (minimal instruction set computer) design. From what I could see a lot of MISC-based computers had quite complex instructions.

While my resulting ISA is likely quite crap, being my very first ISA ever, it's been quite a fun exercise so far. I programmed quite a lot of asm back in the days, but thinking about which instructions are needed and why was something else.

colatkinson · on March 13, 2020

I've always wanted to do exactly what you just described! Can you recommend any resources to get started with FPGA design for a software dev?

magicalhippo · on March 13, 2020

Oh yeah also forgot to add that once you have some hardware you'll most likely want a cheap logic analyzer that supports sigrok[1] like this[2] one. Some LEDs and a breadboard etc is useful too.

[1]: https://sigrok.org/

[2]: https://www.aliexpress.com/item/32877931876.html (random seller link)

magicalhippo · on March 13, 2020

I've so far been quite happy with the iCE40UP5k[1] based dev kit I got, though there are a lot of options out there[2].

The iCE40 FPGAs are a bit whimpy compared to Altera and Xilinx offerings from what I understand, but I really liked the idea of an open-source toolchain[3][4] being available.

To get a taste without committing cash you could just use a simulator[5], which I imagine you'd be using a fair bit anyway as it allows you easier access to the internal state.

As to actually programming, I've found the following resources useful. I started with Verilog mainly because code-gen tools like nMigen[6] generate Verilog, but for writing by hand it seems VHDL is preferred.

Anyway, links, first off some exercises to get going[7]. Introduction to Verilog[8], also has a nice general overview of HDL. Details of how non-blocking vs blocking statements in Verilog works[9], quite specific but was very informative for me.

There's also quite a lot of activity over at Reddit[10], and good experiences over at the EEVBlog forums[11].

As I said I just got started so no expert yet :)

[1]: https://www.digikey.com/product-detail/en/lattice-semiconduc...

[2]: https://joelw.id.au/FPGA/CheapFPGADevelopmentBoards

[3]: https://symbiflow.github.io/

[4]: https://github.com/cliffordwolf/icestorm

[5]: http://iverilog.icarus.com/

[6]: https://nmigen.org

[7]: https://www.fpga4fun.com/

[8]: https://www.chipverify.com/verilog/verilog-tutorial

[9]: http://sunburst-design.com/papers/CummingsSNUG2002Boston_NBA...

[10]: https://www.reddit.com/r/FPGA/

[11]: https://www.eevblog.com/forum/fpga/

ZirconiumX · on March 13, 2020

Could you please point [6] at https://github.com/nmigen/nmigen which is the new upstream after a hard fork?

Additionally, I'd like to clarify that nMigen generates Verilog indirectly; it actually generates RTLIL, which is the intermediate representation of Yosys, and then Yosys turns it into Verilog after some cleanup passes.

I'll happily admit to being biased, but nMigen is so much easier for me to work in than Verilog ever was.

magicalhippo · on March 13, 2020

Ah, I saw the redirect but figured it was better to use the plain URL. Sadly too late for an edit now.

Hadn't quite gotten the connection with Yosys right yet, as I haven't yet started with nMigen, but yeah I definitely want to go there. But like I said, I prefer getting a good handle on Verilog first so I know what to look for when things go wrong.

I was inspired to all of this by the YouTube video series by Robert Baruch[1], where he tries to recreate the 6800 CPU on a FPGA using nMigen with formal verification.

[1]: https://www.youtube.com/playlist?list=PLEeZWGE3PwbbjxV7_XnPS...

tails4e · on March 13, 2020

Verilog for writing RTL is fine, especially if you use the syhthesizable subset of SystemVerilog. There used to be a bit of a religious war between VHDL and Verilog, as VHDL had some syntax that prevented certain errors, but with SV and some basic coding guidelines it's fine. While I'm sure some will still be hanging onto VHDL, I'd say most of the industry is going the SV way.

aseipp · on March 13, 2020

SystemVerilog improves things a lot actually, but the problem is that there is (currently) no freely available, robust synthesis frontend that supports the majority of the useful SV features (e.g. interfaces). In fact I don't think there's any robust, ~complete FOSS SystemVerilog simulators either -- though Icarus and Verilator support SV to varying degrees...

So if you want to stick with FOSS tools, then you're stuck with synthesizable Verilog-2005 at best, for the moment. And standard Verilog very much sucks in a lot of ways, I would argue, synthesizable subset or not. It's an understandable choice though, in a field full of awful options. One day I'm hopeful Yosys will support most of the necessary SystemVerilog features people want for synthesis... (Then, it can also serve as an effective SystemVerilog -> Verilog translation tool, which would be very useful on its own.)

magicalhippo · on March 13, 2020

Guess it might just be the bias of the forums I've been in, but most seemed to use VHDL there. Good point about SystemVerilog though.

magicalhippo · on March 13, 2020

Forgot to mention http://www.asic-world.com/verilog/veritut.html which also has some nice explanation of HDL and hardware in general.

Also it's very important to keep in mind that HDLs describe hardware, the HDL code is not executed on the hardware. Think of it as writing C++ templates.

titzer · on March 13, 2020

I really like conditional subroutine calls. I wish x86 had them. It makes it really easy to inline a fastpath of some safety check (e.g. nullcheck, bounds check, write barrier, etc), and have the slowpath be factored out to a common place. What PL implementations like the JVM, JavaScript engines, etc typically have to do without this is they insert a conditional branch to "deferred code" which is at the end of the function (statically predicted not-taken), but that deferred code can't be shared, because it needs to branch back to the mainline code. That costs code size. A conditional subroutine call is exactly the right mechanism to solve this!

aparashk · on March 13, 2020

Can’t agree more!

chalst · on March 13, 2020

> 16 registers divided into to areas: R0 to R7 are in fact a window to a register bank containing 256 times 8 registers while R8 to R15 are fixed. This architecture makes subroutine calls and saving registers very easy (just increment/decrement the register bank pointer which is part of the status register). All in all QNICE features 256 * 8 + 8 = 2056 registers.

This is indeed nice and the kind of thing whose availability can affect the design of low-level languages, e.g. non-ISO local variables in a Forth dialect.

Cf. http://www.complang.tuwien.ac.at/forth/gforth/Docs-html/Gfor...

throwaway_pdp09 · on March 13, 2020

This sounds like Spark's register windows https://en.wikipedia.org/wiki/SPARC#Features

I got it from a very reliable source, someone who was involved in evolving that hardware, that this feature caused Sun "an awful lot of pain" (his words, best I can recall).

My understanding is that it greatly destroyed the ability to do out-of-order execution, which was on top of the original designer's failure to understand what the compiler could do with inlining that would largely negate the value of this. From what I've read over the years, the Sparc hardware guys didn't talk to the compiler guys - a trap the designers of the DEC Alpha very carefully did not fall into.

Since this is a teaching project, I'm sure that feature is fine, but just saying.

chalst · on March 13, 2020

This is a very nice comment.

The Wikipedia article would benefit from a citeable source for the problems they had: is that something you could help find?

throwaway_pdp09 · on March 13, 2020

I heard it said by a guy at a conference. This guy: https://en.wikipedia.org/wiki/Ivan_Sutherland Sutherland was interested in the Sparc, surprising given he's better known for graphics but there you go.

As it happened, Steve Furber was there too. Irrelevant but bragging rights & all that.

A criticism from wiki itself after a very quick DDG: https://en.wikipedia.org/wiki/Register_window#Criticism

HTH

chalst · on March 13, 2020

Sutherland was something like a cofounder of Sun Labs, so I guess he was likely to develop some interest. I'm afraid I couldn't track down anything more specific, but I'll keep my eyes peeled. An interesting point I did find: the SPARC architecture was later extended so that register windows could be saved and restored other than via SUB calls, allowing instruction reordering and their use for context switching.

The Register window article doesn't talk about Sun's experience or the issue of reordering instructions.

throwaway_pdp09 · on March 13, 2020

I didn't know he was a sun co-founder. I remember him saying that he had an interest in geometry rather than graphics, which related to chip design - but that link is to me a bit nebulous, and it was a long time ago anyway.

Sun would not advertise an horrifically expensive design mistake, so no surprise you can't find much. I've picked up a fair bit from random reading around over the years so I can't remember where much of it came from.

Perhaps email Mr. Sutherland and just ask? Worst that can happen is he doesn't respond.

(thanks for the bit about using other than SUB calls, I didn't know).

jecel · on March 13, 2020

Not Sun but Sun Labs. Ivan's Sutherland, Sproull and Associates was bought by Sun in 1990 to become the seed of the new Sun Labs.

About register windows, overflows and underflows generated traps to the operating system, and for the combination of Sparc version 8 and SunOS that meant thousands of clock cycles. That was improved in later products.

Berkeley's RISC I to IV all had register windows but RISC-V doesn't with the argument that we have far better compilers now. Altera's NIOS processor had register windows which were dropped in NIOS II because it made the processor smaller without reducing the performance too much.

The AMD 29000 had a more flexible register window scheme and the Itanium a very complex scheme.

Computer history is often more like a spiral than a line and old ideas that have become bad might be good again in the future. With out-of-order execution and register renaming you might once again get better performance out of a binary with register windows.

zimpenfish · on March 13, 2020

Pretty neat but doesn't seem to be making it to the Real World(tm) despite several attempts.

https://en.wikipedia.org/wiki/Register_window

chalst · on March 13, 2020

Decent link, thanks.

It doesn't surprise me that having more visible registers and doing a good job of register allocation in the compiler is a better strategy (i.e., both more efficient and more flexible), but doing so means you are not so close to the metal. From the point of view of an efficiency feature that doesn't ask for complexity in the compiler, I see some appeal to register windows.

JoshTriplett · on March 13, 2020

It's not just that. If you have a huge internal register file and register renaming, along with a few other tricks, then it's at least as efficient (and potentially more efficient) to just push and pop the registers that actually get used.

throwaway_pdp09 · on March 13, 2020

Is 'efficient' what you want though? For embedded, yes. For maximum speed where you are willing to blow large power use, you can be much faster but it will cost you efficiency, badly (think Xeons). Without quantifying what you're trying to do, 'efficient' is only a metric not a target.

JoshTriplett · on March 13, 2020

I'm suggesting that a large register file and register renaming is likely to be the better alternative by many different metrics, yes.

throwaway_pdp09 · on March 13, 2020

Apologies, I misread you.

Taniwha · on March 13, 2020

Very pdp11ish, even looks like it supports "mov -(pc), -(pc)"

(if you ignore the register windows and the lack of byte instructions - word addressing harks back to a previous age)

projektfu · on March 13, 2020

Curious why it leaves out PC-relative addressing. By 1990 it seems that was one of the obvious shortcomings of 8086.

arethuza · on March 13, 2020

Has there ever been any attempts to systematically generate processor instruction set designs, evaluate them against 'real' code and measure the results?

Edit: You'd need to generate compiler back ends for each design as well, which might be fun...

carapace · on March 13, 2020

I don't have a link handy to give you but, yes, people have done things like that. Look into Prolog research on compiling.

FeepingCreature · on March 13, 2020

ABRA because it jumps away?

ngcc_hk · on March 13, 2020

Also got some spare fpga, any test required?

saagarjha · on March 13, 2020

Has anyone implemented this in hardware yet?

xellisx · on March 13, 2020

http://qnice-fpga.com/

dimator · on March 13, 2020

that page has a WebRing at the bottom! what a blast from the past.

this is what sites used to do before aggregators and search engines dominated.

xellisx · on March 14, 2020

Oh I had a couple sites that were on web rings. Haha

ape4 · on March 13, 2020

Misspelling "architectur" on the 3rd line doesn't exactly inspire confidence.