--- Log opened Sun Aug 28 00:00:45 2016 | ||
SMDhome1 | ZipCPU great news regarding UART and not so great ones about openrisc performance! | 02:20 |
---|---|---|
SMDhome1 | Also, what's happening now at google summer of code? | 03:29 |
ZipCPU | SMDhome1: I'm not so sure I would declare the performance we measured "not so great." | 08:36 |
ZipCPU | A better characterization is that it "makes sense". | 08:36 |
ZipCPU | OpenRISC has no memcpy() or strcmp() instructions like the VAX did, so ... you'd expect a bit of a difference. | 08:37 |
ZipCPU | It's actually reflective of the RISC vs CISC tradeoff, and much to be expected. | 08:40 |
kc5tja | ZipCPU: How so? I was under the impression that a good RISC microarchitecture can compete with a CISC in pretty much all benchmarks. | 13:14 |
kc5tja | I guess I don't know too much about Dhrystones. | 13:14 |
ZipCPU | kc5tja: We were actually comparing Dhrystone MIPS / MHz. It's a clock independent measure of CPU speed. Multiply it by your clock speed, and you get a measure of Dhrystone MIPS. | 13:15 |
kc5tja | What numbers are you getting for OR1K? | 13:16 |
ZipCPU | Dhrystone MIPS is a measure of your CPU speed, when compared with a VAX at 1MHZ clock speed, which is deemed to be 1DMIPS. | 13:16 |
ZipCPU | I'm convinced the OR1K and even the ZipCPU would beat the VAX at DMIPS alone, simply because the instruction sets are simpler and so the clock rates can be faster. | 13:17 |
kc5tja | That much I know, but hard to keep it in perspective wrt to other architectures. | 13:17 |
ZipCPU | Well, consider this, RISC machines tend to have higher clock speeds than CISC machines, right? | 13:18 |
ZipCPU | And CISC machines can do more logic per clock, no? | 13:18 |
kc5tja | More logic, sure, but more useful logic remains debatable. | 13:19 |
kc5tja | The papers describing RISC-I used VAX as their benchmark, and showed how RISC-I basically had a constant factor performance gain over VAX. | 13:20 |
ZipCPU | Yes ... but ... for the same clock rate? | 13:21 |
ZipCPU | (BTW ... my plan is not to release OR1K's score until ORCONF ... sorry, but we can discuss hard numbers then) | 13:22 |
kc5tja | I'd have to check again; if there was a difference, it wasn't much. Original RISCs were clocked around 12MHz, IIRC. | 13:22 |
ZipCPU | Hmm ... looking for numbers today, I've got Brakefield's comparisons. Is a PDP11 or PDP8 architecture at all related to the VAX? | 13:24 |
kc5tja | PDP8 not so much, but the PDP-11 is the VAX's spiritual predecessor. | 13:27 |
ZipCPU | Now, looking at Brakefield's work, there's a Spartan-6 implementation of a PDP-11 that can run at 64MHz. | 13:28 |
kc5tja | VAX in fact stands for Virtual Address eXtensions, as the original VAXes had the ability to run PDP-11 code in hardware. The 32-bit CPU was always there of course, but VAX was to be an upgrade path from the PDP-11. | 13:28 |
ZipCPU | That implementation comes from the pdp11-34verilog project found (at least at one time) on www.heeltoe.com | 13:30 |
ZipCPU | I know that I can run the ZipCPU at 80MHz on a Spartan-6, so ... there's a normalizing difference there. | 13:30 |
kc5tja | This page: http://heather.cs.ucdavis.edu/RISC.pdf suggests that RISC-I/II software was no more than 50% bigger than an equivalent VAX program. Put another way, the CISCness of the VAX should not contribute much to the DMIPS benchmarks. | 13:32 |
ZipCPU | I disagree completely. I got one benchmark score with no movc5 instruction, and then taught my DMA to do a multi-word move and got quite the performance boost. | 13:32 |
* ZipCPU is reading Norman Matloff's paper now | 13:37 | |
* ZipCPU was just called away to lunch ... | 13:38 | |
kc5tja | Sure, but a RISC with unrolled loop would compete with your DMA transfer quite well. | 13:39 |
kc5tja | Move 20 or 30 words per iteration, and you reduce looping overhead by 1/20 or 1/30 what it would be normally. Hence, RISC code is bigger, but as-fast. | 13:40 |
kc5tja | DMIPS/MHz is looking like a bogus benchmark. What you want is DMIPS/total cycles executed. | 13:40 |
kc5tja | https://en.wikipedia.org/wiki/Instructions_per_second -- that moment when you realize the 6502 is 4x faster than an equivalently clocked 68000. | 13:41 |
kc5tja | This I *know* to be bogus, and amply demonstrates why I don't believe DMIPS/MHz. | 13:41 |
kc5tja | The 6502 is about 1/2 the speed of a 65816 in 16-bit mode, and the 65816 only gets about 80% the performance of a 68000. | 13:42 |
kc5tja | (and, yes, the 65816 has a block move instruction that takes 7 cycles per byte transferred.) | 13:42 |
kc5tja | Cycle per cycle, this makes OR1K and RISC-V vastly more efficient at block moves with only their basic instruction set. | 13:43 |
SMDhome1 | I think I've found what's wrong w/ openrisc dhrystone results | 13:59 |
SMDhome1 | ZipCPU uses cycle num printed after simulation is over, but that includes cycles to printing results and etc. | 14:01 |
SMDhome1 | In this case we have next options: either we delete printfs or we increase dhrystone loops to eliminate printfs influence | 14:01 |
SMDhome1 | I'm running 1M dhrystone loops now, but for 200k I got better results than ZipCPU | 14:02 |
kc5tja | Another question is which version of Dhrystone is being used. 1.0, 1.1, and 2.1 will all report different values for the same architecture. | 14:03 |
SMDhome1 | I'm using 2.1 | 14:03 |
ZipCPU | SMDhome1: Not so. I counted cycles (including printfs) for 20k cycles, and cycles (including printfs) for 10k cycles. I then took the difference as the time it took to do 10k cycles--the printf time and reset time should've been otherwise constant between the two runs. | 14:31 |
ZipCPU | kc5tja: Let's discuss loop unrolling for a moment. When measuring the ZipCPU's performance, I unrolled the loops of the strcmp, strcpy, and memcpy manually. | 14:38 |
ZipCPU | While the Dhrystone benchmark states that the code must be compilable, must come from GCC, it doesn't necessarily state that the library routines can't be hand-optimized. | 14:39 |
SMDhome1 | ZipCPU you can unroll loops as you want, I guess | 14:39 |
ZipCPU | Well ... not quite. Dhrystone is not meant to be hand optimized. I'm sure there are those that do it, but it's *supposed* to be a measure that includes compiler performance. | 14:40 |
SMDhome1 | seems like false alarm, I'm rechecking | 15:25 |
_franck_ | ZipCPU, kc5tja : there is dhrystone numbers here: http://www.juliusbaxter.net/openrisc-irc/search?q=Dhrystone | 15:38 |
_franck_ | coming from stekern_ | 15:39 |
kc5tja | gcc can be made to unroll loops, so don't feel too bad over it. :) | 15:47 |
kc5tja | Also, it's a common complaint against Dhrystone that you're really testing the compiler's standard library performance more than you are the CPU itself. | 15:47 |
kc5tja | Geez. 1.2 to 1.5 are not all that bad. In fact, it's positively stellar compared to many other, more commercially successful CPUs. | 15:52 |
kc5tja | *cough* Intel *cough* | 15:52 |
kc5tja | (Although, to be fair, they did soundly destroy the 68060 when the Pentium came out.) | 15:53 |
kc5tja | Now, see, I want to find out what Dhrystone ranking I get with my own RISC-V core, as well as with the S64X7. Should be enlightening. :) | 15:55 |
ZipCPU | _franck_: That's all well and good, but what I need is something that I can repeat and therefore observe. I'd like to know how that was done. So far, all I've heard about prior runs of the benchmark is that they are not to be trusted. | 16:15 |
ZipCPU | stekern: You were the one who ran Dhrystone last: Do you have any of the system, software, and/or assembly left behind from when you did it? | 16:16 |
ZipCPU | I was also told that the prior number I was using was on an unrealistic simulator ... not through actual logic. | 16:17 |
stekern_ | ZipCPU: http://oompa.chokladfabriken.org/tmp/dhry/ | 16:21 |
olofk | If you want to get rid of the printf overhead when running in simulations, maybe you should use the l.nop method to print bytes | 16:27 |
stekern_ | surely the printf's should not be part of the measurements | 16:28 |
-!- Netsplit *.net <-> *.split quits: Amadiro | 16:29 | |
olofk | I think that sounds strange too | 16:30 |
olofk | And have we copied the optimized memset routine I wrote for the Linux port to newlib? | 16:30 |
olofk | Or are we still using the byte-by-byte copies? | 16:31 |
olofk | SMDhome1: GSoC is just finished. I submitted my final evaluation earlier today | 16:33 |
stekern_ | looking through old irc logs, the last dhrystone result I've mentioned seems to be 1.44 | 16:34 |
olofk | SMDhome1: Also, your pull request looks a bit odd. The first patch adds the pcu support, which isn't in upstream mor1kx yet, and the other adds the branch predictor files, which are already present in the .core file | 16:35 |
olofk | stekern_: While you're here, when can we expect another mor1kx release? :) | 16:35 |
stekern_ | any day ;) | 16:37 |
olofk | :) | 16:37 |
stekern_ | olofk: what's happened to you? booking hotels more than a month in advance. | 16:46 |
olofk | stekern_: Yeah. It's kind of cheating, I know | 16:49 |
stekern_ | to keep up the traditions, I've booked a room there as well now ;) | 17:10 |
ZipCPU | stekern_: Thank you! That's what I've been looking for. What score did you say it achieved? 1.44 was it? | 19:55 |
ZipCPU | stekern_: Looking through your code, two questions come to mind: 1) Is there a particular reason that you combined the two files? and 2) Why did you skip the Proc_6 processing? | 22:34 |
kc5tja | That moment when you're trying to add interrupt support to your RISC-V core, and realize it can trap in only two cycles. | 22:45 |
kc5tja | Eat it, 6502! ;) | 22:45 |
SMDhome1 | stekern_: yeah, I know, I've failed inteligence check and I have to admit, I don't know how to use git in proper way | 23:13 |
SMDhome1 | olofk: previous message should be reply to you | 23:15 |
stekern_ | ZipCPU: I just took that from somewhere else | 23:23 |
-!- stekern_ is now known as stekern | 23:24 | |
stekern | I never really looked inside it too carefully | 23:25 |
--- Log closed Mon Aug 29 00:00:47 2016 |
Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!