IRC logs for #openrisc Sunday, 2016-08-28

--- Log opened Sun Aug 28 00:00:45 2016
SMDhome1	ZipCPU great news regarding UART and not so great ones about openrisc performance!	02:20
SMDhome1	Also, what's happening now at google summer of code?	03:29
ZipCPU	SMDhome1: I'm not so sure I would declare the performance we measured "not so great."	08:36
ZipCPU	A better characterization is that it "makes sense".	08:36
ZipCPU	OpenRISC has no memcpy() or strcmp() instructions like the VAX did, so ... you'd expect a bit of a difference.	08:37
ZipCPU	It's actually reflective of the RISC vs CISC tradeoff, and much to be expected.	08:40
kc5tja	ZipCPU: How so? I was under the impression that a good RISC microarchitecture can compete with a CISC in pretty much all benchmarks.	13:14
kc5tja	I guess I don't know too much about Dhrystones.	13:14
ZipCPU	kc5tja: We were actually comparing Dhrystone MIPS / MHz. It's a clock independent measure of CPU speed. Multiply it by your clock speed, and you get a measure of Dhrystone MIPS.	13:15
kc5tja	What numbers are you getting for OR1K?	13:16
ZipCPU	Dhrystone MIPS is a measure of your CPU speed, when compared with a VAX at 1MHZ clock speed, which is deemed to be 1DMIPS.	13:16
ZipCPU	I'm convinced the OR1K and even the ZipCPU would beat the VAX at DMIPS alone, simply because the instruction sets are simpler and so the clock rates can be faster.	13:17
kc5tja	That much I know, but hard to keep it in perspective wrt to other architectures.	13:17
ZipCPU	Well, consider this, RISC machines tend to have higher clock speeds than CISC machines, right?	13:18
ZipCPU	And CISC machines can do more logic per clock, no?	13:18
kc5tja	More logic, sure, but more useful logic remains debatable.	13:19
kc5tja	The papers describing RISC-I used VAX as their benchmark, and showed how RISC-I basically had a constant factor performance gain over VAX.	13:20
ZipCPU	Yes ... but ... for the same clock rate?	13:21
ZipCPU	(BTW ... my plan is not to release OR1K's score until ORCONF ... sorry, but we can discuss hard numbers then)	13:22
kc5tja	I'd have to check again; if there was a difference, it wasn't much. Original RISCs were clocked around 12MHz, IIRC.	13:22
ZipCPU	Hmm ... looking for numbers today, I've got Brakefield's comparisons. Is a PDP11 or PDP8 architecture at all related to the VAX?	13:24
kc5tja	PDP8 not so much, but the PDP-11 is the VAX's spiritual predecessor.	13:27
ZipCPU	Now, looking at Brakefield's work, there's a Spartan-6 implementation of a PDP-11 that can run at 64MHz.	13:28
kc5tja	VAX in fact stands for Virtual Address eXtensions, as the original VAXes had the ability to run PDP-11 code in hardware. The 32-bit CPU was always there of course, but VAX was to be an upgrade path from the PDP-11.	13:28
ZipCPU	That implementation comes from the pdp11-34verilog project found (at least at one time) on www.heeltoe.com	13:30
ZipCPU	I know that I can run the ZipCPU at 80MHz on a Spartan-6, so ... there's a normalizing difference there.	13:30
kc5tja	This page: http://heather.cs.ucdavis.edu/RISC.pdf suggests that RISC-I/II software was no more than 50% bigger than an equivalent VAX program. Put another way, the CISCness of the VAX should not contribute much to the DMIPS benchmarks.	13:32
ZipCPU	I disagree completely. I got one benchmark score with no movc5 instruction, and then taught my DMA to do a multi-word move and got quite the performance boost.	13:32
* ZipCPU is reading Norman Matloff's paper now		13:37
* ZipCPU was just called away to lunch ...		13:38
kc5tja	Sure, but a RISC with unrolled loop would compete with your DMA transfer quite well.	13:39
kc5tja	Move 20 or 30 words per iteration, and you reduce looping overhead by 1/20 or 1/30 what it would be normally. Hence, RISC code is bigger, but as-fast.	13:40
kc5tja	DMIPS/MHz is looking like a bogus benchmark. What you want is DMIPS/total cycles executed.	13:40
kc5tja	https://en.wikipedia.org/wiki/Instructions_per_second -- that moment when you realize the 6502 is 4x faster than an equivalently clocked 68000.	13:41
kc5tja	This I know to be bogus, and amply demonstrates why I don't believe DMIPS/MHz.	13:41
kc5tja	The 6502 is about 1/2 the speed of a 65816 in 16-bit mode, and the 65816 only gets about 80% the performance of a 68000.	13:42
kc5tja	(and, yes, the 65816 has a block move instruction that takes 7 cycles per byte transferred.)	13:42
kc5tja	Cycle per cycle, this makes OR1K and RISC-V vastly more efficient at block moves with only their basic instruction set.	13:43
SMDhome1	I think I've found what's wrong w/ openrisc dhrystone results	13:59
SMDhome1	ZipCPU uses cycle num printed after simulation is over, but that includes cycles to printing results and etc.	14:01
SMDhome1	In this case we have next options: either we delete printfs or we increase dhrystone loops to eliminate printfs influence	14:01
SMDhome1	I'm running 1M dhrystone loops now, but for 200k I got better results than ZipCPU	14:02
kc5tja	Another question is which version of Dhrystone is being used. 1.0, 1.1, and 2.1 will all report different values for the same architecture.	14:03
SMDhome1	I'm using 2.1	14:03
ZipCPU	SMDhome1: Not so. I counted cycles (including printfs) for 20k cycles, and cycles (including printfs) for 10k cycles. I then took the difference as the time it took to do 10k cycles--the printf time and reset time should've been otherwise constant between the two runs.	14:31
ZipCPU	kc5tja: Let's discuss loop unrolling for a moment. When measuring the ZipCPU's performance, I unrolled the loops of the strcmp, strcpy, and memcpy manually.	14:38
ZipCPU	While the Dhrystone benchmark states that the code must be compilable, must come from GCC, it doesn't necessarily state that the library routines can't be hand-optimized.	14:39
SMDhome1	ZipCPU you can unroll loops as you want, I guess	14:39
ZipCPU	Well ... not quite. Dhrystone is not meant to be hand optimized. I'm sure there are those that do it, but it's supposed to be a measure that includes compiler performance.	14:40
SMDhome1	seems like false alarm, I'm rechecking	15:25
_franck_	ZipCPU, kc5tja : there is dhrystone numbers here: http://www.juliusbaxter.net/openrisc-irc/search?q=Dhrystone	15:38
_franck_	coming from stekern_	15:39
kc5tja	gcc can be made to unroll loops, so don't feel too bad over it. :)	15:47
kc5tja	Also, it's a common complaint against Dhrystone that you're really testing the compiler's standard library performance more than you are the CPU itself.	15:47
kc5tja	Geez. 1.2 to 1.5 are not all that bad. In fact, it's positively stellar compared to many other, more commercially successful CPUs.	15:52
kc5tja	cough Intel cough	15:52
kc5tja	(Although, to be fair, they did soundly destroy the 68060 when the Pentium came out.)	15:53
kc5tja	Now, see, I want to find out what Dhrystone ranking I get with my own RISC-V core, as well as with the S64X7. Should be enlightening. :)	15:55
ZipCPU	_franck_: That's all well and good, but what I need is something that I can repeat and therefore observe. I'd like to know how that was done. So far, all I've heard about prior runs of the benchmark is that they are not to be trusted.	16:15
ZipCPU	stekern: You were the one who ran Dhrystone last: Do you have any of the system, software, and/or assembly left behind from when you did it?	16:16
ZipCPU	I was also told that the prior number I was using was on an unrealistic simulator ... not through actual logic.	16:17
stekern_	ZipCPU: http://oompa.chokladfabriken.org/tmp/dhry/	16:21
olofk	If you want to get rid of the printf overhead when running in simulations, maybe you should use the l.nop method to print bytes	16:27
stekern_	surely the printf's should not be part of the measurements	16:28
-!- Netsplit .net <-> .split quits: Amadiro		16:29
olofk	I think that sounds strange too	16:30
olofk	And have we copied the optimized memset routine I wrote for the Linux port to newlib?	16:30
olofk	Or are we still using the byte-by-byte copies?	16:31
olofk	SMDhome1: GSoC is just finished. I submitted my final evaluation earlier today	16:33
stekern_	looking through old irc logs, the last dhrystone result I've mentioned seems to be 1.44	16:34
olofk	SMDhome1: Also, your pull request looks a bit odd. The first patch adds the pcu support, which isn't in upstream mor1kx yet, and the other adds the branch predictor files, which are already present in the .core file	16:35
olofk	stekern_: While you're here, when can we expect another mor1kx release? :)	16:35
stekern_	any day ;)	16:37
olofk	:)	16:37
stekern_	olofk: what's happened to you? booking hotels more than a month in advance.	16:46
olofk	stekern_: Yeah. It's kind of cheating, I know	16:49
stekern_	to keep up the traditions, I've booked a room there as well now ;)	17:10
ZipCPU	stekern_: Thank you! That's what I've been looking for. What score did you say it achieved? 1.44 was it?	19:55
ZipCPU	stekern_: Looking through your code, two questions come to mind: 1) Is there a particular reason that you combined the two files? and 2) Why did you skip the Proc_6 processing?	22:34
kc5tja	That moment when you're trying to add interrupt support to your RISC-V core, and realize it can trap in only two cycles.	22:45
kc5tja	Eat it, 6502! ;)	22:45
SMDhome1	stekern_: yeah, I know, I've failed inteligence check and I have to admit, I don't know how to use git in proper way	23:13
SMDhome1	olofk: previous message should be reply to you	23:15
stekern_	ZipCPU: I just took that from somewhere else	23:23
-!- stekern_ is now known as stekern		23:24
stekern	I never really looked inside it too carefully	23:25
--- Log closed Mon Aug 29 00:00:47 2016

Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!