IRC logs for #openrisc Thursday, 2016-09-22

--- Log opened Thu Sep 22 00:00:22 2016
olofkHoolootwo: Yeah, I've been bitten by that a few times. It's really annoying. We should fix it02:10
Hoolootwoat least for now, could something could be thrown in the readme?02:11
olofkHoolootwo: Will do. I forgot about it as I generally don't use the xterm at all, but connect via telnet instead02:11
Hoolootwoah okay, I will probably end up doing that too eventually02:12
olofkI found it works a bit better than xterm02:12
olofkThe issue there is that if you boot linux, it will just stop at the prompt without any indication that it's waiting for a telnet connection02:13
olofkI guess that or1ksim in general could be a bit more descriptive :)02:13
olofkWe should also make 32MB RAM default, so you don't strictly need to use a config file02:14
olofkYou know what, I'll file a few bugs so I don't forget02:14
olofkZipCPU|Laptop: I think a potential improvement could be to store the provider info outside of the core file, as a separate file. This has crossed my mind a few times, but there are of course some drawbacks to this approach too02:24
olofkSo for now, the general workflow I use myself is to store a .core file in the repo without a provider section. This is always up-to-date with the head of the repo02:25
olofkFor orpsoc-cores I prefer to store only proper releases and point to a specific version, tag, commit in the provider section02:26
olofkAnd regarding MicroBlaze. I noticed the same thing a few years ago. It's a interesting architecture, and I'm not convinced it's all bad02:27
olofkIt allows you to have a high bandwidth connection to your RAM that doesn't have to wait for slow transfers on the peripheral bus02:28
Hoolootwofrom what I have seen for applications where you really need i/o, you do DMA on the microblaze02:28
olofkThere are however some complications. I worked with a dual-core setup that had one part of the RAM shared, which meant it couldn't be cached. That was a bit tricky to get right, since I had to feed a segment of the peripheral bus to the RAM02:29
olofkHoolootwo: Yes. If you're transferring large amounts of data02:30
olofkBut much of the I/O transfers are just small and slow accesses, like talking to a SPI COntroller or UART02:31
olofkAnd having separate buses will avoid having the CPU wait for the bus to be free when it wants to talk to the RAM02:31
shornestekern: I am looking at jonas's change to sys_rt_sigreturn, I understand that he made the change to switch the return path from normal syscall return path to the exception return path.  But its not that big of a difference as the syscall return patch checkes for workpending then jumps to the exception return path if there is pending work.03:43
shorneSo other than some restored registers its not too much different.  Do you have any idea what jonas means when he said he reworked that patch?03:44
shorneI didnt see any in my rebase.  If you don't know Ill just send him a mail03:44
stekernafair, that is the reworked patch, and he did get feedback from sebastian macke about it. But, I might be remembering wrong03:45
stekernI'm pretty sure I would have picked up the latest version if there was a more recent one though03:46
shorneyeah, I cant find anything in any history also the comment says "comment from the original patch"03:46
shorneDo you know if it was discussed on the linux kernel mailing list before?" or just on the openrisc list?03:46
shornesince the openrisc archive seems gone now03:47
stekernjust the openrisc list03:47
shorneI see, ok might just have to shoot jonas a mail.  I read through everything it seems ok03:47
stekernI can try to forward the messages to you, I have them in my own archive03:48
shornethat would be great if you can03:48
stekerndone03:50
shorneHmm, so it seems first patch still did return via syscall path, second did return via execption path03:58
shornebut then Sebastian says strace is really broken, and jonas says he will look again03:58
shorneso there might be a 3rd patch03:58
olofkJonas hasn't been active for about four years, so don't get your hopes up that he did another one04:01
shorneyeah, I am kind of thinking that04:01
shornewell, I guess I have to try this patch with strace and see if it breaks04:02
stekernshorne: I strongly remember that there was no follow-up after that last mail, sebastian might remember if he did some more testing of it04:08
stekernpoke53281 <- sebastian04:08
shornestekern: thanks, its good to know you remember no updates.  Interestingly it seems that thread is just between you Jonas and Sebastian04:10
shornelooks like its off mailing list (just by the forwarded headers)04:10
stekernthat might very well be the case04:10
shornewayback machine only has lists.openrisc.net till 201204:11
shorneit doesnt have the mails thoug, Ill test it04:14
shornehttps://www.mail-archive.com/[email protected]/index.html#0043204:17
shorneI found this04:17
shornegood04:17
shornethose mails were definitely not on the list, anyway, I got work to do04:24
shorne(not those patches)04:24
ZipCPU|Laptopolofk: I can understand Xilinx's purpose in having four busses.  1. Separate instruction and data busses avoids a bottleneck, 2. Caching a bus that can be cached is an advantage, 3. Having a wide bus speeds things up with memory--especially since the default DDR3 width is 128bits.06:20
ZipCPU|LaptopWhat gets me is that none of these busses are truly pipelined.06:21
ZipCPU|LaptopThey are heavy, feature laden beasts, that are (in terms of performance) inherently slow.06:21
wallentoabout that core file provider, there is still the plan to bring up an api.librecores.org that provides all cores as .core files or others06:24
olofkZipCPU|Laptop: AXI4 is definitely pipelined06:26
ZipCPU|LaptopNot the way Xilinx implemented it for their MicroBlaze.06:28
ZipCPU|LaptopAccording to the docs, the only allow one request in flight at a time--even though the bus allows more.06:28
ZipCPU|LaptopFor the peripheral bus, that's one 32-bit word request.06:28
ZipCPU|LaptopFor the memory/cachable busses, that's one 128-bit request that may be pipelined if the bus is smaller.06:29
olofkah ok06:29
olofkThe problem with pipeline accesses is that you lose exact exceptions06:29
ZipCPU|LaptopNot necessarily.  I was reading through the LM32 wishbone spec yesterday, and they handled that this way:06:30
ZipCPU|LaptopEvery STB pulse gets either an ACK, ERR, or RTY signal in return.06:30
ZipCPU|LaptopSo, as long as the ERR signal doesn't come before your ACK signal (which could happen if you cross devices ...) your exceptions remain exact.06:31
olofkBut that is without pipelining06:32
olofkYou have to wait for the ack, err or rty to come back before sending another one06:33
olofkThis is basically what all non-pipelined wb masters does06:33
ZipCPU|LaptopWhy wait?  The alternative is that you are prepared to roll back several operations.  You need to maintain that information anyway, in order to know which register to place the result into for a read request.06:35
olofkYes. Rolling back is another option, but it's also more complex06:35
olofkBut I doubt that lm32 implements wb4 pipelined mode06:36
olofkuBlaze sends a burst request to fill a cache line, when needed. It's the same thing we do with mor1kx06:39
olofkI don't see how pipelining would help here06:40
olofk(Can't believe I'm defending uBlaze after all the bad things I have said about it) :)06:41
ZipCPU|LaptopSo here's a question there: is the flash on the cache line, or just the DDR3 memory?07:02
ZipCPU|LaptopQSPI flash can get a *big* benefit from pipelining.07:03
ZipCPU|Laptopolofk: Regarding rollback ... for loads, I don't retire instructions until the memory operation is complete.  There's nothing that needs to be rolled back as a result.  Writes can be (but aren't yet) done the same way.07:04
ZipCPU|LaptopNothing truly then needs to be rolled back.07:05
olofkNot sure how they do QSPI Flash. It's likely not pipelined, so either they read from the peripheral bus, or they DMA to the memory07:05
olofkDo you have a cache?07:05
olofkWithout a cache, pipelined accesses are definitely a benefit07:08
shorneolofk: any word from opencores.org?07:13
ZipCPU|Laptopolofk: Currently, I have an instruction cache but no data cache.  I also have no MMU, so concurrency is not a big issue for me (yet).  I intend to fix/change both of these, but that hasn't happened yet.07:18
olofkshorne: Not from what I have heard07:20
kc5tjaWithOUT a cache, pipelining is a benefit?  Unless you're streaming data, that has been the case in my models.  Software reads and writes to random locations in memory more frequently than not, so you have to incur the 70ns (for DRAM) penalty with sufficient frequency that SDRAM protocol overhead actually makes it a slower choice than asynchronous 70ns DRAM.10:47
kc5tjas/has been the case/has not been the case/10:47
kc5tjaQSPI is not pipelined; it is, however, a burst transfer device.10:48
kc5tjaIt's protocol is a lot like SDRAM's protocol, only with more clock cycles.10:49
kc5tjaYou send it the read command (which includes the address) in about 6 cycles or so, then you wait some more cycles (with no transfers) while the device accesses the flash contents, and then you start streaming data back.10:50
kc5tjaIf a particular device does support some kind of pipelining, it's at most six cycles (or however many it takes to receive a command), which is likely to be but a tiny fraction of your burst length.10:51
HoolootwoI can't seem to get to opencores.org from any of my various locations, is it down for anyone else?14:44
Hoolootwodns seems to resolve, but no http gets through14:44
wallentoits dead14:59
Hoolootwohow dead?15:00
Hoolootwogone forever, or should it be back relatively soon?15:01
mafmit's not dead, it's pining for the fjords15:01
* mafm wearing a Cleese t-shirt right now, conveniently15:02
Hoolootwomafm++15:03
ZipCPU|Laptopkc5tja: I intend to discuss the benefit of pipelining without a cache at ORCONF.  Indeed, part of my presentation will show Dhrystone measures with and without pipelining.15:30
ZipCPU|LaptopAs for QSPI, if every access requires the six clocks for address plus two dummy clocks before data shows up, then you've just made my point.15:31
ZipCPU|LaptopA first access in a group requires 8 clocks to start, then can produce one 32-bit data value every 8 clocks.15:31
ZipCPU|LaptopIf you string your operations together, sequentially, then you can read from the flash in 8+8N clocks.15:32
ZipCPU|LaptopThis is one form of "pipelining".15:32
ZipCPU|LaptopAs for SDRAM, the DDR3 SDRAM I'm working with will have an access time of (roughly) 9+N clocks.15:34
ZipCPU|LaptopPipelining lets you exploit the N instead of requiring that N be 1 every time.15:34
ZipCPU|LaptopBTW: N is the number of 128-bit words you wish to read (or write--you just can't switch mid transfer without stalling)15:35
ZipCPU|LaptopOne more comment on the Dhrystone measure: that is with and without pipelining on the *data* channel.  The *instruction* channel is both pipelined and cached as soon as the cache in the CPU is enabled, and hence the CPU is pipelined.  (The option connects the two within the ZipCPU.)15:37
ZipCPU|LaptopIndeed, I get a rough 50% improvement in my Dhrystone score by implementing pipelining ... even without a data cache.15:37
-!- Dan__ is now known as Guest6386015:42
-!- Guest63860 is now known as ZipCPU15:42
kc5tjaZipCPU|Laptop: Nice.  I wish I could attend.  :(15:50
kc5tjare: 8+8N clocks -- that's not pipelining.  That's bursting.  Pipelining is when you can *start* transaction N+1 *before* transaction N completes.15:51
kc5tjaOtherwise, RS-232 communications is highly pipelined transmission of data.  ;-)15:52
ZipCPUPerhaps I'm not defining pipelining the same way.  Hmm ... here's my definition: the controller never relinquishes control of the bus and would, barring stalls from the WB slave, issue one request per clock and then wait for one ack per request.15:56
ZipCPUWith that definition, RS-232 is not "pipelined" unless the wishbone master holds onto the bus and the RS-232 device stalls everything while waiting for its next byte.15:56
ZipCPUSimilarly, a more traditional RS-232 device would maintain a FIFO buffer, which could easily be set up for M+N clocks, where M is any bus propagation time and N is the number of transfers necessary to either fill the buffer or finish the message.15:57
kc5tjaClocking doesn't really enter into any definition I've seen (only used as examples).16:02
kc5tjaFor example, a CPU with a pipeline can still take 5 cycles to execute an instruction, but if it has a 5-deep pipeline, it can "appear" to have an instruction latency of 1 cycle.16:02
kc5tjaThat's because it's busy processing 5 instructions at any given time (bubbles notwithstanding).16:02
kc5tjaBut, a pipeline doesn't always have to be clock synchronous.  The 80386 actually had a limited pipeline which allowed up to three instructions to execute at once, but the minimum latency was 2 cycles.16:03
kc5tjaAnother example more relevant to Flash SQPI devices is the RapidIO interconnect.16:05
kc5tjaIn as little as four clocks (but can be more depending on the kind of packet and how wide your interconnect is), you can kick off any bus transaction you like.16:05
kc5tjaIn a couple of other clocks, you'll get back an acknowledgement that the packet was received by the network.16:06
kc5tjaHowever, that doesn't mean your transaction has completed.  It just means that the interconnect is now free for another if you support that.16:06
kc5tjaThe receipt of the "ok I'm done" for your transaction might come hundreds of clocks later.  In the meantime, you can "queue up" a number of other transactions, some of which might even complete out of order (!!).16:07
ZipCPUI still think "pipeline" is appropriate.  This for two reasons: 1) the WB spec calls this type of access "pipelined", and 2) the bus (not necessarily the peripheral) is acting in a pipelined fashion -- even by your definition.16:07
kc5tjaHowever, what you _cannot_ do is break up an individual transaction's burst of data.16:07
kc5tjaI cannot agree with that definition.16:08
ZipCPUConsider a bus with multiple stages within it.  If there's one request in each stage, you then have a pipeline.16:09
kc5tja(1) WB treats bus transactions as single-beat things, even in a pipelined implementation.  The "pipeline" depth in the controller _must_ match the interconnect's register depth (plus the pipeline depth in the peripheral), or it will fall out of synchronization.16:09
ZipCPUIf the peripheral at the far end only accepts one access every x clocks, that doesn't negate the fact that the bus itself was pipelind.16:10
kc5tjaYes, because the other x-1 clocks are impossible to use for a transaction.16:10
ZipCPUNow wait a second here ... if a CPU has five pipeline stages, and whenever you perform a multiply the multiply stage takes 8 clocks, that doesn't mean the CPU isn't pipelined.16:12
kc5tjaThat's not what I said.16:12
kc5tjaWhat makes it pipelined is the fact that you can have up to 5 instructions in flight at the same time.  Key word: SAME time.16:13
ZipCPUThen ... what have I missed?  You offered a CPU as an example of what defined a pipeline, and I'm pointing out that any pipeline can stall.16:13
ZipCPUOk, but I can still have five bus transactions in flight at the SAME time, even if the peripheral at the end stalls the bus.16:13
kc5tjaSimply stuffing a stream of data down a pipe doesn't make it pipelined.  What makes it pipelined is the _capability_ for that pipe to have _multiple_ and _independent_ transactions in flight at once.16:14
kc5tjaFor example, if your CPU can request to read the next instruction from program space _before_ the currently executing instruction completes a data fetch from data space, then you have a pipelined bus.16:15
kc5tjaThis is why I say flash QPI devices are bursted, not pipelined.  You cannot start read #2 until read #1 has completely finished.16:16
ZipCPUSo ... if a CPU issues a write command to address 0, 1, 2, and 3, before being stalled by the peripheral that needs to wait 14 clocks before the first request completes, and 8 clocks for every request thereafter ... that's not a pipelin?16:16
ZipCPUAnd then once that first request completes, the CPU issues a command to write address 4 -- even before 1, 2, and 3 have completed ... that's not a pipeline?16:17
kc5tjaNope.  That's just burst-mode with lots of wait-states.16:17
kc5tjaWait, you just said that write to 3 stalls the CPU.16:17
ZipCPUYes.  The 3rd write stalled the CPU, not the first two.16:18
kc5tjaI need to see a timing diagram, because this is too confusing to disentangle on IRC alone.16:18
ZipCPUDo you have WB B4 spec available to you?16:19
kc5tjaYes.  It's on my desktop.16:19
ZipCPUOkay, let's compare illustration 3-10 on page 49 with ...16:20
ZipCPU3-11 on page 51.16:20
ZipCPU3-11 is what I'm calling "pipelined"16:20
kc5tjaOK, that is pipelined by virtue of the fact that the address bus, WE, and other control signals changes value every (non-stalled) cycle.  That is to say, EVERY cycle is potentially a unique read or write transaction.16:23
ZipCPUYes!16:23
kc5tjaWhat raised my objection is (let me type it out)16:23
kc5tjaFlash SQPI devices cannot support that mode of operation.  At all.16:24
kc5tjaWhat they DO have, is a set of clock cycles where you send an address and a WE bit,16:24
kc5tjafollowed by some access time latency,16:24
kc5tjafollowed by one or more cycles of contiguously addressed data.16:24
ZipCPU(Let me know when you are done ...)16:25
kc5tja(heh, sorry, had an interruption at the door)16:27
kc5tjaBut, all the while, it's one transaction.16:27
kc5tjaThese 23 clocks (or whatever) all correspond to a _single_ WB bus cycle.16:27
kc5tjaI guess the spec calls them block transactions instead of burst transactions.16:28
kc5tjaStill got Motorola terminology in my brain.16:28
kc5tjaDoes that make sense?16:28
ZipCPUI think so ... perhaps our confusion is in the difference between the device itself and the controller.16:29
kc5tjaNow, can you USE pipelining for this operation?  Absolutely.  And I honestly would probably prefer it over block transactions because it seems to give more control over timing.16:29
kc5tjaMight be.  Different things have different pipelines, which is semantically equivalent to different clock domains.16:30
kc5tjaEither way, no matter which terminology you use, I only ask that you be consistent with it.  :)16:30
ZipCPUSo, the multiple QSPI controllers I've written have all been both internally pipelined (especially this last one), and used the pipeline bus mode.16:31
ZipCPUI can understand why you might say, though, that the interface itself is not pipelined.16:31
* kc5tja nods16:32
shornewallento: FYI, I am building musl with a host gcc version of 6.11. It seems that gcc-6 cannot build gcc-5 due to this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=6995917:36
-!- Netsplit *.net <-> *.split quits: hammond, simoncoo1, andrzejr, eliask, ssvb, jeremybennett, wallento, Hoolootwo, SMDwrk, nurelin, (+7 more, use /NETSPLIT to show all of them)17:37
shornewallento ... just got split17:37
shornegreat17:37
-!- Netsplit over, joins: eliask, simoncoo1, kc5tja, rokka17:38
shorneIt seems we need to bump up to gcc 5.4.0 (which I did here) https://github.com/stffrdhrn/or1k-gcc/tree/musl-5.4.017:39
shorneI just did a merge from gcc 5.4.0 release into or1k to create or1k-5.4.0, and then rebase the musl-5.3.0 on or1k-5.4.0, to create musl-5.4.017:41
shornewallento: repeat I did the bump here gcc 5.4.0 (which I did here) https://github.com/stffrdhrn/or1k-gcc/tree/musl-5.4.017:41
shorneit was very smooth17:41
shorneno conflicts17:41
shornenow my musl build is running again17:42
shorneHoolootwo: what do you need from opencores?  openrisc related data here: http://openrisc.io/,  opencores repo here http://freecores.github.io/18:46
Hoolootwoshorne, I was just looking at the topic, and I keep finding broken opencores links on google about openrisc stuff23:09
--- Log closed Fri Sep 23 00:00:24 2016

Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!