Renesas RX core question

Status
Not open for further replies.
May 11, 2008
22,142
1,401
126
I was doing some reading on this new core from Renesas.

http://www.renesasrulz.com/communit...h-memory-can-dictate-system-level-performance

According to the website i posted, the rx core does not use a smart buffer strategy combined with a wide datapath to the flash memory. For example with the ARM7TDMI cores, the older LPC series from NXP used a 128bit wide datapath to the flash memory to access 4 32 bit instructions at once. This way, the slow access time of the flash memory could be somewhat hidden as long as contiguous memory addresses are accessed. Another solution is to copy the time critical code to local sram to reduce the amount of waitstates to the absolute minimum(1 wait state ?).

ARM7TDMI
I know my good old sam7s256 does not allow the flash memory to be accessed with 1 waitstate cycle when the system clock is above 30 Mhz. I need 2 for reading from the flash memory controller and 3 for writing to the flash memory controller while my trusty ARM7TDMI core runs at 48 MHz. I need only 1 wait state for the sram but this is a fixed number.

Cortex M3
With the Cortex M3, the memory bus seems to be optimized but the flash interface has some sort of preload mechanism to a buffer known as flash accelerator (AKA cache ?) to hide the wait states of the flash memory when the flash memory runs at a higher clock speed then the maximum clock speed for 0 (?) wait states. Thus if i understand correctly, as long as contiguous memory addresses are accessed then the flash interface from the cortex M3 chips seems to behave as if it has zero wait states.
Zero wait states i assume means just 1 waitstate similar as sram ?

The RX core from renesas
With the rx core (For example the RX600)it is claimed that the flash memory itself can be accessed within 10 nanosecond clock cycles without the use of some sort of buffer or extreme wide data path to the flash memory.

Renesas claims that the various cortex M3 chips are not really that fast at all because these chips have slow flash. Is this true ?

I am going to munch through some datasheets to compile some information.

I do know that even if this RX600 chip is that much faster and the cortex M3 chips are really that much slower, the cortex M3 chips have still the advantage of having several different manufacturers and GNU GCC support.
As such locking yourself in with the vendor is not necessary and development tools are lower in price while there is a large community support.

Does anybody have some experience when it comes to the various Cortex M3 cores and / or the RX600 core ?

I want to build experience with the new cortex M3 chips. I have been looking at a lot of different chips but i think will select the LPC1768 or LPC1769 from NXP over the LM3S1968 from Texas Instruments or the other chips from ST.

The RX600 core question : I just would like to know more about because of my work.
 
Last edited:
May 11, 2008
22,142
1,401
126
I found this about the STM32 chips from ST.

http://www.newelectronics.co.uk/article/22813/Flash-accelerator-delivers-performance-boost.aspx

According to ST, the STM32 embedded Flash performance gets a double boost with the 90nm production availability and adaptive real time (ART) accelerator, enabling programme execution up to 120MHz.

A proprietory ART memory accelerator has been incorporated to balance the performance of the ARM Cortex-M3 over Flash memory technologies. ST says that the cpu can now operate up to 120MHz without waiting.

To release the processor's full 150 DMIPS performance at this frequency the accelerator implements an instruction pre-fetch queue and branch cache, designed to enable programme execution from Flash at up to 120MHz with zero wait states.

The new 90nm devices featuring the ART memory accelerator are sampling at lead customers.

I found this about NXP and ST


http://www.newelectronics.co.uk/art...rs-differentiating-Cortex-M3-based-mcus-.aspx

Both companies have identified flash memory and data handling as one of the most important design parameters. By moving from a 140nm process to 90nm, NXP has been able to double memory datawidth from 128bit to 256bit. "This means users can get more performance from flash memory and that's achieved without a penalty in density or die size," Lees asserted.

Alexander Czajor, ST's microcontroller marketing manager, EMEA, underlined the importance of wide flash memory. "If you have flash and it's slow, you have to add wait states because you can't fetch data at the speed it's wanted. If you have 128bit wide flash, you can load four to eight instructions at a time." But Czajor added this approach only works if the code is linear. "If it's not, you have to jump, get code and then wait."


This device has 64 128bit registers for code and a further eight 128bit registers for data. Czajor explained the approach. "If you have branches in the code, the instruction has to be fetched from memory the first time it's needed. After that, the accelerator will become the branch target and, in most cases, the instruction will be in the matrix and will be accessible with zero wait states."

Claiming users will get a performance increase from this approach, Czajor added that performance could be pushed without ART, but at a cost. "There would be the need for a bigger flash memory, which would draw more current and create more emi. The most economic solution is to make the flash smaller or slower, but with the memory accelerator. By integrating the memory accelerator with slower flash, you can still get the full performance of the core up to 120MHz." And because ART reduces the number of flash accesses, less power is consumed.

ST says competing Cortex-M3 mcus can now only outperform the STM32 by pushing the clock to more than 120MHz, which will increase power consumption and heat dissipation.

NXP has implemented up to 1Mbyte of on chip flash in two banks (see fig 2) for the same reason: power.

Perhaps i should rethink of not using ST cortex M3 models :

http://www.dataofic.com/201012/cortexm3-play-ultimate-performance-stm32-f2-series.html


STMicroelectronics STM32 F-2 new microcontroller product line of advanced integrated STMicroelectronics advanced 90nm manufacturing process and innovative adaptive real-time memory accelerator (ART Accelerator ™;), Cortex-M3 architecture successfully play the ultimate performance. When speed of 120MHz when executing code from flash memory, STM32 F-2 micro-controller processing performance of up to 150 Dhrystone MIPS, which is the Cortex-M3 processor, the highest frequency in this performance. CoreMark test results show that when executing from Flash, the series of dynamic power consumption 188uA/MHz, the equivalent of 120MHz while consuming 22.5mA current. In addition to built-in Cortex-M3 microcontroller available on the market’s largest-capacity flash memory, the new series of video images is also enhanced, equipment, interconnection, security, encryption, support for audio and control applications.
 
Last edited:

Modelworks

Lifer
Feb 22, 2007
16,240
7
76
I

The RX core from renesas
With the rx core (For example the RX600)it is claimed that the flash memory itself can be accessed within 10 nanosecond clock cycles without the use of some sort of buffer or extreme wide data path to the flash memory.

Renesas claims that the various cortex M3 chips are not really that fast at all because these chips have slow flash. Is this true ?

I have the renesas development board based on the RX core. The RX core is an interesting design with a lot of promise. The core requires 3 clock cycles to read or write data to the flash. The total time is anywhere from 10ns to max of 15ns for internal flash and external flash is 15 to 30ns. They included phase change flash memory on the board since the RX core can use it like RAM and because phase change memory last 100x longer than NAND flash it is a perfect match.

For the RX core to program 128 bytes of flash it takes under 5ms. Phase change memory is also random access bit addressable unlike NAND so another reason to use it like RAM. Internally the core uses 2 bus , one for RAM and one for ROM, both are 32 bit width and both can be used simultaneously so you can send data to a peripheral and read and write to RAM at the same time. Or you can process an instruction and access RAM at the same time. It is almost like having dual cores with the second core knowing what the first core wants it to do without it having to stop and tell it what to do .


I am going to munch through some datasheets to compile some information.

Did you get the main datasheet ? The main one is 2000 pages, ugh:(

I do know that even if this RX600 chip is that much faster and the cortex M3 chips are really that much slower, the cortex M3 chips have still the advantage of having several different manufacturers and GNU GCC support.
As such locking yourself in with the vendor is not necessary and development tools are lower in price while there is a large community support.

That would be the one disadvantage I can think of, but development tools are becoming much more common now. I use IAR mostly and they have added support for the core.


Does anybody have some experience when it comes to the various Cortex M3 cores and / or the RX600 core ?

I want to build experience with the new cortex M3 chips. I have been looking at a lot of different chips but i think will select the LPC1768 or LPC1769 from NXP over the LM3S1968 from Texas Instruments or the other chips from ST.

The RX600 core question : I just would like to know more about because of my work.


I think it comes down to the target application. There is no question that ARM is going to be big but it doesn't work for everything. The RX core packs a lot into a small package. You have a chip that runs at 100mhz, contains fpu, dsp, 32 and 64 bit registers, 4Gbyte address space, real time clock calendar, ethernet controller, usb2.0 host or slave, 6 serial ports, CAN, 2 D/A , 2 A/D, CRC calculator , I2c buses, 512KB Flash, 96KB RAM for a total price of $7 per chip. That price point can't be touched by ARM for the features included.

Add to all that that Renesas is getting ready to release a 200Mhz version.

If I were going to spend some time with the newer ARM chips I would go with ST or NXP. TI is a great company but they tend to price things high and getting support can be difficult unless marketing feels you might bring in some big bucks for the company. You can get an M3 board cheap if you can find some of the STM8S discovery boards that ST was selling last year. It was designed to show off a different ST part but a chip included on the board just happens to be the STM32F103 running at 60MHZ with all the JTAG pins provided and the USB port soldered on. I bought 2 last year for $8 each.
 
May 11, 2008
22,142
1,401
126
CORTEX M3
Well, judging by what NXP and ST do to hide the latency of the flash, it is obvious they use slower flash. Interesting what you mention about TI. I just looked at the ST -F2 cortex M3 series. They are still under evaluation but it seems interesting to me. I think i will hold off diving into cortex M3 chips until
the ST F2 series are available. This is hobby material for me :).

RXCORE
Yes, from what i have read, it seems very promising. According to my colleague GCC support is available. It is called GNURX. I have not looked at the website yet. But since it has GNU in the name i assume it is free. Then the only question is how to program the chip. I would not be surprised if they provide an onboard bootloader as is very common with most manufacturers of ARM based microcontrollers.

http://www.kpitgnutools.com/releaseNotes.php?view=RNDET&RN=440
 

Modelworks

Lifer
Feb 22, 2007
16,240
7
76
RXCORE
Yes, from what i have read, it seems very promising. According to my colleague GCC support is available. It is called GNURX. I have not looked at the website yet. But since it has GNU in the name i assume it is free. Then the only question is how to program the chip. I would not be surprised if they provide an onboard bootloader as is very common with most manufacturers of ARM based microcontrollers.

http://www.kpitgnutools.com/releaseNotes.php?view=RNDET&RN=440

I had forgotten about those GNU tools. Yes they are free but you have to register and get a key to install it , why I am not sure.

The way the RX cores works for programming is via jtag I currently use this product from segger. For $60 for non business use it is the best jtag on the market.
http://shop-us.segger.com/J_Link_EDU_p/8.08.90.htm

The jlink product is probably the easiest jtag interface I ever used. Simply connect it to usb and plug in the micro . From that point on you have full control upload,download, breakpoints, realtime memory and registers, etc. The same jlink works with ARM and RX cores and lots of other devices as well with a bit of tweaking.
 

exdeath

Lifer
Jan 29, 2004
13,679
10
81
Screw flash memory and all the rest of our ancient achilles heel storage tech. Wake me up when STT-MRAM is in production and I can move gigs of data as fast as the CPU can process it or when I can saturate a HT or QPI bus momentarily by copying a 100GB file.

I want non volatile storage that can keep up with our cpus and busses, otherwise all human technology will remain in the stone age with progress bars and hourglasses.

I always wonder where we would be if capacitor based DRAM wasnt so cheap and easy in its infancy and instead we developed and refined magnetic core memory over the years instead of hard drives and DRAM. Now all our technology is slave to slow hard drives and flash memory.

What a paradigm shift it will be when non volatile RAM and "hard disk" are one and the same and there its no longer a need to "load" or "save" stuff to inferior slow third tier storage media. Just go to where it is in directly addressable storage and run it from where it stays in non volatile ram forever, no more loading and waiting for hourglasses.

There hasn't been a MRAM chip since Freescale's 4 Mbit part, wonder when it will start being tested in small scale consumer electronics like cell phones.
 
Last edited:

exdeath

Lifer
Jan 29, 2004
13,679
10
81
PS ARM is the coolest instruction set ever! Totally orthogonal, and I especially love the optional 4th operand logical operation on every arithmetic instruction. Makes DSP code especially clean.
 
May 11, 2008
22,142
1,401
126
PS ARM is the coolest instruction set ever! Totally orthogonal, and I especially love the optional 4th operand logical operation on every arithmetic instruction. Makes DSP code especially clean.

I like ARM too. I wanted to step away from assembly but arm assembly is like Jennifer Love Hewitt, one can not get enough. I just recoded some interrupt and exception handlers. It is so much easier with ARM compared with other architectures. I love the conditional execution the most. Although it has been removed from the thumb2 instruction set for economic reasons. For the thumb-2 there is an "if-then" instruction that does somewhat similar but only up to 4 instructions. ARM just delivers. I do wish the ARM people would rethink the interrupt scheme again. The FIQ and IRQ are easy to use. and the RX uses the FIQ/IRQ principle of the ARM architecture but enhanced it to lower interrupt latencies. The cortex M3 cores have a fixed standard but predictable interrupt scheme of 12 cycles before entering the interrupt. The ARM7TDMI is a lot slower. The RXcore from renesas seem to have all the best from the ARM7TDMI architecture, removed all the bad spots and enhanced it with floating point instructions. Yet it is not an ARM architecture it seems.
 
May 11, 2008
22,142
1,401
126
It is true :
The SAM7S256 (ARM7TDMI)has a special place inside my heart...:wub:
Thus it is logical that i would use the MCU as comparison. I will cry lonely tears when Atmel decides this chip is no longer available. I weep in fear for that dreadful day that will come in the future anyway :'(..
Jacky Wilson - Lonely teardrops.
http://www.youtube.com/watch?v=2nEfuE8Pw4U


Thus i turn into my logical self and write with a cold electric heart that it is inevitable. But still...

Back to the point that i was about to make. I am comparing the rxcore with the sam7s256. One of the beautiful aspects of the sam7s is that for example there are 2 different register to clear and set a bit for for example I/O. This means that instead of doing or logic on a register to set a bit and doing inverted and logic to clear that same bit.
Or just use the instruction BIC in ARM architecture. This instruction does the same as AND NOT.

One of the many best things since sliced bread, is memory mapped i/o.

We can state that in general on the ARM architecture and any read modify write architecture : To modify an io/ port bit, this what needs to be done :

Set a bit :
[1] read i/o port register data into arm core register. [Peripheral to Core].
[2] modify data with ORR instruction to set a bit. [ Core only].
[3] write data in register back to i/o port register. [Core to peripheral].

Clear a bit:
[1] read i/o port register data into arm core register. [Peripheral to Core].
[2] modify data with BIC instruction to clear a bit. [ Core only].
[3] write data in register back to i/o port register. [Core to peripheral].


But the SAM7S uses 2 registers:

One (SET BIT)register where when you or a bit , than that bit becomes set.
And one (CLEAR BIT)register where when you or a bit , than that bit becomes cleared.

Examples are these register of the Port I/O :
PIO_SODR = Pio Set Output Data Register.
PIO_CODR = Pio Clear Output Data Register.

Thus with the SAM7S256 you do it this way :

[1] modify data with ORR instruction to set a bit.[ Core only].
[2] write data in register to i/o port register. [Core to peripheral].

There is on some registers no need to read the data first because of the separated nature of the registers. In all honesty, i have to say this is not something special for the SAM7 series of Atmel. The (ARM7TDMI) LPC series of NXP use the same system for GPIO.

(Of course, the c- compiler does this for you or not when you write assembly.)



Why am i writing this ? Because i noticed that with the SAM7S all peripheral registers are 32 bit's wide. Thus every i/o register is treated as 32 bit wide values. Since the core is a 32 bit wide architecture, this makes sense.
Setting large values is easy because of the in the ALU build in barrel shifter.

With the RX core, a lot of registers are varying 8 to 32 bits wide. My question would thus be, would this hurt writing programs that need fast access ? I have to honestly say that for example the ethernet part of the rx core is 32 bits wide.

My problem is that in reality, all mcu's or cpu's have trade offs. But the rx core seems to be promoted as the most fastest model (in calculations), but how about the i/o control ? Because that is just as important as raw calculation power with MCU's for the embedded world. My issue is that the rx core seems just to good to be true. Even my SAM7S256 has a drawback. Because of its buildup , it uses slightly more flash and ram memory. But that is hardly an issue these days. Besides, to me it seems to be able to do some things faster than the competition (similar ARM7TDMI mcu's)when the design optimization features are used.
 
Last edited:

Modelworks

Lifer
Feb 22, 2007
16,240
7
76
I think this explains the way the RX core handles i/o better than I could:

http://www.electronicsweekly.com/Ar...as-rx-series-microcontrollers-an-overview.htm

Studies were conducted into the effect of register number on speed, code and chip size. Customer code was utilized to find the best number of general purpose registers. The study found that performance suffers significantly and code size is unnecessarily large with eight registers. As the number of registers increases, the code size shrinks and performance improves. However, chip size and complexity increase. The research concluded that a configuration of sixteen 32-bit registers (R0…R15) is the “sweet spot” in this trade-off. It provides the best performance and code size versus silicon area used.

The RX CPU core also has a Floating Point Unit (FPU), an unusual feature in its class. The FPU implementation is smart because it operates using main CPU registers rather than with its own register set. This avoids the communication overhead that is usually associated with normal FPU implementations, which are more like a coprocessor.
For integer data, the RX CPU offers further computation units such as a multiplier, divider and a barrel shifter. These support DSP-algorithms, for example with Repeat Multiply Accumulate (RMAC), as is common in filter calculations.
Another important factor is fast reaction to interrupts. For this reason, the RX core has a very flexible interrupt system, so engineers can finely tune the microcontroller operation to best suit their application. It has three different kinds of interrupt:
• Normal interrupt
• Fast interrupt
• High speed interrupt
Normal interrupts store the relevant registers on stack by using push/pop instructions, so all general registers are free to be used by the Interrupt Service Routine. In case of fast interrupts, Program Counter and Processor status word are automatically stored in special backup registers (BPSW and BPC), so response time is faster. High speed interrupts allocate up to four of the general registers for dedicated use by the interrupt to increase speed even further.



The internal bus structure provides 5 internal busses to ensure data handling is not slowed down by bottlenecks. The instruction bus is 64-bit, while all other busses are 32-bit wide. The structure supports one dedicated bus each for DMA (Direct Memory Access), DTC (Data Transfer Control) and E-DMA (Ethernet Direct Memory Access) transfers.
RX600 has yet another DMA module, the EXDMA, which can transfer data directly between two external resources in one bus cycle. A common usage example is where an external SRAM is connected to a TFT display. The SRAM acts as frame buffer and the EXDMA transfers stored image data to the TFT without interfering at all with the internal operation of the RX MCU.



Instruction fetches occur via a wide 64-bit bus, so that due to the variable length instructions used in CISC architectures, a single clock cycle loads between 1 and 8 instructions. They are fed into a 5-stage pipeline with four 8-byte instruction queues.
Frequently used instructions are short to keep code size small and many instructions execute in just one clock cycle to speed up the software. The CPU has 73 basic, 8 floating-point and 9 DSP instructions. It has 10 addressing modes, with register-register operations, register-memory operations, and bitwise operations included.
I can say that from using the demo board of the RX62 core that I have been pretty impressed at what it can accomplish .
Also check out the renesas channel on youtube
http://www.youtube.com/user/RenesasPresents


Review of the board I have, cost about $99
http://www.youtube.com/user/RenesasPresents#p/u/8/D9bpB1UgcRs
 
Last edited:
May 11, 2008
22,142
1,401
126
In the post above i mentioned how using separate set and clear registers can eliminate the need to do a read modify write. This can considerably speed up code and setting and clearing of pins of functions such as interrupt disabling or enabling or output data registers. I was wondering how this is implemented. I realized i made an error in my post and in my code as well. It must be :

Set a bit :
[1] read i/o port register data into arm core register. [Peripheral to Core].
[2] modify data with ORR instruction to set a bit. [ Core only].
[3] write data in register back to i/o port register. [Core to peripheral].

Clear a bit:
[1] read i/o port register data into arm core register. [Peripheral to Core].
[2] modify data with BIC instruction to clear a bit. [ Core only].
[3] write data in register back to i/o port register. [Core to peripheral].


But the SAM7S uses 2 registers:

One (SET BIT)register where when you write(1) a bit , then that bit becomes set.
And one (CLEAR BIT)register where when you write(1) a bit , then that bit becomes cleared.

Examples are these registers of the Port I/O :
PIO_SODR = Pio Set Output Data Register.
PIO_CODR = Pio Clear Output Data Register.

Thus with the SAM7S256 you do it this way :


A 1 will set or clear the corresponding bit.
[1] write data in register to i/o port register. [Core to peripheral].

How does it work ?
I made an example and it is quite easy. By using OR ports together with flip flops and separate access registers, the logical OR from the read modify write and the logical AND NOT (BIC in ARM architecture) from the read modify write is performed in hardware. I think this is how it is set up.
The extra AND ports are to prevent race condition in my simulation and are not really needed if the RS flipflop is Reset dominant. The simulation i have done with a free program called Digital works 95. http://www.spsu.edu/cs/faculty/bbrown/circuits/howto.html. I think this basically is how the hardware functions.

dataregisters.png


If you copy this 31 more times and you write 32 bits at a time, you will notice that the bits that are set stay set and the bits that are cleared stay cleared.
 
Last edited:
Status
Not open for further replies.