Building efficient multimedia architectures for mobile computing

If you think of smartphones and tablets as the go-to devices for personal computing, you’ve probably wondered quite a few times how technology has evolved to the point where consumers can now run demanding applications that bring unique, feature-rich user experiences, yet rely on mobile chipsets requiring a few watts of power.

The answer lies in efficient processing. Hardware IP must be designed with a focus on efficiency, which translates to offering users the best performance for the lowest possible power consumption, making these technologies better suited for current and future workloads in mobile and embedded processing.

Building processor architectures with efficiency in mind can have multiple implications. In this article we are going to look at how certain features of a mobile GPU are linked to the memory specifications of a typical mobile processor.

A crash course in memory hierarchy for mobile processors

There are three layers of memory inside a mobile device:

–          The first tier (cache or on-chip memory) is usually the most expensive, but is the fastest in terms of time to access. Mobile GPUs have L0, L1, and L2 caches around the unified shading engine (PowerVR SGX) or unified shading cluster (PowerVR ‘Rogue’), while the shared system level cache (SLC) will contain partitioned cache memory (Cache 0, 1, 2, etc.) for improved memory bandwidth.

–          The second tier is made of RAM (Random Access Memory), which refers to the fast memory used by applications to hold data and instructions. You can think of it as a relatively large filing cabinet for processing units in your mobile device to access information in a timely manner.

–          The third tier typically includes flash memory, which is several orders of magnitude slower compared to RAM and is primarily used as a large storage resource for MultiMediaCards, SD memory cards, USB flash drives, solid-state drives, and similar products.

Mobile SoC - memory hierarchy

In a smartphone or tablet, mobile DRAM memory is usually stacked on top of the chip, while flash memory is a separate component on the board.

Memory bandwidth is an important driver for power consumption

One of the highest users of power in Systems-on-Chip (SoCs) is off-chip memory traffic. According to Rambus, advanced multimedia applications (3D gaming, Full HD video decode/encode, augmented reality, etc.) push memory bandwidth requirements well above 12 Gbps. For a typical power consumption between 60-80 mW/Gbps for LPDDR3, total RAM power consumption due to increased memory traffic can go up to 1W or more, depending on the type of memory being used, its frequency, etc. To put that into perspective, a smartphone SoC will go up to around 1-2W of peak power consumption, while tablets can go up to 4-5W; this means that power consumption due to memory traffic in a mobile processor can range from 25% to 50% of total system power.

This is because mobile SoCs usually need to process data in parallel and therefore must compete among their peers for access to RAM. If not planned correctly, both system- and unit-level design decisions can degrade system performance, increasing cost and power consumption. Brute-force, power-hungry processors will have an immediate impact not just on their individual efficiency inside the system, but are likely to affect a chipset’s overall performance because these designs will typically attempt to eat away at memory bandwidth at an accelerated pace.

PowerVR Series6 ‘Rogue’: Improving memory bandwidth with TBDR and triple compression

PowerVR Series6 is the latest generation of GPU IP from Imagination, delivering exceptional levels of performance while maintaining the low power and high system latency tolerance of the Tile-Based Deferred Rendering (TBDR) architecture. TBDR provides an elegant solution to drawing only the visible elements in the scene.

TBDR architecture

All parts of the TBDR are fully handled in hardware and are completely invisible to software developers, ensuring maximal compatibility and performance between our different GPU families. PowerVR’s unique smart parameter management technology allows TBDR rendering in limited memory footprints, ensuring compatibility of high complexity titles without excessive memory usage.

Triple compression offers lower power consumption and superior memory bandwidth efficiency

The first bandwidth-reduction feature in PowerVR Series6 cores is related to supporting the PVRTC2 texture compression format at both 2bpp and 4bpp resolutions. PVRTC2 offers new features, such as high contrast textures, NPOT support, and sub-texturing and sits alongside the existing PVRTC format, which has been built into PowerVR hardware since the release of the Series4 GPU family. PVRTC and PVRTC2 achieve between 8:1 to 16:1 compression for texture data inside graphically-intensive applications.

The PVRTC technology defines an efficient hardware texture decompression standard which minimizes the bandwidth needed between the GPU and main memory by enabling the graphics pipeline to maintain the texture in compressed form.

Keeping the texture compressed at all times means the cache in the chip effectively becomes several times more effective. PowerVR GPUs only decompress texture data when needed in the final rendering process, resulting in significant reductions in on-chip and off-chip bandwidth and processing power. This is one of many reasons why support for texture compression has been, and continues to be, a key part of every generation of PowerVR graphics architecture, including all upcoming ‘Rogue’-based platforms.

Secondly, there are two types of distinct lossless compression integrated inside the PowerVR Series6 architecture:

–          Lossless geometry compression

This is included in all PowerVR ‘Rogue’ cores and compresses the intermediate geometry data as part of the updated tile-based processing. Geometry compression provides a 3:2 compression ratio, thus saving memory footprint and memory bandwidth, which in turn leads to lower power consumption.

–          Lossless image compression (also referred to as Lossless Framebuffer (De)Compression FBC/FBDC).

This optional component (including in PowerVR G6x30 cores) operates on uncompressed image/texture data flow. With lossless image compression, render targets can be written out by the GPU with compression (render to texture and framebuffer) and can be read back by the GPU with decompression (texturing). This means the GPU can save on both write and read bandwidth, typically seeing a compression ratio of 2:1 (but can be as high as 30:1 for blocks with constant colour). Textures are lossless compressed without extra effort by the developer as part of the texture upload by the GPU hardware. Decompression can be integrated as part of the display controller, allowing final frame buffer writes and reads (by the display pipe) to be compressed, leading to system-level bandwidth saving.

Triple compression - designed for low power and bandwidth efficiency

This technology is particularly useful for browsers and certain GUI components which cannot accept any quality degradation. In memory bandwidth constrained systems, lossless image compression can also lead to higher sustained frame rates.

In conclusion, by choosing the right IP blocks that have proven to reduce memory bandwidth and increase system efficiency, system architects can see tangible benefits in achieving smooth, solid performance while improving the lifetime of devices and lowering the bill of materials costs. The technologies described above enable mobile processing SoCs to get very close to the image quality fidelity and processing capabilities of desktop PCs and consoles while consuming only a small fraction of their power, thus offering consumers superior graphics performance in a compact form factor.

Books