IraqiGeek's Blog

A general overview of the XCORE architecture

DISCLAIMER: This is a very brief introduction to the XMOS XCore architecture. It is based on the XMOS XS1 Architecture documentation. Large portions and a lot of details have been omitted from the architecture document for the sake of brevity. Therefore, it is not meant in way to be a substitute for the XS1 Architecture document.

The XS1 is a family of programmable, general purpose processors that can execute languages such as C. They have direct support for concurrent processing (multi-threading), communication and input-output. The XS1 products are intended to make it practical to use software to perform many functions which would normally be done by hardware; an important example is interfacing and input-output controllers.
  • The XCore Instruction Set: The main features of the instruction set used by the XCore processors are:
    • Efficient access to the stack and other data regions and efficient branching and subroutine calling is provided by the use of short instructions.
    • Memory is byte addressed; however all accesses must be aligned on natural boundaries so that, for example, the addresses used in 32-bit loads and stores must have the two least significant bits zero.
    • Each thread has its own set of registers.
    • Input and output instructions allow very fast communications between threads. They also support high speed, low-latency, input and output operations.

  • Threads and Concurrency: Each XCore tile has hardware support for executing a number of concurrent threads. This includes:
    • A set of registers for each thread.
    • A thread scheduler which dynamically selects which thread to execute.
    • A set of synchronisers to synchronise thread execution.
    • A set of channels used for communication with other threads.
    • A set of ports used for input and output.
    • A set of timers to control real-time execution.
    • A set of clock generators to enable synchronisation of the input-output with an external time domain.

    The set of threads on each XCore tile can be used to:
    • Implement input-output controllers executed concurrently with applications software.
    • Allow communications or input-output to progress concurrently with processing.
    • Hide latency in the interconnect by allowing some threads to continue whilst others are waiting for communication to or from other cores or other external hardware.
    The instruction set enables threads to communicate between each other and perform input and output operations. It provides event-driven communications and input-output operations. Waiting threads automatically descheduled making more processing power available to those that are running. The instruction set also supports streamed, packetised, or synchronous communication between threads. This enables the processor to idle with clocks disabled when all of its threads are waiting in order to save power, while allowing the interconnect to be pipelined and input-output to be buffered.

  • Instruction Execution And The Thread Scheduler: The processor is implemented using a short pipeline to maximise responsiveness. It is optimised to provide deterministic execution of multiple threads. Typically, over 80% of instructions executed are 16-bit, so the XS1 processor can fetch two instructions every cycle. As typically less than 30% of instructions require a memory access, the processor can run at full speed using a unified memory system.
    Threads on a tile are intended to be used to perform several simultaneous realtime tasks, so it is important that the performance of an individual thread can be guaranteed.
    The scheduling method used allows any number of threads to share a single unified memory system and input-output system whilst guaranteeing that with N threads able to execute, each will get at least 1/N processor cycles.
    This means that the minimum performance of a thread is XCore tile's processing power divided by the number of concurrent threads at a specific point in the program. In practice, performance will almost always be higher than this because individual threads can be delayed waiting for input or output and their unused processor cycles taken by other threads.
    The time taken to re-start a waiting thread is at most one thread cycle. The set of N threads can therefore be thought of as a set of virtual processors; each with clock rate at least 1=n of the clock rate of the processor itself. The only exception to this is that if the number of threads is less than the pipeline depth P, the clock rate is at most 1/P.
    Each thread has a 64-bit instruction buffer which is able to hold four short instructions or two long ones. Instructions are issued from the runnable threads in a round-robin manner, ignoring threads which are not in use or are paused waiting for a synchronisation or input/output operation.
    The pipeline has a memory access stage which is available to all instructions. If the instruction buffer is empty when an instruction should be issued, a special fetch no-op is issued; this will use its memory access stage to load the issuing thread’s instruction buffer.
    Certain instructions cause threads to become non-runnable, for example while waiting for an input channel that has no available data. When the data becomes available, the thread will resume from the point where it paused.
    The tile scheduler therefore allows threads to be treated as virtual processors with performance predicted by tools. There is no possibility that the performance can be reduced below these predicted levels when virtual processors are combined.
    Instruction execution from each thread is managed by the thread scheduler. This scheduler maintains a set of runnable threads from which it takes instructions in turn. When a thread is unable to continue, it is paused by removing it from the run set.
    A thread may be paused because:
    • Its registers are being initialised prior to it being able to run.
    • It is waiting to synchronise with another thread before continuing.
    • It is waiting to synchronise with another thread and terminate (a join).
    • It has attempted an input from a channel which has no data available, or a port which is not ready, or a timer which has not reached a specified time.
    • It has attempted an output to a channel or a port which has no room for the data.
    • It has executed an instruction causing it to wait for an event or interrupt which may be generated when channels, ports or timers become ready for input.

  • Communication: Communication between threads is performed using channels. Channels provide full-duplex data transfer between threads. Channels carry messages constructed from control and data tokens between the two channel ends. The control tokens are used to encode communication protocols. A channel end can be used to generate events and interrupts when data becomes available. This allows a thread to monitor several channels, ports, or timers, while only servicing those that are ready.
    Channel ends have a buffer able to hold sufficient tokens to allow at least one word to be buffered. If an output instruction is executed when the channel is too full to take the data then the thread which executed the instruction is paused. The thread is restarted when there is enough room in the channel for the instruction to successfully complete. Likewise, when an input instruction is executed and there is not enough data available then the thread is paused and will be restarted when enough data becomes available.

  • Timers and Clocks: XCore provides a reference clock output which ticks at a standard frequency of 100MHz. In addition to that, a set of programmable timers is provided that can be used by threads to provide timed program execution relative to the reference clock. Each timer can be used by a thread to read its current time or to wait until a specified time.
    A set of programmable clocks is also provided and each can be used to produce a clock output to control the action of one or more ports and their associated port timers. Each clock can use a one bit port as its clock source.
    The data output on the pins of an output port changes state synchronously with that port's clock. If several output ports are driven from the same clock, they will appear to operate as a single output port, provided that the processor is able to supply new data to all of them during each clock cycle.
    Similarly, the data input by an input port from the port pins is sampled synchronously with that port's clock. If several input ports are driven from the same clock they will appear to operate as a single input port provided that the processor is able to take the data from all of them during each clock cycle.
    The use of clocked ports therefore decouples the internal timing of input and output program execution from the operation of synchronous input and output interfaces.

  • Ports, Input and Output: Ports are interfaces to physical pins. They can be tri-state, or can be configured with pull-up or pull-downs. A port can be used for input or output. It can use the reference clock as its port clock or it can use one of the programmable clocks, which in turn can use an external clock source. Transfers to and from the pins can be synchronised with the execution of input and output instructions, or the port can be configured to buffer the transfers and to convert automatically between serial and parallel form (serialization and deserialization). Ports can also be timed to provide precise timing of values appearing on output pins or taken from input pins. When inputting, a condition can be used to delay the input until the data in the port meets the condition. When the condition is met the captured data is time stamped with the time at which it was captured.
    A port has an associated condition which can be used to prevent the processor from taking input from the port when the condition is not met. When the condition is met a timestamp is set and the port becomes ready for input. When a port is used for conditional input, the data which satisfied the condition is held in the transfer register and the timestamp is set. The value returned by a subsequent input on the port is guaranteed to meet the condition and to correspond to the timestamp even if the value on the port has changed.

  • Events and Interrupts: Events and interrupts allow timers, ports and channel ends to automatically transfer control to a pre-defined event handler. A thread normally enables one or more events and then waits for one of them to occur. The thread can perform input and output operations using the port, channel or timer which gave rise to an event whilst leaving some or all of the event information unchanged. This allows the thread to complete handling an event and immediately wait for another similar event.
    Timers, ports and channel ends all support events, the only difference being the ready conditions used to trigger the event.