# Network Systems Design (Intel IXP2xxx)

**Douglas Comer** 

Computer Science Department Purdue University 250 N. University Street West Lafayette, IN 47907-2066

http://www.cs.purdue.edu/people/comer

© Copyright 2004. All rights reserved. This document may not be reproduced by any means without written consent of the author.

Copy permission: these materials are copyright © 2004 by Pearson Education and Douglas Comer, and may not be reproduced by any means without written permission from the author or the publisher. Permission is granted to use the materials in any course for which Comer's text *Network Systems Design Using Network Processors* is a required textbook. In addition to use for in-class presentation, each student who purchases a copy of the textbook is authorized to receive an electronic or paper copy. For permission to use the materials in any way other than the above, contact the author or the publisher.

#### Ι

#### Course Introduction And Overview

#### **Topic And Scope**

The concepts, principles, and technologies that underlie the design of hardware and software systems used in computer networks and the Internet, focusing on the emerging field of network processors.

# You Will Learn

- Review of
  - Network systems
  - Protocols and protocol processing tasks
- Hardware architectures for protocol processing
- Software-based network systems and software architectures
- Classification
  - Concept
  - Software and hardware implementations
- Switching fabrics

# You Will Learn (continued)

- Network processors: definition, architectures, and use
- Design tradeoffs and consequences
- Survey of commercial network processors
- Details of one example network processor
  - Architecture and instruction set(s)
  - Programming model and program optimization
  - Cross-development environment

# What You Will NOT Learn

- EE details
  - VLSI technology and design rules
  - Chip interfaces: ICs and pin-outs
  - Waveforms, timing, or voltage
  - How to wire wrap or solder
- Economic details
  - Comprehensive list of vendors and commercial products
  - Price points

# **Background Required**

- Basic knowledge of
  - Network and Internet protocols
  - Packet headers
- Basic understanding of hardware architecture
  - Registers
  - Memory organization
  - Typical instruction set
- Willingness to use an assembly language

# **Schedule Of Topics**

- Quick review of basic networking
- Protocol processing tasks and classification
- Software-based systems using conventional hardware
- Special-purpose hardware for high speed
- Motivation and role of network processors
- Network processor architectures

# Schedule Of Topics (continued)

- An example network processor technology in detail
  - Hardware architecture and parallelism
  - Programming model
  - Testbed architecture and features
- Design tradeoffs
- Scaling a network processor
- Survey of network processor architectures

# **Course Administration**

- Textbook
  - D. Comer, *Network Systems Design Using Network Processors*, Intel IXP2xxx version, Prentice Hall, 2005.
- Grade
  - Quizzes 5%
  - Midterm and final exam 35%
  - Programming projects 60%

# Lab Facilities Available

- Extensive network processor testbed facilities
- Donations from
  - Agere Systems
  - IBM (now sold to Hifn)
  - Intel
- Includes hardware and cross-development software

# What You Will Do In The Lab

- Write and compile software for an NP
- Download software into an NP
- Monitor the NP as it runs
- Interconnect Ethernet ports on an NP board
  - To other ports on other NP boards
  - To other computers in the lab
- Send Ethernet traffic to the NP
- Receive Ethernet traffic from the NP

# **Example Programming Projects**

- A packet analyzer
  - IP datagrams
  - TCP segments
- An Ethernet bridge
- An IP fragmenter
- A classification program
- A bump-in-the-wire system using low-level packet processors

# **Questions?**

# A QUICK OVERVIEW

# **OF NETWORK PROCESSORS**

# **The Network Systems Problem**

- Data rates keep increasing
- Protocols and applications keep evolving
- System design is expensive
- System implementation and testing take too long
- Systems often contain errors
- Special-purpose hardware designed for one system cannot be reused

#### **The Challenge**

Find ways to improve the design and manufacture of complex networking systems.

# **The Big Questions**

- What systems?
  - Everything we have now
  - New devices not yet designed
- What physical communication mechanisms?
  - Everything we have now
  - New communication systems not yet designed / standardized
- What speeds?
  - Everything we have now
  - New speeds much faster than those in use

# **More Big Questions**

- What protocols?
  - Everything we have now
  - New protocols not yet designed / standardized
- What applications?
  - Everything we have now
  - New applications not yet designed / standardized

#### The Challenge (restated)

Find flexible, general technologies that enable rapid, low-cost design and manufacture of a variety of scalable, robust, efficient network systems that run a variety of existing and new protocols, perform a variety of existing and new functions for a variety of existing and new, higher-speed networks to support a variety of existing and new applications.

# **Special Difficulties**

- Ambitious goal
- Vague problem statement
- Problem is evolving with the solution
- Pressure from
  - Changing infrastructure
  - Changing applications

# Desiderata

- High speed
- Flexible and extensible to accommodate
  - Arbitrary protocols
  - Arbitrary applications
  - Arbitrary physical layer
- Low cost

# Desiderata

- High speed
- Flexible and extensible to accommodate
  - Arbitrary protocols
  - Arbitrary applications
  - Arbitrary physical layer
- Low cost

#### Statement Of Hope (1995 version)

#### If there is hope, it lies in ASIC designers.

#### Statement Of Hope (1999 version)





#### Statement Of Hope (2004 version)

programmers!



# Programmability

- Key to low-cost hardware for next generation network systems
- More flexibility than ASIC designs
- Easier / faster to update than ASIC designs
- Less expensive to develop than ASIC designs
- What we need: a programmable device with more capability than a conventional CPU

# The Idea In A Nutshell

- Devise new hardware building blocks
- Make them programmable
- Include support for protocol processing and I/O
  - General-purpose processor(s) for control tasks
  - Special-purpose processor(s) for packet processing and table lookup
- Include functional units for tasks such as checksum computation
- Integrate as much as possible onto one chip
- Call the result a *network processor*

# **The Rest Of The Course**

- We will
  - Examine the general problem being solved
  - Survey some approaches vendors have taken
  - Explore possible architectures
  - Study example technologies
  - Consider how to implement systems using network processors

#### **Disclaimer #1**

In the field of network processors, I am a tyro.

#### Definition

Tyro  $\langle Ty'ro \rangle$ , n.; pl. *Tyros*. A beginner in learning; one who is in the rudiments of any branch of study; a person imperfectly acquainted with a subject; a novice.

#### **By Definition**

In the field of network processors, you are all tyros.

#### In Our Defense

When it comes to network processors, everyone is a tyro.

# **Questions?**

#### Π

## Basic Terminology And Example Systems (A Quick Review)

## **Packets Cells And Frames**

- Packet
  - Generic term
  - Small unit of data being transferred
  - Travels independently
  - Upper and lower bounds on size

# Packets Cells And Frames (continued)

- Cell
  - Fixed-size packet (e.g., ATM)
- Frame or layer-2 packet
  - Packet understood by hardware
- IP datagram
  - Internet packet

# **Types Of Networks**

- Paradigm
  - Connectionless
  - Connection-oriented
- Access type
  - *Shared* (i.e., multiaccess)
  - Point-To-Point

# **Connection-Oriented Networks**

- Telephone paradigm (connection, use, disconnect)
- Examples
  - Frame Relay
  - Asynchronous Transfer Mode (ATM)

## **Point-To-Point Network**

- Connects exactly two systems
- Often used for long distance
- Example: data circuit connecting two routers

# **Data Circuit**

- Leased from phone company
- Also called *serial line* because data is transmitted bit-serially
- Originally designed to carry digital voice
- Cost depends on speed and distance
- T-series standards define low speeds (e.g. T1)
- STS and OC standards define high speeds

# **Digital Circuit Speeds**

| Standard Name | Bit Rate        | <b>Voice Circuits</b> |
|---------------|-----------------|-----------------------|
| _             | 0.064 Mbps      | 1                     |
| T1            | 1.544 Mbps      | 24                    |
| Т3            | 44.736 Mbps     | 672                   |
| <b>OC-1</b>   | 51.840 Mbps     | 810                   |
| <b>OC-3</b>   | 155.520 Mbps    | 2430                  |
| OC-12         | 622.080 Mbps    | 9720                  |
| <b>OC-24</b>  | 1,244.160 Mbps  | 19440                 |
| <b>OC-48</b>  | 2,488.320 Mbps  | 38880                 |
| OC-192        | 9,953.280 Mbps  | 155520                |
| OC-768        | 39,813.120 Mbps | 622080                |

# **Digital Circuit Speeds**

| Standard Name | Voice Circuits  |        |
|---------------|-----------------|--------|
| _             | 0.064 Mbps      | 1      |
| T1            | 1.544 Mbps      | 24     |
| Т3            | 44.736 Mbps     | 672    |
| <b>OC-1</b>   | 51.840 Mbps     | 810    |
| OC-3          | 155.520 Mbps    | 2430   |
| OC-12         | 622.080 Mbps    | 9720   |
| OC-24         | 1,244.160 Mbps  | 19440  |
| <b>OC-48</b>  | 2,488.320 Mbps  | 38880  |
| OC-192        | 9,953.280 Mbps  | 155520 |
| <b>OC-768</b> | 39,813.120 Mbps | 622080 |

• Holy grail of networking: devices capable of accepting and forwarding data at 10 Gbps (OC-192).

## Local Area Networks

- Ethernet technology dominates
- Layer 1 standards
  - Media and wiring
  - Signaling
  - Handled by dedicated interface chips
  - Unimportant to us
- Layer 2 standards
  - MAC framing and addressing

# **MAC Addressing**

- Three address types
  - Unicast (single computer)
  - Broadcast (all computers in broadcast domain)
  - Multicast (some computers in broadcast domain)

# **More Terminology**

- Internet
  - Interconnection of multiple networks
  - Allows heterogeneity of underlying networks
- Network scope
  - Local Area Network (LAN) covers limited distance
  - Wide Area Network (WAN) covers arbitrary distance

## **Network System**

- Individual hardware component
- Serves as fundamental building block
- Used in networks and internets
- May contain processor and software
- Operates at one or more layers of the protocol stack

## **Example Network Systems**

- Layer 2
  - Bridge
  - Ethernet switch
  - VLAN switch

# **VLAN Switch**

- Similar to conventional layer 2 switch
  - Connects multiple computers
  - Forwards frames among them
  - Each computer has unique *unicast address*
- Differs from conventional layer 2 switch
  - Allows manager to configure *broadcast domains*
- Broadcast domain known as *virtual* network

## **Broadcast Domain**

- Determines propagation of broadcast/multicast
- Originally corresponded to fixed hardware
  - One per cable segment
  - One per hub or switch
- Now configurable via VLAN switch
  - Manager assigns ports to VLANs

## Example Network Systems (continued)

- Layer 3
  - Internet host computer
  - IP router (layer 3 switch)
- Layer 4
  - Basic Network Address Translator (NAT)
  - Round-robin Web load balancer
  - TCP terminator

# Example Network Systems (continued)

- Layer 5
  - Firewall
  - Intrusion Detection System (IDS)
  - Virtual Private Network (VPN)
  - Softswitch running SIP
  - Application gateway
  - TCP splicer (also known as NAPT Network Address and Protocol Translator)
  - Smart Web load balancer
  - Set-top box

NSD-Intel -- Chapt. 2

## Example Network Systems (continued)

- Network control systems
  - Packet / flow analyzer
  - Traffic monitor
  - Traffic policer
  - Traffic shaper

# **Questions?**

#### Π

#### **Review Of Protocols And Packet Formats**

# **Protocol Layering**



- Five-layer Internet reference model
- Multiple protocols can occur at each layer

## **Layer 2 Protocols**

- Two protocols are important
  - Ethernet (widely used)
  - ATM (defines per-flow QoS)
- We will concentrate on Ethernet

# **Ethernet Addressing**

- 48-bit addressing
- Unique address assigned to each station (NIC)
- Destination address in each packet can specify delivery to
  - A single computer (unicast)
  - All computers in broadcast domain (broadcast)
  - Some computers in broadcast domain (multicast)

## Ethernet Addressing (continued)

- Broadcast address is all 1s
- Single bit determines whether remaining addresses are unicast or multicast



• Multicast bit travels first on the wire

## **Ethernet Frame Processing**



- Dedicated physical layer hardware
  - Checks and removes preamble and CRC on input
  - Computes and appends CRC and preamble on output
- Layer 2 systems use source, destination and (possibly) type fields

## Internet

- Set of (heterogeneous) computer networks interconnected by *IP routers*
- End-user computers, called *hosts*, each attach to specific network
- Protocol software
  - Runs on both hosts and routers
  - Provides illusion of homogeneity

## **Internet Protocols Of Interest**

- Layer 2
  - Address Resolution Protocol (ARP)
- Layer 3
  - Internet Protocol (IP)
- Layer 4
  - User Datagram Protocol (UDP)
  - Transmission Control Protocol (TCP)

## **IP Datagram Format**

| 0                                   | 4                    | 8       | 16           | 19        | 24     | 31 |
|-------------------------------------|----------------------|---------|--------------|-----------|--------|----|
| VERS                                | HLEN                 | SERVICE | TOTAL LENGTH |           |        |    |
| ID                                  |                      | FLAGS   |              | F. OFFSET |        |    |
| т                                   | TL                   | ТҮРЕ    | HDR CHECKSUM |           | ECKSUM |    |
|                                     | SOURCE               |         |              |           |        |    |
| DESTINATION                         |                      |         |              |           |        |    |
| IP OPTIONS (MAY BE OMITTED) PADDING |                      |         |              |           |        |    |
|                                     | BEGINNING OF PAYLOAD |         |              |           |        |    |
|                                     |                      |         |              |           |        |    |

- Format of each packet sent across Internet
- Fixed-size fields make parsing efficient

## **IP Datagram Fields**

| Field        | Meaning                                    |
|--------------|--------------------------------------------|
| VERS         | Version number of IP being used (4)        |
| HLEN         | Header length measured in 32-bit units     |
| SERVICE      | Level of service desired                   |
| TOTAL LENGTH | Datagram length in octets including header |
| ID           | Unique value for this datagram             |
| FLAGS        | Bits to control fragmentation              |
| F. OFFSET    | Position of fragment in original datagram  |
| TTL          | Time to live (hop countdown)               |
| TYPE         | Contents of payload area                   |
| HDR CHECKSUM | One's-complement checksum over header      |
| SOURCE       | IP address of original sender              |
| DESTINATION  | IP address of ultimate destination         |
| IP OPTIONS   | Special handling parameters                |
| PADDING      | To make options a 32-bit multiple          |

# **IP addressing**

- 32-bit Internet address assigned to each computer
- Virtual, hardware independent value
- Prefix identifies network; suffix identifies host
- Network systems use an address mask to specify the boundary between prefix and suffix

# **Next-Hop Forwarding**

- Routing table
  - Found in both hosts and routers
  - Stores (destination, mask, next\_hop) tuples
- Route lookup
  - Takes destination address as argument
  - Finds next hop
  - Uses longest-prefix match

# **Next-Hop Forwarding**

- Routing table
  - Found in both hosts and routers
  - Stores (destination, mask, next\_hop) tuples
- Route lookup
  - Takes destination address as argument
  - Finds next hop
  - Uses longest-prefix match

## **UDP Datagram Format**

| 16                   |                  |  |  |
|----------------------|------------------|--|--|
| SOURCE PORT          | DESTINATION PORT |  |  |
| MESSAGE LENGTH       | CHECKSUM         |  |  |
| BEGINNING OF PAYLOAD |                  |  |  |
| · · ·                |                  |  |  |
|                      |                  |  |  |

| Field                   | Meaning                                        |
|-------------------------|------------------------------------------------|
| SOURCE PORT             | ID of sending application                      |
| <b>DESTINATION PORT</b> | ID of receiving application                    |
| MESSAGE LENGTH          | Length of datagram including the header        |
| CHECKSUM                | One's-complement checksum over entire datagram |

## **TCP Segment Format**

| 0                        | 4                    | 10        | 16     | 24 | 31 |
|--------------------------|----------------------|-----------|--------|----|----|
| SOURCE PORT              |                      | DESTINAT  |        |    |    |
|                          | SEQUENCE             |           |        |    |    |
|                          | ACKNOWLEDGEMENT      |           |        |    |    |
| HLEN                     | NOT USED             | CODE BITS | WINDOW |    |    |
| CHECKSUM                 |                      | URGEI     | NT PTR |    |    |
| OPTIONS (MAY BE OMITTED) |                      | PADDING   |        |    |    |
|                          | BEGINNING OF PAYLOAD |           |        |    |    |
|                          |                      |           |        |    |    |

- Sent end-to-end
- Fixed-size fields make parsing efficient

# **TCP Segment Fields**

| Field                   | Meaning                                       |
|-------------------------|-----------------------------------------------|
| SOURCE PORT             | ID of sending application                     |
| <b>DESTINATION PORT</b> | ID of receiving application                   |
| SEQUENCE                | Sequence number for data in payload           |
| ACKNOWLEDGEMENT         | Acknowledgement of data received              |
| HLEN                    | Header length measured in 32-bit units        |
| NOT USED                | Currently unassigned                          |
| CODE BITS               | URGENT, ACK, PUSH, RESET, SYN, FIN            |
| WINDOW                  | Receiver's buffer size for additional data    |
| CHECKSUM                | One's-complement checksum over entire segment |
| URGENT PTR              | Pointer to urgent data in segment             |
| OPTIONS                 | Special handling                              |
| PADDING                 | To make options a 32-bit multiple             |

# **Illustration Of Encapsulation**



• Field in each header specifies type of encapsulated packet

#### **Example ARP Packet Format**

| 0                                 |                                    | 8               | 16                     | 24                                | 31 |
|-----------------------------------|------------------------------------|-----------------|------------------------|-----------------------------------|----|
|                                   | ETHERNET ADDRESS TYPE (1)          |                 | IP ADDRESS TYPE (0800) |                                   |    |
|                                   | ETH ADDR LEN (6)                   | IP ADDR LEN (4) |                        | OPERATION                         |    |
|                                   | SENDER'S ETH ADDR (first 4 octets) |                 |                        |                                   |    |
|                                   | SENDER'S ETH ADDR (last 2 octets)  |                 |                        | SENDER'S IP ADDR (first 2 octets) |    |
| SENDER'S IP ADDR (last 2 octets)  |                                    |                 | ТА                     | RGET'S ETH ADDR (first 2 octets)  |    |
| TARGET'S ETH ADDR (last 4 octets) |                                    |                 |                        |                                   |    |
| TARGET'S IP ADDR (all 4 octets)   |                                    |                 |                        |                                   |    |

- Format when ARP used with Ethernet and IP
- Each Ethernet address is six octets
- Each IP address is four octets

#### **End Of Review**

# **Questions?**

#### IV

#### **Conventional Computer Hardware Architecture**

## **Software-Based Network System**

- Uses conventional hardware (e.g., PC)
- Software
  - Runs the entire system
  - Allocates memory
  - Controls I/O devices
  - Performs all protocol processing

#### Why Study Protocol Processing On Conventional Hardware?

- Past
  - Employed in early IP routers
  - Many algorithms developed / optimized for conventional hardware
- Present
  - Used in low-speed network systems
  - Easiest to create / modify
  - Costs less than special-purpose hardware

## Why Study Protocol Processing On Conventional Hardware? (continued)

- Future
  - Processors continue to increase in speed
  - Some conventional hardware present in all systems

## Why Study Protocol Processing On Conventional Hardware? (continued)

- Future
  - Processors continue to increase in speed
  - Some conventional hardware present in all systems
  - You will build software-based systems in lab!

#### **Serious Question**

- Which is growing faster?
  - Processing power
  - Network bandwidth
- Note: if network bandwidth growing faster
  - Need special-purpose hardware
  - Conventional hardware will become irrelevant

#### **Growth Of Technologies**



## **Conventional Computer Hardware**

- Four important aspects
  - Processor
  - Memory
  - I/O interfaces
  - One or more buses

#### Illustration Of Conventional Computer Architecture



network interfaces and other I/O devices

- Bus is central, shared interconnect
- All components *contend* for use

#### **Bus Organization And Operations**



- Parallel wires (C+A+D total)
- Used to pass
  - Control information (*C* bits)
  - An address (A bits)
  - A data value (*D* bits)

## **Bus Width**

- Number of parallel data bits known as *width* of bus
- Wider bus
  - Transfers more data per unit time
  - Costs more
  - Requires more physical space
- Compromise: to simulate wider bus, use hardware that multiplexes transfers

# **Bus Paradigm**

- Only two basic operations
  - Fetch
  - Store
- All operations cast as forms of the above

#### **Fetch/Store**

- Fundamental paradigm
- Used throughout hardware, including network processors

## **Fetch Operation**

- Place address of a device on address lines
- Issue *fetch* on control lines
- Use control lines to wait for device that owns the address to respond
- If operation successful, extract value (response) from data lines
- If not successful, report error

## **Store Operation**

- Place address of a device on address lines
- Place value on data lines
- Issue *store* on control lines
- Use control lines to wait for device that owns the address to respond
- If operation does not succeed, report error

#### **Example Of Operations Mapped Into Fetch/Store Paradigm**

- Imagine disk device attached to a bus
- Assume disk hardware supports three (nontransfer) operations:
  - Start disk spinning
  - Stop disk
  - Determine current status

## Example Of Operations Mapped Into Fetch/Store Paradigm (continued)

- Assign the disk two contiguous bus addresses D and D+1
- Arrange for store of nonzero to address D to start disk spinning
- Arrange for store of zero to address D to stop disk
- Arrange for fetch from address D+1 to return current status
- Note: effect of store to address D+1 can be defined as
  - Appears to work, but has no effect
  - Returns an error

#### **Bus Address Space**

- Arbitrary hardware can be attached to bus
- K address lines result in 2<sup>k</sup> possible bus addresses
- Address can refer to
  - Memory (e.g., RAM or ROM)
  - I/O device
- Arbitrary devices can be placed at arbitrary addresses
- Address space can contain "holes"

#### **Bus Address Terminology**

- Device on bus known as *memory mapped* I/O
- Locations that correspond to nontransfer operations known as *Control and Status Registers (CSRs)*

#### **Example Bus Address Space**



#### Network I/O On Conventional Hardware

- Network Interface Card (NIC)
  - Attaches between bus and network
  - Operates like other I/O devices
  - Handles electrical/optical details of network
  - Handles electrical details of bus
  - Communicates over bus with CPU or other devices

# Making Network I/O Fast

- Key idea: migrate more functionality onto NIC
- Four techniques used with bus
  - Onboard address recognition & filtering
  - Onboard packet buffering
  - Direct Memory Access (DMA)
  - Operation and buffer chaining

# **Onboard Address Recognition And Filtering**

- NIC given set of addresses to accept
  - Station's unicast address
  - Network broadcast address
  - Zero or more multicast addresses
- When packet arrives, NIC checks destination address
  - Accept packet if address on list
  - Discard others

## **Onboard Packet Buffering**

- NIC given high-speed local memory
- Incoming packet placed in NIC's memory
- Allows computer's memory/bus to operate slower than network
- Handles small packet bursts

#### **Direct Memory Access (DMA)**

- CPU
  - Allocates packet buffer in memory
  - Passes buffer address to NIC
  - Goes on with other computation
- NIC
  - Accepts incoming packet from network
  - Copies packet over bus to buffer in memory
  - Informs CPU that packet has arrived

# **Buffer Chaining**

- CPU
  - Allocates multiple buffers
  - Passes linked list to NIC
- NIC
  - Receives next packet
  - Divides into one or more buffers
- Advantage: a buffer can be smaller than a packet

# **Operation Chaining**

- CPU
  - Allocates multiple buffers
  - Builds linked list of operations
  - Passes list to NIC
- NIC
  - Follows list and performs instructions
  - Interrupts CPU after each operation
- Advantage: multiple operations proceed without CPU intervention

#### **Illustration Of Operation Chaining**



• Optimizes movement of data to memory

#### **Data Flow Diagram**



- Depicts flow of data through hardware units
- Size of arrow represents throughput
- Used throughout the course and text

#### Summary

- Software-based network systems run on conventional hardware
  - Processor
  - Memory
  - I/O devices
  - Bus
- Network interface cards can be optimized to reduce CPU load

# **Questions?**

#### V

#### **Basic Packet Processing: Algorithms And Data Structures**

# Copying

- Used when packet moved from one memory location to another
- Expensive
- Must be avoided whenever possible
  - Leave packet in buffer
  - Pass buffer address among threads/layers

#### **Possibilities For Buffer Allocation**

- Fixed-sze buffers
  - \* Large enough for largest packet
    - \* Small, with bultiple buffers linked together for large packets
- Variable-size buffers

### **Buffer Addressing**

- Buffer address must be resolvable in all contexts
- Easiest implementation: keep buffers in kernel space

### **Integer Representation**

- Two standards
  - Little endian (least-significant byte at lowest address)
  - Big endian (most-significant byte at lowest address)

### Illustration Of Big And Little Endian Integers



#### **Integer Conversion**

- Needed when heterogeneous computers communicate
- Protocols define *network byte order*
- Computers convert to network byte order
- Typical library functions

| Function | data size | Translation                             |
|----------|-----------|-----------------------------------------|
| ntohs    | 16 bits   | Network byte order to host's byte order |
| htons    | 16 bits   | Host's byte order to network byte order |
| ntohl    | 32 bits   | Network byte order to host's byte order |
| htonl    | 32 bits   | Host's byte order to network byte order |

### Examples Of Algorithms Implemented With Software-Based Systems

- Layer 2
  - Ethernet bridge
- Layer 3
  - IP forwarding
  - IP fragmentation and reassembly
- Layer 4
  - TCP connection recognition and splicing
- Other
  - Hash table lookup

• Provide insight to packet processing tasks

- Provide insight to packet processing tasks
- Reinforce concepts

- Provide insight to packet processing tasks
- Reinforce concepts
- Help students recall protocol details

- Provide insight to packet processing tasks
- Reinforce concepts
- Help students recall protocol details
- You will need them in lab!

#### **Ethernet Bridge**



- Used between a pair of Ethernets
- Provides transparent, layer 2 connection
- Listens in promiscuous mode
- Forwards frames in both directions
- Uses addresses to filter

# **Bridge Filtering**

- Uses source address in frames to identify computers on each network
- Uses destination address to decide whether to forward frame

# **Bridge Algorithm**

```
Assume: two network interfaces each operating in promiscuous
mode.
Create an empty list, L, that will contain pairs of values;
Do forever {
     Acquire the next frame to arrive;
     Set I to the interface over which the frame arrived;
     Extract the source address, S;
     Extract the destination address, D;
     Add the pair (S, I) to list L if not already present.
     If the pair (D, I) appears in list L {
          Drop the frame;
     } Else {
          Forward the frame over the other interface;
```

### **Implementation Of Table Lookup**

- Need high speed (more on this later)
- Software-based systems typically use *hashing* for table lookup

# Hashing

- Optimizes number of *probes*
- Works well if table not full
- Practical technique: *double hashing*

# **Hashing Algorithm**

```
Given: a key, a table in memory, and the table size N.
Produce: a slot in the table that corresponds to the key
    or an empty table slot if the key is not in the table.
Method: double hashing with open addressing.
Choose P_1 and P_2 to be prime numbers;
Fold the key to produce an integer, K;
Compute table pointer Q equal to (P_1 \times K) modulo N;
Compute increment R equal to (P_2 \times K) modulo N;
While (table slot Q not equal to K and nonempty) {
    Q \leftarrow (Q + R) \text{ modulo N};
}
At this point, Q either points to an empty table slot or to the
    slot containing the key.
```

#### **Address Lookup**

- Computer can compare integer in one operation
- Network address can be longer than integer (e.g., 48 bits)
- Two possibilities
  - Use multiple comparisons per probe
  - Fold address into integer key

## Folding

- Maps N-bit value into M-bit key, M < N
- Typical technique: exclusive or
- Potential problem: two values map to same key
- Solution: compare full value when key matches

### **IP Forwarding**

- Used in hosts as well as routers
- Conceptual mapping

(next hop, interface)  $\leftarrow f(\text{datagram, routing table})$ 

• Table driven

# **IP Routing Table**

- One entry per destination
- Entry contains
  - 32-bit IP address of destination
  - 32-bit address mask
  - 32-bit next-hop address
  - N-bit interface number

### **Example IP Routing Table**

| Destination | Address              | Next-Hop       | Interface |
|-------------|----------------------|----------------|-----------|
| Address     | Mask                 | Address        | Number    |
| 192.5.48.0  | <b>255.255.255.0</b> | 128.210.30.5   | 2         |
| 128.10.0.0  | 255.255.0.0          | 128.210.141.12 | 1         |
| 0.0.0.0     | 0.0.0.0              | 128.210.30.5   | 2         |

- Values stored in binary
- Interface number is for internal use only
- Zero mask produces *default* route

# **IP Forwarding Algorithm**

```
Given: destination address A and routing table R.
Find: a next hop and interface used to route datagrams to A.
For each entry in table R {
Set MASK to the Address Mask in the entry;
Set DEST to the Destination Address in the entry;
If (A & MASK) == DEST {
Stop; use the next hop and interface in the entry;
}
If this point is reached, declare error: no route exists;
```

• Note: algorithm assumes table is sorted in longest-prefix order

### **IP Fragmentation**

- Needed when datagram larger than network MTU
- Divides IP datagram into *fragments*
- Uses FLAGS bits in datagram header



### IP Fragmentation Algorithm (Part 1: Initialization)

```
Given: an IP datagram, D, and a network MTU.
Produce: a set of fragments for D.
If the DO NOT FRAGMENT bit is set {
    Stop and report an error;
}
Compute the size of the datagram header, H;
Choose N to be the largest multiple of 8 such
    that H+N≤MTU;
Initialize an offset counter, O, to zero;
```

### IP Fragmentation Algorithm (Part 2: Processing)

Repeat until datagram empty {

Create a new fragment that has a copy of D's header; Extract up to the next N octets of data from D and place the data in the fragment; Set the *MORE FRAGMENTS* bit in fragment header; Set *TOTAL LENGTH* field in fragment header to be H+N; Set *FRAGMENT OFFSET* field in fragment header to O; Compute and set the *CHECKSUM* field in fragment header; Increment O by N/8;

### Reassembly

- Complement of fragmentation
- Uses IP SOURCE ADDRESS and IDENTIFICATION fields in datagram header to group related fragments
- Joins fragments to form original datagram

# **Reassembly Algorithm**

Given: a fragment, F, add to a partial reassembly.
Method: maintain a set of fragments for each datagram.
Extract the IP source address, S, and ID fields from F;
Combine S and ID to produce a lookup key, K;
Find the fragment set with key K or create a new set;
Insert F into the set;
If the set contains all the data for the datagram {
Form a completely reassembled datagram and process it;

#### **Data Structure For Reassembly**

- Two parts
  - Buffer large enough to hold original datagram
  - Linked list of pieces that have arrived



### **TCP Connection**

- Involves a pair of endpoints
- Started with SYN segment
- Terminated with FIN or RESET segment
- Identified by 4-tuple

(src addr, dest addr, src port, dest port)

### TCP Connection Recognition Algorithm (Part 1)

Given: a copy of traffic passing across a network.
Produce: a record of TCP connections present in the traffic.
Initialize a connection table, C, to empty;
For each IP datagram that carries a TCP segment {
Extract the IP source, S, and destination, D, addresses;
Extract the source, P<sub>1</sub>, and destination, P<sub>2</sub>, port numbers;
Use (S,D,P<sub>1</sub>,P<sub>2</sub>) as a lookup key for table C and create a new entry, if needed;

### TCP Connection Recognition Algorithm (Part 2)

If the segment has the RESET bit set, delete the entry; Else if the segment has the FIN bit set, mark the connection closed in one direction, removing the entry from C if the connection was previously closed in the other; Else if the segment has the SYN bit set, mark the connection as being established in one direction, making it completely established if it was previously marked as being established in the other;

}

# **TCP Splicing**

- Join two TCP connections
- Allow data to pass between them
- To avoid termination overhead translate segment header fields
  - Acknowledgement number
  - Sequence number

#### **Illustration Of TCP Splicing**



| Connection<br>& Direction | Sequence<br>Number | Connection<br>& Direction | Sequence<br>Number |
|---------------------------|--------------------|---------------------------|--------------------|
| Incoming #1               | 200                | Incoming #2               | 1200               |
| Outgoing #2               | 860                | Outgoing #1               | <b>50</b>          |
| Change                    | 660                | Change                    | -1150              |

### TCP Splicing Algorithm (Part 1)

Given: two TCP connections.

Produce: sequence translations for splicing the connection.
Compute D1, the difference between the starting sequences on incoming connection 1 and outgoing connection 2;
Compute D2, the difference between the starting sequences on incoming connection 2 and outgoing connection 1;

#### TCP Splicing Algorithm (Part 2)

For each segment {
 If segment arrived on connection 1 {
 Add D1 to sequence number;
 Subtract D2 from acknowledgement number;
 } else if segment arrived on connection 2 {
 Add D2 to sequence number;
 Subtract D1 from acknowledgement number;
 }
}

#### Summary

- Packet processing algorithms include
  - Ethernet bridging
  - IP fragmentation and reassembly
  - IP forwarding
  - TCP splicing
- Table lookup important
  - Full match for layer 2
  - Longest prefix match for layer 3

# **Questions?**

# **For Hands-On Experience With**

# **A Software-Based System:**

# Enroll in CS 636 !

#### VI

#### **Packet Processing Functions**

#### Goal

- Identify functions that occur in packet processing
- Devise set of operations sufficient for all packet processing
- Find an efficient implementation for the operations

#### **Packet Processing Functions We Will Consider**

- Address lookup and packet forwarding
- Error detection and correction
- Fragmentation, segmentation, and reassembly
- Frame and protocol demultiplexing
- Packet classification
- Queueing and packet discard
- Scheduling and timing
- Security: authentication and privacy
- Traffic measurement, policing, and shaping

#### **Address Lookup And Packet Forwarding**

- Forwarding requires address lookup
- Lookup is table driven
- Two types
  - Exact match (typically layer 2)
  - Longest-prefix match (typically layer 3)
- Cost depends on size of table and type of lookup

## **Error Detection And Correction**

- Data sent with packet used as verification
  - Checksum
  - CRC
- Cost proportional to size of packet
- Often implemented with special-purpose hardware

#### An Important Note About Cost

The cost of an operation is proportional to the amount of data processed. An operation such as checksum computation that requires examination of all the data in a packet is among the most expensive.

#### **Fragmentation, Segmentation, And Reassembly**

- IP fragments and reassembles datagrams
- ATM segments and reassembles AAL5 packets
- Same idea; details differ
- Cost is high because
  - State must be kept and managed
  - Unreassembled fragments occupy memory

#### **Frame And Protocol Demultiplexing**

- Traditional technique used in layered protocols
- Type appears in each header
  - Assigned on output
  - Used on input to select "next" protocol
- Cost of demultiplexing proportional to number of layers

#### **Packet Classification**

- Alternative to demultiplexing
- Crosses multiple layers
- Achieves lower cost
- More on classification later in the course

# **Queueing And Packet Discard**

- General paradigm is *store-and-forward* 
  - Incoming packet placed in queue
  - Outgoing packet placed in queue
- When queue is full, choose packet to discard
- Affects throughput of higher-layer protocols

## **Queueing Priorities**

- Multiple queues used to enforce priority among packets
- Incoming packet
  - Assigned priority as function of contents
  - Placed in appropriate priority queue
- Queueing discipline
  - Examines priority queues
  - Chooses which packet to send

### **Examples Of Queueing Disciplines**

- Priority Queueing
  - Assign unique priority number to each queue
  - Choose packet from highest priority queue that is nonempty
  - Known as *strict priority* queueing
  - Can lead to starvation

#### Examples Of Queueing Disciplines (continued)

- Weighted Round Robin (WRR)
  - Assign unique priority number to each queue
  - Process all queues round-robin
  - Compute N, max number of packets to select from a queue proportional to priority
  - Take up to N packets before moving to next queue
  - Works well if all packets equal size

13

#### Examples Of Queueing Disciplines (continued)

- Weighted Fair Queueing (WFQ)
  - Make selection from queue proportional to priority
  - Use packet size rather than number of packets
  - Allocates priority to amount of data from a queue rather than number of packets

# **Scheduling And Timing**

- Important mechanisms
- Used to coordinate parallel and concurrent tasks
  - Processing on multiple packets
  - Processing on multiple protocols
  - Multiple processors
- Scheduler attempts to achieve fairness

#### **Security: Authentication And Privacy**

- Authentication mechanisms
  - Ensure sender's identity
- Confidentiality mechanisms
  - Ensure that intermediaries cannot interpret packet contents
- Note: in common networking terminology, *privacy* refers to confidentiality
  - Example: Virtual Private Networks

#### **Traffic Measurement And Policing**

- Used by network managers
- Can measure aggregate traffic or per-flow traffic
- Often related to Service Level Agreement (SLA)
- Cost is high if performed in real-time

# **Traffic Shaping**

- Make traffic conform to statistical bounds
- Typical use
  - Smooth bursts
  - Avoid packet trains
- Only possibilities
  - Discard packets (seldom used)
  - Delay packets

#### **Example Traffic Shaping Mechanisms**

- Leaky bucket
  - Easy to implement
  - Popular
  - Sends steady number of packets per second
  - Rate depends on number of packets waiting
  - Does not guarantee steady data rate

#### Example Traffic Shaping Mechanisms (continued)

- Token bucket
  - Sends steady number of bits per second
  - Rate depends on number of bits waiting
  - Achieves steady data rate
  - More difficult to implement

#### **Illustration Of Traffic Shaper**



- Packets
  - Arrive in bursts
  - Leave at steady rate

#### **Timer Management**

- Fundamental piece of network system
- Needed for
  - Scheduling
  - Traffic shaping
  - Other protocol processing (e.g., retransmission)
- Cost
  - Depends on number of timer operations (e.g., set, cancel)
  - Can be high

#### Summary

- Primary packet processing functions are
  - Address lookup and forwarding
  - Error detection and correction
  - Fragmentation and reassembly
  - Demultiplexing and classification
  - Queueing and discard
  - Scheduling and timing
  - Security functions
  - Traffic measurement, policing, and shaping

# **Questions?**

#### VII

#### **Protocol Software On A Conventional Processor**

#### Possible Implementations Of Protocol Software

- In an application program
  - Easy to program
  - Runs as user-level process
  - No direct access to network devices
  - High cost to copy data from kernel address space
  - Cannot run at *wire speed*

### Possible Implementations Of Protocol Software (continued)

- In an embedded system
  - Special-purpose hardware device
  - Dedicated to specific task
  - Ideal for stand-alone system
  - Software has full control

### Possible Implementations Of Protocol Software (continued)

- In an embedded system
  - Special-purpose hardware device
  - Dedicated to specific task
  - Ideal for stand-alone system
  - Software has full control
  - You will experience this in lab!

## Possible Implementations Of Protocol Software (continued)

- In an operating system kernel
  - More difficult to program than application
  - Runs with kernel privilege
  - Direct access to network devices

## **Interface To The Network**

- Known as Application Program Interface (API)
- Can be
  - *Asynchronous*
  - Synchronous
- Synchronous interface can use
  - Blocking
  - Polling

#### **Asynchronous API**

- Also known as *event-driven*
- Programmer
  - Writes set of functions
  - Specifies which function to invoke for each event type
- Programmer has no control over function invocation
- Functions keep state in shared memory
- Difficult to program
- Example: function *f*() called when packet arrives

## **Synchronous API Using Blocking**

- Programmer
  - Writes main flow-of-control
  - Explicitly invokes functions as needed
  - Built-in functions block until request satisfied
- Example: function *wait\_for\_packet()* blocks until packet arrives
- Easier to program

# **Synchronous API Using Polling**

- Nonblocking form of synchronous API
- Each function call returns immediately
  - Performs operation if available
  - Returns error code otherwise
- Example: function *try\_for\_packet()* either returns next packet or error code if no packet has arrived
- Closer to underlying hardware

# **Typical Implementations And APIs**

- Application program
  - Synchronous API using blocking (e.g., socket API)
  - Another application thread runs while an application blocks
- Embedded systems
  - Synchronous API using polling
  - CPU dedicated to one task
- Operating systems
  - Asynchronous API
  - Built on interrupt mechanism

## **Example Asynchronous API**

- Design goals
  - For use with network processor
  - Simplest possible interface
  - Sufficient for basic packet processing tasks
- Includes
  - I/O functions
  - Timer manipulation functions

#### Example Asynchronous API (continued)

- Initialization and termination functions
  - on\_startup()
  - on\_shutdown()
- Input function (called asynchronously)
  - recv\_frame()
- Output functions
  - new\_fbuf()
  - send\_frame()

## Example Asynchronous API (continued)

- Timer functions (called asynchronously)
  - delayed\_call()
  - periodic\_call()
  - cancel\_call()
- Invoked by outside application
  - console\_command()

# **Processing Priorities**

- Determine which code CPU runs at any time
- General idea
  - Hardware devices need highest priority
  - Protocol software has medium priority
  - Application programs have lowest priority
- Queues provide buffering across priorities

#### **Illustration Of Priorities**



## Implementation Of Priorities In An Operating System

- Two possible approaches
  - Interrupt mechanism
  - Kernel threads

## **Interrupt Mechanism**

- Built into hardware
- Operates asynchronously
- Saves current processing state
- Changes processor status
- Branches to specified location

## **Two Types Of Interrupts**

- *Hardware interrupt* 
  - Caused by device (bus)
  - Must be serviced quickly
- Software interrupt
  - Caused by executing program
  - Lower priority than hardware interrupt
  - Higher priority than other OS code

#### Software Interrupts And Protocol Code

- Protocol stack operates as software interrupt
- When packet arrives
  - Hardware interrupts
  - Device driver raises software interrupt
- When device driver finishes
  - Hardware interrupt clears
  - Protocol code is invoked

## **Kernel Threads**

- Alternative to interrupts
- Familiar to programmer
- Finer-grain control than software interrupts
- Can be assigned arbitrary range of priorities

## **Conceptual Organization**

- Packet passes among multiple threads of control
- Queue of packets between each pair of threads
- Threads synchronize to access queues

## **Possible Organization Of Kernel Threads For Layered Protocols**

- One thread per layer
- One thread per protocol
- Multiple threads per protocol
- Multiple threads per protocol plus timer management thread(s)
- One thread per packet

## **One Thread Per Layer**

- Easy for programmer to understand
- Implementation matches concept
- Allows priority to be assigned to each layer
- Means packet is enqueued once per layer

#### **Illustration Of One Thread Per Layer**



#### **One Thread Per Protocol**

- Like one thread per layer
  - Implementation matches concept
  - Means packet is enqueued once per layer
- Advantages over one thread per layer
  - Easier for programmer to understand
  - Finer-grain control
  - Allows priority to be assigned to each protocol

#### **Illustration Of One Thread Per Protocol**



- TCP and UDP reside at same layer
- Separation allows priority

## **Multiple Threads Per Protocol**

- Further division of duties
- Simplifies programming
- More control than single thread
- Typical division
  - Thread for incoming packets
  - Thread for outgoing packets
  - Thread for management/timing

#### **Illustration Of Multiple Threads Used With TCP**



• Separate timer makes programming easier

## **Timers And Protocols**

- Many protocols implement timeouts
  - TCP
    - \* Retransmission timeout
    - \* 2MSL timeout
  - ARP
    - \* Cache entry timeout
  - IP
    - \* Reassembly timeout

#### Multiple Threads Per Protocol Plus Timer Management Thread(s)

- Observations
  - Many protocols each need timer functionality
  - Each timer thread incurs overhead
- Solution: consolidate timers for multiple protocols

29

#### **Is One Timer Thread Sufficient?**

- In theory
  - Yes
- In practice
  - Large range of timeouts (microseconds to tens of seconds)
  - May want to give priority to some timeouts
- Solution: two or more timer threads

## **Multiple Timer Threads**

- Two threads usually suffice
- Large-granularity timer
  - Values specified in seconds
  - Operates at lower priority
- Small-granularity timer
  - Values specified in microseconds
  - Operates at higher priority

## **Thread Synchronization**

- Thread for layer *i* 
  - Needs to pass a packet to layer i + 1
  - Enqueues the packet
- Thread for layer i + 1
  - Retrieves packet from the queue

## **Thread Synchronization**

- Thread for layer *i* 
  - Needs to pass a packet to layer i + 1
  - Enqueues the packet
- Thread for layer i + 1
  - Retrieves packet from the queue
- Context switch required!

## **Context Switch**

- OS function
- CPU passes from current thread to a waiting thread
- High cost
- Must be minimized

#### **One Thread Per Packet**

- Preallocate set of threads
- Thread operation
  - Waits for packet to arrive
  - Moves through protocol stack
  - Returns to wait for next packet
- Minimizes context switches

## Summary

- Packet processing software usually runs in OS
- API can be synchronous or asynchronous
- Priorities achieved with
  - Software interrupts
  - Threads
- Variety of thread architectures possible

# **Questions?**

#### VIII

#### Hardware Architectures For Protocol Processing And Aggregate Rates

## A Brief History Of Computer Hardware

- 1940s
  - Beginnings
- 1950s
  - Consolidation of von Neumann architecture
  - I/O controlled by CPU
- 1960s
  - I/O becomes important
  - Evolution of third generation architecture with interrupts

2

# **I/O Processing**

- Evolved from after-thought to central influence
- Low-end systems (e.g., microcontrollers)
  - Dumb I/O interfaces
  - CPU does all the work (polls devices)
  - Single, shared memory
  - Low cost, but low speed

## I/O Processing (continued)

- Mid-range systems (e.g., minicomputers)
  - Single, shared memory
  - I/O interfaces contain logic for transfer and status operations
  - CPU
    - \* Starts device then resumes processing
  - Device
    - \* Transfers data to / from memory
    - \* Interrupts when operation complete

## I/O Processing (continued)

- High-end systems (e.g., mainframes)
  - Separate, programmable I/O processor
  - OS downloads code to be run
  - Device has private on-board buffer memory
  - Examples: IBM channel, CDC peripheral processor

## **Networking Systems Evolution**

- Twenty year history
- Same trend as computer architecture
  - Began with central CPU
  - Shift to emphasis on I/O
- Three main generations

## **First Generation Network Systems**

- Traditional software-based router
- Used conventional (minicomputer) hardware
  - Single general-purpose processor
  - Single shared memory
  - I/O over a bus
  - Network interface cards use same design as other I/O devices

#### **Protocol Processing In First Generation Network Systems**



- General-purpose processor handles most tasks
- Sufficient for low-speed systems
- Note: we will examine other generations later in the course

#### How Fast Does A CPU Need To Be?

- Depends on
  - Rate at which data arrives
  - Amount of processing to be performed

#### **Two Measures Of Speed**

- Data rate (bits per second)
  - Per interface rate
  - Aggregate rate
- Packet rate (packets per second)
  - Per interface rate
  - Aggregate rate

### **How Fast Is A Fast Connection?**

- Definition of fast data rate keeps changing
  - 1960: 10 Kbps
  - 1970: 1 Mbps
  - 1980: 10 Mbps
  - 1990: 100 Mbps
  - 2000: 1000 Mbps (1 Gbps)
  - 2004: 2400 Mbps

#### **How Fast Is A Fast Connection?**

- Definition of fast data rate keeps changing
  - 1960: 10 Kbps
  - 1970: 1 Mbps
  - 1980: 10 Mbps
  - 1990: 100 Mbps
  - 2000: 1000 Mbps (1 Gbps)
  - 2004: 2400 Mbps
  - Soon: 10 Gbps???

#### Aggregate Rate Vs. Per-Interface Rate

- Interface rate
  - Rate at which data enters / leaves
- Aggregate
  - Sum of interface rates
  - Measure of total data rate system can handle
- Note: aggregate rate crucial if CPU handles traffic from all interfaces

#### A Note About System Scale

The aggregate data rate is defined to be the sum of the rates at which traffic enters or leaves a system. The maximum aggregate data rate of a system is important because it limits the type and number of network connections the system can handle.

#### Packet Rate Vs. Data Rate

- Sources of CPU overhead
  - Per-bit processing
  - Per-packet processing
- Interface hardware handles much of per-bit processing

#### A Note About System Scale

For protocol processing tasks that have a fixed cost per packet, the number of packets processed is more important than the aggregate data rate.

#### **Example Packet Rates**

| Technology    | Network<br>Data Rate<br>In Gbps | Packet Rate<br>For Small Packets<br>In Kpps | Packet Rate<br>For Large Packets<br>In Kpps |
|---------------|---------------------------------|---------------------------------------------|---------------------------------------------|
| 10Base-T      | 0.010                           | 19.5                                        | 0.8                                         |
| 100Base-T     | 0.100                           | 195.3                                       | 8.2                                         |
| OC-3          | 0.156                           | 303.8                                       | 12.8                                        |
| OC-12         | 0.622                           | 1,214.8                                     | 51.2                                        |
| 1000Base-T    | 1.000                           | 1,953.1                                     | 82.3                                        |
| <b>OC-48</b>  | 2.488                           | 4,860.0                                     | 204.9                                       |
| OC-192        | 9.953                           | 19,440.0                                    | 819.6                                       |
| <b>OC-768</b> | <b>39.813</b>                   | 77,760.0                                    | 3,278.4                                     |

• Key concept: maximum packet rate occurs with minimumsize packets

#### **Bar Chart Of Example Packet Rates**



#### **Bar Chart Of Example Packet Rates**



• Gray areas show rates for large packets

NSD-Intel -- Chapt. 8

#### **Time Per Packet**

| Technology    | Time per packet<br>for small packets<br>(in μs) | Time per packet<br>for large packets<br>( in μs ) |
|---------------|-------------------------------------------------|---------------------------------------------------|
| 10Base-T      | 51.20                                           | 1,214.40                                          |
| 100Base-T     | 5.12                                            | 121.44                                            |
| OC-3          | 3.29                                            | 78.09                                             |
| OC-12         | 0.82                                            | 19.52                                             |
| 1000Base-T    | 0.51                                            | 12.14                                             |
| <b>OC-48</b>  | 0.21                                            | 4.88                                              |
| OC-192        | 0.05                                            | 1.22                                              |
| <b>OC-768</b> | 0.01                                            | 0.31                                              |

• Note: these numbers are for a single connection!

#### Conclusion

Software running on a general-purpose processor is an insufficient architecture to handle high-speed networks because the aggregate packet rate exceeds the capabilities of a CPU.

## Possible Ways To Solve The CPU Bottleneck

- Fine-grain parallelism
- Symmetric coarse-grain parallelism
- Asymmetric coarse-grain parallelism
- Special-purpose coprocessors
- NICs with onboard processing
- Smart NICs with onboard stacks
- Cell switching
- Data pipelines

## **Fine-Grain Parallelism**

- Multiple processors
- Instruction-level parallelism
- Example:
  - Parallel checksum: add values of eight consecutive memory locations at the same time
- Assessment: insignificant advantages for packet processing

## Symmetric Coarse-Grain Parallelism

- Symmetric multiprocessor hardware
  - Multiple, identical processors
- Typical design: each CPU operates on one packet
- Requires coordination
- Assessment: coordination and data access means *N* processors cannot handle *N* times more packets than one processor

## **Asymmetric Coarse-Grain Parallelism**

- Multiple processors
- Each processor
  - Optimized for specific task
  - Includes generic instructions for control
- Assessment
  - Same problems of coordination and data access as symmetric case
  - Designer must choose how many copies of each processor type

## **Special-Purpose Coprocessors**

- Special-purpose hardware
- Added to conventional processor to speed computation
- Invoked like software subroutine
- Typical implementation: ASIC chip
- Choose operations that yield greatest improvement in speed

#### **General Principle**

To optimize computation, move operations that account for the most CPU time from software into hardware.

#### **General Principle**

To optimize computation, move operations that account for the most CPU time from software into hardware.

• Idea known as *Amdahl's law* (performance improvement from faster hardware technology is limited to the fraction of time the faster technology can be used)

# **NICs And Onboard Processing**

- Basic optimizations
  - Onboard address recognition and filtering
  - Onboard buffering
  - DMA
  - Buffer and operation chaining
- Further optimization possible

## **Smart NICs With Onboard Stacks**

- Add hardware to NIC
  - Off-the-shelf chips for layer 2
  - ASICs for layer 3
- Allows each NIC to operate independently
  - Effectively a multiprocessor
  - Total processing power increased dramatically

## **Illustration Of Smart NICs With Onboard Processing**



- NIC handles layers 2 and 3
- CPU only handles exceptions

# **Cell Switching**

- Alternative to new hardware
- Changes
  - Basic paradigm
  - All details (e.g., protocols)
- Connection-oriented

# **Cell Switching Details**

- Fixed-size packets
  - Allows fixed-size buffers
  - Guaranteed time to transmit/receive
- Relative (connection-oriented) addressing
  - Smaller address size
  - Label on packet changes at each switch
  - Requires connection setup
- Example: ATM

## **Data Pipeline**

- Move each packet through series of processors
- Each processor handles some tasks
- Assessment
  - Well-suited to many protocol processing tasks
  - Individual processor can be fast

## **Illustration Of Data Pipeline**



- Pipeline can contain heterogeneous processors
- Packets pass through each stage

## Summary

- Packet rate can be more important than data rate
- Highest packet rate achieved with smallest packets
- Rates measured per interface or aggregate
- Special hardware needed for highest-speed network systems
  - Smart NIC can include part of protocol stack
  - Parallel and pipelined hardware also possible

# **Questions?**

#### IX

# Classification And Forwarding

### Recall

- Packet demultiplexing
  - Used with layered protocols
  - Packet proceeds through one layer at a time
  - On input, software in each layer chooses module at next higher layer
  - On output, type field in each header specifies encapsulation

#### The Disadvantage Of Demultiplexing

Although it provides freedom to define and use arbitrary protocols without introducing transmission overhead, demultiplexing is inefficient because it imposes sequential processing among layers.

## **Packet Classification**

- Alternative to demultiplexing
- Designed for higher speed
- Considers all layers at the same time
- Linear in number of fields
- Two possible implementations
  - Software
  - Hardware

## **Example Classification**

- Classify Ethernet frames carrying traffic to Web server
- Specify exact header contents in *rule set*
- Example
  - Ethernet type field specifies IP
  - IP type field specifies TCP
  - TCP destination port specifies Web server

# Example Classification (continued)

- Field sizes and values
  - 2-octet Ethernet type is  $0800_{16}$
  - 1-octet IP type is 6
  - 2-octet TCP destination port is 80

#### **Illustration Of Encapsulated Headers**

| 0                          | 4      | 8  | 10        | 16                   | 19  | 24          | 31 |
|----------------------------|--------|----|-----------|----------------------|-----|-------------|----|
|                            |        |    |           | ETHERNET DEST. (0-1) |     |             |    |
| ETHERNET DESTINATION (2-5) |        |    |           |                      |     |             |    |
| ETHERNET SOURCE (0-3)      |        |    |           |                      |     |             |    |
| ETHERNET SOURCE (4-5)      |        |    |           | ETHERNET TYPE        |     |             |    |
| VERS                       | HLEN   |    | SERVICE   | IP TOTAL LENGTH      |     |             |    |
| IP IDENT                   |        |    |           | FLAGS                | F   | RAG. OFFSET | 1  |
| IP TTL                     |        |    | IP TYPE   | IP HDR. CHECKSUM     |     |             |    |
| IP SOURCE ADDRESS          |        |    |           |                      |     |             |    |
| IP DESTINATION ADDRESS     |        |    |           |                      |     |             |    |
| TCP SOURCE PORT            |        |    |           | TCP DESTINATION PORT |     |             | т  |
| TCP SEQUENCE               |        |    |           |                      |     |             |    |
| TCP ACKNOWLEDGEMENT        |        |    |           |                      |     |             |    |
| HLEN                       | NOT US | ED | CODE BITS |                      | ТСР | WINDOW      |    |
| TCP CHECKSUM               |        |    |           | TCP URGENT PTR       |     |             |    |
| Start Of TCP Data          |        |    |           |                      |     |             |    |

• Highlighted fields are used for classification of Web server traffic

## Software Implementation Of Classification

- Compare values in header fields
- Conceptually a *logical and* of all field comparisons
- Example

else

declare the packet does not match the classification;

# **Optimizing Software Classification**

- Comparisons performed sequentially
- Can reorder comparisons to minimize effort

## **Example Of Optimizing Software Classification**

- Assume
  - 95.0% of all frames have frame type  $0800_{16}$
  - 87.4% of all frames have IP type 6
  - 74.3% of all frames have TCP port 80
- Also assume values 6 and 80 do not occur in corresponding positions in non-IP packet headers
- Reordering tests can optimize processing time

# Example Of Optimizing Software Classification (continued)

if ((TCP port == 80) && (IP type == 6) && (frame type == 0x0800))
 declare the packet matches the classification;
else

declare the packet does not match the classification;

• At each step, test the field that will eliminate the most packets

## **Note About Optimization**

Although the maximum number of comparisons in a software classifier is fixed, the average number of comparisons is determined by the order of the tests; minimum comparisons result if, at each step, the classifier tests the field that eliminates the most packets.

# **Hardware Implementation Of Classification**

- Can build special-purpose hardware
- Steps
  - Extract needed fields
  - Concatenate bits
  - Place result in register
  - Perform comparison
- Hardware can operate in parallel

## **Illustration Of Hardware Classifier**



• Constant for Web classifier is 08.00.06.00.50<sub>16</sub>

# **Special Cases Of Classification**

- Multiple categories
- Variable-size headers
- Dynamic classification

# **In Practice**

- Classification usually involves multiple categories
- Packets grouped together into *flows*
- May have a default category
- Each category specified with rule set

# **Example Multi-Category Classification**

- Flow 1: traffic destined for Web server
- Flow 2: traffic consisting of ICMP echo request packets
- Flow 3: all other traffic (default)

# **Rule Sets**

- Web server traffic
  - 2-octet Ethernet type is  $0800_{16}$
  - 1-octet IP type is 6
  - 2-octet TCP destination port is 80
- ICMP echo traffic
  - 2-octet Ethernet type is  $0800_{16}$
  - 1-octet IP type is 1
  - 1-octet ICMP type is 8

# **Software Implementation Of Multiple Rules**

```
if (frame type != 0x0800) {
    send frame to flow 3;
} else if (IP type == 6 && TCP destination port == 80) {
    send packet to flow 1;
} else if (IP type == 1 && ICMP type == 8) {
    send packet to flow 2;
} else {
    send frame to flow 3;
}
```

• Further optimization possible

## **Variable-Size Packet Headers**

- Fields not at fixed offsets
- Easily handled with software
- Finite cases can be specified in rules

#### **Example Variable-Size Header: IP Options**

- Rule Set 1
  - 2-octet frame type field contains  $0800_{16}$
  - 1-octet field at the start of the datagram contains  $45_{16}$
  - 1-octet type field in the IP datagram contains 6
  - 2-octet field 22 octets from start of the datagram contains 80
- Rule Set 2
  - 2-octet frame type field contains  $0800_{16}$
  - 1-octet field at the start of the datagram contains  $46_{16}$
  - 1-octet type field in the IP datagram contains 6
  - 2-octet field 26 octets from the start of datagram contains 80

# **Effect Of Protocol Design On Classification**

- Fixed headers fastest to classify
- Each variable-size header adds one computation step
- In worst case, classification no faster than demultiplexing
- Extreme example: IPv6

# **Hybrid Classification**



- Combines hardware and software mechanisms
  - Hardware used for standard cases
  - Software used for exceptions
- Note: software classifier can operate at slower rate

# **Two Basic Types Of Classification**

- Static
  - Flows specified in rule sets
  - Header fields and values known a priori
- Dynamic
  - Flows created by observing packet stream
  - Values taken from headers
  - Allows fine-grain flows
  - Requires state information

# **Example Static Classification**

- Allocate one flow per service type
- One header field used to identify flow
  - IP TYPE OF SERVICE (TOS)
- Use DIFFSERV interpretation
- Note: Ethernet type field also checked

# **Example Dynamic Classification**

- Allocate flow per TCP connection
- Header fields used to identify flow
  - IP source address
  - IP destination address
  - TCP source port number
  - TCP destination port number
- Note: Ethernet type and IP type fields also checked

# **Implementation Of Dynamic Classification**

- Usually performed in software
- State kept in memory
- State information created/updated at wire speed

#### **Two Conceptual Bindings**

classification: packet  $\rightarrow$  flow forwarding: flow  $\rightarrow$  packet disposition

- Classification binding is usually 1-to-1
- Forwarding binding can be 1-to-1 or many-to-1

# **Flow Identification**

- Connection-oriented network
  - Per-flow SVC can be created on demand
  - Flow ID equals connection ID
- Connectionless network
  - Flow ID used internally
  - Each flow ID mapped to (*next hop, interface*)

## Relationship Of Classification And Forwarding In A Connection-Oriented Network

In a connection-oriented network, flow identifiers assigned by classification can be chosen to match connection identifiers used by the underlying network. Doing so makes forwarding more efficient by eliminating one binding.

# **Forwarding In A Connectionless Network**

- Route for flow determined when flow created
- Indexing used in place of route lookup
- Flow identifier corresponds to index of entry in forwarding cache
- Forwarding cache must be changed when route changes

## **Second Generation Network Systems**

- Designed for greater scale
- Use classification instead of demultiplexing
- Decentralized architecture
  - Additional computational power on each NIC
  - NIC implements classification and forwarding
- High-speed internal interconnection mechanism
  - Interconnects NICs
  - Provides fast data path

## Illustration Of Second Generation Network Systems Architecture



# **Classification And Forwarding Chips**

- Sold by vendors
- Implement hardware classification and forwarding
- Typical configuration: rule sets given in ROM

# Summary

- Classification faster than demultiplexing
- Can be implemented in hardware or software
- Dynamic classification
  - Uses packet contents to assign flows
  - Requires state information

# **Questions?**

## XI

## **Network Processors: Motivation And Purpose**

# **Second Generation Network Systems**

- Concurrent with ATM development (early 1990s)
- Purpose: scale to speeds faster than single CPU capacity
- Features
  - Use classification instead of demultiplexing
  - Decentralized architecture to offload CPU
  - Design optimized for fast data path

## Second Generation Network Systems (details)

- Multiple network interfaces
  - Powerful NIC
  - Private buffer memory
- High-speed hardware interconnects NICs
- General-purpose processor only handles exceptions
- Sufficient for medium speed interfaces (100 Mbps)

## **Reminder: Protocol Processing In Second Generation Network Systems**



- NIC handles most of layers 1 3
- Fast-path forwarding avoids CPU completely

# **Third Generation Network Systems**

- Late 1990s
- Functionality partitioned further
- Additional hardware on each NIC
- Almost all packet processing off-loaded from CPU

# **Third Generation Design**

- NIC contains
  - ASIC hardware
  - Embedded processor plus code in ROM
- NIC handles
  - Classification
  - Forwarding
  - Traffic policing
  - Monitoring and statistics

## **Embedded Processor**

- Two possibilities
  - Complex Instruction Set Computer (CISC)
  - Reduced Instruction Set Computer (RISC)
- RISC used often because
  - Higher clock rates
  - Smaller
  - Lower power consumption

## Purpose Of Embedded Processor In Third Generation Systems

Third generation systems use an embedded processor to handle layer 4 functionality and exception packets that cannot be forwarded across the fast path. An embedded processor architecture is chosen because ease of implementation and amenability to change are more important than speed.

# **Protocol Processing In Third Generation Systems**



- Special-purpose ASICs handle lower layer functions
- Embedded (RISC) processor handles layer 4
- CPU only handles low-demand processing

## **Are Third Generation Systems Sufficient?**

# **Are Third Generation Systems Sufficient?**

• Almost

# **Are Third Generation Systems Sufficient?**

• Almost ... but not quite.

# **Problems With Third Generation Systems**

- High cost
- Long time to market
- Difficult to simulate/test
- Expensive and time-consuming to change
  - Even trivial changes require silicon respin
  - 18-20 month development cycle
- Little reuse across products
- Limited reuse across versions

## **Problems With Third Generation Systems** (continued)

- No consensus on overall framework
- No standards for special-purpose support chips
- Requires in-house expertise (ASIC designers)

# **A Fourth Generation**

- Goal: combine best features of first generation and third generation systems
  - Flexibility of programmable processor
  - High speed of ASICs
- Technology called *network processors*

## **Definition Of A Network Processor**

A network processor is a special-purpose, programmable hardware device that combines the low cost and flexibility of a RISC processor with the speed and scalability of custom silicon (i.e., ASIC chips). Network processors are building blocks used to construct network systems.

# **Network Processors: Potential Advantages**

- Relatively low cost
- Straightforward hardware interface
- Facilities to access
  - Memory
  - Network interface devices
- Programmable
- Ability to scale to higher
  - Data rates
  - Packet rates

# **Network Processors: Potential Advantages**

- Relatively low cost
- Straightforward hardware interface
- Facilities to access
  - Memory
  - Network interface devices
- Programmable
- Ability to scale to higher
  - Data rates
  - Packet rates

# **The Promise Of Programmability**

- For producers
  - Lower initial development costs
  - Reuse software in later releases and related systems
  - Faster time-to-market
  - Same high speed as ASICs
- For consumers
  - Much lower product cost
  - Inexpensive (firmware) upgrades

# **Choice Of Instruction Set**

- Programmability alone insufficient
- Also need higher speed
- Should network processors have
  - Instructions for specific protocols?
  - Instructions for specific protocol processing tasks?
- Choices difficult

# **Instruction Set**

- Need to choose one instruction set
- No single instruction set best for all uses
- Other factors
  - Power consumption
  - Heat dissipation
  - Cost
- More discussion later in the course

# **Scalability**

- Two primary techniques
  - Parallelism
  - Data pipelining
- Questions
  - How many processors?
  - How should they be interconnected?
- More discussion later

19

# **Costs And Benefits Of Network Processors**

- Currently
  - More expensive than conventional processor
  - Slower than ASIC design
- Where do network processors fit?
  - Somewhere in the middle

#### **Where Network Processors Fit**



• Network processors: the middle ground

# **Achieving Higher Speed**

- What is known
  - Must partition packet processing into separate functions
  - To achieve highest speed, must handle each function with separate hardware
- What is unknown
  - Exactly what functions to choose
  - Exactly what hardware building blocks to use
  - Exactly how building blocks should be interconnected

## **Variety Of Network Processors**

- Economics driving a gold rush
  - NPs will dramatically lower production costs for network systems
  - A good NP design potentially worth lots of \$\$
- Result
  - Wide variety of architectural experiments
  - Wild rush to try yet another variation

# **An Interesting Observation**

- System developed using ASICs
  - High development cost (\$1M)
  - Lower cost to replicate
- System developed using network processors
  - Lower development cost
  - Higher cost to replicate
- Conclusion: amortized cost favors ASICs for most highvolume systems

# Summary

- Third generation network systems have embedded processor on each NIC
- Network processor is programmable chip with facilities to process packets faster than conventional processor
- Primary motivation is economic
  - Lower development cost than ASICs
  - Higher processing rates than conventional processor

# **Questions?**

#### XII

## The Complexity Of Network Processor Design

## How Should A Network Processor Be Designed?

- Depends on
  - Operations network processor will perform
  - Role of network processor in overall system

# Goals

- Generality
  - Sufficient for all protocols
  - Sufficient for all protocol processing tasks
  - Sufficient for all possible networks
- High speed
  - Scale to high bit rates
  - Scale to high packet rates
- Elegance
  - Minimality, not merely comprehensiveness

## **The Key Point**

A network processor is not designed to process a specific protocol or part of a protocol. Instead, designers seek a minimal set of instructions that are sufficient to handle an arbitrary protocol processing task at high speed.

## **Network Processor Design**

- To understand network processors, consider problem to be solved
  - Protocols being implemented
  - Packet processing tasks

# **Packet Processing Functions**

- Error detection and correction
- Traffic measurement and policing
- Frame and protocol demultiplexing
- Address lookup and packet forwarding
- Segmentation, fragmentation, and reassembly
- Packet classification
- Traffic shaping
- Timing and scheduling
- Queueing
- Security: authentication and privacy

# Questions

- Does our list of functions encompass all protocol processing?
- Which function(s) are most important to optimize?
- How do the functions map onto hardware units in a typical network system?
- Which hardware units in a network system can be replaced with network processors?
- What minimal set of instructions is sufficiently general to implement all functions?

# **Division Of Functionality**

- Partition problem to reduce complexity
- Basic division into two parts
- Functions applied when packet arrives known as

ingress processing

• Functions applied when packet leaves known as

egress processing

# **Ingress Processing**

- Security and error detection
- Classification or demultiplexing
- Traffic measurement and policing
- Address lookup and packet forwarding
- Header modification and transport splicing
- Reassembly or flow termination
- Forwarding, queueing, and scheduling

# **Egress Processing**

- Addition of error detection codes
- Address lookup and packet forwarding
- Segmentation or fragmentation
- Traffic shaping
- Timing and scheduling
- Queueing and buffering
- Output security processing

## **Illustration Of Packet Flow**



## **A Note About Scalability**

Unlike a conventional processor, scalability is essential for network processors. To achieve maximum scalability, a network processor offers a variety of special-purpose functional units, allows parallel or pipelined execution, and operates in a distributed environment.

## How Will Network Processors Be Used?

- For ingress processing only?
- For egress processing only?
- For combination?

#### How Will Network Processors Be Used?

- For ingress processing only?
- For egress processing only?
- For combination?
- Answer: No single role

## Potential Architectural Roles For Network Processor

- Replacement for a conventional CPU
- Augmentation of a conventional CPU
- On the input path of a network interface card
- Between a network interface card and central interconnect
- Between central interconnect and an output interface
- On the output path of a network interface card
- Attached to central interconnect like other ports

#### An Interesting Potential Role For Network Processors

In addition to replacing elements of a traditional third generation architecture, network processors can be attached directly to a central interconnect and used to implement stages of a macroscopic data pipeline. The interconnect allows forwarding among stages to be optimized.

# **Conventional Processor Design**

- Design an instruction set, *S*
- Build an emulator/simulator for *S* in software
- Build a compiler that translates into *S*
- Compile and emulate example programs
- Compare results to
  - Extant processors
  - Alternative designs

## **Network Processor Emulation**

- Can emulate low-level logic (e.g., Verilog)
- Software implementation
  - Slow
  - Cannot handle real packet traffic
- FPGA implementation
  - Expensive and time-consuming
  - Difficult to make major changes

## **Network Processor Design**

- Unlike conventional processor design
- No existing code base
- No prior hardware experience
- Each design differs

#### Hardware And Software Design

Because a network processor includes many low-level hardware details that require specialized software, the hardware and software designs are codependent; software for a network processor must be created along with the hardware.

# Summary

- Protocol processing divided into ingress and egress operations
- Network processor design is challenging because
  - Desire generality and efficiency
  - No existing code base
  - Software designs evolving with hardware

# **Questions?**

#### XIII

#### **Network Processor Architectures**

#### **Architectural Explosion**

An excess of exuberance and a lack of experience have produced a wide variety of approaches and architectures.

# **Principle Components**

- Processor hierarchy
- Memory hierarchy
- Internal transfer mechanisms
- External interface and communication mechanisms
- Special-purpose hardware
- Polling and notification mechanisms
- Concurrent and parallel execution support
- Programming model and paradigm
- Hardware and software dispatch mechanisms

3

# **Processing Hierarchy**

- Consists of hardware units
- Performs various aspects of packet processing
- Includes onboard and external processors
- Individual processor can be
  - Programmable
  - Configurable
  - Fixed

#### **Typical Processor Hierarchy**

| Level | Processor Type       | Programmable? | On Chip?  |
|-------|----------------------|---------------|-----------|
| 8     | General purpose CPU  | yes           | possibly  |
| 7     | Embedded processor   | yes           | typically |
| 5     | I/O processor        | yes           | typically |
| 6     | Coprocessor          | no            | typically |
| 4     | Fabric interface     | no            | typically |
| 3     | Data transfer unit   | no            | typically |
| 2     | Framer               | no            | possibly  |
| 1     | Physical transmitter | no            | possibly  |

# **Memory Hierarchy**

- Memory measurements
  - Random access latency
  - Sequential access latency
  - Throughput
  - Cost
- Can be
  - Internal
  - External

## **Typical Memory Hierarchy**

| Memory Type     | Rel. Speed | Approx. Size    | On Chip? |
|-----------------|------------|-----------------|----------|
| Control store   | 100        | 10 <sup>3</sup> | yes      |
| G.P. Registers† | 90         | 10 <sup>2</sup> | yes      |
| Onboard Cache   | 40         | 10 <sup>3</sup> | yes      |
| Onboard RAM     | 7          | 10 <sup>3</sup> | yes      |
| Static RAM      | 2          | 10 <sup>7</sup> | no       |
| Dynamic RAM     | 1          | 10 <sup>8</sup> | no       |

# **Internal Transfer Mechanisms**

- Internal bus
- Hardware FIFOs
- Transfer registers
- Onboard shared memory

## **External Interface And Communication Mechanisms**

- Standard and specialized bus interfaces
- Memory interfaces
- Direct I/O interfaces
- Switching fabric interface

#### **Example Interfaces**

- System Packet Interface Level 3 or 4 (SPI-3 or SPI-4)
- SerDes Framer Interface (SFI)
- CSIX fabric interface

Note: The Optical Internetworking Forum (OIF) controls the SPI and SFI standards.

# **Polling And Notification Mechanisms**

- Handle asynchronous events
  - Arrival of packet
  - Timer expiration
  - Completion of transfer across the fabric
- Two paradigms
  - Polling
  - Notification

# **Concurrent Execution Support**

- Improves overall throughput
- Multiple threads of execution
- Processor switches context when a thread blocks

# **Support For Concurrent Execution**

- Embedded processor
  - Standard operating system
  - Context switching in software
- I/O processors
  - No operating system
  - Hardware support for context switching
  - Low-overhead or zero-overhead

# **Concurrent Support Questions**

- Local or global threads (does thread execution span multiple processors)?
- Forced or voluntary context switching (are threads preemptable)?

# Hardware And Software Dispatch Mechanisms

- Refers to overall control of parallel operations
- Dispatcher
  - Chooses operation to perform
  - Assigns to a processor

# **Implicit And Explicit Parallelism**

- Explicit parallelism
  - Exposes parallelism to programmer
  - Requires software to understand parallel hardware
- Implicit parallelism
  - Hides parallel copies of functional units
  - Software written as if single copy executing

# Architecture Styles, Packet Flow, And Clock Rates

- Embedded processor plus fixed coprocessors
- Embedded processor plus programmable I/O processors
- Parallel (number of processors scales to handle load)
- Pipeline processors
- Dataflow

#### **Embedded Processor Architecture**



- Single processor
  - Handles all functions
  - Passes packet on
- Known as run-to-completion

#### **Parallel Architecture**



• Each processor handles 1/N of total load

#### **Pipeline Architecture**



- Each processor handles one function
- Packet moves through "pipeline"

20

## **Clock Rates**

- Embedded processor runs at > wire speed
- Parallel processor runs at < wire speed
- Pipeline processor runs at wire speed

# **Software Architecture**

- Central program that invokes coprocessors like subroutines
- Central program that interacts with code on intelligent, programmable I/O processors
- Communicating threads
- Event-driven program
- RPC-style (program partitioned among processors)
- Pipeline (even if hardware does not use pipeline)
- Combinations of the above

#### **Example Uses Of Programmable Processors**

#### **General purpose CPU**

Highest level functionality Administrative interface System control Overall management functions Routing protocols

#### **Embedded processor**

Intermediate functionality Higher-layer protocols Control of I/O processors Exception and error handling High-level ingress (e.g., reassembly) High-level egress (e.g., traffic shaping)

#### I/O processor

Basic packet processing Classification Forwarding Low-level ingress operations Low-level egress operations

#### **Using The Processor Hierarchy**

To maximize performance, packet processing tasks should be assigned to the lowest level processor capable of performing the task.

#### **Packet Flow Through The Hierarchy**



# Summary

- Network processor architectures characterized by
  - Processor hierarchy
  - Memory hierarchy
  - Internal buses
  - External interfaces
  - Special-purpose functional units
  - Support for concurrent or parallel execution
  - Programming model
  - Dispatch mechanisms

# **Questions?**

# XVII

#### **Overview Of The Intel Network Processor**

### **An Example Network Processor**

- We will
  - Choose one example
  - Examine the hardware
  - Gain first-hand experience with software
- The choice: Intel

# **Intel Network Processor Terminology**

- Intel Exchange Architecture (IXA)
  - Broad reference to architecture
  - Both hardware and software
  - Control plane and data plane
- Intel Exchange Processor (IXP)
  - Network processor that implements IXA

# Intel IXP2xxx

- Refers to second generation IXP chip
- Several models available

| Model   | Intended      | Typical         | Data      | Support For  |
|---------|---------------|-----------------|-----------|--------------|
| Number  | Use           | Input           | Rate      | cryptography |
| IXP2400 | Access & edge | OC-12 to OC-48  | 2.5 Gbps  | no           |
| IXP2800 | Edge & core   | OC-48 to OC-192 | 10.0 Gbps | no           |
| IXP2850 | Edge & core   | OC-48 to OC-192 | 10.0 Gbps | yes          |

- Differences in speed, power consumption, parallelism, interfaces, packaging
- Term *IXP2xxx* refers to any model

#### **IXP2xxx Features**

- One embedded RISC processor
- Eight to sixteen programmable packet processors
- Multiple, independent onboard buses
- Processor synchronization mechanisms
- Shared and non-shared onboard memory
- One low-speed serial line interface
- Multiple interfaces for external memories
- Multiple interfaces for external I/O buses
- Coprocessor for hash computation and cryptography
- Other functional units

#### **IXP2xxx External Connections**



#### **IXP2400 External Connection Speeds**

| Туре          | Bus Width          | <b>Clock Rate</b> | Data Rate       |
|---------------|--------------------|-------------------|-----------------|
| Serial line   | (NA)               | (NA)              | 38.4 Kbps       |
| PCI bus       | 64 bits            | 66 MHz            | <b>2.2 Gbps</b> |
| MSF interface | 32 bits in and out | unspecified       | unspecified     |
| DDR DRAM      | 64 bits            | 150 MHz           | 2.4 GBps        |
| QDR SRAM      | 32 bits            | 200 MHz           | 1.6 GBps        |

†GBps abbreviates Giga Bytes per second.

- Notes
  - MBps abbreviates Mega Bytes per second
  - IXP2800 operates at higher speed

#### **IXP2xxx Internal Units**

| Quantity | Component                          | Purpose                                                |
|----------|------------------------------------|--------------------------------------------------------|
| 1        | Embedded RISC<br>processor         | Control, higher layer protocols,<br>and exceptions     |
| 8/16     | Packet processing<br>engines       | I/O, basic packet processing,<br>and packet forwarding |
| 1+       | SRAM access unit                   | Coordinate access to the<br>external SRAM bus          |
| 1+       | DRAM access unit                   | Coordinate access to the<br>external DRAM bus          |
| 1        | Media/Switch Fabric<br>access unit | Coordinate access to the<br>external I/O devices       |
| 1        | PCI bus access unit                | Coordinate access to the<br>external PCI bus           |
| 1        | Hash unit                          | Compute a hash function for<br>high-speed lookup       |
| 0 or 1   | Crypto unit                        | Compute cryptographic encoding for secure transfer     |
| several  | Onboard buses                      | Internal control and data transfer                     |

#### **IXP2xxx Internal Architecture**



#### **Processors On The IXP2xxx**

| Processor Type            | Onboard? | Programmable? |
|---------------------------|----------|---------------|
| General-Purpose Processor | no       | yes           |
| Embedded RISC Processor   | yes      | yes           |
| I/O Processors            | yes      | yes           |
| Coprocessors              | yes      | no            |
| Physical Interfaces       | no       | no            |

#### **IXP2xxx Memory Hierarchy**

| Memory<br>Type      | Maximum<br>Size                   | On<br>Chip? | Typical<br>Use                    |
|---------------------|-----------------------------------|-------------|-----------------------------------|
| <b>GP Registers</b> | 256 (2 banks)                     | yes         | Intermediate computation          |
| Inst. Cache         | 32 Kbytes                         | yes         | <b>Recently used instructions</b> |
| Data Cache          | 32 Kbytes                         | yes         | Recently used data                |
| Mini Cache          | 2 Kbytes                          | yes         | Data that is reused once          |
| Write buffer        | unspecified                       | yes         | Write operation buffer            |
| Local memory        | <b>2560 bytes /</b> μ <b>eng.</b> | yes         | Register spills and caching       |
| Scratchpad          | 16 Kbytes                         | yes         | IPC and synchronization           |
| Inst. Store         | <b>4 Kbytes/</b> μ <b>eng.</b>    | yes         | Microengine instructions          |
| FlashROM            | unspecified                       | no          | Bootstrap                         |
| SRAM                | 64 Mbytes                         | no          | Tables or packet headers          |
| DRAM                | 2 Gbytes                          | no          | Packet storage                    |

#### **IXP2xxx Memory Characteristics**

| Memory<br>Type | Access Unit<br>(bytes) | Relative<br>Access Time | Special<br>Features                                                                                       |
|----------------|------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------|
| local          | 4                      | 1                       | accessed using the<br>LM_ADDR registers                                                                   |
| Scratch        | 4                      | 10                      | synchronization via<br>atomic read-modify-write<br>support for IPC (rings)<br>push/pull reflector<br>mode |
| SRAM           | 4                      | 14                      | follows QDR specification<br>atomic operations<br>support for queues and rings<br>bit manipulation        |
| DRAM           | 8                      | 20                      | connects to: Xscale,<br>microengines, and<br>PCI bus master                                               |

#### **Memory Access**

- Each memory specifies *minimum access unit* 
  - Two-byte unit is word
  - Four-byte unit is *longword*
  - Eight-byte unit is *quadword*
- When program accesses item in memory, physical memory system fetches entire access unit

#### **The Point About Memory Acceess**

The underlying memory is organized into data units of words or longwords. To achieve optimal performance, programmers must understand the memory organization and allocate items to minimize access times.

# **Example Of Complexity: PCI Access Unit**



# Summary

- We will use Intel IXP2xxx as example
- IXP2xxx offers
  - Embedded processor plus parallel packet processors
  - Connections to external memories and buses

# **Questions?**

#### XX

# Reference System And Software Development Kit (ENP-2611, SDK)

### **Reference System**

- Provided by vendor
- Targeted at potential customers
- Usually includes
  - Hardware testbed
  - Development software
  - Simulator or emulator
  - Download and bootstrap software
  - Reference implementations

# **Intel Reference Hardware**

- Single-board network processor testbed
- Plugs into PCI bus on a PC
- Part number *ENP-2611*
- Internal code name *Mt. Hood*
- Manufactured by Radisys Corporation

#### Items On The Intel ENP-2611 Reference System

| Quantity or Size | Item                                                                                                |
|------------------|-----------------------------------------------------------------------------------------------------|
| 1                | IXP2400 network processor (600MHz)                                                                  |
| 8                | Mbytes of QDR-SRAM memory                                                                           |
| 256              | Mbytes of DDR-SDRAM memory                                                                          |
| 16               | Mbytes of Flash ROM memory                                                                          |
| 1                | SPI-3 bridge FPGA to connect to:<br>– PM3386/7 Gigabit Ethernet MACs<br>– MSF running in SPI-3 mode |
| 3                | 10/100/1000 optical Ethernet ports                                                                  |
| 1                | 10/100 Ethernet management port                                                                     |
| 1                | Serial interface (on the XScale)                                                                    |
| 1                | PCI bus interface                                                                                   |

#### **Intel Reference Software**

- Known as *Software Development Kit (SDK)*
- Runs on PC
- Includes:

| Software              | Purpose                                                                                   |  |
|-----------------------|-------------------------------------------------------------------------------------------|--|
| C compiler            | Compile C programs for the XScale                                                         |  |
| MicroC compiler       | Compile C programs for the microengines                                                   |  |
| Assembler             | Assemble programs for the microengines                                                    |  |
| Simulator             | Simulate an IXP2xxx for debugging (Windows)                                               |  |
| Resource Manager      | XScale kernel module used to control<br>and communicate with hardware                     |  |
| Workbench Server      | Load software into the network processor                                                  |  |
| Workbench Backend Svr | Remote (Windows) application that controls<br>and communicates with the network processor |  |
| Bootstrap             | Start the network processor running                                                       |  |
| Reference Code        | Example programs for the IXP2xxx that show<br>how to implement basic functions            |  |

# **Operating System On XScale**

- XScale processor powerful enough to run an OS
- Version of *Embedded Linux* used that supports
  - Telnet server that allows remote login
  - Shell that allows a user to run commands
  - Access to a remote file system via NFS
  - Other servers that are used for control and status

# **Operating System On External Host**

- Cross-development tools run on external host (PC)
  - Some SDK compilers require Linux
  - Workbench Backend Server (WB Backend Server) used for download runs under Windows
- Site can avoid using Workbench software

# **External File Access And Storage**

- DRAM accessed via DRAM bus
- SRAM and Flash accessed via SRAM bus
- Ethernet ports accessed via MSF
- Code and data downloaded via control Ethernet
- NFS accessed via control Ethernet
- XScale accessed via
  - Serial line (console)
  - Telnet

# **Basic Paradigm**

- Build software on conventional computer
- Load into reference system
- Test/measure results

# **Bootstrapping Procedure**

- 1. Restarting the ENP-2611 causes the boot manager on the XScale to load and run a copy of *RedBoot* program out of the Flash ROM.
- The RedBoot program running on the XScale sends a BOOTP request to obtain an IP address for the management Ethernet port.
- 3. The RedBoot program running on the XScale uses the IP address obtained via BOOTP to contact a TFTP server and download a copy of the Linux kernel image.
- 4. The Linux kernel boots and uses NFS to mount a remote file system.
- 5. After the kernel is operational, a script runs that starts a telnet server on the management interface as well as other servers that accessible to external hosts.

# **Starting Software**

- 1. Compile code for the XScale and microengines, and place the resulting files in a directory, D, on the computer that runs the NFS server. The Intel SDK uses the terms *core component* and *microblock* for the compiled files.
- 2. Copy the entire contents of directory D to the read-write public download directory, W, that has been mounted by the testbed. (This step is not necessary if only one programmer has access to the testbed.)
- 3. Run a telnet client program on the host that forms a connection to the testbed system, and log onto the XScale.
- 4. Load a set of Linux kernel modules, including microengine drivers, the Resource Manager library, and a module to configure the SPI-3 interface.

# Starting Software (continued)

- 5. Load a module that reads microblock code and places the code into microengines. Note that no such module comes with the SDK; a programmer must write the module. However, the module does not need to access hardware directly because the *Resource Manager* can be used to load code into microengines.
- 6. Start the programs on the XScale that were compiled in Step 1.

# Summary

- Reference systems
  - Provided by vendor
  - Targeted at potential customers
  - Usually include
    - \* Hardware testbed
    - \* Cross-development software
    - \* Download and bootstrap software
    - \* Reference implementations

# **Questions?**

# XVIII

#### **Embedded RISC Processor (XScale Core)**

#### **XScale Role**



- (a) Single IXP2xxx
- (b) Multiple IXP2xxxs
- Role of XScale differs

# Tasks That Can Be Performed By XScale

- Bootstrapping
- Exception handling
- Higher-layer protocol processing
- Interactive debugging
- Diagnostics and logging
- Memory allocation
- Application programs (if needed)
- User interface and/or interface to the GPP
- Control of packet processors
- Other administrative functions

# **XScale Characteristics**

- Reduced Instruction Set Computer (RISC)
- Thirty-two bit arithmetic
- Extra functionality can be provided via a coprocessor
- Byte addressable memory
- Virtual memory support
- Built-in serial port
- Facilities for a kernelized operating system
- Performance monitoring unit

# Arithmetic

- XScale is configurable in two modes
  - Big endian
  - Little endian
- Choice made at run-time

# **Performance Monitoring Unit**

- Can measure
  - Instruction cache miss rate
  - Translation Lookaside Buffer (TLB) miss rate
  - Stalls in the instruction pipeline
  - Number of branches initiated by software
- Useful for program tuning

## **XScale Memory Organization**

- Single, uniform address space
- Includes memories and devices
- Byte addressable

#### **XScale Address Space**



## **Shared Memory And Address Translation**

- Memory shared with microengines
- Microengines use separate physical address spaces

#### **Consequence For Programmers**

Because the Xscale and packet processors do not use the same memory architecture, linked lists and other data structures in which pointers cross from one memory to another do not make sense in the microengine address space.

# **Internal Peripheral Units**

- One UART
- Four 32-bit countdown timers (one watchdog)
- Eight General-Purpose I/O (GPIO) pins
- One Slowport interface

## Summary

- Embedded processor on IXP2xxx is XScale
- XScale addressing
  - Single, uniform address space
  - Includes all memories
  - Byte addressable

# **Questions?**

#### XIX

#### Packet Processor Hardware (Microengines)

# Microengines

- Parallel hardware units
  - Eight on IXP2400
  - Sixteen on IXP28x0
- Handle fast data path processing
- Known as *microengine version 2 MEv2*)

# **Role Of Microengines**

- Packet ingress from physical layer hardware
- Checksum verification
- Header processing and classification
- Packet buffering in memory
- Table lookup and forwarding
- Header modification
- Checksum computation
- Packet egress to physical layer hardware

# **Microengine Characteristics**

- Programmable microcontroller
- RISC design
- Two hundred fifty-six general-purpose registers
- Five hundred twelve transfer registers
- One hundred twenty-eight Next Neighbor registers
- Hardware support for four threads and context switching
- Six hundred forty words of local memory
- Sixteen entry CAM with thirty-two bits per entry

## Microengine Characteristics (continued)

- Control of an Arithmetic Logic Unit (ALU)
- Direct access to various functional units
- A unit to compute a Cyclic Redundancy Check (CRC)

# **Microengine Level**

- Not a typical CPU
- Does not contain native instruction for each operation
- Controls other units on the chip
- Really a *microsequencer*

#### **Consequence Of Microsequencing**

Because it functions as a microsequencer, a microengine does not provide native hardware instructions for arithmetic operations, nor does it provide addressing modes for direct memory access. Instead, a program running on a microengine controls and uses functional units on the chip to access memory and perform operations.

## **Microengine Instruction Set (Part 1)**

| Description                                          | Instruction                             |  |  |  |
|------------------------------------------------------|-----------------------------------------|--|--|--|
| General instructions (Arithmetic, Rotate, And Shift) |                                         |  |  |  |
| ALU                                                  | Perform an ALU operation                |  |  |  |
| ALU_SHF                                              | Perform an ALU operation and shift      |  |  |  |
| ASR                                                  | Perform an arithmetic right shift       |  |  |  |
| BYTE_ALIGN_BE, BYTE_ALIGN_LE                         | Concatenate registers and select bytes  |  |  |  |
| CRC_LE, CRC_BE                                       | Compute CRC (big or little endian)      |  |  |  |
| DBL_SHF                                              | Concatenate and shift two longwords     |  |  |  |
| MUL_STEP                                             | Multiply two unsigned integers          |  |  |  |
| FFS                                                  | Find position of LSB in register        |  |  |  |
| POP_COUNT                                            | Count 1 bits in a register              |  |  |  |
| IMMED                                                | Load immediate 16-bit value to register |  |  |  |
| IMMED_B0 through IMMED_B3                            | Load immediate byte to a field          |  |  |  |
| IMMED_W0, IMMED_W1                                   | Load immediate 16-bit word to a field   |  |  |  |
| LD_FIELD, LD_FIELD_W_CLR                             | Load bytes to specified fields          |  |  |  |
| LOAD_ADDR                                            | Load instruction address                |  |  |  |
| LOCAL_CSR_RD, LOCAL_CSR_WR                           | Read or write local microengine CSRs    |  |  |  |
| NOP                                                  | No operation                            |  |  |  |

## **Microengine Instruction Set (Part 2)**

| Description                  | Instruction                                  |  |  |
|------------------------------|----------------------------------------------|--|--|
| Branch and Jump Instructions |                                              |  |  |
| BCC                          | Branch on condition code                     |  |  |
| BR                           | Branch unconditionally                       |  |  |
| BR_BCLR, BR_BSET             | Branch if bit clear or set                   |  |  |
| BR=BYTE, BR!=BYTE            | Branch if byte equal or not equal to literal |  |  |
| BR=CTX, BR!=CTX              | Branch on current context                    |  |  |
| BR_INP_STATE, BR_!INP_STATE  | Branch on event state                        |  |  |
| BR_SIGNAL, BR_!SIGNAL        | Branch if signal deasserted                  |  |  |
| JUMP                         | Jump to label                                |  |  |
| RTN                          | Return from branch or jump                   |  |  |

## **Microengine Instruction Set (Part 3)**

| Description                                   | Instruction                               |  |  |  |
|-----------------------------------------------|-------------------------------------------|--|--|--|
| Content Addressable Memory (CAM) Instructions |                                           |  |  |  |
| CAM_CLEAR                                     | Clear all entries in local CAM            |  |  |  |
| CAM_WRITE_STATE                               | Write state bits into specified CAM entry |  |  |  |
| CAM_READ_TAG                                  | Read tag for specified CAM entry          |  |  |  |
| CAM_READ_STATE                                | Read state bits for specified CAM entry   |  |  |  |
| CAM_LOOKUP                                    | Search local CAM for tag value            |  |  |  |
| CAM_WRITE                                     | Write tag value for specified CAM entry   |  |  |  |

# **Microengine Instruction Set (Part 4)**

| Instruction                                 | Description                             |  |  |  |
|---------------------------------------------|-----------------------------------------|--|--|--|
| I/O And Context Swap Instructions           |                                         |  |  |  |
| DRAM (read and write)                       | Perform an ALU operation                |  |  |  |
| DRAM (RBUF and TBUF)                        | Perform an ALU operation and shift      |  |  |  |
| CAP (CSR addressing)                        | Perform an arithmetic right shift       |  |  |  |
| CAP (calculated addressing)                 | Concatenate registers and select bytes  |  |  |  |
| CAP (reflect)                               | Compute CRC (big or little endian)      |  |  |  |
| CTX_ARB Concatenate and shift two longwords |                                         |  |  |  |
| HALT                                        | Multiply two unsigned integers          |  |  |  |
| HASH Find position of LSB in register       |                                         |  |  |  |
| MSF Count 1 bits in a register              |                                         |  |  |  |
| PCI                                         | Load immediate 16-bit value to register |  |  |  |
| SCRATCH (read and write)                    | Load immediate byte to a field          |  |  |  |
| SCRATCH (atomic operation)                  | Load immediate 16-bit word to a field   |  |  |  |
| SCRATCH (ring operation)                    | Load bytes to specified fields          |  |  |  |
| SRAM (read and write)                       | Load instruction address                |  |  |  |
| SRAM (atomic operation)                     | Read or write local microengine CSRs    |  |  |  |
| SRAM (CSR)                                  | Load or store values in CSR registers   |  |  |  |
| SRAM (read queue descriptor)                | Access queue in SRAM                    |  |  |  |
| SRAM (write queue descriptor)               | Change queue in SRAM                    |  |  |  |
| SRAM (enqueue)                              | Enqueue item in SRAM queue              |  |  |  |
| SRAM (dequeue)                              | Dequeue item from SRAM queue            |  |  |  |
| SRAM (ring operation)                       | Manipulate a communication ring in SRAM |  |  |  |
| SRAM (journal operation)                    | Perform atomic operation in SRAM        |  |  |  |

## **Microengine View Of Memory**

- Separate address spaces
- Specific instruction to reference each memory type
  - Instruction *dram* to access DRAM memory
  - Instruction *sram* to access SRAM memory
  - Instruction *scratch* to access Scratchpad memory
- Consequence: early binding of data to memory

#### **Six-Stage Instruction Pipeline**

| Stage | Description                                                           |
|-------|-----------------------------------------------------------------------|
| 1     | Fetch the next instruction (part 1)                                   |
| 2     | Fetch the next instruction (part 2)                                   |
| 3     | Decode the instruction and get register address(es)                   |
| 4     | Extract the operands from registers                                   |
| 5     | Perform ALU, shift, or compare operations and set the condition codes |
| 6     | Write the results to the destination register                         |

#### **Example Of Pipeline Execution**

|      | clock | stage 1 | stage 2 | stage 3 | stage 4 | stage 5 | stage 6 |
|------|-------|---------|---------|---------|---------|---------|---------|
| Time | 1     | inst. 1 | -       | -       | -       | -       |         |
| I    | 2     | inst. 2 | inst. 1 | -       | -       | -       | -       |
|      | 3     | inst. 3 | inst. 2 | inst. 1 | -       | -       | -       |
|      | 4     | inst. 4 | inst. 3 | inst. 2 | inst. 1 | -       | -       |
|      | 5     | inst. 5 | inst. 4 | inst. 3 | inst. 2 | inst. 1 | -       |
|      | 6     | inst. 6 | inst. 5 | inst. 4 | inst. 3 | inst. 2 | inst. 1 |
| •    | 7     | inst. 7 | inst. 6 | inst. 5 | inst. 4 | inst. 3 | inst. 2 |
|      | 8     | inst. 8 | inst. 7 | inst. 6 | inst. 5 | inst. 4 | inst. 3 |

• Once pipeline is started, one instruction completes per cycle

# **Instruction Stall**

- Occurs when operand not available
- Processor temporarily stops execution
- Reduces overall speed
- Should be avoided when possible

#### **Example Instruction Stall**

• Consider two instructions:

K: ALU operation to add the contents of R1 to R2K+1: ALU operation to add the contents of R2 to R3

- Second instruction cannot access R2 until value has been written
- Stall occurs

#### **Effect Of Instruction Stall**

|            | clock | stage 1   | stage 2   | stage 3   | stage 4   | stage 5   | stage 6   |
|------------|-------|-----------|-----------|-----------|-----------|-----------|-----------|
| <b>T</b> : | 1     | inst. K   | inst. K-1 | inst. K-2 | inst. K-3 | inst. K-4 | inst. K-5 |
| Time<br>I  | 2     | inst. K+1 | inst. K   | inst. K-1 | inst. K-2 | inst. K-3 | inst. K-4 |
|            | 3     | inst. K+2 | inst. K+1 | inst. K   | inst. K-1 | inst. K-2 | inst. K-3 |
|            | 4     | inst. K+3 | inst. K+2 | inst. K+1 | inst. K   | inst. K-1 | inst. K-2 |
|            | 5     | inst. K+3 | inst. K+2 | inst. K+1 | -         | inst. K   | inst. K-1 |
|            | 6     | inst. K+3 | inst. K+2 | inst. K+1 | -         | -         | inst. K   |
| •          | 7     | inst. K+4 | inst. K+3 | inst. K+2 | inst. K+1 | -         | -         |
|            | 8     | inst. K+5 | inst. K+4 | inst. K+3 | inst. K+2 | inst. K+1 | -         |

- Bubble develops in pipeline
- Bubble eventually reaches final stage

#### **A Note For Programmers**

Understanding the execution pipeline is important for programmers because dependencies among instructions can cause the processor to stall, which lowers performance.

# **Sources Of Delay**

- Access to result of previous / earlier operation
- Conditional branch
- Memory access

#### **Memory Access Delays**

| Type Of      | Approx. Access Time In Clock Cycles |         |  |  |
|--------------|-------------------------------------|---------|--|--|
| Memory       | IXP2400                             | IXP28x0 |  |  |
| Local Memory | 1                                   | 1       |  |  |
| Scratchpad   | 60                                  | 60      |  |  |
| SRAM         | 150                                 | 90      |  |  |
| DRAM         | 300                                 | 120     |  |  |

• Delay is surprisingly large

## **Threads Of Execution**

- Technique used to speed processing
- Multiple *threads of execution* remain ready to run
- Program defines threads and informs processor
- Processor runs one thread at a time
- Processor automatically switches context to another thread when current thread blocks
- Known as *hardware threads*
- Microengine has eight threads

# **Illustration Of Hardware Threads**



- White ready but idle
- Blue being executed by microengine
- Gray blocked (e.g., during memory access)

#### **The Point Of Hardware Threads**

Hardware threads increase overall throughput by allowing a microengine to handle up to four packets concurrently; with threads, computation can proceed without waiting for memory access.

# **Context Switching Time**

- *Low-overhead context switch* means one instruction delay as hardware switches from one thread to another
- Zero-overhead context switch means no delay during context switch
- IXP2xxx offers zero-overhead context switch

# **Microengine Instruction Store**

- Private instruction store per microengine
- Advantage: no contention
- Disadvantage: smaller size (4000 instructions)

## **General-Purpose Registers**

- two hundred fifty-six per microengine
- Thirty-two bits each
- Used for computation or intermediate values
- Divided into banks
- Context-relative or absolute addresses

# **Forms Of Addressing**

- Absolute
  - Entire set available
  - Uses integer from 0 to 255
- Context-relative
  - One eighth of set available to each thread
  - Uses integer from 0 to 31
  - Allows same code to run on multiple microengines

#### **Register Banks**

- Mechanism commonly used with RISC processor
- Registers divided into A bank and B bank
- Maximum performance achieved when each instruction references a register from the A bank and a register from the B bank

## **Summary Of General-Purpose Registers**

| Number Of<br>Active<br>Contexts | Active<br>Context<br>Number | General Purpose Register<br>Absolute Addresses |           | S Transfer<br>or Neighbor | D Transfer |
|---------------------------------|-----------------------------|------------------------------------------------|-----------|---------------------------|------------|
|                                 |                             | A Port                                         | B Port    | Index                     | Index      |
| 8                               | 0                           | 0 - 15                                         | 0 - 15    | 0 - 15                    | 0 - 15     |
|                                 | 1                           | 16 - 31                                        | 16 - 31   | 16 - 31                   | 16 - 31    |
|                                 | 2                           | 32 - 47                                        | 32 - 47   | 32 - 47                   | 32 - 47    |
|                                 | 3                           | 48 - 63                                        | 48 - 63   | 48 - 63                   | 48 - 63    |
|                                 | 4                           | 64 - 79                                        | 64 - 79   | 64 - 79                   | 64 - 79    |
|                                 | 5                           | 80 - 95                                        | 80 - 95   | 80 - 95                   | 80 - 95    |
|                                 | 6                           | 96 - 111                                       | 96 - 111  | 96 - 111                  | 96 - 111   |
|                                 | 7                           | 112 - 127                                      | 112 - 127 | 112 - 127                 | 112 - 127  |
| 4                               | 0                           | 0 - 31                                         | 0 - 31    | 0 - 31                    | 0 - 31     |
|                                 | 2                           | 32 - 63                                        | 32 - 63   | 32 - 63                   | 32 - 63    |
|                                 | 4                           | 64 - 95                                        | 64 - 95   | 64 - 95                   | 64 - 95    |
|                                 | 6                           | 96 - 127                                       | 96 - 127  | 96 - 127                  | 96 - 127   |

• Note: half of the registers for each context are from A bank and half from B bank

# **Transfer Registers**

- Used to buffer external memory transfers
- Example: read a value from memory
  - Copy value from memory into transfer register
  - Move value from transfer register into general-purpose register
- Five hundred twelve per microengine
- Divided into four types
  - SRAM or DRAM
  - Read or write

# **Next Neighbor Registers**

- Provide high-speed, synchronized communication
- Allows data to pass between microengines
- Handle small values
- Typically used to pass buffer address, not entire packet
- Used to build software pipeline

#### Next Neighbor Register Hardware

| Primitive | Туре     | Purpose                                    |
|-----------|----------|--------------------------------------------|
| NN_Get    | Register | Extract next item from next neighbor ring  |
| NN_Put    | Register | Insert item in a next neighbor ring        |
| NN_FULL   | Signal   | Test whether a next neighbor ring is full  |
| NN_EMPTY  | Signal   | Test whether a next neighbor ring is empty |

• Hardware polls state bit

## **Local Memory**

- Private (one per microengine)
- Small size (2560 bytes)
- Low latency (one instruction cycle after setup)
- Read or written under program control
- Accessed via special hardware registers
  - Address placed in *LM\_ADDR0*
  - Value accessed via *LM\_ADDR1*

# **Content Addressable Memory (CAM)**

- Used to speed searches
- Characteristics
  - Sixteen entries
  - Thirty-two bit search key per entry
  - Four-bit status value per entry
  - Single instruction lookup
  - Hardware reports first entry that matches

# **Organization Of CAM**

| tag for entry | state bits |
|---------------|------------|
|               |            |

### Hardware Bits Returned For CAM Operation



# **Local Control And Status Registers**

- Used to interrogate or control the IXP2xxx
- All mapped into XScale address space
- Microengine can only access its own local CSRs

### **Example Local CSRs**

| Local CSR            | Purpose                                                         |
|----------------------|-----------------------------------------------------------------|
| USTORE_ADDRESS       | Load the microengine control store                              |
| USTORE_DATA_LOWER    | Lower 20 bits of the instruction                                |
| USTORE_DATA_UPPER    | Upper 12 bits of the instruction                                |
| USTORE_ERROR_STATUS  | Error status bits                                               |
| ALU_OUT              | Debugging: allows XScale to read<br>GPRs and transfer registers |
| CTX_ARB_CTL          | Context arbiter control                                         |
| CTX_ENABLES          | Context arbiter control                                         |
| CC_ENABLE            | Debugging: read condition codes                                 |
| CSR_CTX_POINTER      | Used to modify context-specific CSRs                            |
| INDIRECT_CTX_STS     | To access context-specific PC                                   |
| ACTIVE_CTX_STS       | Find context currently running                                  |
| TIMESTAMP_HIGH       | Clock (high-order bits)                                         |
| TIMESTAMP_LOW        | Clock (low-order bits)                                          |
| PSEUDO_RANDOM_NUMBER | Random value                                                    |

• Note: *n* is digit from 0 through 7 (hardware contains a separate CSR for each of the eight contexts).

# **Interprocessor Communication Mechanisms**

- Context-to-XScale communication
- Context-to-Context communication (one or more IXP2xxx's)

# **Context-To-XScale Communication**

- Interrupts (SHaC unit)
- Shared memory
- Memory ring mechanism
  - SRAM
  - Scratchpad

### **Context-to-Context communication**

- Signal event mechanism
- Memory ring mechanism
- Next neighbor registers
- Reflector bus mechanism

# **SHaC Unit**

- Operates as coprocessor
- Controls
  - Scratchpad memory
  - Hash unit
  - Communication mechanism used by microengines
  - CSR bus interface
  - Push / pull reflector

### **SHaC Architecture (simplified)**



43

# **ScratchPad Memory**

- Organized into 4K words of 4 bytes each
- Offers special facilities
  - Atomic operations
    - \* Set or clear bits
    - \* Increment, decrement, add, or subtract
    - \* Swap values
  - Communication rings

# Hash Unit

- Configurable coprocessor
- Operates asynchronously
- Intended for fast table lookup

### **Hash Unit Computation**

• Computes quotient Q(x) and remainder R(x):

 $A(x) \, \ast \, M(x) \, / \, G(x) \ \rightarrow \ Q(x) + R(x)$ 

- A(x) is input value
- M(x) is hash multiplier (configurable)
- G(x) is built-in value
- Three values for G (48-bit, 64-bit, or 128-bit hash)

### **Hash Mathematics**

- Integer value interpreted as polynomial over field [0,1]
- Example:

20401<sub>16</sub>

• Is interpreted as

 $x^{17} + x^{10} + 1$ 

• Similarly, value G(x) used in 48-bit hash

1001002000401<sub>16</sub>

• Is interpreted as

$$x^{48} + x^{36} + x^{25} + x^{10} + 1$$

#### Hash Example

 $A = 8000000001_{16} (x^{47} + 1)$   $G = 1001002000401_{16} (x^{48} + x^{36} + x^{25} + x^{10} + 1)$  $M = 20D_{16} (x^9 + x^3 + x^2 + 1)$ 

• Hash computes R, remainder of M times A divided by G

H(X) = R = A \* M % G

# Hash Example (continued)

• We see that

$$A(x) * M(x) = x^{56} + x^{50} + x^{49} + x^{47} + x^9 + x^3 + x^2 + 1$$

• Furthermore:

A \* M = Q \* G + R

# Hash Example (continued)

• Where

$$Q(x) = x^8 + x^2 + x^1$$

• Thus, Q is  $106_{16}$  and R is

$$R(x) = x^{47} + x^{44} + x^{38} + x^{37} + x^{33} + x^{27} + x^{26} + x^{18} + x^{12} + x^{11} + x^9 + x^8 + x^3 + x^1 + 1$$

• The hash unit returns R as the value of the computation:

$$H(A) = R = 90620C041B0B_{16}$$

### **Other IXP2xxx Hardware**

- The IXP2xxx contains registers used for
  - Configuration and bootstrapping
  - Control of functional units and buses
  - Checking status of processors, threads, and onboard functional units

### **The Point About Registers**

In addition to basic functional units, the IXP contains hundreds of registers that allow software to configure, control, or interrogate the status of functional units, buses, and attached devices.

### **Media Switch Fabric Interface**

- Complex unit
- Primary interface to high-speed external devices
- Configurable to handle standard MACs such as

| - UTOPIA0 | (IXP2400 only) |
|-----------|----------------|
| – SPI-3   | (IXP2400 only) |
| – SPI-4.2 | (IXP28x0 only) |

### **Transmit And Receive BUFs**

- Used for I/O
- Contained in MSF unit
- Function as randomly accessible memory
- Transfer in chuncks of 64, 128, or 256 bytes
- Two types
  - Receive BUFs (RBUFs) handle input
  - Transmit BUFs (TBUFs) handle output

# **Crypto Unit**

- Available on the IXP2850
- Two units
- Can be used for
  - Two 3DES/DES (*Data Encryption Standard*) cores for data encryption/decryption
  - One AES (*Advanced Encryption Standard*) core for data encryption/decryption that can use 128, 192 or 256 bit keys
  - Two SHA-1 (*Secure Hash Algorithm*) cores for authentication
- Programmer chooses

# Ctypto Unit (continued)

- Support for
  - The *Electronic Code Book* standard (*ECB*)
  - The Cipher Block Chaining standard (CBC)
- Sufficient for
  - IPsec
  - SSL

# **Crypto Unit API**

- Input RAM (read/write to on-board RAM)
- State (set crypto parameters, e.g. keys)
- Cipher (initiate cypher algorithm execution)
- Hash (initiate hash algorithm execution)
- Utils (functions such as checksum calculation)

### Summary

- Microengines
  - Low-level, programmable packet processors
  - Use RISC design with instruction pipeline
  - Have hardware threads for higher throughput
  - Use transfer registers to access memory
  - Use BUFs for I/O
  - Have access to hash and crypto units

# **Questions?**

#### XXI

# **Programming Model**

### Assumptions About Support Software And Overall Structure

- XScale runs MontaVista Embedded Linux®
- Code for XScale compiled to run under Linux
- Microengines do not run any OS
- Code for microengines compiled to run on bare machine
- Consequences for programmers:
  - Microcode handles all hardware details
  - XScale code relies on libraries and OS

### **Major Pieces Of Software**

- One or more *microblocks* that run on the microengines
- A *core component* that runs on the XScale
- User interface code that runs on the XScale
- Note: we will concentrate on the first two

### **Interconnections Among Microblocks**

- Each microblock is asynchronous
- Fast data path code built as series of microblock stages
- Basic pipeline architecture

# **Typical Network Systems**

- At least three microblocks
  - Ingress
  - Processing
  - Egress

### **Example Microblock Pipeline**



• Ingress and egress microblocks

- Interface with MSF
- Handle packet I/O
- Process microblock
  - Performs protocol processing

## **Assignment Of Microblocks To Microengines**

- Approach #1
  - Multiple types of microblocks run on each microengine
  - Each thread runs one microblock
- Approach #2
  - Each microblock runs on separate microengine
  - Multiple copies (threads) used to increase performance
- In practice, most systems use approach #2

## **Mpackets And Transfers**

- MSF divides incoming packet into fixed-size units called *mpackets*
- Mpacket size can be configured to be 64, 128, or 256 octets
- Each mpacket received independently
- Hardware sets bit to indicate first and last mpacket of a packet
- Software divides outgoing packet into mpackets
- Each mpacket transmitted independently

#### **Ingress And Egress Microblocks**

- Available from *building blocks library*
- Ingress microblock
  - Named *Receive* (*RX*)
  - Invoked whenever an mpacket arrives
  - Places mpacket in buffer in memory
- Egress microblock
  - Named Transmit (TX)
  - Invoked whenever an mpacket is ready for egress
  - Releases buffer after mpacket is sent

#### **Microblocks And Parallel Execution**

- Microengine can be dedicated to run exactly one microblock
- Microengine can run multiple microblocks in a pipeline
- Multiple microengines can run copies of a pipeline
- Each approach has advantages and disadvantages

#### **Packet Buffers**

- Mechanism provided by SDK
- Set of fixed-size buffers allocates in DRAM
- Typical buffer size is 2048
- Buffer holds
  - Packet (e.g., Ethernet frame)
  - Control information called *metadata*

## **Placement Of A Packet In A Buffer**

- Metadata occupies first 128 bytes of buffer
- Padding separates metadata from packet
- Optimizes memory access
  - DRAM organized into four banks
  - All banks can be accessed in parallel
  - Bank starts at 128 bytes beyond previous
- Goal: distribute packet headers evenly over banks

## **Illustration Of Buffer Layout**



• Padding is selected at random

NSD-Intel -- Chapt. 21

## **Buffer Queues And Buffer Allocation**

- Hardware support for high-speed buffer allocation
- Mechanism implements *First-In-First-Out* (*FIFO*)
  - Singly-linked list
  - Head and tail pointers
  - Two basic operations
    - \* Enqueue adds item at tail
    - \* Dequeue removes item from head

## Buffer Queues And Buffer Allocation (continued)

- FIFO mechanism
  - Provided by SRAM memory controller
  - Known as *Queue Array*
- Packet queues kept in DRAM
- Linear mapping between items in SRAM Queue Array and DRAM buffers

## **Mapping Between Queue Array And DRAM Buffers**



• Linear mapping between address of queue element in SRAM and address of buffer in DRAM

# **Example Address Mapping**

- Thirty-two buffers allocated in DRAM
  - Let *B* denote starting location
  - Assume buffers are contiguous
- Queue of thirty-two list elements allocated in SRAM
  - Let *F* denote starting location
  - Assume elements are contiguous
  - Known as free buffer list
- Let *A* be address of list element in SRAM returned by dequeue operation

#### Example Address Mapping (continued)

• Address of corresponding buffer in DRAM is

buffer address =  $B + \frac{A - F}{free \ list \ element \ size} \times buffer \ size$ 

- Where *buffer size* denotes the size of a packet buffer
- Optimization: to avoid division, precompute

DL\_DS\_RATIO = 
$$\frac{buffer \ size}{free \ list \ element \ size}$$

and

$$DL\_REL\_BASE = B - F \times DL\_DS\_RATIO$$

## Example Address Mapping (continued)

• Once constants are precomputed, a buffer address is found by:

*bufferaddress* = A × DL\_DS\_RATIO + DL\_REL\_BASE

• Note: if buffer and free list elements are powers of two, multiplication can be replaced by bitwise shift

#### **Buffer Handle**

- List element address in SRAM
- Used as packet ID
- Is translated to buffer address as needed
- Consists of four bytes
  - Three bytes correspond to SRAM address
  - One byte used to pass additional information
    - \* Beginning of packet
    - \* End of packet
    - \* Count of segments in the packet

#### **Packet Discard**

- Uses buffer handle
- Extremely efficient
- To discard, dequeue buffer handle on free list
- No other action required

## **Packet Forwarding Mechanism**

- Hardware mechanism used for interprocess communication
- Supported by both SRAM and Scratch memories
- Known as *Memory Ring*
- Controller implements insertion and extraction
- Requests serialized and atomic
- Used for high-speed forwarding of packets among microengines and XScale

#### **Illustration Of Scratch Rings**



#### **Queue Array Hardware Limitation**

- Hardware only has sixty-four Queue Array entries per SRAM channel.
  - IXP2400 has 2 SRAM channels
  - IXP28xx has 4 SRAM channels
- Consequence: cannot have arbitrarily many Rings

## **Overcoming The Queue Array Limitation**

- Resource manager allocates backup store in SRAM for each Queue Array entry
- When Queue Array exhausted, software
  - Chooses one Queue Array entry (typically LRU)
  - Copies entry to backup store
  - Uses hardware slot for new Queue Array
- Process is known as *spilling*
- CAM can be used used to choose entry to spill

#### **Core Processing**

- Terminology
  - XScale is referred to as *core processor*
  - Software running on the XScale is called *core component*
- Core processing
  - Insufficient speed for fast path
  - Reserved for *exceptions*
- Note: core processor has access to memory rings and queues

## Summary

- Two major types of software
  - One or more *microblocks*
  - Core component
- Microblocks interconnected in pipline
- SDK includes ingress and egress microblocks
- Packet buffers allocated in DRAM; free list kept in SRAM
- Address of list element in SRAM called *buffer handle*

#### Summary (continued)

- Linear mapping translates buffer handle to DRAM address when needed
- Packet discard is trivial (return buffer handle to free list)
- Memory ring mechanism used for IPC
- Queue array hardware provides finite set of queues; values spilled to SRAM when necessary

# **Questions?**

#### XXII

#### **XScale Facilities**

## **XScale Responsibilities**

- Loading microengine code
- Creating rings and / or queues for communication
- Allocating and reclaiming resources such as memory
- Patching symbols in microcode
- Starting and controlling microengine operation
- Providing an external interface for management
- Providing an interface to operating system facilities
- Doing slow path processing of exception packets

## **Conceptual Organization Of XScale Software**

- Several pieces of support software available
- Core component uses each piece directly or indirectly
- Each piece of support software implemented as a Linux *loadable kernel module*
- Core component also implemented as a kernel module

#### **Organization Of Software On XScale**



## **Core Component Infrastructure (CCI)**

- Support module
- Provides facilities to
  - Create and run core components
  - Setup timers
  - Give each core component a private *execution engine*

## **Resource Manager (RM)**

- Support module
- Among most important
- Provides facilities used to
  - Access operating system services
  - Manage memory and translate addresses
  - Control microengines
  - Load code into microengines
  - Manage queue of exception packets sent to core component

# **Operating System Specific Library (OSSL)**

- Unfortunate name
- Role of the OSSL is operating system independence
- Core component calls OSSL function
- OSSL function calls undelrying OS function
- Allows core component to remain independent of the underlying OS

#### **Hardware Abstraction Layer (HAL)**

- Two HAL modules
  - Library *halMev2\_lib* provides control functions
  - Module *halMeDrv* provides access to microengine CSRs
- Isolate programmer from hardware details

#### **Memory Management**

- SRAM, DRAM, and Scratch memories shared among microengines and XScale
- However
  - Addressing schemes used by microengines and XScale differ
  - Parallel processors cannot attempt to allocate memory without mutual exclusion
- SDK solution: all memory management (allocation and deallocation occurs through Resource Manager on XScale

#### Memory Management Functions In The Resource Manager

- ix\_rm\_mem\_alloc
  - Argument specifies DRAM, SRAM, or Scratch memory
  - Although XScale uses single address space, Resource Manager ensures allocation is made from specified memory
- ix\_rm\_mem\_free

#### **Memory Allocation By Microengine**

- Microengine
  - Can make an allocation directly
  - Must inform Resource Manager after allocation complete
  - Known as a *reservation*
- To make a reservation, core component calls
  - ix\_rm\_mem\_reserve

## **Allocation Of Local Memory**

- Only visible to individual microengine
- However, Resource Manager provides functions that allow microengine to allocate and free Local Memory
  - ix\_rm\_mem\_local\_alloc
  - ix\_rm\_mem\_local\_free
  - ix\_rm\_mem\_local\_reserve

#### **Address Translation**

- Functions in Resource Manager always return a virtual address in the XScale's address space
- Address must be translated to physical address before microengine can use it
- Address from microengine must be translated to virtual address before core component can use it
- All translation performed by core component using Resource Manager functions
  - ix\_rm\_get\_physical\_offset
  - ix\_rm\_get\_virtual\_address

# **Ring And Queue Creation**

- Provided by Resource Manager
- Three functions
  - ix\_rm\_hw\_queue\_create
  - ix\_rm\_hw\_sram\_ring\_create
  - ix\_rm\_hw\_scratch\_ring\_create
- Return a *handle*
- Value of handle can be predeclared and used as argument

# **Ring And Queue Deletion**

- Also performed by Resource Manager
- Two functions
  - ix\_rm\_hw\_queue\_delete
  - ix\_rm\_hw\_ring\_delete

# **Ring And Queue Manipulation**

- Performed by Resource Manager
- Functions are
  - ix\_rm\_hw\_enqueue
  - ix\_rm\_hw\_dequeue
  - ix\_rm\_hw\_ring\_put
  - ix\_rm\_hw\_ring\_get

#### **Buffer Management Facilities**

- Special case of queue
- Function to create free list is
  - ix\_rm\_buffer\_free\_list\_create

#### **Basic Form Of Core Component**

do forever {
 wait for next packet from the microengines;
 process the packet;
}

• Note: although macroengines typically send a packet, the mechanism allows a microengine to send a "message"

### **Core Processing**

- Questions
  - How does a core component avoid using the CPU while waiting for a packet from the microengines?
  - If multiple core components exist to process packets, how is a given packet sent to the correct core component?
  - Can messages that arrive from microengines be processed out-or-order?
- Answer
  - Introduce an additional demultiplexing stage
  - To send a message, microengine interrupts the demultiplexing stage.

#### **Illustration Of Core Architecture**



# Patching Symbols And Loading Microcode

- XScale loads microengine control store
- Symbolic references in code replaced by constant value
- Known as *patching*
- Allows values to change without recompilation
- Assembly *import* directive used to specify name and value to the assembler
  - .import\_var MY\_CONSTANT

#### Difference Between Import And Defined Constants

• The code below produces an error (constant is too large)

#define MY\_CONSTANT 0x12345678
alu[addr, MY\_CONSTANT, +, 4]

• The following code compiles without an error (constant is truncated during patching)

.import\_var MY\_CONSTANT alu [ addr, MY\_CONSTANT, +, 4 ]

#### Use Of Immed32

• To avoid compiler error

.import\_var MY\_CONSTANT
.reg temp\_value
immed32(tmp\_value, MY\_CONSTANT)
alu[addr, tmp\_value, +, 4]

23

### **Resource Manager API**

- Allows a core component to
  - 1. Read microcode from an external binary file and place it in a structure.
  - 2. Define the names and values of a set of imported symbols.
  - 3. Patch the microcode by replacing each imported variable reference with the appropriate value.
  - 4. Load the patched microcode into the microengine instruction store.

# **Example Of Using The Resource Manager**

• Load code from *my\_file* and patch imported constant *MY\_CONSTANT* 

ix\_rm\_ueng\_set\_ucode(my\_name); importSymbols[0].m\_Name = "MY\_CONSTANT"; importSymbols[0].m\_Value = 0x12345678; ix\_rm\_ueng\_patch\_symbols(me\_number, 1, importSymbols); ix\_rm\_ueng\_load();

#### **Resource Manager Functions To Control A Microengine**

• To start a microengine

ix\_rm\_ueng\_start()

• To stop a microengine

ix\_rm\_ueng\_stop()

## Summary

- XScale software includes
  - Base operating system (Linux)
  - Core components
  - Support software
- Core component runs as a loadable Linux loadable kernel module
- Each support software system also runs as Linux loadable kernel module

27

# Summary (continued)

- Support software includes
  - Resource Manager
  - Core Component Infrastructure
  - Operating System Specific Library
  - Hardware Abstraction Layer
- Resource Manager offers API for items such as
  - Memory management and address translation
  - Queue and Ring allocation
  - Control of microengines

# **Questions?**

#### XXIII

#### **Microengine Programming I**

# **Microengine Code**

- Many low-level details
- Close to hardware
- Written in (micro)assembly language

### **Features Of Intel's Microengine Assembler**

- Directives to control assembly
- Symbolic register names
- Macro preprocessor (extension of C preprocessor)
- Set of structured programming macros

#### **Statement Syntax**

• General form:

label: operator operands tokens

- *label* is optional
- Interpretation of *tokens* depends on instruction

#### **Comment Statements**

- Three styles available
  - C style (between /\* and \*/)
  - C++ style (// until end of line)
  - Traditional assembly style (; until end of line)
- Only traditional comments remain in code for intermediate steps of assembly

#### **Assembler Directives**

- Begin with period in column one
- Can
  - Generate code
  - Control assembly process
- Example: associate *myname* with register five in the A register bank

.areg myname 5 a

## **Example Operand Syntax**

• Instruction *alu* invokes the ALU

alu [dst,  $src_1$ , op,  $src_2$ ]

- Four operands
  - Destination register
  - First source register
  - Operation
  - Second source register
- Two minus signs (--) can be specified for destination, if none needed

# **Major ALU Operations**

| Operator   | Meaning                                                                       |
|------------|-------------------------------------------------------------------------------|
| +          | Result is src <sub>1</sub> + src <sub>2</sub>                                 |
| -          | Result is src <sub>1</sub> - src <sub>2</sub>                                 |
| B-A        | Result is src <sub>2</sub> - src <sub>1</sub>                                 |
| В          | Result is src <sub>2</sub>                                                    |
| ~ <b>B</b> | Result is the bitwise inversion of src <sub>2</sub>                           |
| AND        | Result is bitwise and of src <sub>1</sub> and src <sub>2</sub>                |
| OR         | Result is bitwise or of src <sub>1</sub> and src <sub>2</sub>                 |
| XOR        | Result is bitwise exclusive or of src <sub>1</sub> and src <sub>2</sub>       |
| +carry     | Result is src <sub>1</sub> + src <sub>2</sub> + carry from previous operation |
| -carry     | Result is src <sub>1</sub> - src <sub>2</sub> - carry from previous operation |
| ~AND       | Result is bitwise (not src <sub>1</sub> ) and src <sub>2</sub>                |
| AND~       | Result is bitwise (src <sub>1</sub> and (not src <sub>2</sub> )               |
| +8         | Result is $src_1 + src_2$ with the first 24 bits set to zero                  |
| +16        | Result is $src_1 + src_2$ with the first 16 bits set to zero                  |

# **ALU Shift Operations**

- Shifts or rotates src<sub>2</sub> before operation
- Syntax is

alu\_shf [ dst, src1, op, src2, src2\_shift\_op ]

# **Memory Operations**

- Programmer specifies
  - Type of memory
  - Direction of transfer
  - Address in memory (two registers used)
  - Starting transfer register
  - Count of words to transfer
  - Optional tokens

#### Memory Operations (continued)

#### • General forms

sram [direction, xfer\_reg, addr<sub>1</sub>, addr<sub>2</sub>, count], optional\_tokens
dram [direction, xfer\_reg, addr<sub>1</sub>, addr<sub>2</sub>, count], optional\_tokens
scratch [direction, xfer\_reg, addr<sub>1</sub>, addr<sub>2</sub>, count], optional\_tokens

# **Special Memory Operations**

- Some memories offer special operations such as
  - Test-and-set
  - Atomic increment
- Operand *direction* used to specify special operations

# **Memory Addressing**

- Specified with operands  $addr_1$  and  $addr_2$
- Each operand corresponds to register
- Use of two operands can be used to
  - Scale to large memory
  - Use base + offset form

#### **Immediate Instruction**

• Place constant in thirty-two bit register

immed [ dst, ival, shift ]

- Upper sixteen bits of *ival* must be all zeros or all ones
- Operand *shift* specifies bit shift
  - 0 No shift
  - <<0 No shift (same as 0)
  - << 8 Shift to the left by eight bits
  - <<16 Shift to the left by sixteen bits

#### **Other Forms Of Immed Instruction**

- Used to load part of a register
  - immed\_b0 Load byte zero (low-order byte) only
  - immed\_b1 Load byte one only
  - immed\_b2 Load byte two only
  - immed\_b3 Load byte three only
  - immed\_w0 Load word zero (low-order 16 bits) only
  - immed\_w1 Load word one only

#### **Register Names**

- Usually automated by assembler
- Directives available for manual assignment

| Directive    | Type Of Assignment          |
|--------------|-----------------------------|
| .addr name a | Manual assignment to bank A |
| .addr name b | Manual assignment to bank B |
| .reg name    | Automatic assignment        |

2004

#### **Automated Register Assignment**

Intel's microengine assembler uses symbolic names for registers, and then maps each name to a specific register. A programmer can use directives to specify the mapping manually or can allow the assembler to choose a mapping; for generalpurpose and next-neightbor registers, a programmer cannot mix automatic and manual assignments.

#### **Register Names And Meanings**

• Name denotes type of register

| Register Type        | Relative                     | Absolute       |
|----------------------|------------------------------|----------------|
| General-purpose      | register_name                | @register_name |
| SRAM transfer        | <pre>\$register_name</pre>   | -              |
| <b>DRAM</b> transfer | <pre>\$\$register_name</pre> | -              |
| Next neighbor        | n\$register_name             | -              |

#### **Register Allocation**

- Hardware provides both *read* and *write* transfer registers
- Same numbers used
- Separate allocation functions

.xfer\_read *name* .xfer\_write *name* 

# Local Register Scope, Nesting, And Shadowing

- Programmer
  - Uses .*begin* directive to declare register names
  - Defines register names
  - References names in instructions
  - Uses .*end* to terminate scope
- Assembler
  - Assigns registers
  - Chooses bank for each register
  - Replaces names in code with correct reference

# **Illustration Of Automated Register Naming**

- One or more register names specified after .begin
- Example



• Names valid only within scope

#### **Nested Scopes**

- Programmer specifies .begin and .end pair inside a .begin .end pair
- Innermost scope has precedence
- Intel says inner declarations *shadow* outer declarations

#### **Illustration Of Nested Register Scope**



## **Register Assignments And Conflicts**

- Operands must come from separate banks
- Some code sequences cause *conflict*
- Example:

 $Z \leftarrow Q + R;$   $Y \leftarrow R + S;$  $X \leftarrow Q + S;$ 

- No assignment is valid
- Programmer must change code

#### **Macro Preprocessor Features**

- File inclusion
- Symbolic constant substitution
- Conditional assembly
- Parameterized macro expansion
- Arithmetic expression evaluation
- Iterative generation of code

# **Macro Preprocessor Statements**

| Keyword      | Use                                                                                                |
|--------------|----------------------------------------------------------------------------------------------------|
| #include     | Include a file                                                                                     |
| #define      | Definition of a symbolic constant (unparameterized)                                                |
| #define_eval | Definition of a symbolic constant equal to an arithmetic expression                                |
| #undef       | Remove a previous symbolic constant definition                                                     |
| #macro       | Start the definition of a parameterized assembly language macro                                    |
| #endm        | End a macro definition started with #macro                                                         |
| #ifdef       | Start conditional compilation if specified symbolic constant has been defined                      |
| #ifndef      | Start conditional compilation if specified symbolic constant has<br>not been defined               |
| #if          | Start conditional compilation if expression is true                                                |
| #else        | Terminate current conditional compilation and start alternative<br>part of conditional compilation |
| #elif        | Terminate current conditional compilation and start another<br>if expression is true               |
| #endif       | Terminate current conditional compilation                                                          |
| #for         | Start definite iteration to generate a code segment a fixed number of times                        |
| #while       | Start indefinite iteration to generate a code segment while a condition holds                      |
| #repeat      | Start indefinite iteration to repeat a code segment as long as a condition holds                   |
| #endloop     | Terminate an iteration                                                                             |

#### **Macro Definition**

- Can occur at any point in program
- General form:

#### **Macro Example**

• Compute a = b + c + 5

```
/* example macro add5 computes a=b+c+5 */
#macro add5[a, b, c]
    .begin
    .reg tmp
        alu[tmp, c, +, 5]
        alu[a, b, +, tmp]
    .end
#endm
```

• Assumes values a, b, and c in registers

# **Macro Expansion Example**

• Call of *add5[var1, var2, var3]* expands to:

```
.begin
.reg tmp
    alu[tmp, var3, +, 5]
    alu[var1, var2, +, tmp]
.end
```

• Warning: because macros use textual substitution, illegal arguments can generate illegal code

# **Repeated Generation Of A Code Segment**

- Macro preprocessor
  - Supports *#while* statement for iteration
  - Uses *#define\_eval* for arithmetic evaluation
- Can be used to generate sequence of code blocks

#### **Example Of Repeated Code**

• Preprocessor code:

#define LOOP 1
#while (LOOP < 4)
 alu\_shf[reg, -, B, reg, >>LOOP]
#define\_eval LOOP LOOP + 1
#endloop

• Expands to:

alu\_shf[reg, -, B, reg, >>1]
alu\_shf[reg, -, B, reg, >>2]
alu\_shf[reg, -, B, reg, >>3]

# **Structured Programming Directives**

- Make code appear to follow structured programming conventions
- Include *break* statement al la C

| Directive       | Meaning                                                                        |
|-----------------|--------------------------------------------------------------------------------|
| .if             | Conditional execution                                                          |
| .if_unsigned    | Unsigned version of .if                                                        |
| .elif           | Terminate previous conditional execution and start a new conditional execution |
| .elif_unsigned  | Unsigned version of .elif                                                      |
| .else           | Terminate previous conditional execution and define an alternative             |
| .endif          | End .if conditional                                                            |
| .while          | Indefinite iteration with test before                                          |
| .while_unsigned | Indefinite iteration (unsigned)                                                |
| .endw           | End .while loop                                                                |
| .repeat         | Indefinite iteration with test after                                           |
| .until          | End .repeat loop                                                               |
| .until_unsigned | Unsigned version of .until                                                     |
| .break          | Leave a loop                                                                   |
| .continue       | Skip to next iteration of loop                                                 |

#### **Example Of Conditional Compliation**

.if ( conditional\_expression )
 /\* block of microcode statements \*/
.elif ( conditional\_expression )
 /\* block of microcode statements \*/
.elif ( conditional\_expression )
 /\* block of microcode statements \*/
 .
 .
 .

.else

•

/\* block of microcode statements \*/
.endif

## Tests That Can Be Used In A Conditional Expression

| Operator  | Meaning                                                  |
|-----------|----------------------------------------------------------|
| BIT       | Test whether a bit in a register is set                  |
| BYTE      | Test whether a byte in a register equals a constant      |
| COUT      | Test whether a carry occurred on the previous operation  |
| СТХ       | Test the currently executing thread number               |
| SIGNAL    | Test whether a specified signal has arrived for a thread |
| INP_STATE | Test whether the thread is in a specified state          |

## **Mechanisms For Context Switching**

- Context switching is voluntary
- Thread can execute:
  - *ctx\_arb* instruction
  - Reference instruction (e.g., memory reference)

## **Argument To ctx\_arb Instruction**

- Determines disposition of thread
  - *voluntary*: thread suspended until later
  - *signal\_event*: thread suspended until specified event occurs
  - *kill*: thread terminated

## **Context Switch On Reference Instruction**

- Token added to instruction to control context switch
- Two possible values
  - *ctx\_swap*: thread suspended until operation completes
  - *sig\_done*: thread continues to run, and signal posted when operation completes
- Signals available for SRAM, DRAM, PCI bus, etc.

37

## **Example Of Context Switch**

• To perform context switch while waiting for DRAM access:

dram [read, \$\$rbuf0, base, 2, 4], sig\_done [sig\_name]

#### **Indirect Reference**

- Poor choice of name
- Hardware optimization
- Found on other RISC processors
- Result of one instruction modifies next instruction
- Avoids stalls
- Typical use
  - Compute *N*, a count of words to read from memory
  - Modify memory access instruction to read *N* words

## **Fields That Can Be Modified**

- Microengine associated with a memory reference
- Starting transfer register
- Count of words of memory to transfer
- Context number of the hardware context executing the instruction (i.e., context to signal upon completion)
- The mask that specifies a set of signals

## **How Indirect Reference Operates**

- Programmer codes two instructions
  - ALU operation
  - Instruction with *indirect reference* set
- Note: destination of ALU operation is -- (i.e., no destination)
- Hardware
  - Executes ALU instruction
  - Uses result of ALU instruction to modify field in next instruction

## **Example Of Indirect Reference**

• Example code

alu\_shf [ --, --, b, 0x13, << 16 ] scratch [ read, \$reg0, addr1, addr2, 0 ], indirect\_ref

- Memory instruction coded with count of zero
- ALU instruction computes count

## **External Transfers**

- Microengine cannot directly access
  - Memory
  - Buses (I/O devices)
- Intermediate hardware units used
  - Known as *transfer registers*
  - Multiple registers can be used as large, contiguous buffer

#### **External Transfer Procedure**

- Allocate contiguous set of transfer registers to hold data
- Start reference instruction that moves data to or from allocated registers
- Arrange for thread to wait until the operation completes

# **Allocating Contiguous Registers**

- Registers assigned by assembler
- Programmer needs to ensure transfer registers contiguous
- Assembler provides *.xfer\_order* directive
- Example: allocate four continuous SRAM input transfer registers

.reg \$reg1 \$reg2 \$reg3 \$reg4 .xfer\_order\_rd \$reg1 \$reg2 \$reg3 \$reg4

- Notes
  - Ordering affects both read and write registers
  - Directive .*xfer\_order\_wr* available for output

## Summary

- Microengines programmed in assembly language
- Intel's assembler provides
  - Directives for structuring code
  - Macro preprocessor
  - Automated register assignment
- External data access performed through transfer registers

# **Questions?**

#### XXIV

#### **Microengine Programming II**

# **Specialized Memory Operations**

- Ring and Queue manipulation
- Processor coordination (e.g., via atomic bit operations)
- Atomic memory operations (e.g *incr*, and *decr*)

# **Ring And Queue Manipulation**

- SRAM controller provides Queue Array mechanism
- XScale used to create buffers in Queue Array
- Microengine can allocate buffer from free list

sram [ dequeue, xfer, src\_op<sub>1</sub>, src\_op<sub>2</sub> ], tokens

- Handle placed in *xfer* register
- *src* operands encode memory channel and Queue Array number

# **Ring And Queue Manipulation**

• Microengine can return buffer to free lsit

sram [ enqueue, --, src\_op<sub>1</sub>, src\_op<sub>2</sub> ]

• Note: macros assume variables *DL\_DS\_RATIO* and *DL\_REL\_BASE* have been defined

## Processor Coordination Via Bit Testing

- Provided by SRAM and Scratchpad memories
- Atomic operations on individual bits
- Mask used to specify bit in a word
- General form

scratch [ cmd, \$xfer, addr1, addr2 ]

- Operands  $addr_1$  and  $addr_2$  added to form address
- Register *\$xfer* contains 32-bit mask

#### **Bit Manipulation Commands**

| Operation    | Meaning                                                                                   |
|--------------|-------------------------------------------------------------------------------------------|
| set          | Set the specified bits to one                                                             |
| clr          | Set the specified bits to zero                                                            |
| test_and_set | Place the original word in the read transfer register, and set the specified bits to one  |
| test_and_clr | Place the original word in the read transfer register, and set the specified bits to zero |

## **Atomic Memory Operations**

- Memory shared among
  - XScale
  - Microengines
- Need atomic increment to avoid incorrect results
- General form

scratch [ operation, optional\_value, addr1, addr2 ]

• *optional\_value* depends on operation being performed

#### Atomic Memory Operations (continued)

• Possible atomic operations include:

| Operation | Meaning                                                                   |
|-----------|---------------------------------------------------------------------------|
| incr      | Increment the specified word in memory                                    |
| decr      | Decrement the specified work in memory                                    |
| add       | Add the value in a transfer register to the specified word in memory      |
| sub       | Subtract the value in a transfer register to the specified word in memory |

# **Critical Sections And Folding**

- Piece of code that referenes shared variables known as *critical section*
- To ensure correctness, only one thread can execute a critical section at any time (*mutual exclusion*)
- IXP2xxx solution: *sequencing* 
  - Set of threads placed in circular order
  - Thread passes control the "next" thread

# **Steps Used For Sequencing**

```
Let C be the context number (thread ID);
if (C == 0) {
         wait for signal from "previous" microengine;
} else {
         wait for signal from context C-1;
}
access the critical section;
if (C == 7) {
         signal "next" microengine;
} else {
         signal context C + 1;
}
```

# **Optimized Sequencing**

- Steps for sequencing assume threads on a given microengine in consecutive positions of sequence
- To optimize data access
  - First thread on microengine copies shared variable into Local Memory
  - Threads on microengine sequence and use local copy
  - Last thread on microengine copies value back to external memory
- Can be dynamic: use CAM to test whether variable is in Local Memory

# **Control And Status Registers (CSRs)**

- IXP2xxx has dozens of CSRs
- Provide access to hardware units on the chip
- Allow processors to
  - Configure
  - Control
  - Interrogate
  - Monitor
- Access
  - XScale: mapped into address space
  - Microengines: special instructions

# **Cap Instruction**

- Used on microengines to access CSRs
- General form

cap [ cmd, \$xfer\_data, csr\_address ]

• *cmd* is *read*, *write*, or *fast\_wr* 

# **High-Speed CSR Access**

- Some CSRs reachable through fast data path
- Command *fast\_wr* provides fast-path access
- General form

cap [ fast\_wr, immediate\_data, CSR ]

# Reflection

- Move data from a transfer register on one microengine to a transfer register on another microengine
- Uses *cap* instruction
- General form

cap[cmd, xfer, rem\_ME, rem\_reg, rem\_ctx, ref\_count]

## Local CSRs

- Refer to individual microengine
- Can be accessed in single cycle
- Microengine issues *local\_csr\_rd* or *local\_csr\_wr* instruction
- Example

local\_csr\_wr[CSR, src]

#### Local CSRs

• Reading from local CSR requires two steps

local\_csr\_rd[CSR]
immed[destreg, 0]

# **Intel Dispatch Loop Macros**

- Each microengine executes infinite loop
  - Each iteration checks for event and processes event
  - Events are low level (e.g., hardware device becomes ready)
  - Known as *dispatch loop*
- SDK includes over forty predefined macros related to dispatch loop

#### **Examples Of Predefined Dispatch Loop Macros**

| Macro                   | Purpose                                          |
|-------------------------|--------------------------------------------------|
| dl_buf_init             | Initialize the buffer API                        |
| dl_buf_alloc            | Allocate a packet buffer                         |
| dl_buf_free             | Deallocate a packet buffer                       |
| dl_buf_get_desc         | Return SRAM pointer to metadata from a buffer    |
| dl_buf_get_data         | Return DRAM pointer to buffer data area          |
| dl_meta_init_cache      | Populate a metadata cache                        |
| dl_meta_flush_cache     | Flush metadata cache to SRAM                     |
| dl_meta_get_buffer_next | Move to next buffer in a chain                   |
| dl_meta_get_offset      | Find offset of data within a buffer              |
| dl_meta_get_free_list   | Find free list from which buffer was allocated   |
| dl_meta_get_rx_stat     | Extract receive status from a buffer             |
| dl_meta_get_buffer_size | Find the size of data in a given buffer          |
| dl_meta_get_packet_size | Find the total size of a packet                  |
| dl_meta_get_input_port  | Find the input port over which packet arrived    |
| dl_meta_set_output_port | Set the output port to which packet will be sent |

# **Traffic Management And Packet Scheduling**

- Scheduing requires keeping one queue per scheduled flow
- Cannot be achieved with straightforward data pipeline
- Solution: add microblock outside main pipeline

### Arrangement Of Microblocks When Packet Scheduling Used



• Queue manager and scheduler operate independently

# **Accessing Packet Header Fields**

- Microengine insturctions do not address memory directly
- Packet header loaded into transfer registers
- Many details

## **Example Header Access Code (Part 1)**

/\* Allocate eight DRAM transfer registers to hold the packet header \*/ xbuf\_alloc [ \$\$hdr, 8 ]

/\* Reserve two general-purpose registers for the computation \*/
.begin
.reg base offset

/\* Compute the DRAM address of the data buffer \*/
dl\_buf\_get\_data [ base, dl\_buffer\_handle ]

/\* Compute the byte offset of the start of the packet in the buffer \*/ dl\_meta\_get\_offset [ offset ]

#### **Example Header Access Code (Part 2)**

/\* Load thirty-two bytes of data from DRAM into eight DRAM \*/
/\* transfer registers. Start at DRAM address base + offset \*/
dram [ read, \$\$hdr0, base, offset, 4 ]

/\* Inform the assembler that we have finished using the two \*/
/\* registers: base and offset \*/
.end

/\* Process the packet header in the DRAM transfer registers
/\* starting at register \$\$hdr \*/

/\* Free the DRAM transfer registers when finished \*/
xbuf\_free [ \$\$hdr ]

• • •

# **Dispatch Loop And Associated Variables**

- Typical operation
  - Check for arrival of packet on Hardware Ring from previous microengine
  - Invoke procedure to process packet
  - Place packet on Hardware Ring that leads to next microengine
- Set of variables (registers) control operation of dispatch loop

#### **Examples Of Intel Dispatch Loop Variables**

| Variable           | Size    | Value And Meaning                           |
|--------------------|---------|---------------------------------------------|
| exception_id       | 8 bits  | ID of an exception handler on the XScale    |
| exception_code     | 8 bits  | A value passed with an exception packet     |
| dl_next_block      | 8 bits  | ID of next logical block for a packet       |
| dl_buf_handle      | 32 bits | Buffer handle for start of the packet       |
| dl_eop_buf_handle  | 32 bits | Buffer handle for end of the packet         |
| buffer_size        | 16 bits | Length of the buffer containing the packet  |
| packet_siz         | 16 bits | Total length of packet (across all buffers) |
| buffer_offset      | 16 bits | Offset of data from the start of the buffer |
| input_port         | 16 bits | Logical port over which the packet arrived  |
| rx_stat            | 4 bits  | Status flag bits (unicast, broadcast, etc.) |
| output_port_egress | 24 bits | Port over which packet is to be sent        |
| output_port_fabric | 8 bits  | Blade ID when multiple blades used          |
| output_port_type   | 4 bits  | Hardware type of output interface           |
| cache_flags        | 4 bits  | Control header caching (64 bytes of packet) |
| next_hop_id        | 32 bits | ID of the next hop for the packet           |
| flow_id            | 32 bits | Flow ID for metering / policing             |
| queue_id           | 16 bits | Output queue for traffic management         |

# **Header Caching**

- Packets reside in DRAM
- Accessing header fields is expensive
- To optimize access, copy header into Local Memory
- Think of copy as a *cache*
- SDK includes mechanisms to perfrom header caching

# Packet I/O

- Physical frame divided into sixty-four octet units for transfer
- Each unit known as *mpacket*
- Division performed by interface hardware
- Microengine uses MSF interface to transfer each mpacket separately
- Hardware set two bits in RBUF to specify whether
  - Mpacket is first packet of a frame
  - Mpacket is last packet of a frame
- Note: cell or small packet has both bits set

# Packet I/O (continued)

- No interrupts
- No DMA
- Dispatch loop in ingress or egress microblock uses polling

29

• Microengine performs transfer

## **Ingress Packet Transfer**

- Incoming mpacket moved from Receive BUF into
  - SRAM transfer registers
  - Directly into DRAM
- DRAM transfer has form

msf [ *cmd*, --, *addr*<sub>1</sub>, *addr*<sub>2</sub>, *count* ], *tokens* 

#### Ingress Packet Transfer (continued)

• Transfer to SRAM transfer register has the form

msf [ *cmd*, \$*xfer*, *addr*<sub>1</sub>, *addr*<sub>2</sub>, *count* ]

# **Other I/O Details**

- Microengine must
  - Check status of mpacket to determine if
    - \* MAC hardware detected problem (e.g., bad CRC)
    - \* Mpacket arrived with no problems
  - Check whether mpacket is first mpacket of a frame

## **Example Of Packet Processing**

• If mpacket is first of a frame, branch to *start\_of\_packet#* 

alu [--, \$*rc*, AND, 1] br!=0 [ start\_of\_packet# ]

# Summary

- Special hardware facilities support
  - Hardware Queues and Rings
  - Bit testing
  - Atomic memory operations
  - Sequencing and folding
  - CSR access
- Microengine executes event loop known as *dispatch loop* 
  - Checks for packets arriving
  - Calls macro(s) to process each packet
  - Sends packets to next specified destination

# Summary (continued)

- Intel supplies large set of dispatch loop macros
- Intel's SDK provides microblocks for ingress and egress
- Frame is divided into mpackets for transfer
- Hardware sets bits to specify whether incoming mpacket is first or last of a frame
- Microengine can transfer mpacket to SRAM transfer registers or diretly to DRAM

# **Questions?**

#### XXV

#### An Example Program

# We Will

- Consider an example
- Examine all the user-written code
- See how the pieces fit together

# **Choice Of Network System**

- Used to demonstrate
  - Basic concepts
  - Code structure and organization
- Need to
  - Minimize code size and complexity
  - Avoid excessive detail
  - Ignore performance optimizations
- Example: Network Address Translator (NAT)

# **NAT System Assumptions**

- Only two connections: one to the ISP and one to a local network
- Both connections are Ethernet
- Traffic restricted to
  - TCP
  - UDP
  - ICMP echo and reply (ping)
- Applications do not pass IP address or protocol port information in the data stream

## NAT System Assumptions (continued)

- System will not handle fragmented datagrams or datagrams with IP options
- System will only handle communication initiated from local computers (i.e., computers within the site)
- Use XScale to handle all exceptions
- Will translate port numbers as well as addresses (NAPT)

# **Conceptual NAT Topology**



- NAT located between site and rest of Internet
- All packets between the site and the Internet pass through the NAT box

## **Assumptions About Addresses**

- Site has single valid IP address 192.168.0.2
- Default router at ISP has IP address 192.168.0.100
- Computers behind NAT box use net 10 addresses such as
  - 10.0.0.1
  - -10.0.0.5
  - 10.0.0.13

## **Illustration Of NAT Addressing**



### NAT

- Changes fields in packet headers
  - Source fields in outgoing packet
  - Destination fields in incoming packet
- Uses a table to store translation information

#### **Illustration Of NAT Translation Table**

| Local IP<br>Address | Local Port<br>or ID | Remote IP<br>Address | Remote Port<br>or ID | Protocol | New Port<br>or ID |
|---------------------|---------------------|----------------------|----------------------|----------|-------------------|
| 10.0.0.2            | 29000               | 128.10.2.1           | 80                   | ТСР      | 1180              |
| 10.0.0.3            | 29000               | 128.10.2.1           | 80                   | ТСР      | 1239              |
| 10.0.0.4            | 12                  | 192.5.3.1            | -                    | ICMP     | <b>1630</b>       |

- Table shows three simultaneous connections
  - Computer 10.0.0.2 contacts 128.10.2.1:80
  - Computer 10.0.0.3 contacts 128.10.2.1:80
  - Computer 10.0.0.4 pings 192.5.3.1

## **Ports, Identifiers, And Ping**

- Each entry in NAT table corresponds to flow
- For TCP or UDP, flow is identified by
  - Source IP address
  - Source port number
  - Destination IP address
  - Destination port number
  - Replacement source port used by NAT
  - Protocol

### Ports, Identifiers, And Ping (continued)

- For ping, flow is identified by
  - Source IP address
  - ID value in packet
  - Destination IP address
  - Repalcement ID used by NAT
  - Protocol

# **Dynamic NAT Table**

- Outgoing packet used to create entry in NAT table
- Table is fixed size
- Consequence: when table is full, must delete old entry when adding a new entry

# **NAT Table Management**

- Each entry contains countdown timer field
- Timer value
  - Reset whenever entry used
  - Decremented every second
- When timer reaches zero, entry available for reuse
- When entry must be removed from full table, entry with oldest timer value is selected (LRU)

# Optimization

- To avoid arithmetic operations: use bit shift
- Timer value initialized with high-order bit set
- On each tick of the clock, shift right one bit
- When bit is shifted all the way to right, value becomes zero

# **Organization Of The Code**

- Uses Intel's RX and TX microblocks to receive and send packets
- Single NAT microblock handles fast-path translation and forwarding
- Core component handles exceptions.

#### **Five Main Pieces Of Code**

- Ingress (RX) microblock from Intel's SDK
- NAT microblock to handle the fast data path
- Egress (TX) microblock from Intel's SDK
- Core Component to handle exceptions ,User interface

#### **Illustration Of Interconnections**



• Hardware rings used for interonnection

### **Purpose Of Core Component**

- System initialization. The core component performs the usual startup tasks by patching symbols in the microcode, loading microcode into microengines, and allocating memory.
- Exception packet processing. The core component handles packets for which address translation fails, and inserts new entries in the address translation table as necessary.
- Timer aging. Once each second, the core component decrements the timer associated with each entry in the address translation table.
- User interface interaction. The code component interacts with the user interface application to provide information or respond to commands.

### **ARP Processing**

- ARP processing needed to find hardware addresses of
  - Router at ISP
  - Local computers
- Local computers
  - Treat NAT box as default router
  - Send ARP request
- Router at ISP
  - Is default router for NAT box
  - Expects to receive ARP request

# Handling ARP

- NAT box assumes local computers will semd ARP requests
- Single ARP request sent to router at ISP
  - Performed at startup
  - Packets for the ISP are discarded until a response arrives
- Values left in ARP cache indefinitely

# **Implementation Of The NAT Microblock**

- Poll *POS\_RX\_RING\_OUT* in infinite loop
- When packet available, extract buffer pointer from ring
- Read and classify packet
  - Place first 40 octets of packet in DRAM transfer registers
  - Nore: caching described later
- Check destination Ethernet address
- Verify packet is IP carrying TCP, UDP, or ping

# **Steps Taken In NAT Microblock (1)**

```
do forever {
  if (input ring nonempty) {
    obtain buffer handle for next packet;
    if (Ethernet destination address invalid)
         discard the packet;
         continue;
    if (not an IP packet || not one of TCP,
        UDP, or ICMP echo) {
      send packet to core component;
      continue;
    ł
    if (packet originates from local computer) {
      if (destination is local) {
        send packet to core component;
        continue;
      Check NAT table for outgoing match;
```

# **Steps Taken In NAT Microblock (2)**

```
} else /* packet originates from Internet */ {
    if (destination is not the NAT system) {
        send packet to core component;
        continue;
    }
    Check NAT table for incoming match;
}
if (NAT table lookup failed) {
    send packet to core component;
    continue;
}
```

### **Steps Taken In NAT Microblock (3)**

Replace fields in packet headers; Perform ARP lookup and set the Ethernet source address and Ethernet destination address; Pass packet to TX microblock;

}

# **Header Caching And Alignment**

- DRAM access extremely slow
- To optimize: cache packet header in Local memory
- Alignment
  - Local memory optimized for access by multiples of 4 bytes
  - Ethernet header contains 14 bytes
- Further performance enhancement: shift header right by two bytes when moving to Local memory (and shift back when storing in DRAM).
- Hardware instruction available

#### **Summary Of Comparisons Performed**

| Field In Packet Header                | Field In NAT Table       |  |  |  |  |
|---------------------------------------|--------------------------|--|--|--|--|
| For outgoing packet (to the Internet) |                          |  |  |  |  |
| Source IP address                     | Local IP address         |  |  |  |  |
| Source Port (or ID)                   | Local Port or ID         |  |  |  |  |
| Destination IP address                | Remote IP address        |  |  |  |  |
| <b>Destination Port (or ID)</b>       | <b>Remote Port or ID</b> |  |  |  |  |
| IP Proto field                        | Protocol                 |  |  |  |  |
| For incoming packet (from th          | ne Internet)             |  |  |  |  |
| Source IP address                     | Remote IP address        |  |  |  |  |
| Source Port (or ID)                   | Remote port or ID        |  |  |  |  |
| Destination port (or ID)              | New Port or ID           |  |  |  |  |
| IP Proto field                        | Protocol                 |  |  |  |  |

## **Implementation Of NAT Lookup**

- Hashing used to identify bucket
  - Extract fields from packet header
  - Hash to get bucket number 0 through N-1
- Sequential search within bucket

# **Hashing Details**

- Fields selected for hashing depend on direction of packet
- Two hash tables
  - Forward table for packets traveling to the Internet (f\_nat\_table)
  - Reverse table for packets arriving from the Internet (r\_nat\_table)
  - Must be linked together

#### **Fields In NAT Table Entry**

| Size    | Purpose                                 |
|---------|-----------------------------------------|
| 1 byte  | Valid flag (only left-most bit is used) |
| 1 byte  | protocol                                |
| 2 bytes | New port or ID                          |
| 2 bytes | Local port (or ID)                      |
| 2 bytes | Remote port (or ID)                     |
| 4 bytes | Local IP address                        |
| 4 bytes | Remote IP address                       |

• Entry is exact multiple of DRAM access size

### **Auxiliary Parallel Arrays**

- Store timers and pointers
- Are parallel to the hash tables: the i item in an auxiliary table corresponds to the i item in a hash table
- Four auxiliary arrays used
  - Timer for forward entries, *f\_timer*
  - Timer for reverse entries, *r\_timer*
  - Index for forward entries, *f-index*
  - Index for reverse entries, *r*-index

## **Illustration Of Auxiliary Arrays**



#### **Header Fields That NAT Changes**

**Outgoing packet (to the Internet)** 

SOURCE IP←NAT system IP addressSOURCE PORT (or ID)←NAT New Port (or ID)IP CHECKSUM←Adjusted IP header checksumTCP or UDP CHECKSUM←Adjusted transport checksum

**Incoming packet (from the Internet)** 

**IP CHECKSUM** 

- **DEST. PORT (or ID)**  $\leftarrow$  Local Port (or ID)
  - ← Adjusted IP header checksum

**TCP or UDP CHECKSUM**  $\leftarrow$  Adjusted transport checksum

#### **Definition Of Constants For Entire System (1)**

| /* NAT_shared_defs.h - constants sh                                                         | nared by microcode and core code     | */ |  |  |  |  |
|---------------------------------------------------------------------------------------------|--------------------------------------|----|--|--|--|--|
| #define NAT_DEF_MAJOR_NUMBER 50 /*                                                          | major number of NAT psuedo-device    | */ |  |  |  |  |
| #define NAT_DRIVER_NAME "NAT" /*                                                            | Name of driver for NAT pseudo-device | */ |  |  |  |  |
| #define NAT_DEV_FILE "/dev/NAT" /*                                                          | File name for NAT pseudo-device      | */ |  |  |  |  |
| #define PORTS_NUM 2 /*                                                                      | number of network interfaces         | */ |  |  |  |  |
| #define NAT_IFC 0 /*                                                                        | external interfaces to outside world | */ |  |  |  |  |
| <pre>#define GW_IP 0xC0A80064 /*</pre>                                                      | Router's IP address (192.168.0.100)  | */ |  |  |  |  |
| #define NAT_CC_ID 65 /*                                                                     | ID of core component for exceptions  | */ |  |  |  |  |
| /* Packet buffer parameters: 64MB c<br>#define NUM_BUFFERS 32*1024<br>#define BUF_SIZE 2048 | of buffers, 2048 bytes per buffer    | */ |  |  |  |  |
| /* Memory channels for free buffer list */                                                  |                                      |    |  |  |  |  |

/\* Memory channels for free buffer list \*, #define BUF\_SRAM\_CHAN 0 #define BUF\_DRAM\_CHAN 0

/\* Counter sizes. These are implicitly defined in TX and RX building \*/
/\* blocks. Namely, there are four 4-byte counters per port, which, for \*/
/\* three ports gives 4\*4\*3=48 bytes for each counter region \*/
#define RX\_CNTR\_SIZE 48
#define TX\_CNTR\_SIZE 48

#### **Definition Of Constants For Entire System (2)**

```
/* NAT table size, which must be a power of two */
#define NAT_TABLE_SIZE 128*1024 /* 128K entries */
```

```
/* Hash bucket size for NAT table = 2^HASH_BUCKET_SHIFT */
#define HASH_BUCKET_SHIFT 3
#define HASH_BUCKET_SIZE (1<<HASH_BUCKET_SHIFT)</pre>
```

```
/* NAT table bit mask */
#define NAT_TABLE_BIT_MASK ((NAT_TABLE_SIZE>>HASH_BUCKET_SHIFT)-1)
```

```
/* ARP table size, which must be a power of two */
#define ARP_TABLE_SIZE 256
```

```
/* ARP table bit mask */
#define ARP_TABLE_BIT_MASK (ARP_TABLE_SIZE-1)
```

```
/* Ethernet packet types that are recognized */
#define ETH_ARP 0x0806 /* ARP */
#define ETH_IP 0x0800 /* IP */
```

```
/* IP protocol types that are recognized */
#define IPT_UDP 17
#define IPT_TCP 6
#define IPT_ICMP 1
```

```
/* ICMP message types that are recognized */
#define ICMP_ECHO_REQ 8
#define ICMP_ECHO_REP 0
```

#### **Definition Of Constants For Entire System (3)**

```
/* ARP operation types */
#define ARP REO 1
#define ARP_REP 2
/* timer aging interval in ms */
#define AGING INTERVAL 1000 /* 1 sec */
/* maximum number of attempts to select a new (unused) NAT port value */
#define NEW NPORT ATTEMPS 30
/* maximum number of attempts to resolve gateway MAC address */
#define GW MAC RES ATTEMPTS 3
/* number of microengines */
#define ME NUM 8
/* size of one microengine cluster */
#define ME CL SZ 4
/* macro to represent Ethernet address as a byte array */
\#define ETH2B(X) \setminus
((char^*)_{\&}(X))[0], ((char^*)_{\&}(X))[1], ((char^*)_{\&}(X))[2], ((char^*)_{\&}(X))[3], \
((char^*)\&(X))[4], ((char^*)\&(X))[5]
/* macro to represent IP address as a byte array */
\#define IP2B(X) \setminus
((char^*)\&(X))[0], ((char^*)\&(X))[1], ((char^*)\&(X))[2], ((char^*)\&(X))[3]
```

#### **Constants And Types For The User Interface (1)**

```
/* NAT_types.h - types used by the core component and user interface */
```

```
typedef struct nat_entry_s { /* an entry in a NAT table */
       unsigned int valid : 1;
       unsigned int unused : 7;
       unsigned int prot : 8;
       unsigned int nport : 16;
       unsigned int lport : 16;
       unsigned int rport : 16;
       unsigned int ip addr loc;
       unsigned int ip addr rem;
} nat_entry;
typedef struct arp entry s {
                              /* an entry in the ARP cache */
       unsigned int ip addr;
       unsigned int eth w0;
       unsigned short eth_w1;
       unsigned short ifnum : 15;
       unsigned short valid : 1;
       unsigned int unused;
} arp_entry;
```

#### **Constants And Types For The User Interface (2)**

```
/* network interface structure */
typedef struct net_if_s {
       unsigned int ip addr;
       unsigned int eth w0;
       unsigned int eth w1;
} net_if;
typedef enum nat_cmd_t {
                         /* possible ioctl commands for */
                              /* the NAT pseudo-device
                                                              */
       SILENT=0, VERBOSE,
       GET_RX_COUNTER, CLR_RX_COUNTER,
       GET TX COUNTER, CLR TX COUNTER,
       GET ARP TABLE, GET NAT TABLE, GET TIMER TABLE
} nat_cmd;
#define INVALID CMD -1
```

#### **Definitions Of Scratch Ring Constants**

/\* NAT\_scratch\_rings.h - constants used for Scratch Memory rings \*/

/\* Ring used between the PACKET\_RX and NAT microblocks \*/

#definePKT\_RX\_TO\_NAT\_SCR\_RING4#definePKT\_RX\_TO\_NAT\_SCR\_RING\_SIZEIX\_SCRATCH\_RING\_SIZE\_1K

/\* First ring for communicating between NAT and Packet TX microblocks \*/

#define PACKET\_TX\_SCR\_RING\_0 6
#define PACKET\_TX\_SCR\_RING\_0\_SIZE IX\_SCRATCH\_RING\_SIZE\_256

/\* Second ring for communicating between NAT and Packet TX microblocks \*/

#define PACKET\_TX\_SCR\_RING\_17#define PACKET\_TX\_SCR\_RING\_1\_SIZEIX\_SCRATCH\_RING\_SIZE\_256

/\* Third ring for communicating between NAT and Packet TX microblocks \*/

#definePACKET\_TX\_SCR\_RING\_28#definePACKET\_TX\_SCR\_RING\_2\_SIZEIX\_SCRATCH\_RING\_SIZE\_256

/\* Fourth ring for communicating between NAT and Packet TX microblocks \*/

#define PACKET\_TX\_SCR\_RING\_3 9
#define PACKET\_TX\_SCR\_RING\_3\_SIZE IX\_SCRATCH\_RING\_SIZE\_256

#### **Basic NAT Microblock (1)**

/\* NAT\_microblock.uc - microcode for NAT processing \*/

#include <dl\_system.h>
#include <stdmac.uc>
#include <dispatch\_loop.uc>
#include <hardware.h>
#include <NAT\_shared\_defs.h>
#include <NAT\_macros.uc>

/\* Define NAT table location and parameters \*/ .import\_var NAT\_TABLE\_BASE #define\_eval NAT\_TABLE\_BM NAT\_TABLE\_BIT\_MASK #define\_eval HASH\_BUCKET\_SZ HASH\_BUCKET\_SIZE #define\_eval SHIFT\_VAL 4+HASH\_BUCKET\_SHIFT #define\_eval NAT\_TABLE\_SZ NAT\_TABLE\_SIZE

/\* Define ARP table location and parameters \*/
.import\_var ARP\_TABLE\_BASE
#define\_eval ARP\_TABLE\_BM ARP\_TABLE\_BIT\_MASK

#### **Basic NAT Microblock (2)**

```
/* Define timer table location */
.import_var TIMER_TABLE_BASE
```

```
/* Obtain the default gateway IP address */
.import_var GATEWAY_IP_ADDR
```

/\* Obtain configurations for each network interface \*/
.import\_var IF0\_IP
.import\_var IF1\_IP
.import\_var IF0\_ETH\_W0
.import\_var IF0\_ETH\_W1
.import\_var IF1\_ETH\_W0
.import\_var IF1\_ETH\_W1

```
#define_eval NAT_IP_ADDR IF/**/NAT_IFC/**/_IP
```

```
/* Define Local memory addresses */
#define STEP 64
#define LM_ADDR0_0 0
#define_eval LM_ADDR0_1 LM_ADDR0_0+STEP
#define_eval LM_ADDR0_2 LM_ADDR0_1+STEP
#define_eval LM_ADDR0_3 LM_ADDR0_2+STEP
#define_eval LM_ADDR0_4 LM_ADDR0_3+STEP
#define_eval LM_ADDR0_5 LM_ADDR0_4+STEP
#define_eval LM_ADDR0_6 LM_ADDR0_5+STEP
#define_eval LM_ADDR0_7 LM_ADDR0_6+STEP
```

#### **Basic NAT Microblock (3)**

.sig sig scr get ; signal for scratch get .sig sig scr put ; signal for scratch put .siq siq pkt hdr ; signal for packet header read .sig sig dram wr ; signal for dram write done ; GPR for intermediate data .reg temp ; GPR containing constant value 0 .reg zero ; GPR containing constant value 1 .reg one ; scratch ring .req rinq ; input port number .req port ; tx request to put on scratch rings .reg \$txreq .reg eth ipt ; GPR containing ETH\_IP constant (0x0800) .reg NAT ip ; GPR containing NAT box IP address ; context number of the current thread .reg ctx num .reg if out ; output interface to forward packet to ; Ethernet address registers .req EthDstW0 EthDstW1 .reg f nat table ; GPR with NAT table base .reg nat tab bit mask ; GPR with NAT table bit mask (size - 1) .reg r nat table ; GPR with reverse NAT table base ; GPR with ARP table base .reg arp\_tab .reg arp tab bit mask ; GPR with ARP table bit mask (size - 1) .reg f\_timer ; GPR with timer table base .reg r\_timer ; GPR with reverse timer table base ; GPR with default gateway IP address .reg gateway\_ip ; GPR with port to substitute .reg nat port .reg if ip if eth w0 if eth w1 ; network interface settings .reg IpHlen IpSrc IpDst IpProt SrcPort DstPort ; flow 5-tuple

#### **Basic NAT Microblock (4)**

/\* Allocation of transfer registers \*/
xbuf\_alloc[\$\$pkt\_hdr,2,read\_write]
xbuf\_alloc[\$\$entry\_w,4,read\_write]
xbuf\_alloc[\$\$iphdr,10,read\_write]
xbuf\_alloc[\$rdata, RX\_TO\_FUNC\_MSG\_SIZE, read]

/\* Frequently used constants \*/

immed[zero, 0] /\* 0 \*/
immed[one, 1] /\* 1 \*/
immed32(eth\_ipt,ETH\_IP) /\* Ethernet type IP \*/

```
/* Constants that are specific to NAT */
immed32(NAT_ip,NAT_IP_ADDR)
immed32(f_nat_table,NAT_TABLE_BASE)
immed32(nat_tab_bit_mask,NAT_TABLE_BM)
immed32(arp_tab,ARP_TABLE_BASE)
immed32(arp_tab_bit_mask,ARP_TABLE_BM)
immed32(f_timer,TIMER_TABLE_BASE)
immed32(gateway_ip,GATEWAY_IP_ADDR)
#define_eval NAT_TABLE_SZ_B (NAT_TABLE_SZ<4)
immed32(temp,NAT_TABLE_SZ_B)
alu[r_nat_table,f_nat_table,+,temp]
immed32(temp,NAT_TABLE_SZ)
alu[r_timer,f_timer,+,temp]</pre>
```

#### **Basic NAT Microblock (5)**

```
/* Byte alignment setting */
local csr wr[BYTE INDEX,2]
/* Obtain the current context number */
local csr rd[active ctx sts]
immed[ctx_num,0]
alu[ctx_num, ctx_num, AND, 0x07]
/* Set a Local memory address */
.if (ctx()==0)
        immed[temp,LM ADDR0 0]
.elif (ctx()==1)
        immed[temp,LM ADDR0 1]
.elif (ctx()==2)
        immed[temp,LM_ADDR0_2]
.elif (ctx()==3)
        immed[temp,LM ADDR0 3]
.elif (ctx()==4)
        immed[temp,LM_ADDR0_4]
.elif (ctx()==5)
        immed[temp,LM_ADDR0_5]
.elif (ctx()==6)
        immed[temp,LM ADDR0 6]
.else
        immed[temp,LM_ADDR0_7]
.endif
local csr wr[ACTIVE LM ADDR 0, temp]
```

# **Basic NAT Microblock (6)**

```
/*
          Main loop
                              */
start#:
       /* Read a packet from RX scratch ring */
       alu shf[ring, --, B, PKT RX TO NAT SCR RING, <<2]
       scratch[get,$rdata0,0,ring,RX TO FUNC MSG SIZE],
                                   sig done[sig scr get]
       /* Reset the exception register */
       alu[dl exception reg, --, b, 0]
       /* Wait for the RX ring read to finish */
       ctx arb[siq scr qet]
       /* Check if ring is empty */
       alu[--, $rdata0, -, 0]
       beg[ring empty#]
       /* Ring is not empty */
       alu[dl_buf_handle,--,b,$rdata0] /* set buffer handle */
       alu[dl_eop_buf_handle, --,b,$rdata1] /* get eop parameter */
       alu[dl_meta1,--,b,$rdata2] /* get data offset */
       alu[port, 0xF, AND, $rdata4, >>16] /* get input port */
       /* Ignore packets from ports other than 0 or 1 */
       alu[--,port,-,PORTS_NUM]
       bge[drop#]
```

# **Basic NAT Microblock (7)**

/\* Read the packet header (40 bytes) and assume Ethernet \*/
eth\_iphdr\_load(dl\_buf\_handle, sig\_pkt\_hdr)

```
/* If frame type is not IP, send to the core */
alu[temp,--,b,$$iphdr3,>>16]
alu[--,temp,xor,eth_ipt]
bne[exception#], defer[2] /* defer - save some cycles here */
alu[EthDstW0,--,b,$$iphdr0]
ld_field_w_clr[EthDstW1,1100,$$iphdr1]
```

/\* At this point the code has an IP packet; check the type \*/
alu[IpProt,0xFF,and,\$\$iphdr5]
br=byte[IpProt,0,IPT\_TCP,tcp\_udp\_icmp#] /\* check for TCP \*/
br=byte[IpProt,0,IPT\_UDP,tcp\_udp\_icmp#] /\* check for UDP \*/
br!=byte[IpProt,0,IPT\_ICMP,exception#] /\* check for ICMP \*/

#### tcp\_udp\_icmp#:

/\* The packet carries TCP, UDP or ICMP \*/

```
/* Find the network interface data for the input port */
net_if_data_get(port,if_ip,if_eth_w0,if_eth_w1)
```

```
/* Verify that the Ethernet destination matches our address */
alu[--,if_eth_w0,xor,EthDstW0]
bne[exception#],defer[1]
alu[--,if_eth_w1,xor,EthDstW1]
bne[exception#],defer[2]
```

### **Basic NAT Microblock (8)**

```
/* Compute the IP header size */
alu[IpHlen,0xF,and,$$iphdr3,>>8]
```

```
/* Store a copy of the IP header in local memory */
byte_align_be[--,$$iphdr3]
byte_align_be[*l$index0[0],$$iphdr4]
byte_align_be[*l$index0[1],$$iphdr5]
byte_align_be[*l$index0[2],$$iphdr6]
byte_align_be[*l$index0[3],$$iphdr7]
byte_align_be[*l$index0[4],$$iphdr8]
byte_align_be[*l$index0[5],$$iphdr9]
byte_align_be[*l$index0[6],0]
```

```
/* Obtain the IP source and destination addresses */
alu[IpDst,--,b,*l$index0[4]]
alu[--,if_ip,xor,IpDst]
/* Branch if destination IP is local (i.e., the NAT box) */
beq[local_dst#],defer[1]
alu[IpSrc,--,b,*l$index0[3]]
```

## **Basic NAT Microblock (9)**

```
/* At this point the packet conains TCP, UDP or ICMP, and has
                                                                 */
/* a non-local destination address. If the packet is incoming,
                                                                 */
/* drop it. If the packet is outgoing, perform NAT translation */
/* and send the packet to the Internet.
                                                                 */
alu[--,port,xor,NAT IFC]
beq[exception#]
/* Read the source and destination ports (or ICMP type and ID) */
read_src_and_dst_ports(NON_LOCAL_DST, IpHlen, IpProt,
                       SrcPort, DstPort)
.if (IpProt == IPT_ICMP)
        /* If the packet is ICMP, but not an echo request, */
        /* send the packet to the core as an exception
                                                            */
        alu[--,DstPort,xor,ICMP ECHO REO]
        bne[exception#],defer[1]
        alu[DstPort,--,b,0]
.endif
/* Perfrom NAT lookup for an outgoing packet */
nat_lookup_outgoing(IpSrc,SrcPort,IpDst,DstPort,IpProt,
                    nat port, if out)
alu[SrcPort, --, b, nat_port]
```

# **Basic NAT Microblock (10)**

```
tx pkt#:
        /* If NAT lookup failed, send the packet to core */
        /* as an exception
                                                          */
        alu[--,--,~b,nat_port]
        beg[exception#]
        .set if out /* inserted to prevent an assembler warning */
        /* At this point, NAT lookup has been successful, and the ARP */
        /* table must be consulted to determine the correct Ethernet */
        /* address for the frame.
                                                                       */
        alu[d] exception req, --, b, 1,<<10]
        arp lookup(if out, IpDst, EthDstW0, EthDstW1)
        alu[dl exception req, --, b, 0]
        /* Modify the packet header */
        modify and save packet header(if out, EthDstW0, EthDstW1, IpHlen,
                                 IpProt, IpSrc, IpDst, SrcPort, DstPort)
        /* Create a TX request for transmit queue */
        alu[temp, --, b, if_out, <<24] /* 27:24 output port */
        ld_field[temp, 0111, dl_buf_handle] /* 23:00 buffer handle */
        alu[$txreq, temp, OR, one, <<31] /* 31 valid bit
                                                                    */
                                            /* bits 31:28 reserved */
        /* Jump to Scratch ring write for the corresponding port */
        alu[temp, --, b, if_out, <<2]
        jump[temp,write ring0#],targets[write ring0#,write ring1#,\
                                   write ring2#, write ring3#]
```

# **Basic NAT Microblock (11)**

```
write ring0#:
        write tx ring(0,start#)
write_ring1#:
        write tx ring(1,start#)
write ring2#:
        write tx ring(2,start#)
write ring3#:
        write tx ring(3,start#)
        /* If Scratch ring is full -- wait voluntarily */
full ring0#:
        ctx arb[voluntary], br[write ring0#]
full_ring1#:
        ctx arb[voluntary],br[write ring1#]
full_ring2#:
        ctx arb[voluntary], br[write ring2#]
full_ring3#:
        ctx arb[voluntary], br[write ring3#]
local dst#:
        /* If the Destination IP address in an incoming datagram is */
        /* not the address of the NAT box address, send the packet */
        /* to the core as an exception.
                                                                      */
        alu[--, IpDst, xor, NAT_ip]
        bne[exception#]
```

## **Basic NAT Microblock (12)**

```
/* At this point the incoming packet contains TCP, UDP,
                                                             */
/* or ICMP and has a local IP destination. Read the source */
/* and destination ports.
                                                             */
read src and dst ports(NON LOCAL SRC, IpHlen, IpProt,
                       SrcPort, DstPort)
.if (IpProt == IPT ICMP)
        /* If the packet is ICMP, but not an echo reply, */
        /* send the packet to the core as an exception. */
        alu[--,SrcPort,xor,ICMP_ECHO_REP]
        bne[exception#],defer[1]
        alu[SrcPort, --, b, 0]
.endif
/* Perform NAT lookup for an incoming packet */
nat_lookup_incoming(IpDst,DstPort,IpSrc,SrcPort,IpProt,
                    nat port, if out)
alu[DstPort, --, b, nat_port]
br[tx_pkt#] /* jump to the transmission code */
```

# **Basic NAT Microblock (13)**

exception#:

/\* send to the NAT core component \*/
dl exception set(NAT CC ID, 0)

/\* this is a packet (not message) \*/

dl\_exception\_set\_priority(0)

dl\_exception\_send(dl\_buf\_handle)

ring\_empty#:

br[start#] /\* jump back to the main loop to continue probing \*/

drop#: /\* Drop the packet by freeing its buffer \*/
 dl\_buf\_free(dl\_buf\_handle,BUF\_FREE\_LISTO)
 br[start#] /\* go back to the main loop start \*/

52

# Macros Used To Implement NAT (1)

```
/* NAT_macros.uc - Microassembly macros used with NAT */
```

```
/* Macro to read source and destination ports from UDP or TCP packet */
#macro read src and dst ports(callsite, hdr len, IpProt, SrcPort, DstPort)
       .begin
       .reg buf_offset sdram_offset pkt_offset
       /* assume no IP options */
       .if (IpProt == IPT_ICMP)
#if (streq(callsite, 'NON LOCAL SRC'))
             alu[DstPort, --, b, *1$index0[6], >>16]
             alu[SrcPort, 0xFF, and, *1$index0[5], >>24]
#else
             alu[SrcPort, --, b, *1$index0[6], >>16]
             alu[DstPort, 0xFF, and, *1$index0[5], >>24]
#endif
       .else
             alu[SrcPort, --, b, *1$index0[5], >>16]
             ld field w clr[DstPort,0011,*l$index0[5]]
       .endif
       .end
#endm
```

# Macros Used To Implement NAT (2)

```
/* Macro to do NAT lookup for packet with local src
                                                            */
#macro nat lookup outgoing(ip addr loc, lport, ip addr rem, \
                       rport,prot,nport,if_out)
       .begin
       .reg cnt tmp offset entry w1 tm offset $timer bm
       .sig hash done read done write done
      xbuf_alloc[$hash128 w,4,read_write]
       /* hash IP address, port and protocol */
       alu[$hash128 w0,--,b,ip addr rem]
       alu[$hash128 w1,--,b,ip addr loc]
       alu[entry_w1, rport, or, lport, <<16]
      alu[$hash128 w2,--,b,entry w1]
       alu[$hash128_w3,--,b,prot]
      hash 128[$hash128 w0,1], sig done[hash done]
      alu[cnt,--,b,zero]
       alu[nport, --,~b, zero]
       ctx arb[hash done]
       /* compute the hash value mod the number of buckets */
       /* in the hash table
                                                    */
      alu[offset,$hash128 w0,and,nat tab bit mask]
       /* computer byte offset into NAT table */
       alu[offset, --, b, offset, <<SHIFT VAL]
       alu[offset,offset,-,16]
```

#### **Macros Used To Implement NAT (3)**

```
/* search the bucket linearly */
search start#:
                alu[--,HASH BUCKET SZ,-,cnt]
                ble[search done#]
                alu[offset,offset,+,16]
                dram[read, $$entry w0, f nat table, offset, 2],
                                            sig done[read done]
                ctx arb[read done]
                /* Verify that values in the entry match the */
                /* search keys
                                                               */
                /* check valid bit */
                br bclr[$$entry_w0,31,search_start#],defer[3]
                alu[cnt,cnt,+,one]
                alu[tmp,0xFF,and,$$entry w0,>>16]
                alu[--,tmp,xor,prot] /* check protocol */
                bne[search_start#],defer[3]
                alu[tm offset, --, b, offset, >>4]
                alu[tmp,tm_offset,and,3]
                alu[--,$$entry_w1,xor,entry_w1] /* check ports */
                bne[search start#],defer[3]
                alu[tmp,--,b,tmp,<<3]
                alu[tmp, 31, -, tmp]
                alu[--,$$entry_w2,xor,ip_addr_loc] /* check local IP */
                bne[search start#],defer[3]
                alu[--,tmp,or,zero] /* dummy instruction for */
                                     /* indirect shift
                                                               */
                alu[$timer_bm,--,b,one,<<indirect]
                alu[--,$$entry_w3,xor,ip_addr_rem] /* check remote IP */
                bne[search start#]
```

## Macros Used To Implement NAT (4)

```
/* at this point, the code has found a match in the NAT */
/* table, and must update timer for the entry */
sram[set,$timer_bm,f_timer,tm_offset],sig_done[write_done]
ld_field_w_clr[nport,0011,$$entry_w0]
alu[if_out,--,b,NAT_IFC]
alu[ip_addr_loc,--,b,NAT_ip]
ctx_arb[write_done]
```

search\_done#: .end #endm

# **Macros Used To Implement NAT (5)**

```
/* Macro to perform NAT lookup for a packet with a local destination */
\#macro nat lookup incoming(ip addr loc, nport, ip addr rem, rport, \setminus
                       prot, lport, if_out)
       .begin
       .reg cnt tmp offset port tmp tm offset $timer bm
       .sig hash done read done write done
      xbuf_alloc[$hash128 w,4,read_write]
       /* hash IP address, port and protocol */
       alu[$hash128 w0,--,b,ip addr rem]
       alu[$hash128_w1,rport,or,nport,<<16]
       alu[$hash128_w2,--,b,prot]
       alu[$hash128_w3,--,b,zero]
      hash_128[$hash128_w0,1], sig done[hash done]
       alu[cnt,--,b,zero]
       alu[lport, --,~b, zero]
       ctx arb[hash done]
       /* compute the hash value mod the number of buckets */
       /* in the hash table
                                                    */
       alu[offset,$hash128 w0,and,nat tab bit mask]
       /* compute byte offset into NAT table */
      alu[offset, --, b, offset, <<SHIFT_VAL]
      alu[offset,offset,-,16]
```

#### **Macros Used To Implement NAT (6)**

```
/* search the bucket linearly */
search start#:
                alu[--,HASH BUCKET SZ,-,cnt]
                ble[search done#]
                alu[offset,offset,+,16]
                dram[read, $$entry w0, r nat table, offset, 2],
                                           sig done[read done]
                ctx arb[read done]
                /* Verify that values in the entry match */
                /* the search keys
                                                           */
                /* check valid bit */
                br_bclr[$$entry_w0,31,search_start#],defer[3]
                alu[cnt,cnt,+,one]
                alu[tmp,0xFF,and,$$entry w0,>>16]
                alu[--,tmp,xor,prot] /* check protocol */
                bne[search_start#],defer[3]
                alu[tm offset, --, b, offset, >>4]
                alu[port_tmp,0,+16,$$entry_w1]
                alu[--,port_tmp,xor,rport] /* check remote port */
                bne[search_start#],defer[3]
                alu[tmp,tm_offset,and,3]
                alu[port_tmp,0,+16,$$entry_w0]
                /* check NAT (destination) port */
                alu[--,port_tmp,xor,nport]
                bne[search_start#],defer[3]
                alu[tmp,--,b,tmp,<<3]
                alu[tmp, 31, -, tmp]
                alu[--,$$entry w3,xor,ip addr rem] /* check remote IP */
                bne[search start#],defer[2]
```

## **Macros Used To Implement NAT (7)**

```
alu[--,tmp,or,zero] /* dummy instruction for */
                               /* indirect shift
                                                     */
              alu[$timer_bm,--,b,one,<<indirect]
              /* at this point, the code has found a match in the NAT */
              /* table, and must update timer for the entry
                                                               */
              sram[set,$timer bm,r timer,tm offset],sig done[write done]
              alu[if out,one,+,NAT IFC] /* so that if out!=NAT IFC */
              alu[ip addr loc, --, b, $$entry w2]
              alu[lport, --, b, $$entry_w1, >>16]
              ctx arb[write done]
search done#:
       .end
#endm
*/
/* Macro to read Ethernet and IP headers from DRAM
#macro eth iphdr load(buf handle, req sig)
       .begin
             sdram offset buf offset
       .req
       /* Read 40 bytes of ETH/IP Header from DRAM */
       /* (DRAM reads are in quadwords -- 8 bytes */
      dl buf get data[sdram offset, buf handle]
      dl meta_get_offset[buf_offset]
      dram[read,$$iphdr0,sdram offset,buf offset,5],sig done[reg sig]
       ctx arb[req sig]
       .end
#endm
```

# **Macros Used To Implement NAT (8)**

```
/* Macro to write modified Ethernet and IP headers from Local
                                                      */
/* memory back into DRAM
                                                      */
#macro eth iphdr store(sdram offset, buf offset, req sig)
      byte align be[--,eth ipt]
      byte align be[$$iphdr3,*l$index0[0]]
      byte align be[$$iphdr4,*l$index0[1]]
      byte align be[$$iphdr5,*1$index0[2]]
      byte align be[$$iphdr6,*1$index0[3]]
      byte align be[$$iphdr7,*1$index0[4]]
      byte align be[$$iphdr8,*1$index0[5]]
      byte align be[$$iphdr9,*1$index0[6]]
      dram[write,$$iphdr0,sdram offset,buf offset,5],sig done[reg sig]
#endm
```

# **Macros Used To Implement NAT (9)**

```
*/
/* Macro to update IP checksum
/* cksum new = cksum old + old val + ~new val
                                                */
#macro cksum upd(cksum,old val,new val)
     .begin
     .reg x not new val
     alu[x,--,b,old_val,>>16]
     alu[cksum,cksum,+,x]
     alu[cksum,cksum,+16,old_val]
     alu[not new val, --, ~b, new val]
     alu[x,--,b,not new val,>>16]
     alu[cksum,cksum,+,x]
     alu[cksum,cksum,+16,not new val]
     .end
#endm
/* Macro to add carry into checksum
                                                */
/* cksum = cksum>>16 + cksum&0xffff
                                                */
/* cksum = cksum>>16 + cksum&Oxffff
                                                */
#macro cksum_carry(cksum)
     .begin
     .req x
     alu[x,--,b,cksum,>>16]
     alu[cksum,x,+16,cksum]
     alu[x,--,b,cksum,>>16]
     alu[cksum,x,+16,cksum]
     .end
#endm
```

NSD-Intel -- Chapt. 25

# **Macros Used To Implement NAT (10)**

```
/* Macro to modify and store Eth frame header, IP packet header
                                                             */
/* and UDP, TCP, or ICMP packet header
                                                             */
#macro modify and save packet header(if out,EthDstW0,EthDstW1,IpHlen,
                                IpProt, IpSrc, IpDst, SrcPort, DstPort)
       .begin
       .reg tmp cksum buf offset pkt offset sdram offset
       .reg if ip if eth w0 if eth w1 oldId oldPorts oldIpSrc oldIpDst
       .sig dram rd dram wr1 dram wr2
       /* Compute sdram offset for the buffer */
      dl buf get data[sdram offset, dl buf handle]
       /* Set the Ethernet header */
      net if data get(if out,if ip,if eth w0,if eth w1)
      alu[$$iphdr0,--,b,EthDstW0]
      alu[$$iphdr1,EthDstW1,or,if_eth_w0,>>16]
       dbl shf[$$iphdr2,if eth w0,if eth w1,>>16]
       /* Update the IP header */
       ld field w clr[cksum,0011,*l$index0[2]]
      alu[oldIpSrc,--,b,*l$index0[3]]
      alu[oldIpDst, --, b, *1$index0[4]]
       /* calculate new checksum */
       cksum_upd(cksum,oldIpSrc,IpSrc)
       cksum_upd(cksum,oldIpDst,IpDst)
       cksum carry(cksum)
```

### Macros Used To Implement NAT (11)

```
/* Save the new checksum and IP addresses */
ld_field[*l$index0[2],0011,cksum]
alu[*1$index0[3],--,b,IpSrc]
alu[*1$index0[4],--,b,IpDst]
/* Update the TCP, UDP, or ICMP header. Note: we assume that */
/* the datagram does not contain IP options
                                                               */
.if (IpProt == IPT ICMP)
        /* Update the ICMP header */
        ld field w clr[cksum,0011,*1$index0[5]]
        alu[oldId, --, b, *1$index0[6], >>16]
        .if (IpSrc == NAT ip)
                cksum upd(cksum,oldId,SrcPort)
                ld field[*1$index0[6],1100,SrcPort,<<16]
        .else
                cksum_upd(cksum,oldId,DstPort)
                ld field[*l$index0[6],1100,DstPort,<<16]
        .endif
        cksum carry(cksum)
        ld_field[*l$index0[5],0011,cksum]
        dl meta get offset[buf offset]
        /* Save the modified Ethernet and IP headers */
        eth_iphdr_store(sdram_offset, buf_offset, dram_wr1)
```

#### **Macros Used To Implement NAT (12)**

```
/* Update the TCP or UDP header */
dl meta get offset[buf offset]
.if (IpProt == IPT_TCP)
        alu[pkt_offset,buf_offset,+,48]
.else
        alu[pkt_offset,buf_offset,+,40]
.endif
/* Read the old UDP or TCP checksum */
dram[read,$$pkt hdr0,sdram offset,pkt offset,1],
                                    sig done[dram rd]
alu[oldPorts, --, b, *1$index0[5]]
alu[*1$index0[5],DstPort,or,SrcPort,<<16]
/* Save the modified Ethernet and IP headers */
eth iphdr store(sdram offset, buf offset, dram wr1)
/* Wait for the checksum to be read */
ctx arb[dram rd]
.if (IpProt == IPT_TCP)
        ld field w clr[cksum,0011,$$pkt hdr0]
.else
        alu[cksum, --, b, $$pkt_hdr0, >>16]
        /* If UDP checksum is 0, no update needed */
        alu[--,cksum,xor,zero]
        bne[wait#]
.endif
cksum_upd(cksum,oldIpSrc,IpSrc)
cksum_upd(cksum,oldIpDst,IpDst)
```

.else

## Macros Used To Implement NAT (13)

```
alu[tmp,DstPort,or,SrcPort,<<16]
                cksum_upd(cksum,oldPorts,tmp)
                cksum carry(cksum)
                .if (IpProt == IPT_TCP)
                        alu[tmp,--,b,cksum]
                         ld field[tmp,1100,$$pkt hdr0]
                 .else
                        alu[tmp,--,b,cksum,<<16]
                         ld field[tmp,0011,$$pkt hdr0]
                 .endif
                alu[$$pkt_hdr0,--,b,tmp]
                alu[$$pkt hdr1,--,b,$$pkt hdr1]
                dram[write,$$pkt_hdr0,sdram_offset,pkt_offset,1],
                                                   sig done[dram wr2]
                ctx arb[dram wr1,dram wr2],br[done#]
        .endif
wait#:
        ctx arb[dram wr1]
done#:
        .end
```

#endm

# **Macros Used To Implement NAT (14)**

```
/* Macro to obtain a copy of network interface settings
                                                           */
#macro net if data get(ifnum,if ip,if eth w0,if eth w1)
      alu[temp,--,b,ifnum,<<3]
       jump[temp, if 0#], targets[if 0#, if 1#]
       /* set network interface parameters */
if 0#:
      immed[if_ip, IF0_IP]
       immed w1[if_ip, IF0_IP>>16]
       immed[if_eth_w0,IF0_ETH_W0]
       immed w1[if eth w0, IF0 ETH W0>>16]
      br[end of if table#],defer[2]
       immed[if eth w1, IF0 ETH W1]
       immed w1[if eth w1, IF0 ETH W1>>16]
      nop /* added for alignment */
if 1#:
       immed[if ip, IF1 IP]
       immed w1[if ip, IF1 IP>>16]
       immed[if eth w0, IF1 ETH W0]
       immed w1[if eth w0, IF1 ETH W0>>16]
       immed[if eth w1, IF1 ETH W1]
       immed w1[if eth w1, IF1 ETH W1>>16]
end of if table#:
#endm
```

# **Macros Used To Implement NAT (15)**

```
/* Macro to perform ARP table lookup
                                                           */
#macro arp_lookup(port,IpDst,EthAddrW0,EthAddrW1)
       .begin
       .reg ip addr cnt tmp offset $hash48 w0 $hash48 w1
       .reg $entry w0 $entry w1 $entry w2
       .sig hash done read done
       .xfer order $hash48 w0 $hash48 w1
       .xfer order $entry w0 $entry w1 $entry w2
       .if (port == NAT IFC)
             alu[ip addr,--,b,qateway ip]
       .else
             alu[ip addr,--,b,IpDst]
       .endif
       /* Hash the IP address */
      alu[$hash48 w0,--,b,ip addr]
      alu[$hash48_w1,--,b,zero]
      hash 48[$hash48 w0,1], sig done[hash done]
      alu[cnt,--,b,zero]
      ctx arb[hash done]
      /* Compute the hash value mod the size of the ARP table */
      alu[offset,$hash48 w0,and,arp tab bit mask]
      /* Compute the byte offset into the ARP table */
      alu[offset, --, b, offset, <<4]
      /* Adjust the start of the table */
      alu[offset,offset,-,16]
```

### **Macros Used To Implement NAT (16)**

```
/* Search the table sequentially */
arp search start#:
              alu[--, arp tab bit mask, -, cnt]
              blt[exception#] /* lookup failed */
              alu[offset,offset,+,16]
              alu[offset,offset,and,arp tab bit mask,<<4]
              sram[read,$entry w0,arp tab,offset,3],ctx swap[read done]
              alu[--,$entry_w0,xor,ip_addr]
              bne[arp search start#],defer[1]
              alu[cnt,cnt,+,one]
              br bclr[$entry w2,0,arp search start#]
arp search end#:
       /* Set the word 0 of the Ethernet address */
       alu[EthAddrW0, --, b, $entry_w1]
       /* Set word 1 of the Ethernet address */
       ld field w clr[EthAddrW1,1100,$entry w2]
       /* Set the output port */
       ld field w clr[if out,0011,$entry w2,>>1]
       .end
#endm
/* Macro to write the current packet on the TX ring
                                                              */
#macro write tx ring(ring num, label)
#define eval RN PACKET TX SCR RING /**/ring num
       br inp state[SCR RING/**/RN/**/ FULL, full ring/**/ring num/**/#]
       scratch[put,$txreq,zero,(RN<<2),1],sig_done[sig_scr_put]</pre>
       ctx arb[sig scr put], br[label]
```

#endm

# **Core Component Responsibilities**

- Initialization (performed once at startup)
- Processing exception packets
  - ARP request with the NAT system as the target
  - ICMP echo request with the NAT system as the destination
  - TCP, UDP, or ICMP Echo, packet for which no NAT table entry exists
  - Note: other packets are dropped
- Processing requests from the user interface
- Cleanup (performed once at shutdown)

# **Core Component Implementation**

- Divided into three files
- Conceptual purpose
  - Initialization and driver for pseudo device
  - Packet handler
  - Definitions of protocol headers (include file)

#### **Core Initialization And Pseudo-Device Driver (1)**

```
/* NAT_pseudo_dev.c - NAT core comp. & driver (Linux kernel module) */
```

```
#include <linux/kernel.h>
                           /* Code runs in the Linux kernel
                                                                       */
#include <linux/module.h>
                           /* The code runs as a kernel module
                                                                       */
#include <linux/fs.h>
                           /* NAT pseudo device is a character device
                                                                       */
#include <asm/uaccess.h>
                           /* Needed for communication with user space */
#include <enpv2 types.h>
#include <ix rm.h>
                           /* Code uses the IXA Resource Manager
                                                                       */
#undef LINUX
                           /* Prevents the compiler from complaining
                                                                       */
#include <ix cc.h>
                           /* Code uses the IXA CCI
                                                                       */
#include <ix cci.h>
#include "NAT shared defs.h"
#include "NAT_types.h"
#include "NAT scratch rings.h"
#define ME_MASK 0x07 /* system uses microengines 0, 1 and 2 */
#define CONTEXT MASK 255
                           /* context mask -- enable all contexts */
#define HASH_MULT_WO_0x12345678 /* hash multiplier -- word 0 */
#define HASH MULT_W1 0x87654321 /* hash multiplier -- word 1 */
#define HASH MULT W2 0x56781234 /* hash multiplier -- word 2 */
#define HASH_MULT_W3 0x43218765 /* hash multiplier -- word 3 */
```

# **Core Initialization And Pseudo-Device Driver (2)**

```
/* macro to log error message and terminate resource manager */
#define panic(...) { printk("%s: ",NAT_DRIVER_NAME);\
                     printk(____________);
                     ix rm term();\
                     unregister chrdev(NAT major, NAT DRIVER NAME);
                     return(-1);
/* macro to clear block of kernel memory */
#define bzero(buf,size) ix_ossl_memset(buf,0,size)
/* macro to convert microengine sequence number */
/* into microengine ID
                                                 */
#define ME ID(i) ((i%ME CL SZ) ((i/ME CL SZ) << 4))
/* macro to convert SRAM offset and memory channel into */
/* microengine addressing
                                                         */
#define ME SRAM ADDR(offset,memChan)
( (memChan)?(offset | (0x20000000<<memChan)):offset )
/* Module parameter -- a UOF file name */
static char * Uof file;
/* Module parameter -- Linux major device number for NAT psuedo device */
static unsigned int NAT major = NAT DEF MAJOR NUMBER;
MODULE AUTHOR("Internetworking Lab, CS, Purdue University");
MODULE DESCRIPTION ("NAT pseudo-device driver for IXP2XXX");
MODULE PARM(Uof file, "s");
MODULE PARM(NAT major, "i");
```

# **Core Initialization And Pseudo-Device Driver (3)**

```
/* Static gateway IP address */
unsigned int gateway ip= GW IP;
/* Static network interface configuration */
net if iface table[PORTS NUM]={
        {0xC0A80002 /* 192.168.0.2 */,
         0x01010101,0x01010000 /* 01:01:01:01:01:01 */},
        {0x0A000001 /* 10.0.0.1 */,
         0x02020202,0x02020000 /* 02:02:02:02:02:02 */} };
/* External procedures */
extern ix error nat pkt handler(ix buffer handle, ix uint32, void*);
extern ix error resolve arp(unsigned int);
/* Local procedures
                       */
static int patch microblocks(ix buffer free list info);
static int create scr rings();
static int init hash();
static ix error nat table timer(void*);
static ix error exe init f(ix exe handle,void**);
static ix error exe fini f(ix exe handle,void*);
static ix error cc init f(ix cc handle,void**);
static ix error cc fini f(ix cc handle,void*);
static ix exe handle exeHandle;
static ix cc handle ccHandle;
static ix event handle eveHandle;
static int nat_open(struct inode *, struct file *);
```

# **Core Initialization And Pseudo-Device Driver (4)**

static int nat\_release(struct inode \*, struct file \*); static int nat\_ioctl(struct inode \*, struct file \*, unsigned int, unsigned long);

/\* Verbosity level \*/ int verb=SILENT;

/\* Pointers to various run-time data structures \*/ nat entry \*f nat table, \*r nat table; unsigned int \*f\_index, \*r\_index; arp entry \*arp table; unsigned char \*f timer, \*r timer; void \*rx cntr, \*tx cntr;

```
/* Scratch rings */
ix hw ring handle rxToNatRing, txScrRing[4];
```

```
/* List of free buffers */
ix buffer free list handle hwFreeList = 0;
```

```
/* Operations for the NAT pseudo-device */
static struct file operations nat fops = {
        ioctl:
                       nat_ioctl,
       open:
                      nat_open,
       release:
                       nat release
```

```
};
```

# **Core Initialization And Pseudo-Device Driver (5)**

```
static int nat_open(struct inode *inode, struct file *filp)
ł
        MOD_INC_USE_COUNT;
        return 0;
static int nat_release(struct inode *inode, struct file *filp)
        MOD DEC USE COUNT;
        return 0;
static int nat_ioctl(struct inode *inode, struct file *fp,
                       unsigned int cmd, unsigned long buf)
{
        switch (cmd) {
                case SILENT:
                        verb = SILENT;
                        break;
                case VERBOSE:
                        verb = VERBOSE;
                        break;
                case GET ARP TABLE:
                        if ((char *)buf != NULL)
                                 return copy to user((char *)buf, arp table,
                                         ARP TABLE SIZE*sizeof(arp entry));
                        break;
```

#### **Core Initialization And Pseudo-Device Driver (6)**

```
case GET_NAT_TABLE:
        if ((char *)buf != NULL)
                return copy to user((char *)buf,
                        (void*)f nat table,
                        NAT TABLE SIZE*sizeof(nat entry));
                break;
case GET TIMER TABLE:
        if ((char *)buf != NULL)
                return copy to user((char *)buf,
                        (void*)f_timer, 2*NAT_TABLE_SIZE);
                break;
case GET RX COUNTER:
        if ((char *)buf != NULL)
                return copy to user((char *)buf,rx cntr,
                                     RX CNTR SIZE);
                break;
case GET_TX_COUNTER:
        if ((char *)buf != NULL)
                return copy to user((char *)buf,tx cntr,
                                     TX CNTR SIZE);
                break;
case CLR RX COUNTER:
        bzero(rx cntr,RX CNTR SIZE);
        break;
case CLR TX COUNTER:
       bzero(tx cntr,TX CNTR SIZE);
       break;
default:
        return INVALID CMD;
```

# **Core Initialization And Pseudo-Device Driver (7)**

```
return 0;
}
int init module()
í
        ix error err;
        ix buffer free list info hwFreeListInfo;
        int i;
        if (Uof_file == NULL) {
                printk("%s: no microcode file specified!\n",
                       NAT DRIVER NAME);
                return -1;
        /* Register the pseudo-device with Linux */
        if (register chrdev(NAT major, NAT DRIVER NAME, &nat fops) < 0) {
                printk("%s: can't get major number %d\n",
                       NAT_DRIVER_NAME, NAT_major);
                return -1;
        /* Initialize Intel's Resource Manager */
        printk("%s: Initializing Resource Manager\n", NAT DRIVER NAME);
        err=ix_rm_init(0);
        if (err != IX_SUCCESS) {
                printk("Error: ix rm_init failed\n");
                return -1;
```

}

#### **Core Initialization And Pseudo-Device Driver (8)**

```
/* Register the exception packet handler */
printk("%s: Setting packet receive mode (to callback)\n",
      NAT DRIVER NAME);
err = ix rm packet set receive mode(NAT CC ID,
                       IX COMM ID MODE CALLBACK);
if (err != IX SUCCESS)
        panic("ix rm packet set receive mode failed\n");
printk("%s: Registering packet handler\n", NAT DRIVER NAME);
err = ix rm packet handler register (NAT CC ID, nat pkt handler,
                                    NULL);
if (err != IX SUCCESS)
        panic("ix rm packet handler register failed\n");
/* Allocate a free buffer list */
err = ix rm hw buffer free list create(NUM BUFFERS,
                 sizeof(ix hw buffer meta), BUF SIZE,
                 BUF SRAM CHAN, BUF DRAM CHAN, & hwFreeList);
if (err != IX SUCCESS)
        panic("ix rm hw buffer free list create failed\n");
/* Read freelist info (it will be needes later) */
err = ix rm buffer free list get info(hwFreeList,
                                      &hwFreeListInfo);
if (err != IX SUCCESS)
        panic("ix rm hw buffer free list get info failed\n");
```

### **Core Initialization And Pseudo-Device Driver (9)**

```
/* Allocate RX counters (SRAM, channel 0) */
err = ix rm mem alloc(IX MEMORY TYPE SRAM, 0,
                      RX CNTR SIZE, &rx cntr);
if (err != IX_SUCCESS)
        panic("ix rm mem alloc failed for RX counters\n");
/* Clear RX counters */
bzero(rx cntr,RX CNTR SIZE);
/* Allocate TX counters (SRAM, channel 1) */
err = ix rm mem alloc(IX_MEMORY_TYPE_SRAM, 1,
                      TX CNTR SIZE, &tx cntr);
if (err != IX SUCCESS)
        panic("ix rm mem alloc failed for TX counters\n");
/* Clear TX counters */
bzero(tx cntr,TX CNTR SIZE);
/* Allocate the NAT table in DRAM */
err = ix rm mem alloc(IX MEMORY TYPE DRAM, 0,
                      2*NAT TABLE SIZE*sizeof(nat entry),
                      (void**)&f nat table);
if (err != IX SUCCESS)
        panic("ix rm mem alloc failed for NAT table\n");
/* Clear the NAT table */
bzero((void*)f nat table,2*NAT TABLE SIZE*sizeof(nat entry));
/* Set the base address for reverse NAT table */
r nat table=f nat table+NAT TABLE SIZE;
```

#### **Core Initialization And Pseudo-Device Driver (10)**

```
/* Allocate the NAT index table in DRAM */
err = ix rm mem alloc(IX MEMORY TYPE DRAM, 0,
                      2*NAT TABLE SIZE*sizeof(unsigned int),
                      (void**)&f index);
if (err != IX SUCCESS)
        panic("ix rm mem alloc failed for NAT index table\n");
/* Clear the NAT index table */
bzero((void*)f_index,2*NAT_TABLE_SIZE*sizeof(unsigned int));
/* Set the base for the reverse NAT index table */
r index=f index+NAT TABLE SIZE;
/* Allocate the timer table in SRAM */
err = ix rm mem alloc(IX MEMORY TYPE SRAM, 0, NAT TABLE SIZE,
                      (void**)&f timer);
if (err != IX SUCCESS)
        panic("ix rm mem alloc failed for timer table\n");
/* Clear the timer table */
bzero((void*)f timer,2*NAT TABLE SIZE);
/* Set the base address for the reverse timer table */
r timer=f timer+NAT TABLE SIZE;
/* Allocate the ARP table in DRAM */
err = ix rm mem alloc(IX MEMORY TYPE SRAM, 1,
          ARP TABLE SIZE*sizeof(arp entry), (void**)&arp table);
if (err != IX SUCCESS)
        panic("ix rm mem alloc failed for ARP table\n");
/* Clear the ARP table */
bzero((void*)arp table,ARP TABLE SIZE*sizeof(arp entry));
```

## **Core Initialization And Pseudo-Device Driver (11)**

```
/* Reset the microengines */
printk("%s: Resetting all microengines\n", NAT DRIVER NAME);
ix rm ueng reset all();
/* Get the microcode from the UOF file */
printk("%s: Setting ucode\n", NAT_DRIVER_NAME);
err = ix rm ueng set ucode(Uof file);
if (err != IX SUCCESS)
        panic("ix rm ueng set ucode failed\n");
/* Patch the microcode symbols before actually */
/* loading microcode.
                                                */
if (patch_microblocks(hwFreeListInfo) < 0)
        return(-1);
/* Create scratch rings */
if (create scr rings() < 0)
        return(-1);
/* Load microcode into microengines */
printk("%s: Loading ucode\n", NAT_DRIVER_NAME);
err = ix rm ueng load();
if (err != IX SUCCESS)
        panic("ix rm ueng load failed\n");
/* Initialize hash unit */
if (init hash() < 0)
        return(-1);
```

# **Core Initialization And Pseudo-Device Driver (12)**

```
/* Start the assigned microengines */
for (i=0;i<ME NUM;i++) {</pre>
        if ((ME_MASK>>i)&0x1) {
                printk("%s: Starting ME%i\n",NAT_DRIVER_NAME, i);
                err = ix rm ueng start(ME ID(i), CONTEXT MASK);
                if (err != IX SUCCESS)
                   panic("ix rm ueng start failed for ME %i\n",i);
        }
/* Resolve an ARP entry for the gateway */
if (resolve arp(GW IP) != IX SUCCESS)
        panic("can't resolve ARP entry for the gatewayn");
/* Create an execution engine (i.e., a kernel thread) for the */
/* NAT timer aging procedure
                                                                */
err=ix cci init(); /* Initialize Intel's CCI */
if (err != IX SUCCESS)
        panic("ix cci init failed\n");
printk("%s: Creating timer thread\n",NAT_DRIVER_NAME);
err=ix cci exe run(NULL, exe init f, exe fini f, "NAT timer",
                   &exeHandle);
if (err != IX SUCCESS) {
        ix cci fini();
        panic("ix cci exe run failed\n");
return 0;
```

# **Core Initialization And Pseudo-Device Driver (13)**

```
/* Cleanup */
void cleanup module()
        ix error err;
        int i;
        /* Stop each of the assigned microengines */
        for (i=0;i<ME NUM;i++) {</pre>
                if ((ME_MASK>>i)&Ox1) {
                         printk("%s: Stopping ME%i\n",NAT_DRIVER_NAME,i);
                         err = ix rm ueng stop(ME ID(i));
                         if (err != IX_SUCCESS)
                             printk(
                                  "%s: ix rm ueng stop failed for ME %i\n",
                                  NAT DRIVER NAME, i);
                 }
        /* Unregister the packet handler */
        ix rm packet handler unregister(NAT CC ID);
        /* Terminate the timer thread */
        printk("%s: Stopping timer thread\n",NAT_DRIVER_NAME);
        ix cci exe shutdown(exeHandle);
        ix_cci_fini();
```

ł

# **Core Initialization And Pseudo-Device Driver (14)**

```
/* Terminate the Resource Manager */
        ix rm term();
        /* unregister pseudo-device */
        unregister chrdev(NAT major, NAT DRIVER NAME);
}
/* NAT table management: periodically go through the timer table */
/* and update (age) each of the timers
                                                                   */
ix error nat table timer(void* dummy)
        int i;
        for (i=0;i<2*NAT TABLE SIZE;i++)</pre>
                if (f nat table[i].valid)
                        f_timer[i]=f_timer[i]>>1;
        return(IX_SUCCESS);
}
ix error exe init f(ix exe handle exeHandle, void** ppContext)
        ix cc init context dummy;
        return ix cci cc create(exeHandle, cc init f, cc fini f,
                                 (void*)&dummy,&ccHandle);
}
ix error exe fini f(ix exe handle exeHandle, void* pContext)
        return ix cci cc destroy(ccHandle);
```

# **Core Initialization And Pseudo-Device Driver (15)**

ix\_error cc\_init\_f(ix\_cc\_handle ccHandle,void\*\* ppContext)

ix\_error cc\_fini\_f(ix\_cc\_handle ccHandle,void\* pContext)

return ix\_cci\_cc\_remove\_event\_handler(ccHandle,eveHandle);

# **Core Initialization And Pseudo-Device Driver (16)**

```
/* Patch the microcode symbols before actually loading microcode. */
/*
                                                             */
/* The imported variables that must be patched are:
                                                             */
/* BUF FREE LISTO
                -- get from freelist allocation,
                                                             */
                     used by all microblocks
/*
                                                             */
                  -- get from freelist allocation,
                                                             */
/* BUF SRAM BASE
                     used by all microblocks
                                                             */
/*
/* DL REL BASE
                  -- compute from freelist allocation
                                                             */
/*
                     parameters, used by all microblocks
                                                             */
                  -- get from freelist allocation (RX ublock)
/* FREE LIST ID
                                                             */
/* PACKET COUNTERS SRAM BASE -- get from memory allocation for
                                                             */
                             RX counters, used by RX ublock
/*
                                                             */
/* PACKET TX COUNTER BASE
                         -- get from memory allocation for
                                                             */
/*
                             TX counters, used by TX ublock
                                                             */
/* ARP TABLE BASE
                  -- get from memory allocation,
                                                             */
/*
                     used by NAT microblock only
                                                             */
                  -- get from memory allocation,
/* NAT TABLE BASE
                                                             */
                     used by NAT microblock only
/*
                                                             */
/* TIMER TABLE BASE -- get from memory allocation,
                                                             */
                     used by NAT microblock only
/*
                                                             */
/* GATEWAY IP ADDR -- gateway IP address, hardcoded,
                                                             */
/*
                     used by NAT microblock only
                                                             */
/* IFO IP, IF1 IP,
                                                             */
/* IFO ETH WO, IFO ETH W1,
                                                             */
/* IF1 ETH W0, IF1 ETH W1 -- interface settings from interface
                                                             */
/*
                          table, used by NAT microblock only
                                                             */
```

# **Core Initialization And Pseudo-Device Driver (17)**

```
int patch microblocks(ix buffer free list info
                                                 hwFreeListInfo)
        ix error err;
        ix imported symbol importSymbols[15];
        ix uint32
                        memChan;
        ix uint32
                        offset;
        /* Set common symbols */
        importSymbols[0].m Name="BUF_FREE_LISTO";
        importSymbols[0].m Value = hwFreeListInfo.m FreeListInfo;
        importSymbols[1].m Name="BUF SRAM BASE";
        err = ix rm get phys offset( hwFreeListInfo.m pMetaBaseAddress,
                                     NULL, &memChan, &offset, NULL);
        if (err != IX SUCCESS)
                panic("ix rm get phys offset failed for %s\n",
                      importSymbols[1].m Name);
        importSymbols[1].m Value = ME SRAM ADDR(offset,memChan);
        importSymbols[2].m Name = "DL REL BASE";
        err = ix rm get phys offset(hwFreeListInfo.m pDataBaseAddress,
                                    NULL, &memChan, &offset, NULL);
        if (err != IX SUCCESS)
                panic("ix_rm_get_phys_offset failed for %s\n",
                      importSymbols[2].m Name);
        importSymbols[2].m_Value = offset -
             ((importSymbols[1].m_Value*hwFreeListInfo.m_DataElementSize)/
                hwFreeListInfo.m MetaElementSize);
```

### **Core Initialization And Pseudo-Device Driver (18)**

```
/* Set RX specific symbols */
importSymbols[3].m Name="PACKET COUNTERS SRAM BASE";
err = ix_rm_get_phys_offset(rx_cntr,NULL,&memChan,&offset,NULL);
if (err != IX SUCCESS)
        panic("ix rm get phys offset failed for %s\n",
              importSymbols[3].m Name);
importSymbols[3].m Value = ME SRAM ADDR(offset,memChan);
importSymbols[4].m Name="FREE LIST ID";
importSymbols[4].m Value = hwFreeListInfo.m FreeListInfol;
/* Patch ME 0x00 -- RX microblock */
err = ix rm ueng patch symbols(0x00,5, importSymbols);
if (err != IX SUCCESS)
     panic("ix rm ueng patch symbols failed for RX microblockn");
/* Set NAT specific symbols */
importSymbols[3].m Name="NAT_TABLE_BASE";
err = ix rm get phys offset((void*)f nat table,
                             NULL, &memChan, &offset, NULL);
if (err != IX SUCCESS)
        panic("ix rm get phys offset failed for %s\n",
               importSymbols[3].m_Name);
importSymbols[3].m Value = offset;
importSymbols[4].m Name="ARP_TABLE_BASE";
err = ix rm get phys offset((void*)arp table,
                            NULL, &memChan, &offset, NULL);
```

#### **Core Initialization And Pseudo-Device Driver (19)**

```
if (err != IX_SUCCESS)
        panic("ix rm get phys offset failed for %s\n",
              importSymbols[3].m Name);
importSymbols[4].m Value = ME SRAM ADDR(offset,memChan);
importSymbols[5].m Name="GATEWAY IP ADDR";
importSymbols[5].m Value=gateway ip;
importSymbols[6].m Name="IF0 IP";
importSymbols[6].m Value=iface table[0].ip addr;
importSymbols[7].m Name="IF0 ETH W0";
importSymbols[7].m Value=iface table[0].eth w0;
importSymbols[8].m Name="IF0 ETH W1";
importSymbols[8].m Value=iface table[0].eth w1;
importSymbols[9].m_Name="IF1_IP";
importSymbols[9].m Value=iface table[1].ip addr;
importSymbols[10].m Name="IF1 ETH W0";
importSymbols[10].m Value=iface table[1].eth w0;
importSymbols[11].m Name="IF1 ETH W1";
importSymbols[11].m Value=iface table[1].eth w1;
importSymbols[12].m Name="TIMER TABLE BASE";
err = ix rm get phys offset((void*)f timer,
                            NULL, &memChan, &offset, NULL);
if (err != IX SUCCESS)
        panic("ix rm get phys offset failed for %s\n",
              importSymbols[3].m Name);
importSymbols[12].m Value = ME SRAM ADDR(offset, memChan);
```

### **Core Initialization And Pseudo-Device Driver (20)**

```
/* Patch ME 0x01 -- NAT microblock */
err = ix rm ueng patch symbols(0x01,13, importSymbols);
if (err != IX_SUCCESS)
    panic("ix rm ueng patch symbols failed for NAT microblock\n");
/* Set counter base for TX */
importSymbols[3].m Name="PACKET TX COUNTER BASE";
err = ix rm get phys offset((void*)tx cntr,
                            NULL, &memChan, &offset, NULL);
if (err != IX SUCCESS)
        panic("ix rm get phys offset failed for %s\n",
              importSymbols[3].m Name);
importSymbols[3].m Value = ME SRAM ADDR(offset,memChan);
/* Patch ME 0x02 -- TX microblock */
err = ix rm ueng patch symbols(0x02,4, importSymbols);
if (err != IX SUCCESS)
   panic("ix rm ueng patch symbols failed for TX microblock\n");
return(1);
```

}

### **Core Initialization And Pseudo-Device Driver (21)**

```
/* Function to create RX and TX scratch memory rings */
int create scr rings()
        ix error err;
        err = ix rm hw scratch ring create(0,
                           (PKT RX TO NAT SCR RING SIZE>>9),
                           PKT RX TO NAT SCR RING, &rxToNatRing);
        if (err != IX_SUCCESS)
          panic("ix rm hw scratch ring create failed for Rx->Nat ring\n");
        err = ix rm hw scratch ring create(0,
                           (PACKET TX SCR RING 0 SIZE>>9),
                           PACKET TX SCR RING 0, &txScrRing[0]);
        if (err != IX SUCCESS)
             panic("ix rm hw scratch ring create failed for TX 0 ring\n");
        err = ix rm hw scratch ring create(0,
                           (PACKET TX SCR RING 1 SIZE>>9),
                           PACKET TX SCR RING 1, &txScrRing[1]);
        if (err != IX SUCCESS)
             panic("ix rm hw scratch ring create failed for TX 1 ring\n");
        err = ix rm hw scratch ring create(0,
                           (PACKET TX SCR RING 2 SIZE>>9),
                           PACKET TX SCR RING 2, &txScrRing[2]);
        if (err != IX SUCCESS)
             panic("ix rm hw scratch ring create failed for TX 2 ring\n");
        err = ix rm hw scratch ring create(0,
                           (PACKET TX SCR RING 3 SIZE>>9),
                           PACKET TX SCR RING 3, &txScrRing[3]);
```

# **Core Initialization And Pseudo-Device Driver (22)**

```
if (err != IX_SUCCESS)
    panic("ix_rm_hw_scratch_ring_create failed for TX 3 ring\n");
return(1);
```

}

# **Core Initialization And Pseudo-Device Driver (23)**

/\* Function to initialize the 128-bit and 48-bit hash multipliers \*/ int init\_hash()

```
ix error err;
ix hash multiplier 128 hash128m;
ix hash multiplier 48 hash48m;
hash128m.m LWO=HASH MULT WO;
hash128m.m LW1=HASH MULT W1;
hash128m.m LW2=HASH MULT W2;
hash128m.m LW3=HASH MULT W3;
printk("%s: Setting hash 128 multiplier to 0x%08X%08X%08X\n",
      NAT DRIVER NAME,
      (unsigned int)hash128m.m_LW3, (unsigned int)hash128m.m_LW2,
      (unsigned int)hash128m.m LW1, (unsigned int)hash128m.m LW0);
err=ix rm hash 128 multiplier set(&hash128m);
if (err != IX_SUCCESS)
        panic("ix rm hash 128 multiplier set failed\n");
if (err != IX SUCCESS)
        panic("ix rm hash 64 multiplier set failedn");
hash48m.m LWO=HASH MULT WO;
hash48m.m LW1=HASH MULT W1;
printk("%s: Setting hash 48 multiplier to 0x%08X%08X\n",
     NAT DRIVER NAME,
     (unsigned int)hash48m.m LW1, (unsigned int)hash48m.m LW0);
err=ix rm hash 48 multiplier set(&hash48m);
if (err != IX_SUCCESS)
        panic("ix rm hash 48 multiplier set failedn");
return(1);
```

}

ł

### Packet Formats Used By The Core (1)

/\* NAT\_net.h - protocol declarations used by the core component \*/

```
/* Ethernet packet header */
typedef struct eth s {
        unsigned char e dst[6];
        unsigned char e src[6];
        unsigned short e_type;
        unsigned short data[1];
} eth;
/* ARP packet header
                          */
typedef struct arp s {
        unsigned short ar hrd;
        unsigned short ar pro;
        unsigned char ar hln;
        unsigned char ar_pln;
        unsigned short ar op;
        unsigned char ar sha[6];
        /* declared as two shorts for alignment */
        unsigned short ar_spal;
        unsigned short ar spa2;
        unsigned char ar tha[6];
        unsigned int ar_tpa;
```

} arp;

#### **Packet Formats Used By The Core (2)**

```
/* IP packet header
                          */
typedef struct ip s {
        unsigned char ip v : 4;
        unsigned char ip hl : 4;
        unsigned char ip tos;
        unsigned short ip len;
        unsigned short ip id;
        unsigned short ip frag;
        unsigned char ip ttl;
        unsigned char ip p;
        unsigned short ip sum;
        unsigned int ip_src;
        unsigned int ip dst;
        unsigned int data[1];
} ip;
/* ICMP packet header
                          */
typedef struct icmp s {
        unsigned char icmp type;
        unsigned char icmp code;
        unsigned short icmp cksum;
        unsigned short icmp id;
        unsigned short icmp_seq;
        unsigned int data[1];
```

} icmp;

### Packet Formats Used By The Core (3)

```
/* TCP packet header
                          */
typedef struct tcp_s {
        unsigned short tcp_sport;
        unsigned short tcp dport;
        unsigned int tcp_seq;
        unsigned int tcp ack;
        unsigned char tcp offset;
        unsigned char top code;
        unsigned short tcp window;
        unsigned short tcp cksum;
        unsigned short tcp_urgptr;
        unsigned int data[1];
} tcp;
/* UDP packet header
                          */
typedef struct udp_s {
        unsigned short udp_sport;
        unsigned short udp dport;
        unsigned short udp len;
        unsigned short udp cksum;
} udp;
/* Transmit request structure */
typedef struct tx req s {
        unsigned int valid : 1;
        unsigned int reserved : 3;
        unsigned int port : 4;
        unsigned int buff handle : 24;
} ix_tx_req;
```

## **Core Component Packet Handler (1)**

/\* NAT\_pkt\_handler.c - packet handler and table management functions \*/

```
#include <enpv2 types.h>
#include <ix rm.h>
                           /* Code uses the IXA Resource Manager
                                                                        */
#undef LINUX
                           /* Prevents the compiler from complaining
                                                                        */
#include <ix cc.h>
                           /* Code uses the IXA CCI
                                                                        */
#include <ix cci.h>
#include "NAT shared defs.h"
#include "NAT_types.h"
#include "NAT net.h"
/* macro to drop a packet and quit packet handler */
#define drop(arg hBuffer) { ix rm buffer free(arg hBuffer);\
                           return IX SUCCESS; }
/* Verbosity level */
extern int verb;
/* Static gateway IP address */
extern unsigned int gateway_ip;
/* Static network interface configuration */
extern net if iface table[];
/* Global NAT port */
static unsigned short global nport=0;
/* Scratch rings */
extern ix hw ring handle rxToNatRing, txScrRing[];
```

## **Core Component Packet Handler (2)**

```
/* List of free buffers */
extern ix_buffer_free_list_handle hwFreeList;
```

```
/* Pointers to various run-time data structures */
extern nat entry *f nat table, *r nat table;
extern unsigned int *f_index, *r_index;
extern arp entry *arp table;
extern unsigned char *f_timer, *r_timer;
/* global procedures */
ix error nat pkt handler(ix buffer handle,ix uint32,void*);
ix error resolve arp(unsigned int);
/* local procedures */
static void process_arp_req(arp*, ix_hw_buffer_meta*,
                             ix buffer handle,eth*);
static void process arp rep(arp*, ix hw buffer meta*);
static void send icmp echo rep(ip*, icmp*, ix hw buffer meta*,
                                ix buffer handle,eth*);
static int
            process icmp(icmp*, nat_entry*);
static int
            process udp(udp*,nat entry*);
static int
            process tcp(tcp*,nat entry*);
static char* find arp entry(unsigned int);
static int
             add arp entry(arp entry*);
             add nat entry(nat entry*);
static int
```

```
static void add_r_nat_entry(unsigned int);
```

static void del\_nat\_entry(unsigned int);

# **Core Component Packet Handler (3)**

```
static int set new nport(nat_entry*);
static void send pkt(void*, unsigned int, eth*,
                      unsigned char *, unsigned short);
/* Packet handler called when an exception packet arrives */
ix error nat pkt handler(
        ix buffer handle arg hBuffer,
        ix uint32 arg UserData,
                                   /* exception code */
       void* arg pContext
{
        ix hw buffer meta *meta data;
       void *buf;
        int cksum;
        eth *eth pkt;
        arp *arp_pkt;
        arp entry ae;
        ip* ip_pkt;
        icmp* icmp pkt;
        tcp* tcp_pkt;
       udp* udp_pkt;
       nat entry ne;
        char * qw eth;
        ix rm buffer get data(arg hBuffer, &buf);
        ix rm buffer get meta(arg hBuffer, (void **)&meta data);
        eth_pkt=(eth*)((int)buf+(int)(meta_data->m_Offset));
```

#### **Core Component Packet Handler (4)**

```
if (eth_pkt->e_type == ETH_ARP) { /* got ARP */
        if (verb == VERBOSE)
                printk("Got ARP packet\n");
        arp_pkt=(arp*)(eth_pkt->data);
        if ( arp pkt->ar op == ARP REQ &&
             arp pkt->ar tpa ==
                 iface table[meta data->m InputPort].ip addr) {
                /* Process ARP request */
                process arp reg(arp pkt, meta data,
                                arg hBuffer, eth pkt);
                return IX_SUCCESS;
        } else
        if ( arp pkt->ar op == ARP REP &&
             arp pkt->ar tpa ==
                 iface table[meta data->m InputPort].ip addr) {
                /* Process an ARP reply */
                process_arp_rep(arp_pkt,meta_data);
                return IX SUCCESS;
        } else drop(arg hBuffer);
if (eth pkt->e type != ETH IP)
        drop(arg hBuffer);
/* got IP packet */
ip pkt=(ip*)(eth pkt->data);
```

#### **Core Component Packet Handler (5)**

```
if (verb == VERBOSE) {
       printk("Got IP packet\n");
       printk("\tIP src = %i.%i.%i.%i\n",IP2B(ip pkt->ip src));
       printk("\tIP dst = %i.%i.%i.%i\n",IP2B(ip pkt->ip dst));
       printk("\tprotocol: %i\n", ip pkt->ip p);
       printk("\tingress port = %i\n", meta_data->m_InputPort);
/* For simplicity, the code drops a datagram that */
/* has IP options
                                                   */
if ( ip pkt \rightarrow 1 > 5)
   drop(arg hBuffer);
if ( ip pkt->ip dst ==
      iface table[meta data->m InputPort].ip addr) {
    /* The packet is destined to the NAT box itself */
        if (ip pkt->ip p == IPT_ICMP) { /* got ICMP */
                icmp pkt=(icmp*)(ip pkt->data);
                if (icmp pkt->icmp type == ICMP ECHO REQ) {
                        /* received ping */
                        if (verb == VERBOSE)
                                printk("Got ping for us\n");
                        /* Send an echo reply */
                        send icmp echo rep(ip pkt, icmp pkt,
                                            meta data,
                                            arg_hBuffer,eth pkt);
                        return IX_SUCCESS;
                }
        }
```

#### **Core Component Packet Handler (6)**

```
if ( /* Packet not from the gateway */
   meta data->m InputPort != NAT IFC &&
    /* An entry is present in ARP table for the gateway */
    (qw eth = find arp entry(GW IP)) != NULL ) {
        /* The packet is an exception because the NAT lookup */
        /* failed, so add a new entry to the NAT table
                                                              */
       ne.valid=0; /* will be set to 1 later */
       ne.prot=ip pkt->ip p;
       ne.ip addr loc=ip pkt->ip src;
       ne.ip addr rem=ip pkt->ip dst;
        switch (ip pkt->ip p) {
                case IPT ICMP:
                        /* packet is ICMP */
                        icmp pkt=(icmp*)(ip pkt->data);
                        if (process icmp(icmp pkt, \&ne) < 0)
                                drop(arg hBuffer);
                        break;
                case IPT TCP:
                        /* received TCP */
                        tcp_pkt=(tcp*)(ip_pkt->data);
                        if (process_tcp(tcp_pkt, \&ne) < 0)
                                drop(arg hBuffer);
                        break;
                case IPT UDP:
                        /* received UDP */
                        udp_pkt=(udp*)(ip_pkt->data);
                        if (process udp(udp pkt, \&ne) < 0)
                                drop(arg hBuffer);
                        break;
```

# **Core Component Packet Handler (7)**

```
default: drop(arg_hBuffer);
```

```
/* Create an ARP entry for this packet in case a reply */
        /* comes later
                                                                 */
        ae.ip addr=ip pkt->ip src;
        ae.eth w0=*(int*)eth pkt->e src;
        ae.eth wl=(*(short*)\&eth pkt->e src[4]);
        ae.ifnum = meta_data->m_InputPort;
        ae.valid = 1;
        /* Update the IP checksum */
        cksum = ip pkt->ip sum+(ip pkt->ip src>>16)
              + (ip pkt->ip src&0xFFFF)
              + ((~iface table[NAT IFC].ip addr)>>16)
              + ((~iface table[NAT IFC].ip addr)&0xFFFF);
        cksum=(cksum&0xFFFF)+(cksum>>16);
        cksum=(cksum&0xFFFF)+(cksum>>16);
        /* Update the IP packet */
        ip_pkt->ip_src = iface_table[NAT_IFC].ip_addr;
        ip pkt->ip sum=cksum&0xFFFF;
        /* Transmit the packet */
        send pkt((void*)arg hBuffer,NAT IFC,eth pkt,
                 qw eth, ETH IP);
        /* Add the ARP entry that was created above */
        add arp entry(&ae);
        return IX SUCCESS;
/* Drop the packet */
drop(arg hBuffer);
```

}

# **Core Component Packet Handler (8)**

```
/* Function to pass packet to TX microblock */
static void send pkt(void* buf, unsigned int ifnum, eth* eth pkt,
       unsigned char * eth_addr, unsigned short eth type)
{
        ix tx req txreq;
        ix uint32 txreq size;
        /* set ethernet addresses */
        *(int*)eth pkt->e dst=*(int*)eth addr;
        *(short*)((int*)eth_pkt->e_dst+1)=*(short*)((int*)eth_addr+1);
        *(int*)eth pkt->e src=iface table[ifnum].eth w0;
        *(short*)((int*)eth pkt->e src+1)=(iface table[ifnum].eth w1>>16);
        eth_pkt->e_type = eth_type;
        /* prepare Tx request */
        txreq.valid = 1;  /* valid request */
        txreq.reserved=0; /* reserved -- set to zero */
        txreq.port = ifnum; /* outgoing interface */
        txreq.buff_handle = (unsigned int)buf; /* buffer handle */
        txreq size=1;
        /* put Tx request on appropriate Tx scratch ring */
        ix rm hw ring put(txScrRing[txreq.port], &txreq size,
                          (ix uint32 *)&txreq);
```

}

# **Core Component Packet Handler (9)**

```
/* ARP lookup function */
char * find arp entry(unsigned int ipaddr)
í
        ix hash 48 hash48v;
        int i, j;
        hash48v.m_LWO = ipaddr;
        hash48v.m LW1 = 0;
        ix rm hash 48 hash(&hash48v);
        j = hash48v.m LWO&ARP TABLE BIT MASK;
        for (i=0;i<ARP_TABLE_SIZE;i++) {</pre>
                if (ipaddr == arp table[j].ip addr && arp table[j].valid)
                         return((char*)&(arp_table[j].eth_w0));
                j=(j+1)&ARP TABLE BIT MASK;
        return NULL;
}
/* Function to resolve an ARP entry for given IP address (used */
/* to obtain an ARP entry for the gateway)
                                                                 */
ix error resolve arp(unsigned int ipaddr)
        char eth_bcast[]={0xff,0xff,0xff,0xff,0xff,0xff;
        void * buf;
        eth *eth pkt;
        arp *arp pkt;
        ix buffer handle hBuffer;
        ix hw buffer meta *meta data;
        int i;
```

# **Core Component Packet Handler (10)**

```
for (i=0;i<GW MAC RES ATTEMPTS;i++) {
        ix rm buffer alloc(hwFreeList,&hBuffer);
        ix rm buffer get data(hBuffer,(void**)&buf);
        ix rm buffer get meta(hBuffer, (void **)&meta data);
        meta data->m Offset=0;
        meta data->m BufferSize=60;
        meta data->m PacketSize=60;
        eth_pkt=(eth*)((int)buf+(int)meta_data->m_Offset);
        arp_pkt=(arp*)(eth_pkt->data);
        arp pkt->ar hrd = 1; /* Ethernet */
        arp pkt->ar hln = 6;
        arp pkt->ar pro = ETH IP; /* IPv4 */
        arp pkt -> ar pln = 4;
        arp pkt->ar op = ARP REO;
        *(int*)arp pkt->ar tha=0;
        *(short*)((int*)arp_pkt->ar_tha+1)=0;
        arp pkt->ar tpa=GW IP;
        *(int*)arp pkt->ar sha=iface table[NAT_IFC].eth w0;
        *(short*)((int*)arp pkt->ar sha+1)=
                (iface table[NAT IFC].eth w1>>16);
        arp_pkt->ar_spa1=
                (short)(iface table[NAT IFC].ip addr>>16);
        arp_pkt->ar_spa2=
                (short)(iface table[NAT IFC].ip addr&0xFFFF);
        send pkt((void*)hBuffer, NAT_IFC, eth_pkt,
                 eth bcast, ETH ARP);
        printk("%s: Resolving gateway MAC address...\n",
                NAT DRIVER NAME);
```

## **Core Component Packet Handler (11)**

```
ix ossl sleep(500);
                if (find arp_entry(GW_IP) != NULL)
                        return IX SUCCESS;
        return(-1);
}
/* Function to process an ARP request */
void process arp req(arp* arp pkt, ix hw buffer meta* meta data,
        ix buffer handle arg hBuffer, eth *eth pkt)
{
        arp entry ae;
        ae.ip addr = (arp pkt->ar spa1<<16) arp pkt->ar spa2;
        ae.eth w0 = *(int*)arp_pkt->ar_sha;
        ae.eth w1 = (*(short*)((int*)arp_pkt->ar_sha+1));
        ae.ifnum = meta_data->m_InputPort;
        ae.valid = 1i
        arp_pkt->ar_op = ARP_REP;
        *(int*)arp pkt->ar tha=*(int*)arp pkt->ar sha;
        *(short*)(arp_pkt->ar_tha+4)=*(short*)(arp_pkt->ar_sha+4);
        arp pkt->ar tpa=(arp pkt->ar spa1<<16) arp pkt->ar spa2;
        *(int*)arp_pkt->ar_sha=iface_table[meta_data->m_InputPort].eth w0;
        *(short*)(arp_pkt->ar_sha+4)=
                         iface table[meta data->m InputPort].eth w1>>16;
        arp pkt->ar spal=iface table[meta data->m InputPort].ip addr>>16;
        arp pkt->ar spa2=
                    iface table[meta data->m InputPort].ip addr&0xFFFF;
```

### **Core Component Packet Handler (12)**

```
send_pkt((void*)arg_hBuffer, meta_data->m_InputPort,
                  eth pkt, eth pkt->e src, ETH ARP);
        if (verb == VERBOSE)
                printk("Sent ARP reply\n");
        /* Also add an entry into arp table */
        if (!add arp entry(&ae))
                printk("%s: ARP table full!", NAT DRIVER NAME);
}
/* Function to process an ARP reply */
void process arp rep(arp* arp pkt, ix hw buffer meta* meta data)
        arp_entry_ae;
        ae.ip addr = (arp pkt->ar spa1<<16) | arp pkt->ar spa2;
        ae.eth w0 = *(int*)arp_pkt->ar_sha;
        ae.eth w1 = *(short*)(arp_pkt->ar_sha+4);
        ae.ifnum = meta_data->m_InputPort;
        ae.valid = 1i
        if (!add arp entry(&ae))
                printk("%s: ARP table full!", NAT_DRIVER_NAME);
}
```

# **Core Component Packet Handler (13)**

```
/* Function to insert an entry into the ARP table.
                                                                          */
/* Note: because our code uses a simplified ARP table in which entries */
/* do not expire, there is no need to check for duplicate entries.
                                                                         */
int add arp entry(arp entry *ae)
í
        ix hash 48 hash48v;
        int i, j;
        hash48v.m LWO = ae->ip addr;
        hash48v.m LW1 = 0;
        ix rm hash 48 hash(&hash48v);
        j = hash48v.m LWO&ARP TABLE BIT MASK;
        for (i=0;i<ARP TABLE SIZE;i++) {</pre>
                if (ae->ip addr == arp table[j].ip addr &&
                    arp_table[j].valid )
                        return(1);
                if ( !arp table[j].valid ) {
                         arp table[j]=*ae;
                        return(1);
                j=(j+1)&ARP_TABLE BIT MASK;
        return(0);
}
```

# **Core Component Packet Handler (14)**

```
/* Function to insert an entry into the NAT table */
int add_nat_entry(nat_entry* ne)
```

```
ix hash 128 hash128v;
unsigned char timer, del_timer;
int i, j, del cand;
hash128v.m LWO = ne->ip addr rem;
hash128v.m LW1 = ne->ip addr loc;
hash128v.m_LW2 = (ne->lport<<16) |ne->rport;
hash128v.m_LW3 = ne->prot;
ix rm hash 128 hash(&hash128v);
j = (hash128v.m LWO&NAT TABLE BIT MASK) << HASH BUCKET SHIFT;
del cand=j;
for (i=0;i<HASH_BUCKET_SIZE;i++,j++) {
        if (f_nat_table[j].valid &&
             ne->ip addr loc == f nat table[j].ip addr loc &&
             ne->ip addr rem == f nat table[j].ip addr rem &&
             ne->lport == f nat table[j].lport &&
             ne->rport == f_nat_table[j].rport &&
             ne->prot == f_nat_table[j].prot ) {
                ne->nport=f_nat_table[j].nport;
                return(1);
        }
```

ł

## **Core Component Packet Handler (15)**

```
if ( !f_nat_table[j].valid ) {
                if (set new nport(ne) < 0)
                        return(-1);
                f nat table[j]=*ne;
                add r nat entry(j);
                f_nat_table[j].valid=1;
                return(1);
        }
        /* No free slot was found; choose a candidate */
        /* for deletion
                                                       */
        timer=f timer[j]|r timer[f index[j]];
        del_timer = f_timer[del_cand] |r_timer[f_index[del_cand]];
        if ( timer < del_timer ||
             ( timer == del timer &&
               f_nat_table[j].prot!=f_nat_table[del_cand].prot &&
               (f nat table[j].prot == IPT ICMP ||
                 (f nat table[j].prot == IPT UDP &&
                   f_nat_table[del_cand].prot == IPT_TCP ))))
                del cand=j;
del nat entry(del cand);
if (set new nport(ne) < 0)
        return(-1);
f nat table[del cand]=*ne;
add r nat entry(del cand);
f nat table[del cand].valid=1;
return(1);
```

# **Core Component Packet Handler (16)**

```
/* Function to delete an entry from the NAT table */
void del nat entry (unsigned int entry index)
ł
        f_nat_table[entry_index].valid=0;
        f timer[entry index]=0;
        r nat table[f index[entry index]].valid=0;
        r timer[f index[entry index]]=0;
}
/* Function to add an entry to the reverse NAT table */
void add r nat entry(unsigned int entry index)
        ix hash 128 hash128v;
        int i, j, k, del_cand, r del cand;
        unsigned char timer, del_timer;
        nat entry *ne=&f nat table[entry index];
        hash128v.m LWO = ne->ip addr rem;
        hash128v.m_LW1 = (ne->nport<<16) |ne->rport;
        hash128v.m LW2 = ne->prot;
        hash128v.mLW3 = 0;
        ix rm hash 128 hash(&hash128v);
        j=(hash128v.m LWO&NAT TABLE BIT MASK)<<HASH BUCKET SHIFT;
        del cand=r index[j];
        r del cand=j;
```

# **Core Component Packet Handler (17)**

```
for (i=0;i<HASH BUCKET_SIZE;i++,j++) {
        /* Check whether the slot is empty */
        if (!r nat table[j].valid) {
                /* we found an empty slot in reverse NAT table */
                r nat table[j]=f nat table[entry index];
                f index[entry index]=j;
                r index[j]=entry index;
                r nat table[j].valid=1;
                return;
        }
        /* Find a canditate for deletion */
       k = r index[j];
        timer = f timer[k] r timer[j];
        del timer=f timer[del cand] r timer[r del cand];
        if ( timer < del_timer ||
             (timer == del timer &&
               f nat table[k].prot!=f nat table[del cand].prot &&
                (f_nat_table[k].prot == IPT_ICMP ||
                   (f nat table[k].prot == IPT UDP &&
                     f nat table[del cand].prot == IPT TCP )))) {
                del_cand=k;
                r_del_cand=j;
        }
}
```

# **Core Component Packet Handler (18)**

```
/* This point is reached if no slot is empty */
        del nat entry(del cand);
        r nat table[r del cand]=f nat table[entry index];
        r index[r del cand] = entry index;
        f index[entry index] = r del cand;
        r nat table[r del cand].valid=1;
/* Function to calculate a value for a new NAT port */
int set new nport(nat entry* ne)
        ix hash 128 hash128v;
        int i,j,k;
        ne->nport=++global_nport;
        /* Try at most NEW_NPORT_ATTEMPS values, and then give up */
        for (i=0;i<NEW NPORT ATTEMPS;i++) {</pre>
                hash128v.m LWO = ne->ip addr rem;
                hash128v.m_LW1 = (ne->nport<<16) |ne->rport;
                hash128v.m_LW2 = ne->prot;
                hash128v.m LW3 = 0;
                ix rm hash 128 hash(&hash128v);
                j=(hash128v.m LWO&NAT TABLE BIT MASK) << HASH BUCKET SHIFT;
```

# **Core Component Packet Handler (19)**

```
for (k=0;k<HASH_BUCKET_SIZE;k++,j++) {
    if ( r_nat_table[j].valid &&
        r_nat_table[j].ip_addr_rem == ne->ip_addr_rem &&
        r_nat_table[j].rport == ne->rport &&
        r_nat_table[j].nport == ne->nport &&
        r_nat_table[j].prot == ne->prot )
        break;
    }
    if (k==HASH_BUCKET_SIZE)
        /* An unused NAT port value has been found */
        return(ne->nport);
    /* try the next NAT port value */
        ne->nport=++global_nport;
}
return(-1);
```

}

# **Core Component Packet Handler (20)**

```
/* Function to send echo response */
void send icmp echo rep(ip* ip pkt, icmp* icmp pkt,
        ix hw buffer meta* meta data, ix buffer handle arg hBuffer,
        eth *eth pkt)
{
        unsigned int cksum;
        icmp pkt->icmp type = ICMP ECHO REP;
        cksum = icmp pkt->icmp cksum+(ICMP ECHO REQ<<8)
              + ((~(ICMP_ECHO_REP<<8))&0xFFFF);
        cksum = (cksum & 0xFFFF) + (cksum >> 16);
        cksum = (cksum & 0xFFFF) + (cksum >> 16);
        icmp pkt->icmp cksum = cksum&0xFFFF;
        ip pkt->ip dst = ip pkt->ip src;
        ip pkt->ip src = iface table[meta data->m InputPort].ip addr;
        send_pkt((void*)arg_hBuffer,meta_data->m_InputPort, eth_pkt,
                 eth pkt->e src, ETH IP);
}
```

#### **Core Component Packet Handler (21)**

```
/* Function to translate ICMP packet */
int process icmp(icmp* icmp pkt, nat entry* ne)
ł
        unsigned int cksum;
        /* If this is not an echo request -- drop the packet */
        if (icmp pkt->icmp type != ICMP ECHO REQ)
                return(-1);
        /* For ICMP echo request we do ID field translation */
        ne->lport=icmp_pkt->icmp_id;
        ne->rport=0;
        if (add nat entry(ne) < 0)
                return(-1);
        icmp pkt->icmp id=ne->nport;
        /* update ICMP checksum */
        cksum= icmp pkt->icmp cksum+ne->lport
             + ((~icmp pkt->icmp id)&0xFFFF);
        cksum=(cksum&0xFFFF)+(cksum>>16);
        cksum=(cksum&0xFFFF)+(cksum>>16);
        icmp pkt->icmp cksum = cksum&0xFFFF;
        return(1);
```

}

#### **Core Component Packet Handler (22)**

```
/* Function to translate TCP packet */
int process tcp(tcp* tcp pkt,nat entry* ne)
       unsigned int cksum;
        if (verb == VERBOSE) {
                printk("\tTCP source port = %i\n", tcp_pkt->tcp_sport);
                printk("\tTCP dest. port = %i\n", tcp pkt->tcp dport);
        /* Perform TCP source port translation */
       ne->lport=tcp_pkt->tcp_sport;
       ne->rport=tcp_pkt->tcp_dport;
        if (add nat entry(ne) < 0)
                return(-1);
        tcp pkt->tcp sport=ne->nport;
        /* Update the TCP checksum */
        cksum = tcp_pkt->tcp_cksum+ne->lport+(ne->ip_addr_loc>>16)
              + (ne->ip addr loc&0xFFFF)+((~tcp pkt->tcp sport)&0xFFFF)
              + ((~iface table[NAT IFC].ip addr)>>16)
              + ((~iface table[NAT IFC].ip addr)&0xFFFF);
        cksum=(cksum&0xFFFF)+(cksum>>16);
        cksum=(cksum&0xFFFF)+(cksum>>16);
        tcp pkt->tcp cksum = cksum&0xFFFF;
       return(1);
```

```
}
```

#### **Core Component Packet Handler (23)**

```
/* Function to translate UDP packet */
int process udp(udp* udp pkt,nat entry* ne)
       unsigned int cksum;
        if (verb == VERBOSE) {
                printk("\tUDP source port = %i\n", udp_pkt->udp_sport);
                printk("\tUDP dest. port = %i\n", udp pkt->udp dport);
        /* Perform UDP source port translation */
       ne->lport=udp_pkt->udp_sport;
       ne->rport=udp_pkt->udp_dport;
        if (add nat entry(ne) < 0)
                return(-1);
        udp_pkt->udp_sport=ne->nport;
        /* Update the UDP checksum */
        if (udp pkt->udp cksum) {
                cksum = udp_pkt->udp_cksum+ne->lport+(ne->ip_addr_loc>>16)
                      + (ne->ip addr loc&0xFFFF)
                      + ((~udp pkt->udp sport)&0xFFFF)
                      + ((~iface table[NAT IFC].ip addr)>>16)
                      + ((~iface table[NAT IFC].ip addr)&0xFFFF);
                cksum=(cksum&0xFFFF)+(cksum>>16);
                cksum=(cksum&0xFFFF)+(cksum>>16);
                udp pkt->udp cksum = cksum&0xFFFF;
       return(1);
```

## **User Interface Application**

- Allows user to interact with core component
- Core component
  - Defines pseudo-device in Linux kernel
  - Installs driver for pseudo-device
- To execute a command, user interface performs an opeation on the pseudo-device

#### **Code For User Interface (1)**

```
/* NAT_control.c -- user interface and control functions for NAT */
```

```
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <enpv2 types.h>
#include <sys/mman.h>
#include "NAT shared defs.h"
#include "NAT_types.h"
#define USAGE ∖
"Usage: %s [ show | clear | silent | arp | nat | verbose ]\n", argv[0]
void AppShow(void)
ł
        int i;
        UINT32 *p1, *p2;
        char buf[1024];
        int natfd;
        natfd = open(NAT DEV FILE, O RDWR, 0);
        if ( natfd == -1 ) \{
                printf("Failed to open %s\n",NAT_DEV_FILE);
                return;
        }
```

#### **Code For User Interface (2)**

```
void AppSilent()
{
        int natfd;
        natfd = open(NAT_DEV_FILE, O_RDWR, 0);
        if ( natfd == -1 ) \{
                printf("Failed to open %s\n",NAT_DEV_FILE);
                return;
        ioctl(natfd,SILENT,NULL);
        close(natfd);
}
void AppVerbose()
        int natfd;
        natfd = open(NAT_DEV_FILE, O_RDWR, 0);
        if ( natfd == -1 ) \{
                printf("Failed to open %s\n",NAT_DEV_FILE);
                return;
        ioctl(natfd,VERBOSE,NULL);
        close(natfd);
}
```

#### **Code For User Interface (3)**

```
void AppSilent()
{
        int natfd;
        natfd = open(NAT_DEV_FILE, O_RDWR, 0);
        if ( natfd == -1 ) \{
                printf("Failed to open %s\n",NAT_DEV_FILE);
                return;
        ioctl(natfd,SILENT,NULL);
        close(natfd);
}
void AppVerbose()
        int natfd;
        natfd = open(NAT_DEV_FILE, O_RDWR, 0);
        if ( natfd == -1 ) \{
                printf("Failed to open %s\n",NAT_DEV_FILE);
                return;
        ioctl(natfd,VERBOSE,NULL);
        close(natfd);
}
```

#### **Code For User Interface (4)**

```
void AppClear( void )
{
    int natfd;
    natfd = open(NAT_DEV_FILE, O_RDWR, 0);
    if ( natfd == -1 ) {
        printf("Failed to open %s\n",NAT_DEV_FILE);
        return;
    }
    ioctl(natfd,CLR_RX_COUNTER,NULL);
    ioctl(natfd,CLR_TX_COUNTER,NULL);
    close(natfd);
    printf("Counters cleared\n");
    return;
}
```

## **Code For User Interface (5)**

```
void AppGetArpTbl( void )
        int natfd, i;
        arp entry buf[ARP TABLE SIZE];
        natfd = open(NAT_DEV_FILE, O_RDWR, 0);
        if ( natfd == -1 ) \{
                printf("Failed to open %s\n",NAT_DEV_FILE);
                return;
        ioctl(natfd,GET_ARP_TABLE,buf);
        close(natfd);
        for (i=0;i<ARP_TABLE_SIZE;i++) {</pre>
          if (buf[i].valid)
            printf(
             "%i.%i.%i.%i -> %02X:%02X:%02X:%02X:%02X:%02X iface: %i\n",
             IP2B(buf[i].ip addr), ETH2B(buf[i].eth w0),
             buf[i].ifnum);
        }
        return;
```

ł

}

#### **Code For User Interface (6)**

```
void AppGetNatTbl( void )
        int natfd,i;
        nat_entry_buf[NAT_TABLE_SIZE];
        unsigned char timer[2*NAT TABLE SIZE];
        natfd = open(NAT DEV FILE, O RDWR, 0);
        if ( natfd == -1 ) 
                printf("Failed to open %s\n",NAT_DEV_FILE);
                return;
        ioctl(natfd,GET_NAT_TABLE,buf);
        ioctl(natfd,GET_TIMER_TABLE,timer);
        for (i=0;i<NAT TABLE SIZE;i++) {</pre>
          if (buf[i].valid) {
            printf("IP local: %i.%i.%i.%i port local: %i\n",
                   IP2B(buf[i].ip addr loc), buf[i].lport);
            printf("IP remote: %i.%i.%i.%i port remote: %i\n",
                   IP2B(buf[i].ip addr rem), buf[i].rport);
            printf("protocol: %i port (NAT): %i timer: %i index: %i\n\n",
                   buf[i].prot, buf[i].nport,
                   timer[i]|timer[NAT_TABLE_SIZE+i],i);
        close(natfd);
        return;
```

#### **Code For User Interface (7)**

```
int main(int argc, char **argv)
{
        if (argc != 2) {
                printf(USAGE);
                return 0;
        if (strncmp(argv[1], "show", 4) == 0) {
                AppShow();
        } else if (strncmp(argv[1], "clear", 5) == 0) {
                AppClear();
        } else if (strncmp(arqv[1], "silent", 6) == 0) 
                AppSilent();
        } else if (strncmp(argv[1], "verbose", 7) == 0) 
                AppVerbose();
        } else if (strncmp(argv[1], "arp", 3) == 0) {
                AppGetArpTbl();
        } else if (strncmp(argv[1], "nat", 3) == 0) 
                AppGetNatTbl();
        } else {
                printf("Invalid parameter\n");
                printf(USAGE);
       return 0;
}
```

## **Summary**

- Example system implements NAT
- Code uses RX and TX microblocks from Intel's SDK
- NAT microblock implements NAT in fast path
- Core component handles exceptions
- User interface provides interaction with core component

# **Questions?**

#### X

# **Switching Fabrics**

## **Physical Interconnection**

- Physical box with backplane
- Individual *blades* plug into backplane slots
- Each blade contains one or more network connections

## **Logical Interconnection**

- Known as *switching fabric*
- Handles transport from one blade to another
- Becomes bottleneck as number of interfaces scales

## **Illustration Of Switching Fabric**



• Any input port can send to any output port

# **Switching Fabric Properties**

- Used inside a single network system
- Interconnection among I/O ports (and possibly CPU)
- Can transfer unicast, multicast, and broadcast packets
- Scales to arbitrary data rate on any port
- Scales to arbitrary packet rate on any port
- Scales to arbitrary number of ports
- Has low overhead
- Has low cost

# **Types Of Switching Fabrics**

- Space-division (separate paths)
- Time-division (shared medium)

## **Space-Division Fabric (separate paths)**



• Can use multiple paths simultaneously

## **Space-Division Fabric (separate paths)**



- Can use multiple paths simultaneously
- Still have *port contention*

• High speed

- High speed
- Low cost

• High speed and low cost!

## **Possible Compromise**

- Separation of physical paths
- Less parallel hardware
- Crossbar design

## **Space-Division (Crossbar Fabric)**



## Crossbar

- Allows simultaneous transfer on disjoint pairs of ports
- Can still have *port contention*

## Crossbar

- Allows simultaneous transfer on disjoint pairs of ports
- Can still have *port contention*

# **Solving Contention**

- Queues (FIFOs)
  - Attached to input
  - Attached to output
  - At intermediate points

## **Crossbar Fabric With Queuing**



#### **Time-Division Fabric (shared bus)**



- Chief advantage: low cost
- Chief disadvantage: low speed

# **Time-Division Fabric (shared memory)**



- *May* be better than shared bus
- Usually more expensive

## **Multi-Stage Fabrics**

- Compromise between pure time-division and pure spacedivision
- Attempt to combine advantages of each
  - Lower cost from time-division
  - Higher performance from space-division
- Technique: limited sharing

## **Banyan Fabric**

- Example of multi-stage fabric
- Features
  - Scalable
  - Self-routing
  - Packet queues allowed, but not required

## **Basic Banyan Building Block**



- Address added to front of each packet
- One bit of address used to select output

#### **4-Input And 8-Input Banyan Switches**



## Summary

- Switching fabric provides connections inside single network system
- Two basic approaches
  - Time-division has lowest cost
  - Space-division has highest performance
- Multistage designs compromise between two
- Banyan fabric is example of multistage

# **Questions?**

#### XIV

#### **Issues In Scaling A Network Processor**

## **Design Questions**

- Can we make network processors
  - Faster?
  - Easier to use?
  - More powerful?
  - More general?
  - Cheaper?
  - All of the above?
- Scale is fundamental

## **Scaling The Processor Hierarchy**

- Make processors faster
- Use more concurrent threads
- Increase processor types
- Increase numbers of processors

#### **The Pyramid Of Processor Scale**



• Lower levels need the most increase

# **Scaling The Memory Hierarchy**

- Size
- Speed
- Throughput
- Cost

## **Memory Speed**

- Access latency
  - Raw read/write access speed
  - SRAM 2 10 ns
  - DRAM 50 70 ns
  - External memory takes order of magnitude longer than onboard

## Memory Speed (continued)

- Memory cycle time
  - Measure of successive read/write operations
  - Important for networking because packets are large
  - Read Cycle time (tRC) is time for successive fetch operations
  - Write Cycle time (tWC) is time for successive store operations

## **The Pyramid Of Memory Scale**



• Largest memory is least expensive

## **Memory Bandwidth**

- General measure of throughput
- More parallelism in access path yields more throughput
- Cannot scale arbitrarily
  - Pinout limits
  - Processor must have addresses as wide as bus

# **Types Of Memory**

| Memory Technology        | Abbreviation    | Purpose                                          |
|--------------------------|-----------------|--------------------------------------------------|
| Synchronized DRAM        | SDRAM           | Synchronized with CPU<br>for lower latency       |
| Quad Data Rate SRAM      | QDR-SRAM        | Optimized for low latency<br>and multiple access |
| Zero Bus Turnaround SRAM | <b>ZBT-SRAM</b> | Optimized for random<br>access                   |
| Fast Cycle RAM           | FCRAM           | Low cycle time optimized<br>for block transfer   |
| Double Data Rate DRAM    | DDR-DRAM        | Optimized for low<br>latency                     |
| Reduced Latency DRAM     | RLDRAM          | Low cycle time and<br>low power requirements     |

## **Memory Cache**

- General-purpose technique
- May not work well in network systems

## **Memory Cache**

- General-purpose technique
- May not work well in network systems
  - Low temporal locality

## **Memory Cache**

- General-purpose technique
- May not work well in network systems
  - Low temporal locality
  - Large cache size (either more entries or larger granularity of access)

## **Content Addressable Memory (CAM)**

- Combination of mechanisms
  - Random access storage
  - Exact-match pattern search
- Rapid search enabled with parallel hardware

# **Arrangement Of CAM**



• Organized as array of slots

# **Lookup In Conventional CAM**

- Given
  - Pattern for which to search
  - Known as *key*
- CAM returns
  - First slot that matches key, or
  - All slots that match key

# **Ternary CAM (T-CAM)**

- Allows masking of entries
- Good for network processor

# **T-CAM Lookup**

- Each slot has bit mask
- Hardware uses mask to decide which bits to test
- Algorithm

```
for each slot do {
    if ( ( key & mask ) == ( slot & mask ) ) {
        declare key matches slot;
    } else {
        declare key does not match slot;
    }
}
```

#### **Partial Matching With A T-CAM**



• Key matches slot #1

# **Using A T-CAM For Classification**

- Extract values from fields in headers
- Form values in contiguous string
- Use a key for T-CAM lookup
- Store classification in slot

## **Classification Using A T-CAM**



# **Software Scalability**

- Not always easy
- Many resource constraints
- Difficulty arises from
  - Explicit parallelism
  - Code optimized by hand
  - Pipelines on heterogeneous hardware

## Summary

- Scalability key issue
- Primary subsystems affecting scale
  - Processor hierarchy
  - Memory hierarchy
- Many memory types available
  - SRAM
  - SDRAM
  - CAM
- T-CAM useful for classification

# **Questions?**

#### XV

#### **Examples Of Commercial Network Processors**

## **Commercial Products**

- Emerge in late 1990s
- Become popular in early 2000s
- Exceed thirty vendors by 2003
- Fewer than thirty vendors by 2004

## **Examples**

- Chosen to
  - Illustrate concepts
  - Show broad categories
  - Expose the variety
- Not necessarily "best"
- Not meant as an endorsement of specific vendors
- Show a snapshot as of 2004

## Short Pipeline Of Unconventional Processors (Agere)

- Two-stage pipeline
  - Classification
  - Forwarding (traffic management)
- Unusual, special-purpose processors
  - Classification uses programmable pattern matching engine
  - Traffic manager uses programmable queue selection mechanism
- Model is *APP550*

#### **Agere Architecture**



## Languages Used By Agere

- FPL
  - Functional Programming Language
  - Produces code for FPP
  - Non-procedural
- C-NP
  - C for Network Processors
  - Produces code for engines on chip
  - Similar to shell scripts

## Architecture Of Agere's APP550 chip



#### **Processors On Agere's APP550**

| Engine                    | Purpose                                                                        |  |
|---------------------------|--------------------------------------------------------------------------------|--|
| Pattern Processing Engine | Classification                                                                 |  |
| State Engine              | Gathering state information for scheduling and verifying flow is within bounds |  |
| Reorder Buffer Manager    | Ensure packet order                                                            |  |
| PDU Assembler             | Collect all blocks of a frame                                                  |  |
| Traffic Manager           | Schedule packets and shape traffic flow                                        |  |
| Stream EDitor (SED)       | Modify outgoing packet                                                         |  |

# **Augmented RISC (Alchemy)**

- Based on MIPS-32 CPU
  - Five-stage pipeline
- Augmented for packet processing
  - Instructions (e.g. multiply-and-accumulate)
  - Memory cache
  - I/O interfaces

### **Alchemy Architecture**



# Parallel Embedded Processors Plus Coprocessors (AMCC)

- One to six nP core processors
- Various engines
  - Packet metering
  - Packet transform
  - Packet policy

### **AMCC Architecture**



12

# Parallel Pipelines Of Homogeneous Processors (Cisco)

- Parallel eXpress Forwarding (PXF)
- Arranged in parallel pipelines
- Packet flows through one pipeline
- Each processor in pipeline dedicated to one task

#### **Cisco Architecture**



# **Pipeline Of Parallel Heterogeneous Processors (EZchip)**

- Four processor types
- Each type optimized for specific task

## **EZchip NP-1c Architecture**



### **EZchip Processor Types**

| Processor Type | Optimized For                              |
|----------------|--------------------------------------------|
| TOPparse       | Header field extraction and classification |
| TOPsearch      | Table lookup                               |
| TOPresolve     | Queue management and forwarding            |
| TOPmodify      | Packet header and content modification     |

# **Extensive And Diverse Processors** (Hifn, formerly IBM)

- Multiple processor types
- Extensive use of parallelism
- Separate ingress and egress processing paths
- Multiple onboard data stores
- Model is *NP4GS3*

### Hifn NP4GS3 Architecture



### **Hifn's Embedded Processor Complex**



# **Packet Engines**

- Found in Embedded Processor Complex
- Programmable
- Handle many packet processing tasks
- Operate in parallel (sixteen)
- Known as *picoengines*

### **Other Processors On The IBM Chip**

| Coprocessor | Purpose                                           |
|-------------|---------------------------------------------------|
| Data Store  | Provides frame buffer DMA                         |
| Checksum    | Calculates or verifies header checksums           |
| Enqueue     | Passes outgoing frames to switch or target queues |
| Interface   | Provides access to internal registers and memory  |
| String Copy | Transfers internal bulk data at high speed        |
| Counter     | Updates counters used in protocol processing      |
| Policy      | Manages traffic                                   |
| Semaphore   | Coordinates and synchronizes threads              |

## Flexible RISC Plus Coprocessors (Motorola C-PORT)

- Onboard processors can be
  - Dedicated
  - Parallel clusters
  - Pipeline

### **C-Port Architecture**



### **Internal Structure Of A C-Port Channel Processor**



• Actually a processor complex

NSD-Intel -- Chapt. 15

# **Extremely Long Pipeline (Xelerated)**

- Pipeline contains 200 processors
- Each processor can execute four instructions per packet
- External coprocessor calls used to pass state

### **Xelerated Architecture**



• Pipeline has 200 stages

### **Xelerated Internal Architecture**



# Summary

- Many network processor architecture variations
- Examples include
  - Augmented RISC processor
  - Embedded parallel processors plus coprocessors
  - Parallel pipelines of homogeneous processors
  - Pipeline of parallel heterogeneous processors
  - Extensive and diverse processors
  - Flexible RISC plus coprocessors
  - Extremely long pipeline

# **Questions?**

### XVII

### **Design Tradeoffs And Consequences**

# Low Development Cost Vs. Performance

- The fundamental economic motivation
- ASIC costs \$1M to develop
- Network processor costs programmer time

# Programmability Vs. Processing Speed

- Programmable hardware is slower
- Flexibility costs...

# Speed Vs. Functionality

- Generic idea:
  - Processor with most functionality is slowest
  - Adding functionality to NP lowers its overall "speed"

# **Speed**

- Difficult to define
- Can include
  - Packet Rate
  - Data Rate
  - Burst size

# Per-Interface Rates Vs. Aggregate Rates

- Per-interface rate important if
  - Physical connections form bottleneck
  - System scales by having faster interfaces
- Aggregate rate important if
  - Fabric forms bottleneck
  - System scales by having more interfaces

# Increasing Processing Speed Vs. Increasing Bandwidth

Will network processor capabilities or the bandwidth of network connections increase more rapidly?

- What is the effect of more transistors?
- Does Moore's Law apply to bandwidth?

# Lookaside Coprocessors Vs. Flow-Through Coprocessors

- Flow-through pipeline
  - Operates at wire speed
  - Difficult to change
- Lookaside
  - Modular and easy to change
  - Invocation can be bottleneck

# Uniform Pipeline Vs. Synchronized Pipeline

- Uniform pipeline
  - Operates in lock-step like assembly line
  - Each stage must finish in exactly the same time
- Synchronized pipeline
  - Buffers allow computation at each stage to differ
  - Synchronization expensive

# Explicit Parallelism Vs. Cost And Programmability

- Explicit parallelism
  - Hardware is less complex
  - More difficult to program
- Implicit parallelism
  - Easier to program
  - Slightly lower performance

# Parallelism Vs. Strict Packet Ordering

- Increased parallelism
  - Improves performance
  - Results in out-of-order packets
- Strict packet ordering
  - Aids protocols such as TCP
  - Can nullify use of parallelism

# Stateful Classification Vs. High-Speed Parallel Classification

- Static classification
  - Keeps no state
  - Is the fastest
- Dynamic classification
  - Keeps state
  - Requires synchronization for updates

## Memory Speed Vs. Programmability

- Separate memory *banks* 
  - Allow parallel accesses
  - Yield high performance
  - Difficult to program
- Non-banked memory
  - Easier to program
  - Lower performance

# I/O Performance Vs. Pin Count

- Bus width
  - Increase to produce higher throughput
  - Decrease to take fewer pins

## **Programming Languages**

- A three-way tradeoff
- Can have two, but not three of
  - Ease of programming
  - Functionality
  - Performance

## **Programming Languages That Offer High Functionality**

- Ease of programming vs. speed
  - High-level language offers ease of programming, but lower performance
  - Low-level language offers higher performance, but makes programming more difficult

## **Programming Languages That Offer Ease Of Programming**

- Speed vs. functionality
  - For restricted language, compiler can generate optimized code
  - Broad functionality and ease of programming lead to inefficient code

## Programming Languages That Offer High Performance

- Ease of programming vs. functionality
  - Optimizing compiler and ease of programming imply a restricted application
  - Optimizing code for general applications requires more programmer effort

## Multithreading: Throughput Vs. Ease Of Programming

- Multiple threads of control can increase throughput
- Planning the operation of threads that exhibit less contention requires more programmer effort

## Traffic Management Vs. High-Speed Forwarding

- Traffic management
  - Can manage traffic on multiple, independent flows
  - Requires extra processing
- Blind forwarding
  - Performed at highest speed
  - Does not distinguish among flows

## Generality Vs. Specific Architectural Role

- General-purpose network processor
  - Used in any part of any system
  - Used with any protocol
  - More expensive
- Special-purpose network processor
  - Restricted to one role / protocol
  - Less expensive, but may need many types

## Special-Purpose Memory Vs. General-Purpose Memory

- General-purpose memory
  - Single type of memory serves all needs
  - May not be optimal for any use
- Special-purpose memory
  - Optimized for one use
  - May require multiple memory types

## Backward Compatibility Vs. Architectural Advances

- Backward compatibility
  - Keeps same instruction set through multiple versions
  - May not provide maximal performance
- Architectural advances
  - Allows more optimizations
  - Difficult for programmers

## Parallelism Vs. Pipelining

- Both are fundamental performance techniques
- Usually used in combination: pipeline of parallel processors
  - How long is pipeline?
  - How much parallelism at each stage?

#### **Summary**

- Many design tradeoffs
- No easy answers

# **Questions?**

