Porting OpenBSD to the Solbourne S4000

[ Index ] [ Prev: Introducing The CPU ] [ Next: Mapping Games ]

Every PMAP Goes to Heaven

The pmap is one of the most critical parts of the kernel. It is responsible for all the gory details of address translation, and also has to maintain related structures in the most efficient way. Make a mistake and the kernel dies, use low-quality algorithms or data structures and the kernel crawls.

A simple pmap layout...

Until I figure out how to use the ASI_PID MMU register to get contexts, the S4000 pmap will be a classical two-level page table pmap. Such a state-of-the-art two-level pmap is the mips pmap.

One nice thing of the two-level pmap is that they are well-known, and anyone familiar enough with existing pmap code can write one from scratch in a very short time.

The key component of the pmap code is the struct pmap, which allows the code to manage the entire virtual address space of a given process. A special pmap is also used for the kernel itself, and is available as pmap_kernel() (which is often a macro). The struct pmap gives us full knowledge of the MMU structures for the given pmap, and - of course - only one pmap may be active at any given time.

I started with the following definitions in <machine/pte.h>, describing the page directory and page table entries:

typedef u_int32_t	pt_entry_t;

typedef struct {
	u_int32_t	pde_pa;
	pt_entry_t*	pde_va;
} pd_entry_t;

Then, the struct pmap in <machine/pmap.h> was pretty obvious:

struct pmap {
	pd_entry_t		*pm_segtab;	/* first level table */
	paddr_t			pm_psegtab;	/* pa of above */

	int			pm_refcount;	/* reference count */
	struct simplelock	pm_lock;
	struct pmap_statistics	pm_stats;	/* pmap statistics */
};

typedef struct pmap *pmap_t;

The last three fields are usual administrivia components: a reference count, which prevents pmaps from being destroyed as long as some part of the system still references them, a simplelock (which amounts to nothing at all for single-processor code, but the machine-independent code expects to find it), and a set of statistics counters (which turns out to be necessary, despite what one could think).
The other two fields are the real guts of the pmap: we'll store a pointer to the page directory (first level table), as both virtual and physical addresses.

Since the page directory will always be allocated (unless your pmap uses zero memory (-: ), one could wonder why it exists as a pointer and not an array, removing the need for a memory allocation call.
The reason is that this table is large (4096 bytes), and would make the pmap structure a very uneasy size: large but not a power of two. The kernel allocation routines are much more efficient for either powers of two (such as a 4096 bytes page directory), or small structures (such as a 30 bytes or so pmap structure). If the pmap were to be large, it would cause a lot of fragmentation in the kernel memory, which is something we can't afford.

...with mapping tricks

So here is how a process pmap would be created:

pmap_t
pmap_create()
{
	pmap_t pmap;
	u_int pde;

	DPRINTF(PDB_CREATE, ("pmap_create()"));

	pmap = pool_get(&pmappool, PR_WAITOK);

	bzero(pmap, sizeof(*pmap));
	pmap->pm_refcount = 1;
	simple_lock_init(&pmap->pm_lock);

	/*
	 * Allocate the page directory
	 */
	pmap->pm_segtab = (pd_entry_t *)uvm_km_zalloc(kernel_map, PDT_SIZE);
	if (pmap_extract(pmap_kernel(), (vaddr_t)pmap->pm_segtab,
	    &pmap->pm_psegtab) == FALSE)
		panic("pmap_create: pmap_extract failed!");

	/*
	 * Shadow the kernel map in all user pmaps.
	 */
	for (pde = (KERNBASE >> PDT_INDEX_SHIFT); pde < NBR_PDE; pde++) {
		pmap->pm_segtab[pde].pde_pa =
		    pmap_kernel()->pm_segtab[pde].pde_pa;
		pmap->pm_segtab[pde].pde_va =
		    pmap_kernel()->pm_segtab[pde].pde_va;
	}

	DPRINTF(PDB_CREATE, (" -> %p\n", pmap));

	return (pmap);
}

The struct pmap is allocated from a pool (which is a more efficient allocator than malloc for frequently allocated and released objects). Then the first level page directory is allocated from the kernel_map object (a view of the memory available to the kernel) with uvm_km_zalloc(), which also fills the memory with zeroes before retuning it. We then extract the physical address, as we need to know it, for when we will be programming the ASI_PDBR register in pmap_activate().

Page tables will be allocated on-demand in pmap_enter(), so that only the necessary page tables get allocated (allocating the 512 4KB page tables would not only eat 2MB per pmap, but would also be terribly inefficient). They will not be released automatically when empty, though, in order to save time by not performing such a check in pmap_remove(), and because they might need to be recreated very shortly.
A specific routine, pmap_collect(), is invoked when the virtual memory system is short on physical memory, and tries to free as much physical memory as possible; this routine is (hopefully) not invoked very often, and is allowed to take some time to complete; in our case, it will check for these empty page tables and relinquish them.

The final part is a bit tricky. Since we only have one pmap active at any given time, we need to be able to service interrupts or traps, which reside in kernel code. This is why, on sparc, the address space is split in two parts: a large part is available to userland process, ranging from address zero to VM_MAXUSER_ADDRESS, while a distinct part is reserved for the kernel, and ranges from VM_MIN_KERNEL_ADDRESS to VM_MAX_KERNEL_ADDRESS.
Ok, I lied, there is a third part, above VM_MAX_KERNEL_ADDRESS, which is used for I/O (device) mappings, but can be considered as kernel as well. This is summed up in <machine/vmparam.h> as:

/*
 * User/kernel map constants.  Note that sparc/vaddrs.h defines the
 * IO space virtual base, which must be the same as VM_MAX_KERNEL_ADDRESS:
 * tread with care.
 */
#define	VM_MIN_ADDRESS		((vaddr_t)0)
#define	VM_MAX_ADDRESS		((vaddr_t)KERNBASE)
#define	VM_MAXUSER_ADDRESS	((vaddr_t)KERNBASE)
#define	VM_MIN_KERNEL_ADDRESS	((vaddr_t)KERNBASE)
#define	VM_MAX_KERNEL_ADDRESS	((vaddr_t)0xfe000000)

And we have, on sparc, in <machine/vmparam.h>:

#define	KERNBASE	0xf8000000	/* start of kernel virtual space */
#define	KERNTEXTOFF	0xf8004000	/* start of kernel text */

while on solbourne, we use:

#define	KERNBASE	0xfd080000	/* start of kernel virtual space */
#define	KERNTEXTOFF	0xfd084000	/* start of kernel text */

since the PROM loads us at a higher virtual address.

So what the above for() loop does, is to mimic the kernel part in every single pmap, by making the end of the pmap's page directory entries point to pmap_kernel()'s page directory entries themselves. Because of this, any change to the kernel memory is immediately available and taken in account in the active pmap.

Early allocation

The final part of getting the puzzle into place is the pmap_bootstrap() function, which sets up the initial kernel pmap. As none of the kernel memory allocators are available (because they rely on pmap_kernel() being initialized!), we need to compute our own needs, and satisfy them by pretending the kernel image (code, data, bss and symbols) is larger than what it really is.

Because of this, pmap_bootstrap() is often the most complicated routine in the whole pmap.c file, and this port is no exception. We start by sizing the memory by probing the memory controllers:

void
pmap_bootstrap(size_t promdata)
{
	extern caddr_t end, etext;
	extern vaddr_t esym;
	u_int32_t icuconf;
	u_int8_t imcmcr;
	vaddr_t ekern;
	vaddr_t va, eva;
	unsigned int ntables, tabidx;
	pd_entry_t *pde;
	pt_entry_t *pte;
	extern char **prom_argv;

	/*
	 * Compute memory size by checking the iCU for the number of iMC,
	 * then each iMC for its status.
	 */

	icuconf = *(u_int32_t *)ICU_CONF;
	physmem = 0;

	imcmcr = *(u_int8_t *)MC0_MCR;
	if (imcmcr & MCR_BANK0_AVAIL)
		physmem += (imcmcr & MCR_BANK0_32M) ? 32 : 8;
	if (imcmcr & MCR_BANK1_AVAIL)
		physmem += (imcmcr & MCR_BANK1_32M) ? 32 : 8;

	if ((icuconf & CONF_NO_EXTRA_MEMORY) == 0) {
		imcmcr = *(u_int8_t *)MC1_MCR;
		if (imcmcr & MCR_BANK0_AVAIL)
			physmem += (imcmcr & MCR_BANK0_32M) ? 32 : 8;
		if (imcmcr & MCR_BANK1_AVAIL)
			physmem += (imcmcr & MCR_BANK1_32M) ? 32 : 8;
	}

	/* scale to pages */
	physmem <<= (20 - PAGE_SHIFT);

Then we do administrative initializations:

	/*
	 * Set virtual page size
	 */
	uvmexp.pagesize = PAGE_SIZE;
	uvm_setpagesize();

	/*
	 * Initialize kernel pmap
	 */
	simple_lock_init(&pmap_kernel()->pm_lock);
	pmap_kernel()->pm_refcount = 1;

and it's time to do our real allocations. The first few are trivial:

	/*
	 * Compute kernel fixed memory usage
	 */
	ekern = (vaddr_t)&end;
#if defined(DDB) || NKSYMS > 0
	if (esym != 0)
		ekern = esym;
#endif

	/*
	 * Reserve room for the PROM data we're interested in.
	 */
	prom_argv = (char **)ekern;
	ekern += promdata;

	/*
	 * From then on, all allocations will be multiples of the
	 * page size.
	 */
	ekern = round_page(ekern);

	/*
	 * Reserve two _virtual_ pages for copy and zero operations.
	 */
	vreserve = ekern;
	ekern += 2 * PAGE_SIZE;

Then the real tricky part starts. We need to allocate ourselves enough tables to map the whole kernel memory space, as well as the device registers we use for the console output:

	/*
	 * Initialize fixed mappings.
	 * We want to keep the PTW mapping the kernel for now, but all
	 * devices needed during early bootstrap needs to have their own
	 * mappings.
	 */

	/* Step 0: reserve memory for the kernel pde. */

	bzero((caddr_t)ekern, PDT_SIZE);
	pmap_kernel()->pm_segtab = (pd_entry_t *)ekern;
	pmap_kernel()->pm_psegtab = ekern;
	ekern += PDT_SIZE;      /* not rounded anymore ! */

	/* Step 1: count how many page tables we'll need. */

	/* kernel address space*/
	ntables = howmany(VM_MAX_KERNEL_ADDRESS - PHYSMEM_BASE, NBSEG);

	/* serial */
	ntables++;

	/* round to a NON multiple of 2 - we were offset half a page before */
	if ((ntables & 1) == 0)
		ntables++;

	/* Step 2: connect them to the page directory */

	bzero((caddr_t)ekern, ntables * PT_SIZE);
	tabidx = 0;

	va = (vaddr_t)PHYSMEM_BASE;
	while (va < VM_MAX_KERNEL_ADDRESS) {
		pde = pmap_pde(pmap_kernel(), va);

		pde->pde_va = (pt_entry_t *)(ekern + tabidx * PT_SIZE);
		pde->pde_pa = (vaddr_t)pde->pde_va;

		va += NBSEG;
		tabidx++;
	}

	va = trunc_page(ZS0_BASE);
	{
		pde = pmap_pde(pmap_kernel(), va);

		pde->pde_va = (pt_entry_t *)(ekern + tabidx * PT_SIZE);
		pde->pde_pa = (vaddr_t)pde->pde_va;

		tabidx++;
	}

	ekern += ntables * PT_SIZE;

(pmap_pde is a simple inline routine that returns the pde entry for a given va in a given pmap. Similarly, I have a pde_pte routine which gives me the pointer to the pte for a given va in a given pde, and pmap_pte which combines both).

And now that we have all our page directories and page tables ``allocated'', we can connect and fill them:

	/* Step 3: fill them */

	va = (vaddr_t)KERNBASE;
	while (va < ekern) {
		pde = pmap_pde(pmap_kernel(), va);
		eva = (va & PDT_INDEX_MASK) + NBSEG;
		if (eva > ekern)
			eva = ekern;
		pte = pde_pte(pde, va);
		for (; va < eva; va += PAGE_SIZE) {
#ifdef DEBUG
			if (pte == NULL)
				panic("NULL pte for kernel va %x", va);
#endif
			*pte = va | PG_V | PG_U | PG_CACHE;
			if (va < (vaddr_t)&etext)
				*pte |= PG_RO;
			pte++;
		}
	}

	va = trunc_page(ZS0_BASE);
	{
		pte = pmap_pte(pmap_kernel(), va);
#ifdef DEBUG
		if (pte == NULL)
			panic("NULL pte for kernel va %x", va);
#endif
		*pte = va | PG_V | PG_S | PG_U | PG_IO;
	}

Note that, despite allocating kernel page tables up to VM_MAX_KERNEL_ADDRESS, we only initialize pte up to ekern - the remainder of the memory area is left initialized to zero, thus PG_NV invalid (unused) pages.

Finally, our last chore is to tell the memory system about the physical memory available for it:

	/*
	 * Tell the VM system about the available memory.
	 * Physical memory starts at PHYSMEM_BASE; kernel uses space
	 * from KERNBASE to ekern at this point.
	 */

	virtual_avail = ekern;
	virtual_end = VM_MAX_KERNEL_ADDRESS;

	uvm_page_physload(atop(ekern), atop(PHYSMEM_BASE) + physmem,
	    atop(ekern), atop(PHYSMEM_BASE) + physmem, VM_FREELIST_DEFAULT);
	uvm_page_physload(atop(PHYSMEM_BASE), atop(KERNBASE),
	    atop(PHYSMEM_BASE), atop(KERNBASE), VM_FREELIST_DEFAULT);

(this last chunk is not actually what I wrote initially - more on this later).

The last part of the pmap_bootstrap() work is actually delayed until it returns to bootstrap(), which has itself been invoked from the assembly code in locore.s:

	/*
	 * Invoke early C code, still using the PROM trap, so that it can
	 * do proper TLB insertion for us while we are accessing various
	 * onboard devices.
	 */

	call	_C_LABEL(bootstrap)
	 nop

	/*
	 * Step 4: change the trap base register, now that our trap handlers
	 * will function (they need the tables we just set up).
	 *
	 * XXX save old tbr, to be able to return to the PROM?
	 */
	set	trapbase_kap, %g6
	wr	%g6, 0, %tbr
	nop; nop; nop                   ! paranoia

	/*
	 * Step 5: activate our translations...
	 */
	set	_C_LABEL(kernel_pmap_store), %o0
	ld	[%o0 + PMAP_PSEGTAB], %o1
	sta	%o1, [%g0] ASI_PDBR
	nop; nop; nop

	/*
	 * ... flush all stale TLB entries
	 */
	sta	%g0, [%g0] ASI_PID
	sta	%g0, [%g0] ASI_PIID
	sta	%g0, [%g0] ASI_GTLB_INVALIDATE

	/*
	 * ... and disable PTW0 and PTW1 for now (maybe PTW2 as well?)
	 */
	lda	[%g0] ASI_PTW0, %o1
	andn	%o1, PTW_V, %o1
	sta	%o1, [%g0] ASI_PTW0

	lda	[%g0] ASI_PTW1, %o1
	andn	%o1, PTW_V, %o1
	sta	%o1, [%g0] ASI_PTW1

	/* XXX before invalidating PTW2, we need to set up a few fixed TLBs
	   or we won't be able to trap, at all. */
#if 0
	lda	[%g0] ASI_PTW2, %o1
	andn	%o1, PTW_V, %o1
	sta	%o1, [%g0] ASI_PTW2
#endif

Yes, all this work was useless if we would not eventually switch to the tables we had just set up! But I had to keep one translation window mapping the kernel window, for now... (can you see why? I'll explain this later... after I fix it!)

[ Index ] [ Prev: Introducing The CPU ] [ Next: Mapping Games ]

miod@online.fr