Introduction to Kernel Hacking

profilevidhi03
CSC112project1.pdf

Project 1-Introduction to Kernel Hacking

Preface

The projects are a fundamental part of this course. Projects always require a significant amount of time. Do not procrastinate! It is likely they will take longer than you expect. Do not wait until the day before the assignment is due to start. These assignments should be started pretty much when they are handed out.

All of the assignments will be in C. We assume that you have enough programming background that learning the basics of a new language (if it is indeed new to you) will not be difficult.

To develop a better sense of how an operating system works, you will do a few projects inside a real OS kernel called xv6. The kernel we'll be using is a port of the original Unix (version 6) and is runnable on modern x86 processors. It was developed at MIT and is a small and relatively understandable OS and thus an excellent focus for simple projects. Read more about xv6 from MIT’s website. Also, it is discussed on Hacker News. But the best way to fully understand the system is to read the source code and the book.

Goal

This first project is just a warm-up, and thus relatively light on work. The goal of the project is simple: to add two system calls to xv6. Your system calls, getreadcount() and getopenfilecount(). The former simply returns how many times the read() system call has been called by user processes since the time that the kernel was booted. The latter simply returns how many files have been opened so far by the calling process.

Install xv6

xv6 is very lightweight, only takes a few seconds to compile, and allows remote debugging. This all makes it great for understanding how modern operating systems hang together. To build xv6, you'll use two sets of tools: an x86 emulator, QEMU, for running your kernel; and a compiler toolchain, including assembler, linker, C compiler, and debugger, for compiling and testing your kernel. Read more about the tools.

The easiest way to get a compatible toolchain is to install a modern Linux distribution on your computer. With platform virtualization, Linux can cohabitate with your normal computing

environment. Installing a Linux virtual machine is a two-step process. First, you download the virtualization platform.

We use VirtualBox which is a little slower but free! You can download VirtualBox here and follow the installation instruction here.

(Note, if you are using a Mac OS running on Apple Silicon (M1/M2 chip) where the VirtualBox is not supported, you may use Multipass to install an Ubuntu. Please follow the instruction here. WMware is also a good option.)

Once the virtualization platform is installed, download a boot disk image for the Linux distribution. Ubuntu Desktop is what we use. Please download it here and you would get a file named something like ubuntu-18.04.1-desktop-amd64.iso. Start up your virtualization platform and create a new virtual machine following this tutorial. Use the downloaded Ubuntu image as a boot disk. To conduct step 3 of the tutorial, we need to install tools like gcc inside the Linux virtual machine. The steps would be described below. Also, please note that all the operations and steps in the following paragraphs are conducted inside the Linux virtual machine.

To get the source code of xv6, we first need to install Git, which is now very popular for maintaining open-source projects. Follow the instructions to install git. If you are new to Ubuntu and don’t know how to invoke the terminal, you can check this answer.

Then, we use Git to clone the latest xv6 source code to Ubuntu using the following command in the terminal:

# Download the code git clone git://github.com/mit-pdos/xv6-public.git

In this project, Git is only used to download the xv6 repository. If you are new to Git and interested in learning about it, you can find some good resources here.

(Note, if you are using a Mac OS running on Apple Silicon (M1/M2 chip), you need to use the xv6-riscv version-https://github.com/mit-pdos/xv6-riscv.git. This instruction is written for xv6-public which is developed for x86 architecture. There would be small differences between the xv6-riscv and xv6-public versions.)

To compile and run the xv6, we need to install the required toolchains using the following commands in the terminal:

# Install the compiler

sudo apt install gcc

# Install the compile tool sudo apt install make

# Install the emulator sudo apt install qemu-kvm

Finally, we can compile and boot into the xv6 kernel using the following commands:

# Switch to the source code folder cd xv6-public

# Compile make

# Run under QEMU make qemu-nox

Asides: Try to run ls command to see which programs are provided in the xv6 kernel to play around inside the system; Git clone would create a folder called xv6-public which is the place that holds the source code of the kernel; sudo would run the command following it with higher security privileges as a superuser, and you are required to enter the password; Always run make clean before you want to recompile the kernel; qemu-nox is used to run the system without GUI (here, x refers to graphics window), and run with only the serial console; To exit the xv6, you may simply close the terminal, or press ctrl+A then press X.

xv6 System Call Background

To be able to implement this project, you'll have to understand a little bit about how xv6 implements system calls. A full explanation can be found in chapter 3 of the xv6 book (though most of the low-level details won't be necessary for you to complete this project). As you recall from the OSTEP book or our lectures, a system call is a protected transfer of control from an application (running in user mode) to the OS (running in kernel mode). The general approach, which we refer to as limited direct execution (LDE), enables the kernel to maintain control of the machine while generally letting user applications run efficiently and without kernel intervention.

We'll specifically trace what happens in the code in order to understand a system call. System calls allow the operating system to run code on the behalf of user requests but in a protected manner, both by jumping into the kernel (in a very specific and restricted way) and also by simultaneously raising the privilege level of the hardware, so that the OS can perform certain restricted operations.

System Call Overview

Before delving into the details, we first provide an overview of the entire process. The problem we are trying to solve is simple: how can we build a system such that the OS is allowed access to all of the resources of the machine (including access to special instructions, physical memory, and to any devices) while user programs are only able to do so in a restricted manner?

The way we achieve this goal is with hardware support. The hardware must explicitly have a notion of privilege built into it, and thus be able to distinguish when the OS is running versus typical user applications.

Getting Into The Kernel: A Trap

The first step in a system call begins at the user level with an application. The application that wishes to make a system call (such as read()) calls the relevant library routine. However, all the library version of the system call does is place the proper arguments in relevant registers and issue some kind of trap instruction, as we see in an expanded version of usys.S (Some macros are used to define these functions so as to make life easier for the kernel developer; the example shows the macro expanded to the actual assembly code). You can look at the source code of usys.S and check a nice explanation of it here.

.globl read; read: movl $5, %eax; int $64; ret

File: usys.S

Here we can see that the read() library function actually doesn't do much at all; it moves the value 5 into the register %eax and issues the x86 trap instruction which is confusingly called int (short for interrupt). The value in %eax is going to be used by the kernel to vector to the right system call, i.e., it determines which system call is being invoked. The int instruction takes one argument (here it is 64), which tells the hardware which trap type this is. In xv6, trap 64 is used

to handle system calls. Any other arguments which are passed to the system call are passed on the stack.

Kernel Side: Trap Tables

Once the int instruction is executed, the hardware takes over and does a bunch of work on behalf of the caller. One important thing the hardware does is to raise the privilege level of the CPU to kernel mode; on x86 this usually means moving from a CPL (Current Privilege Level) of 3 (the level at which user applications run) to CPL 0 (in which the kernel runs). Yes, there are a couple of in-between privilege levels, but most systems do not make use of these.

The second important thing the hardware does is to transfer control to the trap vectors of the system. To enable the hardware to know what code to run when a particular trap occurs, the OS, when booting, must make sure to inform the hardware of the location of the code to run when such traps take place. This is done in main.c as follows:

int main(void) { ... tvinit(); // trap vectors initialized here ...

}

FILE: main.c

The routine tvinit() is the relevant one here. Peeking inside of it, we see:

void tvinit(void) { int i;

for(i = 0; i < 256; i++) SETGATE(idt[i], 0, SEG_KCODE<<3, vectors[i], 0);

// this is the line we care about... SETGATE(idt[T_SYSCALL], 1, SEG_KCODE<<3, vectors[T_SYSCALL], DPL_USER);

initlock(&tickslock, "time"); }

FILE: trap.c

The SETGATE() macro is the relevant code here. It is used to set the idt array to point to the proper code to execute when various traps and interrupts occur. For system calls, the single SETGATE() call (which comes after the loop) is the one we're interested in. Here is what the macro does (as well as the gate descriptor it sets):

// Gate descriptors for interrupts and traps struct gatedesc { uint off_15_0: 16; // low 16 bits of offset in segment uint cs: 16; // code segment selector uint args: 5; // # args, 0 for interrupt/trap gates uint rsv1: 3; // reserved(should be zero I guess) uint type: 4; // type(STS_{TG, IG32, TG32}) uint s: 1; // must be 0 (system) uint dpl: 2; // descriptor(meaning new) privilege level uint p: 1; // Present uint off_31_16: 16; // high bits of offset in segment };

// Set up a normal interrupt/trap gate descriptor. // - istrap: 1 for a trap (= exception) gate, 0 for an interrupt gate. // interrupt gate clears FL_IF, trap gate leaves FL_IF alone // - sel: Code segment selector for interrupt/trap handler // - off: Offset in code segment for interrupt/trap handler // - dpl: Descriptor Privilege Level - // the privilege level required for the software to invoke // this interrupt/trap gate explicitly using an int instruction. #define SETGATE(gate, istrap, sel, off, d) \ { \ (gate).off_15_0 = (uint) (off) & 0xffff; \ (gate).cs = (sel); \ (gate).args = 0; \ (gate).rsv1 = 0; \ (gate).type = (istrap)? STS_TG32 : STS_IG32; \ (gate).s = 0; \ (gate).dpl = (d); \ (gate).p = 1; \ (gate).off_31_16 = (uint) (off) >> 16; \

}

FILE: mmu.h

As you can see from the code, all the SETGATE() macros do is set the values of an in-memory data structure. Most important is the off parameter, which tells the hardware where the trap handling code is. In the initialization code, the value vectors[T_SYSCALL] is passed in; thus, whatever the vectors array points to will be the code to run when a system call takes place. There are other details (which are important too); consult an x86 hardware architecture manual (particularly Chapters 3a and 3b) for more information.

Note, however, that we still have not informed the hardware of this information, but rather filled a data structure. The actual hardware informing occurs a little later in the boot sequence; in xv6, it happens in the routine mpmain() in the file main.c, which calls idtinit in trap.c, which calls lidt() in the include file x86.h:

static void mpmain(void) { idtinit(); ...

void idtinit(void) { lidt(idt, sizeof(idt)); }

static inline void lidt(struct gatedesc *p, int size) { volatile ushort pd[3];

pd[0] = size-1; pd[1] = (uint)p; pd[2] = (uint)p >> 16;

asm volatile("lidt (%0)" : : "r" (pd)); }

Here, you can see how (eventually) a single assembly instruction is called to tell the hardware where to find the interrupt descriptor table (IDT) in memory. Note this is done in mpmain() as each processor in the system must have such a table (they all use the same one of course). Finally, after executing this instruction (which is only possible when the kernel is running, in privileged mode), we are ready to think about what happens when a user application invokes a system call.

From Low-level To The C Trap Handler

The OS has carefully set up its trap handlers, and thus we are ready to see what happens on the OS side once an application issues a system call via the int instruction. Before any code is run, the hardware must perform a number of tasks. The first thing it does is those tasks that are difficult/impossible for the software to do itself, including saving the current PC (IP or EIP in Intel terminology) onto the stack, as well as a number of other registers such as the eflags register (which contains the current status of the CPU while the program was running), stack pointer, and so forth. One can see what the hardware is expected to save by looking at the trapframe structure as defined in x86.h.

struct trapframe { // registers as pushed by pusha uint edi; uint esi; uint ebp; uint oesp; // useless & ignored uint ebx; uint edx; uint ecx; uint eax;

// rest of trap frame ushort es; ushort padding1; ushort ds; ushort padding2; uint trapno;

// below here defined by x86 hardware uint err; uint eip;

ushort cs; ushort padding3; uint eflags;

// below here only when crossing rings, such as from user to kernel uint esp; ushort ss; ushort padding4; };

File: x86.h

As you can see from the bottom of the trapframe structure, some pieces of the trap frame are filled in by the hardware (up to the err field); the rest will be saved by the OS. The first code OS runs is vector64() as found in vectors.S (which is automatically generated by the script vectors.pl).

.globl vector64 vector64: pushl $64 jmp alltraps

File: vectors.S (generated by vectors.pl)

This code pushes the trap number onto the stack (filling in the trapno field of the trap frame) and then calls alltraps() to do most of the saving of context into the trap frame.

# vectors.S sends all traps here. .globl alltraps alltraps: # Build trap frame. pushl %ds pushl %es pushal

# Set up data segments. movl $SEG_KDATA_SEL, %eax movw %ax,%ds movw %ax,%es

# Call trap(tf), where tf=%esp pushl %esp call trap addl $4, %esp

File: trapasm.S

The code in alltraps() pushes a few more segment registers (not described here, yet) onto the stack before pushing the remaining general purpose registers onto the trap frame via a pushal instruction. Then, the OS changes the descriptor segment and extra segment registers so that it can access its own (kernel) memory. Finally, the C trap handler is called.

The C Trap Handler

Once done with the low-level details of setting up the trap frame, the low-level assembly code calls up into a generic C trap handler called trap(), which is passed a pointer to the trap frame. This trap handler is called upon all types of interrupts and traps, and thus checks the trap number field of the trap frame (trapno) to determine what to do. The first check is for the system call trap number (T_SYSCALL, or 64 as defined somewhat arbitrarily in traps.h), which then handles the system call, as you see here:

void trap(struct trapframe *tf) { if(tf->trapno == T_SYSCALL){ if(cp->killed) exit(); cp->tf = tf; syscall(); if(cp->killed) exit(); return; } ... // continues }

FILE: trap.c

The code isn't too complicated. It checks if the current process (that made the system call) has been killed; if so, it simply exits and cleans up the process (and thus does not proceed with the system call). It then calls syscall() to actually perform the system call; more details on that are

below. Finally, it checks whether the process has been killed again before returning. Note that we'll follow the return path below in more detail.

static int (*syscalls[])(void) = { [SYS_chdir] sys_chdir, [SYS_close] sys_close, [SYS_dup] sys_dup, [SYS_exec] sys_exec, [SYS_exit] sys_exit, [SYS_fork] sys_fork, [SYS_fstat] sys_fstat, [SYS_getpid] sys_getpid, [SYS_kill] sys_kill, [SYS_link] sys_link, [SYS_mkdir] sys_mkdir, [SYS_mknod] sys_mknod, [SYS_open] sys_open, [SYS_pipe] sys_pipe, [SYS_read] sys_read, [SYS_sbrk] sys_sbrk, [SYS_sleep] sys_sleep, [SYS_unlink] sys_unlink, [SYS_wait] sys_wait, [SYS_write] sys_write, };

void syscall(void) { int num;

num = cp->tf->eax; if(num >= 0 && num < NELEM(syscalls) && syscalls[num]) cp->tf->eax = syscalls[num](); else { cprintf("%d %s: unknown sys call %d\n", cp->pid, cp->name, num);

cp->tf->eax = -1;

} }

File: syscall.c

Vectoring To The System Call

Once we finally get to the syscall() routine in syscall.c, not much work is left to do (see above). The system call number has been passed to us in the register %eax, and now we unpack that number from the trap frame and use it to call the appropriate routine as defined in the system call table syscalls[]. Pretty much all operating systems have a table similar to this to define the various system calls they support. After carefully checking that the system call number is in bounds, the pointed-to routine is called to handle the call. For example, if the system call read() was called by the user, the routine sys_read() will be invoked here. The return value, you might note, is stored in%eax to pass back to the user.

The Return Path

The return path is pretty easy. First, the system call returns an integer value, which the code in syscall() grabs and places into the %eax field of the trap frame. The code then returns to trap(), which simply returns to where it was called from in the assembly trap handler.

# Return falls through to trapret... .globl trapret trapret: popal popl %es popl %ds addl $0x8, %esp # trapno and err code iret

File: trapasm.S

This return code doesn't do too much, just making sure to pop the relevant values off the stack to restore the context of the running process. Finally, one more special instruction is called: iret, or the return-from-trap instruction. This instruction is similar to a return from a procedure call, but simultaneously lowers the privilege level back to user mode and jumps back to the instruction immediately following the int instruction called to invoke the system call, restoring all the state that has been saved into the trap frame. At this point, the user stub for read() (as seen

in the usys.S code) is run again, which just uses a normal return-from-procedure-call instruction (ret) in order to return to the caller.

Summary

We have seen the path in and out of the kernel on a system call. As you can tell, it is much more complex than a simple procedure call and requires a careful protocol on behalf of the OS and hardware to ensure that the application state is properly saved and restored on entry and return. As always, the concept is easy: with operating systems, the devil is always in the details.

Your System Call

Your new system calls should have the following return types and parameters:

int getreadcount(void) int getopenfilecount(void)

The first system call returns the value of a counter (perhaps called readcount or something like that) which is incremented every time any process calls the read() system call. The second system call returns the number of the files opened by the current process at the moment when the system call is invoked. That's it!

Hints

You will need to modify a number of different files for this project, though the total number of lines of code you'll be adding is quite small. Most of the time will be spent on understanding the code. There shouldn't be a whole lot of code added. At a minimum, you'll need to alter syscall.h, syscall.c, user.h, usys.S, and sysfile.c (or sysproc.c) to implement your new system call. Check this great answer for more details.

To count the read system call, you will likely need to add the readcount as the global variable in sysfile.c (or sysproc.c) to track the number of read system calls. To understand how a file descriptor table is constructed, you may take a look at the fdalloc method from sysfile.c, which show you how xv6 finds an empty slot in the table for the to-be-opened file. Essentially, if a table entry of a process’s ofile is 0, that entry is not pointing to any opened file. Initially, the first three entries are used for stdout (ofile[0]), stdin (ofile[1]), and stderr (ofile[2]);

One good way to start hacking inside a large code base is to find something similar to what you want to do and to carefully copy/modify that. Here, you should find some other system call, like getpid() (or any other simple call). Copy it in all the ways you think are needed, and then modify it to do what you need.

Test your System Calls

The following is the test program syscallTest.c:

#include "types.h" #include "user.h" #include "param.h"

#define assert(x) if (x) { /* pass */ } else { \ printf(1, "assert failed %s %s %d\n", #x , __FILE__, __LINE__); \ exit(); \ }

void readfile(char *file, int howmany) { int i; // assumes file opens successfully... int fd = open(file, 0); char buf; // assumes file is big enough... for (i = 0; i < howmany; i++) (void) read(fd, &buf, 1);

close(fd); }

int main(int argc, char *argv[]) { int rc1 = getreadcount(); printf(1, "Read count %d\n", rc1); int rc = fork(); if (rc < 0) {

printf(1, "Fork failed!\n"); exit();

} readfile("README", 5); if (rc > 0) {

wait(); int rc2 = getreadcount(); printf(1, "Read count %d\n", rc2); assert((rc2 - rc1) == 10); printf(1, "TEST PASSED\n"); int fd1 = open("ls", 0); int fd2 = open("syscallTest", 0); int fd3 = open("cat", 0); if(fd1 * fd2 * fd3 < 0){

printf(1, "Open failed!\n"); exit();

} close(fd1); int of = getopenfilecount(); printf(1, "Number of opened files %d\n", of - 3); assert(of == 5); printf(1, "TEST PASSED\n"); close(fd2); close(fd3);

} exit(); }

The output of the testing program should say “TEST PASSED” (similar to output shown below) if the system calls are implemented properly

$ syscallTest Read count 48 Read count 58 TEST PASSED Number of opened files 2 TEST PASSED $ syscallTest Read count 70 Read count 80 TEST PASSED

Number of opened files 2 TEST PASSED

To compile and generate the syscallTest executable into xv6, you should save the source code syscallTest.c in the xv6-public folder. Then, you should modify theMakefile to add syscallTest as a new user program in xv6 as follows before you compile the xv6 by running make:

UPROGS=\ _cat\ _echo\ _forktest\ _grep\ _init\ _kill\ _ln\ _ls\ _mkdir\ _rm\ _sh\ _stressfs\ _usertests\ _wc\ _zombie\ _syscallTest\

EXTRA=\ mkfs.c ulib.c user.h cat.c echo.c forktest.c grep.c kill.c\ ln.c ls.c mkdir.c rm.c stressfs.c usertests.c wc.c zombie.c\ printf.c syscallTest.c umalloc.c\ README dot-bochsrc *.pl toc.* runoff runoff1 runoff.list\ .gdbinit.tmpl gdbutil\

You may also refer to this tutorial about how to add a user program to xv6.

Submission

Upload all of your source files (but not .o files, please, or binaries!) as a zip file xv6-public.zip to Blackboard. A simple way to do this is to type make in your xv6-public folder to make sure it

builds, and then type make clean to remove unneeded files before compressing the folder. Also, please make a SCREENSHOT file, which shows the output of your implementation.