The random rantings of a concerned programmer.

fsking diskless

December 09th, 2007 | Category: Random

LOL.

So I spent today redoing the diskless stuff from scratch (using my other post as a reference), and wow, there’s a lot of shit that went wrong. I’ve been going through that post and trying to fix all the errors that I found (no idea where they came from?), but it’s quite interesting.

I’ve been banging my head against the wall for the past 4 hours because the damn thing would load PXE fine, then start loading the kernel and just freeze. Turns out I was an idiot and forgot to do a make distribution and it didn’t have a HINTS to use (I think) and just broke. Or something.

Anyway, I got it working with 7.0-BETA4 now, which is nice. I think I’m going to go back sometime and re-do everything again just to make sure I’ve got all the kinks worked out. I’m at the “fight with fstab” stage; fstab isn’t being read in at an early enough point for my tastes, and the memory-mounted filesystems aren’t layered over the read-only NFS mount early enough to get some of the more important things done.

Which is booooo.

No comments

NFS makes my setup cry :(

December 03rd, 2007 | Category: Random

So I’ve had a single diskless node running with a NFS-mounted filesystem, and I have to say, I’m a bit disappointed so far. The network hardware I’m using is bottlenecked to 10Mb/s at the hub (blah etc I know), so there are throughput issues in addition to the latency issues inherent in any non-local storage media.

What’s annoying is that any process which touches the disk in any way is implicitly inhibited by that NFS overhead, which is significant. And basically anything you’d want to do with a machine is going to touch the disk in some way. So performance of the diskless node is capped.

The second problem is that a single diskless node hammering the fileserver for a routine thing (like make buildkernel) rapes the hell out of the fileserver, flooding it with I/O interrupts from both the network and the disk. In the tests I ran, the single node burned up to 30% of the fileserver’s cycles.

The third issue is a matter of file distribution – which basically is the fundamental problem with the way this setup is laid out. Each diskless node currently is set up so it has a dedicated root directory somewhere on the disk, which makes sharing common binaries incredibly difficult. It’s basically modeled such that the diskless node appears to have a disk, and the disk is just offloaded remotely. This doesn’t work.

I’m basically going to scrap this approach, the idea of having each diskless node a unique machine is silly at best. When the whole system comes together, the user shouldn’t ever be touching the diskless nodes ever. They should be managed in the background; the kernel should automatically manage a distributed process tree and load-balance tasks between each of the attached nodes.

So how should a node be structured? Well, I’m not sure yet. A set of requirements needs to be defined, I guess.

  • All common binaries should not exist in more than one physical location on the disk. Duplication of data should be kept to an absolute minimum – not only does it take up more physical space, but it also makes the system more difficult to maintain.
  • Each diskless node should have it’s own copy of every common system binary, a kind of forced aggressive caching (or just pre-caching) such that there is no latency between execution request and execution.
  • The diskless nodes should each have their own independent writable persistent memory, for things like configuration files, and such. Adding to this, kernels should be specified per-node, to allow for hardware diversification without imposing kernel bloat.

So I guess the structure I’m thinking about -

/              nfs-mounted read-only, shared minimum boot stuff
/boot /etc      nfs-mounted read-only, per-node configuration and kernels.
/var /tmp       mfs-mounted volatile scratch space
/usr            mfs-mounted over a read-only nfs-mount. On boot,
                common binaries (/usr/bin, etc) are copied from the
                nfs mount into the mfs-mounted region to force caching.
/persist        nfs-mounted per-node persistant scratch space

An idea I had was to create a large-block prefetching system or script that could be run when you knew a long disk-bound operation was going to be using a specific region of disk, such that the entire region could be pulled over NFS at once, slammed into a local MFS mount, then when the operation was finished, written back.

The problem with such a setup is (obviously) serious locking issues, which I’m not really sure how to correct. This problem doesn’t really manifest itself for some scenarios (like make), where the writeback isn’t really necessary (you dump the source into local memory, compile it, then dump the object files back somewhere. Since you’re compiling it, it’s assumed no one else is writing to those files).

Anyway. Hopefully I can get the bloody NFS usage down soon, because it’s raping the performance of the system. And actually, I should be spending less time on this and more time doing homework and preparing for finals.

lol

No comments

Setting up Netboot

November 30th, 2007 | Category: Random

STILL WORKING OUT LOL. Process taken so far:

Okay, so I’ve been trying to get bloody netbook/diskless operation working for awhile. There’s always little pesky bugs which crop up; I almost got it running about a month ago, but I couldn’t manage to get the NFS server running correctly, and though the kernel was loaded properly on the target machine, it was unable to mount a root filesystem because NFS was broke.

So I’m starting with a clean slate, trying out FreeBSD 7.0 for the first time and documenting the steps I take so I can try to better re-create any problems I cause myself. And, lol, if I can get it working this time then having this will be handy when I need to it up again from scratch.

For the most part, the handbook entry on diskless operation is one of the best sources for the entire process; it only lacks in that it doesn’t go into much detail about each of the subsystems. It also doesn’t give any hints regarding what to do when shit goes wrong, but blah. There’s a couple of other (mostly dated) guides on the net, but there’s so many ways to do this it’s easy to get confused.

The setup I’m building is basically a Beowulf cluster – it consists of a set of machines connected on a private, internal network. Only one machine (the head or master node) is actually connected to an external network, thus the internal ones are fairly well-shielded from anything malicious from the outside.

The master node is the only machine with a drive, the rest of the nodes will boot diskless from it. The master node has two NICs – one for the external network which is DHCP-configured (or however your network works), and one for the internal network on which we’ll run a DHCP server.

Each of the nodes which will be booting diskless need special hardware to actually do it, essentially you need a NIC with a PXE-loaded bootrom (and a BIOs which will let you boot from it). A quick way to check is to see if you can boot from your network card from the BIOS – if you can, then you probably have a bootrom. I’m not going to really go into the complicated steps involved in flashing the damned things because that shit it messy. Just hope you have one already loaded with a build which works >_>

I’m using the subnet 192.168.100.0/24 for that internal subnet; the master node will be located at 192.168.100.3, and the test diskless client I’m configuring will be at 192.168.100.11 with hostname suigintou.

Configuring the Master

  1. Installed FreeBSD 7.0-BETA3, minimal distribution. I chose 7.0 because I’ve been wanting to try it out for some time, and since I was starting from scratch I didn’t have a reason not to. Most of the documentation I’m using is for 6.2, but meh it shouldn’t make a difference. It’s not like there’s that much of a change in the components this uses.
  2. Post-install configuration: get distributions: ports, man, info, doc, src, games (for fortune). Of utmost importance is that you get the entire source tree, because you’ll need it when it comes time to make distribution. And you need ports to get the dhcp server. Technically, man, info, doc and games can be omitted, but I chose not to in this run.
  3. Configure sshd’s settings in /etc/ssh/sshd and enable it in rc.conf:
    sshd_enable="YES"

    then start it

    /etc/rc.d/sshd start

    Up to this point, I was working off the actual machine. As soon as sshd was started I beheaded the machine and did the rest of the stuff remotely. Because working through putty while browsing the internet and doing other things is much more fun.

  4. Configure secondary network interface for internal LAN. You’ll need to figure out what your secondary network interface is (with ifconfig) and replace fxp0 with it, durr.
    ifconfig fxp0 inet 192.168.100.3 netmask 255.255.255.0

    And add a line in rc.conf so this gets done at boot-time from now on -

    ifconfig_fxp0="inet 192.168.100.3 netmask 255.255.255.0"
  5. Lay out directories for everything:
    mkdir /diskless
    mkdir /diskless/tftp
    mkdir /diskless/suigintou
  6. Re-build pxeboot from source (because I’ve had… problems with the binary that comes with the distribution for some reason) -
    cd /sys/boot
    make

    And copy it over into the tftp folder to be served up

    cp /boot/pxeboot /diskless/tftp
  7. Install net/isc-dhcp3-server, configure rc.conf to boot it at startup only on internal interface:
    dhcpd_enable="YES"
    dhcpd_flags="-q"
    dhcpd_ifaces="fxp0"
  8. Configure /usr/local/etc/dhcpd.conf for each diskless host:
    default-lease-time 0;
    max-lease-time 7200;
    authoritative;
    ddns-update-style: none;
    
    option domain-name-servers 192.168.100.3;
    
    subnet 192.168.100.0 netmask 255.255.255.0 {
            option subnet-mask 255.255.255.0;
            option broadcast-address 192.168.100.255;
    
            host suigintou {
                    hardware ethernet 00:E0:81:02:B9:92;
                    fixed-address 192.168.100.11;
                    next-server 192.168.100.3;
                    filename "/diskless/tftp/pxeboot";
                    # On another machine, this didn't work (TFTP Error: file not found)
                    # The easiest way to fix this is to tftp into the localhost and try to
                    # fetch the file by hand, then put what works into the filename.
                    option root-path "192.168.100.3:/diskless/suigintou";
            }
    
            # ... etc
    }

    For each host you’re booting diskless, you’ll want to add another host{ } block. The MAC address of each block is used to associate the diskless client with a hostname. I’ll probably end up tinkering with the root-path option to specify different configurations for each diskless machine, and possibly provide swap space for them (though NFS swap is ick).

  9. Enable inetd in rc.conf:
    inetd_enable="YES"

    and have inetd start tftp when needed, for both udp (standard) and tcp (for weird PXE hardware?) connections in inetd.conf -

    tftp    dgram   udp wait    root    /usr/libexec/tftpd  tftpd -l -s /diskless/tftp
    tftp    stream  tcp wait    root    /usr/libexec/tftpd  tftpd -l -s /diskless/tftp

    and restart inetd -

    /etc/rc.d/inetd restart
  10. Enable the NFS server in rc.conf:
    rpcbind_enable="YES"
    mountd_enable="YES"
    nfs_server_enable="YES"

    and export the proper directories for each host (only 1 here) in /etc/exports:

    /diskless/suigintou -alldirs -ro 192.168.100.11

    Start up NFS with

    /etc/rc.d/rpcbind start
    /etc/rc.d/nfsd start
    /etc/rc.d/mountd start

    And verify that everything is properly mounted with showmount -e. The output should look something like this -

    # showmount -e
    Exports list on localhost:
    /diskless/suigintou                     192.168.100.0 

    If there’s nothing listed there, then something isn’t set up properly and you’ll get NFS mount errors when you boot the diskless node.

  11. Prepare a DISKLESS kernel configuration, based on the GENERIC configuration. If you haven’t compiled a custom kernel before, you’ll benefit from reading the handbook article on building and installing custom kernels.
    cp /sys/i386/conf/GENERIC /sys/i386/conf/DISKLESS

    and add the following options into the DISKLESS kernel configuration:

    options     BOOTP
    options     BOOTP_NFSROOT

    The handbook article on diskless doesn’t bother to tell you that you shouldn’t modify the GENERIC configuration directly, but you shouldn’t. Always make a copy of GENERIC and work from that copy, otherwise when you break something you can always easily revert.

  12. Next, write a script to build the distribution from source -
    #!/bin/sh
    export DESTDIR=/diskless/suigintou/
    mkdir -p ${DESTDIR}
    cd /usr/src; make buildworld && make buildkernel KERNCONF=DISKLESS
    cd /usr/src/etc; make distribution

    I took this script straight from Diskless Operation in the handbook, but added the KERNCONF=DISKLESS to indicate that we want to use the DISKLESS kernel instead of the GENERIC kernel.

  13. And execute that script to build the distribution. This is taking forever to finish blah blah.


FUCK THAT DIDN’T WORK. SOMETHING IS WRONG WITH THE DESTDIR BULLSHIT >:(

Okay, I think I found a fix -

  1. Build the world and the kernel. Building the world takes fucking ages to do; if you’ve done it before you shouldn’t need to do it again. Ever. You’ll need to compile the kernel in any case.

    cd /usr/src
    make buildworld
    make buildkernel KERNCONF=DISKLESS
  2. Once that’s done and over with, you need to slam that stuff into the prepared place for it -
    cd /usr/src
    make installworld DESTDIR=/diskless/suigintou
    make installkernel DESTDIR=/diskless/suigintou KERNCONF=DISKLESS
    make distribution DESTDIR=/diskless/suigintou

    As a random note, if you fuck something up and aren’t able to delete certain files anymore, it’s because the installkernel make script sets a “no change” flag on a bunch of files so you can’t accidentially fuck your system with rm -rf /*. Anyway, to kill the flag, use chflags [-R] noschg.

  3. So now we’ve got our root filesystem ready to export. Now just gotta make sure all the processes we need are running (dhcpd, inetd and nfsd), then try booting the remote system… BUM BUM BUMMMMMM

If all goes well, you should be able to boot your remote machine.

Reasons this is Fucked.

The problem is that the entire NFS filesystem will be read-only, which breaks all kinds of shit. One solution I’ve found so far is to slap a union’ed memory-based filesystem over parts of it, like

mdmfs -M -s16m -o union md1 /etc

I had to boot the machine in single-user mode to even do this, because the master.passwd requires a lock to open. Thus, we need to put a memory-backed filesystem over /etc, then touch master.passwd to copy it into the memory-backed part. unionfs is really cool...

Ideally, what I want is to be able to NFS-mount a read-only root directory, then NFS-mount with unionfs a whole filesystem over that, such that we can both modify files AND have those changes be persistant. Memory-backed file systems are great, except that they're completely lost when you reboot...

Now to figure out how to do that...

Okay, woot figured it out. Basically, you'll want to lay out the fstab on the client machine something like this:

# Mount the memory-backed filesystems
/dev/md0 /var  mfs rw,-M,union,-s4m 2 0
/dev/md1 /tmp  mfs rw,-M,union,-s8m 2 0

# Mount the NFS-backed filesystems
192.168.100.3:/usr/diskless/suigintou/etc /etc nfs rw
192.168.100.3:/usr/diskless/suigintou/usr /usr nfs rw

The fstab file format is really archaic: it uses a space-delimited list of things. This implies that the list of options must be comma-delimited and can CONTAIN NO SPACES. Took me a fucking half-hour to work out why mount_md was breaking shit. Anyway.

That should just about do it. I'm tired of editting this post, lawds. Now I need to find me a new CMOS battery so I can actually reboot this machine and have it load everything without me going through the BIOS menus to acknowledge that yes, I know, the battery is dead. Fucking fuckity fuck.

No comments