building a distributed data warehouse with FreeBSD July 4, 2009
A data warehouse is, as the name implies, a place where large amounts of data are archived, usually for the purposes of future browsing and analysis.
The problem with a standard data warehouse is that it requires a fat server, in particular, it requires a server with a large amount of disk space.
This can be expensive, especially as consumer-level PCs rarely contain more than two drive bays, and diskspace, if available, is usually found
on these same consumer-level PCs, not on the server.
Enter FreeBSD. With the right combination of tools, all of the consumer-level PCs on the LAN can be utilised to
create a single, seamless, distributed data warehouse. Rather than concentrating the diskspace in several large drives in the server, a
distributed data warehouse spreads the chore of holding the data amongst the PCs. The server merely provides the interface, and serves the files
- although it could of course host data directories as well.
The required tools are as follows:
FreeBSD (Linux can probably do this too)
Apache (and PHP if you want to do fancy interfaces - not required, however)
ln
mount_smbfs (here, we assume the consumer-level PCs are running Windows or Samba)
The distributed data warehouse is built by mounting shares on the workstations, and placing a symlink to the mountpoints in a directory that is
browsable via Apache. This creates a website consisting of links to remote volumes, however the fact that the volumes are remote is
transparent to the website user, who sees a single website. To the website user, the remote volumes look identical, and work identically,
to directories on the server.
It is, in essence, a read-only SMB-to-HTTP gateway, which presents a unified interface to distributed resources. It allows each machine on
the LAN to share files via HTTP, without needing to install a webserver on each machine.
Benefits:
All the resources of the warehouse are available to any device that runs a web browser, including PCs and mobile phones.
All the resources of the warehouse are available to any user that can connect to the server with HTTP, including (by using port forwarding) internet users. Warehouse users do not need to connect with SMB and do not need to know any hostnames, sharenames, usernames or passwords.
The total size of the warehouse can vastly exceed the amount of diskspace available in the server, or indeed in any single machine.
If a drive, machine or network segment hosting a remote volume fails, the rest of the warehouse stays up.
If a workstation is rebooted, reconfigured or replaced, the rest of the warehouse stays up.
Any workstation that supports SMB can become part of the warehouse, including Windows and Linux PCs, and Macs running OSX.
Any device on any workstation can become part of the warehouse, including DVD-ROMs, flash memory, and USB devices - as the devices are connected to the workstations, rather than the server, the server does not need drivers for these devices.
Data does not need to be copied/moved to the server in order to enter the warehouse. Issues with replication, synchronisation, and duplication do not arise.
A workstation can change which data is shared simply by creating a new share with the same name.
Any data in the shared directories is visible to the server, and can be backed up by the server onto tape, or whatever other backup device it may have.
The amount of total and free space on each remote volume is available, via the df command on the server, permitting easy monitoring of available resources.
Shares can be disabled simply by changing the password to the share. Conversely, shares can be made inaccessible to users of the host system by setting file-system permissions on the host.
Logging and auditing can be implemented if required.
Drawbacks:
The server is serving all the files from all the workstations, and there is a potential bandwidth
bottleneck, particularly for users accessing the server over an internet connection. This is not a problem for a server with few concurrent users, however
for upscale applications, higher-speed connections will be required.
A single point of failure is created, in that if the FreeBSD server goes down, the entire warehouse goes down too.
Mission-critical data should not be served from a consumer-level PC at the end of a network connection. The PC may be rebooted, the user may
eject their device, a cable or switch might get unplugged, etc.
How to build the warehouse:
Note: you will likely need root-level server access for the tasks below.
Install FreeBSD and Apache and get them working. Ensure you can see, in your web browser, a directory listing of the root of the website
you have created. You may wish to create an alias for your warehouse - if you want it to be accessible from the web,
you'll probably need to use a dynamic DNS account, and forward a port on your router, then you can give people an address such as
http://warehouse.mydomain.dynamic-dns-provider.com/
Go to each PC that will be a part of the warehouse, and create a shared directory, with permissions as appropriate. For example, create a share, world-readable. Or, first create a "warehouse" user on the PC, and when creating the share, give the warehouse user read permissions.
On the server, login as root, and create mountpoints for each remote volume. For example, if you have two PCs called PC1 and PC2, each with two shared directories:
Note: you'll need to enter warehouse's password for each mount_smbfs command. Also, the mounts are lost if the server is rebooted.
To rebuild the warehouse, if the server is rebooted, re-run the mount_smbfs commands, entering the password for each share.
Note: if mount_smbfs can't resolve the hostnames your provide, either use IP addresses instead, enter the hostnames into /etc/hosts, or
install Samba.
(optional) Create an index page for the root of the website. If this is not done, Apache will show the directory listing its usual way
(directory listing must be permitted in htaccess or httpd.conf for this to work). Some notes on Apache's directory listing feature:
to show the listing as a table rather than a bulletted list: put IndexOptions FancyIndexing into .htaccess
to allow IndexOptions in htaccess, add "Indexes" to the AllowOverride line in the virtual host's container in httpd.conf
to auto-resize the name field (and avoid truncation of the names), use IndexOptions NameWidth=* in .htaccess
to list directories before other files: use IndexOptions FoldersFirst in .htaccess
Notes:
If a remote volume becomes unavailable (host rebooted, etc), the server simply reports an empty directory, until the workstation comes back online; the server automatically reconnects as needed.
The mounts must be remounted if the server is rebooted - while the mount_smbfs commands can be put into a startup script,
they require a password to be entered for each remote volume. It MAY be possible to store the passwords on the server, using
a file called .nsmbrc, however this is not covered here.
Don't put a mount to a remote volume in fstab. If the remote machine is unavailable at the time the server is rebooted,
for any reason (powered off, networking problem etc), the server will fail to boot.
Don't put a mount to a local DVD or CD drive in fstab. If the drive is empty at the time the machine is rebooted,
the server will fail to boot.
No special permissions on mountpoints or symlinks need be set (although this may be due to the security configuration of the test server used to make this article).
By using Samba instead of Apache, a distributed filesystem can be created, with transparent read/write access available to all
workstations at the drive-letter level. That is, workstations can map a drive to a Samba share on the server, which contains
symlinks to the mountpoints of remote volumes. This will permit workstations to use a path like L:\DATA2\DOCS (for example) without
needing to know that the directory is in fact a share on another PC on the LAN.
By combining Samba and Apache as above, an intranet is created, with documents viewable with a web browser, yet editable only by specific
users, or groups of users (depending on the permissions used).
Apache authentication (via a .htaccess file) can be used to restrict access to the warehouse. By placing the symlinks in sub-directories
with their own .htaccess files, access to each remote volume can also be restricted.
SSL could be added to the server, to provide remote users with secure access to firewalled data.
So, when would you want to serve large quantities of non-mission-critical data, with consumer-level hardware, you may ask?
You have a large collection of photos you took while trainspotting, and want to share them with friends.
You want to listen to your own music collection while at a friend's house, or overseas.
You have a cheap-assed hosting account and you want to link to some giant files for free.
You want to be able to reboot systems, upgrade hardware, and recover from outages while the library is online.
You don't want to use existing content-sharing services, they get hacked, and anyway, you don't want to give them a worldwide
license that cannot be cancelled to everything that you post.
You don't want to mess around copying and synchronising multiple versions of files.
You want a simple way to bypass your own firewall(s).
You're building a library which includes hyperlinks to files on remote volumes, and you want to manage the
library with a PHP application, and/or, access it via any web browser, on any capable device, anywhere in the world.