DFSA - Direct File-System Access:

The first generation of MOSIX has brought about great performance
improvements in CPU jobs - "number crunchers", but cannot help in the
case of I/O tasks, which need to communicate with their home-node as
often as every system-call, and are therefore better off remaining there.

The second generation of MOSIX, now in the stage of Alpha-testing, includes
DFSA, whereby the more common system-calls can be (under certain conditions)
performed directly on the caller's current node, thus increasing the benefit
and probability that I/O-oriented (or mixed I/O and CPU) tasks will also
migrate.

DFSA operates over suitable, cluster-wide shared file-systems that fulfill
certain requirements.  The only file-system to currently fulfill those
requirements is the MOSIX File-System (MFS).

DFSA requires that MFS is mounted at the same spot (mount-point) by all the
nodes in the MOSIX cluster.  It also makes the assumption is that user/group
ID's, thus permissions, are identical throughout the cluster.

The interface to DFSA is very simple - all you need to make
a particular file-system operate in DFSA mode is to run:

echo {mount-point} > /proc/mosix/admin/dfsa{n}		(1 <= n <= 8)

(you should do the same on all the nodes in the cluster and it does not matter
whether this is done before of after the file-system is actually mounted)

To cancel this mode, run:

echo - > /proc/mosix/admin/dfsa{n}

You may also designate symbolic-links to operate in DFSA mode, by running:

echo {symbolic-link-into-DFSA} > /proc/mosix/admin/dfsa{n} (1 <= n <= 8)

but in the case of symbolic-links, unlike mount-points, the symbolic-link
must be defined before writing this definition, so if the link is on a
mounted file-system, that file-system must be mounted first.
The link must point directly to or into a DFSA file-system (attempting to
point to another symbolic-link is unhelpful and falls into the "hard cases"
below).

Both mount-points and symbolic-links must be declared as absolute path-names.
It is the responsibility of the System-Administrator to make sure that
the mount-points and links are identical on all the nodes: failing to make
such declaration on some nodes may cause degradation in performance, but
making inconsistent declarations may cause unpredictable behaviour.
To make any changes to the symbolic-links, their DFSA definition(s) must first
be removed (failing to do this will produce unpredictable results).

requirements from a complying file-system:
------------------------------------------
1) all operations on the file system must be synchroneous, in the sense that
   there is [at most] only one buffer/inode cache throughout the cluster.
   (on client-server file-systems, this usually means that the whole cache
   is maintained on the server - however, a sophisticated server may "lend"
   the cache of particular inodes to particular clients at any given time.
   on shared-hardware file-systems, this probably requires either a hardware
   invalidation signal or a new version to be marked on each inode after each
   modification).

2) The time-stamps on files and between files of the same file-system must be
   consistent and advancing (unless the clock is deliberately set backwards),
   regardless from which node modifications are made.

3) The file-system must populate the following three new super-block operations:
   a) "identify" - to encapsulate identifying information about a "dentry",
      in a way that is sufficient to be able to re-establish that open
      file/directory on another node.  The information-size must currently
      be of a finite, fixed size, but this could be compromised later if
      required.
   b) "compare" - to determine whether a particular "dentry" matches the
      above encapsulated identifying information.
   c) to produce a live new "dentry" based only on a super-block and the
      above encapsulated identifying information.

   Given the distributed nature of the file-system, it is also highly
   recommended that it also populates the following two new inode-operations:

   a) "checkpath", making sure that the "getcwd" system-call produce identical
      results, regardless of the node from which the call is made.
      The "dcache" cannot be trusted, since other processes can move (or remove)
      the current-directory at any time.  This "checkpath" operation should
      also be able to adjust the "dcache", to suit the correct directory-name.
   b) "dotdot", to produce the true parent "dentry" of a given directory,
      rather than trusting the correctness of the "dcache".

4) The file-system must ensure that files/directories are not cleared when
   unlinked, for as long as any process in the cluster still holds them open.
   There are several possible techniques to achieve this, but given the
   distributed nature of the file-system, some form of garbage-collection
   is probably also called upon.

What system-calls are supported:
--------------------------------
The following system-calls are normally supported, being run directly by
the process, while any other calls, or hard cases still need to go via
the home-node:

	read, readv, write, writev
	lseek, llseek
	open, creat, close
	dup, dup2, fcntl (F_DUPFD,F_GETFD,F_SETFD,F_GETFL,F_SETFL)
	getdents, old_readdir
	fsync, fdatasync
	chdir, fchdir, getcwd
	stat, newstat, lstat, newlstat, fstat, newfstat
	access
	truncate/ftruncate
	chmod, chown, lchown, fchmod, fchown, utime
	symlink, readlink
	mkdir, rmdir
	link, unlink, rename

Examples of hard cases:
* if not all nodes have the same DFSA definitions, or the same mounted DFSA
  file-systems, or they have the same mounts - but with different mount-flags.
* if the process is being traced or system-call-traced.
* if the process has a non-standard root-directory.
* if the process shares either its files or current directory as a result of
  the "clone" system-call.
* operations during re-configuration of DFSA on either the home-node or the
  node where the process runs.
* operations involving special files (eg. other than regular, directories
  or symbolic-links)
* operations on files that were commonly opened and still shared with other
  related processes.
* dup2, where the second file-descriptor is an already open non-DFSA file
  (that requires closing).
* use of path-names that leave the DFSA file-system, so if, for example:
	"/mfs" is a DFSA file-system
	/mfstmp is a symbolic link to /mfs/2/tmp, and is declared under DFSA.
	/mtmp is a symbolic link to /mfstmp, and is declared under DFSA.
	/mfs2 is a symbolic link to /mfs/2, but is not declared.
	on node #2, /fie is a symbolic link to "/tmp/foo".
  then the following are accepted as simple cases (and identical):
	/mfs/2/tmp/foo
	//mfs//2/tmp/foo
	/./mfs/2/tmp/foo
	/mfstmp/foo
	/mfs/2/fie
	mfs/2/tmp/foo  (when in the root directory)
	
  but not the following:
	/tmp/../mfs/tmp/foo
		(although it makes sense, the kernel is not allowed to assume
		that each node has an accessible "/tmp" directory!)
	/mfs/2/../../mfs/2/tmp/foo
		(because it steps out of the "/mfs" DFSA file-system)
        /mfs2/tmp/foo
		(/mfs2 is not declared, hence no assurance was provided
		that it is identical on all nodes)
	/mtmp/foo
		(not pointing DIRECTLY to a DFSA file-system)
	mfstmp/foo  (when in the root directory)
		(just a difficult case to recognize)
* chdir/fchdir when the previous directory is non-DFSA.
* link/rename that fail due to attempt to cross-device link.
* open/dup/dup2/fcntl(F_DUPFD) that require an allowable-increase in the
  maximal file-descriptor index.
* when the home-node DEPUTY has pending requests for us (such as signals,
  requests for "ps" information, request to migrate or consider migration, etc.)

Deviations from normal Linux/Unix/Posix behaviour:
--------------------------------------------------
It was impossible to maintain 100% compatibility on DFSA file-systems,
but the deviations are kept to the very minimum:

* A process that received a signal may continue running a few DFSA system-calls
  before it actually receives and handles the signal.
  (in contrast, any POSIX process that receives a signal may possibly
   complete the next system-call, but not issue any new ones beyond that).

* Simultaneous mapping and I/O on the same DFSA file creates unpredictable
  results as follows:
  1) execution (and library and all other file-mappings) is not always
     protected against other process(es) modifying the file (eg. either the
     writing-process or the executing/mapping process may fail to receive the
     "ETXTBSY" error).
  2) The "MS_INVALIDATE" flag of "msync" may fail to ensure that previous
     "write"(s) to a mapped DFSA file are discarded.
  3) when a process modifies memory that is mapped as "MAP_SHARED" to a DFSA
     file, but has not yet written it back (using "msync", "munmap", "exec"
     or "exit"), it is possible that another process that reads that file as
    it migrates will first see some of the changes but later (as opposed to
    normal behaviour), see the old values (or some of them) again.
