Operations and maps on finite sequences

A lot of what I’m trying to do with mathematical models of computer systems involves operations on finite sequences.

Define a “finite sequence of length n>0” to be any total map f: {1 … n} → X for some set X. The 0-length sequence is the null map  “nulls”. If f is a finite sequence of length n, then g = f affix c is the map g: {1 … n+1}   → X ∪ {c}  so that g(i)= f(i) for  i ≤ n and g(n+1) = c.  Also, if g is length n>0 then there is an f of length n-1  and some c so that g = f affix c .

A primitive recursive function on finite sequences is given by the rules  F(nulls) = k and  F( f affix c) = G(c,F(f)). 

For example we can define prefix by  (c prefix nulls) = (nulls affix c) and ( c prefix (f affix d)) = (c prefix f) affix d.

I use affix instead of append because prepend is ugly.

Two observations about common usage in computer science that differs from standard mathematical practice. First, note how I’m defining sequences as maps but being imprecise about the function images, about the set X. In CS we really only have one type of object – a sequence of binary digits- so X is always a set of finite sequences of binary digits. This is a fundamental property of computing. So if I wanted to be pedantic, I’d first define the set B of finite sequences of binary digits as the set of all maps g:{1 … n} → {0,1}  plus the null map, and then define general finite sequences as maps  {1 … n} → B.  But we use those binary sequences as representations of other mathematical objects and even objects that are usually not considered mathematical objects such as MPEG encodings of video.  So it’s convenient to speak of a sequence of integers or strings or a sequence of assorted objects without making the translation between representation and denotation  (connotation? ). The second  observation is that, for the same reason, second order functions are not unusual in computer science – e.g. defining prefix as a function that operates on sequences which are also functions. Note however that in non CS applied math, second order functions are also common.

Painting is The Mathematician by Diego Rivera.


What does the UNIX file system do?

Unix, Linux, Windows and other operating systems and the world wide web all support file systems with the familiar path file names  like

 or "/system/passwords/secret/dontread.txt"

although sometimes with different separator characters between the individual “flat” file names. For example, Windows uses “”. As long as we know how to separate flat file names in the sequence, it doesn’t matter. The flat file names are chained together in a path through the file system that shows “where” a file can be found. URL’s in the world wide web are just path file names with some more information around them.  Constructing the file system involves a clever technique for embedding a tree in a simpler file system where file names are just numbers.

For historical reasons, the base file system uses numbers called “inode numbers” to name files.  Ignoring modifications, this file system looks like a function F:InodeNumbers → FileData. The tree emerges from information stored in some of the files. FileData includes some files that are just data and some files that are maps called “directories” (or “folders”). Directory maps have the form d: SimpleFileNames → InodeNumbers.  If we have a path file name “a/b/c” and a starting inode number i, we can first get d1 = F(i), the contents of file i which should be a directory, then get ia= d1(a) the inode number of the file named a, and then da= F(ia) and  ib= da(b) and db= F(ib) and  ic= db(c)  and then the contents of the file “a/b/c” is F(ic ) – assuming that the path is defined.  More concisely, we can write  ia= F(i)(a) and  ib= F(ia)(b) and so on where functions are resolved left to right: for example, F(i)  is a map which is then applied to a. 

Computing the translation of a path file name to an inode number can be defined recursively in terms of a function usually called namei (for names to inode numbers).  If the path file name is the empty path, then we are already where it leads: namei(i,Empty) = i.  If the path file name is not empty, it has the form a/p  where  is a simple file name (of any length) and  is a path file name with one less simple file name in it than the original path: namei(i,a/p) = namei(F(i)(a),p) It’s possible that namei(i,p) is not defined – for example, F(i) might not even be a directory function or it might be one but d=F(i) might not be defined on the leftmost simple file name in the path. In that case, we have “file not found” or “404” in the case of a URL.

A UNIX type file name has a special inode number for the “root” directory.  For any path  file contents is then U(p)= F(namei(root,p)).  A consistent file system will have at least the following properties.

  1. No orphans. For every  in InodeNumbers,  if F(i)  is defined there must be a path  so that namei(root, p) = i. 
  2. No dangling references. For every so that F(i)  is a directory function and for ever simple file name so that F(i)(a)  is defined, F( (F(i))(a)) must also be defined (that is, if F(i)=d  and d(a)=j  it must be the case that  F(j) is defined.)  

Another useful property limits cycles or loops through the file system and aliases. If U(p)  is a directory, let Children(p) = {a: U(p)(a) is defined} where  is a variable over flat file names.  Then define find(p) = {p} if is not a directory or Children(p) = emptyset and define find(p) = union{find(pa): a in Children(p)} . If there are no loops, this is a well defined function that terminates with the set of leaf nodes reachable from p. For example if one were in an organization concerned about security, there might be regular monitoring of find(/home/snowden) to see if any unauthorized data had been collected.

The most stringent non-alias requirement would be that if namei(root,p) = namei(root,q) then p=q. There can be no loops if there are no aliases. This requirement is usually relaxed to accommodate the “parent” and “self” pseudo file names, and hard and soft links. The simple file name “.” is usually reserved to mean “self” so that if F(i)  is a directory F(i)(“.”) = i. The pseudo-file-name “..” is used for “parent” so that if F(i)(a)=j  and F(j) is also a directory, then F(j)(“..”) = i. These pseudo-file names introduce both loops and aliases so we could just limit the requirement for no aliases to the cases where and  don’t contain any pseudo-file-names. Note that the definition of the parent pseudo-file-name limits many kinds of loops because it cannot be that a directory points back at two different parents.

Soft links, a later addition to UNIX files, are a more complex problem. For soft links we add file contents that are path file names and modify namei  so that if j=F(i)(a)  is a soft link with F(j)=q, then namei(i,a/p) = namei(root,concat(q,p)). The original definition of namei has a nice property that the path shrinks by one flat name at every step and this change loses that property and makes it easy to create loops that never finish. The solution to that is to count soft links and just give up if a path takes us to more than some set limit number of soft links.



Paths versus Recursion

 * Iterative DepthFirst file list (c) Victor Yodaiken 2013
 * "Not the way we do it in Brooklyn" - Dave "Kinch" Arnow.
 * Data structure is P - the current path, with some aux data
 * Two basic operations:
 * 1) Lp(P) - starts at path P and extends it to the leftmost
 * reachable file/directory
 * 2) IterateDF(P) iterates by advancing a path to the next in depth
 * first order
 * So program is
 * Initialize(P);
 * do{
 * PrintPath(P);
 * }while(IterateDF(P) != EOF)
 * Horribly inefficient - can be cured by caching positions in directory
 * entries.
 * */
Link to Code