The summary of this error.
According to a report, the problem occurs because the 32-bit call emulation layer does not check whether the call is truly in the Syscall table. Ben Hawkes, who discovered the problem, says the vulnerability can be exploited to execute arbitrary code with kernel rights. An exploit (direct download of source code) is already in circulation; in a test conducted by The H’s associates at heise Security on 64-bit Ubuntu 10.04, it opened a shell with root rights.source
and the technical description tells us that it involves a missing register clear operation that was fixed once and then reintroduced.
All well and good, but this bug was patched in 220.127.116.11. They fixed the bug by reloading (and thus zero-extending) the original value of eax from the stack. But… strangely enough, in the LOAD_ARGS32 macro that was responsible for this reloading, I couldn’t actually see a specific reloading of eax anymore:
I showed this to my friend Robert Swiecki who had written an exploit for the original bug in 2007, and he immediately said something along the lines of “well this is interesting”. We pulled up his old exploit from 2007, and with a few minor modifications to the privilege escalation code, we had a root shell. So what happened to the patch for CVE-2007-4573? Unfortunately in early 2008 there was a regression that removed eax reloading from LOAD_ARGS32:
and then we look at the change that caused the problem to re-appear and we see two things.
This unifies and cleans up the syscall tracing code on i386 and x86_64.
Using a single function for entry and exit tracing on 32-bit made the
do_syscall_trace() into some terrible spaghetti. The logic is clear and
simple using separate syscall_trace_enter() and syscall_trace_leave()
functions as on 64-bit.
First, there is no regression testing. Sad. Second working code is changed to “clean it up”.
I wish that processor architecture was not so committed to a obsolete model of software. For example, it has become clear over the last few years that the sloppy shared memory thread model of programming has multiple drawbacks. Sharing of data, in particular, should be tightly controlled. Yet processor architectures are “optimized” for unstructured sharing of data between threads with enormously complex “snooping” caches running at high speed to inspect all transactions on what have become essentially cache buses, and initiating complex transactions when conflicts are detected. I put “optimized” in quotes because the performance is terrible for the reason that violations of locality cannot easily be cured. That is, caches are designed to permit totally random sharing of memory without software action. Performance gets hammered by cache locality anyways. Essentially what we have is an expensive (both in transistors and power) fix for poorly designed software that really does not get much advantage out of it, but gets protected from the logical (but not temporal) consequences of sloppy data sharing by the hardware. In point of fact, most threads do not share much memory and the ones that do could be rewritten to explicitly pass control of memory to each other.
Sadly, we are in a situation where processor architectures are constrained to run obsolete software, and then software is constrained to try to take advantage of obsolete hardware. All of this is especially good only for electric power utilities and manufacturers of air conditioning equipment.
A second, related, area of obsolete designs is in memory mapping and paging. This is treated as a problem that has been solved for all time by a hierarchical paging model that was finalized in the 1980s. Is the paging model designed for machines with 10 meg of memory and a 100megabyte swap partition on a slow disk a good model for machines with 1G or 10G of memory and couple of hundred spare gigabytes on disk, or maybe 40 spare on solid state drives? Well, if you give it a little thought, you can see a lot of potential problems. In the good old days, paging in 512 byte blocks or 4k byte blocks from a disk drive organized around such block, to and from a highly constrained memory where you really wanted to be able to avoid wasting even a couple of kilobytes required one design. But what happens when the entire concept of disk blocks disappears on the drive and a system with 10M of unused memory is almost certainly stalled? What kinds of memory use patterns are good for LAMP type systems where jobs do not appear and disappear ? You can look at processor design/operating systems sometimes and think that, just maybe, 1970s computer science departments are not the ultimate final inspiration for all architecture.