Month: July 2001

tech ed day 5

i had to wait until the last day of tech ed to experience a speaker that has such a cult following that he can get away with holding is talk from a bathtub on stage. of course i’m talking don box here.

besides being a great speaker don is known for soap co-authorship and sitting on the xml schema working group. don spoke at length about how massive the transition from traditional win style programming to .net will be. in his view it compares only to the change from DOS to windows nt.

besides cracking jokes all the time don showed how the move to richer metadata in the type system transfers the intent of a programmers code better than current approaches do. in his words, understanding the matrix helps you to understand the clr. there is an (idealized) world inside the clr, and tough reality beneath. much as there has been a distinction between userland and kernel mode, don argues that adding another layer of abstraction will help to get better results. while it is certainly true that higher levels of abstraction give you more leverage, you cannot avoid to wonder how layers upon layers of cruft (.net was basically bolted onto com implementation-wise to maintain compatibility with the installed base) make for a stable system…

Tech Ed Day 4

due to the attractions of barcelona’s nightlife, i missed out on most of the talk about attributed programming. would have been interesting, but like it was it went over my head a bit..

uddi was touted as a solution for finding out about web services and to facilitate integration of applications across the network. while a directory of services is certainly useful it remains to be seen how many directories will be vying for attention and thus reduce the reach of each of them. wsdl, which is the standard to describe the actual apis turns out to be a “throw everything in” kind of standard. even microsoft’s implementations (there are 3 of them) have no interop..

the talk on java vs .net was very well done and while the 2 platforms look remarkably similar, java does not currently have a web services strategy. what became evident though is that all major vendors bet on web services and have at least agreed on soap for interop.

the evening held a gigantic party in store. microsoft had rented the olympic stadium and the surrounding area and threw a party for all 9000 tech ed attendees. attractions ranged from spacing
to foods of all sorts, including an attempt to produce the largest paella ever made (with a diameter of 5m they seem to have succeeded) to clowns, to a concert by a queen lookalike band, to the final fireworks.

.net dangers and community answers

very timely. the last few days have been a wake up call for the open source community about what .net means for the future of the internet. so it’s very reassuring to see this editorial from this weeks lwn.net.

1 frequently-heard criticism of free software is that it lacks innovation. The free software development process can do well at reimplementing others’ good ideas, but is not able to produce those good ideas itself. Free software advocates dismiss that criticism with plenty of counterexamples. But it still hurts a bit sometimes. There is currently an opportunity, however, for the community to show what it can do. A challenge which should be accepted if we want to remain in control of our computing future.
That challenge, of course, is Microsoft’s “.NET” initiative, and the HailStorm component in particular. HailStorm is Microsoft’s bid to be the intermediary in authentication and business transactions across the net. If the company has its way, everybody will have a Microsoft “Passport,” which will be required to be visible on the net. The protocols behind this system will be “open” (based on standards like XML and SOAP), but Microsoft will hold the copyrights and decide what is acceptable.

It is interesting to note that these protocols have been explicitly designed to be independent of little details like which operating system you’re running. Microsoft is saying, essentially, that, at this level of play, who owns the desktop is no longer important. Linux could yet conquer the desktop, but lose the net.

Scattered responses have been seen across the community, including .NET implementations, talk of a free C# compiler, or a “dotGNU” framework. But these are catching-up actions. There is little new there; it is more an effort to keep up with what Microsoft is doing. That approach should be seen as a serious mistake. It is time for the free software community to take the lead.

Doing so will require the presentation of an alternative proposal. What is needed is a compelling vision of how we will deal with each other on the net of the future. The community needs to design a framework which handles tasks like authentication and transactions, but which meets a number of goals that may not be high on Microsoft’s agenda:

The full set of protocols which implement this framework must be open, with an open development and extension process.

No one company or institution should be indispensable to the operation of the framework. No company or institution should be able to dictate the terms under which anybody may participate in life on the net.

Security and privacy must be central to the framework’s design. All security protocols must be open and heavily reviewed.

The framework must bring the net toward its potential as the ultimate communication channel between people worldwide, and it must allow the creation of amazing new services and resources that we can not yet imagine.
The success of the Internet is due to a great many things, but one aspect, in particular, was crucial: nobody’s permission is required to place a new service or protocol in service on the net. Where would we be now if Tim Berners-Lee had been required to clear the World-Wide Web through a Microsoft-controlled standards process – and let Microsoft copyright the protocols too? Any vision of the net of the future must include the same openness to be acceptable.

The free software community could generate that vision, but it is going to have to set itself to the task in a hurry. It is also, for better or for worse, going to need some serious corporate involvement. Companies are needed to help fund the development of a new set of network standards, make sure they meet corporate needs, and, frankly, to insure that it is all taken seriously. There should be no shortage of companies with an interest in a net that is nobody’s proprietary platform. It is time for them to step up and help with the creation of a better alternative.

The community needs to act here. Playing a catch-up role in the design of the net of the future is no way to assure freedom, or even a whole lot of fun. Large-scale architectural design is hard to do in the free development mode, but we need to figure out how to do it well. Either that, or accept the criticism that we can’t really innovate.

Disk bandwidth estimation

Daniel Phillips is fast becoming a major league kernel hacker.

This is an experimental attempt to optimize my previous early flush
patch by adding continuous disk bandwidth estimation. In spirit, the
new modifications are similar to Stephen Tweedie’s “sard” disk
monitoring patch, though it was only after implementing my own ideas
that I became aware of the overlap. On the other hand, what I have done
here is quite lightweight, on the order of 20 lines or so, and seems to
produce good results.

It is far from clear that this continuous bandwidth feedback from the IO
queue is the “right” approach. Alternatively, it would be quite easy to
provide an interface from userland to allow the administrator to provide
a one-time bandwidth estimate, perhaps derived from hddisk -t. On the
other hand, it would be just as easy to provide both an automatic
estimation and a manual override. One big advantage of making the
automatic method the default is that no tuning needs to be done in order
to get decent performance from a new install. Another potential
advantage is that bandwidth can change under different loads, so any
one-time estimate may prove to be sub-optimal.

The Patch
———

This is a patch set with 3 parts:

1) A lightly edited version of the early flush patch
2) Add-on bandwidth estimation
3) Add-on proc interface for bandwidth estimate and transfer rate
Each part depends on the ones before it and each results in a usable
system. I.e, to get the original early flush behavior, just omit the
second and third patches.

The second patch adds bandwidth estimation and this is where things get
interesting from the benchmarking point of view. At this point I
haven’t done any rigorous benchmarking and I can only guess at the
performance effects. On the other hand, by monitoring the bandwidth
estimate, I’ve learned some interesting things about how well we are
doing in terms of optimizing disk seeks (not spectacularly well) and I
have also noticed what appears to be a low-level problem in the disk
queue, causing short periods of unreasonably low block transfer rates on
my laptop.

To apply:

cd /usr/src/yourtree
patch -p0 <thispatch
To reverse, you must separate the patch into its 3 parts and
reverse in reverse order. Sorry. I will try to avoid placing multiple
patches in one file in the future. ๐Ÿ˜‰

For example:

<edit this file into 3 parts: look for early.flush.1/2/3>
patch -p0 <early.flush.3 –reverse
patch -p0 <early.flush.2 –reverse
patch -p0 <early.flush.1 –reverse

Method
——

As expected, estimating disk bandwidth is a little tricky. There
are several problems.

– There could be several disks on the downstream end
– Some of them might not even be disks: ramdisk, flash, nbd.
– Need to know when transfers are running back to back
– Seeking can make the transfer rate highly variable

The way I decided to go at it is by considering 2 types of sample
periods: a) sample periods with continuous activity and b) sample
periods with some idle time. Sample periods that include idle time
only cause the bandwidth estimate to increase; those with continuous
activity can cause the bandwidth estimate to increase or decrease.

The bandwidth samples thus obtained tend to fluctuate rapidly. To make
them more useful, I filter them. The line:

bandwidth_sectors = (bandwidth_sectors*3 + bandwidth_sample)/4;

implements a simple low-pass filter using only shifts and adds.
In some respects, what has been implemented is a feedback loop. When
early flushing is the only active disk IO process, the estimate of disk
bandwidth will tend to be continuously improved. This happens because
the flush will try to write keep the queue full to a level somewhat
greater (150%) of the bandwidth estimation, allowing the estimate to
increase by 50% on each poll interval. When the queue has been properly
saturated with transfers the estimate can decrease as well. Hence the
flushing behavior causes migration towards a position of improved
knowledge about the underlying hardware.

Observations

————

It turns out that measured bandwidth tends to fluctuate a great deal –
by a factor of 20 to 40. This reflects the difference between
sequential transfers and those require large amounts of seeking. For
example, an IDE disk may be capable of transferring a 4K block in 250
microseconds, but if the blocks are all on separate tracks the actual
transfer time may be 5 milliseconds or so, somewhere in the range of the
disk’s average access time. This gives a factor of 20 bandwidth
difference depending on access patterns. I observed this in practice.

Interestingly, the use of smaller blocks gives an even wider variance.
This is because of the larger number of seeks possible for a given
amount of data. I see 2-3 times as much variance with 1K blocks as with
4K blocks. This is an important reason why larger block sizes are good
for throughput. (However, note that the improvement could be illusory if
the data items being transfered are significantly smaller than the block
size.)

Peak transfer rates don’t vary much with block size and remain near the
raw transfer rate of the disk as measured by hdparm -t. This is
encouraging as far correctness of the measuring method goes.

A Level Disk Transfer Anomaly

—————————–

I have consistently observed a troubling anomaly in low level disk
transfer throughput. On rare occasions, the low level transfer rate
seems to drop to ~10 blocks/second on my laptop. During these
periods of slow transfers, the IO queue is typically backed up by a few
tens of sectors. It is hard to imagine any hardware cause for this. I
do not think that this measurement is due to a flaw in my method of
collecting statistics, nonetheless, it is possible. If I have made no
mistake, then there is indeed something odd going on down at the lowest
levels of disk access.

Application to Early Flushing

—————————–

The early flush algorithm essentially tries to use disk bandwidth that
would otherwise be unused. When it detects a period of disk inactivity
it tries to write out as many old buffers as it can, without loading up
the disk queue so much that some higher priority user of the disk
bandwidth, such as the swapper, would be delayed too much. In other
words, it wants to submit enough sectors for io to keep the disk busy
continuously, and not a lot more than that. To do this accurately it
needs to know the disk bandwidth.

As discussed above, disk bandwidth is not a simple number, it depends on
what the disk is actually doing. It’s possible that keeping a
continuous estimate of disk throughput as I do in this patch is better
than assuming some fixed number. There are dangers too. Suppose for
example that a period of coherent IO results in a bandwidth estimate
close to the raw transfer rate of the drive, then activity ceases and
the early flush uses that estimate to begin a flush episode.
Unfortunately, the blocks being flushed turn out to be highly
fragmented, and so 20 times more blocks are scheduled for IO than would
be ideal. If there is no new demand for disk bandwidth during the
period of the flush episode, no harm is done, because the estimate will
be improved over the next few sample periods. But if there is sudden
demand, the higher priority user will be delayed by the low priority
blocks in the queue. Hopefully, such a unfortunate combination of
factors is a rare event, nonetheless I am giving consideration to how
the possible bad effects could be ameliorated.

I tested this patch just once on a live system, for a reality check. In
that test I saw a 5% improvement in kernel compile speed:

Command

time make clean bzImage modules

Vanila kernel

real 11m58.176s
user 10m37.840s
sys 0m28.740s

With early flush + bandwidth sensing

real 11m21.227s
user 8m38.160s
sys 0m48.460s

More testing needs to be done to see if this is reproducible.

Other applications

——————

There are other areas in the kernel that could benefit from using disk
bandwidth and queue size input. Once example is page laundering.

Currently, page laundering relies on a memory pressure and clean page
statistics to decide how many pages to submit for writing.
Unfortunately, under some loads, memory pressure is continuous, and that
statistic carries little useful information. Similarly, some loads use
dirty pages as fast as they are cleaned, so the clean page statistic is
not reliable either.

Alternatively, page_launder could sense the length of the io queue and
use the disk bandwidth statistic to guide its decisions on how many
pages to write out. It is counterproductive to load up the io queue
with too many dirty page writeouts, if only because a sudden relaxation
of the load can leave the system busily writing out pages when it should
be reading, e.g., swapping a gui program back in that was swapped out
under load. So instead, page_launder can write out enough pages to let
the elevator work efficiently and stop there.

Other applications will no likely be found. Even the possibilities for
opportunistic IO have hardly been mined out.

Possible Improvements

———————

The current patch is most probably sub-optimal. For one thing, it lumps
reads and writes together in one bandwidth statistic. For another, the
full/partial sample distinction is overly crude. Something along the
lines of what Stephen Tweedie does in his sard patch with idle time
measurement would likely be superior.

>From the enterprise-computing point of view, the major improvement that
needs to be made is in separate analysis of multiple block devices. This
per-device information needs to be propagated back into kernel
mechanisms such as bdflush, page_launder and the swapper. Needless to
say, this is 2.5 material.

Proc Interface

————–

In the third patch of this set I create a simple proc interface to
expose the bandwidth estimation, and another simple statistic, current
transfer rate, to user space. This is used as follows:

daniel@starship# cat /proc/bandwidth
1720 0
The first number is the current bandwidth estimate and the second is the
current transfer rate. Note that the bandwidth estimate is updated only
when there is disk activity, and it can vary a great deal as described
above. Do not be surprised to see a strangely low bandwidth estimate
when the system is sitting idle – it can easily result from a final
burst of disk access that is extremely fragmented.

I do not pretend that the this proc interface is correct in any way,
however is should be fun to play with.

JXTA, JAX, .Net, ONE.. What a mess

I am still thinking about what the rest of the world will do to answer .net. In the process I have come across so many new acronyms it makes you puke.

  • JXTA seems to be a peer to peer framework for the java language.
  • JAX (Java API for XML)
  • ONE (Open Network Environment)

Apparently Dave Winer is

working on a big-picture road map for XML storage, membership and other cool related stuff. It’s a technical, economic and political document. It’s not wussy. A declaration of independence from our Friends Up North. We can’t get locked in the trunk with the rest of our friends, there’s simply not enough room for comfort. We like lots of space.

tech ed day 3

the day started off with an in-depth session about c#. c# has some nice properties that can stand on their own, but industry support will be crucial. versioning of classes is an approach to tackle the fragile base class problem where changes in a base class lead to bugs in derived classes because the derived classes expect certain methods or variables to be there. versioning can at least give the programmer a hint where problems may arise. if i understood this correctly this versioning information is part of the metadata that is stored alongside the classes and can therefore be used at runtime. another nifty feature are xml comments. extending on the javadoc idea,

they can contain structured comments which can then be transformed with an xsl stylesheet. besides this there are some minor cleanups of c++ like requiring boolean values with each if while construct or escaping entire strings like this: string bla = @”\servershare.la.txt”;

the next presentation was quite impressive, with mark russinovich of sysinternals.com fame at the helm. he gave a walk through for some of his tools, like filemon, regmon to monitor file / registry accesses, respectively. his tools are even used within microsoft. also his process explorer does a lot more than the built in task manager, like killing any process without giving stupid access denied errors. he even has some nifty tool to remotely execute commands. this little hack works by auto-installing a service via the admin share of a remote computer and then carrying out the requested operation.

after his session i tried to charge my notebook but only got to 50 % meaning i had to look for power strips all day ๐Ÿ™‚ the lunch session was very informally held by mark russinovich. his first slide surely caught our attention.

he then went on to demonstrate how far windows has come in terms of architecture, stability and scalability. he threw in lots of tidbits like the fact that the build number for windows is being continuously increased since 1992, the most current is 2505 (XP RC1). so this basically means that the windows os has had 2500 complete builds in 10 years. locking has been made more fine-grained in XP, resulting in scalability increases. i can see it now: a new round of windows benchmarks stacked against linux benchmarks. it came to light that the nt kernel is written somewhat object-oriented (it even uses exception handling i hear) if details like these interest you you should check out the nt resource kit as it comes with great documentation.

the rest of the afternoon was spent in 2 sessions about debugging, one called analyzing crash dumps and the other .net debugging. the first one was quite interesting, i learned that microsoft has a tool to analyze crashes which uses heuristics to determine error patterns in your application. somewhat similar to dawson engler’s meta-level compilation except that it analyzes the binary and is therefore most likely
less powerful than dawson’s approach.

in between we squeezed a meeting with jose osuna, responsible academic manager for switzerland. we had a good talk and i hope we can have some events with him in the future.

now i am off to catch some of barcelona’s nightlife. i’ll skip the graveyard session for once.

Hailstorm alternatives

Dave Winer’s xmlStorageSystem is a proposal to implement a hailstorm like cloud to store xml data. Bery interesting. I hope this gets broad support. Google has already some juicy stuff.

xns.org has been developing xns for several years and aims for the hailstorm space. it seems like every day brings a new initiative in the web services arena. note to self: i ought to write up a comparison of the various identity services that are being developed:

  • hailstorm
  • xmlstoragesystem
  • xns

oh my!

TechEd day 2

notes on .net and how open source may counter the threat, some stats and great food. we hurried to the conference area after a much too early rise. it was on the way to the conference that we realized for the first time how huge teched is.

the main room was just gigantic.

we were greeted by queens barcelona anthem followed by some dull marketing fluff. among reams of uninteresting tidbits we learned that there were some 9000 attending teched. after a while anders hejlsberg entered the stage to give the first keynote. considered by some to be one of the best programmers, his performance left a lot to be desired. of course, he had to remain on the surface, this being the keynote he had no chance to demonstrate some of his considerable talents as a language / systems architect. he was quite successful to give a glimpse of the .net framework and its far-reaching impact, however. all of the days sessions centered around .net. the point that microsoft believes in open standards was driven home many times, with some credible demonstrations like microsoft’s early involvement in xml standardization and its increasing reliance on established standards like kerberos, ldap, dynamic dns, wbem (web based enterprise management), xpath, xslt, http (the list goes on) over the course of these presentations it became very clear that microsoft has unleashed something much larger than it can ever hope to handle like it has in the past when it introduced the concept of web services. web services have all the ingredients of a disruptive technology. they place simplicity where complexity and opaque systems have reigned for so long.

their complete reliance on xml for all aspects has brought them some criticism from some quarters that they are not being efficient and that xml adds nothing that was not there before. i was wondering along these lines as well. however when i saw how the concept of web services has evolved in one year i started to notice similarities to the classic and incredibly successful osi model. web services start where osi ends, but they share the concept of piling independent services on top of each other. this has been a very powerful architecture in networking systems, especially tcp/ip. since xml is such a simple representation of data it has been very easy to extend web services with additional layers and make them increasingly powerful. i believe that the benefits from a large scale adoption of xml will be reaped with ever more layers stacked on each other, with ever increasing power.

although web services are an active area for the w3c, it remains doubtful how the industry will counter microsoft’s .net juggernaut. declaring support for soap, as ibm, sun, oracle and others have done, is not going to cut it. what is needed is a credible architecture that can compete feature by feature with .net. although all the components like apache (web server), soap for apache, jabber (xml messaging), kdevelop (ide), postgresql (database), ldap (directory) exist in the open source community, they are not part of an overall architecture. it would be a major undertaking to get the developers of the respective components to talk to each other and agree on common interfaces. the old unix argument about never setting policy looks quite silly when you realize what productivity gains microsoft will be leveraging with their .net platform.

it also became quite evident that we have seen nothing yet in terms of the web services architecture. many key pieces are missing, like meta data to enable the retrieval and processing of semantics from
data (to support agent technology for instance), the questions of payment for web services and global, fine-grained security matrices (who has access to which of my data). web services are loosely coupled
but they have no mechanism to guard against api changes or to facilitate negotiations on usage terms for web services.

besides all these lofty ideas we came back to reality quickly when we saw the enormous amount of logistics that went into this conference. details like having a dining hall for 9000 people
or being so well organized that leaving my camera in the computer area was not a complete disaster (i struck it lucky when i got it back from the lost & found counter) made a big impression on me. the all you can eat buffets every few meters had their influence as well..

i learned a few interesting details about eai (enterprise application integration) an area where bea systems has been strong and microsoft made their debut with their biztalk server. for instance most people that believe that they need synchronous interfaces (ie immediate access to results) actually don’t.
you can fool these people with clever tricks like pretending to be synchronous on the front end via http redirects while your backend interface is in fact asynchronous. the graveyard session for the day was actually quite funny even though the main speaker had to boast about his accomplishments all the time. they shared many anecdotes like being used as a spam relay during scalability testing, their isp wrongly throttling their bandwidth on the incoming mail connection to 70 kps for 500 concurrent users ๐Ÿ™‚ they made up for that with their end to end ipsec deployment (would have been too lovely to sniff passwords in a lan with 6000 mobile ethernet clients..) and replicating several databases in real time to london. after this session we were driven to a nice location just opposite our hotel for the swiss country dinner. it was basically one of the nicest places i have been to in quite some time. great job microsoft.

tech ed day 1

submitted directly from the conference floor via wireless ethernet.. ๐Ÿ™‚ that’s the power of wireless i guess..
we entered barcelona after a refreshing flight with sensational view over the alps around 10. after checking into the hotel we went to the conference center where we were greeted by bunches of geeks sitting on the floor, huddled over their notebooks or their newly acquired geek toys (aka compaq ipac).

we were handed a monstrous conference backpack in bright yellow. the backpacks were just too offensive for our visual cortexes so we had to dispose them soon afterwards. we queued to get a hot new compaq ipac with a wireless ethernet card. soon afterwards we were successfully checking into the abstrakt portal. does this rule or what. somewhat relieved and with our ipacs we headed back to the hotel to chill out and play with the ipacs.

although they have a high coolness factor and it was great fun playing around with them we concluded that using the analog conference schedule still beats the online version by a lot in regard to usability. besides, the paper version does not forget your notes after a reset. after hunger sent us out to fetch some food (we found some selection of tapas) we headed back to the conference for a special student welcome dinner. microsoft must have taken a page from other conferences since we were greeted by nice hostesses. what a contrast to all these shy geeks.. unfortunately though there were not too many female geeks around, as had to be expected.
we were then driven to a restaurant and a large buffet was quickly consumed. we shared the table with 2 guys from cambridge who are working as microsoft consultants during summer break. they are currently implementing voice over ip application over gprs using the compaq ipac. to our great amusement they were avid slashdot regulars and the rest of the evening was thus spent in merry geek lore. topics ranged from umts to the singularity to debian installs. in short, a very refreshing discussion. we were then
advised as to what sessions out of the 264 we should attend. clearly there will be some hard choices to be made as some interesting sessions collide.
after we were handed a fancy schmancy jacket in brightest yellow (we kept it because it looked kinda neat for a change) we were dismissed and spent the rest of the evening catching up with various projects each of us had been silently advancing. barcelona is one heck of a nice city by night.
so many decent places to hang out.

conference week

in less than 7 hours i will be on the plane to barcelona. going to microsoft teched 2001. i hear there is a lot of criminal energy in barcelona. what the heck i’m taking my cam, notebook etc with me nonetheless. will be interesting to spend a week with my flat mates for once. ๐Ÿ˜‰