Wednesday, December 30, 2009

Arduino - Open Source Initiative of 2009

For those who do not know, Arduino is an open-source, embedded prototyping platform. The single board microcontroller hardware with a bunch of sensors and embedded I/O controls. Hardware components contains an Atmel ATmega328 AVR microcontroller, a thermister, a crystal oscillator, etc, although it varies from board to board. It is an advanced RISC microcontroller with 32 x 8 registers and a 2-cycle multiplier. The Arduino programming language is an extension of C++ - easily learnable. The source code of the entire software suite which includes an IDE is released through GPL. The hardware design is also available through creative commons for non-commercial use. Those who cannot do hardware can purchase a board from Arduino for a low cost.

Arduino's target users are hobbyists and artists, interested in creating interactive objects. But something tells me that the user can do much more than that. I can even imagine potential enterprenuers who have an idea about commercial electronic components can prototype their idea using Arduino.

Arduino first came into picture March, last year. But this year it's fame has grown up very fast, especially after BBC had an article and video on it by its technology correspondent Mark Ward in May. Although this article focused on how Arduino helps in relating the real world to the web world (which is cool in itself), it's just a part of it. Kudos to all the developers. You have done a great job in developing Arduino and making it open to the world.

Friday, December 25, 2009

Ubuntu's backward step

I have a laptop that runs Ubuntu Karmic (and OpenSolaris in multi-boot), and I recommend Ubuntu for all my friends who want to give Linux a shot. The most important reason is Ubuntu's ease of use. For the past several months, many people are shifting from Windows to open source OS because unlike past, Linux is as easy to use as Windows. I still remember those days of 9 CD Debian installation and now anyone can install Ubuntu with some basic knowledge of computers. Also there is a huge online help for Ubuntu.

One of my friends, who installed Ubuntu going by my advice, called me yesterday to tell that it gives an error about low graphics, wherein she has a NVIDIA GeForce in her laptop. Ah, that's easy, you just need to shutdown your GNOME, update /etc/X11/xorg.conf, download the driver and install, invoke runlevel 6. Simple! This is the typical problem of being a computer engineer - making assumption that the person at the other end understands the lingo of computers. Not surprisingly, my friend could not make a head or tail out of it.

Then I got this idea. xorg.conf is a part of debconf. Ubuntu is built on Debian. Therefore she should be able to run dpkg-reconfigure xserver-xorg. Even then she may need to shutdown GNOME, but that's OK. She hit back telling dpkg-reconfigure is not found. I was startled. I went back to my laptop and tried dpkg-reconfigure and it was not there.

Does anyone have any clue why it was removed, without providing an alternate option? Ubuntu was making everyone's life easy with using Linux. In that journey, suddenly a step backward. How many people who have a graphics card would know anything about xorg.conf and its significance? Let's hope Ubuntu does provide an alternative in its next release.

Friday, October 30, 2009

Temporary Suspension

I have not been posting anything in my blog recently as many of you noticed and some of you mailed. The fact is that I am busy in two completely independent work. 1. Designing and implementing a completely new heuristic of logic minimization and 2. Physical layout design for a CMOS charge pump.

Both these are interesting and consume most of my time. As a result, I cannot find time to post my views on my blogpage. So until mid-December, there will be sporadic updates in the blog. Nothing more than that. Starting mid-December, I will be back on track.

Wednesday, September 16, 2009

Arrow of Time

Last month (Aug 09), the world of physics was enthralled by a mathematical work that led to an explanation to the arrow of time. The paper was by Lorenzo Maccone, one of the biggest names in quantum physics in this generation. The physicists in the universities (at least big universities) around the world have a hard copy of this paper in their hands, gearing themselves up for lunch time debates on its merits and demerits. As a result, now we have a paper that provides a counter-argument for Maccone's proposal. Moreover this paper says that quantum explanation given by Maccone actually complicates the arrow of time rather than resolving it. This paper is actually written by professors from Imperial College, London. Before taking a deep-dive into it, let's look at what arrow of time means.

The term Arrow of time was first coined by English Astronomer, Arthur Eddington. The name rings any bell? Yes, he is the same Eddington who studied the 1919 solar eclipse to confirm the theory of relativity. The mathematical equations of physics are always time symmetric. That is, whatever happened in a period of time can be reversed without affecting any physical law. It's like rewinding your VCR. But the crystal jar that I broke last week, did not rebuild itself. And it is ridiculous to think that it would.

When we look at the real world, if we leave things alone it always goes towards chaos. In the words of thermodynamics, entropy always increases in the real world as per the second law of thermodynamics. In the movie Godzilla, when running through Manhattan, Godzilla dashes against Chrysler building and it falls down and breaks. It is possible and realistic, if we make proper assumption about the strength of Godzilla and weakness of the building. If we look at it reverse, Godzilla runs through the rubble and it forms into a tower and moves up the top of the building. Does it sound realistic? No. But in a universal, completely mathematical sense, any object can return to lower entropy eventually. So arrow of time is this phenomenon that we always move forward in time in reality with a monotonous increase in entropy. At quantum level, the evolution of quanta is given by Schrodinger's equation, which is time-symmetric. Here, the term entropy is not about heat, but it is about information.

Lets look at some quantum level example. I have built a particle accelerator at the back of my house. I power it up and it generates a quantum particle. But how would I know whether it did generate a particle or not? Copenhagen interpretation says that unless I observe that particle, I cannot say. In other words, unless I observe the particle, I cannot assume that the particle exists, because I don't have the information about its existence. I decide to test the accelerator and I use a Geiger-Muller counter and it ticks. Wow! Now I know, my accelerator works. But what I did by observing its presence is that I collapsed the wave function of that particle. So I know that I generated a particle in past, because it leaves a trace of information.

The time-symmetry of maths and the second law of nature can be looked at as a paradox. If there is a time-symmetric dynamics, as in thermodynamics, it should be impossible for me to have a working system that is time-asymmetric. But everybody knows that the second law works. This paradox is called Loschmidt's paradox. Maccone in his work has made an attempt to answer this Loschmidt's paradox.
"How wonderful that we have met with a paradox. Now we have some hope of making progress." -Niels Bohr
Now let us get back to the particle-accelerator-in-my-backyard example to explain what Maccone says. Assume that my Geiger-Muller counter ticks, but I have a selective amnesia and I forget the count immediately after I heard it. Or in other words, what if somebody shot down the neuron of my brain, in which I saved the information that I heard a count? OK. My example is bad. Let me make it more classical. What if the crystal jar that I broke has comeback from shards, but I just don't remember the happening of that event? So the point is, the activity that results in a decrease in entropy is accompanied by the erasure of its observer's memory. At quantum-level, some instantaneous memory reduction is observed, but never at macroscopic-level because of decoherence. Here is the finding in Maccone's own words:
... any decrease in entropy of a system that is correlated with an observer entails a memory erasure of said observer, in the absence of reservoirs (or is a zero-entropy process for a super-observer that keeps track of all the correlations). That might seem to imply that an observer should be able to see entropy-decreasing processes when considering systems that are uncorrelated from her.
Quick Note: That particle observation problem that I took as an example is one process in quantum physics that is considered to be irreversible. It is called Born Rule. Even that would be reversible according to this paper, if somebody shoots the neuron of the observer at the instance of observation, as I said earlier.

At this point, if you are really confused, shocked and finds it all illogical and bulls**t, please wait for a moment and read this quote before we go to the counter-argument:
"No, no, you're not thinking; you're just being logical." - Niels Bohr.
The counter-argument by Jennings et al starts with the claim made by Maccone, as given in the quote. So during an entropy-decreasing event, the mutual information about the event is destructed as the part of the information. The important point here is what kind of mutual information has to be nullified as a part of memory erasure. Classical mutual information. Is there another kind? Oh.. yes! It is quantum mutual information. So the proof made by Maccone holds good only if a reduction in classical mutual information correlates with a reduction in quantum mutual information.

Information is usually represented as I(A:B), that is information that system A gets due to an event that happened in system B, observed by system A. So if the reduction in quantum mutual information happens without any change to classical mutual information, the earlier proof holds no good. The counter-argument does not stop there. If there is a reduction in quantum mutual information with no harm done to classical mutual information, the situation gets worse. A mathematical proof is made for a case in which, an entropy decreasing event erases the quantum information, but actually increase the classical information. What do the authors mean by this? It's possible for some people to remember observing the crystal jar reforming itself. But nobody does. So the Loschmidt's paradox still remains unsolved. I am not aware whether Maccone has responded to this yet. Or may be, like Walter Ritz and Albert Einstein, Maccone and the Imperial college professors may agree to disagree.

Thursday, September 10, 2009

Required: An Educational Reform

President Barrack Obama in his congressional speech on health care compared the proposed plan with the current education system in the United States. There are private universities like Stanford, and Harvard offering top class education but at a high cost, while public universities like UC Berkeley, UT Austin, UA Tucson, UMass Amherst are offering good quality education at an affordable cost. Good analogy. The US higher education system is arguably the best in the world. But that does not mean that it is the best that could ever be achieved.

Felix Salmon has wrote a blog in Reuters on changing tuition fees. Currently the tuition fees stand at 60% of the US median income. I am talking about the undergrad education. College tuition fee has increased at the double the rate of US median income - which also means that college tuition fees are increasing almost as fast as health insurance premium does. What is more alarming is that in some public schools like University of Massachusetts, only 33% of the students who enroll in UG education graduate. From socio-economic standpoint, the US higher education needs some serious reforms.

Obama conveyed a message that he did not want any American to go to a state where there is a treatment for the disease but he/she cannot afford it. Nice vision. I think the same applies to education. After all, illiteracy is a disease.

Monday, September 07, 2009

IDE Decision

Texas Instrument has recently made a sound decision to make their Code Composer studio's IDE to be based on Eclipse. That's a good news. Now that's not taking anything away from the old IDE of CCStudio. The old IDE is very comprehensive, very good, and very easy to learn. But the current developers do not just look for this. How easy is it to customize the old CCStudio IDE? That way, as the IDE becomes Eclipse-based, we can expect independent developers writing plugins for CCStudio and that would certainly improve the productivity in embedded systems and DSP development front. After looking at many IDEs, I can make an easy business/productivity decision. Your IDE must be based on Eclipse. That's the second best IDE (after Visual Studio IDE) that I have seen, although I personally write my JAVA codes only in Emacs.

Saturday, September 05, 2009

Taco Bell Revelation

Today I had my breakfast at Taco Bell. When I was waiting for my bean burrito, I observed that they sell egg burrito for $0.89/-. And what more, it has bacon in it. Well, I don't eat bacon, but the menu board reminded me of some old stories. Back at the beginning of the year, some Ivy-League-educated people scared all my friends that the US would be stifling in inflation. But I was not scared. I said I would rather welcome some inflation. The Taco Bell menu teaches my friends that their fear was uncalled for.

Saturday, August 29, 2009

TR-35 2009

What can you achieve in the first 35 years of your age? OK. This has wide answers, since in some field like sports people achieve their most before 35. Let me reduce the scope of the answer. What can you achieve in the field of science and technology, before you are 35 years old? May be you can win Nobel prize in Physics. Or may be you could formulate 3900 mathematical equations and considered by some as an equivalent of Jacobi or Euler. Or may be you can win TR-35.

TR-35 winners of the year 2009
has been announced. Congratulations winners. Your achievements are inspiring. If I can do half of what anyone of you have done, I would take a deep breath and hang up my boots. Special congratulations to Kevin Fu of University of Massachusetts, Amherst and Jose Gomez-Marquez of MIT to stand as best among the best.

Friday, August 28, 2009

Synergy among processors

A simple question. Jack is a cobbler, working in DumbCobbler Inc. He stitches on an average 20 shoes per day. DumbCobbler Inc has got some new orders and so they hire 9 new cobblers to work with Jack. Do the math. How much shoe can DumbCobbler make in a day? If the answer is 200,...

Wrong. Actually they are making 225 shoes a day. Where does the extra 25 shoes come from, if all the cobblers are equally qualified and in isolation they can make only 20 shoes per day. The extra 25 is the result of synergy. So as a team of talented people work together, some kind of a team thing develops among them and even without much process improvement they make more in a given time. Where there is synergy, whole is greater than the mathematical sum of the parts.

Now let me change the problem a little bit. I have a processor, say MIPS 32 bit processor. I have an image processing problem and the processor takes 6 seconds to run the algorithm that solves the problem. Now I am putting two processors, both MIPS 32-bit, in a multicore environment. The same algorithm has to be run. How much time would it take now? If the answer is 3 seconds, ...

Wrong again. The answer probably would be somewhere between 4-5 seconds. It depends on two factors. One is whether the algorithm can be made parallel. Some algorithms are inherently sequential. For example, adding numbers in an array (am I sure?). The second factor is how skillful the programmer is in recognizing the parallelism present in the algorithm. For example, adding numbers in an array can be done in parallel, since you can divide and conquer. But even in a completely parallel implementation the lower bound remain intact. That is, if one processor takes up 6 units of time, two processors can at the best take 3 units of time to finish the job. A little more perhaps for synchronization, but not any less, unlike humans.

So will the processors ever get synergy? If we assume processors to be dumb compared to humans, would robots with AI have synergy? For this I have to be explained how synergy works in terms of a cerebral model. Has any psychologist tried it? No idea; I don't follow medical discoveries.

But here is an interesting fact. In IBM's Cell multicore processor architecture, there are a set of RISC processors called SPE, which stands for Synergistic Processing Element. They are SIMD processor suited for vector processing (as with any SIMD processor). It does not do out of order execution, because it has 128 registers. Register renaming can be done liberally which obviates OOO. Instead of cache, SPE uses something called "Local store". Just like cache it is inside the chip. SPEs act together when a set of SPEs are chained together for stream processing. This property is of great use in a GPU which requires fast video processing. More on Cell architecture some other time.

Probably since these processors are chained together, they have that prefix, "Synergistic". It still does not produce the effect of whole being larger than the sum of the parts. That's something that we have to wait for a long time.

Wednesday, August 26, 2009

Returning 80s

Fashion is cyclic. What goes out of fashion today would be fashion tomorrow. Now whatever was the fashion is coming back slowly. I am not talking about the pure fashion news. I am talking about a silicon PCB and a 3D chip with buses through bare silicon between layers. Some start ups are trying to build PCB-like system on a bare silicon. One does it in 3D and the other in 2D. With a high ASIC price and the need for compact logic, this may come as an alternative for many struggling IC makers.

Ted Kennedy - your memories shall never die!

Ted Kennedy passes away. I cannot forget his very famous speech, "Dreams Shall Never Die". It is an inspiration for all youngster who feel heart broken by our current economic state; for college grads who do not have a funding, for graduated folks who do not find a job; for those hard workers who just lost their jobs for nothing wrong in their part. Must watch video.

Sunday, August 23, 2009

Unemployed Engineers

This news in IEEE website is 2 months old, but I read it just now. And it's a bad news. The unemployment claim of electrical and electronics engineers hit a record low of 8.3%. And what's worse the biggest slump was in the last quarter in which it nearly doubled.
The news for EEs was particularly bad as the jobless rate more than doubled from 4.1 percent in the first quarter to a record-high 8.6 percent in the second. The previous quarterly record was 7 percent, in the first quarter of 2003.
In the first quarter, the computer professionals' jobless claims was at 5.6%, but it is now standing at 5.6%. Good that it has not gone down too.

This is a two months old news. Now the economy is improving. We can see GDP is picking up in the US and Europe. Home sales are rising. We are certainly in the path of recovery. So we can expect that the unemployment claims for engineers would come down. Not so fast. Unfortunately that's not how it works. The economy is currently undergoing a jobless recovery, like it did back in 2001. This means that you can see rise in GDP and consumer price index, but the unemployment rate will remain fixed for a while. So the bad days are far from over. Lets hope we can get out of this hell soon.
UPDATED 23-AUG-09: Here is Paul Krugman affirming jobless recovery.

Saturday, August 22, 2009

Tom, Jerry and Attenboroughii

Looks like Tom has got a new friend to help him catch Jerry!
Sir David, 83, said: "... This is a remarkable species the largest of its kind. I'm told it can catch rats then eat them with its digestive enzymes. It's certainly capable of that."

Thanks, Dr.DeLong!

At the beginning of this month, Richard Posner published an extreme criticism for Christina Romer's prediction about the stimulus package. That article even questioned the ethical responsibility of Romer and some prominent progressive economists (Stiglitz's name was missing!!)

But there was a basic mathematical mistake in the criticism. Posner compared annual GDP with quarterly spending. Although I observed this, my inferiority complex did not allow me to publish this in my blog. After all, I am not a professional economist. I am an engineer who loves macroeconomics and in fact who loves anything that can be derived from reason.

On the other hand, Brad DeLong has no such inhibition to spot this flaw and other mistakes in Posner's article.
Posner is trying to get his readers to compare the number 5 (the percentage-point swing in the growth rate between the first and the second quarter of 2009) to the number 2/3 (the percentage share of second-quarter stimulus expenditures to annual GDP). He hopes that they will conclude that Christina Romer's claims are wrong because the effect is disproportionate to the cause: $1 of stimulus could not reasonably be expected to produce $7.5 of boost within the same quarter. But the stimulus money spent in the second quarter was spent in one quarter, so the right yardstick to use to evaluate it is not annual but rather quarterly GDP--stimulus spending in the second quarter was not 2/3 of one percent but 2.6% percent. And the level of production in the economy in the first quarter was not 6% but rather 1.5% below its level in the fourth quarter--the 6% number is not the decline from one quarter to the next but rather the rate of decline, how much the decline would be after a year were it to go on for four quarters. So the right comparison is 1.5% to 2.6%[1].

Posner is off by a factor of 16.
I am glad that my observation was right that Posner was wrong in comparing the annual GDP with quarterly fiscal spending. Thanks, Dr. DeLong for bringing it up.

Friday, August 21, 2009

I still don't know how Core i7 is made!

Here is an interesting video on the making of Intel Core i7. It starts off well, but slowly it becomes more of a marketing video than a technical video. It gives an overview of the architecture, but I was looking for an in depth analysis of the microarchitecture with design decisions and why they were made. It disappointed me. Anyway it is a good video for a high-level picture and it's worth watching.

Thursday, August 20, 2009

Proposals for Smart Grid

Here is a good news in the last weekend. Earlier what was considered to be an impossible and impractical challenge has received more than 400 proposals. I am talking about Smart Grid project (part of the Grid Vision 2030) through which the United States Department of Energy (DOE) tries to modernize and improve the electric grid through the introduction of microprocessor and microcontroller based control systems.

Frankly, I have been wondering whether the DOE already does not use microprocessors! Processors have been around for more than two decades. The devil may mostly be in the detail. Anyway it's a good news in terms of efficiency, since if we can reduce the power loss due to transmission and distribution, which implies reduction in global warming due to human activities. Losses during transmission and distribution of power accounts for 7.2% of the total power produced. DOE has to be appreciated for its good job of conducting e-forums for Smart Grid. Something aggressive like this needs to be done in developing countries like India which experiences power shortage and huge transmission and distribution losses.

Tuesday, August 18, 2009

Garbage collector for C - not for embedded systems programming

A quick and dirty post while drinking a tall, non-fat latte in a coffee shop. A few months ago, my friend asked me whether C/C++ had garbage collector, since there is no native garbage collector in C and C++. My answer was no. Apparently my answer was wrong. There is a garbage collector for C and it is not just some crackpot code, it is in Hewlett-Packard site. Sweet. But do C or C++ need it? I have coded Java for fun as well as for food. From coding perspective garbage collector is very helpful, making your job easy. Garbage collector is done efficiently without your intervention, but still it uses up considerable CPU cycles. So there are times in which the garbage collector has become a performance nightmare. In those cases, we need to tune the garbage collector for efficiency which takes time that you saved in coding.

C and C++ are used widely in embedded system, because of the efficiency, reduced CPU cycles, and real time performance. Although some have demonstrated Java program reaching up to the speed of C program, you cannot expect that consistently in a wide range embedded systems programming. Garbage collection is a sensible sensitization that you require when the operating system takes care of the memory. But in embedded systems programming, you know how much memory you have and you know what to do to manage that. It's the same way the operating system itself manages the memory. So you have to be sensible enough in your programming task to allocate and remove memory and construct and destroy classes. Leaving it to the garbage collector will slow down your embedded system.

Garbage collection is a great idea. C or C++ can use it certainly while programming a general purpose processor that runs an operating system. But it is not for embedded systems.

Wednesday, August 12, 2009

Will technological growth lead to job crisis?

Gregory Clark has written an article in Washington Post in which he has claimed that in future we would not have any job as people would be replaced by machines to do those jobs.
I recently carried out a complicated phone transaction with United Airlines but never once spoke to a human; my mechanical interlocutor seemed no less capable than the Indian call-center operatives it replaced. Outsourcing to India and China may be only a brief historical interlude before the great outsourcing yet to come -- to machines. And as machines expand their domain, basic wages could easily fall so low that families cannot support themselves without public assistance.
Interesting. But this is not something that I hear for the first time. As an engineer, I was in the receiving side of this blame, during several instances when I met some old bank employees and typewriter mechanics who were pushed to the verge of fear of loosing jobs when computers were introduced.

So does the innovation lead to crisis? If so, why were we not happy being cavemen; hunting rabbits and hunted by saber-tooths? So is the evolution of mankind as an intellectual animal actually a negative thing?

I disagree with Gregory Clark and this is my perspective. There is a strange feature about technology growth - that it never stops. In a competitive environment, people always want to make their product better in the market and faster to reach the market. Only way of doing it is through technology. So every technical innovation has a lifetime, which ends when something better is invented.

When cavemen invented wheels, many people who used to push big blocks of stones might have lost their jobs. But their kids might have got jobs as wheel-makers. The important part is that even the son of the jobless caveman should have access to the school of wheel-making. Once that is assured, this cycle would continue endlessly. So the unskilled workers just need the pay to keep up themselves and provide education for their kids to become a skilled worker. Governmental intervention must ensure that this happens through several means like minimum wage policy, inflation control, subsidy on education, etc.

Tuesday, August 04, 2009

Multicore Processor Simulator

Back in 2005, I was listening to the recorded voice of Intel President Paul Otellini saying this in Intel Developers Forum. When he described the future direction of Intel, this is what he said:
We are designing all of our future development to multicore designs. We believe this is a key inflection point for the industry.
Followed by the diminishing returns from the Instruction-led Parallelism in a uniprocessor, the world of computer architecture decided that multicore processor and chip multiprocessor is the direction of the future.

I knew the importance of multicore processors even before they became famous in the general purpose computing. Part of my undergraduation research thesis involved implementing digital beamforming in quad-core SHARC processor. Now it is apparent that multicore processors are here to stay and whether you like it or not parallel programming is the future way of computing. Web programs are already running in parallel managed by the web application servers. Embedded systems programming are rapidly moving towards introducing parallelism wherever performance matters. There still are two issues that make parallel programming difficult. One is the availability of debugging tools, especially the rather unique bugs like Heisenbugs. The firms are moving towards developing debuggers to reveal the heisenbugs and ease the programming. Although the multicore developers and compiler designers are coming up with parallel programming debugger extensions to solve this problem, it is clear, present and painful at this stage.

Second issue with multicore processor is the lack of simulators for multicore processors. SimpleScalar is certainly a excellent processor simulator. But simulating a Chip MultiProcessor (CMP) with hundreds of core is still an open problem for the computer architecture community. Recently Monchiero et al of Hewlett-Packard Laboratories have come up with an idea to simulate shared-memory CMP of large size, published in the recent SIGARCH transaction.

The best part of this paper is the simplicity of the underlying idea. The idea is to translate the thread-level parallelism of the software to core-level parallelism in the simulated CMP. First step is to use the existing full system simulator to separate instruction streams belonging to different threads. Then the instructions flow of each thread is mapped to different cores of the targeted CMP. And then the final step is simulating the synchronization between the different cores. The simulator explained in this paper can be used to simulate any multithreaded application in a conventional system simulator and extend the evaluation to any homogenous multicore processor. I believe this framework is going to be used in many CMP-simulators in future.

UPDATED ON 02/02/2010: This might be a viable multicore processor simulator.

Saturday, August 01, 2009

Uranium Ore sold online!

Check out what Amazon is selling. A can of Uranium ore! Also look at what the customers bought after looking at this and what the customers who bought this also bought. Very funny!

Friday, July 31, 2009

Bokode - the barcode killer and much more

In SIGGRAPH 2009 conference, the MIT media labs is going to present a paper titled, Bokode: Imperceptible Visual Tags for Camera Based Interaction from a Distance. Here is the abstract of the paper:
We show a new camera based interaction solution where an ordinary camera can detect small optical tags from a relatively large distance. Current optical tags, such as barcodes, must be read within a short range and the codes occupy valuable physical space on products. We present a new low-cost optical design so that the tags can be shrunk to 3mm visible diameter, and unmodified ordinary cameras several meters away can be set up to decode the identity plus the relative distance and angle. The design exploits the bokeh effect of ordinary cameras lenses, which maps rays exiting from an out of focus scene point into a disk like blur on the camera sensor. This bokeh-code or Bokode is a barcode design with a simple lenslet over the pattern. We show that an off-the-shelf camera can capture Bokode features of 2.5 microns from a distance of over 4 meters. We use intelligent binary coding to estimate the relative distance and angle to the camera, and show potential for applications in augmented reality and motion capture. We analyze the constraints and performance of the optical system, and discuss several plausible application scenarios.
I did not read the full paper, but those interested can find it here(5.5 MB) and here is the news release. This paper might change the way in which the future looks. This Bokode is a kind of barcode design that could be detected by ordinary camera. The most obvious usage of this is in retail industry in place of barcode. Bokode has several advantages that the ordinary barcode lacks. It can be read by an out of focus cellphone camera. It has usage in machine vision and in a normal scenario to identify the locations of different objects in a plane, their positions and angles. I am not an imaging expert, but I think that Bokode may be the starting point for many new imaging devices. MIT has also released some cool sketches with projected future scenarios.

Tuesday, July 28, 2009

ARM-wrestling with Intel

ARM Cortex A8 is finally going to run in GHz speed, delivering more than 2000 Mips. So your Netbooks and iPhones may just be faster. If your response is Intel's Atom is already beyond GHz mark, here is the best part of the news: ARM Cortex A8 does all these while consuming just 640 mW power and can run at a minimum supply of 1 volt. Currently iPhone 3G runs at 600 MHz powered by ARM Cortex A8 processor. Both Intel and ARM knows that netbooks and smartphones are the computers of tomorrow, as PC was back in eighties.

So both the companies are gearing up from opposite directions to capture the market. Intel's x86-based Atom runs at 2 GHz, but the problem is that it's like the gas-guzzlers of GM. People would not go for a PDA or netbook that consumes battery at fast rate. Intel has speed but the problem is with the power consumption which it is working towards. It has already announced the release of Metfield, a 32 nm Atom that hits market in 2010. Smaller size chip with low power consumption. The best fit to compete with ARM. Intel's Atom codenamed Metfield is already reported by CNET as the smartphone chip of the 2011. The figure (Courtesy: Intel/CNET) shows the strategy of Intel.

As with ARM is concerned, market presence is its huge advantage. Almost all the latest handheld gadgets have ARM inside. ARM developers have more experience in embedded systems and so poised to develop low power processors. Currently they are up to the task of speeding up the processor to meet the x86 standard. Both ARM and x86 are superscalar architecture. I think both of them use AMBA interconnects. Starting from ARMv5TE (introduced in 1999), they have a DSP instruction set extension, which Atom also has. But the similarities end here. Cortex architecture is strikingly different from the x86 architecture. This fall, Texas Instruments is going to sample on OMAP4 with two parallel Cortex A9 cores in place of a single Atom core. There are already plans to introduce a quad-core Cortex A9 (see figure, Courtesy ARM/CNET), which would certainly pose a stiffer competition to Metfield.

Friday, July 24, 2009

Recession and the broken Okun's law

In economics, Okun's Law states the relationship between unemployment and GDP gap. It states that for every 1% increase in the unemployment, the GDP will decrease by 2% of its potential GDP. This is an emphirical based on observation rather than theory. So conversely it looks reasonable to assume that for every 2% reduction in GDP gap, a 1% reduction in unemployment can be expected.

But the problem is Okun's law does not always work and especially it is broken in recessions. As a result, we may have a GDP recovery ending the recession and even starting the growth, but the unemployment rate would remain as such in the verge of 10%. It's a bad news, but unfortunately some of the liberal economists that I respect (Bradford DeLong, Nouriel Roubini a.k.a. Dr. Doom) suggest so.

UPDATED ON 1st AUGUST: Paul Krugman predicts jobless recovery with recent GDP figures.
UPDATED FOR CORRECTION: Actually Paul Krugman says Okun's law behaves itself now. My misunderstanding. But the question is, would it during GDP recovery? I don't think so.

Thursday, July 23, 2009

Standing on Flights

Those who have been to India might have noticed that in India, people are allowed to stand in the bus while travelling. I have also noticed that in New York subway trains. What about standing on short flights? The recent survey by Ryanair suggests that 60% of the passengers would do so when the ticket is for free. 42% would do so for half-charged tickets. The airline also considers replacing the normal seats by vertical seats that can be found in roller-coasters in amusement parks. Soon we can expect the IT employees going onsite to stand all the way, as the standing-tickets would come cheap.

Weighing Scale for Molecules

A group of physicists headed by Dr. Michael Roukes have developed a nanoelectromechanical system (NEMS) based mass spectrometer, which can measure the mass of things as small as a single molecule. The professor and his group has been working on this for the past 10 years in Caltech's Kavli Nanoscience Institute.

In laymen terms, it is a NEMS resonator which keeps oscillating at a particular frequency. You drop a molecule on it, now because of the force excerted by this molecule, the standing wave frequency of the resonator changes. The change in the frequency is mapped to the mass of the molecule. Well, as you could imagine, the frequency could change depending upon where on the resonator the molecule has fallen. In order to avoid this, the molecule has to be dropped several times and the arithmetic mean of all measurement would give the mass. Here is the complete press release. [Image Credit: Caltech/Akshay Naik, Selim Hanay]

Now a curious question. Do they have a nanometric tuning fork to measure the change in standing wave frequency of this nanometric resonator?

Tuesday, July 21, 2009

Linux boots in one second

Montavista has recently created a record for boot time of embedded Linux. They booted it in one second! This significant achievement is made on Freescale MPC5121E RISC processor. Please look at the demo in the Freescale technology forum. This is undeniably a great achievement.
The application requirements demanded visual feedback of critical real-time data in one second or less from cold power-on. These performance improvements were achieved through a combination of careful tuning of the entire software stack and a highly optimized kernel.
It would be great if the Montavista team publishes a whitepaper on what the team did to achieve this performance. The usual techniques for speeding up boot loading includes Kernal Execute-In-Place, which involves executing the Kernel from Flash memory, and copying Kernal using DMA. Many times, just by increasing the decompression speed by fast decompressors like UCL would do the job. It may not be a surprize if Montavista uses all these together, along with something special. Lets wait and hope that they throw more light on their careful tuning and Kernel optimization to acheive this.

Monday, July 20, 2009

India Microprocessor - Sensibility with no Sense

The top scientists in India are going to convene together in a project to make India Microprocessor. Network security has become a hot topic in many country's security meets thanks to Chinese hackers. Chinese hackers have broken into several governmental networks over the past few years. They managed to take down Russian consulate website and made nearly 70000 attempts on a single day to penetrate into NYPD network.
“History has shown that the need for defence[sic] security has sparked a chip industry in most nations,” she [Poornima Shenoy, the president of Indian Semiconductor Association] said.
Unlike the US and China, India still does not have chip-making technology, and Zerone seeks to change that.
Ms. Shenoy is true about the history. Not just microprocessor, about everything that we use now from Internet to mobile phone has been the output of the need for defense security. India is fundamentally afraid that it might be denied the microprocessor technology at one point of time. India has not produced any evidence for that suspicion. It might be confidential. Are we going to develop our own mobile phones, going by the same logic?

The entire story would make sense, if it is some nascent technology like nanotechnology based carbon chips or biological influence in IC manufacturing. Microprocessor technology is something that is place for the past 3-4 decades. The SPARC RISC processor architecture that they are planning to follow came in 1986. I am not taking anything away from the SPARC architecture. The point is just that it is not what can be called cutting-edge. This raises a lot of questions.

Why should suddenly a country decide to invest on developing a new processor? Instead it can better encourage the companies in India to make them. What happens when the processor technology changes or the processor proves to be inefficient in SPEC benchmark? Those who know history would be aware that there are many failure stories in processor technologies. Is India going to build its own fab to fabricate this processor? Actually building a fab just to fabricate a single kind of chip is a big waste of money. It may take even up to 40 years to recover that money. What if the processor technology falls behind in the test of time? Is India planning to build subsequent versions? That would mean keeping a permanent research team in payroll. Just convening a set of passionate designers is different from keeping them together permanently. In 1994, Intel's Pentium processor had a small bug with its FPU when floating point division was performed. Intel incurred a huge loss to fix the bug and replace all the sold processors. Would India Inc. do the same?

The argument may sound like questioning the country's ability to make microprocessors. I do not think that all government run science projects are inefficient. I rather feel that it should channel their effort in exploring new cryptographic algorithms, if it wants to improve network security and in socially progressive technologies like clean energy for instance. Building general purpose microprocessors in India can better be left out to private firms, by just providing them necessary facilities. And Intel did develop the first Made-in-India microprocessor.
Unless India has its own microprocessor, we can never ensure that networks (that require microprocessors) such as telecom, Army WAN, and microprocessors used in BARC, ISRO, in aircraft such as Tejas, battle tanks and radars are not compromised,” the document points out.
The entire argument is about India not investing directly in the making of general-purpose processor - the one that we use in PCs and game consoles. I think Tejas battletank and RADARs would be using application-specific embedded processors, microcontrollers, and digital signal processors. I don't think and I would not prescribe using general purpose microprocessors for battletanks. Making of these chips indigenously is a completely different ball game and it would make sense if India invests on this directly.

India directly putting bucks on general-purpose processor architecture might make bold headlines. TV channels might insist that Indian citizens should feel proud of that achievement. What may be the most sensible news for media, makes no sense for me as an engineer.

Thursday, July 16, 2009

Apple, is thy pseudonym Stimulus?

A little strange, but a good news! Integrated circuits sales has raised worldwide by 16% last quarter. This is the biggest quarter-on-quarter growth since the second quarter of 1984. More particularly DSP unit shipment has increased by 40% QoQ. One probable reason that I can think off is, in the last one quarter both Apple iPhone 3G(S) and Palm Pre were introduced. And with the reduced price line, both have been selling like hot cakes. On the hindsight, in January of 1984 Apple Macintosh was introduced. And it sold like hot cake. A probable reason for the previous growth record. Apple, is thy pseudonym Stimulus?

Wednesday, July 15, 2009

Intel x86 Processors – CISC or RISC? Or both??

The argument between CISC architecture and RISC architecture is longstanding. For compiler designers, RISC is a little burden since the same C code will translate to nearly five times more lines of RISC assembly code compared to x86 assembly code. But from pure academic point of view, it is easy to see that RISC wins the argument because of several of its advantages. RISC instruction set is very small, for which it is easy to optimize the hardware. Simple instructions running in a single clock cycle is a typical characteristic of RISC that permits aggressive pipelined parallelism. RISC invests more area on registers (using a technique called register windowing), allowing easy out-of-order execution. OOO and pipelining are possible in CISC, but a little clumsy.

One reason that RISC cannot win despite all these advantages is Intel. Microsoft too is one of the major reasons because during the PC revolution, Win 95 had no support for RISC processors. But Intel with its CISC based x86 based architecture blocked all the avenues in general purpose computing for RISC processors. RISC has a good presence in embedded processing however, because of its low power, high real-time, small area advantages.

Two years ago I tried to investigate why Intel did not change its x86 core to a RISC. The findings were astounding, but then I did not have time to write it down in a blog like this. Better late than never. After the success with CISC based CPUs, in 1990 Intel entered the RISC zone with the introduction of i960. The i960 architecture however mainly targeted the embedded systems domain and not the general purpose computer understandably due to the lack of software support.

In general computing domain, Intel Pentium employed two staged pipeline for its IA-32 instructions. The presence of variable length instructions obligated an inherent sequential execution because every execution cycle involved identifying the length of the instruction. As a result, new instruction can begin anywhere with the set of instructions that the processor fetches. As the world was moving towards parallel programming, the only advantage that CISC enjoyed was the software support which might die down soon.

Sometimes when you think that you know where things are heading, there will be a ground breaking invention that would change the entire scenario. One such seminal invental in the form of the introduction of high performance substrate (HPS) by the famous microarchitecture guru, Yale Patt. Although I am tempted to explain HPS in detail, I would rather consider it to be out of the scope of this blogpost. A very simple (not necessarily accurate) description would be that Patt succeeded in converting the CISC instruction to multiple RISC-like instructions or micro-ops.

Intel demonstrated its fast finger by implementing this in its P6 architecture. As any successful, innovative company, Intel is always good at adapting to the new wave. It did it by jumping from its memory business to microprocessor back in eighties and now it did it again by using HPS. Intel’s first IA-32-to-micro-op decoder featured in Pentium Pro. P6 architecture contained three parallel decoders to simultaneously decode the CISC instructions to micro-ops resulting in a deeply pipelined execution (see figure). Sometimes this instruction decoding hardware can become extremely complex. But as the feature size reduced at very fast rate, Intel did not face any significant performance issue with this approach.

Now we are into the post-RISC era, where processors have the advantages of both RISC and CISC architecture. The gap between RISC and CISC has blurred significantly, thanks to the scale of integration possible today and the increased importance of parallelism. Trying to jot down the difference between the two is no longer relevant. Intel’s Pentium Core 2 Duo processor can execute more than one CISC instruction per clock cycle due to increased processing speed. This speed advantage would enable CISC instructions to be pipelined. On the other hand, RISC instructions are also becoming complex (CISC-like) to take advantage of increased processing speed. RISC processors also use complicated hardware for superscalar execution. So at present, classifying a processor as RISC or CISC is almost impossible, because their instructions sets all look similar.

Intel remained in the CISC even when the whole world went towards RISC and it enjoyed the advantage of software support. When the situation started favoring RISC in the advent of parallel processing, Intel used micro-op convertors to exploit the pipelining advantages of RISC. The current Intel processors have a highly advanced micro-op generator and an intricate hardware to execute complex instructions in a single cycle – a powerful CISC-RISC combination.

Monday, July 13, 2009

Atmel's battery authentication IC - a reality check

Atmel introduces a cryptographic battery authentication IC - AT88SA100S in a attempt to curb the market of counterfeit batteries that has all sorts of problems that would tarnish the brand value of original equipment manufacturer (OEM). I think the idea is more like digitally signing the battery. OEM would have a signature that the quacks cannot forge.
The AT88SA100S CrytpoAuthentication™ IC is the only battery authentication IC that uses a SHA-256 cryptographic engine...
SHA-256! Excellent hashing algorithm. Developed and recommended by NSA itself. I don't think there are many commercial hardware implementation of SHA-256. SHA-2 style hashing like SHA-256 requires many registers and gates compared to SHA-1 implentations. As a result, the die size, the critical path, and the operational frequency all increase. Frequency specifications are not given in this press release. But here comes the most important claim:
...a SHA-256 cryptographic engine and a 256-bit key that cannot be cracked using brute force methods.
Now that's interesting. I am not a professional cryptanalyst or a professor of mathematics. But what I know is that any N-bit hash function can be cracked through brute force with atmost 2N trials - in this case 2256 trials. A collision attack can be done in a 2N/2 trials - in this case 2128 trials. In a 256-bit key hash function, a 50% probability of random collision can be obtained through birthday attack with 4 x 1038 attempts. Well, these are quite large number of trials that may take years of computational time. Still you cannot categorically deny that brute force is impossible. May be in their implementation, brute force would not be allowed. Something like wait for 3 unsuccessful attempts and then self-destruct. Such a scheme would not pass. I don't want my iPhone to be broken, just because I tried putting in a phony battery. I need more detail.
The 256-bit key is stored in the on-chip SRAM at the battery manufacturer’s site and is powered by the battery pack itself. Physical attacks to retrieve the key are very difficult to effect because removing the CryptoAuthentication chip from the battery erases the SRAM memory, rendering the chip useless.

Challenge/response Authentication. Battery authentication is based on a "challenge/response" protocol between the microcontroller in the portable end-product (host) and the CrytpoAuthentication IC in the battery (client).
The first point makes a lot of sense. The key is the SRAM powered by the battery itself. You pull the SRAM, the CMOS SRAM cells would loose power and thus memory. Second point is that it uses challenge/response authentication. It is more like the UNIX password authentication - user supplies password, it is hashed and the hash is compared with the stored hash in the UNIX server. In this case, I think battery would supply its hash to the device. The device must have a table of possible battery manufacturer ids and their hashes. How secure is this table?

Security of a system lies in its overall implementation rather than the strength of the cryptographic algorithm it uses for communication. The algorithm itself is just a part of it and not all of it.

Overall, Atmel has done a good job. Soon we can expect electronic devices to support only the authentic batteries that do not leak and spoil your device itself. Soon we can say Auf Wiedersehen to counterfeit batteries and their makers.

Mukherjee at Minsky moment - a clarification

A few friends of mine showed caustic reaction through phone calls to my previous blog by blaming me for not understanding the importance of reducing fiscal deficit. One of them was a professional economist working in a small south Indian university.

Let me clarify my position. In normal period, when foreign banks are not closing at a fast rate affecting jobs all around the world, I am totally in for reducing the fiscal deficit and if required even build up a fiscal surplus. But during the current crisis, I would advise not to go for fiscal retrenchment. The problem in India is not the fiscal deficit as such, but its distribution and how it is funded. Here is the take of Roubini's Global Economonitor on India's situation:
However, the annual growth rate for Community, Social and personal services has remarkably increased to 13.1% in 2008-09 as compared to 6.8% in 2007-08 reflecting the impact of increased expenditures by the Government through financing schemes like NREGS. It is important to notice that such expenditures have not only increased the fiscal deficit beyond the estimated budget for 2009-10, but only 9% of the Indian workforce engaged in Community, Social, and Personal services expected to be benefited through it. Thus the excess flow of subsidized bank credits to GoI for financing the budget deficit is ultimately restraining the economic growth.
Herbert Hoover tried fiscal retrenchment during a downturn that was one of the prime reasons for the Great Depression. FDR's fiscal retrenchment was the reason for a double-dip depression. During both these times, many economists and Wall Street welcomed the move. But in hindsight, it turned out to be an incorrigibly bad choice. Here is the President's chief economic advisor talking about the lessons from the Great Depression. I am just happy that India is not repeating that mistake.

Sunday, July 12, 2009

Mukherjee at Minsky Moment

In India, Mr. Pranab Mukherjee presented his first budget of this term on July 6th. The budget speech was criticized heavily by economic luminaries. Sensex responded by falling down by 870 points. Pranab Mukherjee has gone into a spending spree on infrastructure and building social security safety net. Dr. Jayaprakash Narayan called it a lackluster budget with fiscal deficit crossing Rs. 10 lakh crores. While I agree with Dr. Narayan’s point on reduced allocation to healthcare, I beg to disagree with his view on country’s deficit control.

I accept that India is in the middle of a huge debt. But let us not blame Mr. Mukherjee for this. We have been running huge fiscal deficit for the past several years. Currently we are in the middle of a great recession. This is not the right time for deficit control. Mukherjee did the right job, a Keynesian economist’s job, by increasing the fiscal spending to control the impact of the downturn. While fiscal deficit is something that we have to control in long run. At the middle of Minsky moment, Pranab Mukherjee has done a decent job by not minding it this time.

However once the recession is over, the finance minister (whoever will be, at that time) must ensure to reduce the fiscal deficit in the same Keynesian style with which they increased fiscal spending this year. Don’t care about the reaction of Sensex; don’t care about Narayan’s comment. You have done a good job at the bad time, Mr. Mukherjee! Fiscal deficit control? Previous finance ministers should have done it. Future finance minister should do it. Not this finance minister.

Wednesday, July 08, 2009

Connection Machines – Prelude to Parallel Processing

The computer architecture entered into a new phase with the stored program concept and programmable and general purpose computing architecture. The credit for this development goes to John Von Neumann, Grace Hopper, and Howard Aiken. Later it was relatively easy to build microprocessors and computers like ENIAC, since the general computing architecture was well established.

However there was a problem with this primordial architecture. Unlike human intelligence, it was massively relying on a single powerful processor that operated on the stored program in a sequential order. The first computer to depart from this view and behave more similar to human brain was Connection Machine(CM).

In early eighties, Danny Hillis a graduate student in MIT Artificial Intelligence Lab designed a highly parallel supercomputer that incorporated 65,536 processors. This design was commercially manufactured by Thinking Machines Corporation(TMC) created by Danny Hillis under the name CM-1. Thousands of processors which formed a part of CM-1 were extremely simple one bit processor connected together in a complex 20-dimensional hypercube. The routing mechanism between several processors in CM-1 was designed by Nobel Laureate Richard Feynman himself. In 1985, CM-1 was a dream SIMD machine for labs working on artificial intelligence(AI). But it had some practical problems. First it was too expensive a machinery to be purchased by budding AI labs. Second it did not have a FORTRAN compiler, which was the famous programming language among scientists at that time. Third it did not have floating point processing mechanism, a must for scientific analysis. So CM-1, although was a parallel processing marvel, looked more immature to face the market. Realizing the mistakes done in the design of CM-1, Thinking Machine released CM-2 which had a floating point processor and FORTRAN compiler. But still it did not fly. Evidently Danny Hillis was making a machine for future that present had no use with.

In early nineties, Thinking Machine introduced CM-5, which featured in the control station of Steven Speilburg’s Jurassic Park. It is considered to be not only a technological marvel, but also a totally sexy supercomputer (see figure). Instead of simple processors, CM-5 had a cluster of powerful SPARC processors. They also came out of the hypercube concept and built their data network as a binary fat tree. CM-5 is a synchronized MIMD machine combined with some best aspects of SIMD. The system can support up to 16,384 SPARC processors. The processing nodes and data networks interacted through 1-micron standard cell CMOS network interfaces with clock synchronization. The raw bandwidth for each processing node was 40 MBPS. But in a fat tree as it goes up the level, the cumulative bandwidth may reach up to several GBPS (if all these remind you of Beowulf, you are not alone!). TMC guaranteed that CM-5 was completely free of fetch-deadlock problem that occurs between multiple processors (using this).

Although it looks like a great architecture, from pure technical standpoint it is evident that TMC has toned down its idea of a complete parallel machine, simply because in eighties it did not sell partly due to the lack of market readiness and partly due to some blows in design. Secondly the failure of earlier CM series took a toll on TMC’s strategy. When CM-5 was introduced, the future of the company was dependent on that supercomputer sale. Los Alamos National Laboratory bought one. Jurassic Park set bought one. I cannot think of any other major customer. If CM-5 had survived, it would have had to fight with the likes of Intel Paragon, and Beowulf cluster for market space.

After the cold war, DARPA has cut down its funding on high performance computing which fell as a final blow on Thinking Machines. One fine morning in 1994, TMC filed Chapter 11 bankruptcy protection. The Inc. gives an alternative explanation for the failure of Thinking Machine. It is a good insight, but I cannot buy in its opinion about Danny Hillis. Perhaps The Inc. should restrict itself to its primary aim of advising budding entrepreneurs and refrain from measuring scientific minds. Nobody can deny that Danny Hillis was a genius, but the problem was that he was an out of the world freak who could not become a good businessman. Currently he is working on an ambitious project to build a monumental, mechanical clock that would run for multi-millennium.

The way that Thinking Machine took, was certainly not the way to build a successful enterprise. But the initial architecture that they introduced in early eighties was truly a stroke of a genius that every computer architect must study and understand.

Tuesday, July 07, 2009

Robot Democracy

The previous blogpost has stimulated some of the philosophical gyri of my brain.

After long, inhuman experimentations, we finally figured out that democracy is the best for to govern humans. But we do not give the democratic rights to other organisms like cats and dogs or to machines working in the assembly lines of Ford Motors. It is not surprising because the humans are far more superior to other organisms and machines in terms of intelligence, general awareness, creativity, and common sense.

After thousands of years in a science fictional world, let us assume that robots also become more and more intelligent - not just in terms of computational power, but also in terms of general awareness, creativity, etc. They become smart enough to understand that there is no such thing as free lubricant oil, and so assume the roles of professors and surgeons to establish themselves in the society. In such a world, will this civilized society extend the democratic rights to Artificial Intelligence? Or should these robots still have to toil under despotism? If a humanoid becomes visibly smarter than the dumbest human with democratic rights, could that humanoid be considered for promotion on the basis of its artificial intelligence? The answers to these absolutely crazy questions would determine whether there would be robotic terrorism and terminator-style man vs. machine war in the future.

So why should not I sit down and write a science fiction about a robot which becomes a lawyer, fights for robot rights in a Gandhian way? In every science fiction, robots fight humans to control the world. This time, let it fight for its rights and free will through Ahimsa.

Future belongs to carbon based lifeforms

Many science fiction novels and movies are based on either robots ruling mankind in the distant future or a warfare between humans and robots. Going by the recent technological advancements, one thing seems clear like a beacon. The future world is going to be ruled by carbon-based lifeforms; either one form or the other.

Sunday, July 05, 2009

Hardware / Software Partitioning Decision

Most important part of Hardware / Software partitioning scheme is to determine which part of software needs to be moved to FPGA. This problem becomes complex in a system with multiple applications running at a time. Kalavade et al has given a set of thumb rules to decide whether a given node can be moved to hardware or not. Here they are:
  • Repetition of a node: How many times a given type of node occurs across all applications? More this number, better to implement this in hardware.
  • Performance-area ratio of a node: What is the performance gain, if a given node is implemented in hardware, in terms of area penalty in the implementation? Higher the ratio, better to move to hardware.
  • Urgency of the node: How many times goes the given node appear in the critical path of applications? More this number, better would be the overall performance if this is moved to the hardware
  • Concurrency of the node: How many concurrent instances of the given node can potentially run at a time (on an average)? Hardware is always good at doing things in parallel.
By considering these four factors, we can decide the multiple nodes that would qualify for hardware implementation.

Saturday, July 04, 2009

Numeric definite integral in FPGA

Today when I looked at the comp.arch.fpga USENET feed, I saw an interesting question: how to perform integration in FPGA? Well, most of the time the answer would be, since FPGA is not handling continuous function, summation would be the approximation for integration. From the first glance, this may look like a right answer. But when we look at the purpose of integration, we find that it is too bad an approximation.

The purpose of integration is usually to find the area under a curve. If that is the case, to evaluate it numerically we have to go for adaptive quadrature algorithms, like Simpson's quadrature, Lobatto's quadrature, Gauss-Konrad quadrature, etc. If you are using MATLAB, you should be familier with the quad functions which has implemented these adaptive quadrature algorithms.

In an FPGA however, we never get input as functions, rather they are outputs of functions. For example, you would never get something like f(x) = (sin(x) + 20) for all 0<=x<=pi. Instead it would be the values of f(x) for each discrete value of x. In that case, the integration can be approximated as sum of area of triangle formed between the adjacent different values and the rectable formed by the lowest of the adjacent values with x-axis (y=0). And this has to be done for each value of x (usually comes with each CLK) and summed up. Once we get the result, we can multiply the result with the difference in x-axis between samples. Usually in an FPGA, the x-axis is CLK and so we have to use CLK period as the scale.

So at each CLK (somebody please tell me how to use LaTeX with blogger):
int(n) = int(n-1) + diff(f(n), f(n-1)) >> 1 + min(f(n), f(n-1))
The function is not as difficult to implement as it looks. The diff(f(n), f(n-1)) >> 1 gives the area of the triangle and min(f(n), f(n-1)) gives the area of the rectangle. "int" is the integration or the area under the curve at that limit. The function works if the CLK frequency is 1 Hz. For the other frequencies, the final integration function just needs to be multiplied by the CLK period.

For an ever-increasing function [f(n) >= f(n-1) for 0<=n<=N], the formula would reduce down to:
int(n) = int(n-1) + (f(n) - f(n-1)) >> 1 + f(n-1)
This works neat, but it is up to the person who implements, to determine whether they need summation or finding area.

Friday, July 03, 2009

USB 3.0 – knowns and unknowns

One of the biggest technology news that I see in journals often is about the advent of USB 3.0. Of course, not up to the hype of iPhone, but certainly all major news feeds that I am subscribed to cover this. Linux has come up with a driver for USB 3.0. Windows 7 will have USB 3.0 support. All these could help you in 2010 with transferring high speed HD videos and large databases.

One year ago, when my Debian implementation had some device drivers accidentally messed up, I sat down to write the parallel port and USB device driver myself. Parallel port worked great, but when I was in the file_operation: read part of USB driver, we had a power fluctuation at home crashing my computer. But that operation gave me a chance for understanding the USB architecture.

When USB 3.0 specifications released, I wanted to know how they claim it would be backward compatible. It was all very simple. USB 3.0 has the same bus structure of USB 2.0 with a USB 3.0 superspeed structure added to it in parallel. The baseline topology is the same as USB 2.0 – a tiered star topology. No wonder, it is backward compatible. In the PHY layer, USB 3.0 contains 8 wires instead of four in the previous versions. Out of them, four are the usual USB 2.0 wires, the other four are two superspeed transmitter wires and two superspeed receiver wires. So in physical layer, USB 3.0 is nothing but USB 2.0 tied up with a speedier architecture in the entire PHY layer, with its dedicated spread spectrum clocking (which is known for reducing EMI). This layer also provides a shift register based scrambling. Usually when you design and develop peripherals for USB 3.0, I would advise you turn it off, do your unit testing, and then enable scrambling back (just the same as PCIe development). Otherwise you would have a tough time validating the results.

One of the important features in USB 3.0 is its power management architecture. The power management is done at three loosely coupled levels: localized link power management, USB device power management, USB function power management. You can enable a remote wake feature and wake up any device from remote.

One thing, that is completely in the cloud in the USB chip architecture (proprietary design). I did not find any document in Internet about that. It is a completely mystery how they provide that super speed. USB 3.0 HCI specification does not seem to be completely open yet. From what we know, USB 3.0 looks more like PCI-SIG controller, more specifically express 2 architecture (5 GBPS) packaged differently, although Intel denies it. Both of them use the same encoding scheme, shift register based scrambling, spread spectrum clocking etc. So if you know PCIe 2.0 architecture, then you know more than 50% of USB 3.0. Having said all these, there are a lot of things still hidden. And many of the melodies are still hidden. As Keats says,
Heard melodies are sweet, but those unheard are sweeter.
Let’s wait with all ears to listen to the unheard, whenever it gets loud enough.

Tuesday, June 30, 2009

Ride on the STBus - a brief look at the different protocols

System on Chip (SoC) communication architecture is a vital component in SoC that interconnects heterogeneous components and IP blocks and supplies a mechanism for data and control transfer. This is a significant parameter in the performance and power utilization of the chip, especially when the feature size is in nanometers. As a result, a given architecture must ensure to deliver an agreed QoS through arbitration mechanisms. Recently I took some interest in going through the VSIA (Virtual Socket Interface Alliance) standards for intra-chip communication and the closely aligned bus architecture of STMicroelectronics (STBus) to see how it fits in an automobile collision avoidance system (Blogger sucks. I cannot upload a PDF! Ridiculous!!).

The STBus IP connects an initiator and a target. The initiator (master) initiates the communication by sending a service request to the target and waits for a response. The target obtains the service requests, validates it, processes it and sends back the response. There are configuration registers to change the behavior of request and response handling. These configuration registers allow changing of bus behavior and adjusting the traffic depending on priority, bandwidth, etc.


STBus Block Diagram




There are three protocols in which STBus can operate: Type 1, Type 2, and Type 3 protocols. Type 1 is for a simple, low performance access to peripherals. Type 2 is a little more complex with support to pipelined architecture. Type 3 extends supports to asynchronous interactions and complex instructions with varied sizes. All these are implemented used a shared multiplexer or a crossbar multiplexer based architecture. After going through the characteristics of these protocols, it was easy for me to decide that Type 1 and Type 2 interface are suffice for my collision avoidance system.

Type 1 protocol is the best candidate for general purpose I/O with very minimal operations. It has a very simple handshake with each packet containing the type of transaction (request or response), position of the last cell, address of the operation and the related data. All peripherals are required to instantiate these mandatory signals.

Type 2 protocol has everything within the Type 1 added with pipelining ability, source labeling, prioritization, etc. Another important feature of Type 2 is that a transaction can be split into two parts: the request and the response. So once the initiator sends the request part it can do whatever other activities without waiting for the response, since response is like a separate transaction. This property certainly adds to the performance of the system. Pipelined transactions arise as a result of this transaction splitting. The Type 2 initiator can send consecutive transaction as in a pipeline. The most important point to note here is that the order of transaction has to be maintained with care. The response is expected in the same sequence in which the request is sent. As a result, the initiator-target pairing has to be maintained until all responses are received. Establishing contact with a new target is forbidden if the existing target has not responded for all service requests.

Type 3 protocol is a more advanced protocol that is prescribed only if both the interacting systems are intelligent enough to cope with the complex handshakes between initiator and target. Two main features of Type 3 protocol are shaped packets and out of order transaction management. Shaped packet allows a varied size of request and response packet to be transferred. This feature is certainly an improvement in bandwidth usage. The out of order transaction management feature allows transactions to be processed in any order. This is because in Type 3 protocol, each transaction is earmarked with a four byte transaction id.

In the collision avoidance system, I was talking about all the peripherals and external registers can be connected through Type 1 protocol. Type 2 is well positioned between my baseband demodulator and signal processor or FPGA. My baseband demodulation gives me complex baseband signal values for a 64 element array. For this array, I have to perform power spectrum estimation. My transaction size is going to be fixed and the initiator (baseband modem) is coupled with the target (DSP/FPGA). The spectral estimation is usually a complex process and may take more time than baseband demodulation. So Type 2 allows me to split the request and response and pipeline the flow to the processor. I may need to add a buffer at the target end and latency at initiator end so as to balance the difference in speed. In Type 2, since I cannot resend a message that is already delivered, I have to piggyback the acknowledgement to decide whether the initiator can send another asynchronous packet or go to a wait state.

Friday, June 26, 2009

A Quick Tour of the FORst

In an earlier post about VLSI routing, I promised a little wonkish post on Obstacle Avoiding Rectilinear Steiner Minimum Tree (OARSMT). This may be it, but still not so wonkish. In this post, I would try to explain FORst - one of the earlier methods (2004) to solve OARSMT problem.

OARSMT is a NP-Complete problem and so it cannot be solved in polynomial time. So VLSI designers use different heuristics to derive the solution for OARSMT or simply RSMT. The simplest of all the solutions is to derive a rectilinear minimum spanning tree and taking that as the approximation for RSMT. Theoretically, it is nearly a 50% approximation, i.e. wire length would be up to 50% more if we assume rectilinear minimum spanning tree to be RSMT, although in practice, it would be lot better.

In this paper, the authors have derived a 3-step heuristic algorithm as an optimal solution for OARSMT. The first step is to split the entire graph into several subgraphs depending on the location of the terminals. For this we have to construct a full steiner tree. Full steiner tree is the one in which all nodes in the graph are leaves. Hwang's theorem is used to construct such a tree. Once done, we just need to wipe out all edges that pass over one or more obstacles. Now the nodes in the subtrees that we get, constitutes the subgraph.

The second step is RSMT construction. It is apparent that the subgraphs that we constructed are free of obstacles. We can use any heuristic to construct the RSMTs. The original paper uses a combination of ant colony optimization (ACO) and greedy approximation methods to construct the RSMTs. To be more specific, ACO is used for smaller terminals for accuracy and for connecting them together in a larger terminal greedy method is used as it produces faster result.

The third step is to combine all the RSMTs. The nearest nodes of every RSMT is joined together with all adjacent subgraphs. The paper gives experimental evidence that even a large graph with several obstacles can be routed in a small time. A typical nanometer integrated circuit fabric would have thousands of nodes and hundreds of obstacles like power lines, IPs, etc. All those cases can be treated as large scale OARSMT problem and FORst is the right candidate if you are looking for a solution.

FORst is now nearly five years old and there have been several improvements and several improved heuristics resulting in more and more efficient implementations. However we just could not miss our FORst. It's like the 8088 processor, however outdated you would always learn before proceeding to the architecture of P4.