Thursday, May 05, 2011

Bloom Filter in Python

Bloom Filter is an extremely space-efficient probabilistic data structure. You can read more about it in its wikipedia page: http://en.wikipedia.org/wiki/Bloom_filter. Bloom filter requires a very little storage space for a single key. An enormous amount of data can be stored in a small Bloom filter, if we can afford the risk of a very small percentage of false positive. Venti network storage system introduced in Plan 9 distributed operating system uses Bloom filter for storage.

A few days ago, Raymond Hettinger wrote a simple Bloom filter in ActiveState recipe. This reminded of the fact that I wrote a Bloom filter long time ago with a different hash function. So I forked Hettinger's code and extended the Bloom filter to allow union and intersection operations between two Bloom filters. Such operations are very easy, and allowed if both the Bloom filters use the same hash function. It may be used in distributed systems, where two Bloom filters are independently constructed and merged eventually. I also externalized the hash function to be given as the input to the Bloom filter class. So the user can test it with different hash function, if they have a yielding function to generate hash values.

You can find the Python source code for the Bloom filter in my github or ActiveState recipe.

Bloom filter is a very good data structure for space-efficient storage, whereas it is not suited if you insist on having no false positive in searches. Also if you want to do other operations like sorting the dataset or deleting elements from the dataset, Bloom filter is not suited.

However with some slight modification, sacrificing some space, we can achieve a few more things. For instance, if you want to count the number of times a given key is searched, you can allocate more than a bit for every hash value and increment the positional value as a counter for every search. Least positional value (minus 1) for a given string tells how many times that string has been searched. But this operation would increase the storage requirement. My implementation does not include any such variation.

Tuesday, April 12, 2011

Debugging Python on Emacs IDE

I've tried a few IDEs for standalone Python development. But end of the day I came back to Emacs. I love Emacs because it is lightweight, I can pretty much do whatever I can with any other IDE and much more if I write some Lisp script.

Here in this article, I'd just cover how to step through your standalone Python code and debug it using Emacs as the editor/IDE. First of all you have to install the following Emacs extension modules and Python debugging tools.
  • Install Pymacs from this github site and thank François Pinard of Montreal for programming it
  • Install Rope and Ropemacs from here (using Mercury)
  • Check if you have PDB, just by typing it on command window (you'd mostly have)
Now the step-by-step debugging goes like this
  1. With your Python program on buffer, type M-x pdb. It would ask if you want to run PDB like, pdb a.out. Replace a.out with your Python module name. In the screenshot, it is adder.py (doing nothing useful).
  2. This opens a new PDB windows (Screenshot)
  3. Go to the lines where you need breakpoint, type C-x (you can notice the screenshot for the statement about it). If you are using Ubuntu, you get a very good look-n-feel, like that screenshot.
  4. Type c on the PDB buffer to run upto the breakpoint
  5. From the breakpoint, n for next line or s to explore into functions on that line. In Ubuntu, you'd have a GUI to assist you.
  6. To watch a variable, type p var, as in the screenshot.
  7. Any time during debugging, w prints out the stack and u and d lets you go up and down the stack.
For most of the standalone Python script, this would improve your productivity a lot. You can do more complex debug operations, but this is a very good start.

Saturday, April 09, 2011

Browserling

With all your web apps and client side scripts, cross-browser testing is vital. Great were the roaring 90s when Internet Explorer was the only browser. But now the end user uses any browser, mostly either Chrome, Firefox, Internet Explorer, or Safari. Despite several standardizations, there is no guarantee that scripts in one browser would run in the other. And as an end-user, the message "this page works only in Firefox" is the biggest turn off for me.

Here is an easy way of testing how your web UI looks in different browsers without really installing the browsers in your computer. Through Browserling. It's an open source node.js based StackVM implementation by Peteris Krumins, and you can follow the growth of this project and all his node.js tools, here.

Sunday, February 13, 2011

A Sneak-Peek into Linux Kernel - Chapter 4: Process Scheduling: Part-1

So far, we have covered the process creation and termination in Linux. In this post and the next one, we will go through the process scheduling. The task scheduler or process scheduler or simply scheduler is the part of the kernel that makes the decision on which process can be executed at any given time, keeping in mind best utilization of the available resources. The scheduler in Linux is preemptive – i.e. the scheduler wouldn't mind partially stopping an executing task and run a different task which is better suited. The preempted task is eventually scheduled to continue from the point where it stopped.

Scheduling policy is the strategy followed by the multitasking OS to maximize efficiency. In Linux or for that matter in most of the OS, the processes are separated into two kinds: IO-bound and processor-bound. An IO-bound process is one that depends on constant interrupt from the IO devices – like a word processor that depends on keyboard entry. A processor-bound task is the one in which the instruction sequence is executed consecutively on the processor until it is either preempted or finished. Linux tries to provide more importance to the IO-bound processes because of the user-interactive nature of it. This is done without compromising much on the background tasks that requires processor time. Linux scheduler is also a dynamic priority-based scheduler. All processes start with a base priority and (inversely) based on how often the processes use up their allocated timeslices without being interrupted by any IO device.

The Linux scheduler is defined in kernel/sched.c. The scheduling algorithm is called Completely Fair Scheduler (CFS), developed by Hungarian hacker, Ingo Molnar. Completely Fair Scheduler is a O(log n) algorithm that depends on a red-black tree which implements a timeline of future tasks ready for execution. Before CFS, Linux used O(1) scheduler. In that, a bitmap based priority queue data structure used of the priority based scheduling class is given by:

struct rt_prio_array {
DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */
struct list_head queue[MAX_RT_PRIO];
};
Here MAX_RT_PRIO is the number of priority levels which is equal to 140. For user-space, the number of priority levels is given by MAX_USER_RT_PRIO = 100. The main scheduling data structure is run-queue defined as struct rq. This structure actually belongs to the previous O(1) scheduling algorithm, except for a few changes. One important change is the introduction of the reference to struct cfs_rq and the leaves of the cfs_rq given by struct list_head leaf_cfs_rq_list.

The struct cfs_rq consists of struct load_weight is a simple structure with two unsigned longs used in load balancing.

struct load_weight load;
unsigned long nr_running;
struct rb_root tasks_timeline;
struct rb_node *rb_leftmost;
nr_running variable gives the number of processes currently running. tasks_timeline is implemented as a red-black tree as already mentioned, with a reference to its root. The leftmost node of the red-black tree is also tracked. This points to the most eligible task that can for running (see enque method declared in sched_class).

CFS also eliminates the need for the priority queue by introducing two new structures into struct task_struct. They are: struct sched_entity and struct sched_class. The important part of struct sched_entity are:

struct sched_entity {
struct load_weight load;
struct rb_node run_node;
struct list_head group_node;
...
}
Here rb_node is the node of the red-black tree. The struct sched_class contains pointers to different functions that are required to implement the CFS scheduler. It consists of enqueue and dequeue functions to add and remove the struct task_struct to the rq, by changing the status runnable of the task. check_preempt_curr function that ensures fairness (F of CFS) by checking whether the current task is pre-emptable or not. set_curr_task function to set the current task as running in the given CPU. The function prio_changed that gives information about change in priority in a dynamic priority driven scheduling. The average time for scheduling a task is given by:

return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
Here NSEC_PER_MSEC stands for nanoseconds per microsecond. One observation that can be made is that CFS scheduling algorithm in Linux tries to stretch the accuracy to the scale of nanoseconds.

Since Linux allows preemptive scheduling, the CPU time is divided into several time slices. After every timeslice context switching takes place. Timeslice length is usually a variable in CFS. When to schedule a new task is determined in sched_clock_tick() function in kernel/sched_clock.c or by resched_task() call made through change of priority requested by the user (using nice() system call and much more) or by sched_yield() system call.

In the next post, we will see how spinlocks and rcutree are used in scheduling. We will also look at the different scheduling policies in Linux.