Archive for August, 2009


In an effort to learn Perl I was going through the Llama book. When I came to the second chapter there was an exercise where we need to enter two numbers and the program will return their product. I did the program and it worked well except for a small glitch. Here is my code and the output.

#!/usr/bin/perl
print "Enter the first number : ";
$num1 = <STDIN>;
print "Enter the second number : " ;
$num2 = <STDIN>;
$prd = $num1 * $num2;
print "The product of $num1 and $num2 is $prd\n";

The code worked well and outputted 20 if we enter 4 and 5 but it printed output like this.

Enter the first number : 4
Enter the second number : 5
The product of 4
and 5
is
20

Now it was annoying. Having only started programming in Perl and my previous experience in C or Java or even PHP never made me expect something like this. I expected it to print the whole sentence in one single line. What could be wrong. Then I recalled Llama telling about <STDIN> adding a ‘\n’ to the input and the need for using chomp() to get rid of this.

So I modified code and added two more lines

chomp($num1);
chomp($num2);

This made it correct. But still I found it odd and thought this feature of <STDIN> a nuisance. So I went into some knowledge hunting. I asked in #perl channel at freenode. So far I was wary and wasn’t talking and was merely listening. I was afraid even while asking this of getting some serious ‘RTFM’. But the folk in there responded nicely. Thanks for <claes_>. He gave the apt reply. Here is what I learned from him.

There is a special variable in Perl named $/ which is initialzed to be ‘\n’. This variable is termed the input record separator. This variable is used to separate the various lines from your inut. Since $/ contains ‘\n’ it assumes your input has been ended once you key in the return key.(Note: Though the ‘\n’ is not a part of the intended input, since you entered it it will also be included as a part of your input). That’s why we get the newline when printing it back. When we use chomp() it removes these ‘\n’. If we explicitly assign say, ‘f’ to $/ then <STDIN> enters whatever up to the first ‘f’ it sees. Also the functionality of chomp() changes. It will now remove the last ‘f’ from your input (Once again though ‘f’ wasn’t in your intended input, since you entered it it would be there, Please don’t do so. Leave ‘$/’ to the ‘/n’, or be ready to bear the dire after effects). So $/ variable helps in entering user input from standard input.

Another option is to use in the above example is the chop() method [I earlier called it a better option, but I taking that away as per the info by ‘Andrew’]. When chomp() removes whatever declared in the $/ variable, chop() removes the last character from the input.

So in the below snippet
$str = "Christy;
chop($str);
print $str;

the output is ‘Christ’;

I think I was successful in explaining the issue and let this save another newbie someday.

[Revision: Thursday 3 Sep, 2009, There was some confusion in the explanation of input record separator. Thank to Chas Owen. I corrected it.}

add to del.icio.us : Add to Blinkslist : add to furl : Digg itStumble It! : add to simpy : seed the vine : : : TailRank : post to facebook

Alan Haggai Alavi a friend of mine and another Perl hacker introduced me to Perl a few days back. He has been doing steady Perl coding for a while and has submitted some modules in CPAN and is a known name among some elitist coders in the Perl community. I am happy to have him as my friend.

Much to the anger of the die hard Perl hackers, I was also an ignorant fellow who confused Perl to be synonymous wih cgi. But Haggai and some reading on the matter got rid of the misinformation. Haggai showed me some samples of a CMS he was developing in Perl. It is using catalyst as the frame work. He showed me how to do a Unit Test, though most of it was Latin to me, since I haven’t yet done Test driven development (I know for a programmer with more than 2 years of experience it is a shame, What to do, all the work I have done so far didn’t demand one. I’m pretty beginner.). However this got me excited and I started looked into Perl.

There are so much myths regarding Perl and I’m yet to gain experience to believe or dispel one. From what I heard from people who actually code Perl is that Perl is robust, powerful and a less restrictive language. So I’m going to learn it.

One more thing I love about Perl is the passion among Perl coders. They will Eat, Drink and Sleep Perl and will do anything and everything for the language. I haven’t seen such a kind of emotion in no other language community. Also the Perl coders are humble. They don’t belittle other languages. They even tell people to learn more languages to experience the difference and gain wisdom. This is some kind of emotion which isn’t seen elsewhere. For an experience, ask a python programmer what he feels about Perl. I think this cool behavior of Perl programmers must have been inherited from Larry Wall (I don’t know him personally. But I read he is a very religious guy. So it implies he is humble).

Haggai also introduced me to Ohloh and Stack Overflow. Since I haven’t developed any open source software and isn’t yet a genius my rating there is very poor. But I believe I can improve in the coming days.

I also decided to join the Iron man competition.Maybe I’m vain, but I have a dream. One day I too will a known name in the open source community. Like Matt S Trout, Shlomi Fish, Randal Schwartz, brian d foy, Larry Wall (Oh, God! I’m really vain)  one day Perl community will know me too.

In an attempt to move into the field of enterprise application development I started refreshing my Java recently. I was going through a well known book when I stumbled upon the implementation of  Strings in Java. I have high esteem for the developers at Sun, but I really could not digest the fact that Sun engineers thought 2 bytes would be enough for characters. It was kind of  Y2kish. Now that the UTF has grown above the usual number 16 bits can represent I was eager to find how Sun tackled this problem. The book touched the matter vaguely but since that didn’t completely clear my doubts I decided to investigate. And these are my findings.

Before we dive into the Java implementation of the standard, we should understand what UTF is. At least some of us might have seen it somewhere. May be those of us who have the creepy behavior of going through the source of an HTML page might have seen the following,
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />.
And we may have a vague idea on what it does. Don’t worry let’s understand what it really is.

Unicode is an internationally accepted standard to define character sets and corresponding encoding. And the above piece of code is informing the browser that the document should be interpreted using the utf-8 character set.  So what exactly is UTF or Uniform Transformation Format?

In order to completely know about Unicode, we have to traverse a few decades back when people thought the earth was the center of  the universe and it was indeed flat. Oh sorry, we have to traverse a few decades back when the majority of the software was written by English speaking people. It was only quite natural they thought the only set of characters ever to be  encountered in the realm of programming would be from English alphabets, numerical digits and a couple of other prominent characters. So it was logical to use 28 or 256 bits to represent the set of characters. It was enough for the normally used characters back then and also space were left for the inclusion of characters in the future. The problem started when different people started encoding different characters in the free space. What evolved was chaos. Also when the internet happened, all the people around the globe  started using technology and started tweaking programs to their likes and in their own native languages. It became impossible to fit all the characters into the tiny space of 256 bits. Soon the encoding system known as ASCII ran out of space and a need for a better encoding system evolved. To make a long story short, thus evolved UTF. More on this can be read from here and here.

In UTF characters are represented by code points. It is usually a hex number preceded by U+. For example U+2122 means TM, the trademark symbol. The prominent UTF encodings are UTF-8, UTF-16 and the latest one being UTF-32. These three are methods to represent the UTF character set using 8 bit, 16 bit and 32 bit respectively. To get a more detailed and exclusive idea on Unicode please read Joel Spolsky’s post.

In Java from beginning characters were represented by 16 bits and for some time it was enough to represent all the characters. But since the characters included into UTF overgrew the 16 bit realm, Java was faced with a dilemma, either to change the char representation into 32 bit or use some other methods.. It is not of much issue since most of the characters outside the 16 bit representation is rarely used. But since Java is a language which believes a in portability very much and the engineers in Sun are much more intelligent than the average developers like us, they found a way to circumvent this issue. Java is now equipped to represent all the characters in the 32 bit realm also. So how does Java tackle the supplementary characters out side the 16 bits? What Sun employed to get out of this mess was UTF-16 encoding. So what is UTF-16?

To quote Wikipedia UTF- 18 is a variable length character encoding for Unicode, capable of encoding the entire Unicode repertoire. The encoding maps character to a sequence of 16 bit words. Characters are known as code points and 16 bit words are known as code units. The basic characters from the Basic Multilingual Plane’ can be represented using the 16 bits. For characters outside this we need to use a pair of 16 bit words called as the surrogate pair. Thus all the characters that can be encoded by 232 or U+0000 through U+10FFFF , except for U+D800–U+DFFF (These are not assigned any characters) can be specified using UTF-16. Why are these numbers not assigned any characters? It is an intelligent choice made by the Unicode community to design the UTF-16 encoding scheme.

The characters outside the BMP (those from U+10000 through U+10FFFF) are represented using a pair of 16 bit words as I said before. These pair is known as surrogate pair. Now 1000016 is subtracted from the original code point to make it a 20 bit number. Now it is divided to two 10 bit numbers each of which is loaded into a surrogate with the higher order bits in the first. the two surrogates will be in the range 0xD800–0xDBFF and 0xDC00-0xDFFF. Thus since we have left out those region unassigned we can be sure it isn’t a character but need processing before the original code point is found out.  You can read the UTF-16 specification from Sun here.

add to del.icio.us : Add to Blinkslist : add to furl : Digg it : Stumble It! : add to simpy : seed the vine : : : TailRank : post to facebook