Month: May 2010

Applying the rules …

Developing software has many things in common with aircraft development, depending of course, on how you look at it. In both cases, getting technical innovations to market as quickly as possible is key to success and while our users may not fall from the sky if our software fails, there are many software products that have enormous dollar and human cost when they don’t work as they should. Even

Applying the rules …

Developing software has many things in common with aircraft development, depending of course, on how you look at it. In both cases, getting technical innovations to market as quickly as possible is key to success and while our users may not fall from the sky if our software fails, there are many software products that have enormous dollar and human cost when they don’t work as they should. Even

Kelly Johnson’s 14 Rules of Management

Kelly Johnson’s 14 Rules of Management, in their original form (highlighting added by me):——————————————————–1. The Skunk Works manager must be delegated practically complete control of his program in all aspects. He should report to a division president or higher. 2. Strong but small project offices must be provided both by the military and industry. 3

Kelly Johnson’s 14 Rules of Management

Kelly Johnson’s 14 Rules of Management, in their original form (highlighting added by me):——————————————————–1. The Skunk Works manager must be delegated practically complete control of his program in all aspects. He should report to a division president or higher. 2. Strong but small project offices must be provided both by the military and industry. 3

The top 10 tricks of Perl one-liners

I’m a recovering perl hacker. Perl used to be far and away my language of choice, but these days I’m more likely to write new code in Python, largely because far more of my friends and coworkers are comfortable with it.

I’ll never give up perl for quick one-liners on the command-line or in one-off scripts for munging text, though. Anything that lasts long enough to make it into git somewhere usually gets rewritten in Python, but nothing beats perl for interactive messing with text.

Perl, never afraid of obscure shorthands, has accrued an impressive number of features that help with this use case. I’d like to share some of my favorites that you might not have heard of.

One-liners primer

We’ll start with a brief refresher on the basics of perl one-liners before we begin. The core of any perl one-liner is the -e switch, which lets you pass a snippet of code on the command-line:

perl -e 'print "hi\n"' prints “hi” to the console.

The second standard trick to perl one-liners are the -n and -p flags. Both of these make perl put an implicit loop around your program, running it once for each line of input, with the line in the $_ variable. -p also adds an implicit print at the end of each iteration.

Both of these use perl’s special “ARGV” magic file handle internally. What this means is that if there are any files listed on the command-line after your -e, perl will loop over the contents of the files, one at a time. If there aren’t any, it will fall back to looping over standard input.

perl -ne 'print if /foo/' acts a lot like grep foo, and perl -pe 's/foo/bar/' replaces foo with bar

Most of the rest of these tricks assume you’re using either -n or -p, so I won’t mention it every time.

The top 10 one-liner tricks

Trick #1: -l

Smart newline processing. Normally, perl hands you entire lines, including a trailing newline. With -l, it will strip the trailing newline off of any lines read, and automatically add a newline to anything you print (including via -p).

Suppose I wanted to strip trailing whitespace from a file. I might naïvely try something like

perl -pe 's/\s*$//'

The problem, however, is that the line ends with "\n", which is whitespace, and so that snippet will also remove all newlines from my file! -l solves the problem, by pulling off the newline before handing my script the line, and then tacking a new one on afterwards:

perl -lpe 's/\s*$//'

Trick #2: -0

Occasionally, it’s useful to run a script over an entire file, or over larger chunks at once. -0 makes -n and -p feed you chunks split on NULL bytes instead of newlines. This is often useful for, e.g. processing the output of find -print0. Furthermore, perl -0777 makes perl not do any splitting, and pass entire files to your script in $_.

find . -name '*~' -print0 | perl -0ne unlink

Could be used to delete all ~-files in a directory tree, without having to remember how xargs works.

Trick #3: -i

-i tells perl to operate on files in-place. If you use -n or -p with -i, and you pass perl filenames on the command-line, perl will run your script on those files, and then replace their contents with the output. -i optionally accepts an backup suffix as argument; Perl will write backup copies of edited files to names with that suffix added.

perl -i.bak -ne 'print unless /^#/' script.sh

Would strip all whole-line commands from script.sh, but leave a copy of the original in script.sh.bak.

Trick #4: The .. operator

Perl’s .. operator is a stateful operator — it remembers state between evaluations. As long as its left operand is false, it returns false; Once the left hand returns true, it starts evaluating the right-hand operand until that becomes true, at which point, on the next iteration it resets to false and starts testing the other operand again.

What does that mean in practice? It’s a range operator: It can be easily used to act on a range of lines in a file. For instance, I can extract all GPG public keys from a file using:

perl -ne 'print if /-----BEGIN PGP PUBLIC KEY BLOCK-----/../-----END PGP PUBLIC KEY BLOCK-----/' FILE
Trick #5: -a

-a turns on autosplit mode – perl will automatically split input lines on whitespace into the @F array. If you ever run into any advice that accidentally escaped from 1980 telling you to use awk because it automatically splits lines into fields, this is how you use perl to do the same thing without learning another, even worse, language.

As an example, you could print a list of files along with their link counts using

ls -l | perl -lane 'print "$F[7] $F[1]"'

Trick #6: -F

-F is used in conjunction with -a, to choose the delimiter on which to split lines. To print every user in /etc/passwd (which is colon-separated with the user in the first column), we could do:

perl -F: -lane 'print $F[0]' /etc/passwd
Trick #7: \K

\K is undoubtedly my favorite little-known-feature of Perl regular expressions. If \K appears in a regex, it causes the regex matcher to drop everything before that point from the internal record of “Which string did this regex match?”. This is most useful in conjunction with s///, where it gives you a simple way to match a long expression, but only replace a suffix of it.

Suppose I want to replace the From: field in an email. We could write something like

perl -lape 's/(^From:).*/$1 Nelson Elhage <nelhage\@ksplice.com>/'

But having to parenthesize the right bit and include the $1 is annoying and error-prone. We can simplify the regex by using \K to tell perl we won’t want to replace the start of the match:

perl -lape 's/^From:\K.*/ Nelson Elhage <nelhage\@ksplice.com>/'
Trick #8: $ENV{}

When you’re writing a one-liner using -e in the shell, you generally want to quote it with ‘, so that dollar signs inside the one-liner aren’t expanded by the shell. But that makes it annoying to use a ‘ inside your one-liner, since you can’t escape a single quote inside of single quotes, in the shell.

Let’s suppose we wanted to print the username of anyone in /etc/passwd whose name included an apostrophe. One option would be to use a standard shell-quoting trick to include the ‘:

perl -F: -lane 'print $F[0] if $F[4] =~ /'"'"'/' /etc/passwd

But counting apostrophes and backslashes gets old fast. A better option, in my opinion, is to use the environment to pass the regex into perl, which lets you dodge a layer of parsing entirely:

env re="'" perl -F: -lane 'print $F[0] if $F[4] =~ /$ENV{re}/' /etc/passwd

We use the env command to place the regex in a variable called re, which we can then refer to from the perl script through the %ENV hash. This way is slightly longer, but I find the savings in counting backslashes or quotes to be worth it, especially if you need to end up embedding strings with more than a single metacharacter.

Trick #9: BEGIN and END

BEGIN { ... } and END { ... } let you put code that gets run entirely before or after the loop over the lines.

For example, I could sum the values in the second column of a CSV file using:

perl -F, -lane '$t += $F[1]; END { print $t }'
Trick #10: -MRegexp::Common

Using -M on the command line tells perl to load the given module before running your code. There are thousands of modules available on CPAN, numerous of them potentially useful in one-liners, but one of my favorite for one-liner use is Regexp::Common, which, as its name suggests, contains regular expressions to match numerous commonly-used pieces of data.

The full set of regexes available in Regexp::Common is available in its documentation, but here’s an example of where I might use it:

Neither the ifconfig nor the ip tool that is supposed to replace it provide, as far as I know, an easy way of extracting information for use by scripts. The ifdata program provides such an interface, but isn’t installed everywhere. Using perl and Regexp::Common, however, we can do a pretty decent job of extracing an IP from ip‘s output:

ip address list eth0 | \

  perl -MRegexp::Common -lne 'print $1 if /($RE{net}{IPv4})/'

So, those are my favorite tricks, but I always love learning more. What tricks have you found or invented for messing with perl on the command-line? What’s the most egregious perl “one-liner” you’ve wielded, continuing to tack on statements well after the point where you should have dropped your code into a real script?

~nelhage

Date Columns at the end of Indexes

A generally good rule of thumb is that date columns should appear at the end of a multi-column index and not at the start or the middle. Why? Because generally date columns are used in range comparisons in queries a lot more often than equality comparisons. The effect of putting such a date column in the middle of an index is to distribute values of the following columns across the different dates within the index.

What this means is that if you have an index on 3 columns – A, D, Z say – where D is a date and A and Z are not (either numeric or a character string) and you execute a query constraining on all 3 columns but with a range constraint on D, the database server will have to read and check a great many index entries


select ... from T where A = 123 and D > to_date ('20090101', 'YYYYMMDD') and Z = 456 ;

An index is a tree like structure organized by the columns in it. So at the highest levels we branch out for different values of A. Then lower down, for each value of A it branches out for each value of D within that particular value of A. And underneath that we have each value of Z that occurs for each value of D. The value 456 of Z may only occur a few times in the whole table, but it could occur on any value date of D, and so could be in different parts of the index.

This means that when executing the query, the database server will traverse the index, find the branch for A with the value 123 and then all the values for D underneath it. Due to the date range constraint, it will have to check many different values of D to see which ones also have Z with a value of 456.

If the index was on A, Z, D instead, then fewer index entries would need to be checked when executing the query. First it would traverse the index to 123 for A. Then it would traverse down the next level to 456 for Z. Then it would traverse down the next level to D and all values greater than the specified date (1/1/2009). At this point all such entries in the index match the 3 constraints, and so can be retrieved.

This alternate index leads to far fewer index entries being read and checked when executing this particular query.

I am not saying that dates should always go at the end of indexes. If Z was not constrained in the query, only A and D, then the second index would not be much use, and the first index would be much better. There may also be cases where dates are used with equality constraints rather than ranges. But the general rule is still a good one. Put date columns at the end of an index, unless you know the queries being executed would benefit from that date column appearing earlier in the index. As ever the best indexes are those that are most useful to the queries you execute and cover the referenced columns.

An example of this I have just come across resulted in a query execution time coming down from 10,000 ms (10 seconds) by an order of magnitude to 20 ms – a major improvement I think you’ll agree. To me this just shows the greater “efficiency” of the index – far fewer index blocks need to be visited to satisfy the query, and those fewer blocks are more likely to be cached in memory too. Hence the orders of magnitude reduction in the elapsed time of the query.

Furthermore this query is executed in a loop some 200+ times within the application I was looking at. So the net effect is not just 10 seconds once down to less than 1 second once, but really 2,000 seconds or 33 minutes down to less than 33 seconds across the overall job each day it is run. A worthwhile improvement from having an alternate index with the date column at the end of it.

Date Columns at the end of Indexes

A generally good rule of thumb is that date columns should appear at the end of a multi-column index and not at the start or the middle. Why? Because generally date columns are used in range comparisons in queries a lot more often than equality compa…

Underperformance of V$SQL_OPTIMIZER_ENV

A couple of days ago, I was trying to search the non-default optimizer environments(aka. parameters) of a specific SQL query which has multiple child cursors. But it soon turned out that it’s almost impossible to search this simple view just because of the underperformance. Followings are what I mean. 1. Create simple objects and gather […]

Oracle on ext4 warning

I bought a new laptop recently ( Dell Studio XPS 16 ) Core 7i, 8Gb, 7200RPM disk.  I installed my Linux distro of choice, Fedora 12, and started installing Oracle.  Yeah I know Fedora isn’t a supported OS, but I’ve got almost ever version from 8.1.7.4 to 11.2 and I’ve never really hit any problems […]

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569