Wiki: Perl

Perl Coding and Syntax Reference
by Oliver; 2014-01-13
   

Introduction

This article is no longer maintained


If you started scripting with bash, as I did, you'll find Perl to be a breath of fresh air. As my co-worker said, "Perl is my first love. Bash is that dirty girl I call when I'm drunk." Perl is a good choice for shell-scripting because it's easy to use, crystal-clear, and beloved by many. System commands are a cinch and its regex abilities are legendary. However, there's a great civil war raging in the CS community between Python and Perl. If you're new to programming—the only reason for being on this page!—it makes more sense to start with Python, Perl's wildly popular younger brother. Just as spoken languages gain in prestige from more speakers working within their medium, so in computer languages numbers matter. More users means more momentum means more packages means more users... repeat cycle.

The go-to source for all things Perl is the official docs: This wiki is a collection of Perl miscellany—so you know what to expect, it's more a reminder to myself than a carefully crafted article. I'm a firm believer in the basics and that's mostly what you'll find here. Before we dive in, I can't resist linking this picture:

image
(Image credit: Learning Perl - Oleg Volk)

Declaring Empty Variables in Perl

In Perl, there are ordinary variables—called scalars—which can take on either numbers or strings, as well as arrays and hashes. Here's how to declare an empty variable for each data type:
$var= "";    # empty scalar variable
@arr = ();   # empty array
%hash = ();  # empty hash

Variable Scope in Perl

By default, variables are global in Perl. This can lead to dangerous name conflicts, so you usually want to turn this default off. To do so, use strict and declare variables with my, which will cause them to live only in the curly bracket blocks { } where they're declared. Examine the following script, tmp.pl:
#!/usr/bin/env perl

use strict; 

$a="hello";
print $a."\t".$ARGV[0]."\n";
As we'll see below, the
$ARGV[0]
represents an argument supplied by the user. In words, the above code snippet will print the variable $a, print tab, print the argument, print newline. If we run it on the command line, we get:
$ ./tmp.pl joe
hello	joe
Let's change the script to:
#!/usr/bin/env perl

use strict; 

{
	$a="hello";
}
print $a."\t".$ARGV[0]."\n";
Now:
$ ./tmp.pl joe
hello	joe
Let's change the script to use my:
#!/usr/bin/env perl

use strict; 

{
	my $a="hello";
}
print $a."\t".$ARGV[0]."\n";
Now:
$ ./tmp.pl joe
	joe
At last hello isn't printed because $a has local scope.

Working with Scalars, Arrays, and Hashes in Perl

Let's illustrate how to work with each basic data type in Perl:
#!/usr/bin/env perl

use strict; 
use warnings; 

# ordinary scalars can be either strings or numbers 
my $a = "joe";
my $b = 10;

print "var: ",$a,"\t",$b,"\n";
Running this produces:
var: joe	10
Note that Perl scalars can be either strings or numbers. If we add some more code about arrays:
# array
my @c = ("joe", "joe2", "joe3");
my @d = ();
push(@d, 1); 
push(@d, 2); 
push(@d, 3); 

print "arr: ",$c[0],"\t",$c[1],"\n\n";

foreach my $elt (@d)
{
	print "arr: ",$elt,"\n";
}
print "\n";
print "length of array is: ",scalar(@d),"\n";
we get:
arr: joe	joe2

arr: 1
arr: 2
arr: 3

length of array is: 3
Note that scalar is the oddly named function to get the array's length.

Putting in some code about hashes:
# hash
my %h = ("joe", 1, "joe2", 2, "joe3", "three");
my %f = ();
$f{2} = "doe";

print "hash: ",$f{2},"\n\n";

foreach my $key (keys %h)
{
	print "hash: ",$key,"\t",$h{$key},"\n";
}
produces:
hash: doe

hash: joe	1
hash: joe3	three
hash: joe2	2
Note that keys is a function in Perl which returns an array of the keys of the hash. There is also a function values which returns the values:
foreach my $val (values %h)
{
	print "hash vals: ",$val,"\n";
}
This produces:
hash vals: 1
hash vals: three
hash vals: 2
To see your whole hash, use the Dumper function from the Data module:
use Data::Dumper;
print Dumper \%h;
Note that this function takes a reference to the hash, not the hash itself. We'll discuss references below. This outputs:
$VAR1 = {
          'joe' => 1,
          'joe3' => 'three',
          'joe2' => 2
        };

Conditional Logic

For example:
my $a=0; 
my $b=1;

if ($a)
{
	print "A","\n";                                                                                                                                                        
}
elsif ($b)
{                                                                                                                                                                   
	print "B","\n";
}
else                                                                                                                                                    
{
	print "C","\n"; 
}

# output is: B 
Perl's logical operators are:
&&and
||or
!not
you can also use the English words:
andand
oror
notnot

Loops

Basic for loop to print 1 through 10:
#!/usr/bin/env perl

use strict; 

for (my $i = 1; $i <= 10; $i++)
{
	print $i,"\n";
}
Perl also has a special foreach loop for arrays:
my @arr = ("hello", 2, "goodbye");

foreach my $elt (@arr)
{
	print $elt,"\n";
}
Produces:
hello
2
goodbye

File I/O

Suppose we have two files:
a
b
c
and:
1
2
3
and we want to interleave the rows of these files. Here's a script to do that, illustrating basic file I/O:
#!/usr/bin/env perl

use strict; 
use warnings; 

# script to take two files of the same length and interleave

(my $input1, my $input2, my $output1) = @ARGV;

open my $fh1, '<', $input1;  # read
open my $fh2, '<', $input2;  # read 
open my $fh3, '>', $output1; # write                              

while(<$fh1>) 
{
	print $fh3 $_; 
	$_ = <$fh2>;
	print $fh3 $_; 
}
	
close $fh1;
close $fh2;
close $fh3;
The variable $fh1 stores the file handle of the first file argument, $fh2 corresponds to the second file, and:
print $fh3 $_;
prints a line to the third file (we'll discuss the special variable $_ below). We could run our script on the command line as:
$ ./myscript.pl file1.txt file2.txt file3.txt
This produces file3.txt:
a
1
b
2
c
3
Read more about file handling in Perl at tutorialspoint.com.

String Manipulation

To concatenate strings in Perl use a dot:
my $a = "hello";
print $a." goodbye "."goodbye ".$a."\n";
produces:
hello goodbye goodbye hello
If we use double quotes, we can actually put the variables right inside. This snippet is identical to the one above:
my $a = "hello";
print "$a goodbye goodbye $a\n";
This makes for much more convenient reading, so you should use this form when possible and avoid unnecessary concatenation. On the other hand, if we use single quotes:
print '$a goodbye goodbye $a\n';
we get a completely literal interpretation:
$a goodbye goodbye $a\n
Split on tab:
my $line="hello\tkitty";
print $line,"\n";

my @arr = split(/\t/, $line);

print $arr[0],"\n";
print $arr[1],"\n";
Output is:
hello	kitty
hello
kitty
Join on comma:
my $c = join(",", @arr);
print $c,"\n";
Output is:
hello,kitty
Perl chomp removes a newline character ("\n") from the end of a string:
chomp($mystr);

$_ and @_

$_ and @_ are special variables in Perl. Recall the for loop over an array we saw above:
my @arr = ("hello", 2, "goodbye");

foreach my $elt (@arr)
{
	print $elt,"\n";
}
We could do this identically as:
my @arr = ("hello", 2, "goodbye");

foreach (@arr)
{
	print $_,"\n";
}
What's $_? It is a slippery thing, sometimes called an "implicit variable," which springs into existence when there should be a variable handy but we haven't specified one. In the above case, we iterated through an array but didn't provide a variable. Perl allows us to be lazy in this way and use $_ as a shorthand.

$_ often comes up in the context of file reading. For example, to read input piped in from std:in the syntax is:
while (<STDIN>)
{
	print $_;
}
In this case, $_ represents a line from std:in (it could just as well be one from a file).

We'll discuss command line Perl below, but here's a foretaste:
$ echo joe | perl -ne '{ chomp($_); print $_," ",$_,"\n" }'
joe joe
Here $_ is a stand-in for the input string piped via std:in. Because of $_'s protean nature, it's best to get a feel for it through practice. If you'd like to read more about it, this tutorial isn't bad.

We'll see @_ below when we discuss Perl functions—called subroutines. It is, simply, an array which exists implicitly in every subroutine and holds the arguments passed to the subroutine. Because it's an array, you can access its elements with the usual array syntax:
$_[0]
is its first element, and so on. These elements are not to be confused with the variable we discussed before, $_, which has a totally different meaning. This stackoverflow post has a nice discussion about @_. You can read about all Perl special variables in the the official docs.

Example with Hashes: Combine Rows with Same First Field in a File

Let's say we have a file of the form:
a	1
a	2
b	3
c	4
d	5
b	6
a	7
e	8
and we want to get this in the form:
e	8
c	4
a	1,2,7
b	3,6
d	5
Here's a script to do that with hashes on input piped in from std:in:
#!/usr/bin/env perl

use Data::Dumper; 
use strict; 

my %h=();

while (<STDIN>)
{
	chomp($_); 
	my @a=split; 

	if ( not $h{$a[0]} ) 
	{ 
		$h{$a[0]} = $a[1]; 
	} 
	else 
	{ 
		$h{$a[0]} = $h{$a[0]}.",".$a[1]; 
	}
}

foreach my $key ( keys %h ) 
{
	print $key,"\t",$h{$key},"\n"
}

# print Dumper \%h
This script loops over std:in. It gets the first field of the input and makes an entry in the hash, h, if it does not exist:
field 1 --> field 2
If the key already exists in the hash table, it concatenates the second field onto the current value with a comma.

Command Line Perl and Regex

I'm borrowing this section from An Introduction to Unix. You can make Perl run on the command line and execute code line by line using the flags -ne:
cat file.txt | perl -ne '{ some code }'
This makes Perl behave just like awk—and it obeys some of the same conventions, such as using the BEGIN and END keywords. You can read more about Perl's command line options here and in even gorier detail here, but let's see how to use it. Suppose you have a text file and you want to format it as an HTML table:
$ cat test_table.txt
x	y	z
1	2	3
a	b	c
Doing this by hand would be pure torture. Let's do it with a one-liner on the command line:

cat test_table.txt | perl -ne 'BEGIN{print "<table border=\"1\">\n";}{chomp($_); my @line=split("\t",$_); print "<tr>"; foreach my $elt (@line) { print "<td>$elt</td>"; } print "</tr>\n";}END{print "</table>\n";}'i

Expanding this for readability:
$ cat test_table.txt | perl -ne 'BEGIN{print "<table border=\"1\">\n";}{
	chomp($_); 
	my @line=split("\t",$_); 
	print "<tr>"; 
	foreach my $elt (@line) { print "<td>$elt</td>"; } 
	print "</tr>\n";
  }END{print "</table>\n";}'
<table border="1">
<tr><td>x</td><td>y</td><td>z</td></tr>
<tr><td>1</td><td>2</td><td>3</td></tr>
<tr><td>a</td><td>b</td><td>c</td></tr>
</table>
All we're doing here is embedding each line in a table row (tr) tag, and each field in a table data (td) tag, plus printing table tags at the beginning and end of the file.

Somewhere off the internet, I stole this neat perl regex cheat sheet:

SyntaxEquivalent SyntaxWhat It Represents
\d[0-9]Any digit
\D[^0-9]Any character not a digit
\w[0-9a-zA-Z_]Any "word character"
\W[^0-9a-zA-Z_]Any character not a word character
\s[ \t\n\r\f]whitespace (space, tab, newline, carriage return,
form feed)
\S[^ \t\n\r\f]Any non-whitespace character
.Any character except newline
If you think of the above as nouns, you can think of the following as adjectives:

Quantifiers, etc.What It Means
*Match 0 or more times
+Match 1 or more times
?Match 1 or 0 times
{n}Match exactly n times
{n,}Match at least n times
{n,m}Match at least n but not more than m times
^Match at the beginning of a line
$Match at the end of a line
Read more about perl regex here. To get a feeling for how to use these, let's take an example file such that:
$ cat test.txt
889
tttxc234
wer1
CAT
asfwaffffffff2342525
Everything obeys the pattern non-digit string digit string except for 889, just digits, and CAT, just non-digits. Look at the following:
$ cat test.txt | perl -ne '{chomp($_); if ($_ =~ m/(\d*)/) {print $_,"\n";}}'
889
tttxc234
wer1
CAT
asfwaffffffff2342525
In Perl, the syntax:
some value =~ m/regular expression/
tests for a match against a regular expression. Referring to our cheat sheet, the above command prints every row because every row has at least 0 digits. Let's change the asterisk to a plus sign:
$ cat test.txt | perl -ne '{chomp($_); if ($_ =~ m/(\d+)/) {print $_,"\n";}}'
889
tttxc234
wer1
asfwaffffffff2342525
This prints everything with at least 1 digit, which is every row except CAT. Let's invert this and print the rows with non-digits:
$ cat test.txt | perl -ne '{chomp($_); if ($_ =~ m/(\D+)/) {print $_,"\n";}}'
tttxc234
wer1
CAT
asfwaffffffff2342525
This prints everything with at least 1 non-digit, which is every row except 889. We can try to match a more specific pattern:
$ cat test.txt | perl -ne '{chomp($_); if ($_ =~ m/(\d+)(\D+)/) {print $_,"\n";}}'
$
This prints everything with the pattern at least 1 digit, at least 1 non-digit, which no rows follow. What about this?
$ cat test.txt | perl -ne '{chomp($_); if ($_ =~ m/(\D+)(\d+)/) {print $_,"\n";}}'
tttxc234
wer1
asfwaffffffff2342525
It prints everything with the pattern at least 1 non-digit, at least 1 digit, which three rows follow.

We can also grab pieces of our regular expression as follows:
$ cat test.txt | perl -ne '{chomp($_); if ($_ =~ m/(\D+)(\d+)/) {print $1,"\n";}}'
tttxc
wer
asfwaffffffff
$ cat test.txt | perl -ne '{chomp($_); if ($_ =~ m/(\D+)(\d+)/) {print $2,"\n";}}'
234
1
2342525
$1 refers to the piece in the first ( ), $2 the second, and so on.

Let's take another example, the one with emails. You have a file such that:
$ cat mail.txt
xd2@joe.com
malformed.hotmail.com
malformed@@hotmail.com
carlos_danger@gmail.com
hellokitty@yahoo.com
Then to get the strings that are appropriately formatted as emails, we could do the following:
$ cat mail.txt | perl -ne '{chomp($_); if ($_ =~ m/(\w+)@(\w+)/) {print $_,"\n";}}'
xd2@joe.com
carlos_danger@gmail.com
hellokitty@yahoo.com
$ cat mail.txt | perl -ne '{chomp($_); if ($_ =~ m/(\w+)\@{1}(\w+)/) {print $_,"\n";}}'
xd2@joe.com
carlos_danger@gmail.com
hellokitty@yahoo.com
These do the same thing, but we're being a little more explicit in the second case. Escaping the @ sign with a slash isn't a bad idea because in perl @ can denote an array. If we wanted to grab lines with two @s, the syntax would be:
$ cat mail.txt | perl -ne '{chomp($_); if ($_ =~ m/(\w+)\@{2}(\w+)/) {print $_,"\n";}}'
malformed@@hotmail.com
Question: what would this do?
$ cat mail.txt | perl -ne '{chomp($_); if ( $_ =~ m/(\w+)(\@+)(\w+)/) {print $2,"\n";}}'
Answer:
$ cat mail.txt | perl -ne '{chomp($_); if ( $_ =~ m/(\w+)(\@+)(\w+)/) {print $2,"\n";}}'
@
@@
@
@

More Examples of Command Line Perl

Split a file on tab and spit out the first three columns:
$ cat file.txt | 
 perl -ne '{@a = split(/\t/, $_); print $a[0]."\t".$a[1]."\t".$a[2]."\n";}'
Print only rows which begin with a number:
$ cat file.txt | perl -ne '{if ($_ =~ m/^\d/) {print $_;}}'
Replace white space (any combination of spaces and tabs) with tabs:
$ cat file.txt | perl -ne '{s/(\s)+/\t/g; print $_,"\n"}'
Remove pure empty lines from a file:
$ cat file.txt | perl -ne '{print if not m/^$/}' 	    
Remove empty lines with white space:
$ cat file.txt | perl -ne '{print if not m/^(\s*)$/}' 
Filter file such that there are only unique entries in the first column:
$ cat file.txt | perl -ne 'BEGIN{my %h=();}{my @a=split; if (!($h{$a[0]})) {print $_;}; $h{$a[0]}=1;}'
Count the number of occurrences of each element in the first column of a file:
$ cat file.txt | perl -ne 'BEGIN{my %h=();}{my @a=split; if (!$h{$a[0]}) {$h{$a[0]}=1} else {$h{$a[0]}++}}END{foreach my $key ( sort keys %h ) {print "$key\t$h{$key}\n";}}'
Here's an example of regex substitution. Removing leading and trailing Ns from a sequence (bioinformatics):
$ echo NNNNACTGAAANNNNNN | perl -ne '{chomp($_); $line=$_; $line =~ s/^(N+)//; $line =~ s/(N+)$//; print $line,"\n"}'
ACTGAAA
Or better:
$ echo NNNNACTGAAANNNNNN | perl -ne '{chomp($_); $line=$_; $line =~ /^(N+)([ACTG]*)(N+)$/; print $2,"\n"}'
ACTGAAA

Arguments to Your Script

Much like C, a Perl script stores its arguments in the array @ARGV. You might see any of the following at the beginning of a script which expects two arguments. They are all equivalent:
(my $arg1, my $arg2) = @ARGV;
print "$arg1 $arg2\n";
my $arg1 = shift;
my $arg2 = shift;
print "$arg1 $arg2\n";
print "$ARGV[0] $ARGV[1]\n";
Note that shift is Perl lingo for "pop off an array." In the beginning of a script, we don't have to explicitly state what this array is: Perl knows we're referring to @ARGV.

For more advanced argument handling, Perl has packages like Getopt. Here's a bit of random script I wrote using Getopt:
use Getopt::Long;

GetOptions (    'help' => \$help,                     # bool_help
		'inp=s' => \$infile,                  # input file
                'outputdir=s' => \$outputdir,         # output directory
                'inst=i' => \$instances,              # number of instances
                'prefix=s' => \$prefix,               # file prefix
                'count=i' => \$count );               # count
The =s syntax denotes a string is expected; =i is for an integer; and the default is boolean. So inp is a flag whose value will be stored in the variable $infile. You can imagine calling this script as:
$ ./myscript.pl --inp myfile --outputdir /my/path --inst 4 --prefix tmp --count 1
If you're writing a script with many options, Getopt is infinitely superior to positional arguments.

System Calls in Perl

System commands are easy in Perl. Just use system:
my $cmd="ls -hl";
print $cmd."\n";
system($cmd);  # execute command
What if we want to store the output of the system command in a variable? In that case, use backticks à la bash:
my $output=`ls -hl`;	# execute command
print $output;		# print output		

Perl Subroutines (Functions)

Perl functions are called subroutines. As we remarked above, whatever arguments you pass to the function are accessible in the array:
@_
which automatically exists as soon as the function is created. Let's write the simplest subroutine ever:
sub addone
{
	my $arg1 = shift;
	# adds one to input
	return $arg1 + 1
}
So:
print addone(4),"\n";
prints 5 as expected.

Liberally make use of subroutines according to the DRY (Don't repeat yourself) principle. If you find yourself duplicating code to do the same thing, stop and instead package it into a re-usable function.

Making Your Own Perl Modules, @INC

If you're writing many subroutines, it becomes convenient to package them into a module. While a subroutine can only be used within a particular script, a module can be used across many scripts by importing it as:
use My_Module;
Let's create My_Module.pm:
package My_Module;

use strict; 

sub addtwo
{
	my $arg1 = shift;
	# adds two to input
	return $arg1 + 2
}

1;
One strange feature of this is the 1 at the end. This source explains:
When a module is loaded (via use) the compiler will complain unless the last statement executed when it is loaded is true. This line ensures that this is the case (as long as you don't place any code after this line). It's Perl's way of making sure that it successfully parsed all the way to the end of the file.
We can include the new module in our main script as:
#!/usr/bin/env perl

use strict; 
use My_Module;

print My_Module::addtwo(4),"\n";
which returns 6, as expected. Sometimes you'll see the syntax:
print &My_Module::addtwo(4),"\n";
where the ampersand reminds us that addtwo is a subroutine, but I'd steer clear of the unnecessary verbiage. In this case the .pm file, our module, was in the current working directory so Perl found it. But will it always find it? No, if we cd somewhere else and try to run the script from a different directory, where My_Module.pm doesn't reside, we'll get the error:
Can't locate My_Module.pm in @INC 
How do we see the paths in which Perl is looking for modules? The special array, @INC, shows us the paths included in the search space:
foreach (@INC)
{
        print $_."\n";
}
This might output stuff like:
/usr/local/lib64/perl5
/usr/share/perl5
...
If we want to add our own path to this list, one easy way to do it is by setting the PERL5LIB variable in the shell:
$ export PERL5LIB=/path/to/homemade/modules:$PERL5LIB
Now if we look at the contents of @INC again, it's been updated to:
/path/to/homemade/modules
/usr/local/lib64/perl5
/usr/share/perl5
...
and we can run the script from any directory we like.

Getting Stuff from CPAN

CPAN, at: is an online repository of over 100,000 open source, user-generated Perl modules. To get stuff from CPAN you can go there directly or use cpanminus, "a script to get, unpack, build and install modules from CPAN." For instance, to get HTML-Tree from the command line:
$ curl -L http://cpanmin.us | perl - HTML::TreeBuilder

Using Perl's map Function

map is a Perl tool which applies some function to the elements of an array. For example:
#!/bin/env perl

use strict;
use warnings;

my @data = (1,2,3);
foreach(@data) 
{
        print $_."\n";
}

print "\n";

my @data2 = map { $_ * 2 } @data;
foreach(@data2) 
{
        print $_."\n";
}
The output is:
1
2
3

2
4
6
map can also create hashes. As the docs say, it "returns a list, which can be assigned to a hash such that the elements become key/value pairs." For example:
#!/usr/bin/env perl

use strict; 
use warnings; 
use Data::Dumper;

my @a = (1,2,3,4);

my %h = map{$_ => 1} @a;

print Dumper \%h;
Produces:
$VAR1 = {
          '4' => 1,
          '1' => 1,
          '3' => 1,
          '2' => 1
        };
The following example shows how to take the union of the keys of two hashes:
# get union of keys of %h1 and %h2
my @h1keys = (keys %h1);
my @h2keys = (keys %h2);	
my @uniq_keys_array = keys %{{map {$_ => 1} (@h1keys, @h2keys)}};

Referencing and De-referencing in Perl

Just as in C, we might want to refer to an object's address in memory rather than the object itself. In Perl, we can get a reference to an object using a slash:
$my_scalar_ref = \ $my_scalar
$my_array_ref = \ @my_array
$my_hash_ref = \ %my_hash
To de-reference (return the object from the reference):
$my_sca = $$my_scalar_ref
@my_arr = @$my_array_ref
%my_h = %$my_hash_ref
Here's an example of looping through a hash of a hash:
# loop through outer keys
foreach my $key ( sort keys %h )
{
	print "key: $key \n";

	# get the inner hash
	my %h2 = %{$h{$key}};
	
	# loop through inner keys
	foreach my $key2 ( keys %h2 )
	{
		print "key2: $key2, value: $h2{$key2} \n";		
	}
}
What's going on here is that we have a primary hash, %h, which maps keys to values (as all hashes do), but the values are references to another hash:
key --> hash_reference
So, to get the actual hash this reference points to—i.e., to de-reference—we use:
my %h2 = %{$h{$key}};
And now it's business as usual with the hash %h2.

Making a Help Section for Your Script with Here Documents

Make a help section for your script using Here Documents:
my $usage = <<_EOUSAGE_;

###########################################################################
#
#  About: This script does this ... 
#
#  Usage example: $0 ...
#
###########################################################################

_EOUSAGE_

my $arg1 = $ARGV[0];

if ($arg1 eq "-h" or $arg1 eq "--h" or $arg1 eq "-help" or $arg1 eq "--help" or scalar(@ARGV) == 0)
{
        print $usage;
        exit;
}

Example with Hashes: Loop through a Fasta File and Store IDs (Bioinformatics)

For this example, you need to know that a fasta file is one type of file format in which sequencing data is stored in bioinformatics. In fasta format, every sequence has an ID line, which begins with > followed by some sequence, which is allowed to span multiple lines. For DNA, the sequence is comprised of the letters (or base pairs) A C T G. A fasta file of two genes could look like this:
$ cat myfasta.fa
>GeneA
ATGCTGAAAGGTCGTAGGATTCGTAG
>GeneB
ATGAACGTAA
This following subroutine simply returns a hash of the first words in the fasta IDs mapped to 1. For example, if a particular ID is:
>c3_1 [3 - 89]
then I want to put c3_1 into the hash. The point is to illustrate file I/O, regex, and hash operations.
# return reference to hash of the IDs of a fasta file to "1"
sub fastaid_firstword_hash  
{
        # arg 1 - file name
        my $infile = shift;
        my %h = ();		# empty hash  

        if ( -s $infile )	# if file nonzero
        {       
                open(my $fh, '<', $infile);
                while (<$fh>)
                {
			chomp $_;
			# header looks like >c3_1 [3 - 89] 
			if ($_ =~ m/>(\S+)(\s+)(.*)/)
			{
				# don't print leading ">", just get first word
				my $key = $1;
				# hash key is fastq id
				$h{$key} = 1;
			}       
                }
                close($fh);
        }
        return \%h;
}
Of course, we could do the same thing in a single line with map:
my %h_fasta = map {/>(\S*)\s(.*)/; $1 => 1} split(/\n/, `cat $infile`);
How does this work? split returns an array. Because we're splitting the text returned from the system command:
`cat file`
on a newline, we get an array comprised of all the rows of the file. This array gets passed to map which then parses each element of the array according to the regex and returns a hash of the first matched part of the regex—whatever's after the > character and before whitespace—to 1.

Example: Sorting on Some Field within a Complicated Pattern

Here's a Schwartzian transform example, via my friend Albert. The problem is as follows. We have some complicated pattern like:
L, Albert, 2
E, Oliver, 3
K, Hossein, 1
and we want to sort this alphabetically on the names in the second field. That is, we want the output to be:
L, Albert, 2
K, Hossein, 1
E, Oliver, 3
Note the positions of Hossein and Oliver have been swapped. Here's how to do it:
#!/usr/bin/env perl

use strict;
use warnings;

my @data = (
"L, Albert, 2",
"E, Oliver, 3",
"K, Hossein, 1"
);

foreach(@data) 
{
        chomp;
        print $_."\n";
}

# start at the bottom and "pipe" backwards
my @sorted = map { $_->[0] }
             sort {  $a->[1] cmp $b->[1] }
             map { [ $_, /.*?,(.*),.*?$/ ] } @data;

print '------------'."\n";

foreach(@sorted)
{
        chomp;
        print $_."\n";
}
Let's break this down. If we change the hard part to:
my @sorted = 	map { [ $_, /.*?,(.*),.*?$/ ] } @data;

foreach my $elt (@sorted)
{
        chomp ($elt);
        print $elt."\n";
	print "element 1 ".@{$elt}[0]."\n";
	print "element 2 ".@{$elt}[1]."\n";
}
We get:
ARRAY(0x2444878)
element 1 L, Albert, 2
element 2  Albert
ARRAY(0x2460ea0)
element 1 E, Oliver, 3
element 2  Oliver
ARRAY(0x2460e58)
element 1 K, Hossein, 1
element 2  Hossein
We've used the map function to create a new array of anonymous arrays:
[anonymous array 1, anon array 2, anon array 3]
Once we de-reference the anonymous arrays:
@{$elt}
we can access their elements. These arrays map each whole line to the second field—the name field:
[whole line, name]
according to the regular expression we provided in the map function. If we add a line:
my @sorted = 	sort {  $a->[1] cmp $b->[1] }
		map { [ $_, /.*?,(.*),.*?$/ ] } @data;
we're now sorting based on the name field—element 1 in 0-based counting. Note that you read this starting at the bottom and "piping" backwards, so the output of map gets passed to sort. Finally, we add another line:
my @sorted =	map { $_->[0] }	
		sort {  $a->[1] cmp $b->[1] }
		map { [ $_, /.*?,(.*),.*?$/ ] } @data;
to re-grab the whole line—element 0 in 0-based counting—after sorting. Mission accomplished!
Advertising

image


image


image


image