handyfloss

Because FLOSS is handy, isn’t it?

Posts Tagged ‘howto’

Summary of my Python optimization adventures

Posted by isilanes on February 17, 2008

Blog moved to: handyfloss.net

Entry available at: http://handyfloss.net/2008.02/summary-of-my-python-optimization-adventures/

This is a follow up to two previous posts. In the first one I spoke about saving memory by reading line-by-line, instead of all-at-once, and in the second one I recommended using Unix commands.

The script reads a host.gz log file from a given BOINC project (more precisely one I got from MalariaControl.net, because it is a small project, so its logs are also smaller), and extracts how many computers are running the project, and how much credit they are getting. The statistics are separated by operating system (Windows, Linux, MacOS and other).

Version 0

Here I read the whole file to RAM, then process it with Python alone. Running time: 34.1s.

#!/usr/bin/python

import os
import re
import gzip

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]
  
# Process file:
f = gzip.open('host.gz','r')
for line in f.readlines():
  if re.search('total_credit',line):
    # The following line lacks a '' behind the "total_credit" thing
    # because WordPress won't accept them (it keeps mangling the text 
    # if I do include them)
    credit = float(re.sub('/?total_credit','',line.split()[0])
  elif re.search('os_name',line):
    if re.search('Windows',line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif re.search('Linux',line):
        stat['lin'][0] += 1
        stat['lin'][1] += credit
    elif re.search('Darwin',line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()

# Return output:
nstring = ''
cstring = ''
for osy in os_list:
  nstring +=   "%15.0f " % (stat[osy][0])
  try:
    cstring += "%15.0f " % (stat[osy][1])
  except:
    print osy,stat[osy]

print nstring
print cstring

Version 1

The only difference is a “for line in f:“, instead of “for line in f.readlines():“. This saves a LOT of memory, but is slower. Running time: 44.3s.

Version 2

In this version, I use precompiled regular expresions, and the time-saving is noticeable. Running time: 26.2s

#!/usr/bin/python

import os
import re
import gzip

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]


pattern    = r'total_credit'
match_cre  = re.compile(pattern).match
pattern    = r'os_name';
match_os   = re.compile(pattern).match
pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Process file:
f = gzip.open('host.gz','r')

for line in f:
  if match_cre(line,5):
    # The following line lacks a '' behind the "total_credit" thing
    # because WordPress won't accept them (it keeps mangling the text 
    # if I do include them)
    credit = float(re.sub('/?total_credit','',line.split()[0])
  elif match_os(line,5):
    if search_win(line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif search_lin(line):
      stat['lin'][0] += 1
      stat['lin'][1] += credit
    elif search_dar(line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()

# etc.

Version 3

Later I decided to use AWK to perform the heaviest part: parsing the big file, to produce a second, smaller, file that Python will read. Running time: 14.8s.

#!/usr/bin/python

import os
import re

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]
  
pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Distile file with AWK:
tmp = 'bhs.tmp'
os.system('zcat host.gz | awk \'/total_credit/{printf $0}/os_name/{print}\' > '+tmp)

stat = {}
for osy in os_list:
  stat[osy] = [0,0]
# Process tmp file:
f = open(tmp)
for line in f:
  line = re.sub('>','<',line)
  aline = line.split('<')
  credit = float(aline[2])
  os_str = aline[6]
  if search_win(os_str):
    stat['win'][0] += 1
    stat['win'][1] += credit
  elif search_lin(os_str):
    stat['lin'][0] += 1
    stat['lin'][1] += credit
  elif search_dar(os_str):
    stat['dar'][0] += 1
    stat['dar'][1] += credit
  else:
    stat['oth'][0] += 1
    stat['oth'][1] += credit
f.close()

# etc

Version 4

Instead of using AWK, I decided to use grep, with the idea that nothing can beat this tool, when it comes to pattern matching. I was not disappointed. Running time: 5.4s.

#!/usr/bin/python

import os
import re

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]
  
pattern    = r'total_credit'
search_cre = re.compile(pattern).search

pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Distile file with grep:
tmp = 'bhs.tmp'
os.system('zcat host.gz | grep -e total_credit -e os_name > '+tmp)

# Process tmp file:
f = open(tmp)
for line in f:
  if search_cre(line):
    line = re.sub('>','<',line)
    aline = line.split('<')
    credit = float(aline[2])
  else:
    if search_win(line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif search_lin(line):
      stat['lin'][0] += 1
      stat['lin'][1] += credit
    elif search_dar(line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit

f.close()

# etc

Version 5

I was not completely happy yet. I discovered the -F flag for grep (in the man page), and decided to use it. This flag tells grep that the pattern we are using is a literal, so no expansion of it has to be made. Using the -F flag I further reduced the running time to: 1.5s.

time_vs_version.png

Running time vs. script version (Click to enlarge)

Posted in howto | Tagged: , , , , | 14 Comments »

Speeding up file processing with Unix commands

Posted by isilanes on February 17, 2008

Blog moved to: handyfloss.net

Entry available at: http://handyfloss.net/2008.02/speeding-up-file-processing-with-unix-commands/

In my last post I commented some changes I made to a Python script to process a file reducing the memory overhead related to reading the file directly to RAM.

I realized that the script needed much optimizing, and resorted to reading the link a reader (Paddy3118) was kind enough to point me to, I realized I could save time by compiling my search expressions. Basically my script opens a gzipped file, searches for lines containing some keywords, and uses the info read from those lines. The original script would take 44 seconds to process a 6.9 MB file (49 MB uncompressed). Using compile on the search expressions, this time went down to 29 s. I tried using match instead of search, and expressions like “if pattern in line:“, instead of re.search(), but these didn’t make much of a difference.

Later I thought that Unix commands such as grep were specially suited for the task, so I gave them a try. I modified my script to run in two steps: in the first one I used zcat and awk (called from within the script) to create a much smaller temporary file with only the lines containing the information I wanted. In a second step, I would process this file with standard Python code. This hybrid approach reduced the processing time to just 12 s. Sometimes using the best tool really makes a difference, and it seems that the Unix utilities are hard to come close to in terms of performance.

It is only after programming exercises like this one that one realizes how important writing good code is (something I will probably never do, but I try). For some reason I always think of Windows, and how Microsoft refuses to make an efficient program, relying on improvementes on the hardware instead. It’s as if I tried to speed up my first script using a faster computer, instead of fixing the code to be more efficient.

Posted in howto | Tagged: , , , | 1 Comment »

IMAP access to GMAIL with KMail

Posted by isilanes on February 3, 2008

I recently discovered that Gmail offers IMAP access to the service. I must admit that I have never used IMAP, but it is a very good idea for simplifying the access to one’s account from anywhere, and having your e-mail always up to date in any number of computers. You can think of IMAP as all the good things of POP3 (custom UI, great flexibility) and web-mail (central repository of messages) together, without their drawbacks.

Although I think Google is an evil company that wants to take the world over, I have surrendered to their superb e-mail service, Gmail, with its huge inbox and fast and reliable access. I was happy with POP3, go figure with IMAP…

Of course, I have had to configure my e-mail client, KMail, to use IMAP. For that, I have followed the instructions, e.g., in linux.wordpress.org.

First, you have to allow IMAP connection to Gmail. For that, you just need to go to Settings in your Gmail account, then Forwarding and POP/IMAP, and Enable IMAP (I think it’s on by default).

Second, create an IMAP account in KMail: Settings -> Configure KMail -> Accounts -> Add -> IMAP. You will be prompted for some info:

  • Account name: anything to let you identify it.
  • Login: your full Gmail address.
  • Host: imap.gmail.com
  • Port: 993

Small trick: the default Trash folder is “Local Folders/trash”. If you keep this, when you “delete” a message from the IMAP account, it will be moved to the “General” KMail trash. The problem is that it means moving the message outside the IMAP tree, and I have found that the IMAP mechanism (probably as a security measure) keeps a copy of the message in the original location (i.e., it is actually not erased). To avoid that, you can put something like “Gmail IMAP/[Gmail]/Trash” as Trash folder, and make the deleted message be moved to the Trash inside the IMAP folder. There, it is deleted exactly as if you access your Gmail account from the web and click on “Delete”.

Third, in the Security tab of the dialog window we have just filled, choose “Use SSL for secure mail download” in Encryption and “Clear Text” in Authentication method.

That’s it, you’re done!

So far I have only used IMAP at home (lousy 300 kb connection), and I think it is a bit on the slow side of the scale, but except for that, I am starting to love IMAP.

Posted in Evil software | Tagged: , , , , | 2 Comments »

Flash: better without Flash

Posted by isilanes on January 6, 2008

Remember my previous post about a problem with Flash in Firefox/Iceweasel? Now the second part.

After following my own instructions, I ended up with a Flash instalation that could play YouTube videos correctly, but some other Flash animations would not work. By chance, my computer at work would reproduce any Flash animation just fine, so… why would that be?

To find out the reason, I have compared what Flash-related packages I have installed in Homer (my computer at work) and Heracles (the one at home). The result is quite surprising:


Homer[~]: aptitude search flash
p   flashplayer-mozilla       - Macromedia Flash Player
p   flashrom                  - Universal flash programming utility
p   flashybrid                - automates use of a flash disk as the root filesystem
p   libflash-dev              - GPL Flash (SWF) Library - development files
p   libflash-mozplugin        - GPL Flash (SWF) Library - Mozilla-compatible plugin
p   libflash-swfplayer        - GPL Flash (SWF) Library - stand-alone player
p   libflash0c2               - GPL Flash (SWF) Library - shared library
p   libroxen-flash2           - Flash2 module for the Roxen Challenger web server
p   m16c-flash                - Flash programmer for Renesas M16C and R8C microcontrollers
p   vrflash                   - tool to flash kernels and romdisks to Agenda VR
Homer[~]: aptitude search swf
p   libflash-swfplayer        - GPL Flash (SWF) Library - stand-alone player
p   libswf-perl               - Ming (SWF) module for Perl
p   libswfdec-0.5-4           - SWF (Macromedia Flash) decoder library
p   libswfdec-0.5-4-dbg       - SWF (Macromedia Flash) decoder library
p   libswfdec-0.5-dev         - SWF (Macromedia Flash) decoder library
v   libswfdec-dev             -
p   pyvnc2swf                 - screen recording tool to SWF movie
v   swf-player                -
p   swfdec-mozilla            - Mozilla plugin for SWF files (Macromedia Flash)
p   swfmill                   - xml2swf and swf2xml processor

Yes, Flash works perfectly at Homer because it has no package installed with swf or flash in their name! And I don’t have any Gnash package installed, either. I removed all swf/flash-related packages on Heracles, and now Flash works perfectly in my home computer too.

Posted in howto | Tagged: , , , , , , , | 7 Comments »

Hard links: an example case

Posted by isilanes on November 29, 2007

One argument I tend to hear from Windows users is that in Windows you can do as much as you can with Linux, and that the technical advantages of Linux only show up if you are really an utter geek. This is one of (I hope) a series of entries in my blog, illustrating some cases where this doesn’t hold: I took advantage of tools provided by Linux in a way that anyone could have, not just geeks.

The moral of it all is that Windows encourages a lack of choice and flexibility that makes users tend not to be creative, and think the cage Windows keeps them in is actually a shelter from the storm, when it’s not. They think that what can’t be done with Windows, needs not be done. I think otherwise…

Today I will try to provide an example in which hard links can be useful. Under Windows XP hard links can be created, using the fsutil utility, but only for NTFS file systems, and only by the Administrator account (and only from the command line). If you want to learn more about links and specially Windows links, read this interesting sell-shocked.org article.

The problem

I download a lot of music from Jamendo, using the BitTorrent p2p protocol. After having downloaded a given album, I tend to leave the torrent open, so that people can continue uploading from my computer.

However, I also want to have my music collection tidy and ordered, so I immediately organize the newly-dowloaded songs moving them to a neat directory tree I have, will all my music.

So, there is a conflict between keeping the files in the bittorrent download/upload dir, and properly organizing them. I don’t want to have to wait until I decide to stop sharing a file to organize it, and I don’t want to risk deleting the files if I remove them from the bittorrent client before saving them elsewhere. I could get over all this by simply making a copy of the files… but then I would be filling twice as much disk space, and with GBs of shared files, this is not neat at all.

The solution

What I do is hardlink all the downloaded files to their final location. If I download all torrents to /scratch/ktorrent/, a downloaded album will look like that:


% ls /scratch/ktorrent/album1/
song1.ogg song2.ogg song3.ogg [...]

If I want to save the album under my artist1 directory, I do the following:


% mkdir /scratch/music/artist1/album1
% ln /scratch/ktorrent/album1/* /scratch/music/artist1/album1/

This way all the “song*.ogg” files will appear to be in both /scratch/music/artist1/album1/ and /scratch/ktorrent/album1/ at the same time.

Benefits:

1 – I can keep sharing the files in /scratch/ktorrent/album1/, while listening to and/or manipulating the /scratch/music/artist1/album1/ files as if I had 2 copies of each.

2 – The total size is not affected. The hard links do not “occupy” space (only a few bytes each).

3 – I can delete the files in the shared directory without any fear. Only the “copy” in /scratch/ktorrent/ disappears, while the other “copy” in /scratch/music/artist1/album1/ becomes the only copy (just as if it had always been a “normal” file, and the only one).

Recall that all files are hard links. Normally a given file is the only hard link to a given piece of data in the hard disk, but there can be more “links” pointing to that data. When we remove files, we only remove the “link” pointing to the data.

Posted in Windows no-nos | Tagged: , , , , , | Leave a Comment »

Unicode in the command line

Posted by isilanes on November 20, 2007

This is a short HowTo for making unicode work in Linux, specifically in the command line. Yet more specifically, in the konsole terminal. This is useful if you want to be able to use characters like ‘ñ’ or accents like in ‘á’ and ‘ö’.

1 – Modify your shell locale variables

You need locale settings that support UTF (for example en_US.UTF-8). For that, you can add the following lines to .tcshrc or whatever script run at login:


setenv mylang   en_US.UTF-8
setenv LANG     $mylang
setenv LC_CTYPE $mylang
setenv LANGUAGE $mylang
setenv LC_ALL   $mylang

The ‘$mylang’ thing is just because I’m lazy, and I might want to change them all in the future, and I don’t want to type too much.

2 – Modify your global locales

I don’t know if this is needed, but it doesn’t hurt. In Debian:

% dpkg-reconfigure locales

and follow the instructions, using en_US.UTF-8 or something similar as default.

3 – Modify the encoding of Konsole

In the menus:

Settings->Encoding->Unicode (utf8)

Make this permanent with:

Settings->Save as Default

Then choose xterm and not linux as keyboard setting:

Settings->Keyboard->Xterm (XFree 4.x.x)

You can make this permanent in the Session tab of:

Settings->Configure Konsole

namely inserting “xterm” in the box labeled “$TERM”.

If you follow these instructions, you will be able to introduce non-ASCII text in the terminal, and use non-ASCII filenames without problem.

Posted in Free software and related beasts | Tagged: , , , , , | Leave a Comment »

e-mail howto

Posted by isilanes on November 14, 2007

When we send e-mails (specially mass forwards) we might not be aware that on the other side of the wire there is some person that could be annoyed by some of our acts. We could help others behave nicely with us if we started behaving correctly with others. This post tries to help you with that.

All the following is my opinion, but I’m not asking you to do it because it’s my opinion. I think that, besides, it’s also sensible. Judge yourself.

Avoid HTML messages at all costs

In fact, only plain text e-mails should ever be sent (and anything else as an attachment). Sophocles, Shakespeare, Cervantes… they all used plain text, and managed to get their message through, didn’t they?

The reason to use plain text is dual. Firstly, it merely adds bloat. The e-mail will be unnecessarily fat, without adding the slightest actual content. Secondly, and maybe even more importantly, HTML is used in e-mails by spammers and crackers to force the receiver to execute unwanted actions, including: visiting unsolicited web pages, sending private data (as, e.g., the confirmation of the actual existence of the receiver, something very valuable for a spammer), and, if the HTML includes malicious Java, JavaScript or ActiveX code and the receiver is not correctly protected (*cough* Windows users *cough*), anything from crashing the mail client to setting your screen on fire and killing the little puppy you got yesterday.

For the second reason in the previous paragraph, any knowledgeable user will abhor receiving HTML e-mails (I do), and will have it completely deactivated (the mail client will not interpret the HTML code, and will display it literally instead, which is 100% safe, except if ugly symbols hurt your eyeballs). Thus, your pretty HTML message will not be correctly read by the receiver, and will at least charge him with the annoyance of either activating the HTML back, or reading the source code. And in this day and age, even allowing HTML e-mails in a per-sender basis is risky as can be, since anyone can forge anyone else’s e-mail address.

So, don’t ever send HTML messages, and also deactivate the rendering of HTML messages you receive altogether. The first thing will make your receivers happier, and the second one will keep you safer.

Use care if sending mass forwards

Can you name something more unpleasant than those silly mass forwards of 2MB PowerPoints with “witty” sentences, and almost always ending in “send it to 1000 friends or die a slow and painful death”?

For me, there are two kinds of forwards: the ones I name above, and the ones with funny, interesting and/or useful data. The first one: avoid them like the plague. Don’t ever send/answer/forward them. The only use they can have is negative: they clutter the net, they slow down the download of other (possibly important) e-mails for the receiver, they waste bandwidth and connection time for those who have either or both limited, and they don’t actually add anything to the life of the receiver, except anger towards a sob who pretends to be her “friend”, and then blackmails her to spread the same message or “suffer consequences”.

For the (veeery few) contents you want to spread to legitimately help/amuse/enlighten the receivers: choose a suitable format! If the content is a joke or similar, send it in plain text. It works all the same! Don’t send a huge PowerPoint just for the sake of it. If the content is a (presumably big) file (a movie file, a presentation that is amusing in itself, an article with images and links…), put it online and send a link instead! Sending just a link is much more comfortable for the receiver, since the size of the e-mail is tiny, and she can choose whether or not to download the file, after all. Not everyone has a personal web page, but at times it proves invaluable… look for online storage solutions, as there are many free ones.

Also take into account that mass forwards can be used by spammers to get a list of valid addresses to bomb with their mails. The more “evil” a spammer, the more friendly she’ll pretend to be, to be included in the more people’s distribution lists, so that she’ll be sent all their mass forwards, along with the addresses of maybe hundreds of victims.

To avoid that, try to send your forwards only to people you actually know, and think are not spammers. Even safer: DO NOT DISCLOSE the addresses of all the receivers of your e-mails to every other receiver. It’s easy: with any half-decent e-mail client (KMail, Thunderbird and even Outlook can) you can chose to make any receiver “To:”, “CC: or “BCC:” (“Para:”, “CC” and “CCO” in the Spanish version of Outlook Express). Send all your forwards with BCC to be on the safe side.

Trim the excess

Whenever you answer to or forward an e-mail, depending on the configuration of your e-mail client it will automatically attach the original message, quoted. Now, if the receiver answers to your answer, she’ll quote your text AND your quotation of her original message. Then you answer and… you get the picture: e-mails flying around with hundreds of lines that only add: a) superfluous size excess and b) confusion, since sometimes it is not easy to find exactly the new material (coloring quotations helps, though).

Quoting the e-mail we answer to can be useful, but when answering to an answer, be nice an take the ten seconds you need to properly delete what is not needed.

Also remember that blindly forwarding messages can make you disclose to third parties information that the original sender wanted just you to read. Watch out for that!

Don’t overspread e-mail addresses

Don’t make spammers’ day by providing them with your e-mail, much less with mine!

Spammers are out there, like the truth in The X-Files. They never sleep. They have no mercy. They will relentlessly go on an on, harvesting e-mail addresses to prey upon. You have to understand that the most valuable thing for a spammer is a list of valid e-mail addresses. Valid e-mails are those that will be actually read, or at least received.

The ways in which spammers build their lists include:

* Unprotected addresses publicly amenable on the Web
* Being included in a “mass forward” (see above)
* Random spam

Unprotected public addresses include valid e-mail addresses that appear literally in a web page, or sent to USENET or other discussion forums. For that reason, if you want to protect your address, while still making it possible for others to contact you, don’t ever put your address on the web like that:

myname@mydomain.com

Instead, put something like:

myname AT mydomain DOT com

or:

mynameIHATESPAMMERS@mydomain.com

Or any other combination that makes the literal e-mail completely invalid, but a human reader can realize how to handle to get the correct address. You have to understand that the spammers use robots to harvest e-mails from the web, that is, there are computer programs looking for e-mails, not human beings (even stretching the meaning of “human being” to include scum like spammers). An address that needs human “logic” to be read will not be parsed correctly by robots.

In that regard, beware that both “protected” addresses above are far from perfect. It’s trivial to write a robot program that translates every “AT” with an “@”, and any “DOT” with a “.”, and/or eliminates spaces, capital letters or words like “SPAMMER(S)” etc. So be colorful, and think like a robot can’t think :^)

A second approach to protecting your e-mail could be to use a specific anti-spam address. There are companies like Bluebottle who provide such a service. As you can see, the e-mail I provide in this Web site belongs to that category, and is a completely free account (they offer further services, that I do not need, for a fee).

These “anti-spam” e-mail accounts basically contact the sender each time they receive an e-mail. Then the sender has to perform some kind of basic action (click a button or similar) to assure that they are valid senders, and if they fail to, the e-mail is filtered. The validation action has the sole actual purpose of making sure that the sender is human. ANY human sender is let through, but the spam robots normally don’t have the wit to answer properly when prompted by the Bluebottle server. Yes, this might piss off the legitimate senders, because they are required to click a silly button before their message goes through. However, this is done only once. After the first authentication, all the e-mails coming from that address will be automatically accepted.

Being included in a mass forward is discussed above, and random spam messages are those offering medicines or pornography. If you answer to one of them, you might not get infected with a virus or anything, but the sender might secretly know that you actually exist (because she is notified when you answer or click the link), and remember: valid addresses are what spammers seek.

Posted in Miscellaneous | Tagged: , , , , | Leave a Comment »