handyfloss

Because FLOSS is handy, isn’t it?

Archive for the ‘howto’ Category

Some more tweaks to my Python script

Posted by isilanes on February 19, 2008

All the comments to my previous post have provided me with hints to increase further the efficiency of a script I am working on. Here I present the advices I have followed, and the speed gain they provided me. I will speak of “speedup”, instead of timing, because this second set of tests has been made in a different computer. The “base” speed will be the last value of my previous test set (1.5 sec in that computer, 1.66 in this one). A speedup of “2″ will thus mean half an execution time (0.83 s in this computer).

Version 6: Andrew Dalke suggested the substitution of:

line = re.sub('>','<',line)

with:

line   = line.replace('>','<')

Avoiding the re module seems to speed up things, if we are searching for fixed strings, so the additional features of the re module are not needed.

This is true, and I got a speedup of 1.37.

Version 7: Andrew Dalke also suggested substituting:

search_cre = re.compile(r'total_credit').search
if search_cre(line):

with:

if 'total_credit' in line:

This is more readable, more concise, and apparently faster. Doing it increases the speedup to 1.50.

Version 8: Andrew Dalke also proposed flattening some variables, and specifically avoiding dictionary search inside loops. I went further than his advice, even, and substituted:

stat['win'] = [0,0]

loop
  stat['win'][0] = something
  stat['win'][1] = somethingelse

with:

win_stat_0 = 0
win_stat_1 = 0

loop
  win_stat_0 = something
  win_stat_1 = somethingelse

This pushed the speedup futher up, to 1.54.

Version 9: Justin proposed reducing the number of times some patterns were matched, and extract some info more directly. I attained that by substituting:

loop:
  if 'total_credit' in line:
    line   = line.replace('>','<')
    aline  = line.split('<')
    credit = float(aline[2])

with:

pattern    = r'total_credit>([^<]+)<';
search_cre = re.compile(pattern).search

loop:
  if 'total_credit' in line:
    cre    = search_cre(line)
    credit = float(cre.group(1))

This trick saved enough to increase the speedup to 1.62.

Version 10: The next tweak was an idea of mine. I was diggesting a huge log file with zcat and grep, to produce a smaller intermediate file, which Python would process. The structure of this intermediate file is of alternating lines with “total_credit” then “os_name” then “total_credit”, and so on. When processing this file with Python, I was searching the line for “total_credit” to differentiate between these two lines, like this:

for line in f:
  if 'total_credit' in line:
    do something
  else:
    do somethingelse

But the alternating structure of my input would allow me to do:

odd = True
for line in f:
  if odd:
    do something
    odd = False
  else:
    do somethingelse
    odd = True

Presumably, checking falsity of a boolean is faster than matching a pattern, although in this case the gain was not huge: the speedup went up to 1.63.

Version 11: Another clever suggestion by Andrew Dalke was to avoid using the intermediate file, and use os.popen to connect to and read from the zcat/grep command directly. Thus, I substituted:

os.system('zcat host.gz | grep -F -e total_credit -e os_name > '+tmp)

f = open(tmp)
for line in f:
  do something

with:

f = os.popen('zcat host.gz | grep -F -e total_credit -e os_name')

for line in f:
  do something

This saves disk I/O time, and the performance is increased accordingly. The speedup goes up to 1.98.

All the values I have given are for a sample log (from MalariaControl.net) with 7 MB of gzipped info (49 MB uncompressed). I also tested my scripts with a 267 MB gzipped (1.8 GB uncompressed) log (from SETI@home), and a plot of speedups vs. versions follows:

versions2.png

Execution speedup vs. version
(click to enlarge)

Notice how the last modification (avoiding the temporary file) is of much more importance for the bigger file than for the smaller one. Recall also that the odd/even modification (version 10) is of very little importance for the small file, but quite efficient for the big file (compare it with Version 9).

The plot doesn’t tell (it compares versions with the same input, not one input with the other), but my eleventh version of the script runs the 267 MB log faster than the 7 MB one with Version 1! For the 7 MB input, the overall speedup from Version 1 to Version 11 is above 50.

Posted in howto | Tagged: , , , , | 7 Comments »

Summary of my Python optimization adventures

Posted by isilanes on February 17, 2008

Blog moved to: handyfloss.net

Entry available at: http://handyfloss.net/2008.02/summary-of-my-python-optimization-adventures/

This is a follow up to two previous posts. In the first one I spoke about saving memory by reading line-by-line, instead of all-at-once, and in the second one I recommended using Unix commands.

The script reads a host.gz log file from a given BOINC project (more precisely one I got from MalariaControl.net, because it is a small project, so its logs are also smaller), and extracts how many computers are running the project, and how much credit they are getting. The statistics are separated by operating system (Windows, Linux, MacOS and other).

Version 0

Here I read the whole file to RAM, then process it with Python alone. Running time: 34.1s.

#!/usr/bin/python

import os
import re
import gzip

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]
  
# Process file:
f = gzip.open('host.gz','r')
for line in f.readlines():
  if re.search('total_credit',line):
    # The following line lacks a '' behind the "total_credit" thing
    # because WordPress won't accept them (it keeps mangling the text 
    # if I do include them)
    credit = float(re.sub('/?total_credit','',line.split()[0])
  elif re.search('os_name',line):
    if re.search('Windows',line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif re.search('Linux',line):
        stat['lin'][0] += 1
        stat['lin'][1] += credit
    elif re.search('Darwin',line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()

# Return output:
nstring = ''
cstring = ''
for osy in os_list:
  nstring +=   "%15.0f " % (stat[osy][0])
  try:
    cstring += "%15.0f " % (stat[osy][1])
  except:
    print osy,stat[osy]

print nstring
print cstring

Version 1

The only difference is a “for line in f:“, instead of “for line in f.readlines():“. This saves a LOT of memory, but is slower. Running time: 44.3s.

Version 2

In this version, I use precompiled regular expresions, and the time-saving is noticeable. Running time: 26.2s

#!/usr/bin/python

import os
import re
import gzip

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]


pattern    = r'total_credit'
match_cre  = re.compile(pattern).match
pattern    = r'os_name';
match_os   = re.compile(pattern).match
pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Process file:
f = gzip.open('host.gz','r')

for line in f:
  if match_cre(line,5):
    # The following line lacks a '' behind the "total_credit" thing
    # because WordPress won't accept them (it keeps mangling the text 
    # if I do include them)
    credit = float(re.sub('/?total_credit','',line.split()[0])
  elif match_os(line,5):
    if search_win(line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif search_lin(line):
      stat['lin'][0] += 1
      stat['lin'][1] += credit
    elif search_dar(line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()

# etc.

Version 3

Later I decided to use AWK to perform the heaviest part: parsing the big file, to produce a second, smaller, file that Python will read. Running time: 14.8s.

#!/usr/bin/python

import os
import re

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]
  
pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Distile file with AWK:
tmp = 'bhs.tmp'
os.system('zcat host.gz | awk \'/total_credit/{printf $0}/os_name/{print}\' > '+tmp)

stat = {}
for osy in os_list:
  stat[osy] = [0,0]
# Process tmp file:
f = open(tmp)
for line in f:
  line = re.sub('>','<',line)
  aline = line.split('<')
  credit = float(aline[2])
  os_str = aline[6]
  if search_win(os_str):
    stat['win'][0] += 1
    stat['win'][1] += credit
  elif search_lin(os_str):
    stat['lin'][0] += 1
    stat['lin'][1] += credit
  elif search_dar(os_str):
    stat['dar'][0] += 1
    stat['dar'][1] += credit
  else:
    stat['oth'][0] += 1
    stat['oth'][1] += credit
f.close()

# etc

Version 4

Instead of using AWK, I decided to use grep, with the idea that nothing can beat this tool, when it comes to pattern matching. I was not disappointed. Running time: 5.4s.

#!/usr/bin/python

import os
import re

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]
  
pattern    = r'total_credit'
search_cre = re.compile(pattern).search

pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Distile file with grep:
tmp = 'bhs.tmp'
os.system('zcat host.gz | grep -e total_credit -e os_name > '+tmp)

# Process tmp file:
f = open(tmp)
for line in f:
  if search_cre(line):
    line = re.sub('>','<',line)
    aline = line.split('<')
    credit = float(aline[2])
  else:
    if search_win(line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif search_lin(line):
      stat['lin'][0] += 1
      stat['lin'][1] += credit
    elif search_dar(line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit

f.close()

# etc

Version 5

I was not completely happy yet. I discovered the -F flag for grep (in the man page), and decided to use it. This flag tells grep that the pattern we are using is a literal, so no expansion of it has to be made. Using the -F flag I further reduced the running time to: 1.5s.

time_vs_version.png

Running time vs. script version (Click to enlarge)

Posted in howto | Tagged: , , , , | 13 Comments »

Speeding up file processing with Unix commands

Posted by isilanes on February 17, 2008

Blog moved to: handyfloss.net

Entry available at: http://handyfloss.net/2008.02/speeding-up-file-processing-with-unix-commands/

In my last post I commented some changes I made to a Python script to process a file reducing the memory overhead related to reading the file directly to RAM.

I realized that the script needed much optimizing, and resorted to reading the link a reader (Paddy3118) was kind enough to point me to, I realized I could save time by compiling my search expressions. Basically my script opens a gzipped file, searches for lines containing some keywords, and uses the info read from those lines. The original script would take 44 seconds to process a 6.9 MB file (49 MB uncompressed). Using compile on the search expressions, this time went down to 29 s. I tried using match instead of search, and expressions like “if pattern in line:“, instead of re.search(), but these didn’t make much of a difference.

Later I thought that Unix commands such as grep were specially suited for the task, so I gave them a try. I modified my script to run in two steps: in the first one I used zcat and awk (called from within the script) to create a much smaller temporary file with only the lines containing the information I wanted. In a second step, I would process this file with standard Python code. This hybrid approach reduced the processing time to just 12 s. Sometimes using the best tool really makes a difference, and it seems that the Unix utilities are hard to come close to in terms of performance.

It is only after programming exercises like this one that one realizes how important writing good code is (something I will probably never do, but I try). For some reason I always think of Windows, and how Microsoft refuses to make an efficient program, relying on improvementes on the hardware instead. It’s as if I tried to speed up my first script using a faster computer, instead of fixing the code to be more efficient.

Posted in howto | Tagged: , , , | 1 Comment »

Password cracking with John the Ripper

Posted by isilanes on February 10, 2008

Following some security policy updates (not necessarily for better) in my workplace, a colleague and I discussed the vulnerability of user passwords in the accounts of our computers. He assured that an attack with a cracker program such as John the Ripper could potentially break into someone’s account, if only the cracker would have access to an initial user account.

I am by no means an expert on cryptography and computer security, but I would like to outline some ideas about the subject here, and explain why my colleague was partially wrong.

How authentication works

When we log in to an account in a computer, we enter a password. The computer checks it, and if it is the correct one, we are granted access. For the computer to check the password, we must have told it beforehand what the correct password is. Now, if the computer knows our password, anyone with access to the place where it is stored could retrieve our password.

We can avoid that by not telling the computer our password, but only an encrypted version. The encrypted version can be obtained from the password, but there is no operation to obtain the password from its encrypted form. When the computer asks for a password, it applies the encrypting algorithm, and compares the result with the stored encrypted form. If they are equal, it infers that the password was correct, since only from the correct password could one obtain the encrypted form.

On the other hand, no-one can possibly obtain the original password, even by inspection of the contents of the host computer, because only the encrypted form is available there.

How password cracking works

I will only deal with brute force attacks, i.e., trying many passwords, until the correct one is found.

Despite the “romantic” idea that a cracker will try to log in to an account once and again, until she gets access, this method is really lame, since such repeated access tries can be detected and blocked.

The ideal approach is to somehow obtain the encrypted password that the computer stores, and then try (in the cracker’s computer) to obtain the plain password from it. To do so, the cracker will make a guess, encrypt it with the known encrypting method, and compare the result with the encrypted key, repeating the process until a match is found. This task is the one performed by tools such as John the Ripper.

Why this shouldn’t work in a safe (Linux) system

The security of a password relies heavily on the difficulty of guessing it by the cracker. If our password is the same as our user name, this will be the first guess of the cracker, and she’ll find it immediately. If our password is a word that appears in a dictionary, they’ll find it quickly. If it is a string of 12 unrelated characters, plus digits, dots or exclamation marks, then it will take ages for the cracking program to reach the point where it guesses it.

The second hurdle for the cracker is that, even if she gets access to a regular user account, the file where the encrypted passwords are stored is only readable by the root (administrator) user (in a Linux system). Information about users and their passwords is stored in /etc/passwd (that any user can read) and /etc/shadow (that only root can read). The encrypted password is stored only in the latter. In the past all info was in /etc/passwd, but later on it was split, to increase the security.

In short: you need root access to start trying to crack passwords in a machine… but, if you have root access, why bother? You already have full access to all accounts!

Posted in howto | Tagged: , , , , , | Leave a Comment »

Re-partitioning a disk infected with Vista to dual-boot with Linux

Posted by isilanes on January 9, 2008

Some time ago I helped a friend to install Linux into a Vista laptop (incidentally, another friend asked me about the subject today). The only aspect I’m covering in this post is the re-partitioning of the disk, which is a wee bit trickier than with XP and previous Windows versions.

With my laptop (one with XP preinstalled), I just inserted my favorite Linux CD, rebooted, and used the built-in partition utility that all Linux installation CDs have to downsize the Windows partition, and then make the Linux partitions in the remaining disk space. With Vista this is not the case. You have to be very careful, because Linux can not resize the Vista partitions (at least at the time of writing these lines). The problem is that Vista uses a modified NTFS format, and Linux can not cope with it yet (read more at my source for this info: pronetworks.org).

You can also find at pronetworks.org a detailed HowTo for making the resizing of a partition. In summary (e.g., for shrinking a partition to make room for Linux):

  1. Go to Control Panel -> Administrative Tools -> Computer Management
  2. Click on Disk Management (under Storage in left hand panel)
  3. Locate partition to shrink, right click on it, and from the context menu choose Shrink Volume
  4. Fill in the self-explanatory dialog box. Basically, enter amount of MB you want the partition to be reduced by.

You will thus end up with a smaller Vista partition, and some empty space. Now, you can insert the Linux CD, reboot, and install Linux in that empty space, without touching the Vista partition.

Posted in howto | Tagged: , , , , , | Leave a Comment »

Extracting audio from a YouTube video

Posted by isilanes on January 7, 2008

This HowTo really has two parts:

1 – How to download a video from YouTube
2 – How to extract audio from any video

The second step is not limited to videos obtained in the first step, and the first step can obviously be made for the sake of it.

How to download a video from YouTube

When you play a video on YouTube, the contents of a FLV file are streamed to your screen. Downloading this FLV file is a bit more tricky than it should be, because there is no direct indication of the URL of this file in the code of the page of the video.

Apparently some guys got over it, and they made the software I use to do the job: a Firefox extension called DownloadHelper. Using it is so easy: a three-sphere icon appears to the right of the URL bar of Firefox. When a page contains material that can be downloaded with DownloadHelper (such as a YouTube video), the spheres in the icon are colorful and move (otherwise they are grayed-out, and still). You can then click on the icon to see a list of items to download, and choose the one you want (usually the .flv file).

How to extract audio from any video

It is so easy to do from the command line. First we use MPlayer to extract the audio in PCM/WAV format:


% mplayer filename.flv -vo null -ao pcm:fast:file=filename.wav

Then, we make use of oggenc to encode the WAV into Ogg Vorbis. For example, to encode with quality level 7 (a reasonable tradeoff between quality and size):


% oggenc filename.wav -q 7 -o filename.ogg

And that’s all to it!

Posted in howto | Tagged: , , , , , , | 2 Comments »

Flash: better without Flash

Posted by isilanes on January 6, 2008

Remember my previous post about a problem with Flash in Firefox/Iceweasel? Now the second part.

After following my own instructions, I ended up with a Flash instalation that could play YouTube videos correctly, but some other Flash animations would not work. By chance, my computer at work would reproduce any Flash animation just fine, so… why would that be?

To find out the reason, I have compared what Flash-related packages I have installed in Homer (my computer at work) and Heracles (the one at home). The result is quite surprising:


Homer[~]: aptitude search flash
p   flashplayer-mozilla       - Macromedia Flash Player
p   flashrom                  - Universal flash programming utility
p   flashybrid                - automates use of a flash disk as the root filesystem
p   libflash-dev              - GPL Flash (SWF) Library - development files
p   libflash-mozplugin        - GPL Flash (SWF) Library - Mozilla-compatible plugin
p   libflash-swfplayer        - GPL Flash (SWF) Library - stand-alone player
p   libflash0c2               - GPL Flash (SWF) Library - shared library
p   libroxen-flash2           - Flash2 module for the Roxen Challenger web server
p   m16c-flash                - Flash programmer for Renesas M16C and R8C microcontrollers
p   vrflash                   - tool to flash kernels and romdisks to Agenda VR
Homer[~]: aptitude search swf
p   libflash-swfplayer        - GPL Flash (SWF) Library - stand-alone player
p   libswf-perl               - Ming (SWF) module for Perl
p   libswfdec-0.5-4           - SWF (Macromedia Flash) decoder library
p   libswfdec-0.5-4-dbg       - SWF (Macromedia Flash) decoder library
p   libswfdec-0.5-dev         - SWF (Macromedia Flash) decoder library
v   libswfdec-dev             -
p   pyvnc2swf                 - screen recording tool to SWF movie
v   swf-player                -
p   swfdec-mozilla            - Mozilla plugin for SWF files (Macromedia Flash)
p   swfmill                   - xml2swf and swf2xml processor

Yes, Flash works perfectly at Homer because it has no package installed with swf or flash in their name! And I don’t have any Gnash package installed, either. I removed all swf/flash-related packages on Heracles, and now Flash works perfectly in my home computer too.

Posted in howto | Tagged: , , , , , , , | 7 Comments »

Ignoring some file in a Subversion working copy

Posted by isilanes on January 3, 2008

Blog moved to: handyfloss.net

Entry available at: http://handyfloss.net/2008.01/ignoring-some-file-in-a-subversion-working-copy/

Sometimes you have files littering a svn working copy, but you don’t want to put them under version management. Often times it is impossible, or plain painful, to delete them, and having them appear in all svn status calls is uncomfortable. This can happen, for example, if you keep a repository of Python files. When a Python script is called from another script (a module, for example), a .pyc file is generated (a “compiled” version of the called .py module, that is just faster to load next time).

Clearly, you don’t want to put the .pyc files in the repository, but deleting them everytime you manage the working copy is painful. For that, you can make svn ignore some files, like e.g. .pyc files. You can do that in at least two ways: specifically for a directory, or as a general preference.

Setting ignore property on a single directory

You can set a property on any directory of a working copy, so that the chosen files in that directory will be ignored by svn. For example:


% svn propset 'svn:ignore' '*.pyc' pythonfiles/

will make svn ignore all .pyc files in the directory “pythonfiles”, but not those in other directories.

Setting ignore patterns globally

You can edit the ~/.subversion/config file, and add “*.pyc” in the line starting with "global-ignores =". For example, my such line is:


global-ignores = *.o *.lo *.la #*# .*.rej *.rej .*~ *~ .#* .DS_Store *.pyc *.swp

It works immediately, and for all the calls to svn status you do in any working copy on that machine.

Posted in howto | Tagged: , , , | 2 Comments »

Flash player problem in Debian Lenny: “This SWF file is known to trigger bugs in the swfdec decoder.”

Posted by isilanes on December 27, 2007

Blog moved to: handyfloss.net

Entry available at: http://handyfloss.net/2007.12/flash-player-problem-in-debian-lenny-this-swf-file-is-known-to-trigger-bugs-in-the-swfdec-decoder/

[Update: read a more recent post]

I have recently come accross this problem, and I am posting the solution I just found. The problem is the following: when trying to watch any online flash animation with Iceweasel on my Debian Lenny (e.g. in YouTube), instead of the video, I got a grey window, with the following black letters on it:

This SWF file is known to trigger bugs in the swfdec decoder. Playback is cancelled.

This bug can be followed in this August 2007 thread at donarmstrong.com. I am amazed why it didn’t affect me sooner than this week, but oh well…

The steps that fixed the issue for me where to install two packages. I am not 100% sure the first one was needed, though. I first installed the package flashplayer-mozilla, which uninstalled flashplugin-nonfree as a side effect. This step alone fixed nothing.

The second step was to actually read the donarmstrong.com bug report, and see that it says that “This is fixed in swfdec-0.4.2 [...]“. Great! I did a aptitude search swf, and found out that Lenny has libswfdec-0.5-4, and a swfdec-mozilla that depends on it. I installed the latter, which obviously also installed the former, plus it removed swf-player and libswfdec0.3.

After that, I opened Iceweasel again, and now flash works!

Posted in howto | Tagged: , , , , , , , | 1 Comment »

 
Follow

Get every new post delivered to your Inbox.