Cleaning Up E-Books

Jan 19, 2007

I have a large number of ebooks in Microsoft's .lit format. My Nokia 770 doesn't have any software to read a .lit format book. In fact, I can't say I've ever seen a .lit reader other than Microsoft's own.

What I have seen is the nifty and very usefull ConvertLIT which I use to down convert the files into plain HTML. I don't even bother with the images. The problem is, they tend to come out formatted in a hideous fashion. I came up with a nice combo of HTML tidy and a perl script.

Here's my command line for tidy, beware, this will modify your original copy!

tidy --bare yes --clean yes --drop-font-tags yes --drop-proprietary-attributes yes --enclose-text yes --output-xhtml yes --word-2000 yes --tidy-mark no --write-back yes TARGETFILENAME.htm

Here is my perl script, it just runs the file through some regex's and writes to the same filename with "NEW" appended. I also made a nice little progress bar because I was bored.

#!/usr/bin/perl

$file = $ARGV[0];   # Name the file
open(INFO, "< ".$file);   # Open the file
@lines = ;    # Read it into an array
close(INFO);      # Close the file

$size = @lines;
$counter = 0;
$size = $size / 50;

open(FILEWRITE, "> NEW".$file);
foreach(@lines) {
  $counter++;
  if(0 == ($counter % 50) || $counter == @lines) {
  print "\rProcessing: [";
  for($i = 0; $i < ($counter / $size); $i++) {
    print "+";
  }
  for($i = 0; $i < (49 - ($counter / $size)); $i++) {
    print "-";
  }
  print "]";
  }

  # Empty paragraph removal
  $_ =~ s/\s*<\/p>//mi;
  if($_ =~ m/^\s*\n$/) {
    # If the line is just a newline or newline and spaces, scrap it.
    $_ = '';
  }
  else {
    # Remove excess spaces
    $_ =~ s/  //mi;
    # I get these alot...
    $_ =~ s///mi;
  }
  print FILEWRITE $_;
}
close FILEWRITE;
print "\n";

You can download it here, but be careful with it.
cleaner.pl.txt

Update (01/21/07)
That perl script has a line $_ =~ s/ //mi; which doesn't really make that much sense looking at it now. I'm thinking $_ s/\s\s+/ /mi; for a replacement. Also, for some reason the server throws up a 500 error on trying to get that file, I'm working on it.