Cleaning Up E-Books
I have a large number of ebooks in Microsoft's .lit format. My Nokia 770 doesn't have any software to read a .lit format book. In fact, I can't say I've ever seen a .lit reader other than Microsoft's own.
What I have seen is the nifty and very usefull ConvertLIT which I use to down convert the files into plain HTML. I don't even bother with the images. The problem is, they tend to come out formatted in a hideous fashion. I came up with a nice combo of HTML tidy and a perl script.
Here's my command line for tidy, beware, this will modify your original copy!
tidy --bare yes --clean yes --drop-font-tags yes --drop-proprietary-attributes yes --enclose-text yes --output-xhtml yes --word-2000 yes --tidy-mark no --write-back yes TARGETFILENAME.htm
Here is my perl script, it just runs the file through some regex's and writes to the same filename with "NEW" appended. I also made a nice little progress bar because I was bored.
#!/usr/bin/perl
$file = $ARGV[0]; # Name the file
open(INFO, "< ".$file); # Open the file
@lines = ; # Read it into an array
close(INFO); # Close the file
$size = @lines;
$counter = 0;
$size = $size / 50;
open(FILEWRITE, "> NEW".$file);
foreach(@lines) {
$counter++;
if(0 == ($counter % 50) || $counter == @lines) {
print "\rProcessing: [";
for($i = 0; $i < ($counter / $size); $i++) {
print "+";
}
for($i = 0; $i < (49 - ($counter / $size)); $i++) {
print "-";
}
print "]";
}
# Empty paragraph removal
$_ =~ s/\s*<\/p>//mi;
if($_ =~ m/^\s*\n$/) {
# If the line is just a newline or newline and spaces, scrap it.
$_ = '';
}
else {
# Remove excess spaces
$_ =~ s/ //mi;
# I get these alot...
$_ =~ s///mi;
}
print FILEWRITE $_;
}
close FILEWRITE;
print "\n";
You can download it here, but be careful with it.
cleaner.pl.txt
Update (01/21/07)
That perl script has a line $_ =~ s/ //mi; which doesn't really make that much sense looking at it now. I'm thinking $_ s/\s\s+/ /mi; for a replacement. Also, for some reason the server throws up a 500 error on trying to get that file, I'm working on it.