Showing posts with label utilities. Show all posts
Showing posts with label utilities. Show all posts

Thursday, August 23, 2018

Fetching multiple files from an internet site as a batch job


Sometimes one encounters a website that displays a book or manuscript page-by-page as individual jpeg files.  But what you need for your research is to have a single PDF of the item, so that you can move about it easily, and consult it offline.

There are several quick ways of getting these images as a batch job: here's one.   

  • First you have to identify the URL of one of the images.  I use Firefox, so I 
    • first bring up a page that displays the first folio of the MS. 
    • Then I press ctrl+I to get the "page info" (or Firefox menu Tools/Page Info).  
    • Then I select "Media" on the top line of the Info window.  
    • Then scroll down to the graphics file of the whole page, right click and ctrl+c to copy the URL.

      You now have a URL that looks like this:

      http://awebsite.net/uploads/manuscripts/miscellaneous/sometext/001.jpg

There may be a more direct way of getting this URL, but this is good enough for me.

The next bit is the nice bit.  Drop to the command line and use the utility "curl".  Here's the syntax ($ is my command prompt):

$ curl -O http://awebsite.net/uploads/manuscripts/miscellaneous/sometext/[001-268].jpg
  • Hit "enter" and several hundred jpeg files will be transferred to your directory.  It takes a couple of minutes, depending on your bandwidth.
    The bit in square brackets, "[001-268]" is curl's syntax for "please fetch 001.jpg, 002.jpg, ... 267.jpg, and 268.jpg".  Curl is one of the few tools with this simple ability to fetch lots of different files with a single simple command.

To convert them to a single PDF, I use ImageMagick:

$ convert *.jpg Hayanaratna.pdf
and wait for ten seconds.

(I was taught about curl by Patrick McAllister - thanks Patrick!)

A quite different approach is to use wget to fetch a whole website in a single gulp.   That's what I use for GRETIL, for example, so that I have the whole archive on my hard drive.

Tuesday, April 17, 2018

Improving PDFs

I sometimes  do some processing on PDFs if I think they are important, or I want to read them more conveniently.  I was trying to explain my techniques to my students, recently, and I realized that I use a mixture of tools that are not at all obvious or easy to explain to someone not familiar with Unix.
So I'm going to write down here what I do, so that at least the information is available in one place.  I assume a general knowledge of Linux and an ability to work with command-line commands.

If I receive a PDF that is a scanned book, with 1 PDF page = one book opening, I want to chop it up so that 1 PDF page = 1 book page.
  • make a working directory
  • use pdftk to unpack the PDF into one file per page:
    > pdftk foobar.pdf burst
  • I now have a directory full of one-page PDFs.  Nice.
  • convert them into jpegs using pdf2jpegs, a shell script that I wrote that contains this text:
    #!/bin/bash
    # convert a directory full of pdfs into jpegs
    for i in *.pdf; do pdftoppm -jpeg -r 400 "$i" >"$i.jpg"; done
  • I now have a directory full of jpegs, one jpeg per page.
  • Start the utility scan-tailor and use it to
    • separate left and right pages into separate files
    • straighten the pages
    • select the text area of each page
    • create a margin around the text
    • finally, write out the resulting new pages
     
  • I now have a directory (../out) full of TIFF files, one page per file, smart.
  •  Combine the TIFFs into a single PDF using my shell script tiffs2pdf:
    #!/bin/bash
    # Create a PDF file from all the .tiff files in the current directory.
    # The argument provides the name of the output file (with .pdf appended).
    echo "Created a PDF from a directory full of .tif files"
    echo "Single argument - the filename of the output PDF (no .pdf extension)"
    tiffcp *.tif "/tmp/${1}.tiff"
    tiff2pdf "/tmp/${1}.tiff" > "${1}.pdf"
    echo "Created ${1}.pdf"
    rm "/tmp/${1}.tiff"
    echo "Removed temporary file /tmp/${1}.tiff"

    # thanks to http://ubuntuforums.org/showthread.php?t=155628
     
  • I now have a nice PDF that has one smart page per PDF page. 
  • If I want it OCRed, then I usually use Adobe Acrobat, a commercial program. But if I'm uploading to Archive.org, that isn't necessary because Archive.org does the OCR work using Abbyy.


That's all, folks!