Bay 12 Games Forum

Please login or register.

Login with username, password and session length
Advanced search  

Author Topic: How to compress many image files? Digitizing a bookshelf.  (Read 1472 times)

Truean

  • Bay Watcher
  • Ok.... [sigh] It froze over....
    • View Profile
How to compress many image files? Digitizing a bookshelf.
« on: March 25, 2017, 06:37:39 pm »

Issue: I need to compress a great many image files for long term storage on a mass media storage device.

I have over 112 hardcopy books (which I have paid for already, long ago). They take up physical space and don't travel well. I can't bring myself to throw them out when I've managed to preserve them for years and paid for them.

So, I intend to take pictures of each and every page, place them in a folder of some kind and digitally store them. The average page length is 500 pages. Uncompressed, let's call that 500 KB of uncompressed computerized storage per book, average x 112 = 60GB .
500 x 112 = 56,000 pages / images. (JPEG Files).


Question(s):

1.) Can a great many (estimated 56,000) image files be compressed with something like winzip?
2.) Can it be made into something like a PDF? (Somehow throwing all pages of the file into a word document and saving as PDF?).

Thank you for your time.
« Last Edit: March 25, 2017, 06:47:48 pm by Truean »
Logged
The kinda human wreckage that you love

Current Spare Time Fiction Project: (C) 2010 http://www.bay12forums.com/smf/index.php?topic=63660.0
Disclaimer: I never take cases online for ethical reasons. If you require an attorney; you need to find one licensed to practice in your jurisdiction. Never take anything online as legal advice, because each case is different and one size does not fit all. Wants nothing at all to do with law.

Please don't quote me.

wierd

  • Bay Watcher
  • I like to eat small children.
    • View Profile
Re: How to compress many image files? Digitizing a bookshelf.
« Reply #1 on: March 25, 2017, 07:59:14 pm »

what image format?

some compress better than others...
Logged

Truean

  • Bay Watcher
  • Ok.... [sigh] It froze over....
    • View Profile
Re: How to compress many image files? Digitizing a bookshelf.
« Reply #2 on: March 25, 2017, 08:44:33 pm »

Jpeg

Fair point to make. Thank you.
Logged
The kinda human wreckage that you love

Current Spare Time Fiction Project: (C) 2010 http://www.bay12forums.com/smf/index.php?topic=63660.0
Disclaimer: I never take cases online for ethical reasons. If you require an attorney; you need to find one licensed to practice in your jurisdiction. Never take anything online as legal advice, because each case is different and one size does not fit all. Wants nothing at all to do with law.

Please don't quote me.

wierd

  • Bay Watcher
  • I like to eat small children.
    • View Profile
Re: How to compress many image files? Digitizing a bookshelf.
« Reply #3 on: March 25, 2017, 08:54:21 pm »

I would suggest OCR.  essentially, optical recognition of the text, with storage of the text in pdf form.

Text compresses REALLY REALLY well.  Images? Not so much.  We arent interested in pixel data nearly as much as we are interested in the actual text.  Inline any illustrations into the PDF as needed. (Ironically, you can do this with something as simple as MSWord, which has a "save to PDF" option. Failing that, there are various "drivers" you can install so that you can "Print to PDF" as well.)

Google's OCR is based on Tesseract, which is FOSS. If you dont want to monkey around with a clunkly console program, you can use google's free OCR service. (It's part of the google documents ecosystem.) It is very mature, and tries to preserve page formatting.

There is a reason I suggest this aside from the obvious space economy and ease of reading later-- Zipping up hundreds of jpegs will do little to reduce the filesize further. JPEG does not recompress well, as it already is very high entropy. You might save 1% space after zipping an entire book.  After zipping, all those files are now "all eggs in one basket" and if the zip gets damaged in any way, you lose the whole damn book.  I do not consider this appropriate, or desirable, and would steer you away from that direction.

One can convert 500 pages of text, INTO TEXT, and now instead of taking about 30-60mb of space per book, it takes about 600kb-1mb.  It can be compressed into a tiny fraction of that, if you REALLY want to zip archive it. It also becomes digitally searchable, and a number of other perks, like being readable in just about any e-reader, etc.
« Last Edit: March 25, 2017, 09:02:49 pm by wierd »
Logged

wierd

  • Bay Watcher
  • I like to eat small children.
    • View Profile
Re: How to compress many image files? Digitizing a bookshelf.
« Reply #4 on: March 25, 2017, 09:28:21 pm »

An article on using google-drive/google-docs to do the OCR.
https://opensource.com/life/15/9/open-source-extract-text-images

Tesseract, the software backend, might be more useful, and more private though. Dunno if there are prebuilt binaries or not. will look.
https://github.com/tesseract-ocr (apache licensed source repo on github)

fakedit: appears so.
https://osdn.net/projects/sfnet_tesseract-ocr-alt/downloads/tesseract-ocr-3.02-win32-portable.zip/

I know you have some tech swagger, so the command line driven interface wont slow you down. Name your scans sequentially, and make a batch processing script to OCR them in an automated manner, then assemble the text files together into something like Word, and save as PDF.
Logged

TheBiggerFish

  • Bay Watcher
  • Somewhere around here.
    • View Profile
Re: How to compress many image files? Digitizing a bookshelf.
« Reply #5 on: March 30, 2017, 07:19:43 pm »

Yeah, I was gonna say, I would think converting them to text would be better than compressing a lot of image files.

Also jpeg is lossy.
Logged
Sigtext

It has been determined that Trump is an average unladen swallow travelling northbound at his maximum sustainable speed of -3 Obama-cubits per second in the middle of a class 3 hurricane.

George_Chickens

  • Bay Watcher
  • Ghosts are stored in the balls.
    • View Profile
Re: How to compress many image files? Digitizing a bookshelf.
« Reply #6 on: March 31, 2017, 02:37:10 am »

>1.) Can a great many (estimated 56,000) image files be compressed with something like winzip?
Don't use Winzip or Winrar, their compression is sub par at best. Use 7zip or Peazip. Even then, you won't get much more compression, as if I am not mistaken (I probably am!) jpgs are already fairly compressed as is.
>2.) Can it be made into something like a PDF? (Somehow throwing all pages of the file into a word document and saving as PDF?).
There are tools to convert images to modifiable PDFs, but I can't name any off the top of my head. It's definitely an option.
Logged
Ghosts are stored in the balls?[/quote]
also George_Chickens quit fucking my sister

BigD145

  • Bay Watcher
    • View Profile
Re: How to compress many image files? Digitizing a bookshelf.
« Reply #7 on: April 01, 2017, 01:06:58 am »

epub would be the easiest format to read from and still have a small file size. pdf's tend to be quite a bit larger. cbz and cbr are used for comic users as well as being read as zip and rar, respectively.

Compression is one thing. Using the end results is another. How did you plan on accessing your library? Laptop? E-reader?
Logged

lordcooper

  • Bay Watcher
  • I'm a number!
    • View Profile
Re: How to compress many image files? Digitizing a bookshelf.
« Reply #8 on: April 04, 2017, 06:22:05 am »

Being optimistic and assuming two seconds (on average) to take each photo that's 31 hours of solid photography, and that's not factoring in breaks and general procrastination.  Wouldn't it make more sense to just buy ebook versions instead?

It's also arguably a breach of copyright, which might be more important to you than most people (you're  a lawyer, right?).

I'm also incredibly sceptical about your assumption that a jpeg of a page is 1 KB.  Here's a photo of page 1 of the first Harry Potter book.  I've cropped it to minimise screen space and selected the lowest possible quality (smallest size) Photoshop allows.  Uploading it to IMGUR adds on further compression, taking the filesize down to a mere 24KB.  And it's barely legible at this quality.

So yeah, multiply your estimated space by at least 24 to get something realistic (the original image I used was 55KB, but your camera/scanner is likely a much higher resolution than the image I picked up online).  Of course, cropping it and making alterations to minimise the filesize like this take time.  Lets be incredibly generous again and suggest five seconds for each image, plus the two for taking the image.  It's gonna take 78 hours to make horrible, barely legible copies of these books.

It's probably best to just buy the ebooks.
« Last Edit: April 04, 2017, 06:44:07 am by lordcooper »
Logged
Santorum leaves a bad taste in my mouth

wierd

  • Bay Watcher
  • I like to eat small children.
    • View Profile
Re: How to compress many image files? Digitizing a bookshelf.
« Reply #9 on: April 04, 2017, 07:33:32 am »

It may be possible to achieve the stated filesize if 1bpp is used. EG, true black and white. (not greyscale.) Each pixel is thus represented by a single bit, not a whole 8 bits, as with greyscale. When jpeg compression is used on such an image, (which uses neighbor heuristics, and some other shennanigans so that actual pixel data can be discarded wholesale, then reconstructed on the fly later by the decoder) one can really cut down the filesize considerably.

Again, only true for 1bpp scans. This is NOT something you are going to get from a digital camera. You will need to do that with a flatbed scanner.

Personally, I would just use a shell script to take the output from a scan, and pipe it through the console based OCR I already mentioned, so that you just turn the page each time the script prompts, and it does all the work for you. (I would write said script, most likely.)  This would mostly automate the process, so my only real task at that point is assuring good registration in the scanner, and etc.

Failing that, I would scan all the pages first at 1bpp, name the scans using some numerical sequence rule, then fire off a script against the directory containing the scan files.

Either way, I just do the scanning, and the computer turns it back into useful text for me. I then discard the image files, as they are no longer useful, and keep the resulting etext.

Logged