The ultimate goal of the project is to write a python script that:
a) converts PDFs to sequential JPG pages
b) trim white margins around jpgs and save to disk.
Second part of the project is to upload the jpg's to AWS s3 with public access permissions via python + AWS package.
Third part is garbage collecting from AWS s3.
Given a list of PDF URLs+ PDF_IDs , and Image quality/size (pixel length+width):
1) Download PDF from provided link - ex: https://s3.us-east-2.amazonaws.com/pdfs12z49a/sample+pdf/[login to view URL]
2) Convert each PDF into a series of JPGs (1 for each page), with specified image quality /size
3) Trim white margins from each JPG (margin spacing will vary, so you will need to calculate that for each page)
4) Create a folder on disk called PDF_ID, and save each image in a sub folder generated from image quality/size input (Ex: C:\PDF2JPG\PDF_ID\300dpi\[login to view URL] , [login to view URL] etc).
5) output a list of lists for each PDF, containing PDF_ID, quality, page_number, location_on_disk
['PDFID1' , 300, 1 , 'C:\Windows\ID\300dpi\[login to view URL]']
,[PDFID1 , 300, 2 , 'C:\Windows\ID\300dpi\[login to view URL]']
Part 2 - Upload to AWS using Python 2.7 /aws package, and the list of lists from above:
1) Generate a new bucket within existing bucket, named PDF_ID
2) Upload all images to AWS S3 bucket for PDF_ID with public read permissions
3) output a list of lists for each PDF, containing PDF_ID, quality, page_number, AWS url
['PDFID1' , 300, 1 , '[login to view URL]']
,[PDFID1 , 300, 2 , '[login to view URL]']
and output a list containing PDF_ID, page_number, url .
Part 3 - AWS garbage collector - Python + AWS package
Given a list of PDF_IDs, delete sub buckets with that ID.
Ideally, I'm looking for somebody who has done this type of project in the past, and has a script laying around. Once the bid is accepted, I will provide:
1) PDF id's + links
2) User id + PW to AWS with write permissions to test buckets