I have a batch of several hundred PDF files which have been annotated (mostly highlights, but some comments) using a variety of different software platforms (iOS Goodreader, Skim, Adobe Acrobat, Apple [login to view URL]). This script will parse through a range of PDF files specified, and perform relevant actions quickly and without errors. Script should primarily use built in functions, but may use a minimum number of reasonably standard libraries (e.g. PyPDF2, poppler via python-poppler-qt5, PDFMiner, etc.). Desired actions are as follows:
(1) Extract annotations from PDF files and export markdown formatted list of comments and highlights to individual files for each PDF (naming convention, `[original PDF name][login to view URL]`). Note: if text is embedded in PDF, provide page number, author (of annotation) and date/time (ISO 8601 format) annotation was generated followed by full text of highlighted text / full text of comments etc.; if text is not embedded in PDF (e.g. page images only), provide list with page number and user and date/time each annotation was generated.
(2) Extract annotations from PDF files and export markdown formatted list of comments and highlights to a single markdown file (naming convention, `[login to view URL]`), following conventions noted above.
(3) Extract annotations from PDF files and export dataframe as formatted data to JSON or CSV table, with page number, annotation type, content, author, and date/time (ISO 8601 format)
(4) Raw dump of all annotations from all files in original XML format to text file (with name of files included in dump)
Allow for user to set basic options at command line execution:
(a) identify format of output (choice of 1-4 above)
(b) filtering input PDF files using simple regexp on filenames specified on command line
(c) specify whether to overwrite existing markdown files
All files to be deposited in github repository ([login to view URL]) for testing and final delivery (via github pull requests). Code will be well commented. Freelancer should be comfortable developing code which will be placed under an open license (BSD or CC-BY). Final script (as below) should pass pylint with score of 7 or higher.
- successful execution of script on macos which parses sample PDF files (to be provided) following actions 1-4 as defined above
- script passes pylint with a score of 7 or higher, final pull request to deposit script with documentation README which includes operation and installation (on MacOS using Python 3.6+)