Freelance
PhP 10,000 / $200 for task
TBD
Mar 9, 2021
We need an executable program preferably written in python to extract data from one or several hundred PDFs with variable formatting and output to a CSV file. The data to be extracted from the PDFs will already be in 1 or more tables, one or more images, and formatted text such as a business name or address. The difficulty is the information to be extracted can be in different locations or have different dimensions depending on the source of the PDF. Job success has 2 criteria. The first is defined as either 1) ideal solution would be able to handle it automatically or 2) an acceptable solution is to allow a user to provide definitions from 1 or more templates to extract the necessary information. A bonus will be awarded if the ideal solution is developed. The second criterion is the speed of the solution as must be able to process a pdf within a reasonable amount of time.
We will provide 3 examples of pdfs with the locations to be extracted marked and the expected output format. Proof of a solution will be on examples not provided to verify it works as intended and not based on simply manually extracting the required information.