Created attachment 77502 [details]
PDF File font problem
Using pdftotext on the attached "FRA_2803_DE_FD_1B686455.pdf" to extract the text, the program is consuming memory indefinitely until exhausting all OS. The tests were conducted in OpenSuse 12.1, Ubuntu 12.04 and Ubuntu 12.10, with versions 0.18.0, 0.20.4 and 0.22.2 of pdftotext. Also tested with other tools like pdftoppm pdftohtml and give the same result. The standard output of the command is also attached (output.txt), showing that the font is not recognized.
I think there is a problem with rendering embedded fonts that are corrupt within the PDF document. Try Open it with Adobe Reader and returns the following message: "The font 'FrutigerLT-Cn' Contain a bad / BBox", but can be saved as text
Due to the high memory consumption, try searching for memory leaks but running valgrind on version 0.22.0 of pdftotext not report any leaks.
Where is the pdf file?
Created attachment 77588 [details]
sorry this is the pdf file
Quick analisys, there is a page that does a form Xf4, while doing that form a new dictionary with resources is pushed and that form calles the new Xf4 form, but unfortunately its ref is null so the loop in GfxResources::lookupXObject ends up picking up the same Xf4 form that executes itself forever.
Suggestion, store in Gfx the forms we are drawing (using the ref), if we find a form we are already drawing, bail out.
Am I making any sense?
Noone commented, implemented it, seems to work and cause no regressions, commited, will be in poppler 0.22.4