Created attachment 47501 [details] [review] Fix pdftotext -htmlmeta to correctly output U+2019 in PDF metadata pdftotext -htmlmeta is supposed to parse the PDF metadata and output it as HTML metadata. It generally works, but fails when decoding U+2019 (right single quotation mark). This is because U+2019 may be encoded in PDF documents as 0x90, because the PDF document encoding uses some of the reserved areas of ISO 8859-1. pdfinfo does the right thing, so I have attached a patch which makes pdftotext use the same approach as pdfinfo. pdftohtml has the same problem, but I haven't attempted to fix this.
Please attach a pdf with such a problem.
Created attachment 47513 [details] Test case demonstrating problem with U+2019 in title Attached as requested (generated by Word 2007 + Acrobat 9, the same as the document that was actually causing the problem). $ pdftotext -htmlmeta /tmp/u2019test.pdf - | xxd | less ... 00000a0: 6d6c 223e 0a3c 6865 6164 3e0a 3c74 6974 ml">.<head>.<tit 00000b0: 6c65 3e54 6573 7420 6f66 2070 6466 746f le>Test of pdfto 00000c0: 7465 7874 c290 7320 636f 6e76 6572 7369 text..s conversi 00000d0: 6f6e 206f 6620 552b 3230 3139 2e3c 2f74 on of U+2019.</t ... [0xc2 0x90 is the UTF-8 encoding of U+0090] $ pdfinfo /tmp/u2019test.pdf | xxd | less ... 0000000: 5469 746c 653a 2020 2020 2020 2020 2020 Title: 0000010: 5465 7374 206f 6620 7064 6674 6f74 6578 Test of pdftotex 0000020: 74e2 8099 7320 636f 6e76 6572 7369 6f6e t...s conversion 0000030: 206f 6620 552b 3230 3139 2e0a 4175 7468 of U+2019..Auth 0000040: 6f72 3a20 2020 2020 2020 2020 736a 6d32 or: sjm2 ... [0xe2 0x80 0x99 is the UTF-8 encoding of U+2019]
I've commited your patch and it will be in poppler >= 0.17.1 If you are interested in fixing pdftohtml we'd like a patch for it.
Created attachment 47627 [details] [review] Fix encoding of PDF document metadata in output of pdftohtml pdftohtml simply copies the PDF document title into the <title> HTML tag, which fails when the title is UCS-2 encoded, or if it contains characters which are in pdfDocEncoding (a ISO 8859-1 superset), but not in ISO 8859-1. This patch fixes the problem by decoding UCS-2 or pdfDocEncoding into Unicode, then encoding this in the desired output encoding. HTML escaping wasn't being done either, so I have used the existing function HtmlFont::HtmlFilter to perform both HTML escaping and character set encoding. This static method had to be made public to call it from pdftohtml. See bug #37900.
Fix commited, your help is appreciated, keep patches comming :-)
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.