| Summary: | pdftotext -htmlmeta and pdftohtml fail to decode U+2019 | ||
|---|---|---|---|
| Product: | poppler | Reporter: | Steven Murdoch <sjm217-freedesktop> |
| Component: | general | Assignee: | poppler-bugs <poppler-bugs> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | normal | ||
| Priority: | medium | CC: | sjm217-freedesktop |
| Version: | unspecified | Keywords: | patch |
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| i915 platform: | i915 features: | ||
| Attachments: |
Fix pdftotext -htmlmeta to correctly output U+2019 in PDF metadata
Test case demonstrating problem with U+2019 in title Fix encoding of PDF document metadata in output of pdftohtml |
||
|
Description
Steven Murdoch
2011-06-03 16:17:16 UTC
Please attach a pdf with such a problem. Created attachment 47513 [details]
Test case demonstrating problem with U+2019 in title
Attached as requested (generated by Word 2007 + Acrobat 9, the same as the document that was actually causing the problem).
$ pdftotext -htmlmeta /tmp/u2019test.pdf - | xxd | less
...
00000a0: 6d6c 223e 0a3c 6865 6164 3e0a 3c74 6974 ml">.<head>.<tit
00000b0: 6c65 3e54 6573 7420 6f66 2070 6466 746f le>Test of pdfto
00000c0: 7465 7874 c290 7320 636f 6e76 6572 7369 text..s conversi
00000d0: 6f6e 206f 6620 552b 3230 3139 2e3c 2f74 on of U+2019.</t
...
[0xc2 0x90 is the UTF-8 encoding of U+0090]
$ pdfinfo /tmp/u2019test.pdf | xxd | less
...
0000000: 5469 746c 653a 2020 2020 2020 2020 2020 Title:
0000010: 5465 7374 206f 6620 7064 6674 6f74 6578 Test of pdftotex
0000020: 74e2 8099 7320 636f 6e76 6572 7369 6f6e t...s conversion
0000030: 206f 6620 552b 3230 3139 2e0a 4175 7468 of U+2019..Auth
0000040: 6f72 3a20 2020 2020 2020 2020 736a 6d32 or: sjm2
...
[0xe2 0x80 0x99 is the UTF-8 encoding of U+2019]
I've commited your patch and it will be in poppler >= 0.17.1 If you are interested in fixing pdftohtml we'd like a patch for it. Created attachment 47627 [details] [review] Fix encoding of PDF document metadata in output of pdftohtml pdftohtml simply copies the PDF document title into the <title> HTML tag, which fails when the title is UCS-2 encoded, or if it contains characters which are in pdfDocEncoding (a ISO 8859-1 superset), but not in ISO 8859-1. This patch fixes the problem by decoding UCS-2 or pdfDocEncoding into Unicode, then encoding this in the desired output encoding. HTML escaping wasn't being done either, so I have used the existing function HtmlFont::HtmlFilter to perform both HTML escaping and character set encoding. This static method had to be made public to call it from pdftohtml. See bug #37900. Fix commited, your help is appreciated, keep patches comming :-) |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.