Bug 37900

Summary:	pdftotext -htmlmeta and pdftohtml fail to decode U+2019
Product:	poppler	Reporter:	Steven Murdoch <sjm217-freedesktop>
Component:	general	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED FIXED	QA Contact:
Severity:	normal
Priority:	medium	CC:	sjm217-freedesktop
Version:	unspecified	Keywords:	patch
Hardware:	All
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	Fix pdftotext -htmlmeta to correctly output U+2019 in PDF metadata Test case demonstrating problem with U+2019 in title Fix encoding of PDF document metadata in output of pdftohtml

Description Steven Murdoch 2011-06-03 16:17:16 UTC

Created attachment 47501 [details] [review]
Fix pdftotext -htmlmeta to correctly output U+2019 in PDF metadata

pdftotext -htmlmeta is supposed to parse the PDF metadata and output it as HTML metadata. It generally works, but fails when decoding U+2019 (right single quotation mark).

This is because U+2019 may be encoded in PDF documents as 0x90, because the PDF document encoding uses some of the reserved areas of ISO 8859-1. pdfinfo does the right thing, so I have attached a patch which makes pdftotext use the same approach as pdfinfo. pdftohtml has the same problem, but I haven't attempted to fix this.

Comment 1 Albert Astals Cid 2011-06-04 02:57:26 UTC

Please attach a pdf with such a problem.

Comment 2 Steven Murdoch 2011-06-04 03:24:37 UTC

Created attachment 47513 [details]
Test case demonstrating problem with U+2019 in title

Attached as requested (generated by Word 2007 + Acrobat 9, the same as the document that was actually causing the problem).

$ pdftotext -htmlmeta /tmp/u2019test.pdf - | xxd | less
...
00000a0: 6d6c 223e 0a3c 6865 6164 3e0a 3c74 6974  ml">.<head>.<tit
00000b0: 6c65 3e54 6573 7420 6f66 2070 6466 746f  le>Test of pdfto
00000c0: 7465 7874 c290 7320 636f 6e76 6572 7369  text..s conversi
00000d0: 6f6e 206f 6620 552b 3230 3139 2e3c 2f74  on of U+2019.</t
...

[0xc2 0x90 is the UTF-8 encoding of U+0090]

$ pdfinfo /tmp/u2019test.pdf | xxd | less

...
0000000: 5469 746c 653a 2020 2020 2020 2020 2020  Title:          
0000010: 5465 7374 206f 6620 7064 6674 6f74 6578  Test of pdftotex
0000020: 74e2 8099 7320 636f 6e76 6572 7369 6f6e  t...s conversion
0000030: 206f 6620 552b 3230 3139 2e0a 4175 7468   of U+2019..Auth
0000040: 6f72 3a20 2020 2020 2020 2020 736a 6d32  or:         sjm2
...

[0xe2 0x80 0x99 is the UTF-8 encoding of U+2019]

Comment 3 Albert Astals Cid 2011-06-04 12:24:58 UTC

I've commited your patch and it will be in poppler >= 0.17.1

If you are interested in fixing pdftohtml we'd like a patch for it.

Comment 4 Steven Murdoch 2011-06-06 17:05:31 UTC

Created attachment 47627 [details] [review]
Fix encoding of PDF document metadata in output of pdftohtml

pdftohtml simply copies the PDF document title into the <title> HTML
tag, which fails when the title is UCS-2 encoded, or if it contains
characters which are in pdfDocEncoding (a ISO 8859-1 superset), but not
in ISO 8859-1.  This patch fixes the problem by decoding UCS-2 or
pdfDocEncoding into Unicode, then encoding this in the desired output
encoding.  HTML escaping wasn't being done either, so I have used the
existing function HtmlFont::HtmlFilter to perform both HTML escaping
and character set encoding. This static method had to be made public
to call it from pdftohtml. See bug #37900.

Comment 5 Albert Astals Cid 2011-06-20 15:26:28 UTC

Fix commited, your help is appreciated, keep patches comming :-)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.