Bug 2929 - searching should 'expand' ligatures so fi matches fi etc
Summary: searching should 'expand' ligatures so fi matches fi etc
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: PowerPC Linux (All)
: high normal
Assignee: Kristian Høgsberg
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-04-08 04:07 UTC by Soeren Sonnenburg
Modified: 2006-05-25 04:43 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
the testcase (95.20 KB, application/pdf)
2005-04-08 04:11 UTC, Soeren Sonnenburg
Details
uncompressed file (to be opened with FontForge) (6.59 KB, application/binary)
2006-05-14 18:08 UTC, Pablo Rodríguez
Details
poppler-unicode-search.patch (380.92 KB, patch)
2006-05-16 16:36 UTC, Ed Catmur
Details | Splinter Review

Description Soeren Sonnenburg 2005-04-08 04:07:09 UTC
http://bugzilla.gnome.org/show_bug.cgi?id=170341

open pdf and search for 'suf' you will find sufficient statistic on page 20.

try searching for 'sufficient' or 'suff' then nothing gets found

discussion in #evince gives the explanation: 'fi' is a ligature and thus
searching for suf works but suff not anymore. there needs to be some clever
trick (i.e. additional code) to also find words that contain ligatures
Comment 1 Soeren Sonnenburg 2005-04-08 04:11:30 UTC
Created attachment 2350 [details]
the testcase

I forgot to mention, the first url links to the original evince bug.
Comment 2 Pablo Rodríguez 2005-05-06 04:53:34 UTC
Sorry if I'm not right, but I guess that ligatures do work, the problem is that
Adobe includes a table of these ligatures and how to handle them when copying
and searching for text. Poppler should add such information about ligatures to
handle them right.
Comment 3 Kristian Høgsberg 2005-05-06 08:48:28 UTC
(In reply to comment #2)
> Sorry if I'm not right, but I guess that ligatures do work, the problem is that
> Adobe includes a table of these ligatures and how to handle them when copying
> and searching for text. Poppler should add such information about ligatures to
> handle them right.

No you're right, ligatures do work, they just confuse searching.  I've changed
the summary.
Comment 4 Pablo Rodríguez 2005-05-07 09:26:40 UTC
I'm not sure wether this will be helpful, but there is an interesting tip here:
http://omega.enstb.org/yannis/pdf/eurotex05t.pdf (pp. 22-3). PDF documents
generated from TeX sources could follow this convention.
Comment 5 Pablo Rodríguez 2005-10-03 02:38:58 UTC
The ligature text is not handled properly because the ActualText value is not
implemented in poppler. The PDF Reference 1.6 explains it on page 872 (section
10.8.3: Replacement Text).

I think it is an important feature to be implemented.
Comment 6 Pablo Rodríguez 2005-10-09 05:14:57 UTC
I don't know whether Albert Astals Cid is one of the recipients of this bug
report (the system hasn't allowed to add his email address to he cc field) . I
only want to get his attention to this concrete question (because of
http://lists.freedesktop.org/archives/poppler/2005-October/001029.html).

Ligatures aren't to be translated. acroread is able to "decompose" them because
it implements the ActualText value, as explained on page 872 of the PDF
Reference 1.6. 

(Sorry, I would submit a patch myself, but I'm afraid I cannot code.)
Comment 7 Ed Catmur 2005-11-29 02:15:30 UTC
This also messes up copying from Evince, esp. from LaTeX pdfs.

Suggest change summary to "Handle ActualText (searching and copying ligatures
e.g. fi)"
Also change Hardware to All.

How much work would it be to implement ActualText?
Comment 8 Pablo Rodríguez 2006-05-14 07:47:53 UTC
I'm afraid that I was totally wrong. ActualText is indeed a way of doing this.
As described on PDF Specification 1.6 (see above), it is used for other purposes
(example in PDF specification).

With ligatures, Acrobat reads the info from the glyph itself (included in the
embedded font). The glyph includes the ligature info in the glyph itself. I
don't know how it can be read, but if you open the attached file
(uncompressed.pdf) with the lastest version of FontForge, select the fi glyph,
press Ctrl+i and the tab "Ligature", you will see that there is a ligature.

As far as I know, these ligature tags are described in the OpenType
Specification
(http://partners.adobe.com/public/developer/opentype/index_spec.html) and at
least the ligature tags: liga, rlig, dlig, clig (and the deprecated dpng). There
may be more ligatures, but they have to be implemented the same way by poppler.

(Sorry if I'm explaining something obvious, but I don't know whether my
explanation is clear.) The implementation should be the opposite way that Pango
>=1.2 handles some ligatures. But poppler should read all kinds of ligatures
defined in fonts.

Thanks for your excellent work (and sorry for my mistake),


Pablo
Comment 9 Pablo Rodríguez 2006-05-14 18:08:08 UTC
Created attachment 5622 [details]
uncompressed file (to be opened with FontForge)

Sorry, I forgot to include the attachment.
Comment 10 Ed Catmur 2006-05-16 16:28:30 UTC
I'm not convinced the ligature properties are the solution to this; the fi
characters are stored in poppler as Unicode text so matching them is a Unicode
text search problem. The solution to that is to implement Unicode searching as
described in UTS#10 <http://www.unicode.org/reports/tr10/#Searching>.

I've put together a simple patch for searching; the strategem used is to convert
the search term and search text (a line of the document) to NFKC (compatibility
composition) and do a binary match. To improve speed, the lines are shadowed
with the NFKC canonicalization as required, and reverse indices are provided
from the NFKC string back into the document line.

I'll attach the patch.
Comment 11 Ed Catmur 2006-05-16 16:36:03 UTC
Created attachment 5638 [details] [review]
poppler-unicode-search.patch

Patch to do searches in normalization form NFKC.

Most of the patch is the Unicode data tables. The combining class and
composition tables are copied from GLib; the decomposition tables are generated
by the included Python script.

It fixes the bug for the above testcase.
Comment 12 Kristian Høgsberg 2006-05-22 16:49:38 UTC
Patch applied to CVS, thanks.  I was wondering if it is possible, and if it
makes sense, to make the matching even more liberal, i.e. so 'a' matches any
accented version of 'a' and so 'ae' matches 'æ'.  Another idea is to make
text-extraction (including pdf2text and copy-and-paste from viewers) use the
normalized form.

I'm closing this bug for now, since it is now fixed, but now that we have those
unicode tables in poppler, maybe we can implement some of these ideas.
Comment 13 Pablo Rodríguez 2006-05-23 06:31:31 UTC
Kristian, text-extraction should not copy the ligature, but the normalized text.
At least this is what acroead does. I think the bug should be reopen until this
is implemented.
Comment 14 Kristian Høgsberg 2006-05-23 08:08:19 UTC
(In reply to comment #13)
> Kristian, text-extraction should not copy the ligature, but the normalized text.
> At least this is what acroead does. I think the bug should be reopen until this
> is implemented.

This bug is fixed, I've opened #7002 for the text extraction issue.
Comment 15 Ed Catmur 2006-05-24 21:37:29 UTC
(In reply to comment #12)
> I was wondering if it is possible, and if it
> makes sense, to make the matching even more liberal, i.e. so 'a' matches any
> accented version of 'a' and so 'ae' matches 'æ'. 

It's certainly possible, but it would require a lot of work, since character
matching is locale-sensitive. (Is 'ø' an accented version of 'o', or a separate
character? Does 'ô' match 'o', 'oe', or both? Is 'å' 'aa' or 'a'? Can a match
end in the middle of 'll', or is that a single letter?) 

At that point, it would probably be more sensible to use ICU.

Oh, and thanks for taking my patch.
Comment 16 Pablo Rodríguez 2006-05-25 21:43:05 UTC
(In reply to comment #15)
> (In reply to comment #12)
> > I was wondering if it is possible, and if it
> > makes sense, to make the matching even more liberal, i.e. so 'a' matches any
> > accented version of 'a' and so 'ae' matches 'æ'. 
> 
> It's certainly possible, but it would require a lot of work, since character
> matching is locale-sensitive. (Is 'ø' an accented version of 'o', or a separate
> character? Does 'ô' match 'o', 'oe', or both? Is 'å' 'aa' or 'a'? Can a match
> end in the middle of 'll', or is that a single letter?) 

Character matching seems to be locale-sensitive, even when considering only
diacritics (http://en.wikipedia.org/wiki/Diacritic). u and ü are the same letter
in Spanish and different letters in German.

I think the find feature should be able to enable/disable the case sensitivity
and diacritic sensitivity. When disabled, diacritic sensitivity should find the
character no matter which diacritical marks has on it. And it should be
language- and locale-independent.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.