http://bugzilla.gnome.org/show_bug.cgi?id=170341 open pdf and search for 'suf' you will find sufficient statistic on page 20. try searching for 'sufficient' or 'suff' then nothing gets found discussion in #evince gives the explanation: 'fi' is a ligature and thus searching for suf works but suff not anymore. there needs to be some clever trick (i.e. additional code) to also find words that contain ligatures
Created attachment 2350 [details] the testcase I forgot to mention, the first url links to the original evince bug.
Sorry if I'm not right, but I guess that ligatures do work, the problem is that Adobe includes a table of these ligatures and how to handle them when copying and searching for text. Poppler should add such information about ligatures to handle them right.
(In reply to comment #2) > Sorry if I'm not right, but I guess that ligatures do work, the problem is that > Adobe includes a table of these ligatures and how to handle them when copying > and searching for text. Poppler should add such information about ligatures to > handle them right. No you're right, ligatures do work, they just confuse searching. I've changed the summary.
I'm not sure wether this will be helpful, but there is an interesting tip here: http://omega.enstb.org/yannis/pdf/eurotex05t.pdf (pp. 22-3). PDF documents generated from TeX sources could follow this convention.
The ligature text is not handled properly because the ActualText value is not implemented in poppler. The PDF Reference 1.6 explains it on page 872 (section 10.8.3: Replacement Text). I think it is an important feature to be implemented.
I don't know whether Albert Astals Cid is one of the recipients of this bug report (the system hasn't allowed to add his email address to he cc field) . I only want to get his attention to this concrete question (because of http://lists.freedesktop.org/archives/poppler/2005-October/001029.html). Ligatures aren't to be translated. acroread is able to "decompose" them because it implements the ActualText value, as explained on page 872 of the PDF Reference 1.6. (Sorry, I would submit a patch myself, but I'm afraid I cannot code.)
This also messes up copying from Evince, esp. from LaTeX pdfs. Suggest change summary to "Handle ActualText (searching and copying ligatures e.g. fi)" Also change Hardware to All. How much work would it be to implement ActualText?
I'm afraid that I was totally wrong. ActualText is indeed a way of doing this. As described on PDF Specification 1.6 (see above), it is used for other purposes (example in PDF specification). With ligatures, Acrobat reads the info from the glyph itself (included in the embedded font). The glyph includes the ligature info in the glyph itself. I don't know how it can be read, but if you open the attached file (uncompressed.pdf) with the lastest version of FontForge, select the fi glyph, press Ctrl+i and the tab "Ligature", you will see that there is a ligature. As far as I know, these ligature tags are described in the OpenType Specification (http://partners.adobe.com/public/developer/opentype/index_spec.html) and at least the ligature tags: liga, rlig, dlig, clig (and the deprecated dpng). There may be more ligatures, but they have to be implemented the same way by poppler. (Sorry if I'm explaining something obvious, but I don't know whether my explanation is clear.) The implementation should be the opposite way that Pango >=1.2 handles some ligatures. But poppler should read all kinds of ligatures defined in fonts. Thanks for your excellent work (and sorry for my mistake), Pablo
Created attachment 5622 [details] uncompressed file (to be opened with FontForge) Sorry, I forgot to include the attachment.
I'm not convinced the ligature properties are the solution to this; the fi characters are stored in poppler as Unicode text so matching them is a Unicode text search problem. The solution to that is to implement Unicode searching as described in UTS#10 <http://www.unicode.org/reports/tr10/#Searching>. I've put together a simple patch for searching; the strategem used is to convert the search term and search text (a line of the document) to NFKC (compatibility composition) and do a binary match. To improve speed, the lines are shadowed with the NFKC canonicalization as required, and reverse indices are provided from the NFKC string back into the document line. I'll attach the patch.
Created attachment 5638 [details] [review] poppler-unicode-search.patch Patch to do searches in normalization form NFKC. Most of the patch is the Unicode data tables. The combining class and composition tables are copied from GLib; the decomposition tables are generated by the included Python script. It fixes the bug for the above testcase.
Patch applied to CVS, thanks. I was wondering if it is possible, and if it makes sense, to make the matching even more liberal, i.e. so 'a' matches any accented version of 'a' and so 'ae' matches 'æ'. Another idea is to make text-extraction (including pdf2text and copy-and-paste from viewers) use the normalized form. I'm closing this bug for now, since it is now fixed, but now that we have those unicode tables in poppler, maybe we can implement some of these ideas.
Kristian, text-extraction should not copy the ligature, but the normalized text. At least this is what acroead does. I think the bug should be reopen until this is implemented.
(In reply to comment #13) > Kristian, text-extraction should not copy the ligature, but the normalized text. > At least this is what acroead does. I think the bug should be reopen until this > is implemented. This bug is fixed, I've opened #7002 for the text extraction issue.
(In reply to comment #12) > I was wondering if it is possible, and if it > makes sense, to make the matching even more liberal, i.e. so 'a' matches any > accented version of 'a' and so 'ae' matches 'æ'. It's certainly possible, but it would require a lot of work, since character matching is locale-sensitive. (Is 'ø' an accented version of 'o', or a separate character? Does 'ô' match 'o', 'oe', or both? Is 'å' 'aa' or 'a'? Can a match end in the middle of 'll', or is that a single letter?) At that point, it would probably be more sensible to use ICU. Oh, and thanks for taking my patch.
(In reply to comment #15) > (In reply to comment #12) > > I was wondering if it is possible, and if it > > makes sense, to make the matching even more liberal, i.e. so 'a' matches any > > accented version of 'a' and so 'ae' matches 'æ'. > > It's certainly possible, but it would require a lot of work, since character > matching is locale-sensitive. (Is 'ø' an accented version of 'o', or a separate > character? Does 'ô' match 'o', 'oe', or both? Is 'å' 'aa' or 'a'? Can a match > end in the middle of 'll', or is that a single letter?) Character matching seems to be locale-sensitive, even when considering only diacritics (http://en.wikipedia.org/wiki/Diacritic). u and ü are the same letter in Spanish and different letters in German. I think the find feature should be able to enable/disable the case sensitivity and diacritic sensitivity. When disabled, diacritic sensitivity should find the character no matter which diacritical marks has on it. And it should be language- and locale-independent.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.