GNOME Bug: http://bugzilla.gnome.org/show_bug.cgi?id=171010 [copied from bug] Reporter: tommi.komulainen@iki.fi (Tommi Komulainen) See the following pdf: http://www.architecture.external.hp.com/Download/arch_template_vers13_withexamples.pdf Copy the Table 1 on page 14 and paste it into OOo or gedit. The contents of the table now gets extra line breaks between the columns, but what's worse the left column is pasted on the line following the right column. They should be the other way around, definition before the description. ------- Additional Comment #1 From Tommi Komulainen 2005-03-20 17:56 UTC ------- OTOH copying the same table from xpdf 2.01 seems to work as one would expect. From acroread everything is as one long line with no line breaks. ------- Additional Comment #2 From Bryan W Clark 2005-03-24 00:28 UTC ------- This will hopefully get fixed when we have something like http://bugzilla.gnome.org/show_bug.cgi?id=165155 Can't quite dupe this one against that, but maybe it should depend on it...
As I reported in http://bugzilla.gnome.org/show_bug.cgi?id=326358 (now closed), this happens with the CVS manual (http://ftp.gnu.org/non-gnu/cvs/source/stable/1.11.21/cederqvist-1.11.21.pdf) also.
Ubuntu bug about that: https://launchpad.net/distros/ubuntu/+source/poppler/+bug/33288
I took a look at this today because its pretty old and the downstream evince bug has a lot of dupes: http://bugzilla.gnome.org/show_bug.cgi?id=325189 Some of the comments in the evince bug about mouseover modes/images are irrelevant, since the selection is just using poppler_page_get_text anyway. After reading the code, I am wondering if the bug is caused by TextPage::visitSelection iterating over blocks directly, rather than using TextFlow order? http://cgit.freedesktop.org/poppler/poppler/tree/poppler/TextOutputDev.cc?id=e3e4113c73128f49f99289b592446d4382b5d65c#n3898 So the fix might be fairly simple, just to iterate over blocks in flow order in TextPage::visitSelection, and only use the selection rect when visiting the first and last selected block - for intermediate blocks, the whole block should be selected so PDFRectangle passed in should be the block's own bbox. I might have a go at a patch along these lines, but if its the wrong approach let me know before I go too far.
Brian Ewins: did you ever create/submit the patch you proposed? Though I know nothing about Poppler, and thus can't comment on whether it's the "right" way to fix the bug, any solution would be better than none. This major bug has gone unfixed for way too long.
Created attachment 23584 [details] [review] UNTESTED fix for the bug. David: I hadn't and now I feel guilty :) Not at a machine I can build poppler on right now but I've typed up the change I intended; ymmv, may not even compile. Hope this nudges things along a bit.
Created attachment 23585 [details] [review] fix typo in patch
Created attachment 23691 [details] [review] a patch that actually compiles While this is still untested (jhbuild of evince is failing for me at the moment), at least the patched poppler compiles with no new warnings. I'll get this tested asap but posting this so no-one wastes time on the previous patch.
I managed to test this a bit today, and for what I was able to test, can confirm that the patch works. With the latest GnomeDeveloperKit iso, jhbuild of the Gnome 2.26 moduleset, I couldn't get evince to open pdfs - dbus issue I think - so I moved on to trying epdfview 0.1.7 with the jhbuilt poppler. That epdfview was able to select but not copy, (looked like an epdfview bug) but selection in multicolumn docs exhibits the bug described here and in bug 4006, with the selection expanding in odd ways. So I was able to go on and try the patch. I patched the copy of poppler that had been built (0.10.2); the patch above doesn't apply to that older tree but its just a matter of replacing TextPage::visitSelection with the code in the patch I posted. Rebuilt, ran epdfview, hey presto - the selection predictably follows the column text, which appears to fix this and bug 4006. I'll leave a comment over there too, in case anyone downstream wants to take a look and test it better.
Hi Brian, thank you very much for the patch, this has been a very important issue for a long time now, with lots of duplicates. The patch works great with evince, the selection follows columns perfectly. However, the selected text is not correcly pasted. See this screenshot: http://people.freedesktop.org/~carlosgc/ev-selection-columns.png The text on the second column is pasted before. I'm looking forward to seeing this finally fixed. Thanks!.
Thanks Carlos, looks like this just fixes bug 4006 so far then. I'll dig into that further later today, seems like we're close.
Ok the problem appears to be that TextSelectionDumper::getText does a top-down qsort on the fragments it was passed, when they were in reading order already. The qsort here is just a relic from before the 'flow' code centralized reading-order decisions, and getText should more closely resemble the section commented 'output the page "undoing" the layout' in TextPage::dump, complicated a little by the visitor infrastructure. This looks like I can wrap it up in a unit test.
Created attachment 24739 [details] [review] prevent re-sorting of selection fragments I managed to test this one with evince. Carlos, your screenshot in comment #9 shows what I presume was the intended old behaviour: its actually pasting the text with its layout intact; I'd agree this isn't what I'd want as a user. With this second patch applied, the selected parts of each column are presented in reading order instead of side-by-side. I think the output has excess whitespace - the columns should be directly below each other instead of leaving space for previous columns. Also it might also be better if TextPage::dump effectively selected the whole page and dumped its text by visiting the selection, to avoid duplication of this logic. However, I went for the simple fix first. This seems to work except for those cases where poppler's reading-order heuristics break down, eg, on paragraphs that flow around a diagram that contains text labels (the labels are viewed as part of the para because the fall within the flow bbox, I think), pasting tables (which unlike text columns, are best kept tabular), and most likely on rtl text.
Created attachment 24748 [details] [review] fix typo Oops. Replacing that last patch.
Created attachment 24749 [details] a more agressive fix, paste unformatted text In this version of the patch, I remove the formatting entirely from text selected in reading order, except for line breaks. This is closer to what is done by Acrobat: there there are several selection modes. 'Text' selection mode produces unformatted text in reading order; 'Table' selection mode corresponds to TextPage::getText(x1,y1,x2,y2) - formatted text from within a rectangular selection. The poppler code previously confused the two. I've got a couple of cleanup patches to follow on from this, to reduce duplication.
Hi Brian, I've noticed other problem with your patch. See this screenshot: http://people.freedesktop.org/~carlosgc/poppler-selection-problem.png It's always reproducible with the hig document (http://developer.gnome.org/projects/gup/hig/2.0/hig-2.0.pdf) and evince. Thanks!
Created attachment 24922 [details] [review] replacement patch Thanks for checking that one Carlos. It seems the logic I'd used to find the end of the selection was working in part by chance. Here's a different approach, which should be more deterministic. As before, we find the first block in reading order whose bottom right is below and right of either of the two corner points of the selection; start selecting text here. The other corner becomes the 'stop point'. Expand the selection from there until we find a block containing the stop point. If there is no such block, the selection ends before the first block that is below the stop point (the 'low_block' in the code). This makes a much better job of the gnome hig doc, and works fine on the multicolumn docs I've tried (at least, when poppler determines reading order correctly).
Created attachment 24928 [details] [review] sigh - fixed again I'd introduced a small bug in the last one tidying up the patch. There should never be more than two partially selected blocks (the first and last).
Brian, it seems there are still problems with the bullets. See: http://people.freedesktop.org/~carlosgc/poppler-selection-problem2.png This document is available here: http://research.sun.com/techrep/1994/smli_tr-94-29.pdf And even with the hig there is still a weird behaviour. See: http://people.freedesktop.org/~carlosgc/poppler-selection-problem3.png Note that only the first two bullets symbols are selected. Thanks!
Created attachment 24938 [details] [review] further improvements With the previous version of the patch, when the stop point lay outside any block, the choice of stop block was sometimes unnatural - this is very noticeable when doing a multicolumn select and moving the mouse into whitespace between paragraphs in the rightmost column. In this latest version of the patch, I've improved this quite a bit, so the selection expands much more smoothly. The 'natural' block to select is found like this: 1. if there are blocks directly above the stop point, use the lowest one that is still above the stop point 2. otherwise, find the nearest column, and choose the lowest block in that column that is still above the stop point. Step 1 selects the blocks you select when dragging down a column. Step 2 guesses which block you meant when you drag outside a column. Occasionally step 2 goes wrong: drag past the end of a narrow paragraph, and it will no longer be selected. However you can always get the desired effect, of selecting this whole para, by moving the mouse under the para. Its not ideal but it is reasonably discoverable, and could be fixed by better column identification. Its way better than the old behaviour anyway.
We've overlapped there Carlos, but the latest version doesn't fix those two issues. They are both down to problems with the column identification in poppler (a bit outside the scope of this bug, but there's more robust algorithms in ocropus). In the hig, poppler does not correctly identify the bullets as being part of their paragraphs, and then does not think that all the bullets form a single column; hence what you see. (this is a bit clearer when you look at the text you get from a paste.) In the sun report, if you drag down in the first column, you'll see that the second column gets highlighted as soon as you hit those bulleted items: this is because poppler thinks that the bulleted items are after the right hand column in reading order! I do think there's a problem with my code here though with the way the selection behaves when the start or end point are in those bullet pointed items. They should definitely always select when the mouse moves over them, but it seems to depend on the starting point of the selection. I'll take a look at that.
Created attachment 24983 [details] [review] another stab at it Here's a new version. Whats different this time is how much better the selection follows the mouse. For example, if you click in the margin then drag into the text, the selection is just the lines you have moused across; current evince grabs either the whole beginning or whole end of the paragraph. I've got rid of all of the 'flashing' I could see - where the selection suddenly expands then contracts again as you move - except for those instances where its caused by mistakes in the reading order; the hig document above is a good example of that. In order to do this I had to fix up TextBlock::visitSelection as well, since it often didn't find the end line and would suddenly cause a whole para to be selected (and then deselected again a few pixels further) I don't think this can be pushed much further without fixing up the reading order issues. But that's a pile of work ( the algorithm I'm considering is here: http://pubs.iupr.org/cache/2002-breuel-das.pdf.dir/m-p-00001.html ), although maybe the current code could just be tweaked to deal better with these bullet lists. I'd rather not sink time into that if there's no chance of this getting in.
Created attachment 25026 [details] [review] simpler patch This behaves much the same as the previous patch, but the algorithm is simpler and there's a lot less code, so it should be easier to review.
(In reply to comment #22) > Created an attachment (id=25026) [details] > simpler patch > > This behaves much the same as the previous patch, but the algorithm is simpler > and there's a lot less code, so it should be easier to review. > Does this patch depend on any of the previous patches? it applies cleanly on master, but it doesn't build: TextOutputDev.cc: In member function 'void TextPage::visitSelection(TextSelectionVisitor*, PDFRectangle*, SelectionStyle)': TextOutputDev.cc:3980: error: 'begin' was not declared in this scope TextOutputDev.cc:3980: error: 'end' was not declared in this scope TextOutputDev.cc: In function 'void TextOutputDev_outputToFile(void*, char*, int)': I'm sorry, but I don't have time to debug it now. Thanks!
Sorry about that Carlos, must've messed up again transferring the patch from the VM. Yes, it was supposed to apply on its own and compile. This line: + visitor->visitBlock (this, begin, end, selection); Should be: + visitor->visitBlock (this, best_block[start], best_block[stop], selection); But when I'm back tomorrow I'll respin the patch and make sure its the right one. Still thinking about how to improve the reading order.
Hey Brian, any progress on this? it would be really nice to have this finally fixed for poppler 0.12 Thanks!
Sorry - have been struggling to find free time to tackle this, though I should really have posted the corrected patch by now. I'll try to push things on in the next couple of days. I think this might need some discussion on-list before committing, since I won't have a fix for reading order errors by the June 8 code freeze - is this worth taking without further improvements there?
Brian, have you got any news on this patch? It would be a real improvement!
Hi Brian, I've been looking at your patch again and even after applying the change you mentioned in comment #24 it still doesn't build. Could you attach the last version of the patch that worked for you, please? If you are busy, we'll try to finish your great work, but we need a working patch first. Thanks!
Created attachment 30821 [details] [review] patch against poppler-0.12.0-46-g7670cc4 Sorry for being away for so long. This patch is exactly what I was using previously, hopefully without any errors this time; built and works for me on the gnome dev kit based on poppler 0.10.2, but this time diff generated directly against git rather than transferring back to the host vm first, so there's less chance this is messed up. A patch that applies to current poppler follows.
Created attachment 30822 [details] [review] patch against 0.10.2 Agh...the previous patch was the one rebased against master: poppler-0.12.0-46-g7670cc4. This one is the old pristine patch in case it is needed. This does copy & paste fine for me, but as I've previously mentioned 2 other fixes are needed: the code needs to take account of the primary writing direction on the page, shouldn't be too hard as this is already recorded; and the column selection needs improved, much harder.
Created attachment 31140 [details] [review] Fixes a segfault in previous patches when doc with no text was opened. This is just a fix to the previous patch, and still applies to poppler-0.12.0-46-g7670cc4 . 2 additional patches follow.
Created attachment 31141 [details] [review] remove existing reading-order sort This patch just removes the existing reading-order sort (which happened while flows were being constructed), and replaces it with an XY sort of the blocks. The blocks are then added to the flows preserving this sort order. This patch is just to make it easier to review what follows, although it does result in a usable build.
Created attachment 31143 [details] [review] Add a reading-order sort. In this patch, I replace the XY block sort with a reading-order sort. The sort is supposed to be similar to the one used in Ocropus, which was described in http://pubs.iupr.org/#2003-breuel-sdiut . However, my implementation is not right yet (this is particularly clear if you look at the gnome hig pdf: the headers are selected out-of-order). Some of the comments don't match the code in depth_first_visit as I flipped the comparisons a few times to get this result - thats likely where I've gone wrong. I've used the reading order sort on blocks rather than lines as in Breuel's paper; this seemed the simplest way to slot the code in. I hoped that this might make the overall selection order a bit more logical and then the problem with bullet points getting their own column could be fixed by tweaking the block-building code. I thought it worth posting up as a work in progress, to show where this sort could be integrated, since people have been asking about this bug over on the Ubuntu and Gnome trackers again. The patch takes no account of block rotation or RTL, again for ease of review. That's not going to be too hard to add. Depth_first_visit probably belongs in TextBlock and needs some support methods to do the required comparisons taking rotation into account.
Created attachment 31162 [details] [review] fixed reading order sort this fixes up my reading order sort (I had ommitted brackets round one boolean clause...gah). This now seems reasonably behaved on most of the test multicolumn documents. I have one document thats behaving oddly, but its an unusual case, with text all over a diagram and the columns flowing round it. Still to come: handling mixed block rotation, and RTL. I don't have any good test docs for those though.
Reassign the poppler-bugs
Created attachment 31217 [details] [review] 1/3. Updated series, restores triple-click, bugfixes.
Created attachment 31218 [details] [review] 2/3 updated series. This should be equivalent to the patch it replaces.
Created attachment 31219 [details] [review] 3/3 code reorganized to handle page rotation, some RTL. Now the primary page rotation and primary LTR flag are supported in the reading order sort. I havent added that support to the selection code yet. I followed the line of the rest of the code in using the primary page rotation rather than the block or line rotation. I think this could be improved upon (its going to guess the wrong reading order for rotated text) but its not a common case. The debug statements added here I left in because I have a tool that goes with them (to be added next) that helps check the result of the sort.
Created attachment 31220 [details] perl script to diagram log output figuring out where and why the reading order sort goes wrong on random documents is painful. I wrote this to help visualize the problems, it may help others working on the problem. With the debug statements enabled (the 'PAGE' before the ro sort, the list of blocks after, and the statements in visitDepthFirst), this will dump out svg representing one page of the pdf. (you need to change which page you're working on by editing the program) Red boxes are blocks, blue lines are 'rule 1', green lines are 'rule 2'. For both blue and green lines, a circle on the end indicates the 'after' block in the comparison. These lines only appear when a block was not already in sort order when it was visited; due to the other sorts in the code, the 0th block generally has no lines leaving it (for example). The upper number on each block is which position in the sorted array we were filling when we visited it - this number repeats as you follow the recursion in visitDepthFirst. The lower number is the position the block eventually sorted to.
Created attachment 31404 [details] [review] 1/5 selection follows flows Latest spin on the series. This one has some support for RTL (but not bidi). I don't think the order of the pasted text for RTL is correct yet, nor is the text dump - the lines look to be in reverse order? However that is fixable separately from making the selection move in the right direction.
Created attachment 31405 [details] [review] 2/5 updated patch
Created attachment 31406 [details] [review] 3/5 reading order (bug fixed)
Created attachment 31407 [details] [review] 4/5 change linebreaks This introduces linebreaks in the text dump after each line (instead of after each block), unless a hyphenation is being removed. This is closer to the output from acrobat.
Created attachment 31408 [details] [review] 5/5 rtl selection
it seems this bug now has a working patch and yet there has not been any activity for the past few weeks. It would really be great if this is committed soon as this is a really annoying bug for many.
There has been activity, see mailing list: http://lists.freedesktop.org/archives/poppler/2009-December/005341.html http://lists.freedesktop.org/archives/poppler/2009-December/005346.html Brian is still working on it.
Created attachment 33251 [details] [review] distinguish between columns and tables when selecting text Hi, I added ability to distinguish between columns and tables into Brian's patch. The patch finds all cells which belong to the same table and set ExMin, ExMax, EyMin and EyMax numbers which encloses whole table. With this enclosure, Brian's code can sort text blocks taking into account tables. It also changes the code which pastes the text to be able to paste cells, which are in the same row, correctly (even for multiple-line cells). Regards Marek
I've built and tested the patches on poppler_0.12.0-0ubuntu2.1 and it is definitely a HUGE improvement, simply the ability to select text in columns makes it now useful all of a sudden, many thanks to Brian and Marek for looking into this! (deb packages for ubuntu 9.10 are available from https://launchpad.net/~arand/+archive/poppler, (nil safety included)) There are however still some minor issues, For testing purposes I've been using these two papers: http://ntur.lib.ntu.edu.tw/bitstream/246246/164287/1/50.pdf http://www.opticsinfobase.org/oe/abstract.cfm?uri=oe-16-24-19724 (link to fulltext pdf available.) On the first Yuh-Sien SUN et. al. paper there is the equation on the start of page 2, as well as the combined equation/Fig. 7. on page 4; both does messy things to selection. On the second Trefor Sloanes et. al. paper the references on page 1 are a bit odd, as the numbers is treated as a separate column when selecting, whereas on page 2 they are not (second behavior is preferred I would say). Also equation (3) page 3 does a lot of messy things to selection of it, and most of the text following on that page as well. In fact equations seems to mess with selection as a general rule, which isn't very surprising I guess.. If not wanting to sit down and design them, I would say scientific papers makes rather good material for testing, due to their use of columns, diagrams with text, equations, reference lists, etc. and also due to availability e.g. opticsinfobase.org should be a fairly huge supply for test docs... Once again, thanks! This makes a huge difference when working with referencing/quoting on columned pdfs.
Brian, any progress on this since your last patch? Could you comment on Marek's patch? If we have a good patch, even though it still needs some improvements, I would like to include it in the next release so that it will be more tested. Albert?
I've just pushed all the patches. I'm leaving the bug open because even though selection works better now for some multicolumn docs, it still can be improved a lot. I hope it is easier now for you guys since the code have been merged instead of maintaining a set of patches. Thank you very much for working on this so important bug.
Thanks Carlos!
Created attachment 35494 [details] [review] a patch checking whether cells of an assumed table overlap Hi, this patch improves detection of tables. It checks whether cells of assumed table overlap in axis x or y. Marek
Hi Marek, thank you very much for the patch, do you have a test case where your patch makes a difference to try with?
Created attachment 35559 [details] reproducer for table cell overlapping Hi Carlos, this is a reproducer of the problem with overlapping cells. I can not post the original one, so I created this one. Select the Headline 1, Text 1, Headline 2 and Text 2. Paste it somewhere then and you'll see the problem. Now I see that it is not problem of not checking for overlapping but wrong check of overlapping. I'll post an updated patch (which do the same, but sooner in the code). The actual code doesn't check for overlapping of lower left cell and upper right cell in x and y. Regards Marek
Created attachment 35560 [details] [review] a patch improving checking of overlapping of cells
(In reply to comment #55) > Created an attachment (id=35560) [details] > a patch improving checking of overlapping of cells pushed to git master, thank you very much!
I've just added a selections demo to poppler-glib-demo, I hope it makes easier to test selections.
(In reply to comment #42) > Created an attachment (id=31406) [details] > 3/5 reading order (bug fixed) It seems this bug introduced an important performance regression, see the detailed analysis in poppler mailing list: http://lists.freedesktop.org/archives/poppler/2010-October/006566.html
Correctness trumps performance, right?
It should be possible to improve this substantially. When I wrote the patch I was being very conservative with the existing poppler data structures, so essentially that method is traversing an unordered list. If the block list was in isBeforeByRule1 order most of those comparisons would go away. I can't remember if this would break clients wanting access to the text in physical order-it's been a while since I looked at the code and I'm reading this on a phone. Can take a deeper look tomorrow.
Created attachment 40056 [details] [review] patch to fix performance regression Here's what I've got so far. On my very slow VM, this renders the paris bus map reported on the mailing list in ~15.2s, compared to ~60s without the patch. YX-sorting alone got the time down to ~16.2s. Rendering on other documents is as fast as ever. Almost all of the time rendering the bus map is prior to the sort, so there must still be some quadratic algorithms in there unrelated to reading order. There is one obvious fix on my list that I didn't implement (track the first unvisited block, start loops there) but I don't think this will make much difference for the effort it requires. I'll be offline until Monday 8 Nov, but I'd be grateful if some more eyes could look at this to make sure I haven't regressed anything.
Created attachment 40061 [details] [review] improved patch Had another look and tidied the code a bit removing repeated page orientation checks, and a redundant test for overlap in rule(2). This is noticeably faster rendering the bus map. (down to ~14.8s)
Can you please attach a patch without unnecessary spacing changes like - before = gTrue; + before = gTrue; Thanks :-)
(In reply to comment #62) > Created an attachment (id=40061) [details] > improved patch > > Had another look and tidied the code a bit removing repeated page orientation > checks, and a redundant test for overlap in rule(2). This is noticeably faster > rendering the bus map. (down to ~14.8s) I tested this in glib. It improved rendering for me significantly for the PDF's prone to be affected. I tested other PDFs as well and didn't notice anything. I did a little bit of testing with test selection as well, but not as much, text selection seemed OK.
Created attachment 40124 [details] [review] patch without extraneous whitespace changes Oops! Ok, here's the patch without the whitespace changes.
(In reply to comment #65) > Created an attachment (id=40124) [details] > patch without extraneous whitespace changes > > Oops! Ok, here's the patch without the whitespace changes. Tested with the new whitespace patch - renders the map PDFs much faster in cairo, other PDFs aven't changed much.
There are also 2 nested "for" cycles in the block preceded by this comment: /* set extended bounding boxes of all other blocks * so that they extend in x without hitting neighbours */ I'm working on an optimization of it. Marek
Created attachment 40308 [details] [review] optimization of search for tables Hi, this patch improves efficiency of searching for blocks which belong to a table. In the worst case it is still quadratic but it should be O(n*sqrt(n)) in average case. It creates a list of all y coordinates and then sort it. After that it goes through this list and for each y coordinate which begins a block it starts a local while cycle. This while cycle searches for blocks which overlaps with the actual block in y. It finds closest blocks from the left and from the right during this search. It does the same for the x coordinate and finds up and down adjacent blocks. After that, it uses this information for computing of ExMin, .., EyMax and for determining whether the actual block is part of a table. (You need to have the Brian's patch applied already.) Regards Marek
Brian, your patch changes the output of pdftotext in a file, is this to be expected?
While working on bug #71160 I've found another regression introduced by this fix. In some cases, additional lines are added to the selection. For example, open the hig document and go to the first page. Start selecting the second line, but dragging from the margin, and you will see that the first line is selected too. This is because the second line is more indented than the first one. This fix changed the way blocks and lines are included in the selection by using the manhattan distance, and in this case, the distance of the first line is less than the second line, but the first line doesn't even intersect with the selection rectangle. If you start the selection closer to the beginning of the second line, then the first line is not included because distance to the second line is less in such case. You can play with it now using the text demo of poppler-glib-demo. I've added an area selector to get the text of a given area. Try using X1=0, Y1=122, which should discard the first line, but it doesn't. However using X1=257, Y1=122 discards the first line entirely. So, I think that we need to somehow discard blocks and lines that don't intersect with the selection rectangle even if the manhattan distance is less than any other block/line.
(In reply to comment #70) (I originally replied on launchpad, which is supposed to copy it through to here, but it hasn't.) Carlos: it isn't a regression that lines outside a rectangle formed by the start and endpoints are included, it's the intent. Consider selecting in a document with two columns, starting in the 1st column 2/3 down the page, ending in the 2nd column 1/3 down the page. In this case, the correct selection consists entirely of lines that lie outside the rectangle formed by the start and endpoints (ie, the bottom 1/3 of the 1st column and the top 1/3 of the 2nd column). You get situations like this even for single column text; just choose start and end points vertically above each other. The motivation for this patch was that text selection by rectangles is fundamentally wrong. The correct approach is to reconstruct the reading order of text; then from two points on the page, find the nearest insertion points (where an edit cursor would go); swap the insertion points if necessary; then return the characters between them. The difficulties lie in inferring the reading order, and determining what 'nearest insertion point' means. Clicking inside a word, the nearest insertion point is obvious; it's the nearest character boundary. Click in a blank area, and it's less clear. In Breuel's algorithms that I used for determining reading order, there is something that helps here. There, line width is determined by expanding the line left and right to fit the column it contains. So the line 'box' contains the initial indent if it is the first line of a paragraph, or the trailing space in the last line; or the ragged space for left- or right- justified text. Poppler doesn't have columns as such, but blocks instead, and as I recall the line boxes are the tight bounding box of the words contained in the line. So we can try to determine insertion point by looking for the nearest block (horizontally and vertically), then the nearest line (vertically ONLY, so that we ignore indents/ragged space), then nearest character (horizontally). I mean these to be three different comparisons, discarding blocks, line and character candidates at each stage, not some single distance you sum up. The upshot would be that clicking in blank areas of a line that lie within its block's bounding box - or even nearby - will choose that line, not the one above or below. (It's been ages since I looked at the poppler code, I can't remember if this heuristic is what the patches do already)
Created attachment 97356 [details] [review] Improve efficiency of searching for tables (In reply to comment #68) > Created attachment 40308 [details] [review] [review] > optimization of search for tables This is an updated version of the patch. It needs Brian's patch to be already applied. (see #77087 for additional info)
I just realized we never really followed up on the last two patches. My concern is that they change the pdftotext output. I though they where for a) speed b) text selection so i'd prefer if they did not change pdftotext output. I've checked a few files of the changed ones and the changes can be argued not to be for better or worse, but then again the problem is that 1 out of 3 files i have has a changed output in pdftotext and having 1600 files in the test suite it makes it impossible for me to go over them all and verify the changes are "not better nor worse". Is there any chance we get patches that don't influence pdftotext output or at least not such drastically? And yes, i know it's ages ago since you wrote the patches, sorry.
I went through the patch written by me and unfortunately I can not make it so that it returns the same result as before. I've separated axes in which it searches for immediate up/down/left/right neighbours, sorted them and found the neighbours by single pass (+ number of possible neighbour candidates in the other axis for given block, which should be sqrt(n) in average case). The difference is that the patch searches for right-down neighbour by looking at down neighbour of its right neighbour and at the right neighbour of its down neighbour. If they match then it selects it as the right-bottom neighbour. Previous version just searched for the closest block which is to the right of the block and below the block (and looking at the code, the result could depend on order of the blocks in the searched array). Modifying the patch so that it would return the same results as without it would cost the whole efficiency. Anyway, the efficiency improvement of my patch is not as big as the one from Brian so you can reject it (but I think that it improves the conversion :) ).
Created attachment 121848 [details] [review] Cache result of inner loop in visitDepthFirst This is an alternative to Brian's patch in comment 65. This speeds up the visitDepthFirst function by caching the result in the inner loop. This provides a similar speedup without changing the output of pdftotext.
Comment on attachment 121848 [details] [review] Cache result of inner loop in visitDepthFirst Looks good to me, pushed. Thanks!
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/516.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.