Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(246)

Issue 1230583006: Fix for keeping lists structure (Closed)

Created:
5 years, 5 months ago by Marcelo Correa
Modified:
5 years, 4 months ago
Reviewers:
mdjones, cjhopman, wychen
CC:
dalmirdasilva
Base URL:
https://github.com/chromium/dom-distiller.git@master
Target Ref:
refs/heads/master
Visibility:
Public.

Description

Fix for keeping lists structure Created the WebTag class to represent the tags we want to preserve over the distillation process. A WebTag instance acts like a placeholder which is injected to the WebDocument when walking the DOM and is used to track the UL, OL and LI element positions. This mechanism could be used in the future for all the tags that we want to preserve. When the generateOuput() is called for each WebTag, it returns the html tag that the instance is representing. A new filter was created to process these WebTags and mark as content considering its content inside. This filter runs for the last since it relies on content found by all the others filters. BUG=502524 R=mdjones@chromium.org, wychen@chromium.org Committed: a4e94dd067684796435ca45d2f88e063b5af20af

Patch Set 1 #

Patch Set 2 : Using a new approach to keep list structure. #

Total comments: 4

Patch Set 3 : Unified the placeholder class into a single one. #

Patch Set 4 : Small code refactor for more appropriate names. #

Total comments: 18

Patch Set 5 : Fixed small issues. #

Patch Set 6 : StackEntry removed, using WebTag content flag instead. #

Total comments: 5

Patch Set 7 : class WebTagStructureKeeper renamed to NestedElementRetainer #

Patch Set 8 : Fixed imports order #

Total comments: 5

Patch Set 9 : Fixed NestedElementRetainer.process method and unit tests #

Total comments: 6

Patch Set 10 : Method addList renamed to addTag #

Total comments: 1

Patch Set 11 : Fixed grammar #

Total comments: 2

Patch Set 12 : Classes were documented. #

Total comments: 7

Patch Set 13 : Tag Elements verification logic delegated to WebTag class. #

Patch Set 14 : sync #

Total comments: 1

Patch Set 15 : Changed nestingTags List to Set. #

Patch Set 16 : canBeNested move out of the switch. #

Unified diffs Side-by-side diffs Delta from patch set Stats (+694 lines, -4 lines) Patch
M java/org/chromium/distiller/ContentExtractor.java View 1 2 3 4 5 6 7 2 chunks +2 lines, -0 lines 0 comments Download
M java/org/chromium/distiller/webdocument/DomConverter.java View 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 chunks +11 lines, -0 lines 0 comments Download
M java/org/chromium/distiller/webdocument/WebDocument.java View 1 2 3 4 5 6 7 8 9 1 chunk +4 lines, -0 lines 0 comments Download
M java/org/chromium/distiller/webdocument/WebDocumentBuilder.java View 1 2 3 4 5 6 7 8 9 1 chunk +6 lines, -0 lines 0 comments Download
M java/org/chromium/distiller/webdocument/WebDocumentBuilderInterface.java View 1 2 3 4 5 6 7 8 9 1 chunk +1 line, -0 lines 0 comments Download
A java/org/chromium/distiller/webdocument/WebTag.java View 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 chunk +50 lines, -0 lines 0 comments Download
M java/org/chromium/distiller/webdocument/WebText.java View 1 2 3 4 5 6 7 8 9 10 11 12 1 chunk +8 lines, -2 lines 0 comments Download
A java/org/chromium/distiller/webdocument/filters/NestedElementRetainer.java View 1 2 3 4 5 6 7 8 9 10 11 1 chunk +49 lines, -0 lines 0 comments Download
M javatests/org/chromium/distiller/ContentExtractorTest.java View 1 2 3 4 5 6 7 8 9 10 11 12 1 chunk +261 lines, -0 lines 0 comments Download
M javatests/org/chromium/distiller/TestUtil.java View 1 1 chunk +12 lines, -0 lines 0 comments Download
M javatests/org/chromium/distiller/webdocument/DomConverterTest.java View 1 2 3 4 1 chunk +35 lines, -0 lines 0 comments Download
M javatests/org/chromium/distiller/webdocument/FakeWebDocumentBuilder.java View 1 2 3 4 5 6 7 8 9 1 chunk +3 lines, -0 lines 0 comments Download
M javatests/org/chromium/distiller/webdocument/TestWebDocumentBuilder.java View 1 2 3 4 5 6 7 8 9 1 chunk +12 lines, -0 lines 0 comments Download
A javatests/org/chromium/distiller/webdocument/WebTagTest.java View 1 2 3 4 5 6 7 8 9 10 11 12 1 chunk +40 lines, -0 lines 0 comments Download
M javatests/org/chromium/distiller/webdocument/WebTextTest.java View 1 1 chunk +15 lines, -0 lines 0 comments Download
A javatests/org/chromium/distiller/webdocument/filters/NestedElementRetainerTest.java View 1 2 3 4 5 6 7 8 9 1 chunk +185 lines, -0 lines 0 comments Download
M javatests/org/chromium/distiller/webdocument/filters/RelevantElementsTest.java View 1 2 3 4 1 chunk +0 lines, -2 lines 0 comments Download

Messages

Total messages: 61 (3 generated)
Marcelo Correa
This is a new approach for issue about ordered lists: https://codereview.chromium.org/1190343004/
5 years, 5 months ago (2015-07-09 16:10:29 UTC) #2
Marcelo Correa
5 years, 5 months ago (2015-07-09 16:11:43 UTC) #3
wychen
Thanks for the patch. We might need to keep LI elements separate, or the filter ...
5 years, 5 months ago (2015-07-10 01:42:42 UTC) #4
Marcelo Correa
On 2015/07/10 01:42:42, wychen wrote: > Thanks for the patch. > > We might need ...
5 years, 5 months ago (2015-07-13 13:41:37 UTC) #5
dalmirdasilva
On 2015/07/13 13:41:37, Marcelo Correa wrote: > On 2015/07/10 01:42:42, wychen wrote: > > Thanks ...
5 years, 5 months ago (2015-07-13 16:55:03 UTC) #6
wychen
On 2015/07/13 16:55:03, ds wrote: > On 2015/07/13 13:41:37, Marcelo Correa wrote: > > Hello ...
5 years, 5 months ago (2015-07-20 22:45:32 UTC) #7
Marcelo Correa
Using new approach to keep list structure.
5 years, 4 months ago (2015-07-29 14:36:26 UTC) #8
mdjones
Some initial comments/suggestions. Once upon a time, I worked on a solution to this problem ...
5 years, 4 months ago (2015-07-29 17:12:16 UTC) #9
Marcelo Correa
On 2015/07/29 17:12:16, mdjones wrote: > Some initial comments/suggestions. > > Once upon a time, ...
5 years, 4 months ago (2015-07-29 17:53:23 UTC) #10
mdjones
I believe what wychen was referring to with leaving them separate was that an entire ...
5 years, 4 months ago (2015-07-29 21:00:24 UTC) #11
Marcelo Correa
On 2015/07/29 21:00:24, mdjones wrote: > I believe what wychen was referring to with leaving ...
5 years, 4 months ago (2015-07-29 21:21:04 UTC) #12
wychen
Hi, Just talked with mdjones, he didn't mean to make you follow his old CL ...
5 years, 4 months ago (2015-07-29 23:53:28 UTC) #13
Marcelo Correa
On 2015/07/29 23:53:28, wychen wrote: > Hi, > > Just talked with mdjones, he didn't ...
5 years, 4 months ago (2015-07-30 01:23:59 UTC) #14
Marcelo Correa
All comments were addressed.
5 years, 4 months ago (2015-07-30 18:01:12 UTC) #15
wychen
Thanks for your explanation about LI. I see it is necessary. https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/DomConverter.java File java/org/chromium/distiller/webdocument/DomConverter.java (right): ...
5 years, 4 months ago (2015-08-01 01:00:21 UTC) #16
Marcelo Correa
Hello wychen! Yeah, I think WebTag would be a good name for this PlaceHolder. In ...
5 years, 4 months ago (2015-08-01 01:28:43 UTC) #17
dalmirdasilva
https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/WebText.java File java/org/chromium/distiller/webdocument/WebText.java (right): https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/WebText.java#newcode76 java/org/chromium/distiller/webdocument/WebText.java:76: return elementClonedRoot.getInnerHTML(); Is here the case where we can ...
5 years, 4 months ago (2015-08-01 16:27:04 UTC) #19
dalmirdasilva
https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/filters/RelevantElements.java File java/org/chromium/distiller/webdocument/filters/RelevantElements.java (right): https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/filters/RelevantElements.java#newcode62 java/org/chromium/distiller/webdocument/filters/RelevantElements.java:62: StackEntry stackEntry = holderStack.pop(); This might raise EmptyStackException if ...
5 years, 4 months ago (2015-08-03 15:43:50 UTC) #20
mdjones
https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/PlaceHolder.java File java/org/chromium/distiller/webdocument/PlaceHolder.java (right): https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/PlaceHolder.java#newcode35 java/org/chromium/distiller/webdocument/PlaceHolder.java:35: return sb.toString(); I think the following is more clear: ...
5 years, 4 months ago (2015-08-03 16:57:55 UTC) #21
dalmirdasilva
https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/filters/RelevantElements.java File java/org/chromium/distiller/webdocument/filters/RelevantElements.java (right): https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/filters/RelevantElements.java#newcode46 java/org/chromium/distiller/webdocument/filters/RelevantElements.java:46: } On 2015/08/03 16:57:55, mdjones wrote: > What if ...
5 years, 4 months ago (2015-08-03 17:13:18 UTC) #22
mdjones
https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/filters/RelevantElements.java File java/org/chromium/distiller/webdocument/filters/RelevantElements.java (right): https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/filters/RelevantElements.java#newcode46 java/org/chromium/distiller/webdocument/filters/RelevantElements.java:46: } > ...use Set<PlaceHolder> and Stack<PlaceHolder>... I don't, I ...
5 years, 4 months ago (2015-08-03 18:10:54 UTC) #23
Marcelo Correa
All comments addressed again, except that one related to use Set and Stack instead of ...
5 years, 4 months ago (2015-08-03 19:36:51 UTC) #24
mdjones
On 2015/08/03 19:36:51, Marcelo Correa wrote: > All comments addressed again, except that one related ...
5 years, 4 months ago (2015-08-03 20:22:20 UTC) #25
Marcelo Correa
Fixed what mdjones requested.
5 years, 4 months ago (2015-08-03 21:01:30 UTC) #26
wychen
On 2015/08/03 15:43:50, ds wrote: > https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/filters/RelevantElements.java > File java/org/chromium/distiller/webdocument/filters/RelevantElements.java > (right): > > https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/filters/RelevantElements.java#newcode62 ...
5 years, 4 months ago (2015-08-03 21:26:42 UTC) #27
mdjones
https://codereview.chromium.org/1230583006/diff/100001/java/org/chromium/distiller/ContentExtractor.java File java/org/chromium/distiller/ContentExtractor.java (right): https://codereview.chromium.org/1230583006/diff/100001/java/org/chromium/distiller/ContentExtractor.java#newcode96 java/org/chromium/distiller/ContentExtractor.java:96: WebTagStructureKeeper.process(documentInfo.document); This should probably be the last filter to ...
5 years, 4 months ago (2015-08-03 23:29:45 UTC) #28
Marcelo Correa
5 years, 4 months ago (2015-08-04 01:44:57 UTC) #29
wychen
https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/filters/RelevantElements.java File java/org/chromium/distiller/webdocument/filters/RelevantElements.java (right): https://codereview.chromium.org/1230583006/diff/60001/java/org/chromium/distiller/webdocument/filters/RelevantElements.java#newcode62 java/org/chromium/distiller/webdocument/filters/RelevantElements.java:62: StackEntry stackEntry = holderStack.pop(); On 2015/08/03 15:43:50, ds wrote: ...
5 years, 4 months ago (2015-08-04 02:37:01 UTC) #30
Marcelo Correa
Fixed isContent flag in NestedElementRetainer.
5 years, 4 months ago (2015-08-04 03:47:04 UTC) #32
Marcelo Correa
Hello wychen, I believe it is necessary because we might have situations like <ol><li><ol><li><p>text</p></li>.... So ...
5 years, 4 months ago (2015-08-04 03:54:28 UTC) #33
wychen
On 2015/08/04 03:54:28, Marcelo Correa wrote: > Hello wychen, > I believe it is necessary ...
5 years, 4 months ago (2015-08-04 05:10:10 UTC) #34
Marcelo Correa
https://codereview.chromium.org/1230583006/diff/140001/java/org/chromium/distiller/webdocument/DomConverter.java File java/org/chromium/distiller/webdocument/DomConverter.java (right): https://codereview.chromium.org/1230583006/diff/140001/java/org/chromium/distiller/webdocument/DomConverter.java#newcode106 java/org/chromium/distiller/webdocument/DomConverter.java:106: builder.list(new WebTag(e.getTagName(), WebTag.TagType.START)); I don't think we can do ...
5 years, 4 months ago (2015-08-04 16:34:48 UTC) #35
wychen
https://codereview.chromium.org/1230583006/diff/140001/java/org/chromium/distiller/webdocument/DomConverter.java File java/org/chromium/distiller/webdocument/DomConverter.java (right): https://codereview.chromium.org/1230583006/diff/140001/java/org/chromium/distiller/webdocument/DomConverter.java#newcode106 java/org/chromium/distiller/webdocument/DomConverter.java:106: builder.list(new WebTag(e.getTagName(), WebTag.TagType.START)); I see. You are right. The ...
5 years, 4 months ago (2015-08-05 00:01:02 UTC) #36
Marcelo Correa
https://codereview.chromium.org/1230583006/diff/160001/java/org/chromium/distiller/webdocument/WebText.java File java/org/chromium/distiller/webdocument/WebText.java (right): https://codereview.chromium.org/1230583006/diff/160001/java/org/chromium/distiller/webdocument/WebText.java#newcode72 java/org/chromium/distiller/webdocument/WebText.java:72: // Since LI Tag is being wrapped by a ...
5 years, 4 months ago (2015-08-05 00:26:35 UTC) #37
Marcelo Correa
Comments addressed
5 years, 4 months ago (2015-08-05 00:44:04 UTC) #38
wychen
https://codereview.chromium.org/1230583006/diff/160001/java/org/chromium/distiller/webdocument/WebText.java File java/org/chromium/distiller/webdocument/WebText.java (right): https://codereview.chromium.org/1230583006/diff/160001/java/org/chromium/distiller/webdocument/WebText.java#newcode72 java/org/chromium/distiller/webdocument/WebText.java:72: // Since LI Tag is being wrapped by a ...
5 years, 4 months ago (2015-08-05 00:52:15 UTC) #39
wychen
The title to this CL and the commit message might need to be updated to ...
5 years, 4 months ago (2015-08-05 01:04:42 UTC) #40
Marcelo Correa
fixed grammar.
5 years, 4 months ago (2015-08-05 01:09:22 UTC) #41
Marcelo Correa
Title changed to reflect the new implementation
5 years, 4 months ago (2015-08-05 03:11:12 UTC) #42
mdjones
https://codereview.chromium.org/1230583006/diff/200001/java/org/chromium/distiller/webdocument/WebTag.java File java/org/chromium/distiller/webdocument/WebTag.java (right): https://codereview.chromium.org/1230583006/diff/200001/java/org/chromium/distiller/webdocument/WebTag.java#newcode3 java/org/chromium/distiller/webdocument/WebTag.java:3: public class WebTag extends WebElement { Briefly describe what ...
5 years, 4 months ago (2015-08-05 15:44:35 UTC) #43
wychen
I've rewrapped the commit message, and added back the bug id.
5 years, 4 months ago (2015-08-05 15:52:10 UTC) #44
Marcelo Correa
Classes were documented.
5 years, 4 months ago (2015-08-05 16:25:56 UTC) #45
Marcelo Correa
On 2015/08/05 15:52:10, wychen wrote: > I've rewrapped the commit message, and added back the ...
5 years, 4 months ago (2015-08-05 16:26:21 UTC) #46
wychen
https://codereview.chromium.org/1230583006/diff/220001/java/org/chromium/distiller/webdocument/WebText.java File java/org/chromium/distiller/webdocument/WebText.java (right): https://codereview.chromium.org/1230583006/diff/220001/java/org/chromium/distiller/webdocument/WebText.java#newcode78 java/org/chromium/distiller/webdocument/WebText.java:78: } else if (elementClonedRoot.getTagName().equals("LI")) { UL and OL might ...
5 years, 4 months ago (2015-08-05 19:46:15 UTC) #47
Marcelo Correa
https://codereview.chromium.org/1230583006/diff/220001/java/org/chromium/distiller/webdocument/WebText.java File java/org/chromium/distiller/webdocument/WebText.java (right): https://codereview.chromium.org/1230583006/diff/220001/java/org/chromium/distiller/webdocument/WebText.java#newcode78 java/org/chromium/distiller/webdocument/WebText.java:78: } else if (elementClonedRoot.getTagName().equals("LI")) { On 2015/08/05 19:46:15, wychen ...
5 years, 4 months ago (2015-08-05 20:31:24 UTC) #48
wychen
https://codereview.chromium.org/1230583006/diff/220001/java/org/chromium/distiller/webdocument/WebText.java File java/org/chromium/distiller/webdocument/WebText.java (right): https://codereview.chromium.org/1230583006/diff/220001/java/org/chromium/distiller/webdocument/WebText.java#newcode78 java/org/chromium/distiller/webdocument/WebText.java:78: } else if (elementClonedRoot.getTagName().equals("LI")) { On 2015/08/05 20:31:24, Marcelo ...
5 years, 4 months ago (2015-08-06 00:04:52 UTC) #49
Marcelo Correa
Verification Logic moved to WebTag class https://codereview.chromium.org/1230583006/diff/220001/java/org/chromium/distiller/webdocument/WebText.java File java/org/chromium/distiller/webdocument/WebText.java (right): https://codereview.chromium.org/1230583006/diff/220001/java/org/chromium/distiller/webdocument/WebText.java#newcode78 java/org/chromium/distiller/webdocument/WebText.java:78: } else if ...
5 years, 4 months ago (2015-08-06 01:17:58 UTC) #50
mdjones
https://codereview.chromium.org/1230583006/diff/260001/java/org/chromium/distiller/webdocument/WebTag.java File java/org/chromium/distiller/webdocument/WebTag.java (right): https://codereview.chromium.org/1230583006/diff/260001/java/org/chromium/distiller/webdocument/WebTag.java#newcode18 java/org/chromium/distiller/webdocument/WebTag.java:18: private static List<String> canBeNestedTags = Arrays.asList("UL", "OL", "LI"); Why ...
5 years, 4 months ago (2015-08-06 15:26:20 UTC) #51
Marcelo Correa
Using Set instead of List.
5 years, 4 months ago (2015-08-06 16:16:44 UTC) #52
Marcelo Correa
On 2015/08/06 16:16:44, Marcelo Correa wrote: > Using Set instead of List. Is there anything ...
5 years, 4 months ago (2015-08-10 13:43:52 UTC) #53
wychen
The score remains the same as of patch set 15. https://codereview.chromium.org/1230583006/diff/220001/java/org/chromium/distiller/webdocument/DomConverter.java File java/org/chromium/distiller/webdocument/DomConverter.java (right): https://codereview.chromium.org/1230583006/diff/220001/java/org/chromium/distiller/webdocument/DomConverter.java#newcode106 ...
5 years, 4 months ago (2015-08-11 00:12:15 UTC) #54
Marcelo Correa
https://codereview.chromium.org/1230583006/diff/220001/java/org/chromium/distiller/webdocument/DomConverter.java File java/org/chromium/distiller/webdocument/DomConverter.java (right): https://codereview.chromium.org/1230583006/diff/220001/java/org/chromium/distiller/webdocument/DomConverter.java#newcode106 java/org/chromium/distiller/webdocument/DomConverter.java:106: builder.tag(new WebTag(e.getTagName(), WebTag.TagType.START)); On 2015/08/11 00:12:15, wychen wrote: > ...
5 years, 4 months ago (2015-08-11 00:21:01 UTC) #55
Marcelo Correa
canBeNested move out of the switch.
5 years, 4 months ago (2015-08-11 16:10:28 UTC) #56
wychen
lgtm
5 years, 4 months ago (2015-08-11 20:15:36 UTC) #57
Marcelo Correa
Great! mdjones, is there anything else ?
5 years, 4 months ago (2015-08-19 14:01:51 UTC) #58
Marcelo Correa
5 years, 4 months ago (2015-08-19 14:02:02 UTC) #59
mdjones
lgtm
5 years, 4 months ago (2015-08-25 20:41:15 UTC) #60
mdjones
5 years, 4 months ago (2015-08-25 22:46:33 UTC) #61
Message was sent while issue was closed.
Committed patchset #16 (id:300001) manually as
a4e94dd067684796435ca45d2f88e063b5af20af (presubmit successful).

Powered by Google App Engine
This is Rietveld 408576698