|
|
Created:
8 years, 5 months ago by frankwang Modified:
8 years, 4 months ago CC:
chromium-reviews, joi+watch-content_chromium.org, darin-cc_chromium.org, jam Base URL:
http://src.chromium.org/svn/trunk/src/ Visibility:
Public. |
DescriptionDuplicateContentResourceHandler looks at resources that come through and track, in histograms, which resources are duplicated, how big those resources are, and what type of resources they are.
Patch Set 1 #
Total comments: 27
Patch Set 2 : #
Total comments: 4
Patch Set 3 : #Patch Set 4 : #
Total comments: 4
Patch Set 5 : #
Total comments: 1
Patch Set 6 : #
Total comments: 17
Patch Set 7 : #Patch Set 8 : #
Total comments: 50
Patch Set 9 : #
Total comments: 2
Patch Set 10 : #
Total comments: 30
Patch Set 11 : Changes for second full pass-through #
Total comments: 6
Patch Set 12 : #
Total comments: 4
Patch Set 13 : Removed one histogram and changed uint32 types for windows compliance #
Total comments: 2
Patch Set 14 : Eliminated implicit type force. #
Total comments: 6
Patch Set 15 : Created helper function and moved comments. #Patch Set 16 : #
Total comments: 32
Patch Set 17 : #Patch Set 18 : #
Total comments: 4
Patch Set 19 : fixed some nits #
Total comments: 11
Patch Set 20 : more nits #Patch Set 21 : fixed parameter spacing #Patch Set 22 : final change to namespace #Patch Set 23 : changed static instance to LazyInstance #Messages
Total messages: 52 (0 generated)
https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:7: #include "content/browser/renderer_host/duplicate_resource_handler.h" Move this above #include <string> https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:20: std::set<uint32>* GetSetOfHashes(){ Nit: don't indent this. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:45: read_buffer_size_ = *buf_size; This is wrong. The buffer is provided by your downstream resource handler. if (next_handler_->OnWillRead(...)) { read_buffer_ = *buf; // Is it OK to keep this reference? read_buffer_size_ = *buf_size; return true; } return false; https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:52: if (!next_handler_->OnReadCompleted(...)) return false; https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:55: MurmurHash3_x86_32(read_buffer_, bytes_read, 0x0, &buf_hash); This is conceptually broken. The reads come to you in whatever quantity we got from the TCP stack, possibly joined by the BufferedResourceHandler. I think you need to take hashes of aligned 4k chunks of the file, concatenate those hashes, and use the hash of that string of hashes to identify each file. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:55: MurmurHash3_x86_32(read_buffer_, bytes_read, 0x0, &buf_hash); Also: MurmurHash3_x86_32(read_buffer_->data(), bytes_read, 0x0, &buf_hash); https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:57: static base::Histogram* num_hits(NULL); This isn't a common pattern for creating histograms. How about: const bool did_we_find_it = ...; UMA_HISTOGRAM_BOOLEAN("Duplicate.HitRate", did_we_find_it); if (did_we_find_it) UMA_HISTOGRAM_CUSTOM_ENUMERATION("Duplicate.HitSize", total_size_, frobnitz); else UMA_HISTOGRAM_CUSTOM_ENUMERATION("Duplicate.MissSize", total_size_, frobnitz); https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... File content/browser/renderer_host/duplicate_resource_handler.h (right): https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.h:10: #include <vector> Remove unneeded headers. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.h:20: class DuplicateResourceHandler: public LayeredResourceHandler{ Nit: space before {. Same throughout this CL. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.h:24: net::URLRequest* request); Nit: indent https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.h:39: scoped_refptr<net::IOBuffer> read_buffer_; Include base/memory/ref_counted.h for this. Also, forward define IOBuffer too. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.h:43: }; // class DuplicateResourceHandler Nit: two spaces before //. Same issue throughout this CL. https://chromiumcodereview.appspot.com/10701151/diff/1/content/content_browse... File content/content_browser.gypi (right): https://chromiumcodereview.appspot.com/10701151/diff/1/content/content_browse... content/content_browser.gypi:502: 'browser/renderer_hoest/duplicate_resource_handler.h', Speling.
https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:72: I'm not sure if it's kosher to keep read_buffer_ around the way you do (I think it's fine, but we might want to make that an IOBuffer*?). But if you are keeping a scoped_ptr, you should: read_buffer_.reset(); So that you don't keep it alive any longer.
I made a few changes as specified. Using Murmur on the read_buffer_ is currently not working. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:7: #include "content/browser/renderer_host/duplicate_resource_handler.h" On 2012/07/11 23:39:46, gavinp wrote: > Move this above #include <string> Done. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:20: std::set<uint32>* GetSetOfHashes(){ On 2012/07/11 23:39:46, gavinp wrote: > Nit: don't indent this. Done. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:45: read_buffer_size_ = *buf_size; On 2012/07/11 23:39:46, gavinp wrote: > This is wrong. The buffer is provided by your downstream resource handler. > > if (next_handler_->OnWillRead(...)) { > read_buffer_ = *buf; // Is it OK to keep this reference? > read_buffer_size_ = *buf_size; > return true; > } > return false; Done. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:52: On 2012/07/11 23:39:46, gavinp wrote: > if (!next_handler_->OnReadCompleted(...)) > return false; Done. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:55: MurmurHash3_x86_32(read_buffer_, bytes_read, 0x0, &buf_hash); On 2012/07/11 23:39:46, gavinp wrote: > Also: > > MurmurHash3_x86_32(read_buffer_->data(), bytes_read, 0x0, &buf_hash); Done. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:57: static base::Histogram* num_hits(NULL); On 2012/07/11 23:39:46, gavinp wrote: > This isn't a common pattern for creating histograms. How about: > > const bool did_we_find_it = ...; > > UMA_HISTOGRAM_BOOLEAN("Duplicate.HitRate", did_we_find_it); > if (did_we_find_it) > UMA_HISTOGRAM_CUSTOM_ENUMERATION("Duplicate.HitSize", total_size_, frobnitz); > else > UMA_HISTOGRAM_CUSTOM_ENUMERATION("Duplicate.MissSize", total_size_, frobnitz); I changed it just for one histogram. I am still discussing with Elie regarding what histograms we need exactly since we are collecting data anyway. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.cc:72: On 2012/07/11 23:41:31, gavinp wrote: > I'm not sure if it's kosher to keep read_buffer_ around the way you do (I think > it's fine, but we might want to make that an IOBuffer*?). But if you are keeping > a scoped_ptr, you should: > > read_buffer_.reset(); > > So that you don't keep it alive any longer. Yes, I called release() on it because it was a scoped_refptr. I am not sure if what I am keeping is right either. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... File content/browser/renderer_host/duplicate_resource_handler.h (right): https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.h:10: #include <vector> On 2012/07/11 23:39:46, gavinp wrote: > Remove unneeded headers. Done. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.h:20: class DuplicateResourceHandler: public LayeredResourceHandler{ On 2012/07/11 23:39:46, gavinp wrote: > Nit: space before {. Same throughout this CL. Done. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.h:24: net::URLRequest* request); On 2012/07/11 23:39:46, gavinp wrote: > Nit: indent Done. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.h:39: scoped_refptr<net::IOBuffer> read_buffer_; On 2012/07/11 23:39:46, gavinp wrote: > Include base/memory/ref_counted.h for this. Also, forward define IOBuffer too. Done. https://chromiumcodereview.appspot.com/10701151/diff/1/content/browser/render... content/browser/renderer_host/duplicate_resource_handler.h:43: }; // class DuplicateResourceHandler On 2012/07/11 23:39:46, gavinp wrote: > Nit: two spaces before //. Same issue throughout this CL. Done. https://chromiumcodereview.appspot.com/10701151/diff/1/content/content_browse... File content/content_browser.gypi (right): https://chromiumcodereview.appspot.com/10701151/diff/1/content/content_browse... content/content_browser.gypi:502: 'browser/renderer_hoest/duplicate_resource_handler.h', On 2012/07/11 23:39:46, gavinp wrote: > Speling. Done.
I made the segfault go away. However, when I launch the browser, it does not render any pages. In duplicate_resource_handler.cc, I think it might not be overriding a certain function correctly or missing a function to override. https://chromiumcodereview.appspot.com/10701151/diff/4003/content/browser/ren... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/4003/content/browser/ren... content/browser/renderer_host/duplicate_resource_handler.cc:45: } I added this function to prevent it from asking the next handler. https://chromiumcodereview.appspot.com/10701151/diff/4003/content/browser/ren... content/browser/renderer_host/duplicate_resource_handler.cc:56: Instead of storing the pointer, *buf is now equal to read_buffer_.get() so that I can access read_buffer_ later. This prevents the segfault that was happening. *buf is supposed to be assigned to something in this function. https://chromiumcodereview.appspot.com/10701151/diff/4003/content/browser/ren... File content/browser/renderer_host/duplicate_resource_handler.h (right): https://chromiumcodereview.appspot.com/10701151/diff/4003/content/browser/ren... content/browser/renderer_host/duplicate_resource_handler.h:29: enum { kReadBufSize = 4000 }; I set the buffer size to be 4000. In other handlers, the buffer sizes seem to vary. I don't know what the best one to use is. I think only look at the first 4000 bytes of a resource might actually be fine. https://chromiumcodereview.appspot.com/10701151/diff/4003/content/browser/ren... File content/browser/renderer_host/resource_dispatcher_host_impl.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/4003/content/browser/ren... content/browser/renderer_host/resource_dispatcher_host_impl.cc:1045: new DuplicateResourceHandler(handler.Pass(), request)); I am not sure if this is in the right spot in terms of the line of handlers being initiated.
The browser now renders pages, but the histogram is not showing.
I just tested this, and I can't believe it renders. I have to go get dinner ready, I'll find more time for this tonight. http://codereview.chromium.org/10701151/diff/7001/content/browser/renderer_ho... File content/browser/renderer_host/duplicate_resource_handler.cc (right): http://codereview.chromium.org/10701151/diff/7001/content/browser/renderer_ho... content/browser/renderer_host/duplicate_resource_handler.cc:63: MurmurHash3_x86_32(read_buffer_->data(), bytes_read, 0x0, &buf_hash); 0, but in hexadecimal? Just use 0. If it helps, you should know that 0 is an octal constant according to the grammar in the original K&R. http://codereview.chromium.org/10701151/diff/7001/content/browser/renderer_ho... content/browser/renderer_host/duplicate_resource_handler.cc:68: // if element is in hash Your comment not full sentence with punctuation http://codereview.chromium.org/10701151/diff/7001/content/browser/renderer_ho... File content/browser/renderer_host/duplicate_resource_handler.h (right): http://codereview.chromium.org/10701151/diff/7001/content/browser/renderer_ho... content/browser/renderer_host/duplicate_resource_handler.h:9: #include <set> You don't need this one, either. Move it to the .cc file. http://codereview.chromium.org/10701151/diff/7001/content/browser/renderer_ho... content/browser/renderer_host/duplicate_resource_handler.h:37: int bytes_hit_; // bytes that hit element in cache Nit: two spaces before //. http://codereview.chromium.org/10701151/diff/10005/content/browser/renderer_h... File content/browser/renderer_host/duplicate_resource_handler.cc (right): http://codereview.chromium.org/10701151/diff/10005/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:54: return true; I'm pretty sure you always want to pass through.
The problem was that I was not passing through correctly on the OnWillRead. The histogram was not showing because I reset the handler in the wrong spot in resource_dispatcher_host_impl.cc. I made some stylistic fixes and added the histograms we want. The histograms still need to be finalized, but this is a first real pass at the code (working, rendering, and displaying histograms without crashing). Patch 6 is the most recent patch.
https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:104: UMA_HISTOGRAM_COUNTS("Duplicate.SizeKB.Miss.CurrentCache", log10(bytes_read_)); I changed this locally to "Duplicate.Size.Miss.CurrentCache" since it is actually bytes base 10 because it was wrong and misleading.
The IOBuffer and hash are still being used wrong. However, the histograms are looking better. http://codereview.chromium.org/10701151/diff/13001/content/browser/renderer_h... File content/browser/renderer_host/duplicate_resource_handler.cc (right): http://codereview.chromium.org/10701151/diff/13001/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:26: static std::set<uint32> seen_resources; This won't build in clang, due to our static destructor rule. Do you think any of your hashes will ever get very big? If they will, you should consider dropping them and stop running when the system comes under memory pressure. http://codereview.chromium.org/10701151/diff/13001/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:60: return next_handler_->OnWillRead(request_id, buf, buf_size, min_size); This doesn't work; you're setting *buf and *buf_size, but so does the AsyncResourceHandler, and it's later in the chain than the DuplicateResourceHandler. So your buffer isn't being used... You're just hashing initialized memory over and over again. You shouldn't be allocating your own buffer, either. You should be just keeping track of the one you're expecting to see in OnReadCompleted. And you need to grab it after OnWillRead(). http://codereview.chromium.org/10701151/diff/13001/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:65: bool* defer) { Once you've fixed the above, you'll want to read the buffer here. Your earlier crashing version was crashing because of course the AsyncResourceHandler tosses the buffer once it's done. http://codereview.chromium.org/10701151/diff/13001/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:69: // Find hash of buffer, using previous hash as the seed (first seed is 0) This does not work. MurmurHash3 is not incremental, seed doesn't work like that. I think you want to use MurmurHash2A, since we have code for it in third_party/smhasher (though we don't build it, that's easily fixed). http://codereview.chromium.org/10701151/diff/13001/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:89: const bool did_we_find_resource_same_origin = Origin isn't the right term to use here, for http://foo.com/bar/quux, the web origin is ('http', 'foo.com', 80). Perhaps you want original_url? http://codereview.chromium.org/10701151/diff/13001/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:96: UMA_HISTOGRAM_BOOLEAN("Duplicate.ProposedCache.HitRate", true); Each of these macro invocations has a global variable hidden in it, so it's best to avoid having too many redundant ones. Why not avoid the nested ifs, and instead something like: UMA_HISTOGRAM_BOOLEAN("Duplicate.ProposedCache.Hits", did_we_find_resource); I'm not a fan of "ProposedCache", how about "HashCache" ? Or are we even at the level of a cache here? Maybe "Duplicate.Hash.Hits" ? http://codereview.chromium.org/10701151/diff/13001/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:97: UMA_HISTOGRAM_BOOLEAN("Duplicate.CurrentCache.HitRate", true); You're not really measuring the current cache here. You're measuring an hypothetical URL based cache that always works when the resources are identical. "Duplicate.HashWithUrl.Hit" ? http://codereview.chromium.org/10701151/diff/13001/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:104: UMA_HISTOGRAM_COUNTS("Duplicate.SizeKB.Miss.CurrentCache", log10(bytes_read_)); UMA_HISTOGRAM_COUNTS is already exponentially binned, so might as well: UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.Hash.Size", bytes_read_, 1, 0x7FFFFFFF, 50); http://codereview.chromium.org/10701151/diff/13001/content/content_browser.gypi File content/content_browser.gypi (right): http://codereview.chromium.org/10701151/diff/13001/content/content_browser.gy... content/content_browser.gypi:15: '<(webkit_src_dir)/Source/WebKit/chromium/WebKit.gyp:webkit', '../third_party/smhasher/smhasher.gyp:murmurhash3',
I fixed the issue with the IOBuffer but not hash. I thought it would be helpful to just upload this CL as I thought about the hash issue more. I did tests on the hash function (MurmurHash2A). It still doesn't seem to work in the way I want. I found a lot of false positives when hashing the resource. This is problematic. However, I am not sure if this is still an architectural problem (which I feel it is not) or a problem with using the hash function incrementally/problem with the hash function. Maybe, we should go with the original idea: hash each portion of the resource and store it in a vector, then hash the vector. That would definitely get around the problem. The only issue is converting the vector into an object usable in the MurmurHash. Other option is to use an incremental hash function. https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:60: return next_handler_->OnWillRead(request_id, buf, buf_size, min_size); On 2012/07/14 19:05:06, gavinp wrote: > This doesn't work; you're setting *buf and *buf_size, but so does the > AsyncResourceHandler, and it's later in the chain than the > DuplicateResourceHandler. So your buffer isn't being used... You're just hashing > initialized memory over and over again. > > You shouldn't be allocating your own buffer, either. You should be just keeping > track of the one you're expecting to see in OnReadCompleted. And you need to > grab it after OnWillRead(). Done. https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:65: bool* defer) { On 2012/07/14 19:05:06, gavinp wrote: > Once you've fixed the above, you'll want to read the buffer here. Your earlier > crashing version was crashing because of course the AsyncResourceHandler tosses > the buffer once it's done. Done. https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:89: const bool did_we_find_resource_same_origin = On 2012/07/14 19:05:06, gavinp wrote: > Origin isn't the right term to use here, for http://foo.com/bar/quux, the web > origin is ('http', 'foo.com', 80). Perhaps you want original_url? Done. https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:96: UMA_HISTOGRAM_BOOLEAN("Duplicate.ProposedCache.HitRate", true); On 2012/07/14 19:05:06, gavinp wrote: > Each of these macro invocations has a global variable hidden in it, so it's best > to avoid having too many redundant ones. > > Why not avoid the nested ifs, and instead something like: > > UMA_HISTOGRAM_BOOLEAN("Duplicate.ProposedCache.Hits", did_we_find_resource); > > I'm not a fan of "ProposedCache", how about "HashCache" ? Or are we even at the > level of a cache here? Maybe "Duplicate.Hash.Hits" ? Done. https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:97: UMA_HISTOGRAM_BOOLEAN("Duplicate.CurrentCache.HitRate", true); On 2012/07/14 19:05:06, gavinp wrote: > You're not really measuring the current cache here. You're measuring an > hypothetical URL based cache that always works when the resources are identical. > > "Duplicate.HashWithUrl.Hit" ? Done. https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:104: UMA_HISTOGRAM_COUNTS("Duplicate.SizeKB.Miss.CurrentCache", log10(bytes_read_)); On 2012/07/14 19:05:06, gavinp wrote: > UMA_HISTOGRAM_COUNTS is already exponentially binned, so might as well: > > UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.Hash.Size", bytes_read_, 1, 0x7FFFFFFF, > 50); Done. https://chromiumcodereview.appspot.com/10701151/diff/13001/content/content_br... File content/content_browser.gypi (right): https://chromiumcodereview.appspot.com/10701151/diff/13001/content/content_br... content/content_browser.gypi:15: '<(webkit_src_dir)/Source/WebKit/chromium/WebKit.gyp:webkit', On 2012/07/14 19:05:06, gavinp wrote: > '../third_party/smhasher/smhasher.gyp:murmurhash3', Done.
On 2012/07/16 01:44:05, frankwang wrote: > I fixed the issue with the IOBuffer but not hash. I thought it would be helpful > to just upload this CL as I thought about the hash issue more. I did tests on > the hash function (MurmurHash2A). It still doesn't seem to work in the way I > want. I found a lot of false positives when hashing the resource. This is > problematic. However, I am not sure if this is still an architectural problem > (which I feel it is not) or a problem with using the hash function > incrementally/problem with the hash function. Maybe, we should go with the > original idea: hash each portion of the resource and store it in a vector, then > hash the vector. That would definitely get around the problem. The only issue is > converting the vector into an object usable in the MurmurHash. Other option is > to use an incremental hash function. My apologies for my earlier confusion on this; I think suggesting MurmurHash2 was a bad idea. My only excuse is that the MurmurHash wikipedia page. Reading the source to MurmurHash2A (from smhasher/MurmurHash2.cpp ) it's clear that Wikipedia is just lying. But here's the best plan: smhasher/PMurHash.h. What do you think? I have perf tested it on my system and it is close to memory bandwidth rates. - Gavin > > https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... > File content/browser/renderer_host/duplicate_resource_handler.cc (right): > > https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... > content/browser/renderer_host/duplicate_resource_handler.cc:60: return > next_handler_->OnWillRead(request_id, buf, buf_size, min_size); > On 2012/07/14 19:05:06, gavinp wrote: > > This doesn't work; you're setting *buf and *buf_size, but so does the > > AsyncResourceHandler, and it's later in the chain than the > > DuplicateResourceHandler. So your buffer isn't being used... You're just > hashing > > initialized memory over and over again. > > > > You shouldn't be allocating your own buffer, either. You should be just > keeping > > track of the one you're expecting to see in OnReadCompleted. And you need to > > grab it after OnWillRead(). > > Done. > > https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... > content/browser/renderer_host/duplicate_resource_handler.cc:65: bool* defer) { > On 2012/07/14 19:05:06, gavinp wrote: > > Once you've fixed the above, you'll want to read the buffer here. Your earlier > > crashing version was crashing because of course the AsyncResourceHandler > tosses > > the buffer once it's done. > > Done. > > https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... > content/browser/renderer_host/duplicate_resource_handler.cc:89: const bool > did_we_find_resource_same_origin = > On 2012/07/14 19:05:06, gavinp wrote: > > Origin isn't the right term to use here, for http://foo.com/bar/quux, the web > > origin is ('http', 'foo.com', 80). Perhaps you want original_url? > > Done. > > https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... > content/browser/renderer_host/duplicate_resource_handler.cc:96: > UMA_HISTOGRAM_BOOLEAN("Duplicate.ProposedCache.HitRate", true); > On 2012/07/14 19:05:06, gavinp wrote: > > Each of these macro invocations has a global variable hidden in it, so it's > best > > to avoid having too many redundant ones. > > > > Why not avoid the nested ifs, and instead something like: > > > > UMA_HISTOGRAM_BOOLEAN("Duplicate.ProposedCache.Hits", did_we_find_resource); > > > > I'm not a fan of "ProposedCache", how about "HashCache" ? Or are we even at > the > > level of a cache here? Maybe "Duplicate.Hash.Hits" ? > > Done. > > https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... > content/browser/renderer_host/duplicate_resource_handler.cc:97: > UMA_HISTOGRAM_BOOLEAN("Duplicate.CurrentCache.HitRate", true); > On 2012/07/14 19:05:06, gavinp wrote: > > You're not really measuring the current cache here. You're measuring an > > hypothetical URL based cache that always works when the resources are > identical. > > > > "Duplicate.HashWithUrl.Hit" ? > > Done. > > https://chromiumcodereview.appspot.com/10701151/diff/13001/content/browser/re... > content/browser/renderer_host/duplicate_resource_handler.cc:104: > UMA_HISTOGRAM_COUNTS("Duplicate.SizeKB.Miss.CurrentCache", log10(bytes_read_)); > On 2012/07/14 19:05:06, gavinp wrote: > > UMA_HISTOGRAM_COUNTS is already exponentially binned, so might as well: > > > > UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.Hash.Size", bytes_read_, 1, 0x7FFFFFFF, > > 50); > > Done. > > https://chromiumcodereview.appspot.com/10701151/diff/13001/content/content_br... > File content/content_browser.gypi (right): > > https://chromiumcodereview.appspot.com/10701151/diff/13001/content/content_br... > content/content_browser.gypi:15: > '<(webkit_src_dir)/Source/WebKit/chromium/WebKit.gyp:webkit', > On 2012/07/14 19:05:06, gavinp wrote: > > '../third_party/smhasher/smhasher.gyp:murmurhash3', > > Done.
Hash function fixed. Let me know if there are any other issues that need to be fixed and next steps for landing.
This is getting a lot better. I've taken a pass through, and commented on what I saw. I'll have to go through again since I did ask for a lot of changes. I want to apologise right now for the requests I make that directly contradict any earlier requests I made. I was wrong back then, and I know that now. In addition to my comments, you should go through the lint results, and try to clean up any lint complaints that you cannot justify. You should add some tests to resource_dispatcher_host_unittest.cc. Consider conditionalising using this to just canary or dev. I suspect you'll want the faster cycle times of canary/dev anyway as beta has long latency, and release has very long latency. What are your estimates of the memory use of this? The memory cost per distinct resource is about 8 bytes. Can you use existing histograms to come up with an estimate of this cost? http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... File content/browser/renderer_host/duplicate_resource_handler.cc (right): http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:25: std::set<uint32>* GetSetOfHashes() { Naming by data type isn't ideal. Better by use: ContentHashes maybe? And ContentWithUrlHashes below? http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:26: static std::set<uint32> seen_resources; Probably we should bite the bullet, and use base/memory/singleton.h for this. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:61: Lose this blank line. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:70: PMurHash32_Process(&ph1_,&pcarry_,read_buffer_->data(), bytes_read); spaces after commas. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:79: const std::string& security_info) { What should you do for status != net::URLRequestStatus::SUCCESS ? http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:84: // from the same or different origin Comments should be sentences (have a period) and wrap at 80 columns. Origin isn't the right name for what you're talking about, the origin of a resource is a triple (scheme, hostname, portnumber) usually, as in ('http', 'foo.com', 80) is the origin of http://foo.com/bar/quux?blatto#thisisafragment. You want to compare the url. Of note also is that the urls you get will never have fragments. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:87: int url_length = strlen(url); This scares me a bit. Is it safe to trust c_str() after destruction of what may have been a temporary std::string? And why put the null on the string anyway, isn't std::string::data() good enough? I also don't like the O(n) strlen() when we had a C++ string which stored explicit length earlier. Lastly, you include <cstring>, but don't call std::strlen, and instead rely on one of your includes having included string.h. How about: const std::string url_spec = request_->url().spec(); then use url_spec.data() and url_spec.size() in the call to PMurHash32_Process? http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:97: const bool did_we_find_resource = did_match_contents_ maybe? http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:98: GetSetOfHashes()->find(resource_hash) != Use count(resource_hash) instead. You don't need to compare to end(), and it makes it more clear your testing for existence. That should make the comment above moot, too. Same just below. Also, might as well grab a ptr to this set rather than use this method, for easier reading: std::set<uint32>* content_hashes = GetSetOfHashes(); ... http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:103: const bool did_we_find_resource_original_url = I've thought about it more, and now I don't like this name, it sounds like you are checking if you found an URL match, when really you're checking URL and Data. did_match_contents_and_url_ ? http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:108: // from the same url or different Best practice is to have a single instance of each histogram. It saves memory, and I think makes more readable code. Consider: UMA_HISTOGRAM_BOOLEAN("Duplicate.Hits", did_match_contents); UMA_HISTOGRAM_BOOLEAN("Duplicate.HitsSameUrl", did_match_contents && did_match_contents_and_url); if (did_match_contents && !did_match_contents_and_url) { UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.Size.HashHitUrlMiss", ...) .... http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:112: UMA_HISTOGRAM_BOOLEAN("Duplicate.Hash.Hits", true); I've thought about this a bit more: I think you just want Duplicate.Hits and Duplicate.HitsSameUrl . Sorts better (the two varieties will sort right next to each other), and shorter. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:121: UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.HashMiss.Size", bytes_read_, I think you want: "Duplicate.Size.HashHitUrlMiss" this will sort better in the histogram tool (put all varieties of sizes near each other), and better describes what we have. Do you also want .HashHit and .HashMiss ? .HashMissUrlHit ? http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:124: UMA_HISTOGRAM_ENUMERATION("Duplicate.ResourceType", resource_type_, The name should reflect that it's for missed resources. "Duplicate.ResourceType.HashMiss". Do you also want to talk about the MIME type? http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:129: // We did not see the resource so it is a miss on both caches I don't like comments like this. The logic should be concise enough to make this clear. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... File content/browser/renderer_host/duplicate_resource_handler.h (right): http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.h:12: #include "googleurl/src/gurl.h" lose this include. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.h:27: ResourceType::Type resource_type, You're not using this. Lose the parameter if you don't want to use it. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.h:32: lose this blank line. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.h:46: uint32 pcarry_; I'm not a fan of either of these names. I know they come from the PMurHash interface. Maybe pmurhash_ph1_ and pmurhash_pcarry_ ? http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.h:47: int buffer_size_; Move this next to read_buffer_. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.h:49: scoped_refptr<net::IOBuffer> read_buffer_; You could use an IOBuffer* here safely I believe. Ask eroman. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.h:53: }; // class DuplicateResourceHandler For short classes, we don't need this comment. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.h:56: Extra blank line. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... File content/browser/renderer_host/resource_dispatcher_host_impl.cc (right): http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/resource_dispatcher_host_impl.cc:1016: Lose this blank line. http://codereview.chromium.org/10701151/diff/15004/content/browser/renderer_h... content/browser/renderer_host/resource_dispatcher_host_impl.cc:1049: handler.reset( A short comment explaining what this resource handler does, and with a TODO() to clean up the experiment is probably good. http://codereview.chromium.org/10701151/diff/15004/third_party/smhasher/smhas... File third_party/smhasher/smhasher.gyp (right): http://codereview.chromium.org/10701151/diff/15004/third_party/smhasher/smhas... third_party/smhasher/smhasher.gyp:16: 'target_name': 'murmurhash2', Probably we can lose this target.
Some more histogram ideas, not sure which are most interesting: - hit rate binned by how many distinct hashes we've seen. - hit rate binned by how many distinct loads we've done. - hit rate binned by length of browser session. - hit rate binned by how many bytes we've seen.
How often do we believe collisions will occur? If we trust our hash is random, what can we do to remove collisions from our reported data?
I made some other notes, but the main note is in duplicate_resource_handler.cc. I directed you to where the segfault occurs. I change this to a constant, and it stopped segfaulting. I will try to hunt down why. Let me know if you figure it out before I do. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:25: std::set<uint32>* GetSetOfHashes() { On 2012/07/18 12:13:32, gavinp wrote: > Naming by data type isn't ideal. Better by use: ContentHashes maybe? And > ContentWithUrlHashes below? Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:26: static std::set<uint32> seen_resources; On 2012/07/18 12:13:32, gavinp wrote: > Probably we should bite the bullet, and use base/memory/singleton.h for this. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:61: On 2012/07/18 12:13:32, gavinp wrote: > Lose this blank line. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:70: PMurHash32_Process(&ph1_,&pcarry_,read_buffer_->data(), bytes_read); On 2012/07/18 12:13:32, gavinp wrote: > spaces after commas. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:79: const std::string& security_info) { On 2012/07/18 12:13:32, gavinp wrote: > What should you do for status != net::URLRequestStatus::SUCCESS ? I put in a check so that I pass through when it is not SUCCESS so the next handler can deal with it. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:84: // from the same or different origin On 2012/07/18 12:13:32, gavinp wrote: > Comments should be sentences (have a period) and wrap at 80 columns. Origin > isn't the right name for what you're talking about, the origin of a resource is > a triple (scheme, hostname, portnumber) usually, as in ('http', 'foo.com', 80) > is the origin of http://foo.com/bar/quux?blatto#thisisafragment. You want to > compare the url. Of note also is that the urls you get will never have > fragments. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:87: int url_length = strlen(url); On 2012/07/18 12:13:32, gavinp wrote: > This scares me a bit. Is it safe to trust c_str() after destruction of what may > have been a temporary std::string? And why put the null on the string anyway, > isn't std::string::data() good enough? I also don't like the O(n) strlen() when > we had a C++ string which stored explicit length earlier. Lastly, you include > <cstring>, but don't call std::strlen, and instead rely on one of your includes > having included string.h. > > How about: > > const std::string url_spec = request_->url().spec(); > > then use url_spec.data() and url_spec.size() in the call to PMurHash32_Process? Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:97: const bool did_we_find_resource = On 2012/07/18 12:13:32, gavinp wrote: > did_match_contents_ maybe? Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:103: const bool did_we_find_resource_original_url = On 2012/07/18 12:13:32, gavinp wrote: > I've thought about it more, and now I don't like this name, it sounds like you > are checking if you found an URL match, when really you're checking URL and > Data. did_match_contents_and_url_ ? Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:108: // from the same url or different On 2012/07/18 12:13:32, gavinp wrote: > Best practice is to have a single instance of each histogram. It saves memory, > and I think makes more readable code. Consider: > > UMA_HISTOGRAM_BOOLEAN("Duplicate.Hits", did_match_contents); > UMA_HISTOGRAM_BOOLEAN("Duplicate.HitsSameUrl", did_match_contents && > did_match_contents_and_url); > if (did_match_contents && !did_match_contents_and_url) { > UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.Size.HashHitUrlMiss", ...) > .... Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:112: UMA_HISTOGRAM_BOOLEAN("Duplicate.Hash.Hits", true); On 2012/07/18 12:13:32, gavinp wrote: > I've thought about this a bit more: I think you just want Duplicate.Hits and > Duplicate.HitsSameUrl . Sorts better (the two varieties will sort right next to > each other), and shorter. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:121: UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.HashMiss.Size", bytes_read_, On 2012/07/18 12:13:32, gavinp wrote: > I think you want: "Duplicate.Size.HashHitUrlMiss" this will sort better in the > histogram tool (put all varieties of sizes near each other), and better > describes what we have. Do you also want .HashHit and .HashMiss ? > .HashMissUrlHit ? Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:124: UMA_HISTOGRAM_ENUMERATION("Duplicate.ResourceType", resource_type_, On 2012/07/18 12:13:32, gavinp wrote: > The name should reflect that it's for missed resources. > "Duplicate.ResourceType.HashMiss". Do you also want to talk about the MIME > type? Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:129: // We did not see the resource so it is a miss on both caches On 2012/07/18 12:13:32, gavinp wrote: > I don't like comments like this. The logic should be concise enough to make this > clear. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... File content/browser/renderer_host/duplicate_resource_handler.h (right): https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.h:12: #include "googleurl/src/gurl.h" On 2012/07/18 12:13:32, gavinp wrote: > lose this include. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.h:27: ResourceType::Type resource_type, On 2012/07/18 12:13:32, gavinp wrote: > You're not using this. Lose the parameter if you don't want to use it. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.h:32: On 2012/07/18 12:13:32, gavinp wrote: > lose this blank line. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.h:46: uint32 pcarry_; On 2012/07/18 12:13:32, gavinp wrote: > I'm not a fan of either of these names. I know they come from the PMurHash > interface. Maybe pmurhash_ph1_ and pmurhash_pcarry_ ? Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.h:47: int buffer_size_; On 2012/07/18 12:13:32, gavinp wrote: > Move this next to read_buffer_. I deleted the buffer_size_. I never use it anywhere. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.h:53: }; // class DuplicateResourceHandler On 2012/07/18 12:13:32, gavinp wrote: > For short classes, we don't need this comment. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.h:56: On 2012/07/18 12:13:32, gavinp wrote: > Extra blank line. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... File content/browser/renderer_host/resource_dispatcher_host_impl.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/resource_dispatcher_host_impl.cc:1016: On 2012/07/18 12:13:32, gavinp wrote: > Lose this blank line. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/content/browser/re... content/browser/renderer_host/resource_dispatcher_host_impl.cc:1049: handler.reset( On 2012/07/18 12:13:32, gavinp wrote: > A short comment explaining what this resource handler does, and with a TODO() to > clean up the experiment is probably good. Done. https://chromiumcodereview.appspot.com/10701151/diff/15004/third_party/smhash... File third_party/smhasher/smhasher.gyp (right): https://chromiumcodereview.appspot.com/10701151/diff/15004/third_party/smhash... third_party/smhasher/smhasher.gyp:16: 'target_name': 'murmurhash2', On 2012/07/18 12:13:32, gavinp wrote: > Probably we can lose this target. Done. https://chromiumcodereview.appspot.com/10701151/diff/25002/content/browser/re... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/25002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:95: const std::string url_spec = request_->url().spec(); The segfault happens here.
heh. http://codereview.chromium.org/10701151/diff/25002/content/browser/renderer_h... File content/browser/renderer_host/duplicate_resource_handler.cc (right): http://codereview.chromium.org/10701151/diff/25002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:95: const std::string url_spec = request_->url().spec(); It seems you never initialize request_.
Another pass through. I think we're getting pretty close! http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... File content/browser/renderer_host/duplicate_resource_handler.cc (right): http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:22: class GlobalDuplicateRecords { I advise you to do some grepfights in our code for current Singleton names. Instead of giving advice, I will suggest you pick a name very consistant with the other extant Singletons. http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:36: int* bytes_seen() { The names should be the same. Either gain a total on the method, or lose a total on the member. Either is fine. http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:85: *bytes_seen += bytes_read; I think you're introducing a confusing error condition by doing this here. If the request ends in error, won't the bytes not count, and then you having added them to your global was a mistake? It's probably best to add bytes_read_ to bytes_seen once and at the end of the resource. I also really don't like the read vs seen distinction. I say we put total_ on the name, it's more clear. http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:98: uint32 resource_hash = PMurHash32_Result(pmurhash_ph1_, I think contents_hash is a better name. http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:101: // Hash url into the resource to see whether it is from the same or // Combine the contents_hash with the url, so we can test if future content identical resources have the same original url or not. http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:114: std::set<uint32>* content_hashes = These automatics should have names identical to the methods they are calling. http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:132: UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.TotalBytesSeen", *bytes_seen, This histogram doesn't seem very useful to me. Can you explain it? http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:134: content_and_url_hashes->insert(hashed_with_url); Move this line above all this histograms, so the code that is actually doing things is all together. http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... File content/browser/renderer_host/duplicate_resource_handler.h (right): http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.h:27: virtual ~DuplicateResourceHandler(); I have no real comment here, except to say that how sad it makes me that Visual Studio makes an error if you add OVERRIDE; to this. http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.h:35: virtual bool OnReadCompleted(int request_id, int bytes_read, In method definitions, it's either all on one line, or one per line. I think this one can go one line. http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.h:42: uint32 pmurhash_ph1_; I would like to see these pmumurhash members separated out by a blank line from the other members, with a short comment explaining that they are expected arguments to the incremental hash interface you are using. I think they'd be well placed at the end of the class, so they wouldn't get in the way of reading the other members. http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... File content/browser/renderer_host/resource_dispatcher_host_impl.cc (right): http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/resource_dispatcher_host_impl.cc:1049: // TODO(frankwang) Clean up this experiment. Add : after ). Remove "in general". Wrap everything at 80 columns. Add ",gavinp" to frankwang. http://codereview.chromium.org/10701151/diff/26002/third_party/smhasher/READM... File third_party/smhasher/README.chromium (right): http://codereview.chromium.org/10701151/diff/26002/third_party/smhasher/READM... third_party/smhasher/README.chromium:4: Revision: 146 Does the fact that you had to make this change call into question this text file? Are there other examples of files like this? I'm not happy seeing this change.
You should figure out why check_deps is failing in the tries. It's blocking this change from getting meaningful testing, and either your deps are broken or check_deps is, and we should fix the one that is wrong.
I made all the noted changes or commented on them in this new patch. I finally got rid of all the lint errors. The one issue is with the "try." I am still unsure how to get rid of the "illegal include" error for the PMurHash.h include in my duplicate_resource_handler.cc file so that more tests can run. I can ping jam tomorrow to see if he can direct me to a fix. Does he work at Google? If so, what's his LDAP? https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:22: class GlobalDuplicateRecords { On 2012/07/20 01:37:59, gavinp wrote: > I advise you to do some grepfights in our code for current Singleton names. > Instead of giving advice, I will suggest you pick a name very consistant with > the other extant Singletons. I grepped for a bunch of Singletons. There is nothing distinguishing their names from other classes really. For all someone knows for a given class name, that class could be a Singleton. I think our naming is fine. https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:36: int* bytes_seen() { On 2012/07/20 01:37:59, gavinp wrote: > The names should be the same. Either gain a total on the method, or lose a total > on the member. Either is fine. Done. https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:85: *bytes_seen += bytes_read; On 2012/07/20 01:37:59, gavinp wrote: > I think you're introducing a confusing error condition by doing this here. If > the request ends in error, won't the bytes not count, and then you having added > them to your global was a mistake? It's probably best to add bytes_read_ to > bytes_seen once and at the end of the resource. > > I also really don't like the read vs seen distinction. I say we put total_ on > the name, it's more clear. Done. https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:98: uint32 resource_hash = PMurHash32_Result(pmurhash_ph1_, On 2012/07/20 01:37:59, gavinp wrote: > I think contents_hash is a better name. Done. https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:101: // Hash url into the resource to see whether it is from the same or On 2012/07/20 01:37:59, gavinp wrote: > // Combine the contents_hash with the url, so we can test if future content > identical resources have the same original url or not. Changed. https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:114: std::set<uint32>* content_hashes = On 2012/07/20 01:37:59, gavinp wrote: > These automatics should have names identical to the methods they are calling. Done. https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:132: UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.TotalBytesSeen", *bytes_seen, On 2012/07/20 01:37:59, gavinp wrote: > This histogram doesn't seem very useful to me. Can you explain it? I am using this histogram in a similar way to length of browser session, which was difficult to get when I looked it up. Basically, we want to see if the misses are uniformly distributed or if they are worse the more bytes we see. I am not particularly attached to this histogram. I just wanted to put it in just in case. In my mind, it was borderline, but it would be harder and more time consuming to land another histogram later on. I just put it in to be on the safe side. https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:134: content_and_url_hashes->insert(hashed_with_url); On 2012/07/20 01:37:59, gavinp wrote: > Move this line above all this histograms, so the code that is actually doing > things is all together. Done. https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... File content/browser/renderer_host/duplicate_resource_handler.h (right): https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.h:27: virtual ~DuplicateResourceHandler(); On 2012/07/20 01:37:59, gavinp wrote: > I have no real comment here, except to say that how sad it makes me that Visual > Studio makes an error if you add OVERRIDE; to this. That's an interesting tidbit. Thanks! https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.h:35: virtual bool OnReadCompleted(int request_id, int bytes_read, On 2012/07/20 01:37:59, gavinp wrote: > In method definitions, it's either all on one line, or one per line. I think > this one can go one line. Done. https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.h:42: uint32 pmurhash_ph1_; On 2012/07/20 01:37:59, gavinp wrote: > I would like to see these pmumurhash members separated out by a blank line from > the other members, with a short comment explaining that they are expected > arguments to the incremental hash interface you are using. I think they'd be > well placed at the end of the class, so they wouldn't get in the way of reading > the other members. Done. https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... File content/browser/renderer_host/resource_dispatcher_host_impl.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/resource_dispatcher_host_impl.cc:1049: // TODO(frankwang) Clean up this experiment. On 2012/07/20 01:37:59, gavinp wrote: > Add : after ). Remove "in general". Wrap everything at 80 columns. Add ",gavinp" > to frankwang. Done. https://chromiumcodereview.appspot.com/10701151/diff/26002/third_party/smhash... File third_party/smhasher/README.chromium (right): https://chromiumcodereview.appspot.com/10701151/diff/26002/third_party/smhash... third_party/smhasher/README.chromium:4: Revision: 146 On 2012/07/20 01:37:59, gavinp wrote: > Does the fact that you had to make this change call into question this text > file? Are there other examples of files like this? I'm not happy seeing this > change. There was a pre-submit error that said I should change the README to the revision I am using. I am currently using 146 so that I can have PMurHash.h. Let me know what I should do for this.
This is getting very good. So the tools warning comes from tools/checkdeps/checkdeps.py (found this by grepping "Illegal include" in our source tree). It's reading the include_rule entries out of DEPS files. I'll add one, and make an upload so we can run another try. http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... File content/browser/renderer_host/duplicate_resource_handler.cc (right): http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:22: class GlobalDuplicateRecords { On 2012/07/20 04:50:00, frankwang wrote: > On 2012/07/20 01:37:59, gavinp wrote: > > I advise you to do some grepfights in our code for current Singleton names. > > Instead of giving advice, I will suggest you pick a name very consistant with > > the other extant Singletons. > I grepped for a bunch of Singletons. There is nothing distinguishing their names > from other classes really. For all someone knows for a given class name, that > class could be a Singleton. I think our naming is fine. SGTM. http://codereview.chromium.org/10701151/diff/26002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:132: UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.TotalBytesSeen", *bytes_seen, On 2012/07/20 04:50:00, frankwang wrote: > On 2012/07/20 01:37:59, gavinp wrote: > > This histogram doesn't seem very useful to me. Can you explain it? > I am using this histogram in a similar way to length of browser session, which > was difficult to get when I looked it up. Basically, we want to see if the > misses are uniformly distributed or if they are worse the more bytes we see. I > am not particularly attached to this histogram. I just wanted to put it in just > in case. > Aha. So a fine way to get length of browser session is to grab base::TimeTicks::Now() in your singleton at construction. That will be very close to right, since it's initialized pretty much at startup. I also am not sure that this TotalBytesSeen does what you want. I think we want four histograms here to get the data you want: if (did_match_contents) { UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.ByTotalMBSeen.Hit", ...); if (did_match_contents_and_url) UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.ByTotalMBSeen.HitAndUrlHit", ...); else UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.ByTotalMBSeen.HitAndUrlMiss", ...); } else { UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.ByTotalMBSeen.Miss", ...); if (did_match_contents_and_url) UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.ByTotalMBSeen.MissAndUrlHit", ...); else UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.ByTotalMBSeen.MissAndUrlMiss", ...); } You might not want all six. But by comparing these, you can see how rates vary with size. > > > > In my mind, it was borderline, but it would be harder and more time consuming to > land another histogram later on. I just put it in to be on the safe side. http://codereview.chromium.org/10701151/diff/26002/third_party/smhasher/READM... File third_party/smhasher/README.chromium (right): http://codereview.chromium.org/10701151/diff/26002/third_party/smhasher/READM... third_party/smhasher/README.chromium:4: Revision: 146 On 2012/07/20 04:50:00, frankwang wrote: > On 2012/07/20 01:37:59, gavinp wrote: > > Does the fact that you had to make this change call into question this text > > file? Are there other examples of files like this? I'm not happy seeing this > > change. > There was a pre-submit error that said I should change the README to the > revision I am using. I am currently using 146 so that I can have PMurHash.h. Let > me know what I should do for this. It's fine. I just don't get why we have these files. Either we don't need this, or we don't need the DEPS file. But that's out of scope for our review. http://codereview.chromium.org/10701151/diff/35002/content/browser/renderer_h... File content/browser/renderer_host/duplicate_resource_handler.cc (right): http://codereview.chromium.org/10701151/diff/35002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:49: int total_bytes_seen_; I think you can't use an int here, it's very likely to overflow during a browser session. You'll need to use an int64. But histograms report by ints, which can be 32 bit. So when reporting by size, report by MB. Then you'll be good up to 2,251,799,812,636,672 bytes, which should be enough for anyone. http://codereview.chromium.org/10701151/diff/35002/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:135: } else { I think you don't need this else (insert does nothing if it's already there, and it's quite cheap on a hash). Worse, isn't it introducing a logic error? I think this means you never insert properly into content_and_url_matches. http://codereview.chromium.org/10701151/diff/35002/content/browser/renderer_h... File content/browser/renderer_host/resource_dispatcher_host_impl.cc (right): http://codereview.chromium.org/10701151/diff/35002/content/browser/renderer_h... content/browser/renderer_host/resource_dispatcher_host_impl.cc:1049: // TODO(frankwang, gavinp) Clean up this experiment. Need a : after the ).
Aha, reitveld won't let me upload a patch to your issue. So, you need to add a line to include rules in content/browser/DEPS , and you should be good.
I fixed the issues, and I decided to just keep the browser session instead of bytes. That seemed more in line with the data we want. Hopefully, with the change in include rule, the try will work. https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/26002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:132: UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.TotalBytesSeen", *bytes_seen, On 2012/07/20 11:38:46, gavinp wrote: > On 2012/07/20 04:50:00, frankwang wrote: > > On 2012/07/20 01:37:59, gavinp wrote: > > > This histogram doesn't seem very useful to me. Can you explain it? > > I am using this histogram in a similar way to length of browser session, which > > was difficult to get when I looked it up. Basically, we want to see if the > > misses are uniformly distributed or if they are worse the more bytes we see. I > > am not particularly attached to this histogram. I just wanted to put it in > just > > in case. > > > > Aha. So a fine way to get length of browser session is to grab > base::TimeTicks::Now() in your singleton at construction. That will be very > close to right, since it's initialized pretty much at startup. > > I also am not sure that this TotalBytesSeen does what you want. I think we want > four histograms here to get the data you want: > > if (did_match_contents) { > UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.ByTotalMBSeen.Hit", ...); > if (did_match_contents_and_url) > UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.ByTotalMBSeen.HitAndUrlHit", ...); > else > UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.ByTotalMBSeen.HitAndUrlMiss", ...); > } else { > UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.ByTotalMBSeen.Miss", ...); > if (did_match_contents_and_url) > UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.ByTotalMBSeen.MissAndUrlHit", ...); > else > UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.ByTotalMBSeen.MissAndUrlMiss", ...); > } > > You might not want all six. But by comparing these, you can see how rates vary > with size. > > > > > > > > > In my mind, it was borderline, but it would be harder and more time consuming > to > > land another histogram later on. I just put it in to be on the safe side. > I think browser session length should be sufficient for what we need. We don't need that much detail (at least for right now). https://chromiumcodereview.appspot.com/10701151/diff/35002/content/browser/re... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/35002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:49: int total_bytes_seen_; On 2012/07/20 11:38:47, gavinp wrote: > I think you can't use an int here, it's very likely to overflow during a browser > session. > > You'll need to use an int64. But histograms report by ints, which can be 32 bit. > So when reporting by size, report by MB. Then you'll be good up to > 2,251,799,812,636,672 bytes, which should be enough for anyone. I took this out and went with session length. https://chromiumcodereview.appspot.com/10701151/diff/35002/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:135: } else { On 2012/07/20 11:38:47, gavinp wrote: > I think you don't need this else (insert does nothing if it's already there, and > it's quite cheap on a hash). Worse, isn't it introducing a logic error? I think > this means you never insert properly into content_and_url_matches. Done. https://chromiumcodereview.appspot.com/10701151/diff/35002/content/browser/re... File content/browser/renderer_host/resource_dispatcher_host_impl.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/35002/content/browser/re... content/browser/renderer_host/resource_dispatcher_host_impl.cc:1049: // TODO(frankwang, gavinp) Clean up this experiment. On 2012/07/20 11:38:47, gavinp wrote: > Need a : after the ). Done.
http://codereview.chromium.org/10701151/diff/40003/content/browser/renderer_h... File content/browser/renderer_host/duplicate_resource_handler.cc (right): http://codereview.chromium.org/10701151/diff/40003/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:53: int64 browser_start_time_; Why isn't this of type base::TimeTicks ? http://codereview.chromium.org/10701151/diff/40003/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:130: GlobalDuplicateRecords::GetInstance()->browser_start_time(); Why isn't this of type base::TimeDelta ? http://codereview.chromium.org/10701151/diff/40003/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:138: 1, 0x7FFFFFFF, 50); Isn't this only useful if you have Duplicate.BySessionLength.<someotherthing> to compare it to? Can you explain the use of this without one of the other ones to compare to?
I decided against that histogram in question because I don't really see any use for it in the future. If we really need it, I'll need to think of better metrics to start off with. I also fixed the types to try to pass the win_rel test.
I just sent this in to the try bots. I think it's ready to change the name, and go through a formal review after those results come in. http://codereview.chromium.org/10701151/diff/40003/content/browser/renderer_h... File content/browser/renderer_host/duplicate_resource_handler.cc (right): http://codereview.chromium.org/10701151/diff/40003/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:138: 1, 0x7FFFFFFF, 50); On 2012/07/20 17:58:17, gavinp wrote: > Isn't this only useful if you have Duplicate.BySessionLength.<someotherthing> to > compare it to? Can you explain the use of this without one of the other ones to > compare to? I'm sad to see this idea gone. I think it's most interesting not with respect to time, but with respect to the number of resources retrieved. In particular, I'd like to gather enough data to estimate the equilibrium hit rate on each of these variables. With the data we have here, it's unclear how we'll separate newly started browsers (with many misses) from the older browsers with equilibrium rates.
There is still an error in win_rel that I need to fix. If you have any idea how to fix it, let me know. What are the steps needed to get it through the formal review process?
I tried the fix I suggested, and fired off a win_rel job. If it works, you know what to do. http://codereview.chromium.org/10701151/diff/29006/content/browser/renderer_h... File content/browser/renderer_host/duplicate_resource_handler.cc (right): http://codereview.chromium.org/10701151/diff/29006/content/browser/renderer_h... content/browser/renderer_host/duplicate_resource_handler.cc:110: const bool did_match_contents = content_matches->count(contents_hash); Visual studio warns on this int degrading to a bool: E:\b\build\slave\win\build\src\content\browser\renderer_host\duplicate_resource_handler.cc(110) : warning C4800: 'unsigned int' : forcing value to bool 'true' or 'false' (performance warning) I'd just add "> 0" to the end.
Hopefully, this eliminates the compile error for win_rel. https://chromiumcodereview.appspot.com/10701151/diff/29006/content/browser/re... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/29006/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:110: const bool did_match_contents = content_matches->count(contents_hash); On 2012/07/22 17:20:18, gavinp wrote: > Visual studio warns on this int degrading to a bool: > > E:\b\build\slave\win\build\src\content\browser\renderer_host\duplicate_resource_handler.cc(110) > : warning C4800: 'unsigned int' : forcing value to bool 'true' or 'false' > (performance warning) > > I'd just add "> 0" to the end. Done.
On 2012/07/22 20:44:29, frankwang wrote: > Hopefully, this eliminates the compile error for win_rel. I think it will. I just fired off a try. My apologies over the confusing try that failed; git try has awesome behaviour of ignoring changes in the local index, and instead only trying committed changes. It is unclear to me why this is a good idea. > > https://chromiumcodereview.appspot.com/10701151/diff/29006/content/browser/re... > File content/browser/renderer_host/duplicate_resource_handler.cc (right): > > https://chromiumcodereview.appspot.com/10701151/diff/29006/content/browser/re... > content/browser/renderer_host/duplicate_resource_handler.cc:110: const bool > did_match_contents = content_matches->count(contents_hash); > On 2012/07/22 17:20:18, gavinp wrote: > > Visual studio warns on this int degrading to a bool: > > > > > E:\b\build\slave\win\build\src\content\browser\renderer_host\duplicate_resource_handler.cc(110) > > : warning C4800: 'unsigned int' : forcing value to bool 'true' or 'false' > > (performance warning) > > > > I'd just add "> 0" to the end. > > Done.
Hi Darin, I've heard you've done a lot of resource handler work lately, WDYT? Frank
Just some initial comments... https://chromiumcodereview.appspot.com/10701151/diff/37005/content/browser/re... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/37005/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:89: MH_UINT32 contents_hash = PMurHash32_Result(pmurhash_ph1_, nit: move the code between the two OnResponseCompleted calls into a helper function so you can write the code like this: if (status.is_success()) DoStuff(); return next_handler_->OnResponseCompleted(...); https://chromiumcodereview.appspot.com/10701151/diff/37005/content/browser/re... File content/browser/renderer_host/resource_dispatcher_host_impl.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/37005/content/browser/re... content/browser/renderer_host/resource_dispatcher_host_impl.cc:1046: // Install DuplicateResourceHandler, which keeps a hash of resources seen and nit: This is not the best place to write a comment that defines what DuplicateResourceHandler is. That kind of comment belongs in the header file for DuplicateResourceHandler. BTW, the name DuplicateResourceHandler is a bit odd. https://chromiumcodereview.appspot.com/10701151/diff/37005/third_party/smhash... File third_party/smhasher/smhasher.gyp (right): https://chromiumcodereview.appspot.com/10701151/diff/37005/third_party/smhash... third_party/smhasher/smhasher.gyp:19: 'src/PMurHash.h', did you forget to add these source files to the CL?
Some changes, comments, and questions. https://chromiumcodereview.appspot.com/10701151/diff/37005/content/browser/re... File content/browser/renderer_host/duplicate_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/37005/content/browser/re... content/browser/renderer_host/duplicate_resource_handler.cc:89: MH_UINT32 contents_hash = PMurHash32_Result(pmurhash_ph1_, On 2012/07/23 23:30:14, darin wrote: > nit: move the code between the two OnResponseCompleted calls into a helper > function so you can write the code like this: > > if (status.is_success()) > DoStuff(); > return next_handler_->OnResponseCompleted(...); Done. https://chromiumcodereview.appspot.com/10701151/diff/37005/content/browser/re... File content/browser/renderer_host/resource_dispatcher_host_impl.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/37005/content/browser/re... content/browser/renderer_host/resource_dispatcher_host_impl.cc:1046: // Install DuplicateResourceHandler, which keeps a hash of resources seen and On 2012/07/23 23:30:14, darin wrote: > nit: This is not the best place to write a comment that defines what > DuplicateResourceHandler is. That kind of comment belongs in the header file > for DuplicateResourceHandler. > > BTW, the name DuplicateResourceHandler is a bit odd. Done. Do you have any suggestions for the name? DuplicateResourceHandler was the first one that came to mind. https://chromiumcodereview.appspot.com/10701151/diff/37005/third_party/smhash... File third_party/smhasher/smhasher.gyp (right): https://chromiumcodereview.appspot.com/10701151/diff/37005/third_party/smhash... third_party/smhasher/smhasher.gyp:19: 'src/PMurHash.h', On 2012/07/23 23:30:14, darin wrote: > did you forget to add these source files to the CL? It is part of the repo if DEPS has smhasher at revision 146, so it does not show up in the CL. I also modified the necessary DEPS files.
I changed the class from DuplicateResourceHandler to something more descriptive DuplicateContentResourceHandler. I have also added a comment at the top of the header. What else needs to be done for LGTM?
https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... File content/browser/renderer_host/duplicate_content_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:17: nit: no new line here https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:21: public: nit: indentation https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:23: return Singleton<GlobalDuplicateRecords>::get(); we should only use Singleton<> for static objects that may be accessed from multiple threads. it doesn't seem like you have to access this class from multiple threads. you can just use a static variable here and return the address of that. that will lazily construct the GlobalDuplicateRecords object w/o the overhead of Singleton. static GlobalDuplicateRecords* GetInstance() { static GlobalDuplicateRecords records; return &records; } https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:41: std::set<MH_UINT32> content_matches_; how big can these sets grow? do we need to worry about memory bloat in the browser process? https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:44: } // namespace nit: add a new line above the close of the namespace https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:74: PMurHash32_Process(&pmurhash_ph1_, &pmurhash_pcarry_, how do we know that this function isn't going to slow down resource loading appreciably? maybe you should collect metrics for that? you should probably run some tests on your local machine to see what the histogram reports. ideally, it would only have samples at 0. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:86: DuplicateContentResourceHandler::RecordContentMetrics(status); no need for the "DuplicateContentResourceHandler::" prefix. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:92: RecordContentMetrics(const net::URLRequestStatus& status) { you don't seem to need this parameter. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:93: MH_UINT32 contents_hash = PMurHash32_Result(pmurhash_ph1_, what about the runtime of this function? is it costly? https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:94: pmurhash_pcarry_, bytes_read_); nit: indentation of the parameters https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:99: const std::string url_spec = request_->url().spec(); nit: you should use a const ref here: "const std::string& url_spec" https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:122: content_and_url_matches->insert(hashed_with_url); nit: indentation https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:134: } // namespace content nit: add a new line above the close of the namespace https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:135: nit: extraneous blank line? https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... File content/browser/renderer_host/duplicate_content_resource_handler.h (right): https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.h:6: // url.It is an experiment meant to simulate cache hits for both a url-based nit: add a space after the period. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.h:46: void RecordContentMetrics(const net::URLRequestStatus& status); nit: put a new line above this function since it is not part of the implementation of ResourceHandler.
I fixed some nits and some small parts of the code. I also added some comments in response to concerns and comments that you had. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... File content/browser/renderer_host/duplicate_content_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:17: On 2012/07/25 22:02:06, darin wrote: > nit: no new line here Done. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:21: public: On 2012/07/25 22:02:06, darin wrote: > nit: indentation Done. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:23: return Singleton<GlobalDuplicateRecords>::get(); On 2012/07/25 22:02:06, darin wrote: > we should only use Singleton<> for static objects that may be accessed from > multiple threads. it doesn't seem like you have to access this class from > multiple threads. > > you can just use a static variable here and return the address of that. that > will lazily construct the GlobalDuplicateRecords object w/o the overhead of > Singleton. > > static GlobalDuplicateRecords* GetInstance() { > static GlobalDuplicateRecords records; > return &records; > } Done. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:41: std::set<MH_UINT32> content_matches_; On 2012/07/25 22:02:06, darin wrote: > how big can these sets grow? do we need to worry about memory bloat in the > browser process? We crawled the Alexa top 25,000 list and found that there are about 620,000 unique resources on those pages, which is about 25 unique resources per page. We are using 8 bytes per resource. I think on average to be generous, a person will visit 100 pages per session. We imagine that the amount of duplicated content to hover around 10 percent. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:44: } // namespace On 2012/07/25 22:02:06, darin wrote: > nit: add a new line above the close of the namespace Done. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:74: PMurHash32_Process(&pmurhash_ph1_, &pmurhash_pcarry_, On 2012/07/25 22:02:06, darin wrote: > how do we know that this function isn't going to slow down resource loading > appreciably? maybe you should collect metrics for that? you should probably > run some tests on your local machine to see what the histogram reports. > ideally, it would only have samples at 0. PMurHash is relatively fast. The MurMur hash family is much faster than the typical lookup3. This is just an incremental version of that family which processes data at a rate of 3-4 GB/s. I don't think this speed will be an issue. I have also looked at the source code and the operations it does are very fast. Moreover, it is hardware optimized, from what I know, and designed with speed and performance in mind. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:86: DuplicateContentResourceHandler::RecordContentMetrics(status); On 2012/07/25 22:02:06, darin wrote: > no need for the "DuplicateContentResourceHandler::" prefix. Done. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:92: RecordContentMetrics(const net::URLRequestStatus& status) { On 2012/07/25 22:02:06, darin wrote: > you don't seem to need this parameter. Deleted. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:93: MH_UINT32 contents_hash = PMurHash32_Result(pmurhash_ph1_, On 2012/07/25 22:02:06, darin wrote: > what about the runtime of this function? is it costly? Comment left above about PMurHash. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:94: pmurhash_pcarry_, bytes_read_); On 2012/07/25 22:02:06, darin wrote: > nit: indentation of the parameters Done. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:99: const std::string url_spec = request_->url().spec(); On 2012/07/25 22:02:06, darin wrote: > nit: you should use a const ref here: "const std::string& url_spec" Done. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:122: content_and_url_matches->insert(hashed_with_url); On 2012/07/25 22:02:06, darin wrote: > nit: indentation Done. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:134: } // namespace content On 2012/07/25 22:02:06, darin wrote: > nit: add a new line above the close of the namespace Done. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:135: On 2012/07/25 22:02:06, darin wrote: > nit: extraneous blank line? Done. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... File content/browser/renderer_host/duplicate_content_resource_handler.h (right): https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.h:6: // url.It is an experiment meant to simulate cache hits for both a url-based On 2012/07/25 22:02:06, darin wrote: > nit: add a space after the period. Done. https://chromiumcodereview.appspot.com/10701151/diff/48001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.h:46: void RecordContentMetrics(const net::URLRequestStatus& status); On 2012/07/25 22:02:06, darin wrote: > nit: put a new line above this function since it is not part of the > implementation of ResourceHandler. Done.
just a few nits... then you are good to go https://chromiumcodereview.appspot.com/10701151/diff/56001/content/browser/re... File content/browser/renderer_host/duplicate_content_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/56001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:60: int* buf_size, int min_size) { nit: indentation nit: formatting, should look like this: bool Foo::Bar( A a, B b, C c, D d) { ... } https://chromiumcodereview.appspot.com/10701151/diff/56001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:70: OnReadCompleted(int request_id, int bytes_read, bool* defer) { ditto... you formatted correctly for OnResponseCompleted https://chromiumcodereview.appspot.com/10701151/diff/56001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:101: /*DVLOG(4) << "url: " << url_spec; nit: either uncomment this code or delete the code. https://chromiumcodereview.appspot.com/10701151/diff/56001/content/browser/re... File content/browser/renderer_host/duplicate_content_resource_handler.h (right): https://chromiumcodereview.appspot.com/10701151/diff/56001/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.h:30: ResourceType::Type resource_type, nit: indentation
nits fixed.
LGTM with nits fixed. (but I'm not an OWNER). https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... File content/browser/renderer_host/duplicate_content_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:16: namespace { Why not move the anon namespace out of content? It doesn't reference anything in content. https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:71: bool DuplicateContentResourceHandler::OnReadCompleted( You can move the first parameter up here, and indent the rest; saving a line. Same for OnWillRead above as well. https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:119: UMA_HISTOGRAM_BOOLEAN("Duplicate.HitsSameUrl", did_match_contents && Could we break this line after the ", " for better readability, without gaining a line? looks to me like it fits. https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... File content/browser/renderer_host/duplicate_content_resource_handler.h (right): https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.h:7: // and content-based cache. "and a content-based cache."
more nits fixed. darin, WDYT? https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... File content/browser/renderer_host/duplicate_content_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:16: namespace { On 2012/07/31 19:26:18, gavinp wrote: > Why not move the anon namespace out of content? It doesn't reference anything in > content. Done. https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:71: bool DuplicateContentResourceHandler::OnReadCompleted( On 2012/07/31 19:26:18, gavinp wrote: > You can move the first parameter up here, and indent the rest; saving a line. > Same for OnWillRead above as well. Is that the format? It seems like the parameters need to be on a different line according to darin's guideline. https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:119: UMA_HISTOGRAM_BOOLEAN("Duplicate.HitsSameUrl", did_match_contents && On 2012/07/31 19:26:18, gavinp wrote: > Could we break this line after the ", " for better readability, without gaining > a line? looks to me like it fits. Yup, it does. https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... File content/browser/renderer_host/duplicate_content_resource_handler.h (right): https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.h:7: // and content-based cache. On 2012/07/31 19:26:18, gavinp wrote: > "and a content-based cache." Done.
https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... File content/browser/renderer_host/duplicate_content_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:71: bool DuplicateContentResourceHandler::OnReadCompleted( On 2012/07/31 20:26:18, frankwang wrote: > On 2012/07/31 19:26:18, gavinp wrote: > > You can move the first parameter up here, and indent the rest; saving a line. > > Same for OnWillRead above as well. > Is that the format? It seems like the parameters need to be on a different line > according to darin's guideline. Each parameter must be on its own line, but it is preferred to move them up to the declaration line. From the chromium coding standard: For function declarations and definitions, put each argument on a separate line unless everything fits on one line. You should only break this rule when the arguments have some logical association like "int x, int y" or "char* str, int str_len" where putting a few arguments together makes semantic sense. Do NOT collapse function arguments together just because you have "too many": void OK1(TypeA first_argument, TypeB second_argument, TypeC third_argument); // Prefer this to the next form when it's possible. void OK2(TypeA first_argument, TypeB second_argument, TypeC thir_argument); void OK3( TypeA first_argument, TypeB second_argument, TypeC third_argument); // Only legal when |second_argument| and |third_argument| are intimately // related, e.g. (x, y) coords. void SpecialCase(TypeA first_argument, TypeB second_argument, TypeC third_argument, TypeD fourth_argument); // Not legal, args are collapsed but overall decl doesn't fit on one line void BadWrappingStyle( TypeA first_argument, TypeB second_argument, TypeC third_argument); // Perhaps there is a way to break things up into functions with fewer arguments... void MyFunctionTakesTooManyArguments(Type first_argument, Type second_argument, Type third_argument, Type fourth_argument, Type fifth_argument, Type sixth_argument, Type seventh_argument, Type etc_argument); Please change your functions to the OK2 format above where it fits.
https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... File content/browser/renderer_host/duplicate_content_resource_handler.cc (right): https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:16: namespace { On 2012/07/31 19:26:18, gavinp wrote: > Why not move the anon namespace out of content? It doesn't reference anything in > content. Sorry to contradict, but I think it is generally better to place the anonymous namespace within the containing namespace. That way, we are consistent from file to file. Sometimes the anon namespace will need to reference things from the content:: namespace. The last thing I want to see in source files under src/content is 'content::Foo' or 'using content::Foo'. Those just add noise to the source. There's no downside to putting the anon namespace inside the content namespace, right? https://chromiumcodereview.appspot.com/10701151/diff/54004/content/browser/re... content/browser/renderer_host/duplicate_content_resource_handler.cc:71: bool DuplicateContentResourceHandler::OnReadCompleted( gavin is correct. sorry, i assumed since you had placed the method name on a separate line that the OK2 format would not have worked. i only suggested OK3 format assuming that OK2 format would not have worked. sorry for the confusion (and welcome to chromium style nit picking)!!
LGTM
final nits fixed. I put the anon namespace back into the content namespace. I fixed the parameter spacing. I realized why I didn't do it for OnResponseCompleted (one of the parameters was too long).
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/frankwang@google.com/10701151/51005
Try job failure for 10701151-51005 (retry) on mac_rel for step "compile" (clobber build). It's a second try, previously, step "compile" failed. http://build.chromium.org/p/tryserver.chromium/buildstatus?builder=mac_rel&nu...
On 2012/07/31 21:10:57, I haz the power (commit-bot) wrote: > Try job failure for 10701151-51005 (retry) on mac_rel for step "compile" > (clobber build). > It's a second try, previously, step "compile" failed. > http://build.chromium.org/p/tryserver.chromium/buildstatus?builder=mac_rel&nu... The exit time destructor strikes again! This is what caused me to think you should use the singleton, I ran into this problem building your code too. Perhaps you could use "new" to allocate it? Then you'll leak it, so be prepared for a complaint from the memory trybots, and you can add a suppression for your "leak."
Changed to static variable to static LazyInstance::Leaky. Hope it goes through.
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/frankwang@google.com/10701151/44014
Change committed as 149324
Hi, In investigating http://crbug.com/114570, once the WebKit patches listed there are applied, the two |set|s in this CL are causing a slow leak. Can you think of a way to prune GlobalDuplicateRecords::content_matches_ and ::content_and_url_matches_? Alternatively, we should exclude or handle blob urls. They are of the form "blob:http://google.com/<guid>/<raw_binary_data>". Because of the GUID, the metrics are probably not reporting what you want in this case anyway, as they'll always be unique. data: urls are worth considering too. There's no guid, but a page could be generating many unique data: urls which could eventually fill this set also. |