OLD | NEW |
---|---|
(Empty) | |
1 Here's a rough walkthrough of how this works. The ultimate output file is | |
2 database.filtered.json. | |
nweiz
2012/02/01 00:10:39
After reading this, I'm having trouble understandi
| |
3 | |
4 search.js | |
5 - read data/domTypes.json | |
nweiz
2012/02/01 00:10:39
What's in this file? Where does it come from?
| |
6 - for each dom type: | |
7 - search for page on www.googleapis.com | |
8 - write search results to output/search/<type>.json | |
9 . this is a list of search results and urls to pages | |
10 | |
11 crawl.js | |
12 - read data/domTypes.json | |
13 - for each dom type: | |
14 - for each output/search/<type>.json: | |
nweiz
2012/02/01 00:10:39
Isn't there only one of these files for each type?
| |
15 - for each result in the file: | |
16 - try to scrape that cached MDN page from webcache.googleusercontent.com | |
17 - write mdn page to output/crawl/<type><index of result>.html | |
18 - write output/crawl/cache.json | |
19 . it maps types -> search result page urls and titles | |
20 | |
21 extract.sh | |
nweiz
2012/02/01 00:10:39
Should probably mention which directory this needs
| |
22 - compile extract.dart to js | |
23 - run extractRunner.js | |
nweiz
2012/02/01 00:10:39
Is this the same as the compiled extract.dart? If
| |
24 - read data/domTypes.json | |
25 - read output/crawl/cache.json | |
26 - read data/dartIdl.json | |
nweiz
2012/02/01 00:10:39
What's in this file? Where does it come from?
| |
27 - for each scraped search result page: | |
28 - create a cleaned up html page in output/extract/<type><index>.html that | |
29 contains the scraped content + a script tag that includes extract.dart.js. | |
30 - create an args file in output/extract/<type><index>.html.json with some | |
31 data on how that file should be processed | |
nweiz
2012/02/01 00:10:39
s/that file/the HTML file/
What sort of data? Wha
| |
32 - invoke dump render tree on that file | |
nweiz
2012/02/01 00:10:39
Make it more explicit that this invokes it in a he
| |
33 - when that returns, parse the console output and add it to database.json | |
nweiz
2012/02/01 00:10:39
Does this mean output/database.json?
| |
34 - add any errors to output/errors.json | |
35 - save output/database.json | |
nweiz
2012/02/01 00:10:39
Somewhat confusing given that you just said you we
| |
36 | |
37 extract.dart | |
nweiz
2012/02/01 00:10:39
Is this run within extractRunner.js? How is its fu
| |
38 - xhr output/extract/<type><index>.html.json | |
nweiz
2012/02/01 00:10:39
Is this different than the "read *.json" you're do
| |
39 - all sorts of shenanigans to actually pull the content out of the html | |
40 - build a JSON object with the results | |
41 - do a postmessage with that object so extractRunner.js can pull it out | |
42 | |
43 - run postProcess.dart | |
nweiz
2012/02/01 00:10:39
Is this run via DumpRenderTree? On the VM? On Frog
| |
44 - go through the results for each type looking for the best match | |
nweiz
2012/02/01 00:10:39
Mention what files you're using here.
| |
45 - write output/database.html | |
46 - write output/examples.html | |
47 - write output/obsolete.html | |
nweiz
2012/02/01 00:10:39
What are all these files for? Why are they in HTML
| |
48 - write output/database.filtered.json which is the best matches | |
nweiz
2012/02/01 00:10:39
Is this just a mapping of type names to the conten
| |
OLD | NEW |