Charles He comments on Open Thread: Spring 2022

Charles He 2 Jan 2022 0:33 UTC
1 point
0 ∶ 0
Flagging some more technical points about the scraping above (verbose, quickly written):
- This scraping might be in the form of API calls that occur every few minutes. The burden of these calls seems small (?) relative to the mundane, everyday use of the API, e.g. see GreaterWrong or Issa Rice’s site.
- Just to be super clear, I think the computing costs for the backend activity of these calls are probably <$1 a month
- It seems there aren’t rules/norms for rate limits and there is some evidence that the EA forum/ LessWrong may not handle heavy use of API calls robustly :
  - Calls that seem sort of large are allowed. To me, these calls seem large compared to say, response limits and size limits of calls of Gmail API and other commercial APIs I’ve used.
  - ~~Pagination isn’t supported in the API, and for many calls there aren’t even date filters (“before:”/”after:”) for me to approximate pagination~~ I found additional query “views” such as MultiCommentOutput, which allow offset, so you can paginate.
- The API exposes certain information that isn’t available in the front-end website. However, I am reluctant to elaborate because (1) this same information is available another way, so it’s not quite a leak (2) I’m a noob, but this was easy to find—I think this is a sign it’s sanguine and maybe already used (3) I don’t want to just add a low value ticket to someone’s Kanban board (4) I find this information interesting!
Other comments on the purpose (also verbose, quickly written):
- This “higher resolution” scraping might help answer interesting questions. I don’t want to write details, mainly because I’m in the fun, initial 10%/ideation stage of a side project. In this stage, usually I see something shiny, like a batch of kittens in the neighborhood that need fostering, and the project ends.
- Not really related to high frequency temporal scrapping, but related scrapping in general: this is useful to get over certain limitations with the API. e.g. See the part in Rice Issa’s thoughtful walkthrough of GraphQL where he says “Some queries are hard/impossible to do. Examples: (1) getting comments of a user by placing conditions on the parent comment or post (e.g. finding all comments by user 1 where they are replying to user 2); (2) querying and sorting posts by a function of arbitrary fields (e.g. as a function of baseScore and voteCount); (3) finding the highest-karma users looking only at the past N days of activity.”
I guess one reason I’m writing all this is to make sure there isn’t some big blocker, before I spend the time grokking my AWS Lambda cookbook, or whatever.
- NunoSempere 10 Jan 2022 11:23 UTC
  7 points
  0 ∶ 0
  Parent
  Hey, I have a series of js snippets that I’ve put some love into that that might be of help, do reach out via PM.
  - Charles He 16 Jan 2022 21:04 UTC
    4 points
    0 ∶ 0
    Parent
    Hi Nuño,
    This is generous of you.
    So I managed to stitch together a quick script in Python. This consists of GraphQL queries created per the post here and Python requests/urllib3.
    If you have something interesting written up in js, that would be cool to share! I guess you have much deeper knowledge of the API than I do.
    It was a bit of a hassle was getting it packaged and running on AWS, with Lambda calls every few minutes. But I got it working!
    Now, witness the firepower of this fully armed and operational battlestation!