Filip Sondej answers Database dumps of the EA Forum

Filip Sondej Oct 28, 2022, 11:20 AM

1 point

0 ∶ 0

Here is some code you can use to get the most essential data from all the forum posts: https://github.com/filyp/tree-of-tags/blob/main/tree_of_tags/forum_queries.py

Also, I used that data to build a topic model of the forum as you mentioned. You can see it here

Jeff Kaufman 🔸Oct 4, 2023, 8:16 PM

3 points

0 ∶ 0

Parent

I tried running this code, but it sends a query that gets:

{
  "errors": [
    {
      "message": "Exceeded maximum value for skip",
      "locations": [
        {
          "line": 3,
          "column": 3
        }
      ],
      "path": [
        "posts"
      ],
      "extensions": {
        "code": "INTERNAL_SERVER_ERROR"
      }
    }
  ],
  "data": {
    "posts": null
  }
}

Looking at ForumMagnum code this is:

  // Don't allow API requests with an offset provided >2000. This prevents some
  // extremely-slow queries.
  if (terms.offset && (terms.offset > maxAllowedSkip)) {
    throw new Error("Exceeded maximum value for skip");
  }

It looks like how tree_of_tags works, with an offset that keeps increasing by chunk_size isn’t compatible with the current forum API.

JWS 🔸Oct 4, 2023, 8:20 PM

4 points

0 ∶ 0

Parent

When I tried repurposing this code I got a similar error, so that portion in my code now looks like this:

query = """
{
  posts(input: {
    terms: {
      view: "new"
      after: "%s"
      before: "%s"
      limit: %d
      meta: null
    }
  }) {
    results {
      _id
      title
      postedAt
      postedAtFormatted
      user {
        username
        displayName
      }
      wordCount
      voteCount
      baseScore
      commentCount
      tags {
        name
      }
    }
  }
}
"""

def send_graphql_request(input_query):
    r = requests.post(url, json={"query": input_query}, headers=headers)
    data = json.loads(r.text)
    return data


def return_data(input_query):
    # Initialize an empty list to store the data
    data_list = []

    # Set the initial date and the end date for the time intervals
    start_date = pd.Timestamp("1960-01-01")
    end_date = pd.Timestamp("2019-01-01")

    # Define the time interval (4-month chunks) for iteration
    delta = pd.DateOffset(months=4)

    # Iterate over the time intervals
    while start_date < pd.Timestamp.now():
        print(start_date)
        # Construct the query with the current time interval
        query = input_query % (
            start_date.strftime("%Y-%m-%d"),
            end_date.strftime("%Y-%m-%d"),
            3000,
        )

        # Call the function to send the GraphQL request and extract the data
        response = send_graphql_request(query)
        results = response["data"]["posts"]["results"]

        # Add the current iteration's data to the list
        data_list.extend(results)

        # Increment the dates for the next iteration
        start_date = end_date
        end_date += delta

    # Create a DataFrame from the collected data
    df = pd.DataFrame(data_list)
    df["postedAt"] = pd.to_datetime(df["postedAt"])

    # Return the final DataFrame
    return df

So ~3000 seemed to be a fine chunksize to get every post, and I use the date strings to iterate through the dates to get all posts in postedAt order.

Jeff Kaufman 🔸Oct 4, 2023, 9:30 PM
2 points
0 ∶ 0
Parent
Thanks! I tried extending this for comments, but it looks like it ignores terms; filed an issue.