Module: analyse

The analyse module reads all the fetch files and then performs the analysis. It figures out what posts rank how high and which are local versus remote. It writes out the analysis to a JSON file.

At the moment, the ranking methodology is not great. I don't know a better way, but I'm open to suggestions. There are basically 2 rules:

The method

  1. Calculate the top_n posts with the most boosts.
  2. Take those posts out of consideration. Look at the remaining posts and find the top_n posts with the most favourites.
  3. Take those posts out of consideration. Look at the remaning posts and find the top_n posts with the most replies.

Code Reference

Module for analyzing toots for a hashtag. Reads a JSON dump of toots presumably written by the fetch() function.

analyse(config)

Does a bunch of analysis over the toots. Returns a dict with the results suitable for sending to post(). The whole process is described in more detail in the methodology documentation.

Config Parameters Used

Option Description
mastoscore:hashtag Hashtag to analyze
mastoscore:top_n How many top toots to report
mastoscore:timezone What timezone to convert times to
mastoscore:event_start Start time of the event
post:tag_users Whether we tag users with an @ or not

Parameters:

Name Type Description Default
config ConfigParser

A ConfigParser object from the config module

required

Returns:

Type Description
None

Dict that includes a few elements: - preamble: A bit of information about the analysis. Hashtag and when it was generated. - num_toots: A few lines of text that describe the analysis: total number of toots, servers, participants, etc. - most_toots: A line about the person that posted the most toots. - max_boosts, max_faves, and max_replies: pandas DataFrames that contain the top_n toots in each of these categories.

Source code in mastoscore/analyse.py
def analyse(config: ConfigParser) -> None:
    """
    Does a bunch of analysis over the toots. Returns a dict with the results suitable for
    sending to [post()](module-post.md). The whole process is described in more detail in
    the [methodology documentation](../methodology.md).

    ## Config Parameters Used

    | Option | Description |
    | ------- | ------- |
    | `mastoscore:hashtag` | Hashtag to analyze |
    | `mastoscore:top_n` | How many top toots to report |
    | `mastoscore:timezone` | What timezone to convert times to |
    | `mastoscore:event_start` | Start time of the event |
    | `post:tag_users` | Whether we tag users with an @ or not |

    Args:
        config: A ConfigParser object from the [config](module-config.md) module

    Returns:
        Dict that includes a few elements:
            - `preamble`: A bit of information about the analysis. Hashtag and when it was generated.
            - `num_toots`: A few lines of text that describe the analysis: total number of toots, servers, participants, etc.
            - `most_toots`: A line about the person that posted the most toots.
            - `max_boosts`, `max_faves`, and `max_replies`: pandas DataFrames that contain the `top_n` toots in each of these categories.

    """

    hashtag = config.get("mastoscore", "hashtag")
    top_n = config.getint("mastoscore", "top_n")
    timezone = config.get("mastoscore", "timezone")
    tag_users = config.getboolean("post", "tag_users")
    logger = get_logger(config, __name__)

    df = get_toots_df(config)
    if len(df) <= 0:
        return None

    analysis = dict()
    # some old data files don't have data for fields we expect
    df = df.replace(nan, None)
    # top poster
    most_toots_id = df["userid"].value_counts().idxmax()
    most_toots_name = df.loc[df["userid"] == most_toots_id][:1][
        "account.display_name"
    ].values[0]
    most_toots_count = len(df.loc[df["userid"] == most_toots_id])

    # Some overall statistics
    num_servers = df["server"].nunique()
    max_server = df["server"].value_counts().idxmax()
    max_server_toots = len(df.loc[df["server"] == max_server])

    # do the max_boosts stuff last because it is destructive. I remove selected toots
    # from the dataframe so that they can't appear twice. i.e., if you're the most
    # boosted toot, you're taken out of the running for most favourites and most replies,
    # even if you DO have the most favourites and most replies.
    maxdf = df.copy(deep=True)
    max_boosts = maxdf.sort_values(
        by=["reblogs_count", "favourites_count", "replies_count"], ascending=False
    ).head(top_n)

    # drop from df all the toots that are in the max_boosts df
    maxdf.drop(maxdf[maxdf["uri"].isin(max_boosts["uri"])].index, inplace=True)

    max_faves = maxdf.sort_values(
        by=["favourites_count", "reblogs_count", "replies_count"], ascending=False
    ).head(top_n)

    # drop from df all the toots that are in the max_faves df
    maxdf.drop(maxdf[maxdf["uri"].isin(max_faves["uri"])].index, inplace=True)

    # Count how many replies to each post are from the post's original author
    # Group by the original post ID and count replies where the author is the same
    cols = maxdf.columns.to_list()
    if "in_reply_to_id" in cols and "id" in cols:
        # Create a copy of the dataframe to work with for calculating external replies
        replies_df = maxdf.copy()
        logger.debug("removing self-replies")

        # Create a new column for external replies (total replies minus self-replies)
        # First, identify self-replies (where author replies to their own post)
        self_replies = replies_df[
            replies_df["in_reply_to_id"] == replies_df["userid"]
        ]

        # Count self-replies per original post
        self_reply_counts = (
            self_replies.groupby("in_reply_to_id")
            .size()
            .reset_index(name="self_reply_count")
        )

        # Merge this count back to the main dataframe
        replies_df = replies_df.merge(
            self_reply_counts,
            left_on="userid",
            right_on="in_reply_to_id",
            how="left",
        )

        # Fill NaN values with 0 (posts with no self-replies)
        replies_df["self_reply_count"] = replies_df["self_reply_count"].fillna(0)
        replies_df = replies_df.replace(nan, None)

        # Calculate external replies (total replies minus self-replies)
        replies_df["external_replies_count"] = (
            replies_df["replies_count"] - replies_df["self_reply_count"]
        )

        # Sort by external replies count instead of total replies
        max_replies = replies_df.sort_values(
            by=["external_replies_count", "reblogs_count", "favourites_count"],
            ascending=False,
        ).head(top_n)
        logger.debug(
            replies_df[
                ["id", "replies_count", "self_reply_count", "external_replies_count"]
            ]
        )
    else:
        # Fallback to original behavior if we don't have the necessary columns
        logger.debug("NOT removing self-replies")
        logger.debug(df.columns.tolist())
        max_replies = df.sort_values(
            by=["replies_count", "reblogs_count", "favourites_count"], ascending=False
        ).head(top_n)
    # Prepare the analysis
    # convert config strings into datetime structs
    tag = "@" if tag_users else ""
    timezone = pytimezone(timezone)
    start_dt = get_event_start(config)
    end_dt = get_event_end(config)
    event_start_str = start_dt.strftime("%a %e %b %Y %H:%M %Z")
    end_time = end_dt.strftime("%a %e %b %Y %H:%M %Z")
    right_now = datetime.datetime.now(tz=timezone).strftime("%a %e %b %Y %H:%M %Z")
    analysis["preamble"] = f"<p>Summary of #{hashtag} generated at {right_now}.</p>"
    analysis["num_toots"] = (
        f"We looked at {len(df)} toots posted between {event_start_str} and "
        f"{end_time} by {df['userid'].nunique()} "
        + f"different participants across {num_servers} different servers. {max_server} "
        + f"contributed the most toots at {max_server_toots}"
    )
    analysis["most_toots"] = (
        f"Most toots were from '{most_toots_name}' ({tag}{most_toots_id}) who posted {most_toots_count}"
    )
    analysis["max_boosts"] = max_boosts.to_dict(
        orient="records",
    )
    analysis["max_faves"] = max_faves.to_dict(orient="records")
    analysis["max_replies"] = max_replies.to_dict(orient="records")
    analysis["unique_ids"] = df["userid"].nunique()
    analysis["top_n"] = top_n
    analysis["hashtag"] = hashtag
    analysis["top_n"] = top_n
    analysis["generated"] = right_now
    analysis["event_start"] = event_start_str
    analysis["gross_toots"] = len(df)
    analysis["event_end"] = end_time
    analysis["num_servers"] = num_servers
    analysis["max_server"] = {}
    analysis["max_server"]["name"] = max_server
    analysis["max_server"]["num"] = max_server_toots
    analysis["most_posts"] = {}
    analysis["most_posts"]["name"] = most_toots_name
    analysis["most_posts"]["id"] = most_toots_id
    analysis["most_posts"]["count"] = most_toots_count

    write_json(config, "analysis", analysis)

get_toots_df(config)

Opens the journal files from a hierarchical directory structure, parses the toots, and does a bunch of analysis over the toots. Returns a df with the results. This is its own method because the graph() module calls it.

Parameters:

Name Type Description Default
config ConfigParser

A ConfigParser object from the config module

required

Config Parameters Used

  • mastoscore:journaldir: Base directory to read JSON files from
  • mastoscore:journalfile: Template for files to read
  • mastoscore:event_year: Year of the event (YYYY)
  • mastoscore:event_month: Month of the event (MM)
  • mastoscore:event_day: Day of the event (DD)

Returns:

Pandas DataFrame with all the toots pulled in and converted to normalised types.

Source code in mastoscore/analyse.py
def get_toots_df(config:ConfigParser) -> pd.DataFrame:
    """
    Opens the journal files from a hierarchical directory structure, parses the toots,
    and does a bunch of analysis over the toots. Returns a df with the results.
    This is its own method because the graph() module calls it.

    Args:
      config: A ConfigParser object from the [config](module-config.md) module

    ## Config Parameters Used
    - `mastoscore:journaldir`: Base directory to read JSON files from
    - `mastoscore:journalfile`: Template for files to read
    - `mastoscore:event_year`: Year of the event (YYYY)
    - `mastoscore:event_month`: Month of the event (MM)
    - `mastoscore:event_day`: Day of the event (DD)

    Returns:

    Pandas DataFrame with all the toots pulled in and converted to normalised types.
    """

    journaldir = config.get("mastoscore", "journaldir")
    journalfile = config.get("mastoscore", "journalfile")
    logger = get_logger(config, __name__)

    # Get date components from config
    try:
        year = config.get("mastoscore", "event_year")
        month = config.get("mastoscore", "event_month")
        day = config.get("mastoscore", "event_day")
        date_path = os.path.join(year, month, day)
        logger.info(f"Looking for journal files in date path: {date_path}")
    except Exception as e:
        logger.error(f"Failed to get date components from config: {e}")
        logger.error("Falling back to flat directory structure")
        date_path = ""

    df = pd.DataFrame([])
    # journal is now a template. Read all the matching files into a big data frame
    max_toots = 0
    max_toots_file = "none"
    nfiles = 0

    # Build the path to search for journal files
    if date_path:
        search_path = os.path.join(journaldir, date_path)
        p = Path(search_path).resolve()
        if not p.exists():
            logger.error(f"Directory {search_path} does not exist")
            # Try falling back to the base directory
            logger.info(f"Falling back to base directory: {journaldir}")
            p = Path(journaldir).resolve()
            # Look for files in the hierarchical structure
            pattern = f"**/{journalfile}-*.json"
        else:
            pattern = f"{journalfile}-*.json"
    else:
        p = Path(journaldir).resolve()
        # Look for files in the hierarchical structure
        pattern = f"**/{journalfile}-*.json"

    logger.info(f"Searching for files matching pattern: {pattern} in {p}")
    filelist = list(p.glob(pattern))

    if not filelist:
        logger.warning(f"No files found matching pattern {pattern} in {p}")
        # Try a more general search if specific path failed
        if date_path:
            logger.info("Trying broader search in entire journal directory")
            p = Path(journaldir).resolve()
            filelist = list(p.glob(f"**/{journalfile}-*.json"))

    for jfile in filelist:
        try:
            logger.debug(f"Attempting to read {jfile}")
            newdf = pd.read_json(jfile)
        except Exception as e:
            logger.critical(f"Failed to open {jfile}")
            logger.critical(e)
            continue
        if len(newdf) > max_toots:
            max_toots_file = jfile
            max_toots = len(newdf)
        nfiles = nfiles + 1
        df = pd.concat([df, newdf])
        logger.debug(f"Loaded {len(newdf)} toots from {jfile.name}")
        del newdf

    logger.info(f"Loaded {len(df)} total toots from {nfiles} JSON files")
    logger.info(f"Biggest was {max_toots} toots from {max_toots_file}")
    assert not df.empty
    # Now exclude toots that are too old or too new
    earliest, latest = get_fetch_window(config)
    df = df.loc[df["created_at"] >= earliest]
    df = df.loc[df["created_at"] <= latest]
    assert not df.empty
    # gather up the set we want to work on
    # 1. local toots
    # 2. remote toots where we didn't get a local version
    local_toots = df.loc[df["local"] == True]
    sources = local_toots["source"].unique()
    non_local_toots = df.loc[df["local"] == False]
    # drop all toots from servers we successfully contacted
    non_local_toots = non_local_toots.loc[~non_local_toots["server"].isin(sources)]
    # There will be more than one copy of non-local toots.
    # Iterate over each uri, find the copy of it that has the highest numbers
    # and keep it, deleting the others
    non_local_keepers = pd.DataFrame([])
    for uri in non_local_toots["uri"].unique():
        minidf = non_local_toots[non_local_toots["uri"] == uri]
        # logger.debug(f"{len(minidf)} toots for {uri}")
        minidf = minidf.sort_values(
            by=["reblogs_count", "favourites_count", "replies_count"], ascending=False
        ).head(1)
        non_local_keepers = pd.concat([non_local_keepers, minidf])
    logger.info(
        f"{len(local_toots)} local toots and {len(non_local_keepers)} non-local toots"
    )
    df = pd.concat([local_toots, non_local_keepers])

    # Quick check to make sure we don't have duplicates. Number of rows in the final
    # DataFrame and the number of unique URIs should be the same. If they're not, we
    # have duplicates somewhere.
    num_unique = len(df["uri"].unique())
    num_rows = len(df)
    if num_unique != num_rows:
        logger.error(
            f"We have {num_rows} toots, but {num_unique} URIs. Likely duplicates!"
        )
    else:
        logger.debug(
            f"Number of unique URIs ({num_unique}) == Number of rows ({num_rows}). All good."
        )
    return df

toots2df(toots, api_base_url)

Take in a list of toots from a tooter object, turn it into a pandas dataframe with a bunch of data normalized.

Parameters:

Name Type Description Default
toots list

list. A list of toots in the same format as returned by the search_hashtag() API

required
api_base_url str

string. Expected to include protocol, like https://server.example.com.

required

Returns:

Type Description
DataFrame

A Pandas DataFrame that contains all the toots normalised. Normalisation includes:

  • Converting date fields like created_at to timezone-aware datetime objects
  • Converting integer fields like reblogs_count to integers
  • Adding some columns (see below)
  • Discarding all but a few columns. So many different systems return different columns, and I'm only using a few of them. So I just discard everything else. This cuts down on storage and processing time.

Synthetic columns added:

  • server: The server part of api_base_url: server.example.com if the api_base_url is https://server.example.com
  • userid: The user's name in person@server.example.com format. Note it does not have the leading @ because tagging people is optional.
  • local: Boolean that is True if the toot comes from the api_base_url server. False otherwise.
  • source: The server part of the server who owns the toot. I might be talking to server.example.com, but they've sent me a copy of a toot from other.example.social.
Source code in mastoscore/analyse.py
def toots2df(toots: list, api_base_url: str) -> pd.DataFrame:
    """
    Take in a list of toots from a tooter object, turn it into a
    pandas dataframe with a bunch of data normalized.

    Args:
      toots: list. A list of toots in the same format as returned by the search_hashtag() API
      api_base_url: string. Expected to include protocol, like `https://server.example.com`.

    Returns:
        A Pandas DataFrame that contains all the toots normalised. Normalisation includes:

            - Converting date fields like `created_at` to timezone-aware `datetime` objects
            - Converting integer fields like `reblogs_count` to integers
            - Adding some columns (see below)
            - Discarding all but a few columns. So many different systems return different columns, and I'm only
                using a few of them. So I just discard everything else. This cuts down on storage and processing time.

    # Synthetic columns added:
    - server: The server part of `api_base_url`: `server.example.com` if the `api_base_url` is `https://server.example.com`
    - userid: The user's name in `person@server.example.com` format. Note it does not have the leading `@` because tagging people is optional.
    - local: Boolean that is **True** if the toot comes from the `api_base_url` server. **False** otherwise.
    - source: The server part of the server who owns the toot. I might be talking to `server.example.com`, but they've sent me a copy of a toot from `other.example.social`.
    """

    df = pd.json_normalize(toots)
    df["source"] = api_base_url.split("/")[2]
    df["local"] = [True if i.startswith(api_base_url) else False for i in df["uri"]]
    # make a new "server" column off of uris
    df["server"] = [n.split("/")[2] for n in df["uri"]]
    df["userid"] = df["account.username"] + "@" + df["server"]
    df["reblogs_count"] = df["reblogs_count"].astype(int)
    df["replies_count"] = df["replies_count"].astype(int)
    df["favourites_count"] = df["favourites_count"].astype(int)
    df["created_at"] = pd.to_datetime(df["created_at"], utc=True, format="ISO8601")

    # Define the columns to keep, all others will be deleted
    desired_columns = {
        "account.display_name",
        "account.indexable",
        "account.url",
        "content",
        "created_at",
        "external_replies_count",
        "favourites_count",
        "id",
        "in_reply_to_id",
        "local",
        "max_boosts",
        "max_faves",
        "max_replies",
        "most_toots",
        "num_toots",
        "preamble",
        "reblogs_count",
        "replies_count",
        "required",
        "self_reply_count",
        "server",
        "source",
        "uri",
        "url",
        "userid",
    }
    # Get the intersection of desired columns and actual columns
    columns_to_keep = list(desired_columns.intersection(df.columns))

    # Create new data frame with only desired columns, implicitly discarding all others
    small_df = df[columns_to_keep]

    return small_df