Module: analyse¶
The analyse module reads all the fetch files and then performs the analysis. It figures out what posts rank how high and which are local versus remote. It writes out the analysis to a JSON file.
At the moment, the ranking methodology is not great. I don't know a better way, but I'm open to suggestions. There are basically 2 rules:
- I prioritise boosts. A boost is rarer, so the post with the most boosts is a little more rare than a post with a lot of favourites. For example, on a recent monsterdon, all 3 "most boosted" posts had 19 boosts. The post with the most favourites, however, had 76 favourites.
- The same post can't appear in more than one category. If a post has the most boosts and the most favourites, it can't win both.
The method¶
- Calculate the
top_nposts with the most boosts. - Take those posts out of consideration. Look at the remaining posts and find the
top_nposts with the most favourites. - Take those posts out of consideration. Look at the remaning posts and find the
top_nposts with the most replies.
Code Reference¶
Module for analyzing toots for a hashtag. Reads a JSON dump of toots presumably written by the fetch() function.
analyse(config)
¶
Does a bunch of analysis over the toots. Returns a dict with the results suitable for sending to post(). The whole process is described in more detail in the methodology documentation.
Config Parameters Used¶
| Option | Description |
|---|---|
mastoscore:hashtag |
Hashtag to analyze |
mastoscore:top_n |
How many top toots to report |
mastoscore:timezone |
What timezone to convert times to |
mastoscore:event_start |
Start time of the event |
post:tag_users |
Whether we tag users with an @ or not |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
ConfigParser
|
A ConfigParser object from the config module |
required |
Returns:
| Type | Description |
|---|---|
None
|
Dict that includes a few elements:
- |
Source code in mastoscore/analyse.py
def analyse(config: ConfigParser) -> None:
"""
Does a bunch of analysis over the toots. Returns a dict with the results suitable for
sending to [post()](module-post.md). The whole process is described in more detail in
the [methodology documentation](../methodology.md).
## Config Parameters Used
| Option | Description |
| ------- | ------- |
| `mastoscore:hashtag` | Hashtag to analyze |
| `mastoscore:top_n` | How many top toots to report |
| `mastoscore:timezone` | What timezone to convert times to |
| `mastoscore:event_start` | Start time of the event |
| `post:tag_users` | Whether we tag users with an @ or not |
Args:
config: A ConfigParser object from the [config](module-config.md) module
Returns:
Dict that includes a few elements:
- `preamble`: A bit of information about the analysis. Hashtag and when it was generated.
- `num_toots`: A few lines of text that describe the analysis: total number of toots, servers, participants, etc.
- `most_toots`: A line about the person that posted the most toots.
- `max_boosts`, `max_faves`, and `max_replies`: pandas DataFrames that contain the `top_n` toots in each of these categories.
"""
hashtag = config.get("mastoscore", "hashtag")
top_n = config.getint("mastoscore", "top_n")
timezone = config.get("mastoscore", "timezone")
tag_users = config.getboolean("post", "tag_users")
logger = get_logger(config, __name__)
df = get_toots_df(config)
if len(df) <= 0:
return None
analysis = dict()
# some old data files don't have data for fields we expect
df = df.replace(nan, None)
# top poster
most_toots_id = df["userid"].value_counts().idxmax()
most_toots_name = df.loc[df["userid"] == most_toots_id][:1][
"account.display_name"
].values[0]
most_toots_count = len(df.loc[df["userid"] == most_toots_id])
# Some overall statistics
num_servers = df["server"].nunique()
max_server = df["server"].value_counts().idxmax()
max_server_toots = len(df.loc[df["server"] == max_server])
# do the max_boosts stuff last because it is destructive. I remove selected toots
# from the dataframe so that they can't appear twice. i.e., if you're the most
# boosted toot, you're taken out of the running for most favourites and most replies,
# even if you DO have the most favourites and most replies.
maxdf = df.copy(deep=True)
max_boosts = maxdf.sort_values(
by=["reblogs_count", "favourites_count", "replies_count"], ascending=False
).head(top_n)
# drop from df all the toots that are in the max_boosts df
maxdf.drop(maxdf[maxdf["uri"].isin(max_boosts["uri"])].index, inplace=True)
max_faves = maxdf.sort_values(
by=["favourites_count", "reblogs_count", "replies_count"], ascending=False
).head(top_n)
# drop from df all the toots that are in the max_faves df
maxdf.drop(maxdf[maxdf["uri"].isin(max_faves["uri"])].index, inplace=True)
# Count how many replies to each post are from the post's original author
# Group by the original post ID and count replies where the author is the same
cols = maxdf.columns.to_list()
if "in_reply_to_id" in cols and "id" in cols:
# Create a copy of the dataframe to work with for calculating external replies
replies_df = maxdf.copy()
logger.debug("removing self-replies")
# Create a new column for external replies (total replies minus self-replies)
# First, identify self-replies (where author replies to their own post)
self_replies = replies_df[
replies_df["in_reply_to_id"] == replies_df["userid"]
]
# Count self-replies per original post
self_reply_counts = (
self_replies.groupby("in_reply_to_id")
.size()
.reset_index(name="self_reply_count")
)
# Merge this count back to the main dataframe
replies_df = replies_df.merge(
self_reply_counts,
left_on="userid",
right_on="in_reply_to_id",
how="left",
)
# Fill NaN values with 0 (posts with no self-replies)
replies_df["self_reply_count"] = replies_df["self_reply_count"].fillna(0)
replies_df = replies_df.replace(nan, None)
# Calculate external replies (total replies minus self-replies)
replies_df["external_replies_count"] = (
replies_df["replies_count"] - replies_df["self_reply_count"]
)
# Sort by external replies count instead of total replies
max_replies = replies_df.sort_values(
by=["external_replies_count", "reblogs_count", "favourites_count"],
ascending=False,
).head(top_n)
logger.debug(
replies_df[
["id", "replies_count", "self_reply_count", "external_replies_count"]
]
)
else:
# Fallback to original behavior if we don't have the necessary columns
logger.debug("NOT removing self-replies")
logger.debug(df.columns.tolist())
max_replies = df.sort_values(
by=["replies_count", "reblogs_count", "favourites_count"], ascending=False
).head(top_n)
# Prepare the analysis
# convert config strings into datetime structs
tag = "@" if tag_users else ""
timezone = pytimezone(timezone)
start_dt = get_event_start(config)
end_dt = get_event_end(config)
event_start_str = start_dt.strftime("%a %e %b %Y %H:%M %Z")
end_time = end_dt.strftime("%a %e %b %Y %H:%M %Z")
right_now = datetime.datetime.now(tz=timezone).strftime("%a %e %b %Y %H:%M %Z")
analysis["preamble"] = f"<p>Summary of #{hashtag} generated at {right_now}.</p>"
analysis["num_toots"] = (
f"We looked at {len(df)} toots posted between {event_start_str} and "
f"{end_time} by {df['userid'].nunique()} "
+ f"different participants across {num_servers} different servers. {max_server} "
+ f"contributed the most toots at {max_server_toots}"
)
analysis["most_toots"] = (
f"Most toots were from '{most_toots_name}' ({tag}{most_toots_id}) who posted {most_toots_count}"
)
analysis["max_boosts"] = max_boosts.to_dict(
orient="records",
)
analysis["max_faves"] = max_faves.to_dict(orient="records")
analysis["max_replies"] = max_replies.to_dict(orient="records")
analysis["unique_ids"] = df["userid"].nunique()
analysis["top_n"] = top_n
analysis["hashtag"] = hashtag
analysis["top_n"] = top_n
analysis["generated"] = right_now
analysis["event_start"] = event_start_str
analysis["gross_toots"] = len(df)
analysis["event_end"] = end_time
analysis["num_servers"] = num_servers
analysis["max_server"] = {}
analysis["max_server"]["name"] = max_server
analysis["max_server"]["num"] = max_server_toots
analysis["most_posts"] = {}
analysis["most_posts"]["name"] = most_toots_name
analysis["most_posts"]["id"] = most_toots_id
analysis["most_posts"]["count"] = most_toots_count
write_json(config, "analysis", analysis)
get_toots_df(config)
¶
Opens the journal files from a hierarchical directory structure, parses the toots, and does a bunch of analysis over the toots. Returns a df with the results. This is its own method because the graph() module calls it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
ConfigParser
|
A ConfigParser object from the config module |
required |
Config Parameters Used¶
mastoscore:journaldir: Base directory to read JSON files frommastoscore:journalfile: Template for files to readmastoscore:event_year: Year of the event (YYYY)mastoscore:event_month: Month of the event (MM)mastoscore:event_day: Day of the event (DD)
Returns:
Pandas DataFrame with all the toots pulled in and converted to normalised types.
Source code in mastoscore/analyse.py
def get_toots_df(config:ConfigParser) -> pd.DataFrame:
"""
Opens the journal files from a hierarchical directory structure, parses the toots,
and does a bunch of analysis over the toots. Returns a df with the results.
This is its own method because the graph() module calls it.
Args:
config: A ConfigParser object from the [config](module-config.md) module
## Config Parameters Used
- `mastoscore:journaldir`: Base directory to read JSON files from
- `mastoscore:journalfile`: Template for files to read
- `mastoscore:event_year`: Year of the event (YYYY)
- `mastoscore:event_month`: Month of the event (MM)
- `mastoscore:event_day`: Day of the event (DD)
Returns:
Pandas DataFrame with all the toots pulled in and converted to normalised types.
"""
journaldir = config.get("mastoscore", "journaldir")
journalfile = config.get("mastoscore", "journalfile")
logger = get_logger(config, __name__)
# Get date components from config
try:
year = config.get("mastoscore", "event_year")
month = config.get("mastoscore", "event_month")
day = config.get("mastoscore", "event_day")
date_path = os.path.join(year, month, day)
logger.info(f"Looking for journal files in date path: {date_path}")
except Exception as e:
logger.error(f"Failed to get date components from config: {e}")
logger.error("Falling back to flat directory structure")
date_path = ""
df = pd.DataFrame([])
# journal is now a template. Read all the matching files into a big data frame
max_toots = 0
max_toots_file = "none"
nfiles = 0
# Build the path to search for journal files
if date_path:
search_path = os.path.join(journaldir, date_path)
p = Path(search_path).resolve()
if not p.exists():
logger.error(f"Directory {search_path} does not exist")
# Try falling back to the base directory
logger.info(f"Falling back to base directory: {journaldir}")
p = Path(journaldir).resolve()
# Look for files in the hierarchical structure
pattern = f"**/{journalfile}-*.json"
else:
pattern = f"{journalfile}-*.json"
else:
p = Path(journaldir).resolve()
# Look for files in the hierarchical structure
pattern = f"**/{journalfile}-*.json"
logger.info(f"Searching for files matching pattern: {pattern} in {p}")
filelist = list(p.glob(pattern))
if not filelist:
logger.warning(f"No files found matching pattern {pattern} in {p}")
# Try a more general search if specific path failed
if date_path:
logger.info("Trying broader search in entire journal directory")
p = Path(journaldir).resolve()
filelist = list(p.glob(f"**/{journalfile}-*.json"))
for jfile in filelist:
try:
logger.debug(f"Attempting to read {jfile}")
newdf = pd.read_json(jfile)
except Exception as e:
logger.critical(f"Failed to open {jfile}")
logger.critical(e)
continue
if len(newdf) > max_toots:
max_toots_file = jfile
max_toots = len(newdf)
nfiles = nfiles + 1
df = pd.concat([df, newdf])
logger.debug(f"Loaded {len(newdf)} toots from {jfile.name}")
del newdf
logger.info(f"Loaded {len(df)} total toots from {nfiles} JSON files")
logger.info(f"Biggest was {max_toots} toots from {max_toots_file}")
assert not df.empty
# Now exclude toots that are too old or too new
earliest, latest = get_fetch_window(config)
df = df.loc[df["created_at"] >= earliest]
df = df.loc[df["created_at"] <= latest]
assert not df.empty
# gather up the set we want to work on
# 1. local toots
# 2. remote toots where we didn't get a local version
local_toots = df.loc[df["local"] == True]
sources = local_toots["source"].unique()
non_local_toots = df.loc[df["local"] == False]
# drop all toots from servers we successfully contacted
non_local_toots = non_local_toots.loc[~non_local_toots["server"].isin(sources)]
# There will be more than one copy of non-local toots.
# Iterate over each uri, find the copy of it that has the highest numbers
# and keep it, deleting the others
non_local_keepers = pd.DataFrame([])
for uri in non_local_toots["uri"].unique():
minidf = non_local_toots[non_local_toots["uri"] == uri]
# logger.debug(f"{len(minidf)} toots for {uri}")
minidf = minidf.sort_values(
by=["reblogs_count", "favourites_count", "replies_count"], ascending=False
).head(1)
non_local_keepers = pd.concat([non_local_keepers, minidf])
logger.info(
f"{len(local_toots)} local toots and {len(non_local_keepers)} non-local toots"
)
df = pd.concat([local_toots, non_local_keepers])
# Quick check to make sure we don't have duplicates. Number of rows in the final
# DataFrame and the number of unique URIs should be the same. If they're not, we
# have duplicates somewhere.
num_unique = len(df["uri"].unique())
num_rows = len(df)
if num_unique != num_rows:
logger.error(
f"We have {num_rows} toots, but {num_unique} URIs. Likely duplicates!"
)
else:
logger.debug(
f"Number of unique URIs ({num_unique}) == Number of rows ({num_rows}). All good."
)
return df
toots2df(toots, api_base_url)
¶
Take in a list of toots from a tooter object, turn it into a pandas dataframe with a bunch of data normalized.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
toots
|
list
|
list. A list of toots in the same format as returned by the search_hashtag() API |
required |
api_base_url
|
str
|
string. Expected to include protocol, like |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A Pandas DataFrame that contains all the toots normalised. Normalisation includes:
|
Synthetic columns added:¶
- server: The server part of
api_base_url:server.example.comif theapi_base_urlishttps://server.example.com - userid: The user's name in
person@server.example.comformat. Note it does not have the leading@because tagging people is optional. - local: Boolean that is True if the toot comes from the
api_base_urlserver. False otherwise. - source: The server part of the server who owns the toot. I might be talking to
server.example.com, but they've sent me a copy of a toot fromother.example.social.
Source code in mastoscore/analyse.py
def toots2df(toots: list, api_base_url: str) -> pd.DataFrame:
"""
Take in a list of toots from a tooter object, turn it into a
pandas dataframe with a bunch of data normalized.
Args:
toots: list. A list of toots in the same format as returned by the search_hashtag() API
api_base_url: string. Expected to include protocol, like `https://server.example.com`.
Returns:
A Pandas DataFrame that contains all the toots normalised. Normalisation includes:
- Converting date fields like `created_at` to timezone-aware `datetime` objects
- Converting integer fields like `reblogs_count` to integers
- Adding some columns (see below)
- Discarding all but a few columns. So many different systems return different columns, and I'm only
using a few of them. So I just discard everything else. This cuts down on storage and processing time.
# Synthetic columns added:
- server: The server part of `api_base_url`: `server.example.com` if the `api_base_url` is `https://server.example.com`
- userid: The user's name in `person@server.example.com` format. Note it does not have the leading `@` because tagging people is optional.
- local: Boolean that is **True** if the toot comes from the `api_base_url` server. **False** otherwise.
- source: The server part of the server who owns the toot. I might be talking to `server.example.com`, but they've sent me a copy of a toot from `other.example.social`.
"""
df = pd.json_normalize(toots)
df["source"] = api_base_url.split("/")[2]
df["local"] = [True if i.startswith(api_base_url) else False for i in df["uri"]]
# make a new "server" column off of uris
df["server"] = [n.split("/")[2] for n in df["uri"]]
df["userid"] = df["account.username"] + "@" + df["server"]
df["reblogs_count"] = df["reblogs_count"].astype(int)
df["replies_count"] = df["replies_count"].astype(int)
df["favourites_count"] = df["favourites_count"].astype(int)
df["created_at"] = pd.to_datetime(df["created_at"], utc=True, format="ISO8601")
# Define the columns to keep, all others will be deleted
desired_columns = {
"account.display_name",
"account.indexable",
"account.url",
"content",
"created_at",
"external_replies_count",
"favourites_count",
"id",
"in_reply_to_id",
"local",
"max_boosts",
"max_faves",
"max_replies",
"most_toots",
"num_toots",
"preamble",
"reblogs_count",
"replies_count",
"required",
"self_reply_count",
"server",
"source",
"uri",
"url",
"userid",
}
# Get the intersection of desired columns and actual columns
columns_to_keep = list(desired_columns.intersection(df.columns))
# Create new data frame with only desired columns, implicitly discarding all others
small_df = df[columns_to_keep]
return small_df