Module: fetch¶
The fetch module starts at your server where you're logged in and searches a hashtag. After it gets all the toots your server knows about, then it starts looking at where they came from. For each server that it finds mentioned, it calls fetch_hashtag_remote(). Each time that it connects to a new server, it fetches every toot that server knows about the hashtag. Then it looks at the servers that are mentioned and adds any new ones to a list of servers to contact.
Public API¶
This code depends on Mastodon.py and uses it to connect to servers that are mentioned. If you know anything about the fediverse, you know that there's more than just Mastodon servers out there. There's Pleroma, Akkoma, and various other ActivityPub-compatible servers. Some are derived from Mastodon and implement the same APIs. Others don't. Some Mastodon servers offer public read APIs, others don't. So servers that allow public read of their APIs will send you the details on their toots. Servers that don't allow public read, or that don't implement a Mastodon-compatible timeline API will be quietly skipped.
Parallel fetching¶
Fetch now spins up parallel fetches. If you're using backstage, it will report updates every 2 seconds. Servers will tend to update in page sizes. That is, since we are typically fetching 40 posts at a time, you should see server progress change by multiples of 40 when they change.
Fetching is limited by 2 things. The time of the post, and the max number in the INI file. The max is the most we will fetch from any one server. There's no overall limit. The max by default is 2000. And if you hit 50 servers with 2000 posts each, we will happily fetch 100,000 posts.
The start of the fetch interval is calculated by taking the event_start and subtracting the hours_margin. The end of the fetch interval takes the event_start and adds the duration plus the hours_margin.
An Example: Imagine the event starts at 19:00, the duration is 75 minutes (1:15) and the hours_margin is 2 hours. The earliest possible toot will be at 17:00, and the latest possible toot will be 17:00 + 1:15 + 2:00, or 20:15. Fetch will stop fetching from a particular server when it hits max or toots that are later than the latest time.
Directory Structure¶
Fetch organizes data in a directory based on the date of the event. So if the journaldir is data, the hashtag is monsterdon, and the event_date is 2025-10-19, then the directory structure is data/2025/10/19 and all the files will have the monsterdon hashtag in their name. See the example below:
data
├── 2025
│ └── 10
│ ├── 19
│ │ ├── data-monsterdon-analysis.json
│ │ ├── data-monsterdon-fetch.json
│ │ ├── index.md
│ │ ├── monsterdon-20251019.png
│ │ ├── monsterdon-20251019.txt
│ │ ├── monsterdon-beige.party.json
│ │ ├── monsterdon-bolha.us.json
│ │ ... lots more files, one per server...
│ │ ├── wordcloud-monsterdon-20251019-remove.png
│ │ └── wordcloud-monsterdon-20251019-remove.txt
List of files¶
Every file has the hashtag and the date in its name. If you ran the same analysis on 2 different hashtags on the same day, none of the files would conflict, though they would all be stored in the same directory.
data-monsterdon-analysis.json: Analysis of the results. It contains the contents of all the top posts and a bunch of meta statistics like top poster, busiest server, etc.data-monsterdon-fetch.json: Data about the fetch. Mainly the date it was done, the servers that succeeded and failed, and the gross total (not de-duplicated) of posts we fetched.index.md: The blog post entry. It's copied manually to the blog post directory.monsterdon-20251019.png: The histogram graph of activity generated by graphmonsterdon-20251019.txt: The alt text for the histogram graph generated by graphmonsterdon-[servername].json: The raw content of posts downloaded from serverservername.wordcloud-monsterdon-20251019-remove.png: The wordcloud generated by graphwordcloud-monsterdon-20251019-remove.txt: The alt text for the wordcloud generated by graph
Module for fetching toots for a hashtag.
fetch(config, progress=FetchProgress(), cancel_token=CancelToken(), mock=False)
¶
This is the top-level function that will download toots and store them in a JSON cache. This
function will create a tooter and login to the server named in the cred_file.
Config Parameters Used¶
| Option | Description |
|---|---|
| fetch:overwrite | If True, overwrite files (re-fetch). If False and exists for a server, skip it. Default: False |
| fetch:parallel_workers | int, how many fetches to run in parallel. Default: 5 |
| mastoscore:api_base_url | Starting server for our first connection |
| mastoscore:journalfile | prefix for journal files. |
| mastoscore:timezone | Timezone for any non-UTC times |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
ConfigParser
|
A ConfigParser object from the config module |
required |
progress
|
FetchProgress
|
A thread-safe webapi FetchProgress FetchProgress object to track partial progress for reporting. |
FetchProgress()
|
cancel_token
|
CancelToken
|
a thread-safe webapi FetchProgress CancelToken object that the thread will check to determine if it should stop mid-fetch |
CancelToken()
|
mock
|
bool
|
Whether we are meant to be mocking the fetch (used for testing the front end) |
False
|
Returns:
| Type | Description |
|---|---|
dict | None
|
A dictionary that becomes the contents of the data-{hashtag}-fetch.json statistics file. |
Source code in mastoscore/fetch.py
def fetch(
config: ConfigParser,
progress: FetchProgress = FetchProgress(),
cancel_token: CancelToken = CancelToken(),
mock: bool = False,
) -> dict | None:
"""
This is the top-level function that will download toots and store them in a JSON cache. This
function will create a [tooter](module-tooter.md) and login to the server named in the `cred_file`.
## Config Parameters Used
| Option | Description |
| ------ | ----------- |
| fetch:overwrite | If True, overwrite files (re-fetch). If False and exists for a server, skip it. Default: False
| fetch:parallel_workers | int, how many fetches to run in parallel. Default: 5
| mastoscore:api_base_url | Starting server for our first connection
| mastoscore:journalfile | prefix for journal files.
| mastoscore:timezone | Timezone for any non-UTC times
Args:
config: A ConfigParser object from the [config](module-config.md) module
progress: A thread-safe [webapi FetchProgress](module-webapi.md) FetchProgress object
to track partial progress for reporting.
cancel_token: a thread-safe [webapi FetchProgress](module-webapi.md) CancelToken
object that the thread will check to determine if it should stop mid-fetch
mock(bool): Whether we are meant to be mocking the fetch (used for testing the front end)
Returns:
A dictionary that becomes the contents of the data-{hashtag}-fetch.json statistics file.
"""
overwrite = config.getboolean("fetch", "overwrite", fallback=False)
parallel_workers = config.getint("fetch", "parallel_workers", fallback=5)
journalfile = config.get("mastoscore", "journalfile")
api_base_url = config.get("mastoscore", "api_base_url")
timezone = pytimezone(config.get("mastoscore", "timezone"))
fresults = {}
logger = get_logger(config, __name__)
oldest_date, newest_date = get_fetch_window(config)
oldest_str = oldest_date.strftime("%Y-%m-%d %H:%M:%S")
newest_str = newest_date.strftime("%Y-%m-%d %H:%M:%S")
logger.debug(f"Fetch window: {oldest_str} to {newest_str}")
# Create directory structure
dir_path = create_journal_directory(config)
if dir_path is None:
return None
# Initialize tracking - start with just the initial server
if mock:
fake = Faker()
num_servers = randint(10, 30)
# Generate random server URIs using Faker
servers_todo = {f"https://{fake.domain_name()}" for _ in range(num_servers)}
else:
servers_todo = {api_base_url}
servers_done = set()
servers_fail = set()
total_toots = 0
servers_lock = threading.Lock()
# Initialize progress tracking
progress.update(
len(servers_todo), len(servers_done), len(servers_fail), total_toots
)
logger.info(
f"Starting fetch from {api_base_url} with up to {parallel_workers} parallel workers"
)
fetch_start = datetime.datetime.now(tz=timezone)
all_done = threading.Event()
work_available = threading.Event()
with ThreadPoolExecutor(max_workers=parallel_workers) as executor:
futures = {}
def submit_pending():
"""Submit work from servers_todo up to parallel_workers limit.
Must be called while holding servers_lock."""
while len(futures) < parallel_workers and servers_todo:
uri = servers_todo.pop()
progress.set_pending(uri)
if mock:
maxtoots = config.getint("fetch", "max")
future = executor.submit(
mock_fetch_single_server,
uri,
maxtoots,
cancel_token,
progress,
config,
dir_path,
journalfile,
)
else:
future = executor.submit(
fetch_single_server,
config,
uri,
dir_path,
journalfile,
overwrite,
progress,
cancel_token,
)
future.add_done_callback(on_future_done)
futures[future] = uri
logger.debug(f"Submitted: {uri} (active workers: {len(futures)})")
def on_future_done(future):
"""Callback: bookkeeping only, no submitting. Signals main loop."""
nonlocal total_toots
with servers_lock:
uri = futures.pop(future)
try:
_, toots_result, newuris = future.result()
except Exception as e:
logger.error(f"Worker exception for {uri}: {e}")
toots_result = None
newuris = None
if toots_result is None:
servers_fail.add(uri)
logger.debug(f"Failed: {uri}")
else:
servers_done.add(uri)
total_toots += len(toots_result)
if newuris:
new_count = 0
for server in set(newuris):
if (
server not in servers_done
and server not in servers_todo
and server not in servers_fail
):
servers_todo.add(server)
progress.set_pending(server)
new_count += 1
if new_count > 0:
logger.info(
f"Discovered {new_count} new servers from {uri}"
)
progress.update(
len(servers_todo),
len(servers_done),
len(servers_fail),
total_toots,
uri,
)
logger.info(
f"Processed {uri}. Todo: {len(servers_todo)}, Active: {len(futures)}, Done: {len(servers_done)}, Fail: {len(servers_fail)}"
)
# Wake up main loop to submit more work (outside the lock)
work_available.set()
# Kick off initial batch
with servers_lock:
submit_pending()
# Main loop: wait for callbacks, then submit more work
while True:
work_available.wait(timeout=1.0)
work_available.clear()
if cancel_token.is_cancelled():
logger.info("Fetch cancelled by user")
with servers_lock:
if not futures:
break
continue
with servers_lock:
before = len(futures)
submit_pending()
after = len(futures)
if after > before:
logger.info(f"Main loop submitted {after - before} new workers (active: {after}, todo: {len(servers_todo)})")
elif servers_todo and after < parallel_workers:
logger.error(f"MAIN LOOP UNDERUTILIZED: {after} of {parallel_workers} active, {len(servers_todo)} waiting")
if not futures and not servers_todo:
break
fetch_end = datetime.datetime.now(tz=timezone)
fetch_duration = fetch_end - fetch_start
duration_string = ""
fetch_hours, fetch_seconds = divmod(int(fetch_duration.total_seconds()), 3600)
if fetch_hours > 1:
duration_string = f"{fetch_hours} hours"
elif fetch_hours == 1:
duration_string = f"{fetch_hours} hour"
fetch_minutes, fetch_seconds = divmod(fetch_seconds, 60)
if fetch_minutes > 1:
duration_string = f"{duration_string} {fetch_minutes} minutes"
elif fetch_minutes == 1:
duration_string = f"{duration_string} {fetch_minutes} minute"
if fetch_seconds > 1:
duration_string = f"{duration_string} {fetch_seconds} seconds"
elif fetch_seconds == 1:
duration_string = f"{duration_string} {fetch_seconds} second"
else:
duration_string = f"{duration_string} exactly"
fresults["total_toots"] = total_toots
fresults["servers_done"] = list(servers_done)
fresults["servers_fail"] = list(servers_fail)
fresults["oldest_date"] = oldest_date.strftime("%a %e %b %Y %H:%M:%S %Z")
fresults["total_toots"] = total_toots
fresults["fetch_time"] = fetch_start
fresults["fetch_end"] = fetch_end
fresults["fetch_duration"] = duration_string
fresults["fetch_version"] = __version__
write_json(config, "fetch", fresults)
logger.info(
f"Done! Collected {total_toots} toots from {len(servers_done)} servers with {len(servers_fail)} failures."
)
# Return summary for API
return {
"servers_done": list(servers_done),
"servers_fail": list(servers_fail),
"total_toots": total_toots,
"duration_seconds": fetch_duration.total_seconds(),
"output_file": dir_path,
}
fetch_hashtag_remote(config, server, progress=FetchProgress(), cancel_token=CancelToken())
¶
Given a uri of a toot, (like from Mastodon.status), create a Tooter for that URI. Connect and fetch the statuses. Return a few fields, but not all.
Config Parameters Used¶
| Option | Description |
|---|---|
| fetch:max | Max number of toots to pull from a server (default: 2000) |
| mastoscore:hashtag | Hashtag to search for |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
ConfigParser
|
A ConfigParser object from the config module |
required |
server
|
str
|
The api_base_url of a server to fetch from |
required |
progress
|
FetchProgress
|
Optional progress tracker for reporting incremental progress |
FetchProgress()
|
cancel_token
|
CancelToken
|
Optional token that can tell us we need to cancel |
CancelToken()
|
Returns:
| Type | Description |
|---|---|
list | None
|
List of statuses in the raw JSON format from the API. Fields are not normalised or converted in any way. Since not all ActivityPub servers are exactly the same, it's not even sure which fields you get. |
Source code in mastoscore/fetch.py
def fetch_hashtag_remote(
config: ConfigParser,
server: str,
progress: FetchProgress = FetchProgress(),
cancel_token: CancelToken = CancelToken(),
) -> list | None:
"""
Given a uri of a toot, (like from Mastodon.status), create a Tooter
for that URI. Connect and fetch the statuses. Return a few fields, but not all.
## Config Parameters Used
| Option | Description |
| ------- | ------- |
| fetch:max | Max number of toots to pull from a server (default: 2000)
| mastoscore:hashtag | Hashtag to search for
Args:
config: A ConfigParser object from the [config](module-config.md) module
server: The api_base_url of a server to fetch from
progress: Optional progress tracker for reporting incremental progress
cancel_token: Optional token that can tell us we need to cancel
Returns:
List of statuses in the raw JSON format from the API. Fields are not
normalised or converted in any way. Since not all ActivityPub servers
are exactly the same, it's not even sure which fields you get.
"""
maxtoots = config.getint("fetch", "max")
hashtag = config.get("mastoscore", "hashtag")
oldest_date, newest_date = get_fetch_window(config)
logger = get_logger(config, __name__)
# Make the tooter that will do the searching.
try:
t = Tooter(config, "fetch", server)
logger.debug(f"Tooter created for {server}")
except (MastodonAPIError, MastodonNetworkError) as e:
logger.info(f"Cannot connect to {server}: {e}")
return None
except Exception as e:
logger.error(f"Failed to create Tooter for {server}: {e}")
return None
try:
# Collect toots from paginated generator
all_toots = []
for page in t.search_hashtag(hashtag, oldest_date, maxtoots):
# Check for cancellation between pages
if cancel_token.is_cancelled():
logger.debug(f"Fetch cancelled for {server}")
return None
all_toots.extend(page)
# Update progress with current toot count
if progress:
progress.update_server_progress(server, len(all_toots))
return all_toots if all_toots else None
except (MastodonAPIError, MastodonNetworkError) as e:
# Expected: 4xx errors, auth required, timeouts, connection failures
logger.info(f"Server {server} unavailable: {e}")
return None
except Exception as e:
logger.error(
f"fetch_hashtag_remote: unexpected error fetching {hashtag} from {server}: {e}"
)
logger.exception(e)
return None
fetch_single_server(config, uri, dir_path, journalfile, overwrite, progress=FetchProgress(), cancel_token=CancelToken())
¶
Fetch toots from a single server (worker function for parallel execution). Assumes the journal directory already exists.
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
tuple
|
(uri, toots_list, new_server_uris) or (uri, None, None) on failure |
Source code in mastoscore/fetch.py
def fetch_single_server(
config: ConfigParser,
uri: str,
dir_path: str,
journalfile: str,
overwrite: bool,
progress: FetchProgress = FetchProgress(),
cancel_token: CancelToken = CancelToken(),
) -> tuple:
"""
Fetch toots from a single server (worker function for parallel execution). Assumes
the journal directory already exists.
Returns:
tuple: (uri, toots_list, new_server_uris) or (uri, None, None) on failure
"""
logger = get_logger(config, __name__)
server = uri.split("/")[2]
jfilename = join(dir_path, f"{journalfile}-{server}.json")
# Check if file exists and we shouldn't overwrite
if not overwrite:
if exists(jfilename) and (stat(jfilename).st_size > 0):
logger.info(f"{jfilename} exists, extracting server URIs")
# Read existing file to discover server URIs
try:
import json
with open(jfilename, "r") as f:
existing = json.load(f)
progress.complete_server(uri, len(existing), failed=False)
newuris = ["/".join(s["uri"].split("/")[0:3]) for s in existing if "uri" in s]
return (uri, [], newuris)
except Exception as e:
logger.warning(f"Could not read {jfilename} for discovery: {e}")
return (uri, [], [])
# Mark server as started
progress.start_server(uri)
# Fetch from remote server with progress updates
newtoots = fetch_hashtag_remote(config, uri, progress, cancel_token)
if newtoots is None:
logger.debug(f"Got no toots back from {uri}")
progress.complete_server(uri, 0, failed=True)
return (uri, None, None)
# Convert to dataframe and write
try:
df = toots2df(newtoots, uri)
if not write_journal(config, df, server):
progress.complete_server(uri, len(newtoots), failed=True)
return (uri, None, None)
except Exception as e:
logger.error(f"Failed to convert {len(newtoots)} toots from {uri}: {e}")
progress.complete_server(uri, 0, failed=True)
return (uri, None, None)
# Mark server as complete
progress.complete_server(uri, len(newtoots), failed=False)
# Extract new server URIs
newuris = ["/".join(s["uri"].split("/")[0:3]) for s in newtoots]
return (uri, newtoots, newuris)
mock_fetch_single_server(uri, max, cancel_token=CancelToken(), progress=FetchProgress(), config=None, dir_path=None, journalfile=None)
¶
Simulate fetching from one server with paginated progress.
This function pretends to fetch posts from a bunch of fake named servers. It has a 25% failure rate (that is, 25% of the time, it returns a failure). This is mainly for testing the front-end UI on things like parallelism and cancelling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uri
|
str
|
A URL string. It could be anything. It's ignored and just returned back in the result data structure. |
required |
cancel_token
|
CancelToken
|
A thread-safe cancel token, like the real fetch would use |
CancelToken()
|
progress
|
FetchProgress
|
A thread-safe progress object, like the real fetch would use |
FetchProgress()
|
config
|
ConfigParser
|
ConfigParser for writing journal files |
None
|
dir_path
|
str
|
Directory path for journal files |
None
|
journalfile
|
str
|
Journal file prefix |
None
|
Returns:
| Type | Description |
|---|---|
|
The standard output of fetch_single_server(), which is the URI, the number of toots, and a list of made up new servers. |
Source code in mastoscore/fetch.py
def mock_fetch_single_server(
uri:str,
max:int,
cancel_token: CancelToken = CancelToken(),
progress: FetchProgress = FetchProgress(),
config: ConfigParser = None,
dir_path: str = None,
journalfile: str = None,
):
"""
Simulate fetching from one server with paginated progress.
This function pretends to fetch posts from a bunch of fake named servers. It has a 25% failure rate
(that is, 25% of the time, it returns a failure). This is mainly for testing the front-end UI on things
like parallelism and cancelling.
Args:
uri(str): A URL string. It could be anything. It's ignored and just returned back in the result
data structure.
cancel_token: A thread-safe cancel token, like the real fetch would use
progress: A thread-safe progress object, like the real fetch would use
config: ConfigParser for writing journal files
dir_path: Directory path for journal files
journalfile: Journal file prefix
Returns:
The standard output of fetch_single_server(), which is the URI, the number of toots,
and a list of made up new servers.
"""
from mastoscore.analyse import toots2df
from mastoscore.journal import write_journal
from mastoscore.config import get_logger
logger = get_logger(config, __name__) if config else None
# Mark as started
progress.start_server(uri)
# How speedy is this server? It will sleep this much between every progress update.
sleep_time = uniform(0.5, 3.0)
# 25% failure rate - fail immediately
if random() < 0.25:
sleep(sleep_time)
progress.complete_server(uri, 0, failed=True)
return (uri, None, None)
# How many toots will this server have? Between 100 and 500
num_toots = randint(1, max)
return_toots = []
while len(return_toots) < num_toots:
# Check cancellation
if cancel_token.is_cancelled():
progress.complete_server(uri, len(return_toots), failed=False)
return (uri, return_toots, [])
# Simulate page fetch delay (1-4 seconds per page)
sleep(sleep_time)
# most mastodon servers do 40 per page
page_size = 40
if len(return_toots) + page_size > num_toots:
# we're done
page_size = num_toots - len(return_toots)
new_toots = [random_toot() for _ in range(page_size)]
return_toots.extend(new_toots)
# Update the progress object
progress.update_server_progress(uri, len(return_toots))
# Write journal file if config provided
if config and dir_path and journalfile:
try:
df = toots2df(return_toots, uri)
# Add mock flag to the dataframe
df['mock'] = True
server = uri.split("/")[2]
if not write_journal(config, df, server):
if logger:
logger.error(f"Failed to write mock journal for {uri}")
progress.complete_server(uri, len(return_toots), failed=True)
return (uri, None, None)
except Exception as e:
if logger:
logger.error(f"Failed to write mock journal for {uri}: {e}")
progress.complete_server(uri, len(return_toots), failed=True)
return (uri, None, None)
# Mark as complete
progress.complete_server(uri, len(return_toots), failed=False)
if randint(0, 9) < 1:
# Now pretend we found a few new servers. Keep this low, so it
# converges quickly.
fake = Faker()
# Generate random server URIs using Faker
new_uris = {f"https://{fake.domain_name()}" for _ in range(randint(1, 6))}
else:
new_uris = []
return (uri, return_toots, new_uris)
random_toot()
¶
Used by the mock_fetch_single_server() to generate random toots that fit the structure that fetch() expects.
Returns:
| Type | Description |
|---|---|
dict
|
Dict very much like a toot would have. Except, the dates are really random. It doesn't |
dict
|
make coherent sense. |
Source code in mastoscore/fetch.py
def random_toot() -> dict:
"""
Used by the mock_fetch_single_server() to generate random toots that fit the structure
that fetch() expects.
Returns:
Dict very much like a toot would have. Except, the dates are really random. It doesn't
make coherent sense.
"""
fake = Faker()
tootid = fake.credit_card_full()
server = fake.domain_name()
userid = fake.user_name()
toot = {
"account.display_name": fake.first_name() + fake.last_name(),
"account.username": userid,
"content": fake.words(nb=randint(10,20)),
"local": (randint(0,1) > 0),
"in_reply_to_id": str(tootid),
"favourites_count": randint(0,50),
"source": f"{server}",
"server": f"{server}",
"account.url": f"https://{server}/@{userid}",
"uri": f"https://{server}/users/{userid}/statuses/{tootid}",
"url": f"https://{server}/@{userid}/{tootid}",
"replies_count": randint(0,10),
"created_at": "2026-02-28T19:27:49Z",
"userid": f"{userid}@{server}",
"id": f"{tootid}",
"account.indexable": False,
"reblogs_count": randint(0,20),
}
return toot
reset_fetch(config)
¶
Delete all fetch data files for a given configuration.
Removes
- All journal files (data-{journalfile}-{server}.json)
- Fetch results file (data-{journalfile}-fetch.json)
Config Parameters Used¶
| Option | Description |
|---|---|
| mastoscore:hashtag | Hashtag to search for |
| mastoscore:journalfile | slug for journal files |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
ConfigParser
|
A ConfigParser object from the config module |
required |
Returns:
| Type | Description |
|---|---|
dict
|
dict with: - files_deleted: list of dicts with 'path' and 'size' for each file - count: number of files deleted - total_size: total bytes deleted |
Source code in mastoscore/fetch.py
def reset_fetch(config: ConfigParser) -> dict:
"""
Delete all fetch data files for a given configuration.
Removes:
- All journal files (data-{journalfile}-{server}.json)
- Fetch results file (data-{journalfile}-fetch.json)
## Config Parameters Used
| Option | Description |
| ------ | ----------- |
| mastoscore:hashtag | Hashtag to search for
| mastoscore:journalfile | slug for journal files
Args:
config: A ConfigParser object from the config module
Returns:
dict with:
- files_deleted: list of dicts with 'path' and 'size' for each file
- count: number of files deleted
- total_size: total bytes deleted
"""
logger = get_logger(config, __name__)
journalfile = config.get("mastoscore", "journalfile")
hashtag = config.get("mastoscore", "hashtag")
dir_path = create_journal_directory(config)
if dir_path is None or not exists(dir_path):
return {"files_deleted": [], "count": 0, "total_size": 0}
deleted_files = []
total_size = 0
# Pattern 1: Journal files per server (data-{journalfile}-*.json)
data_files = join(dir_path, f"data-{journalfile}-*.json")
# Pattern 2: Fetch results file (data-{journalfile}-fetch.json)
journal_files = join(dir_path, f"{hashtag}-*.json")
# Find and delete all matching files
for pattern in [data_files, journal_files]:
for filepath in glob(pattern):
try:
# Get file size before deleting
file_size = stat(filepath).st_size
remove(filepath)
deleted_files.append({"path": filepath, "size": file_size})
total_size += file_size
logger.info(f"Deleted: {filepath} ({file_size} bytes)")
except Exception as e:
logger.error(f"Failed to delete {filepath}: {e}")
return {
"files_deleted": deleted_files,
"count": len(deleted_files),
"total_size": total_size,
}