Module: fetch

The fetch module starts at your server where you're logged in and searches a hashtag. After it gets all the toots your server knows about, then it starts looking at where they came from. For each server that it finds mentioned, it calls fetch_hashtag_remote(). Each time that it connects to a new server, it fetches every toot that server knows about the hashtag. Then it looks at the servers that are mentioned and adds any new ones to a list of servers to contact.

Public API

This code depends on Mastodon.py and uses it to connect to servers that are mentioned. If you know anything about the fediverse, you know that there's more than just Mastodon servers out there. There's Pleroma, Akkoma, and various other ActivityPub-compatible servers. Some are derived from Mastodon and implement the same APIs. Others don't. Some Mastodon servers offer public read APIs, others don't. So servers that allow public read of their APIs will send you the details on their toots. Servers that don't allow public read, or that don't implement a Mastodon-compatible timeline API will be quietly skipped.

Parallel fetching

Fetch now spins up parallel fetches. If you're using backstage, it will report updates every 2 seconds. Servers will tend to update in page sizes. That is, since we are typically fetching 40 posts at a time, you should see server progress change by multiples of 40 when they change.

Fetching is limited by 2 things. The time of the post, and the max number in the INI file. The max is the most we will fetch from any one server. There's no overall limit. The max by default is 2000. And if you hit 50 servers with 2000 posts each, we will happily fetch 100,000 posts.

The start of the fetch interval is calculated by taking the event_start and subtracting the hours_margin. The end of the fetch interval takes the event_start and adds the duration plus the hours_margin.

An Example: Imagine the event starts at 19:00, the duration is 75 minutes (1:15) and the hours_margin is 2 hours. The earliest possible toot will be at 17:00, and the latest possible toot will be 17:00 + 1:15 + 2:00, or 20:15. Fetch will stop fetching from a particular server when it hits max or toots that are later than the latest time.

Directory Structure

Fetch organizes data in a directory based on the date of the event. So if the journaldir is data, the hashtag is monsterdon, and the event_date is 2025-10-19, then the directory structure is data/2025/10/19 and all the files will have the monsterdon hashtag in their name. See the example below:

data
├── 2025
│  └── 10
│     ├── 19
│     │  ├── data-monsterdon-analysis.json
│     │  ├── data-monsterdon-fetch.json
│     │  ├── index.md
│     │  ├── monsterdon-20251019.png
│     │  ├── monsterdon-20251019.txt
│     │  ├── monsterdon-beige.party.json
│     │  ├── monsterdon-bolha.us.json
│     │  ... lots more files, one per server...
│     │  ├── wordcloud-monsterdon-20251019-remove.png
│     │  └── wordcloud-monsterdon-20251019-remove.txt

List of files

Every file has the hashtag and the date in its name. If you ran the same analysis on 2 different hashtags on the same day, none of the files would conflict, though they would all be stored in the same directory.

Module for fetching toots for a hashtag.

fetch(config, progress=FetchProgress(), cancel_token=CancelToken(), mock=False)

This is the top-level function that will download toots and store them in a JSON cache. This function will create a tooter and login to the server named in the cred_file.

Config Parameters Used

Option Description
fetch:overwrite If True, overwrite files (re-fetch). If False and exists for a server, skip it. Default: False
fetch:parallel_workers int, how many fetches to run in parallel. Default: 5
mastoscore:api_base_url Starting server for our first connection
mastoscore:journalfile prefix for journal files.
mastoscore:timezone Timezone for any non-UTC times

Parameters:

Name Type Description Default
config ConfigParser

A ConfigParser object from the config module

required
progress FetchProgress

A thread-safe webapi FetchProgress FetchProgress object to track partial progress for reporting.

FetchProgress()
cancel_token CancelToken

a thread-safe webapi FetchProgress CancelToken object that the thread will check to determine if it should stop mid-fetch

CancelToken()
mock bool

Whether we are meant to be mocking the fetch (used for testing the front end)

False

Returns:

Type Description
dict | None

A dictionary that becomes the contents of the data-{hashtag}-fetch.json statistics file.

Source code in mastoscore/fetch.py
def fetch(
    config: ConfigParser,
    progress: FetchProgress = FetchProgress(),
    cancel_token: CancelToken = CancelToken(),
    mock: bool = False,
) -> dict | None:
    """
    This is the top-level function that will download toots and store them in a JSON cache. This
    function will create a [tooter](module-tooter.md) and login to the server named in the `cred_file`.

    ## Config Parameters Used

    | Option | Description |
    | ------ | ----------- |
    | fetch:overwrite |  If True, overwrite files (re-fetch). If False and exists for a server, skip it. Default: False
    | fetch:parallel_workers |  int, how many fetches to run in parallel. Default: 5
    | mastoscore:api_base_url |  Starting server for our first connection
    | mastoscore:journalfile |  prefix for journal files.
    | mastoscore:timezone |  Timezone for any non-UTC times

    Args:
        config: A ConfigParser object from the [config](module-config.md) module
        progress: A thread-safe [webapi FetchProgress](module-webapi.md) FetchProgress object
            to track partial progress for reporting.
        cancel_token: a thread-safe [webapi FetchProgress](module-webapi.md) CancelToken
            object that the thread will check to determine if it should stop mid-fetch
        mock(bool): Whether we are meant to be mocking the fetch (used for testing the front end)

    Returns:
        A dictionary that becomes the contents of the data-{hashtag}-fetch.json statistics file.
    """

    overwrite = config.getboolean("fetch", "overwrite", fallback=False)
    parallel_workers = config.getint("fetch", "parallel_workers", fallback=5)
    journalfile = config.get("mastoscore", "journalfile")
    api_base_url = config.get("mastoscore", "api_base_url")
    timezone = pytimezone(config.get("mastoscore", "timezone"))

    fresults = {}
    logger = get_logger(config, __name__)

    oldest_date, newest_date = get_fetch_window(config)
    oldest_str = oldest_date.strftime("%Y-%m-%d %H:%M:%S")
    newest_str = newest_date.strftime("%Y-%m-%d %H:%M:%S")
    logger.debug(f"Fetch window: {oldest_str} to {newest_str}")

    # Create directory structure
    dir_path = create_journal_directory(config)
    if dir_path is None:
        return None

    # Initialize tracking - start with just the initial server
    if mock:
        fake = Faker()
        num_servers = randint(10, 30)
        # Generate random server URIs using Faker
        servers_todo = {f"https://{fake.domain_name()}" for _ in range(num_servers)}
    else:
        servers_todo = {api_base_url}
    servers_done = set()
    servers_fail = set()
    total_toots = 0
    servers_lock = threading.Lock()

    # Initialize progress tracking
    progress.update(
        len(servers_todo), len(servers_done), len(servers_fail), total_toots
    )

    logger.info(
        f"Starting fetch from {api_base_url} with up to {parallel_workers} parallel workers"
    )
    fetch_start = datetime.datetime.now(tz=timezone)
    all_done = threading.Event()
    work_available = threading.Event()

    with ThreadPoolExecutor(max_workers=parallel_workers) as executor:
        futures = {}

        def submit_pending():
            """Submit work from servers_todo up to parallel_workers limit.
            Must be called while holding servers_lock."""
            while len(futures) < parallel_workers and servers_todo:
                uri = servers_todo.pop()
                progress.set_pending(uri)
                if mock:
                    maxtoots = config.getint("fetch", "max")
                    future = executor.submit(
                        mock_fetch_single_server,
                        uri,
                        maxtoots,
                        cancel_token,
                        progress,
                        config,
                        dir_path,
                        journalfile,
                    )
                else:
                    future = executor.submit(
                        fetch_single_server,
                        config,
                        uri,
                        dir_path,
                        journalfile,
                        overwrite,
                        progress,
                        cancel_token,
                    )
                future.add_done_callback(on_future_done)
                futures[future] = uri
                logger.debug(f"Submitted: {uri} (active workers: {len(futures)})")

        def on_future_done(future):
            """Callback: bookkeeping only, no submitting. Signals main loop."""
            nonlocal total_toots
            with servers_lock:
                uri = futures.pop(future)
                try:
                    _, toots_result, newuris = future.result()
                except Exception as e:
                    logger.error(f"Worker exception for {uri}: {e}")
                    toots_result = None
                    newuris = None

                if toots_result is None:
                    servers_fail.add(uri)
                    logger.debug(f"Failed: {uri}")
                else:
                    servers_done.add(uri)
                    total_toots += len(toots_result)

                    if newuris:
                        new_count = 0
                        for server in set(newuris):
                            if (
                                server not in servers_done
                                and server not in servers_todo
                                and server not in servers_fail
                            ):
                                servers_todo.add(server)
                                progress.set_pending(server)
                                new_count += 1
                        if new_count > 0:
                            logger.info(
                                f"Discovered {new_count} new servers from {uri}"
                            )

                progress.update(
                    len(servers_todo),
                    len(servers_done),
                    len(servers_fail),
                    total_toots,
                    uri,
                )

                logger.info(
                    f"Processed {uri}. Todo: {len(servers_todo)}, Active: {len(futures)}, Done: {len(servers_done)}, Fail: {len(servers_fail)}"
                )

            # Wake up main loop to submit more work (outside the lock)
            work_available.set()

        # Kick off initial batch
        with servers_lock:
            submit_pending()

        # Main loop: wait for callbacks, then submit more work
        while True:
            work_available.wait(timeout=1.0)
            work_available.clear()

            if cancel_token.is_cancelled():
                logger.info("Fetch cancelled by user")
                with servers_lock:
                    if not futures:
                        break
                continue

            with servers_lock:
                before = len(futures)
                submit_pending()
                after = len(futures)
                if after > before:
                    logger.info(f"Main loop submitted {after - before} new workers (active: {after}, todo: {len(servers_todo)})")
                elif servers_todo and after < parallel_workers:
                    logger.error(f"MAIN LOOP UNDERUTILIZED: {after} of {parallel_workers} active, {len(servers_todo)} waiting")
                if not futures and not servers_todo:
                    break

    fetch_end = datetime.datetime.now(tz=timezone)
    fetch_duration = fetch_end - fetch_start
    duration_string = ""
    fetch_hours, fetch_seconds = divmod(int(fetch_duration.total_seconds()), 3600)
    if fetch_hours > 1:
        duration_string = f"{fetch_hours} hours"
    elif fetch_hours == 1:
        duration_string = f"{fetch_hours} hour"

    fetch_minutes, fetch_seconds = divmod(fetch_seconds, 60)
    if fetch_minutes > 1:
        duration_string = f"{duration_string} {fetch_minutes} minutes"
    elif fetch_minutes == 1:
        duration_string = f"{duration_string} {fetch_minutes} minute"

    if fetch_seconds > 1:
        duration_string = f"{duration_string} {fetch_seconds} seconds"
    elif fetch_seconds == 1:
        duration_string = f"{duration_string} {fetch_seconds} second"
    else:
        duration_string = f"{duration_string} exactly"

    fresults["total_toots"] = total_toots
    fresults["servers_done"] = list(servers_done)
    fresults["servers_fail"] = list(servers_fail)
    fresults["oldest_date"] = oldest_date.strftime("%a %e %b %Y %H:%M:%S %Z")
    fresults["total_toots"] = total_toots
    fresults["fetch_time"] = fetch_start
    fresults["fetch_end"] = fetch_end
    fresults["fetch_duration"] = duration_string
    fresults["fetch_version"] = __version__

    write_json(config, "fetch", fresults)
    logger.info(
        f"Done! Collected {total_toots} toots from {len(servers_done)} servers with {len(servers_fail)} failures."
    )

    # Return summary for API
    return {
        "servers_done": list(servers_done),
        "servers_fail": list(servers_fail),
        "total_toots": total_toots,
        "duration_seconds": fetch_duration.total_seconds(),
        "output_file": dir_path,
    }

fetch_hashtag_remote(config, server, progress=FetchProgress(), cancel_token=CancelToken())

Given a uri of a toot, (like from Mastodon.status), create a Tooter for that URI. Connect and fetch the statuses. Return a few fields, but not all.

Config Parameters Used

Option Description
fetch:max Max number of toots to pull from a server (default: 2000)
mastoscore:hashtag Hashtag to search for

Parameters:

Name Type Description Default
config ConfigParser

A ConfigParser object from the config module

required
server str

The api_base_url of a server to fetch from

required
progress FetchProgress

Optional progress tracker for reporting incremental progress

FetchProgress()
cancel_token CancelToken

Optional token that can tell us we need to cancel

CancelToken()

Returns:

Type Description
list | None

List of statuses in the raw JSON format from the API. Fields are not normalised or converted in any way. Since not all ActivityPub servers are exactly the same, it's not even sure which fields you get.

Source code in mastoscore/fetch.py
def fetch_hashtag_remote(
    config: ConfigParser,
    server: str,
    progress: FetchProgress = FetchProgress(),
    cancel_token: CancelToken = CancelToken(),
) -> list | None:
    """
    Given a uri of a toot, (like from Mastodon.status), create a Tooter
    for that URI. Connect and fetch the statuses. Return a few fields, but not all.

    ## Config Parameters Used

    | Option | Description |
    | ------- | ------- |
    | fetch:max | Max number of toots to pull from a server (default: 2000)
    | mastoscore:hashtag | Hashtag to search for

    Args:
      config: A ConfigParser object from the [config](module-config.md) module
      server: The api_base_url of a server to fetch from
      progress: Optional progress tracker for reporting incremental progress
      cancel_token: Optional token that can tell us we need to cancel

    Returns:
      List of statuses in the raw JSON format from the API. Fields are not
        normalised or converted in any way. Since not all ActivityPub servers
        are exactly the same, it's not even sure which fields you get.
    """

    maxtoots = config.getint("fetch", "max")
    hashtag = config.get("mastoscore", "hashtag")
    oldest_date, newest_date = get_fetch_window(config)

    logger = get_logger(config, __name__)

    # Make the tooter that will do the searching.
    try:
        t = Tooter(config, "fetch", server)
        logger.debug(f"Tooter created for {server}")
    except (MastodonAPIError, MastodonNetworkError) as e:
        logger.info(f"Cannot connect to {server}: {e}")
        return None
    except Exception as e:
        logger.error(f"Failed to create Tooter for {server}: {e}")
        return None

    try:
        # Collect toots from paginated generator
        all_toots = []
        for page in t.search_hashtag(hashtag, oldest_date, maxtoots):
            # Check for cancellation between pages
            if cancel_token.is_cancelled():
                logger.debug(f"Fetch cancelled for {server}")
                return None

            all_toots.extend(page)
            # Update progress with current toot count
            if progress:
                progress.update_server_progress(server, len(all_toots))

        return all_toots if all_toots else None
    except (MastodonAPIError, MastodonNetworkError) as e:
        # Expected: 4xx errors, auth required, timeouts, connection failures
        logger.info(f"Server {server} unavailable: {e}")
        return None
    except Exception as e:
        logger.error(
            f"fetch_hashtag_remote: unexpected error fetching {hashtag} from {server}: {e}"
        )
        logger.exception(e)
        return None

fetch_single_server(config, uri, dir_path, journalfile, overwrite, progress=FetchProgress(), cancel_token=CancelToken())

Fetch toots from a single server (worker function for parallel execution). Assumes the journal directory already exists.

Returns:

Name Type Description
tuple tuple

(uri, toots_list, new_server_uris) or (uri, None, None) on failure

Source code in mastoscore/fetch.py
def fetch_single_server(
    config: ConfigParser,
    uri: str,
    dir_path: str,
    journalfile: str,
    overwrite: bool,
    progress: FetchProgress = FetchProgress(),
    cancel_token: CancelToken = CancelToken(),
) -> tuple:
    """
    Fetch toots from a single server (worker function for parallel execution). Assumes
    the journal directory already exists.

    Returns:
        tuple: (uri, toots_list, new_server_uris) or (uri, None, None) on failure
    """
    logger = get_logger(config, __name__)
    server = uri.split("/")[2]
    jfilename = join(dir_path, f"{journalfile}-{server}.json")

    # Check if file exists and we shouldn't overwrite
    if not overwrite:
        if exists(jfilename) and (stat(jfilename).st_size > 0):
            logger.info(f"{jfilename} exists, extracting server URIs")
            # Read existing file to discover server URIs
            try:
                import json
                with open(jfilename, "r") as f:
                    existing = json.load(f)
                progress.complete_server(uri, len(existing), failed=False)
                newuris = ["/".join(s["uri"].split("/")[0:3]) for s in existing if "uri" in s]
                return (uri, [], newuris)
            except Exception as e:
                logger.warning(f"Could not read {jfilename} for discovery: {e}")
                return (uri, [], [])

    # Mark server as started
    progress.start_server(uri)

    # Fetch from remote server with progress updates
    newtoots = fetch_hashtag_remote(config, uri, progress, cancel_token)
    if newtoots is None:
        logger.debug(f"Got no toots back from {uri}")
        progress.complete_server(uri, 0, failed=True)
        return (uri, None, None)

    # Convert to dataframe and write
    try:
        df = toots2df(newtoots, uri)
        if not write_journal(config, df, server):
            progress.complete_server(uri, len(newtoots), failed=True)
            return (uri, None, None)
    except Exception as e:
        logger.error(f"Failed to convert {len(newtoots)} toots from {uri}: {e}")
        progress.complete_server(uri, 0, failed=True)
        return (uri, None, None)

    # Mark server as complete
    progress.complete_server(uri, len(newtoots), failed=False)

    # Extract new server URIs
    newuris = ["/".join(s["uri"].split("/")[0:3]) for s in newtoots]

    return (uri, newtoots, newuris)

mock_fetch_single_server(uri, max, cancel_token=CancelToken(), progress=FetchProgress(), config=None, dir_path=None, journalfile=None)

Simulate fetching from one server with paginated progress.

This function pretends to fetch posts from a bunch of fake named servers. It has a 25% failure rate (that is, 25% of the time, it returns a failure). This is mainly for testing the front-end UI on things like parallelism and cancelling.

Parameters:

Name Type Description Default
uri str

A URL string. It could be anything. It's ignored and just returned back in the result data structure.

required
cancel_token CancelToken

A thread-safe cancel token, like the real fetch would use

CancelToken()
progress FetchProgress

A thread-safe progress object, like the real fetch would use

FetchProgress()
config ConfigParser

ConfigParser for writing journal files

None
dir_path str

Directory path for journal files

None
journalfile str

Journal file prefix

None

Returns:

Type Description

The standard output of fetch_single_server(), which is the URI, the number of toots, and a list of made up new servers.

Source code in mastoscore/fetch.py
def mock_fetch_single_server(
    uri:str,
    max:int,
    cancel_token: CancelToken = CancelToken(),
    progress: FetchProgress = FetchProgress(),
    config: ConfigParser = None,
    dir_path: str = None,
    journalfile: str = None,
):
    """
    Simulate fetching from one server with paginated progress.

    This function pretends to fetch posts from a bunch of fake named servers. It has a 25% failure rate
    (that is, 25% of the time, it returns a failure). This is mainly for testing the front-end UI on things
    like parallelism and cancelling.

    Args:
        uri(str): A URL string. It could be anything. It's ignored and just returned back in the result
            data structure.
        cancel_token: A thread-safe cancel token, like the real fetch would use
        progress: A thread-safe progress object, like the real fetch would use
        config: ConfigParser for writing journal files
        dir_path: Directory path for journal files
        journalfile: Journal file prefix

    Returns:
        The standard output of fetch_single_server(), which is the URI, the number of toots,
    	    and a list of made up new servers.
    """
    from mastoscore.analyse import toots2df
    from mastoscore.journal import write_journal
    from mastoscore.config import get_logger

    logger = get_logger(config, __name__) if config else None

    # Mark as started
    progress.start_server(uri)

    # How speedy is this server? It will sleep this much between every progress update.
    sleep_time = uniform(0.5, 3.0)
    # 25% failure rate - fail immediately
    if random() < 0.25:
        sleep(sleep_time)
        progress.complete_server(uri, 0, failed=True)
        return (uri, None, None)

    # How many toots will this server have? Between 100 and 500
    num_toots = randint(1, max)
    return_toots = []

    while len(return_toots) < num_toots:
        # Check cancellation
        if cancel_token.is_cancelled():
            progress.complete_server(uri, len(return_toots), failed=False)
            return (uri, return_toots, [])

        # Simulate page fetch delay (1-4 seconds per page)
        sleep(sleep_time)
        # most mastodon servers do 40 per page
        page_size = 40
        if len(return_toots) + page_size > num_toots:
            # we're done
            page_size = num_toots - len(return_toots)

        new_toots = [random_toot() for _ in range(page_size)]
        return_toots.extend(new_toots)
        # Update the progress object
        progress.update_server_progress(uri, len(return_toots))

    # Write journal file if config provided
    if config and dir_path and journalfile:
        try:
            df = toots2df(return_toots, uri)
            # Add mock flag to the dataframe
            df['mock'] = True
            server = uri.split("/")[2]
            if not write_journal(config, df, server):
                if logger:
                    logger.error(f"Failed to write mock journal for {uri}")
                progress.complete_server(uri, len(return_toots), failed=True)
                return (uri, None, None)
        except Exception as e:
            if logger:
                logger.error(f"Failed to write mock journal for {uri}: {e}")
            progress.complete_server(uri, len(return_toots), failed=True)
            return (uri, None, None)

    # Mark as complete
    progress.complete_server(uri, len(return_toots), failed=False)

    if randint(0, 9) < 1:
        # Now pretend we found a few new servers. Keep this low, so it
        # converges quickly.
        fake = Faker()
        # Generate random server URIs using Faker
        new_uris = {f"https://{fake.domain_name()}" for _ in range(randint(1, 6))}
    else:
        new_uris = []

    return (uri, return_toots, new_uris)

random_toot()

Used by the mock_fetch_single_server() to generate random toots that fit the structure that fetch() expects.

Returns:

Type Description
dict

Dict very much like a toot would have. Except, the dates are really random. It doesn't

dict

make coherent sense.

Source code in mastoscore/fetch.py
def random_toot() -> dict:
    """
    Used by the mock_fetch_single_server() to generate random toots that fit the structure
    that fetch() expects.

    Returns:
        Dict very much like a toot would have. Except, the dates are really random. It doesn't
        make coherent sense.
    """
    fake = Faker()

    tootid = fake.credit_card_full()
    server = fake.domain_name()
    userid = fake.user_name()
    toot = {
        "account.display_name": fake.first_name() + fake.last_name(),
        "account.username": userid,
        "content": fake.words(nb=randint(10,20)),
        "local": (randint(0,1) > 0),
        "in_reply_to_id": str(tootid),
        "favourites_count": randint(0,50),
        "source": f"{server}",
        "server": f"{server}",
        "account.url": f"https://{server}/@{userid}",
        "uri": f"https://{server}/users/{userid}/statuses/{tootid}",
        "url": f"https://{server}/@{userid}/{tootid}",
        "replies_count": randint(0,10),
        "created_at": "2026-02-28T19:27:49Z",
        "userid": f"{userid}@{server}",
        "id": f"{tootid}",
        "account.indexable": False,
        "reblogs_count": randint(0,20),
    }

    return toot

reset_fetch(config)

Delete all fetch data files for a given configuration.

Removes
  • All journal files (data-{journalfile}-{server}.json)
  • Fetch results file (data-{journalfile}-fetch.json)

Config Parameters Used

Option Description
mastoscore:hashtag Hashtag to search for
mastoscore:journalfile slug for journal files

Parameters:

Name Type Description Default
config ConfigParser

A ConfigParser object from the config module

required

Returns:

Type Description
dict

dict with: - files_deleted: list of dicts with 'path' and 'size' for each file - count: number of files deleted - total_size: total bytes deleted

Source code in mastoscore/fetch.py
def reset_fetch(config: ConfigParser) -> dict:
    """
    Delete all fetch data files for a given configuration.

    Removes:
        - All journal files (data-{journalfile}-{server}.json)
        - Fetch results file (data-{journalfile}-fetch.json)

    ## Config Parameters Used

    | Option | Description |
    | ------ | ----------- |
    | mastoscore:hashtag | Hashtag to search for
    | mastoscore:journalfile | slug for journal files

    Args:
      config: A ConfigParser object from the config module

    Returns:
        dict with:
            - files_deleted: list of dicts with 'path' and 'size' for each file
            - count: number of files deleted
            - total_size: total bytes deleted
    """
    logger = get_logger(config, __name__)

    journalfile = config.get("mastoscore", "journalfile")
    hashtag = config.get("mastoscore", "hashtag")

    dir_path = create_journal_directory(config)

    if dir_path is None or not exists(dir_path):
        return {"files_deleted": [], "count": 0, "total_size": 0}

    deleted_files = []
    total_size = 0

    # Pattern 1: Journal files per server (data-{journalfile}-*.json)
    data_files = join(dir_path, f"data-{journalfile}-*.json")

    # Pattern 2: Fetch results file (data-{journalfile}-fetch.json)
    journal_files = join(dir_path, f"{hashtag}-*.json")

    # Find and delete all matching files
    for pattern in [data_files, journal_files]:
        for filepath in glob(pattern):
            try:
                # Get file size before deleting
                file_size = stat(filepath).st_size
                remove(filepath)
                deleted_files.append({"path": filepath, "size": file_size})
                total_size += file_size
                logger.info(f"Deleted: {filepath} ({file_size} bytes)")
            except Exception as e:
                logger.error(f"Failed to delete {filepath}: {e}")

    return {
        "files_deleted": deleted_files,
        "count": len(deleted_files),
        "total_size": total_size,
    }