What’s stopping me from doing this? Here we go:

I’m going to start an instance and federate with everyone who will allow it, which is most instances including this one, I believe.

Then I’m going to feed all that data into my new website, called Open Lemmy Stats, where anyone can query the user data ive accumulated. The homepage will be ripe with insights, leaderboards and all kinds of data on prolific users.

Additionally, I’ll display a snapshot/profile of a random user by feeding that users data to GPT4 to make inferences about the user’s political affiliations and display the results.

Worst of all, I’m not going to out my instance for everyone to know it as the one to defederate. In fact, I’m spinning up a few instances that will host innocuous communities that I plan to mod and support to give my instances cover for their true purpose: redundant fediverse datastreams for my site, Open Lemmy Stats.

I’ll also have a store where anyone can buy my collected fediverse data for a handsome sum.

Just kidding I’m not doing any of this. But someone absolutely will or already is working on it. They’ll make a good bit of money too, I’d bet.

This is inspired by a recent post on youshouldknow@lemmy.world where someone highlighted what kind of data instance admins have access to, even for users not on their instance.

I wanted to share this to start a discussion that I find interesting. I’m interested in your thoughts, or to hear more on why this may or may not be possible and if it is, maybe some ideas how to fix that? because obviously such a site would be problematic, but no doubt popular for oh so many reasons.

Edit: typo, I called admins adminis. Corrected.

Edit 2: wanted to credit the post I was referencing from YSK, here it is - https://lemmy.world/post/1033769

  • GamingChairModel@lemmy.world
    link
    fedilink
    English
    arrow-up
    35
    ·
    1 year ago

    I don’t think that site would be problematic. After all, we’re just talking about custom interfaces to analyze public data.

    A big part of the solution is that users should have an awareness that their activity is public. Every once in a while someone gets burned not knowing that anyone can view what a specific Twitter user or Instagram user liked (like politicians liking risque thirst trap photos).

    Another is easy alts and throwaways, with tips to avoid correlations:

    • Don’t use the same verified email address
    • Don’t reuse usernames, including across platforms
    • Try not to use the same instances, such that instance admins can see whether login activity is coming from the same place, unless you absolutely trust that the admins won’t analyze your data OR inadvertently leak their records.
    • Be aware of the techniques used to correlate users: analysis of timestamps, linguistic/grammatical quirks, etc.

    This is a public place, so people should be aware that this is a public place. That means they can still find this useful space, as with many other public places, but should be aware that the more they do on this platform, the easier it is to correlate with a real life identity.

    • GamingChairModel@lemmy.world
      link
      fedilink
      English
      arrow-up
      13
      ·
      1 year ago

      Thinking about this some more, I don’t mean to put everything on the user.

      The platform itself, through its design and architecture and settings, should also do stuff to make super detailed analysis more difficult:

      • Don’t log unnecessary metadata, such as views/visits, clicks, scrolls, time spent on specific posts, etc. Information that is never observed/logged can’t be shared/published.
      • Don’t share unnecessary information with other instances. For example, with an update to the protocol, an instance might be able to hide which local users voted for what in local threads, while maintaining the proper count internally of what the vote totals are, who has already voted, etc. Non-local users would have to have their votes publicly known, though.
      • Make the public nature of each action obvious. Make votes more obviously public through the interface (perhaps by allowing people to view who upvoted or downvoted). Make people’s comment history and like history easy to view within the native interface, so that people understand that the information isn’t private to begin with.
      • Commit to deletion in a public, auditable way. Let instance administrators know that being a good citizen on the fediverse requires adherence to norms about privacy and deletion, and have watchdogs publish stats on how long it takes for an instance to delete a comment or vote or whether it retains edit/delete history.
      • lily33@lemmy.world
        link
        fedilink
        English
        arrow-up
        12
        ·
        edit-2
        1 year ago

        That last point is completely impossible. Don’t forget that I don’t have to run the official lemmy software on my instance. I can make changes: for example, I can add a feature to my instance like “log every post in a separate, local database before deleting it from lemmy”. Nobody else but me will know this feature exists. Or (to be AGPL compliant) have a separate tool to regularly back up my lemmy database, undoing deletions.

        As for the second point: I’d say making local votes private and non-local public will be worse for privacy due to causing confusion.

      • booty_flexx@lemmy.worldOP
        link
        fedilink
        English
        arrow-up
        3
        ·
        edit-2
        1 year ago

        Great point and ideas, I hope to see things like this introduced as the lemmy project matures

    • booty_flexx@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      7
      ·
      edit-2
      1 year ago

      Those are good practices if you have privacy concerns.

      we’re just talking about custom interfaces to analyze public data

      Semi-public. As it stands, only instance admins have access to per-user vote data. Possibly also API users, but I’m not sure the lemmy api has an endpoint for exposing per-user vote data, I believe it just gives you a tally of the up/down votes of posts and comments, but not who made each vote. But most people don’t have the skillset to host their own instance and process the data into something meaningful/easy to digest.

      You could make the argument that semi-public is basically public, but I think there is some nuance to be explored:

      Once a site like open lemmy stats launches, it becomes trivial for any user to query that data, who upvoted what, who downvoted what, when they up/downvoted it, etc.

      There’s a difference between something being available to people motivated enough to get it vs it reaching critical mass and being trivial to access by anyone with a browser. How the data is ultimately used, whether it is used nefariously or not, is going to be up to the people that access openlemmystats and what they wish to use it for.

      Which has me considering an analogy, without expressly intending to make this political, please consider the statement “guns don’t kill people, people kill people”. “Openlemmystats doesnt harass political dissenters! The people who use it do!”. One could argue that openlemmystats wouldn’t do anything inherently bad, it’s the people who would use it. Just like with guns, there will likely be debate on whether or not the world would be better without openlemmystats or if we should start doing things to make it impossible for openlemmystats-alike sites to exist.

      That said, I mostly agree with you, and I appreciate your privacy suggestions/best practices, good stuff!

      Edit: for the record, I think “guns don’t kill people, people do” is a stupid statement, but I thought it was an interesting analogy. That is to say nothing of my feelings on gun control, I’m just not a fan of distilling complex issues into dismissive one line statements.