The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

  • MonkderVierte@lemmy.zip
    link
    fedilink
    English
    arrow-up
    47
    arrow-down
    1
    ·
    7 days ago

    Meaning, internal error, like the other two prior.

    Almost like one big provider with 99.9999% availability is worse than 10 with maybe 99.9%

    • Jason2357@lemmy.ca
      link
      fedilink
      English
      arrow-up
      16
      ·
      7 days ago

      Except, if you chose the wrong 1 of that 10 and your company is the only one down for a day, you get fire-bombed. If “TEH INTERNETS ARE DOWN” and your website is down for a day, no one even calls you.

    • jj4211@lemmy.world
      link
      fedilink
      English
      arrow-up
      10
      ·
      7 days ago

      Note that this outage by itself, based on their chart, was kicking out errors over the span of about 8 hours. This one outage would have almost entirely blown their downtown allowance under 99.9% availability criteria.

      If one big provider actually provided 99.9999%, that would be 30 seconds of all outages over a typical year. Not even long enough for people to generally be sure there was an ‘outage’ as a user. That wouldn’t be bad at all.

  • melsaskca@lemmy.ca
    link
    fedilink
    English
    arrow-up
    29
    arrow-down
    2
    ·
    7 days ago

    We are going to see a lot more of this type of bullshit now that there are no standards anymore. Fuck everything else and make that money people!

  • falseWhite@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    7
    ·
    7 days ago

    What are the chances they started using AI to automate some of this and that’s the real reason. It sounds like no human was involved in breaking this.

    • stepintomydojo@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      3
      ·
      7 days ago

      Zero for the triggering action. A human rolled out a permissions change in a database that led to an unexpected failure in a different system because that other system was missing some safety checks when loading the data (non-zero chance that code was authored in some way by AI).

  • mech@feddit.org
    link
    fedilink
    English
    arrow-up
    162
    arrow-down
    1
    ·
    7 days ago

    A permissions change in one database can bring down half the Internet now.

    • SidewaysHighways@lemmy.world
      link
      fedilink
      English
      arrow-up
      18
      ·
      7 days ago

      certainly brought my audiobookshelf to its knees when i decided that that lxc was gonna go ahead and be the jellyfin server also

    • CosmicTurtle0@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      9
      ·
      7 days ago

      tbf IAM is the bastard child of many cloud providers.

      It exists to provide CISOs and BROs a level of security that no one person has access to their infrastructure. So if a company decides that system A should no longer have access to system B, they can do that quickly.

      IAM is so complex now that it’s a field all in itself.

  • dan@upvote.au
    link
    fedilink
    English
    arrow-up
    62
    arrow-down
    2
    ·
    7 days ago

    When are people going to realise that routing a huge chunk of the internet through one private company is a bad idea? The entire point of the internet is that it’s a decentralized network of networks.

    • Jason2357@lemmy.ca
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      1
      ·
      7 days ago

      Someone always chimes into these discussions with the experience of being DDOSed and Cloudflare being the only option to prevent it.

      Sounds a lot like a protection racket to me.

  • Nighed@feddit.uk
    link
    fedilink
    English
    arrow-up
    10
    arrow-down
    1
    ·
    7 days ago

    Somewhere, that Dev who was told that having clustered databases in nonprod was two expensive and not needed is now updating the deploy scripts

    • choopeek@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      7 days ago

      Sadly, in my case, even after almost destroying a production cluster, they still decided a test cluster is to expensive and they’ll just live with the risk.

  • ranzispa@mander.xyz
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    1
    ·
    7 days ago

    Before today, ClickHouse users would only see the tables in the default database when querying table metadata from ClickHouse system tables such as system.tables or system.columns.

    Since users already have implicit access to underlying tables in r0, we made a change at 11:05 to make this access explicit, so that users can see the metadata of these tables as well.

    I’m no expert, but this feels like something you’d need to ponder very carefully before deploying. You’re basically changing the result of all queries to your db. I’m not working in there, but I’m sure in plenty places if the codebase there’s a bunch of query this and pick column 5 from the result.

  • Zwuzelmaus@feddit.org
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    8
    ·
    7 days ago

    a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size.

    Isn’t cloudflare also offering bot prevention as a service?

    Imagine if the number of bots suddenly increases by 2…

    And already they are on their knees?

    Muuuhahahahaaa…

    • dan@upvote.au
      link
      fedilink
      English
      arrow-up
      26
      ·
      7 days ago

      Did you read the article? It wasn’t taken down by the number of bots, but by the number of columns:

      In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features. Again, the limit exists because for performance reasons we preallocate memory for the features.

      When the bad file with more than 200 features was propagated to our servers, this limit was hit — resulting in the system panicking.

      They had some code to get a list of the database columns in the schema, but it accidentally wasn’t filtering by database name. This worked fine initially because the database user only had access to one DB. When the user was granted access to another DB, it started seeing way more columns than it expected.

  • panda_abyss@lemmy.ca
    link
    fedilink
    English
    arrow-up
    12
    arrow-down
    41
    ·
    edit-2
    7 days ago

    Classic example of how dangerous rust is.

    If they had just used Python and ran the whole thing in a try block with bare except this would have never been an issue.

    Edit: this was a joke, and not well done. I thought the foolishness would come through.

    • Thallium_X@feddit.org
      link
      fedilink
      English
      arrow-up
      8
      ·
      7 days ago

      As a next step they should have wrapped everything in a true(while) loop so it automatically restarts and the program never dies

    • Zwuzelmaus@feddit.org
      link
      fedilink
      English
      arrow-up
      14
      ·
      7 days ago

      So you think there is no error handling possible in Rust?

      Wait until you find out that Pyhon doesn’t write the error handling by itself either…

    • dan@upvote.au
      link
      fedilink
      English
      arrow-up
      10
      ·
      7 days ago

      This can happen regardless of language.

      The actual issue is that they should be canarying changes. Push them to a small percentage of servers, and ensure nothing bad happens before pushing them more broadly. At my workplace, config changes are automatically tested on one server, then an entire rack, then an entire cluster, before fully rolling out. The rollout process watches the core logs for things like elevated HTTP 5xx errors.

    • jimmy90@lemmy.world
      link
      fedilink
      English
      arrow-up
      9
      ·
      7 days ago

      honestly this was a coding cock-up. there’s a code snippet in the article that unwraps on a Result which you don’t do unless you’re fine with that part of the code crashing

      i think they are turning linters back to max and rooting through all their rust code as we speak

    • SinTan1729@programming.dev
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      1
      ·
      edit-2
      7 days ago

      I hope you’re joking. If anything, Rust makes error handling easier by returning them as values using the Result monad. As someone else pointed out, they literally used unwrap in their code, which basically means “panic if this ever returns error”. You don’t do this unless it’s impossible to handle the error inside the program, or if panicking is the behavior you want due to e.g. security reasons.

      Even as an absolute amateur, whenever I post any Rust to the public, the first thing I do is get rid of unwrap as much as possible, unless I intentionally want the application to crash. Even then, I use expect instead of unwrap to have some logging. This is definitely the work of some underpaid intern.

      Also, Python is sloooowwww.

        • SinTan1729@programming.dev
          link
          fedilink
          English
          arrow-up
          2
          ·
          6 days ago

          Ah that makes sense. To be fair tho, there’s a lot of unwarranted hate towards Rust so it can be hard to tell.

          • panda_abyss@lemmy.ca
            link
            fedilink
            English
            arrow-up
            2
            ·
            6 days ago

            I should bite the bullet and learn it.

            I decided to learn zig recently, it feels like crafting artisanal software, which is what I liked C for. But it’s kinda janky in that each point version major features come and go (see io and async).

            There’s a place for engineering software which is what rust seems great at. Definitely seems like a tool I could/would use as rust is taking over many of my tool workflows.