• mindbleach@lemmy.world
    2 years ago

    The PS3 had a 128-bit CPU. Sort of. “Altivec” vector processing could split each 128-bit word into several values and operate on them simultaneously. So for example if you wanted to do 3D transformations using 32-bit numbers, you could do four of them at once, as easily as one. It doesn’t make doing one any faster.

    Vector processing is present in nearly every modern CPU, though. Intel’s had it since the late 90s with MMX and SSE. Those just had to load registers 32 bits at a time before performing each same-instrunction-multiple-data operation.

    The benefit of increasing bit depth is that you can move that data in parallel.

    The downside of increasing bit depth is that you have to move that data in parallel.

    To move a 32-bit number between places in a single clock cycle, you need 32 wires between two places. And you need them between any two places that will directly move a number. Routing all those wires takes up precious space inside a microchip. Indirect movement can simplify that diagram, but then each step requires a separate clock cycle. Which is fine - this is a tradeoff every CPU has made for thirty-plus years, as “pipelining.” Instead of doing a whole operation all-at-once, or holding back the program while each instruction is being cranked out over several cycles, instructions get broken down into stages according to which internal components they need. The processor becomes a chain of steps: decode instruction, fetch data, do math, write result. CPUs can often “retire” one instruction per cycle, even if instructions take many cycles from beginning to end.

    To move a 128-bit number between places in a single clock cycle, you need an obscene amount of space. Each lane is four times as wide and still has to go between all the same places. This is why 1990s consoles and graphics cards might advertise 256-bit interconnects between specific components, even for mundane 32-bit machines. They were speeding up one particular spot where a whole bunch of data went a very short distance between a few specific places.

    Modern video cards no doubt have similar shortcuts, but that’s no longer the primary way the perform ridiculous quantities of work. Mostly they wait.

    CPUs are linear. CPU design has sunk eleventeen hojillion dollars into getting instructions into and out of the processor, as soon as possible. They’ll pre-emptively read from slow memory into layers of progressively faster memory deeper inside the microchip. Having to fetch some random address means delaying things for agonizing microseconds with nothing to do. That focus on straight-line speed was synonymous with performance, long after clock rates hit the gigahertz barrier. There’s this Computer Science 101 concept called Amdahl’s Law that was taught wrong as a result of this - people insisted ‘more processors won’t work faster,’ when what it said was, ‘more processors do more work.’

    Video cards wait better. They have wide lanes where they can afford to, especially in one fat pipe to the processor, but to my knowledge they’re fairly conservative on the inside. They don’t have hideously-complex processors with layers of exotic cache memory. If they need something that’ll take an entire millionth of a second to go fetch, they’ll start that, and then do something else. When another task stalls, they’ll get back to the other one, and hey look the fetch completed. 3D rendering is fast because it barely matters what order things happen in. Each pixel tends to be independent, at least within groups of a couple hundred to a couple million, for any part of a scene. So instead of one ultra-wide high-speed data-shredder, ready to handle one continuous thread of whatever the hell a program needs next, there’s a bunch of mundane grinders being fed by hoppers full of largely-similar tasks. It’ll all get done eventually. Adding more hardware won’t do any single thing faster, but it’ll distribute the workload.

    Video cards have recently been pushing the ability to go back to 16-bit operations. It lets them do more things per second. Parallelism has finally won, and increased bit depth is mostly an obstacle to that.

    So what 128-bit computing would look like is probably one core on a many-core chip. Like how Intel does mobile designs, with one fat full-featured dual-thread linear shredder, and a whole bunch of dinky little power-efficient task-grinders. Or… like a Sony console with a boring PowerPC chip glued to some wild multi-phase vector processor. A CPU that they advertised as a private supercomputer. A machine I wrote code for during a college course on machine vision. And it also plays Uncharted.

    The PS3 was originally intended to ship without a GPU. That’s part of its infamous launch price. They wanted a software-rendering beast, built on the Altivec unit’s impressive-sounding parallelism. This would have been a great idea back when TVs were all 480p and games came out on one platform. As HDTVs and middleware engines took off… it probably would have killed the PlayStation brand. But in context, it was a goofy path toward exactly what we’re doing now - with video cards you can program to work however you like. They’re just parallel devices pretending to act linear, rather than they other way around.

    • vrighter@discuss.tchncs.de
      2 years ago

      slight correction. vector processing is available on almost no common architectures. What most architectures have is SIMD instructions. Which means that code that was written for sse2 cannot and will not ever make use of the wider AVX-512 registers.

      The risc-v isa is going towards the vector processing route. The same code works on machines with wide vector registers, or ones with no real parallel ability, but will simply loop in hardware.

      Simd code running on a newer cpu with better simd capabilities will not run any faster. Unmodified vector code on a better vector processor, will run faster

      • mindbleach@lemmy.world
        2 years ago

        Fancier tech co-opting an existing term doesn’t make the original use wrong.

        Any parallel array operation in hardware is vector processing.

        • vrighter@discuss.tchncs.de
          2 years ago

          fancy, vector processing predated simd. It’s how cray supercomputers worked in the 90s. You’re the one co opting an existing term :)

          And it is in fact a big deal, with several advantages and disadvantages to both.

            • vrighter@discuss.tchncs.de
              2 years ago

              from the very first paragraph in the page:

              a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SWAR Arithmetic Units.

              Where it pretty much states that scalar processors with simd instructions are not vector processors. Vector processors work on large 1 dimensional arrays. Call me crazy, but I wouldn’t call a register with 16 32-bit values a “large” vector.

              It also states they started in the 70s. That checks out. Which dates were you referring to?

              • mindbleach@lemmy.world
                2 years ago

                This is rapidly going to stop being a polite interaction if you can’t remember your own claims.

                SIMD predates the term vector processing, and was in print by 1966.

                Vector processing is at least as old as the Cray-1, in 1975. It was already automatically parallelizing what would’ve been loops on prior hardware.

                Hair-splitting about whether a processor can use vector processing or exclusively uses vector processing is a distinction that did not exist at the time and does not matter today. What the overwhelming majority of uses refer to is basically just SIMD extensions. Good luck separating the two when SIMT is a thing.

                • vrighter@discuss.tchncs.de
                  2 years ago

                  I’m not hair splitting over whether they can or not. scalar processors with simd cannot do vector processing, because vector processing is not simd.

                  yes an array of values can be called a vector in a lot of contexts. I could also say that vector processing involves dynamically allocated arrays, since that’s what c++ calls them. A word can be used in mulmiple contexts. When the word vector is used in the term “vector processor” it specifically excludes scalar processors with simd instructions. It refers to a particular architecture of machine. Just being able to handle a sequence of numbers is not enough. Simd can do it, as can scalar processors (one at a time, but they still handle “an array of numbers”). You can’t even say that they necessarily have to execute more than one at a time. A superscalar processor without simd can do that as well.

                  A vector processor is a processor specifically designed to handle large lists. And yes, I do consider gpus to be vector processors (exact same shader running on better vector hardware, does run faster.) They are specifically designed for it. simd on a scalar processor is just… not

                  • mindbleach@lemmy.world
                    2 years ago

                    A word can be used in mulmiple contexts.

                    Says user insisting an umbrella term has one narrow meaning.

                    A meaning that would include the SoundBlaster 32.

    • lte678@feddit.de
      2 years ago

      I am unsure about the historical reasons for moving from 32-bit to 64-bit, but wouldnt the address space be a significantly larger factor? Like you said, CPUs have had vectoring instructions for a long time, and we wouldn’t move to 128-bit architectures just to be able to compute with numbers of those size. Memory bandwidth is, also as you say, limited by the bus widths and not the processor architecture. IMO, the most important reason that we transitioned to 64-bit is primarily for the larger address address space without having to use stupidly complex memory mapping schemes. There are also some types of numbers like timestamps and counters that profit from 64-bit, but even here I am not sure if the conplex architecture would yield a net slowdown or speedup.

      To answer the original question: 128 bits would have no helpful benefit for the address space (already massive) and probably just slow everyday calculations down.

      • mindbleach@lemmy.world
        2 years ago

        8-bit machines didn’t stop dead at 256 bytes of memory. Address length and bus width are completely independent. 1970s machines were often built with bit-slice memory, with however many bits of addressing, and one-bit output. If you wanted 8-bit memory then you’d wire eight chips in parallel - with the same address lines. Each chip would deliver a different part of the same logical byte.

        64-bit math doesn’t need 64-bit hardware, either. Turing completeness says any computer can run the same code - memory and time allowing. As an object example, Javascript exclusively used 64-bit double floats, even when it was defined in the late 1990s, and ran exclusively on 32-bit machines.

        • lte678@feddit.de
          2 years ago

          Clearly you can address more bytes than your data bus width. But then why all the “hacks” on 32-bit architectures? Like the 36-bit address bus via memory mapping on SPARCv8 instead of using paired index registers ( or ARMv7 width LPAE). From a perfomance perspective using an address width that is not the native register width/ internal data bus width is an issue. For a significant subset of operations multiple instructions are required instead of one.

          Also is your comment about turing completeness to be taken seriously? We are talking about performance and practicality. Go ahead and crunch on some 64-bit floats using purely 8-bit arithmetic operations (or even using vector registers). Of course you can, but the point is that a suitable word size is more effective for certain computational tasks. For operations that are done frequently, they should ideally be done at native data-bus width. Vectored operations will also cost performance.

          • mindbleach@lemmy.world
            2 years ago

            If timestamps and counters represent a bottleneck, you have problems larger than bit depth.

            • lte678@feddit.de
              2 years ago

              Indeed, because those two things were only exemplary, meaning they would be indicative of your system having a bottleneck in almost all types workloads. Supported by the generally higher perforance in 64-bit mode.