Comments Locked

38 Comments

Back to Article

  • evilpaul666 - Monday, August 19, 2019 - link

    Isn't there a move towards having DRAM contents encrypted? If you bit-shift ciphertext don't you just get garbage?
  • Ian Cutress - Monday, August 19, 2019 - link

    Very good point
  • Rοb - Tuesday, August 20, 2019 - link

    Not with homomorphic encryption: https://en.wikipedia.org/wiki/Homomorphic_encrypti...
  • DanNeely - Wednesday, August 21, 2019 - link

    Isn't that still basically limited to research/very targeted small scale use because it has a massive power/performance penalty compared to working on non-encrypted data?
  • Memo.Ray - Monday, August 19, 2019 - link

    I was thinking about this in the opposite direction, the security implication of this new feature. Now you can change the data directly in memory - another attack vector that you to protect against.
  • Alexvrb - Monday, August 19, 2019 - link

    Awesome now we can have side octa channel attacks!
  • imaskar - Tuesday, August 20, 2019 - link

    Depends on the use case. If you're selling vms, then yes. If it's your supercomputing cluster, then no, why would you encrypt it?
  • Elstar - Wednesday, August 21, 2019 - link

    The problem is not insurmountable. As long as the DPU has the key AND the key is not extractable, then encrypted memory should work with PIM DRAM.
  • TomWomack - Thursday, August 29, 2019 - link

    Not in the high-performance computing space, which is the only place where PIM processing seems plausible; the threat models have really diverged between 'protect this phone which will be running Javascript from arbitrary websites from Ruritanian customs agents with physical access to it' and 'protect this compute cluster which runs only executables compiled on the head-node by trusted users'
  • Dodozoid - Monday, August 19, 2019 - link

    IMP for In-Memory Processing sound way better than PIM fo Processing In Memory. Hope we could some day get to PARALLEL in-memory processing for even merrier abbreviation.
  • azazel1024 - Wednesday, August 21, 2019 - link

    I prefer General In-Memory Processing. Just be careful of Last-level In-Memory Processing. I hear it can handicap your system if implemented wrong.
  • winkingchef - Monday, August 19, 2019 - link

    (This needs a #HotChips keyword tag)

    IMO this model of computing is the way the industry needs to go (pushing bits around from storage/memory to compute is wasteful of power).

    HOWEVER, i also believe the adoption of this technology will be held back by the current mechanical/thermal assumptions around DRAM DIMMS which also drive electricals (spacing DIMMs out will create need for higher drive strength from the full rank of them to CPU). Someone will need to take a risk to adopt this on their server architecture.
  • Threska - Monday, August 19, 2019 - link

    Basically diffusing a CPU into the space of a memory stick. Might be better to take the most used operations* in code and put those there.

    *Not to be confused with instructions. Too fine grained.
  • Elstar - Monday, August 19, 2019 - link

    So many questions: what’s the security model? What’s the coherency model? How does virtual memory or virtualization in general interact with this? What happens when one DPU program needs data outside of its 64 MiB region?
  • name99 - Saturday, August 24, 2019 - link

    Yeah, these two issues (security and concurrency) are the immediate sticking points, and saying that a C-library hides them doesn't help much.

    An additional problem that hasn't been raised is the issue of physical address bit swizzling. Obviously this work happens at the physical address level, not virtual address; but it's worse than that. Right now physical address bits are rearranged in the memory to maximize DRAM utilization across all possible channels. So very low bits will target different memory controllers, then slightly less low bits will target ranks, and so on.
    Meaning that I don't understand how this "No DPU sharing" provides real value. IF DPUs can only access their own data with no DPU cross traffic, then you're limited in how much damage you can do (but you're also massively limited in how much of value you can do...) But if DPUs can write to each other (necessary if you're going to support blits and memcopy, which would seem to be the first order win) then who cares about this "no sharing", how does it help?

    Onur Mutlu's solution is much less sophisticated, but also seems a lot more practical. All that solution does is perform on-DRAM blits and memcopies, in essence by allowing the CPU to send instructions through the memory controller saying copy DRAM page at address A to DRAM page at address B. This gets you most of the win, while providing no security weirdness, and a very obvious chokepoint for how you perform the virtual to (controller, rank, bank, page) mapping rather than being a nightmare for generic code.
  • abufrejoval - Monday, August 19, 2019 - link

    When I invented the concept perhaps ten years ago, I called it ‘computing memory’ and I was so glad, I found no matches for it on Google. But like with many other of my ideas, plenty of others had been there and actually done something about it, too.

    I was somewhat inspired by the Weitek Floating Point co-processor, an Intel i387 competitor which used a 64K memory segment while it only had perhaps 32 registers: It used the lowest significant address bit to encode one operation while writing the data, perhaps another while reading it: It very neatly solved the memory bottleneck between the CPU and the co-processor at the time, offering something like 4x the performance of Intel’s own.

    Content addressable memory and the Linda Tuple space were other inspirations and I always wanted SQL-RAM, where the I’d put the WHERE clause into the address bits and then only retrieve matching data 😊.

    I liked Micron's Automata Processor a lot, because it was a nice programming paradigm, good balance between problem solving power and logic overhead and excellent for formal proofs and unassailable by the likes of return oriented programming. Unfortunately they shelved that project.

    HBM memory chip stacks offer free real-estate below the DRAM cells on the silicon die carrier, much like offices on the ground floor of a high-rise building with lots of elevators for the silicon tru vias (TSV). Even if the die carrier would be manufactured at a lower density, you’d have ample space for some simple logic at bandwidths much bigger than the CPU behind the chip multiplexing memory bus. I believe SK Hynix was at one point begging for engineers to come forward with ideas what to put there.

    When the HP Memristor was supposed to offer 1000 layers and density at linear cost, it became very clear that general purpose CPUs simply wouldn’t be able to take advantage of that, much like a Morris Minor with 1000 BHP.

    UPMEM is French I believe, I’ve heard about them for a long time, never seen a working chip yet. But computing memory is one of the few escape routes out of the von Neuman bottleneck.
  • abufrejoval - Monday, August 19, 2019 - link

    Another idea was to use to expand on the row buffer also used for refresh. Use dual or tertiary buffers an a 'row ALU' that would operate on the entire memory row as a whole, e.g. for matching bit or byte patterns, then use some address bits to select transparent or operational access and write/modify the ALU row buffers.
  • SaberKOG91 - Monday, August 19, 2019 - link

    You didn't invent the idea: https://ieeexplore.ieee.org/document/592312

    I'm sure there are even early examples of this, but Patterson et. al were pretty prophetic in this arena.
  • abufrejoval - Tuesday, August 20, 2019 - link

    Of course, I invented it, and all on my own, too!

    But as I mentioned (with a bit of irony between the lines) and you noticed, I just didn't invent it first nor did I manage to put it into physical implementation.

    It's rather obviously the only open avenue of escape so idea duplication is natural.

    I am actually more astonished at the lack of adoption over the last years, but silicon economics is still a game of scale.

    But perhaps more importantly, those companies who's rising computing-power vs. value-of- computeratio is under the strongest pressure (GAFA/BATX), have found ways to push out the energy cost onto the client devices.
  • SaberKOG91 - Tuesday, August 20, 2019 - link

    It has nothing to do with adoption or scale. The materials science for modern memory technologies took a long time to catch up, some of it didn't even exist until around decade ago, and only caught up because CMOS process tech lagged and slowed down the progress of everything else.

    It isn't selfish to push computing to client devices. These devices are faster to adopt optimized accelerators because they are now mostly mobile and battery restricted, which saves a tremendous amount of power overall, not just in the data-center.
  • abufrejoval - Tuesday, August 20, 2019 - link

    How can it not be selfish when Facebook, Amazon and Google save data transmission capacity and compute power and use your phone and browser to mine all the personal information they sell or use?

    They make you the product and have you pay for it, too!

    Quite the opposite of altruism in my book. And completely unethical as well.
  • SaberKOG91 - Tuesday, August 20, 2019 - link

    The amount of power consumed by client devices and telecoms far far far exceeds data center power consumption. If you can optimize at the client level, you can save way more energy than anything you can do in the datacenter. This is why we are seeing more and more special accelerators in consumer electronics when those same accelerators aren't as prevalent in the datacenter. That's an industry trend as a whole and has nothing to do with FB, Google, or Amazon specifically.

    And for crying out loud, you don't get to complain about what they do with your data when you aren't paying for their services and still choose to use them. There's no such thing as a free lunch. There are plenty of alternative services to anything they offer that protect your privacy and give you more control over your data. It will cost more and will be less convenient, but if you care that much, surely you'll pay the cost?
  • abufrejoval - Tuesday, August 27, 2019 - link

    Sure, this is an industry trend as a whole, but yes it has everything to do with FB, Google and Amazon specifically: They are the ones driving it and they do it, because they couldn't afford to spy as deeply on the unaware if they'd have to foot the energy bill.

    And yes, you have every right to complain because they aren't telling you what they do and how they are making you pay for the new phone with the NN accelerator and the energy while they reap their profits in the insights they obtain from you.

    Consumers in Europe have a right to be uneducated even stupid and still not be abused. I understand North Americans tend to believe it's ok to exploit the innocent and unaware, but that's why we need to apply the ground rules to the clouds and bleed Wild-West data cowboys until they faint or bow to reason or the liege.

    We have such a rich history of punishement here in Europe, time to remember Circus Maximus and the fun we had since ;-)
  • bfredd9 - Friday, February 19, 2021 - link

    The basic idea of using DRAM process for ALU computation has allready been exploited in the late 80s for embedded video processing: SVP: scan-line Video Processor-general purpose Video Processor

    The SVP achieved a fast processing rate exceeding standard DSPs by integrating 1024 PEs (Processing Elements). 50 MHz operation in each PE in the SIMD (Single Instruction Multiple Data) scheme is realized on two stage pipelines in the IG (Instruction Generator) and five stage pipelines in the PE CORE. With the realization of a 20 ns DRAM cycle in each PE and the system clock generated through a PLL, SVP enables full-spec-EDTV2 (the second generation Enhanced Definition Television in Japan).

    The problem as it is for now at this time was not to replace conventional processors but to find the niche applications where the effektiv performances are real.
  • Chrishnaw - Monday, August 19, 2019 - link

    Would adding ECC to the mix complicate this at all, or would the in-memory processing be completely unaffected by ECC?

    Will this ever come to the consumer space, or is this strictly for enterprise computing?
  • KAlmquist - Thursday, August 22, 2019 - link

    The DIMM shown has 16 chips. To support ECC would require 18 chips; 16 to hold the data being stored and 2 to hold the error correction codes.

    It would certainly be possible to build a DIMM using 18 of their chips, but you couldn't do much in the way of computations with ECC enabled. The problem is updating the error correction codes when the memory data changes. The chips don't communicate with each other, so it is not possible to calculate the updated error correction codes from scratch; instead they have to be calculated using only the existing error correction codes. That means that the only operations that can be performed on ECC memory would be exclusive or and setting memory to a known constant value.
  • edzieba - Tuesday, August 20, 2019 - link

    Hot DIMMs! Could be the first time since the days of FBDIMMs that RAMsinks would be anything other than cosmetic.

    As for heat dissipation: for air-cooled served it may even be beneficial to shift some thermal load away from the CPU socket(s) to reduce potential for throttling. For CLC servers, 1U chassis might hit Z-height issues when adding WC blocks to vertical DIMM row, but otherwise there are off-the-shelf solutions for adding DIMMs to WC loops.
  • ballsystemlord - Tuesday, August 20, 2019 - link

    On misspelling ( Keep up the good work! ):

    "The 14-stage pipeline us a basic in-order threaded CPU with dispatch/fetch/read/format/ALU/merge stages with access to the local SRAMs."
    "as" not "us":
    "The 14-stage pipeline as a basic in-order threaded CPU with dispatch/fetch/read/format/ALU/merge stages with access to the local SRAMs."
  • philehidiot - Tuesday, August 20, 2019 - link

    I am a lay-idiot. This sounds utterly friggin' awesome. Obviously, it's not going to be massively useful for the home gamer but for some people who play with massive datasets it's gonna be a gamechanger. Now, what I wanna know is two things: 1) how does this apply to my pr0n stash and 2) can it play Crysis yet?
  • Rudde - Wednesday, August 21, 2019 - link

    1) It depends on how much you are going to shift and rotate your stash.
    2) No. It doesn't support vector instruction (among other things).
  • philehidiot - Wednesday, August 21, 2019 - link

    Kill joy. But thanks for playing along with my drunken, technically illiterate comments.
  • FunBunny2 - Tuesday, August 20, 2019 - link

    "The idea behind In-Memory Processing, or ‘Processing In-Memory’, is that a number of those simple integer or floating point operations should be done while the memory is still in DRAM – no need to cart it over to the CPU, do the operation, and then send it back."

    FWIW, back in the late 70s TI built a mini, and later a chip with the ISA, which had only a couple of registers. One was an instruction pointer another was the context pointer and perhaps one or two more. All instructions were executed on memory resident data. Deja Vu all over again.
  • SaberKOG91 - Friday, August 23, 2019 - link

    Those were stack machines and were quickly replaced by virtual machines running on RISC processors for efficiency sake. In Flynn's Taxonomy these fall more into the category of MIMD (Multiple Instruction Multiple Data) machines, whereas stack machines are SISD (Single-Instruction Single Data) machines. These chips are basically a modern take on the Stanford VIRAM processors from the late 90's early 00's. Their biggest advantage has to do with not needing to swap RAM in and out of caches to access all of it. If you could bypass the data caches and directly access RAM from the CPU you may incur higher latencies, but the energy cost wouldn't be as bad as you might think.
  • abufrejoval - Tuesday, August 27, 2019 - link

    The TMS9900 microprocessor did indeed use a RAM based register file to save CPU transistors while supporting a full 16-bit architecture in those 8-bit days. But that was only possible because even the simplest instructions typically took several clock cycles to complete back then so the overhead of accessing a RAM based register file didn't matter that much if any: Operating on RAM didn't slow computation, truly justifying the Random Access Memory name. Today RAM is the new tape even with 3-4 levels of cache memory.

    In the case of the TMS9900 data did actually get carried back and forth twice as often, as it as transferred over a multiplexed 8-bit RAM bus to the non-multiplexed 16-bit 256 Byte scratchpad RAM that represented the register file and then would do ALU operations with CPU-RAM read-write operations only to transfer the results back to ordinary RAM afterwards.

    TI lost $111 on that venture, perhaps another reason not to repeat that approach.
  • blacklion - Friday, August 23, 2019 - link

    I wonder, how memory allocation is done from point of veiw of Host?
    They write: work is submitted to DPU via some OS driver. Ok, this part is clear.
    But it is only half of the story. DPUs works with physical memory. User-level code (on host processor) works with virtual addresses. So, to prepare task for DPU it needs to know virtual to physical translation, which is typically not allowed for user programs.
    And even worse: it needs to allocate chunks of memory in contiguous physical (not virtual!) address space. Again, typical OSes doesn't have such API.
    Example: we want to add two arrays of float32 and store result into third. Let say for sake of simplicity, each source array is 16MiB. So, we need to allocate 3 chunks of 16MiB in SAME 64MiB PHYSICAL SPACE to be able to process this data with DPUs! As far as I know, no general-purpose OS supports such allocations!
    And it could not be solved with "simple driver", it is changes to very heart of virtual memory subsystem of OS.
    I can not find anything about this part in slide deck :(
  • TomWomack - Thursday, August 29, 2019 - link

    That's exactly the same problem as allocating memory on GPU, though at least accessing the memory from the CPU requires only (careful - the CPU cache hierarchy doesn't know about the processors in the memory!) cache invalidation rather than trips over a PCIe bus.
  • ThopTv - Wednesday, August 28, 2019 - link

    One of the key critical future elements about this world of compute is moving data about. Moving data requires power, to the point where calling data from
  • Senbon-Sakura - Thursday, November 18, 2021 - link

    As the large bandwidth of DRAM, I guess the vector instructions will achieve more gains for upmem, but why only the scalar instructions are supported in it?

Log in

Don't have an account? Sign up now