"core", in terms of computer jargon, is old as sin.
Initially, it referred to "
Core memory", which was one of the very first kinds of electronic memory out there. It used small magnetic ferrite beads, laced over criss-crossing mesh of copper wire, where two wires passed over each other at 90 degrees, and a ferrite bead was placed diagonally over the intersection. One of the wires supplied "ring current", and the other was the sense wire. To flip a bit, ring current was supplied, which would alter the magnetic polarity of the ferrite bead, and thus change the inductance value of the sense wire. In this way, a state could be stored on the circuit. Assemblages of these memory units were called "Core memory", and it was due to the existence of ferrite bead cores being present in the memory's design. It later became subsumed by the notion that this kind of memory was deeply rooted at the heart of the computer, and thus "At it's core", and became misnomered as the base system memory, vs expansion memory.
This misnomer was later carried on further, with multiple CPU-on-die chips, which are REALLY
Symmetrical MultiProcessing (SMP), with a unified cache architecture. Each individual CPU is misnomered as a "core", for lack of an easier to use alternative. "Processor unit" is more accurate, but cumbersome.
As for what each of them actually *IS*--
A CPU is comprised of a
logic analysis unit/decoder unit, (which decodes instruction words, and in modern CPUs, contains the
microcode to perform these tasks. In older CPUs, this could have been done with
wire-logic, or hard-defined logic gate structures, such as seen in the
NEC V20 chip.) which is connected to an array of ALUs, or
Arithmetic Logic Units. ALUs are comprised of discrete transistor gates, which are capable of performing logical operations, such as AND, NOT, OR, etc.. By stringing these together in an ordered fashion, complex operations can be performed.
Connected to the decoder and the ALU array, there is a very high speed, and privileged bit of SRAM memory, called the "On Die cache". This stores recently processed instructions and their outputs, so that the CPU can avoid having to redo operations if it does not have to, or avoid fetching data from the system RAM to perform operations with, if it does not have to.
Because cache ram is both bulky (takes lots of real-estate up), difficult to design in a configuration that effectively deals with signal propagation delays sensibly, and does not act like a thermal blanket to the ALU arrays-- this is often a shared commodity between processor "cores" in a multi-core processing chip. (Originally, the concept of SMP used multiple discrete processor ICs, each with a discrete cache block!) This is where "Cache contention" comes into play. Since there is only so much of this memory available, and multiple processing units trying to write/read stuff from there all at the same time, the risk of one process evicting data needed by another process increases as the number of threads being processed increases, and as the needs for large data structures increases. Managing this problem for maximal performance is a headache, which is one of the reasons why game developers loathe the idea of multiprocessing in their game engines.