From the discussions here, this is what I think would work the fastest(likely a few false assumptions, invalidating the whole thing):
Assume that there will never be more than 65535 objects (and one null value). This allows the use of a 16 bit index rather than a 32 or 64 bit pointer, allowing 2-4 times as many in each cache entry, reducing cache misses.
Furthermore, allocate the entire array once, to eliminate the reallocation penalty.
Get the individual object size as small as possible, but able to fit an integer number of them exactly in a cache entry.
Store the objects in an array, fully preallocated, or in a series of fixed-size arrays that are allocated as needed and never freed (with an array of pointers to them, null until allocated).
Each object would contain two indexes, so that they form two doubly-linked lists, one of empty objects and one of used objects. The last used object should be followed by an empty object with a null index as it's next, to signify that all of the remaining objects are empty.
That way, adding a new object is usually as simple as a removal from a doubly linked list (constant time), an insertion into a doubly linked list (again, constant time), and object initialization. Removing an object takes just as long, minus the initialization. Depending on size, a new array may have to be allocated occsionally, but since it would be a separate array, no copying of data would be nessecary.
User data would be stored as a single void*, to fit as many objects into the cache as possible, and give the objects a fixed size.
Finally, don't even bother with something like this, find an adequate third-party library and focus on the impostant part: YOUR code. It is likely that an existing physics library is highly optimized, and you will probably spend days debugging physics math before you can even approach a sufficient setup to do everything you want it to.
Spend an hour on wikipedia, starting at
http://en.wikipedia.org/wiki/Physics_engine#Engines, and you will have better results, sooner.