Lua is actually closer to PHP in how arrays are used in it.
"Until Lua 4.0, tables were implemented strictly as hash tables: all pairs were
explicitly stored. Lua 5.0 brought a new algorithm to optimize the use of tables
as arrays: it optimizes pairs with integer keys by not storing the keys and storing
the values in an actual array. More precisely, in Lua 5.0, tables are implemented
as hybrid data structures: they contain a hash part and an array part. Figure 2
shows a possible configuration for a table with the pairs "x" → 9.3, 1 → 100,
2 → 200, 3 → 300. Note the array part on the right: it does not store the integer
keys. This division is made only at a low implementation level; access to table
fields is transparent, even to the virtual machine. Tables automatically and dy-
namically adapt their two parts according to their contents: the array part tries
to store the values corresponding to integer keys from 1 to some limit n. Values
corresponding to non-integer keys or to integer keys outside the array range are
stored in the hash part.
When a table needs to grow, Lua recomputes the sizes for its hash part and
its array part. Either part may be empty. The computed size of the array part
is the largest n such that at least half the slots between 1 and n are in use
(to avoid wasting space with sparse arrays) and there is at least one used slot
between n/2 + 1 and n (to avoid a size n when n/2 would do). After computing
the new sizes, Lua creates the new parts and re-inserts the elements from the
old parts into the new ones. As an example, suppose that a is an empty table;
both its array part and hash part have size zero. If we execute a[1]=v, the table
needs to grow to accommodate the new key. Lua will choose n = 1 for the size
of the new array part (with a single entry 1 → v). The hash part will remain
empty."
Since PHP arrays are within a factor of two for the theoretical optimum for a dynamically typed language that unifies arrays and hashes, all these languages (add javascript to the list) should have roughly comparable behaviour unless they do some ugly special casing for hashes where all keys are integers. [edit: see the comment by InfernalH for a measurement that indicates that ruby seems to do this.] You have to look that up in the relevant documentatio and/or source code. Perl on the other hand should fare better, as it distinguishes arrays and hahes.
If you want performance, use a real programming language.
JavaScript engines have many types of arrays. They do indeed resort to "ugly special casing" for objects in which all keys are integers: for instance, SpiderMonkey calls them "dense arrays". V8 has a similar optimization. Likewise, Lua stores a hashtable part and an array part for all objects.
V8 and SpiderMonkey have optimizations for the values, too. When a value is a 31-bit integer (in the case of V8) or any number (in the case of SpiderMonkey), the value itself is optimized to avoid heap allocation. In V8's case optimized values take 32 bits, while in SpiderMonkey's case optimized values take 64 bits. So an array consisting of 100,000 integers will take 400K plus a negligible amount of malloc/GC slop in V8 and 800K in SpiderMonkey.
JavaScript also has typed arrays, which allow the programmer to use fine-grained, optimized array buffers. Performing this experiment with a typed array will yield a 400K buffer in all engines that support the feature.
Python and Ruby both have separate arrays and hashes. They're completely different data structures in both cases.
Python's dict type is a hash table like you'd expect. Python's list is a pointer-array-backed list. (it may inline ints/similar things--I don't remember if CPython does, and exact details are implementation-dependent), and raw arrays are in the standard library if you need them.
From a very quick check, a list of 100k ints in CPython is ~1.5mb, and a dict of 100k ints -> other ints is ~6mb.
Ruby's hash and array implementations are similar, I think, although I don't know Ruby as well, so I don't know the specifics.
Good info about Python, but you are incorrect about Ruby. Hashes and arrays are totally different in the Ruby implementations I know of.
EDIT: Sorry, I misread you. Ruby MRI and CPython are indeed similar.
At least in Ruby MRI (the mainline) arrays are implemented as a struct with a size and a pointer to a normal C array which contains the object references (references in MRI are pointers to object structs which use the lower bits to inline integers of <= 31 or 63 bits, true, false, nil and symbols).
Hashes in MRI I have not looked that much into but I believe they are a hash tables which in ruby 1.9 retain insertion order using pointers like a singly linked list.
Ugly special casing requires code that would detect those special cases. This adds overhead in terms of processing cycles and code. Most arrays, majority of the time, store a tiny amount of information. Less than a dozen entries. Doing that sort of detection is massive overkill and would actually hurt performance. You want to speed things up for the real world. Not artificial benchmarks. How many times in your code are you really going to need 100,000 integers in memory? Probably once in your entire lifetime, if that.
I use much larger integer arrays all the time but that's beside the point. The question is whether or not it makes sense to make dynamic language runtimes as efficient as possible. I think it does, not because it necessarily helps all or even most existing programs, but because it enables people to create a much wider range of software with a language they already know. Many modern websites wouldn't be possible with JavaScript engines from 1999.
And if you optimize at all, memory usage must be a top priority. Lack of memory is a lot more costly in terms of performance than a couple of C based heuristic checks. Creating and then garbage collecting an array full of individual integer objects doesn't just use a lot more memory, it's also orders of magnitude slower. The array doesn't have to be very large for that to matter as there may be large numbers of arrays.
Yeah, I too am pretty sure that CPython does not inline integers in the pointers. And from a quick glance at the source code I saw nothing such.
Inlining integers in pointers by shifting up and adding 1 is a quite common trick though and I have seen it in more programming language implementations than MRI. I think at least some Prolog implementation and older versions of Spidermonkey (newer versions use a similar trick with doubles).
Inlining integers in pointers by shifting up and adding 1 is a quite common trick though and I have seen it in more programming language implementations than MRI
This trick was already used in Smalltalk-80, btw. A more recent variant of this is NaN tagging, made popular by LuaJIT.
> Inlining integers in pointers by shifting up and adding 1 is a quite common trick though and I have seen it in more programming language implementations than MRI.
Yeah, I know about it, I just did not think MRI had bothered with it anymore than CPython.
Python has lists and dicts, and lists don't get the per-item overhead of GC-ed keys. PyPy does better than CPython and implements specialised, low-overhead storage for some dicts, lists and sets of uniform type (in 1.6, nightly builds, and a branch respectively).