Interesting. Wonder how this compares to python and ruby memory handling.

dchest · on Dec 16, 2011

Lua is actually closer to PHP in how arrays are used in it.

"Until Lua 4.0, tables were implemented strictly as hash tables: all pairs were explicitly stored. Lua 5.0 brought a new algorithm to optimize the use of tables as arrays: it optimizes pairs with integer keys by not storing the keys and storing the values in an actual array. More precisely, in Lua 5.0, tables are implemented as hybrid data structures: they contain a hash part and an array part. Figure 2 shows a possible configuration for a table with the pairs "x" → 9.3, 1 → 100, 2 → 200, 3 → 300. Note the array part on the right: it does not store the integer keys. This division is made only at a low implementation level; access to table fields is transparent, even to the virtual machine. Tables automatically and dy- namically adapt their two parts according to their contents: the array part tries to store the values corresponding to integer keys from 1 to some limit n. Values corresponding to non-integer keys or to integer keys outside the array range are stored in the hash part.

When a table needs to grow, Lua recomputes the sizes for its hash part and its array part. Either part may be empty. The computed size of the array part is the largest n such that at least half the slots between 1 and n are in use (to avoid wasting space with sparse arrays) and there is at least one used slot between n/2 + 1 and n (to avoid a size n when n/2 would do). After computing the new sizes, Lua creates the new parts and re-inserts the elements from the old parts into the new ones. As an example, suppose that a is an empty table; both its array part and hash part have size zero. If we execute a[1]=v, the table needs to grow to accommodate the new key. Lua will choose n = 1 for the size of the new array part (with a single entry 1 → v). The hash part will remain empty."

From "Implementation of Lua 5.0" -- http://lua.org/doc/jucs05.pdf

fhars · on Dec 16, 2011

Since PHP arrays are within a factor of two for the theoretical optimum for a dynamically typed language that unifies arrays and hashes, all these languages (add javascript to the list) should have roughly comparable behaviour unless they do some ugly special casing for hashes where all keys are integers. [edit: see the comment by InfernalH for a measurement that indicates that ruby seems to do this.] You have to look that up in the relevant documentatio and/or source code. Perl on the other hand should fare better, as it distinguishes arrays and hahes. If you want performance, use a real programming language.

ootachi · on Dec 16, 2011

JavaScript engines have many types of arrays. They do indeed resort to "ugly special casing" for objects in which all keys are integers: for instance, SpiderMonkey calls them "dense arrays". V8 has a similar optimization. Likewise, Lua stores a hashtable part and an array part for all objects.

V8 and SpiderMonkey have optimizations for the values, too. When a value is a 31-bit integer (in the case of V8) or any number (in the case of SpiderMonkey), the value itself is optimized to avoid heap allocation. In V8's case optimized values take 32 bits, while in SpiderMonkey's case optimized values take 64 bits. So an array consisting of 100,000 integers will take 400K plus a negligible amount of malloc/GC slop in V8 and 800K in SpiderMonkey.

JavaScript also has typed arrays, which allow the programmer to use fine-grained, optimized array buffers. Performing this experiment with a typed array will yield a 400K buffer in all engines that support the feature.

samdk · on Dec 16, 2011

Python and Ruby both have separate arrays and hashes. They're completely different data structures in both cases.

Python's dict type is a hash table like you'd expect. Python's list is a pointer-array-backed list. (it may inline ints/similar things--I don't remember if CPython does, and exact details are implementation-dependent), and raw arrays are in the standard library if you need them.

From a very quick check, a list of 100k ints in CPython is ~1.5mb, and a dict of 100k ints -> other ints is ~6mb.

Ruby's hash and array implementations are similar, I think, although I don't know Ruby as well, so I don't know the specifics.

jeltz · on Dec 16, 2011

Good info about Python, but you are incorrect about Ruby. Hashes and arrays are totally different in the Ruby implementations I know of.

EDIT: Sorry, I misread you. Ruby MRI and CPython are indeed similar.

At least in Ruby MRI (the mainline) arrays are implemented as a struct with a size and a pointer to a normal C array which contains the object references (references in MRI are pointers to object structs which use the lower bits to inline integers of <= 31 or 63 bits, true, false, nil and symbols).

Hashes in MRI I have not looked that much into but I believe they are a hash tables which in ruby 1.9 retain insertion order using pointers like a singly linked list.

masklinn · on Dec 16, 2011

> Good info about Python, but you are incorrect about Ruby. Hashes and arrays are totally different in the Ruby implementations I know of.

From its context, I'm guessing samdk is saying:

> Ruby's hash and array implementations are similar [to Python's dict and list]

not that they're similar to one another, which would make absolutely no sense considering his comment starts with:

> Python and Ruby both have separate arrays and hashes.

so I'd say you agree with him and misread his comment.

scotty79 · on Dec 16, 2011

Does python have standard implementation of odict already?

It's nice that Python and Ruby have data structures that trade some features for some memory and/or performance gain.

masklinn · on Dec 16, 2011

> Does python have standard implementation of odict already?

Yeppers, since 2.7/3.1: http://docs.python.org/library/collections.html#collections.... http://docs.python.org/py3k/library/collections.html#collect...

fauigerzigerk · on Dec 16, 2011

As I see it, doing the ugly special casing for you is exactly what any competent dynamic language implementation should do.

maratd · on Dec 16, 2011

Ugly special casing requires code that would detect those special cases. This adds overhead in terms of processing cycles and code. Most arrays, majority of the time, store a tiny amount of information. Less than a dozen entries. Doing that sort of detection is massive overkill and would actually hurt performance. You want to speed things up for the real world. Not artificial benchmarks. How many times in your code are you really going to need 100,000 integers in memory? Probably once in your entire lifetime, if that.

fauigerzigerk · on Dec 16, 2011

I use much larger integer arrays all the time but that's beside the point. The question is whether or not it makes sense to make dynamic language runtimes as efficient as possible. I think it does, not because it necessarily helps all or even most existing programs, but because it enables people to create a much wider range of software with a language they already know. Many modern websites wouldn't be possible with JavaScript engines from 1999.

And if you optimize at all, memory usage must be a top priority. Lack of memory is a lot more costly in terms of performance than a couple of C based heuristic checks. Creating and then garbage collecting an array full of individual integer objects doesn't just use a lot more memory, it's also orders of magnitude slower. The array doesn't have to be very large for that to matter as there may be large numbers of arrays.

ars · on Dec 16, 2011

Javascript also distinguishes arrays and hashmaps.

InfernalH · on Dec 16, 2011

I tried with a "pure" Array - for Ruby it's 1Mb for 100 000 elements

EDIT: and for Hash with h[i] = i, it's ~6Mb

masklinn · on Dec 16, 2011

Does that account for the size taken by the integers themselves?

jeltz · on Dec 16, 2011

I assume so. If the integers fit in 31 or 63 bits, depending on your architecture, they are inlined in the pointer.

masklinn · on Dec 16, 2011

OK (because that's not the case in CPython, I think)

jeltz · on Dec 16, 2011

Yeah, I too am pretty sure that CPython does not inline integers in the pointers. And from a quick glance at the source code I saw nothing such.

Inlining integers in pointers by shifting up and adding 1 is a quite common trick though and I have seen it in more programming language implementations than MRI. I think at least some Prolog implementation and older versions of Spidermonkey (newer versions use a similar trick with doubles).

cygx · on Dec 16, 2011

Inlining integers in pointers by shifting up and adding 1 is a quite common trick though and I have seen it in more programming language implementations than MRI

This trick was already used in Smalltalk-80, btw. A more recent variant of this is NaN tagging, made popular by LuaJIT.

masklinn · on Dec 16, 2011

> Inlining integers in pointers by shifting up and adding 1 is a quite common trick though and I have seen it in more programming language implementations than MRI.

Yeah, I know about it, I just did not think MRI had bothered with it anymore than CPython.

vidarh · on Dec 16, 2011

false, true, nil, Symbol's and Fixnum's are all special cased with typetags in MRI.

obtu · on Dec 16, 2011

Python has lists and dicts, and lists don't get the per-item overhead of GC-ed keys. PyPy does better than CPython and implements specialised, low-overhead storage for some dicts, lists and sets of uniform type (in 1.6, nightly builds, and a branch respectively).

http://morepypy.blogspot.com/2011/10/more-compact-lists-with...