Datastructure required.

muzzel · Sep 30, 2013

Let * be a binary, associative, commutative operation and M a set of elements for which * is defined.
Let S = m_1 * m_2 * m_3 * ... * m_n for m_i ∈ M.
I need a data structure for M that allows easy recalculation of S when:
- one or more elements of M are changed
- one or more elements are added to M
- one or more elements are removed from M

The naive approach of calculating S by iterating over the set is in O(n) time and O(1) space.
Storing all n subsets of M with n-1 elements results in O(1) time and O(n) space.
Id like a solution with O(log(n)) time and a good space complexity.

I thought about a sqrt(n)-tree which has average time and space complexities but requires annoying rebalancing.
Ideas?

Menag · Sep 30, 2013

can you explain the sqrt(n)-tree a bit more? Do you mean a tree with sqrt(n) children at each node? If yes, the space complexity would still be linear wouldn't it? Imo a normal binary tree would be best. You'd have logarithmic time complexity for all those operations while having linear space complexity. I don't know if you can get it any better.
The subset idea has the following problems:
1. adding and element e you'd have to multiply this to every subset. Since you have n subsets, you'd have O(n) time complexity.

2. exchanging an element e with e' would force you to recalculate every subset with e in it. This means you have to recalculate n-1 sets completly new. Each of them having n-2 elements + e', means n-2 multiplications. Thus you'd have O(n*(n-2)) = O(n^2).

3. removing an element would result in a lot of recalculations. See 2.

Thus, you actually never have O(1), since you always have to recalculate the subsets. Therefore you give it O(n) space while gaining no speed at all.

peq · Sep 30, 2013

If you use a tree you could just insert/delete elements in a way which keeps the tree balanced. You do not need a search tree, and it does not matter at which position in the tree you insert an element.

You can keep a separate data structure to find an element when it has to be removed.

muzzel · Sep 30, 2013

With the additional operations (add/remove elements) the tree is superior to the subset solution, i agree.
A tree with more than two edges per node would require less space than a binary tree while increasing the time complexity a bit. The problem is that with a constant amount of edges per node the performance varies depending on how many elements are in the set.

Dr Super Good · Oct 1, 2013

Why not use an AVL tree with a "*" solution cache. Obviously assuming "*" is a reversible operation (there exists an inverse to it, which sounds reasonable).

This means that you update the solution once every time an element is added by modifying the cache value "*" the new value. When an element is removed you use the inverse of "*" to remove the element from the cached total.

The AVL tree is used to maintain the set like properties with good performance (O(logn)) while the value cache is used to maintain the value S with good performance (O(1)).

If "*" is changeable or no inverse exists (strange it would not be order dependant then, a violation of set properties) then unfortunately it will not be possible to change the computational order or the storage complexity down, next to hopping that storing smaller numbers uses smaller space (assuming vector output).

Menag · Oct 1, 2013

If you'd have an inverse for every element this would obviously be fucking easy. No need for any data structure. But there isn't any inverse, at least not always. The only thing you know is that * is associative and commutative.

Dr Super Good · Oct 1, 2013

A speed up would at least be obtainable by caching the set that is least often changed.

From the given set, there exists a subset which will exist longer than other elements in the set. Caching the value of that given subset will save you having to re-compute it and thus give you greatly improved performance at a fraction of the memory complexity that a full product tree might use. Obviously this depends on how it is used, as all elements may have very short life expectancy.

Otherwise it should be fine to go with a product tree. This is where each node contains the resulting value of its children put through the operation. All actual set values are leaves with all intermittent values being nodes with the root node containing the solution. This uses n-1 nodes where n is the number of values in the set. A simple AVL like balancing algorithm would keep the O(logn) performance you require for commuting the value.

Assuming by binary you mean "Boolean" (2 state) then the memory complexity of such a tree is completely fine. If by binary you mean a computer based number of any size (a vector) then the memory requirements might be too big, especially since each element may consume more memory as depth is lowered.

Of course other trees can be used. For example you could create a tree that combines both points, where the least commonly changed subset is kept at a bulky inefficient, almost linear branch of the tree (right of root) while the commonly changed and new values are placed on the left. This unbalanced tree would actually perform faster than a balanced tree as the average insertion and deletion would require less than O(logn) complexity due to the sacrificing done with complexity for long lifecycle objects (which would be closer to O(n) when accessed).

Which is best all comes down to the sort of data being put into the set. If 99% of it is unchanged with only a small subset being changed at any time then such a unbalanced tree would be considerably faster thanks to its better than O(logn) performance. If all data is changed equally then the best you can do is a balanced tree.

muzzel · Oct 1, 2013

Menag said:
If you'd have an inverse for every element this would obviously be fucking easy. No need for any data structure. But there isn't any inverse, at least not always. The only thing you know is that * is associative and commutative.

this.

Dr Super Good said:
Otherwise it should be fine to go with a product tree. This is where each node contains the resulting value of its children put through the operation. All actual set values are leaves with all intermittent values being nodes with the root node containing the solution. This uses n-1 nodes where n is the number of values in the set. A simple AVL like balancing algorithm would keep the O(logn) performance you require for commuting the value.

This is the tree i mentioned in my first two posts, its space complexity is quite bad when a binary tree is used. Thats why i was thinking about something like a sqrt(n)-tree.

Assuming by binary you mean "Boolean" (2 state) then the memory complexity of such a tree is completely fine. If by binary you mean a computer based number of any size (a vector) then the memory requirements might be too big, especially since each element may consume more memory as depth is lowered.

Binary means the operation takes two operands.

Of course other trees can be used. For example you could create a tree that combines both points, where the least commonly changed subset is kept at a bulky inefficient, almost linear branch of the tree (right of root) while the commonly changed and new values are placed on the left. This unbalanced tree would actually perform faster than a balanced tree as the average insertion and deletion would require less than O(logn) complexity due to the sacrificing done with complexity for long lifecycle objects (which would be closer to O(n) when accessed).

Which is best all comes down to the sort of data being put into the set. If 99% of it is unchanged with only a small subset being changed at any time then such a unbalanced tree would be considerably faster thanks to its better than O(logn) performance. If all data is changed equally then the best you can do is a balanced tree.

How often and how many elements are changed is undefined. However i think we can assume that usually only a "small" (less than 50%) part of the elements are changed at once.

For now il try using a tree but new ideas are still welcome.

Dr Super Good · Oct 1, 2013

its space complexity is quite bad when a binary tree is used

O(n) space complexity is not that bad as it basically translates to extra element overhead for every element and can be disguised as that.

If space complexity is an issue, you can optimize it by using a N-child tree structure where N is greater than 2. This means that each node would have more than 2 children which in return will reduce the number of internal nodes and thus space required but at the cost of each node iteration becoming more O(n) like (slower node operations).

In any case, you will probably find that the highest performance and space efficiency is obtained if you classify elements of the set based on their usage pattern. This would let you remove elements that are seldom or never changed from the less space efficient but faster structure and instead put them in a more space efficient but slower structure.

However, is space complexity really a concern? Data storage and memory is cheap (was cheap until that factory burned down the other day) so generally it is considered a good idea to sacrifice space for performance, as long as the sacrifice is reasonable. Having memory proportional to the number of elements stored is totally reasonable and can be written off as a constant overhead of each element. However if you used some system which had a space complexity of O(n^2) or worse, O(!n) then obviously space becomes a real problem as anything but the smallest sets would start to use infeasible quantities of space.

Menag · Oct 1, 2013

Dr Super Good said:
If space complexity is an issue, you can optimize it by using a N-child tree structure where N is greater than 2. This means that each node would have more than 2 children which in return will reduce the number of internal nodes and thus space required but at the cost of each node iteration becoming more O(n) like (slower node operations).

Ouhm. What do you think he means by this:

muzzel said:
This is the tree i mentioned in my first two posts, its space complexity is quite bad when a binary tree is used. Thats why i was thinking about something like a sqrt(n)-tree.

Dr Super Good said:
However, is space complexity really a concern? Data storage and memory is cheap (was cheap until that factory burned down the other day) so generally it is considered a good idea to sacrifice space for performance, as long as the sacrifice is reasonable. Having memory proportional to the number of elements stored is totally reasonable and can be written off as a constant overhead of each element. However if you used some system which had a space complexity of O(n^2) or worse, O(!n) then obviously space becomes a real problem as anything but the smallest sets would start to use infeasible quantities of space.

Well he was talking about several million elements. So i think yes, space might be a problem here.

muzzel · Oct 1, 2013

Dr Super Good said:
If space complexity is an issue, you can optimize it by using a N-child tree structure where N is greater than 2. This means that each node would have more than 2 children which in return will reduce the number of internal nodes and thus space required but at the cost of each node iteration becoming more O(n) like (slower node operations).

In any case, you will probably find that the highest performance and space efficiency is obtained if you classify elements of the set based on their usage pattern. This would let you remove elements that are seldom or never changed from the less space efficient but faster structure and instead put them in a more space efficient but slower structure.

Thats ... exactly my suggestion since my first post.

However, is space complexity really a concern? Data storage and memory is cheap (was cheap until that factory burned down the other day) so generally it is considered a good idea to sacrifice space for performance, as long as the sacrifice is reasonable. Having memory proportional to the number of elements stored is totally reasonable and can be written off as a constant overhead of each element. However if you used some system which had a space complexity of O(n^2) or worse, O(!n) then obviously space becomes a real problem as anything but the smallest sets would start to use infeasible quantities of space.

How costly space usage is really depends on the problem, which i didnt explain at all. Short answer: Yes, space is an issue.

Dr Super Good · Oct 1, 2013

Yes, space is an issue.

Must be quite some thing you are doing with it then lol.

Anyway you can get around space limitations through techniques such as memory mapping to increase the quantity of virtual memory you can allocate and to simultaneously implement persistence for the set.

Instead of storing your data structure in process memory space obtained from standard page allocation algorithms, you can instead map address space from a file that is on some large persistent backing storage and use that mapped space as the storage for the data structure. The OS will automatically handle writing and reading the pages from the disk like it does with the standard page file except that you can allocate as many files as you have disk space/process space for and that the data persists when the process closes.

Combined with a 64 bit process and a cheap 1 TB hard disk, this would let you allocate dozens of gigabytes of process address space to contain the data structure. All you then need to do is make sure to try and keep infrequently used elements clustered together.

Sure you will have a performance impact from page faults since the process can no longer reside fully in memory, but since less than 50% is frequently changed it means that most of the data structure could be paged out without much affect on performance. The operating system memory management will automatically unload infrequently used pages in preference to highly used pages and thus the process could come towards a reasonably sized working set, hopefully small enough to sit in memory, despite having an enormous data structure loaded and in use.

Magtheridon96 · Oct 3, 2013

I don't see the problem with this solution:

1) Outline of the tree:

All your values m would be at the bottom of the tree and the nodes in the middle just represent results of the binary operations.

When you insert, you should have a stack of empty nodes representing tree positions at the bottom that need to be filled first.

When the number of nodes you have at the bottom is less than 4 times the amount of space available, you rebalance because at that point, you'd have a seriously malformed tree. You can do it whenever possible (i.e when the number of nodes at the bottom is equal to half the space available (i.e 2^tree_level where tree_level is 0 at the root)) but this might not be worth it. In fact, you could go for restructuring when you have 8 times the amount of space available, I really don't know what factor would work here ^o^.

When you remove a node:

You would simply push the position of the removed node on the insertion stack for the next item to be inserted to go there, and then you would do O(log n) binary operations to recompute the results.

I think you or DSG already said this, but I felt the need to stick my ego in here

edit
So Menag told me you're already doing something like this except with more than 2 children per node ;_;
I'll just pack my bags and ~leave~ ;-;

muzzel · Oct 11, 2013

Ok so for those who are interested in my solution to this:
While i dont have inverse elements i was able to find an approximation for them with an upper bound on the error. So my solution is the naive approach of using the inverse to remove an element and estimate the current error. When the error exceeds a defined border i recalculate the whole thing. Its simple and fast

Datastructure required.

Attachments

Similar threads