For this article I will be using Python 3.6 and IPython 6.2.1.

Before to start on the technical side, let's present quickly python sets. if you want to have the full documentation don't hesitate to go to the official documentation.

Sets are python data structure which can contain any other data type object. For each element a hash is computed. Consequently, there can't be duplicated value as two same values would get same hash. It also means that accessing to an object in a set should be close to O(1) but it can reach O(n) in worst case._{1.}

It makes sets extremely efficient for membership testing. Contrary to a dictionary a value in a set doesn't point to another value.

There is two ways of creating a set:

```
In [1]: {"a","b"}
Out[1]: {'a', 'b'}
In [2]: set(["a","b"])
Out[2]: {'a', 'b'}
```

To test the membership of an object you can just use the usual in.

```
In [1]: "a" in {"a","b"}
Out[1]: True
```

What is the cost of creating a set? To test it we are generating a list and we then create a set based on it. We use the timeit feature of ipython.

```
In [1]: r = list(range(0,10))
In [2]: timeit(set(r))
560 ns ± 35.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

We are drawing both for set and for list, as it will be important for future experimentation.

Here are the results:

We can see that for relatively low number of elements, the time needed to create a set is growing way faster than the time needed to create a list of same size. It is expected as the set needs to compute hashes and to manage collisions.

On really large number of elements, we can see that both seem to tend to same time. It is surely due to the memory reallocation that is needed while growing both data structure. The time needed for memory reallocation is becoming predominant in the creation process.

So now the question, based on the creation time of both structure when is it more intersting of doing:

```
if 1 in {1,2,3,4}:
pass
```

Than doing:

```
if 1 in [1,2,3,4]:
pass
```

We are going to test three cases here. For each one we are instantiating the data structure and we are then doing the membership testing.

In the case of the list, as python needs to go through it all, we are going to consider two cases, the best case when the item we are looking for is the first item of the list, and the worst case when the item is not in the list.

Here is the results obtained.

From the experiment we can quickly see that using set in conditions are not a good idea. The list best case is always way better than the set and the worst case is never really far from it.

There would be really little gain, to try to use a set in a condition if you instantiate it only for it.

So when is a set becoming more interesting than a list for membership testing?

As set is slower to generate than a list, then it means that set becomes interesting only when there is several membership test to operate on it.

Let's take a look at the time cost of membership testing for both of them, same methodology than the previous try.

The difference is small but the set membership testing is always faster than the list one. We can see here that the O(1) complexity is achieved. It is the same complexity as testing the first element of the table, therefore they are pretty close. But in the case of worst case where complexity would be O(n) it is striving.

So if there is multiple membership testing the set becomes more interesting, the question is how many do you need before it becomes interesting.

It seems there is a correlation where the number of membership testing needed is twice as high as the number of elements, in the list best case. For the worst case, if we take in account the deviation, it seems the set is becoming interesting after the first test.

But sets can be used in a lot of different ways. I am not thinking of comparing their efficiency on their special methods like union, or difference.

To deduplicate a list it is not rare to see:

```
set(my_list)
```

How efficient it is to use a set from scratch? Let's consider two cases, when all the elements are different, and when they are all similar.

```
set(my_list)
```

The code used for getting the information we need. We will use three functions. The first one is to test deduplication in set, the second one is the same with the conversion of the set to a list in the end, and the last one is the most efficient list deduplication, which is to avoid insertion of possible duplicated element.

```
from timeit import timeit as ti
def deduplicate_set(elements):
my_set = set()
# Simulating the progressive addition of values
for element in elements:
my_set.add(element)
# That's all set don't have duplicated values
def deduplicate_set_final_list(elements):
# In this case we are also taking in account the
# fact that someone want a list in the end.
my_set = set()
# Simulating the progressive addition of values.
for element in elements:
my_set.add(element)
# Conversion of the set into a list.
list(my_set)
def deduplicate_list_pre(elements):
my_list = list()
# Simulating the progressive addition of values.
for element in elements:
# We avoid duplication by checking before insertion.
if element not in my_list:
my_list.append(element)
# To avoid temporary skewing we do several iterations.
nb_iter = 1000
tests_to_do = [10, 100, 500, 1000]
for value_to_test in tests_to_do:
elements = list(range(value_to_test))
t1 = ti("deduplicate_set(elements)",
setup="from __main__ import elements, deduplicate_set",
number=int(nb_iter)) / nb_iter
t2 = ti("deduplicate_set_final_list(elements)",
setup="from __main__ import elements, deduplicate_set_final_list",
number=int(nb_iter)) / nb_iter
t3 = ti("deduplicate_list_pre(elements)",
setup="from __main__ import elements, deduplicate_list_pre",
number=int(nb_iter)) / nb_iter
print("results for {} elements".format(value_to_test))
print(t1)
print(t2)
print(t3)
```

All elements are different:

This was predictable, because the *in* is currently a O(n_{2}) function. In this case the set is shining as a thousands of sun. Even with really low number of elements it is faster than the list.

Now let's try with all the elements being identical.

```
for value_to_test in tests_to_do:
elements = [1] * value_to_test
t1 = ti("deduplicate_set(elements)",
setup="from __main__ import elements, deduplicate_set",
number=int(nb_iter)) / nb_iter
t2 = ti("deduplicate_set_final_list(elements)",
setup="from __main__ import elements, deduplicate_set_final_list",
number=int(nb_iter)) / nb_iter
t3 = ti("deduplicate_list_pre(elements)",
setup="from __main__ import elements, deduplicate_list_pre",
number=int(nb_iter)) / nb_iter
print("results for {} elements".format(value_to_test))
print(t1)
print(t2)
print(t3)
```

All elements are identical:

When the element to check is the first one, then the list is faster than the set. It is a case that I guess would rarely happen. But it is good to know.

The difference is way smaller than the previous test. If you can predict how your data are going to look like, you will be able to make the right choice. Otherwise, the set seems to be the way to go.

To conclude this musing session, I must admit, this is far from what I imagined, I thought the set would be way faster to build, and it would be a handy tool for many cases. But it seems it this drawback will limit its use.