Choosing a good data type for an identifier column is very important. You're more likely to compare these columns to other values (for example, in joins) and to use them for lookups than other columns. You're also likely to use them in other tables as foreign keys, so when you choose a data type for an identifier column, you're probably choosing the type in related tables as well. (It's a good idea to use the same data types in related tables, because you're likely to use them for joins.)
When choosing a type for an identifier column, you need to consider not only the storage type, but also how MySQL performs computations and comparisons on that type. Once you choose a type, make sure you use the same type in all related tables. Mixing different data types can cause performance problems, and even if it doesn't, implicit type conversions during comparisons can create hard-to-find errors. These may even crop up much later, after you've forgotten that you're comparing different data types.
Choose the smallest size that can hold your required range of values, and leave room for future growth if necessary. For example, using TINYINT for state_id, rather than INT.
Integers are usually the best choice for identifiers, because they're fast and they work with AUTO_INCREMENT.
ENUM and SET
The ENUM and SET types are generally a poor choice for identifiers, though they can be okay for static "definition tables" that contain status or "type" values. ENUM and SET columns are appropriate for holding information such as an order's status, a product's type, or a person's gender.
Avoid string types for identifiers if possible, because they take up a lot of space and are generally slower than integer types.You should also be very careful with completely "random" strings, such as those produced by MD5(), SHA1(), or UUID(). Each new value you generate with them will be distributed in arbitrary ways over a large space, which can slow INSERT and some types of SELECT queries:
- They slow INSERT queries because the inserted value has to go in a random location in indexes. This causes page splits, random disk accesses, and clustered index fragmentation for clustered storage engines(Such as InnoDB).
- They slow SELECT queries because logically adjacent rows will be widely dispersed on disk and in memory.
- Random values cause caches to perform poorly for all types of queries because they defeat locality of reference, which is how caching works. If the entire dataset is equally "hot" there is no advantage to having any particular part of the data cached in memory, and if the working set does not fit in memory, the cache will have a lot of flushes and misses.