Introduction to Tensors
What is a tensor?
To this end, PyTorch introduces a fundamental data structure: the tensor.
In simple words, tensor is a data structure that stores a collection of numbers that are accessible individually by an index, and that can be indexed with multiple indices. In the context of deep learning, tensors refer to the generalization of the vectors and matrices to an arbitrary number of dimensions. Another name for the same concept is multidimensional array.
The essence of tensors
While Python lists or tuples are collections of Python objects that are individually stored in memory, PyTorch tensors and NumPy arrays are views over contiguous memory blocks containing unboxed C numeric types.
Why unboxed C numeric types.?
- Numbers in Python are objects which means a 32bit floating point number that need only 32bits to be represented, is converted into an object with reference counting and so on. (called boxing) This isn’t a problem when dealing with a small set of numbers. but allocating millions of numbers gets inefficient.
- List in Python is an indexable collection of pointers to Python objects of any kind and is meant for sequential collections of them. There is no efficiently implemented operation for lists. Moreover lists are one-dimensional and although you can create lists of lists, this is very inefficient.
- With NumPy arrays and tensors we deal with optimized code in a compiled low-level language like C, which makes the mathematical operations of numerical data much faster than using Python interpreter.
Creating a tensor
The tensor data types
Data types represent the possible values a tensor can hold and number of bytes used to store each value. You can set data types with dtype argument in a tensor constructor.
- Computations in Neural Networks are usually executed with 32-bit floating point precision. The 64-bit is not recommended as it typically will not improve the accuracy and will require more memory and computing time.
- Using tensors for indexing other tensors required them to be 64-bit integer data type.
As such, we often use torch.float32 and torch.int64 as dtype.
Note that although the default Numeric type in PyTorch is float32, In NumPy it’s float64.
Moving tensors to GPU
Every PyTorch tensor can be transferred to the GPU(s) supporting CUDA. By that, all operations on the tensor will be performed using GPU specific routines that come with PyTorch and makes them cable of performing massively fast and parallel computations.
To specify where the tensor is placed (GPU or CPU) we can whether use the device argument in its constructor or we can make a copy of an already created tensor into our desired device by using to method.
We can also use the shorthand methods cpu and cuda, instead of the to method to achieve the same goal:
The to method returns a tensor with same numerical data, but stored in GPU RAM rather than regular system RAM. It’s also worth mentioning that using the to method, we can change the placement and the data type simultaneously by providing both device and dtype as arguments.
NumPy Array vs Tensor
As you might already know, NumPy is another library for dealing with multidimensional arrays. In fact it’s so popular that has now arguably become the lingua franca of data science. However, PyTorch tensors have few superpowers in comparison to NumPy arrays. Tensors are able to perform very fast operations on GPUs, perform distributed operations on multiple devices and machines and they can keep the track of the computation graphs.
To get a NumPy array out of a tensor or vice versa we can use the below code:
Interestingly, If the tensor is placed in CPU the returned NumPy array from numpy method shares the same buffer with the tensor storage. This means the method can be executed with no cost. It also means modifying the
NumPy array will lead to a change in the first tensor. If the tensor is allocated
on the GPU, PyTorch will make a copy of the content of the tensor into a NumPy array stored into the CPU.
Serializing Tensors
Serialization is the process of converting objects to byte streams. Also called marshaling. Serialization allows the developer to save the state of an object and re-create it as needed, providing storage of objects as well as data exchange.
Now suppose that the data inside our tensor is valuable and you want to save it to a file and be able to load it back. Therefor, we need to serialize the tensor object. PyTorch uses pickle under the hood plus dedicated serialization code for the storage. Here is the example code using save and load method:
The problem with the above code is that although its fast and convenient, it is not possible to read the tensor with software other than PyTorch.
To save tensors and load them back in systems that already rely on other libraries like tensorflow, we can use HDF5.
HDF5 is a portable, widely supported format for representing serialized multidimensional arrays, organized in a key-value dictionary. To use HDF5 format you must first install h5py library. the below code shows the process of saving and loading tensors using HDF5 format:
These majority of the above article is my notes from the book “Deep Learning with PyTorch” chapter 3. If you want to learn PyTorch I highly recommend reading this book. I also used pictures of the book for better understanding.
In my next article I continue to write about PyTorch.