CS229 Spring 2020 Python Tutorial

Basic Python

If Statement

In [25]:
code = 230

if code == 229:
    print('Hello CS229!')
elif code == 230:
    print('That\'s deep learning!')
elif code < 200:
    print('That is some undergraduate class')
else:
    print('Wrong class!')
That's deep learning!

Python doesn't have "switch" statement.

Python Operators

Logical operators

In [26]:
true = True
false = False
if true:
    print("It's true!")
if not false:
    print("It's still true!")
if true and not false:
    print("Anyhow, it's true!")
if false or not true:
    print("True?")
else:
    print("Okay, it's false now....")
It's true!
It's still true!
Anyhow, it's true!
Okay, it's false now....

&, | and ~ are all bitwise operators.

Arithmetic operators.

In [27]:
print(5 / 2) # floating number division
print(5 % 2) # remainder
print(5 ** 2) # exponentiation
print(5 // 2) # integer division
2.5
1
25
2

^ means bitwise XOR in Python.

Loop

We typically use range and enumerate for iterations. You can loop over all iterables.

In [28]:
for i in range(5):
    print(i)
0
1
2
3
4
In [29]:
a = 5
while a > 0:
    print(a)
    a -= 1
5
4
3
2
1

Python doesn't have command like "a++" or "a--".

Function

Python functions can take default arguments, they have to be at the end. Be VERY careful because forgetting that you have default argument can prevent you from debugging effectively.

In [30]:
def power(v, p=2):
    return v ** p # How to return multiple values?

print(power(10))
print(power(10, 3))
100
1000

Functions can support extra arguments. You can pass them on to another function, or make use of these directly.

In [31]:
def func2(*args, **kwargs):
    print(args)
    print(kwargs)
    
def func1(v, *args, **kwargs):
    
    func2(*args, **kwargs)
    
    if 'power' in kwargs:
        return v ** kwargs['power']
    else:
        return v

print(func1(10, 'extra 1', 'extra 2', power=3))
print('--------------')
print(func1(10, 5))
('extra 1', 'extra 2')
{'power': 3}
1000
--------------
(5,)
{}
10

Simple Python data types

String

See Python documentation here

In [32]:
cs_class_code = 'CS-229'

print('I like ' + str(cs_class_code) + ' a lot!')
print(f'I like {cs_class_code} a lot!')

print('I love CS229. (upper)'.upper())
print('I love CS229. (rjust 50)'.rjust(50))
print('we love CS229. (capitalize)'.capitalize())
print('       I love CS229. (strip)        '.strip())
I like CS-229 a lot!
I like CS-229 a lot!
I LOVE CS229. (UPPER)
                          I love CS229. (rjust 50)
We love cs229. (capitalize)
I love CS229. (strip)

"f"-string (f for formatting?) is new since Python 3.6. Embed values using { }

In [33]:
print(f'{print} (print a function)')
print(f'{type(229)} (print a type)')
<built-in function print> (print a function)
<class 'int'> (print a type)

For reference, here is how people used to do things. Or you want more control.

In [34]:
print('Old school formatting: {2}, {1}, {0:10.2F}'.format(1.358, 'b', 'c'))
# Fill in order of 2, 1, 0. For the decimal number, fix at length of 10, round to 2 decimal places
Old school formatting: c, b,       1.36

List

In general, data structure documentations can be found here

In [35]:
list_1 = ['one', 'two', 'three']
list_2 = [1, 2, 3]

list_2.append(4)
list_2.insert(0, 'ZERO')

List extension is just addition

In [36]:
print(list_1 + list_2)

list_1_temp = ['a', 'b']
list_1_temp.extend(list_2)

print(list_1_temp)
['one', 'two', 'three', 'ZERO', 1, 2, 3, 4]
['a', 'b', 'ZERO', 1, 2, 3, 4]

But be VERY careful when you multiply a list, will explain later

In [37]:
print(list_1 * 3 + list_2)
print([list_1] * 3 + list_2)
['one', 'two', 'three', 'one', 'two', 'three', 'one', 'two', 'three', 'ZERO', 1, 2, 3, 4]
[['one', 'two', 'three'], ['one', 'two', 'three'], ['one', 'two', 'three'], 'ZERO', 1, 2, 3, 4]

pprint is your friend

In [38]:
import pprint as pp
In [39]:
pp.pprint([list_1] * 5 + list_2)
pp.pprint([list_1] * 2 + [list_2] * 3)
[['one', 'two', 'three'],
 ['one', 'two', 'three'],
 ['one', 'two', 'three'],
 ['one', 'two', 'three'],
 ['one', 'two', 'three'],
 'ZERO',
 1,
 2,
 3,
 4]
[['one', 'two', 'three'],
 ['one', 'two', 'three'],
 ['ZERO', 1, 2, 3, 4],
 ['ZERO', 1, 2, 3, 4],
 ['ZERO', 1, 2, 3, 4]]

List comprehension can save a lot of lines

In [40]:
long_list = [i for i in range(9)]
long_long_list = [(i, j) for i in range(3) for j in range(5)]
long_list_list = [[i for i in range(3)] for _ in range(5)]

pp.pprint(long_list)
pp.pprint(long_long_list)
pp.pprint(long_list_list)
[0, 1, 2, 3, 4, 5, 6, 7, 8]
[(0, 0),
 (0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (1, 0),
 (1, 1),
 (1, 2),
 (1, 3),
 (1, 4),
 (2, 0),
 (2, 1),
 (2, 2),
 (2, 3),
 (2, 4)]
[[0, 1, 2], [0, 1, 2], [0, 1, 2], [0, 1, 2], [0, 1, 2]]

List is iterable!

In [41]:
string_list = ['a', 'b', 'c']
for s in string_list:
    print(s)
for i, s in enumerate(string_list):
    print(f'{i}, {s}')
a
b
c
0, a
1, b
2, c

Slicing. With numpy array (covered layter), you can do this to multi-dimensional ones as well.

In [42]:
print(long_list[:5])
print(long_list[:-1])
print(long_list[4:-1])

long_list[3:5] = [-1, -2]
print(long_list)

long_list.pop()
print(long_list)
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4, 5, 6, 7]
[4, 5, 6, 7]
[0, 1, 2, -1, -2, 5, 6, 7, 8]
[0, 1, 2, -1, -2, 5, 6, 7]

Sorting a list (but remember that sorting can be costly). Documentation for sorting is here

In [43]:
random_list = [3, 12, 5, 6, 8, 2]
print(sorted(random_list))

random_list_2 = [(3, 'z'), (12, 'r'), (5, 'a'), (6, 'e'), (8, 'c'), (2, 'g')]
print(sorted(random_list_2, key=lambda x: x[1]))
[2, 3, 5, 6, 8, 12]
[(5, 'a'), (8, 'c'), (6, 'e'), (2, 'g'), (12, 'r'), (3, 'z')]

Think first before copying Copy by reference not by value. More about copying here

In [44]:
orig_list = [[1, 2], [3, 4]]
dup_list = orig_list

dup_list[0][1] = 'okay'
pp.pprint(orig_list)
pp.pprint(dup_list)
[[1, 'okay'], [3, 4]]
[[1, 'okay'], [3, 4]]
In [45]:
a = [[1, 2, 3]]*3
b = [[1, 2, 3] for i in range(3)]
a[0][1] = 4
b[0][1] = 4
print(a)
print(b)
[[1, 4, 3], [1, 4, 3], [1, 4, 3]]
[[1, 4, 3], [1, 2, 3], [1, 2, 3]]
In [46]:
import copy
In [47]:
orig_list = [[1, 2], [3, 4]]
dup_list = copy.deepcopy(orig_list)

dup_list[0][1] = 'okay'
pp.pprint(orig_list)
pp.pprint(dup_list)
[[1, 2], [3, 4]]
[[1, 'okay'], [3, 4]]

Tuple

List that you cannot edit.

In [48]:
my_tuple = (10, 20, 30)
my_tuple[0] = 40
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-48-a4317678f4cc> in <module>
      1 my_tuple = (10, 20, 30)
----> 2 my_tuple[0] = 40

TypeError: 'tuple' object does not support item assignment

Split assignment makes your code shorter (also works for list).

In [50]:
a, b, c = my_tuple
print(f"a={a}, b={b}, c={c}")
for obj in enumerate(my_tuple):
    print(obj)
a=10, b=20, c=30
(0, 10)
(1, 20)
(2, 30)

Dictionary/Set

Again, documentation for data structure is here

In [51]:
my_set = {i ** 2 % 3 for i in range(10)}
my_dict = {(5 - i): i ** 2 for i in range(10)}

print(my_set)
print(my_dict)

print(my_dict.keys())
{0, 1}
{5: 0, 4: 1, 3: 4, 2: 9, 1: 16, 0: 25, -1: 36, -2: 49, -3: 64, -4: 81}
dict_keys([5, 4, 3, 2, 1, 0, -1, -2, -3, -4])

Updating and/or addint content to a dictionary

In [52]:
second_dict = {'a': 10, 'b': 11}
my_dict.update(second_dict)

pp.pprint(my_dict)

my_dict['new'] = 10
pp.pprint(my_dict)
{-4: 81,
 -3: 64,
 -2: 49,
 -1: 36,
 0: 25,
 1: 16,
 2: 9,
 3: 4,
 4: 1,
 5: 0,
 'a': 10,
 'b': 11}
{-4: 81,
 -3: 64,
 -2: 49,
 -1: 36,
 0: 25,
 1: 16,
 2: 9,
 3: 4,
 4: 1,
 5: 0,
 'a': 10,
 'b': 11,
 'new': 10}

Here is how to iterate through a dictionary. And remember that dictionary is NOT sorted by key value.

In [53]:
for k, it in my_dict.items(): # similar to for loop over enumerate(list)
    print(k, it)
5 0
4 1
3 4
2 9
1 16
0 25
-1 36
-2 49
-3 64
-4 81
a 10
b 11
new 10
In [54]:
# Sorting keys by string order
for k, it in sorted(my_dict.items(), key=lambda x: str(x[0])):
    print(k, it)
-1 36
-2 49
-3 64
-4 81
0 25
1 16
2 9
3 4
4 1
5 0
a 10
b 11
new 10

For defaultdict and sorted dictionary, see the collections documentation

Numpy

Numpy is a nice vector and matrix manipulation package.

In [55]:
import numpy as np

Array initialization

Initialize from existing list. If type is not consistent, numpy will give you weird result.

In [56]:
from_list = np.array([1, 2, 3])
from_list_2d = np.array([[1, 2, 3.0], [4, 5, 6]])
from_list_bad_type = np.array([1, 2, 3, 'a'])
                               
pp.pprint(from_list)
print(f'\t Data type of integer is {from_list.dtype}')
pp.pprint(from_list_2d)
print(f'\t Data type of float is {from_list_2d.dtype}')
pp.pprint(from_list_bad_type)
array([1, 2, 3])
	 Data type of integer is int64
array([[1., 2., 3.],
       [4., 5., 6.]])
	 Data type of float is float64
array(['1', '2', '3', 'a'], dtype='<U21')

Initialize with ones, zeros, or as identity matrix

In [57]:
print(np.ones(3))
print(np.ones((3, 3)))

print(np.zeros(3))
print(np.zeros((3, 3)))

print(np.eye(3))
[1. 1. 1.]
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
[0. 0. 0.]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

Sampling over uniform distribution on $[0, 1)$.

In [58]:
print(np.random.random(3))
print(np.random.random((2, 2)))
[0.32901573 0.175129   0.34132364]
[[0.32287561 0.60218408]
 [0.17216162 0.42272833]]

Sampling over stnadard normal distribution.

In [59]:
print(np.random.randn(3, 3))
[[-0.44315124 -1.21745661  0.20513334]
 [ 1.40976472  1.80851604 -0.72227264]
 [-0.70184302 -0.75835938 -0.08404159]]

Numpy has built-in samplers of a lot of other common (and some not so common) distributions.

Array shape

Shape/reshape and multi-dimensional arrays

In [60]:
array_1d = np.array([1, 2, 3, 4])
array_1by4 = np.array([[1, 2, 3, 4]])
array_2by4 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

print(array_1d.shape)
print(array_1by4.shape)

print(array_1d.reshape(-1, 4).shape)

print(array_2by4.size)
(4,)
(1, 4)
(1, 4)
8
In [61]:
large_array = np.array([i for i in range(400)])
large_array = large_array.reshape((20, 20))

print(large_array[:, 5])

large_3d_array = np.array([i for i in range(1000)])
large_3d_array = large_3d_array.reshape((10, 10, 10))

print(large_3d_array[:, 1, 1])
print(large_3d_array[2, :, 1])
print(large_3d_array[2, 3, :])

print(large_3d_array[1, :, :])
[  5  25  45  65  85 105 125 145 165 185 205 225 245 265 285 305 325 345
 365 385]
[ 11 111 211 311 411 511 611 711 811 911]
[201 211 221 231 241 251 261 271 281 291]
[230 231 232 233 234 235 236 237 238 239]
[[100 101 102 103 104 105 106 107 108 109]
 [110 111 112 113 114 115 116 117 118 119]
 [120 121 122 123 124 125 126 127 128 129]
 [130 131 132 133 134 135 136 137 138 139]
 [140 141 142 143 144 145 146 147 148 149]
 [150 151 152 153 154 155 156 157 158 159]
 [160 161 162 163 164 165 166 167 168 169]
 [170 171 172 173 174 175 176 177 178 179]
 [180 181 182 183 184 185 186 187 188 189]
 [190 191 192 193 194 195 196 197 198 199]]

Think about the order you need before using reshape.

In [62]:
small_array = np.arange(4)
print(np.reshape(small_array, (2, 2), order='C')) # Default order
print(np.reshape(small_array, (2, 2), order='F'))
[[0 1]
 [2 3]]
[[0 2]
 [1 3]]

Numpy math

This also works for sin, cos, tanh, etc.

In [63]:
array_1 = np.array([1, 2, 3, 4])

print(array_1 + 5)
print(array_1 * 5)
print(np.sqrt(array_1))
print(np.power(array_1, 2))
print(np.exp(array_1))
print(np.log(array_1))
[6 7 8 9]
[ 5 10 15 20]
[1.         1.41421356 1.73205081 2.        ]
[ 1  4  9 16]
[ 2.71828183  7.3890561  20.08553692 54.59815003]
[0.         0.69314718 1.09861229 1.38629436]

For sum, mean, avg, std, var, etc, you can perform the operation on set axis.

In [64]:
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

pp.pprint(array_2d)
print(f'shape={array_2d.shape}')
print(np.sum(array_2d))
print(np.sum(array_2d, axis=0))
print(np.sum(array_2d, axis=1))

array_3d = np.array([i for i in range(8)]).reshape((2, 2, 2))
pp.pprint(array_3d)

print(np.sum(array_3d, axis=0))
print(np.sum(array_3d, axis=1))
print(np.sum(array_3d, axis=(1, 2)))
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
shape=(3, 3)
45
[12 15 18]
[ 6 15 24]
array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])
[[ 4  6]
 [ 8 10]]
[[ 2  4]
 [10 12]]
[ 6 22]

Numpy tend to do things element-wise. But be VERY CAREFUL when dimensions don't match. We will cover this in broadcasting. Actuall just be careful with dimension of arrays in general.

In [65]:
array_1 = np.array([1, 2, 3, 4])
array_2 = np.array([3, 4, 5, 6])

print(array_1 * array_2)
print(array_1 * array_2.reshape(4, -1)) # Come back to this later
[ 3  8 15 24]
[[ 3  6  9 12]
 [ 4  8 12 16]
 [ 5 10 15 20]
 [ 6 12 18 24]]

Dot product can be written in 3 ways

In [66]:
print(array_1 @ array_2)
print(array_1.dot(array_2))
print(np.dot(array_1, array_2))

print(array_1.shape)
50
50
50
(4,)

Here, you can't dot when the dimensions are incorrect. But it did not complain just now. Check the shapes!

In [67]:
array_1 = np.array([[1, 2, 3, 4]])
array_2 = np.array([[3, 4, 5, 6]])

print(array_1.shape)

print(array_1 * array_2)
print(array_1.dot(array_2))
(1, 4)
[[ 3  8 15 24]]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-67-853ea17e99b1> in <module>
      5 
      6 print(array_1 * array_2)
----> 7 print(array_1.dot(array_2))

ValueError: shapes (1,4) and (1,4) not aligned: 4 (dim 1) != 1 (dim 0)

With proper handling of shapes, things work. Also, dot is just matrix multiplication. You might just want to write matrix multiply to keep things consistent and be SURE that you have the correct shapes.

In [68]:
# T for transpose

print(array_1.dot(array_2.T))
print(array_1.T.dot(array_2))

print(np.matmul(array_1, array_2.T))
print(np.matmul(array_1.T, array_2))
[[50]]
[[ 3  4  5  6]
 [ 6  8 10 12]
 [ 9 12 15 18]
 [12 16 20 24]]
[[50]]
[[ 3  4  5  6]
 [ 6  8 10 12]
 [ 9 12 15 18]
 [12 16 20 24]]
In [69]:
weight_matrix = np.array([1, 2, 3, 4]).reshape(2, 2)
sample = np.array([[50, 60]]).T

np.matmul(weight_matrix, sample)
Out[69]:
array([[170],
       [390]])

And of course, we typically use matmul for 2D matrix multiplications. For dim>3, Numpy treats it as a stack of matrices. See Matmul documentation

In [70]:
mat1 = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])

print(np.matmul(mat1, mat2))
[[19 22]
 [43 50]]

Notice that np.multiply is element-wise multiplication. NOT proper matrix multiplicatio.

In [71]:
a = np.array([i for i in range(10)]).reshape(2, 5)

print(a * a)
print(np.multiply(a, a))
print(np.multiply(a, 10))
[[ 0  1  4  9 16]
 [25 36 49 64 81]]
[[ 0  1  4  9 16]
 [25 36 49 64 81]]
[[ 0 10 20 30 40]
 [50 60 70 80 90]]

Broadcasting and dimension manipulation

Numpy has capability to perform operations on arrays with different shapes, inferring/expanding dimension as needed. Taking examples from Scipy's documentaiton on numpy, some examples can be

A      (4d array):  8 x 1 x 6 x 1
B      (3d array):      7 x 1 x 5
Result (4d array):  8 x 7 x 6 x 5

A      (2d array):  5 x 4
B      (1d array):      1
Result (2d array):  5 x 4

A      (2d array):  5 x 4
B      (1d array):      4
Result (2d array):  5 x 4

A      (3d array):  15 x 3 x 5
B      (3d array):  15 x 1 x 5
Result (3d array):  15 x 3 x 5

A      (3d array):  15 x 3 x 5
B      (2d array):       3 x 5
Result (3d array):  15 x 3 x 5

A      (3d array):  15 x 3 x 5
B      (2d array):       3 x 1
Result (3d array):  15 x 3 x 5

Essentially all dimensions of size 1 can be "over-looked" or "expanded" to match dimension from another operator. But the order of such must be matched. Dimension of size 1 is only prepended, not appended. For example, the following would not work, though you might think we can add another dimension at the end of B.

A      (3d array):  15 x 3 x 5
B      (2d array):       1 x 3
Result (3d array):  15 x 3 x 5
In [72]:
op1 = np.array([i for i in range(9)]).reshape(3, 3)
op2 = np.array([[1, 2, 3]])
op3 = np.array([1, 2, 3])

pp.pprint(op1)
pp.pprint(op2)

# Notice that the result here is DIFFERENT!
print(op2.shape)
pp.pprint(op1 + op2)
pp.pprint(op1 + op2.T)

# Notice that the result here are THE SAME!
print(op3.shape)
pp.pprint(op1 + op3)
pp.pprint(op1 + op3.T)
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
array([[1, 2, 3]])
(1, 3)
array([[ 1,  3,  5],
       [ 4,  6,  8],
       [ 7,  9, 11]])
array([[ 1,  2,  3],
       [ 5,  6,  7],
       [ 9, 10, 11]])
(3,)
array([[ 1,  3,  5],
       [ 4,  6,  8],
       [ 7,  9, 11]])
array([[ 1,  3,  5],
       [ 4,  6,  8],
       [ 7,  9, 11]])

Here, broadcasting won't work for 15 x 3 x 5 with 1 x 3. Because dimensions are only prepended.

But it WILL work for 15 x 3 x 5 with 3 x 1.

In [73]:
op1 = np.array([i for i in range(225)]).reshape(15, 3, 5)
op2 = np.array([[1, 2, 3]])

# This does not work
# print(op1 + op2)

# This works
print(op1 + op2.T)

# BTW you can contract the cells by clicking on the left
[[[  1   2   3   4   5]
  [  7   8   9  10  11]
  [ 13  14  15  16  17]]

 [[ 16  17  18  19  20]
  [ 22  23  24  25  26]
  [ 28  29  30  31  32]]

 [[ 31  32  33  34  35]
  [ 37  38  39  40  41]
  [ 43  44  45  46  47]]

 [[ 46  47  48  49  50]
  [ 52  53  54  55  56]
  [ 58  59  60  61  62]]

 [[ 61  62  63  64  65]
  [ 67  68  69  70  71]
  [ 73  74  75  76  77]]

 [[ 76  77  78  79  80]
  [ 82  83  84  85  86]
  [ 88  89  90  91  92]]

 [[ 91  92  93  94  95]
  [ 97  98  99 100 101]
  [103 104 105 106 107]]

 [[106 107 108 109 110]
  [112 113 114 115 116]
  [118 119 120 121 122]]

 [[121 122 123 124 125]
  [127 128 129 130 131]
  [133 134 135 136 137]]

 [[136 137 138 139 140]
  [142 143 144 145 146]
  [148 149 150 151 152]]

 [[151 152 153 154 155]
  [157 158 159 160 161]
  [163 164 165 166 167]]

 [[166 167 168 169 170]
  [172 173 174 175 176]
  [178 179 180 181 182]]

 [[181 182 183 184 185]
  [187 188 189 190 191]
  [193 194 195 196 197]]

 [[196 197 198 199 200]
  [202 203 204 205 206]
  [208 209 210 211 212]]

 [[211 212 213 214 215]
  [217 218 219 220 221]
  [223 224 225 226 227]]]

Tile

Treat broadcasting as tilling the lower dimensional array to suit the size of the "more complex" array.

In [74]:
array = np.array([1, 2, 3])

# np.tile(array, shape)
print(np.tile(array, 2))
print(np.tile(array, (2, 3)))
[1 2 3 1 2 3]
[[1 2 3 1 2 3 1 2 3]
 [1 2 3 1 2 3 1 2 3]]

Observe how, with transpose, the tiled result is different. Op2 originally has shape 1 x 3, so

Tiling it (1 x 5) means tiling 2nd dimension 5 times, yielding (1 x 15)

Tiling the transpose, thus 3 x 1, by (1 x 5) means tiling 2nd dimension 5 times, yielding (3 x 5)

In [75]:
op1 = np.array([i for i in range(225)]).reshape(15, 3, 5)
op2 = np.array([[1, 2, 3]])

op_tiled= np.tile(op2, (1, 5))
print(op_tiled.shape)

op_tiled= np.tile(op2.T, (1, 5))
print(op_tiled.shape)
(1, 15)
(3, 5)

Expand/Squeeze

Add a dimension of size 1 or remove dimension of size 1. Here we massage op2 (shape=(1, 3)) to shape of (15, 3, 5)

In [76]:
op_expanded = np.expand_dims(op2, axis=2)
print(op_expanded.shape)

op_tiled_2 = np.tile(op_expanded, (15, 1, 5))
print(op_tiled_2.shape)
(1, 3, 1)
(15, 3, 5)

Same effect with np.newaxis

In [77]:
op3 = np.array([i for i in range(9)]).reshape(3, 3)

op_na = op3[np.newaxis, :]
print(op_na)
print(op_na.shape)

op_na2 = op3[:, np.newaxis, :]
print(op_na2)
print(op_na2.shape)
[[[0 1 2]
  [3 4 5]
  [6 7 8]]]
(1, 3, 3)
[[[0 1 2]]

 [[3 4 5]]

 [[6 7 8]]]
(3, 1, 3)

Squeeze removes size 1 dimensions

In [78]:
print(op_expanded)
print(op_expanded.shape)

op_squeezed = np.squeeze(op_expanded)

print(op_squeezed)
[[[1]
  [2]
  [3]]]
(1, 3, 1)
[1 2 3]

Pairwise distance

Here are 3 ways to compute pairwise distances.

  • "Naive" method through tile expansion
  • Convert the tile/expansion to broadcasting
  • Scipy one line
In [79]:
samples = np.random.random((15, 5))
print(samples.shape)
print(samples)

# Without broadcasting
expanded1 = np.expand_dims(samples, axis=1)
tile1 = np.tile(expanded1, (1, samples.shape[0], 1))
#print(expanded1.shape)
#print(tile1.shape)
#print(tile1)

expanded2 = np.expand_dims(samples, axis=0)
tile2 = np.tile(expanded2, (samples.shape[0], 1 ,1))
#print(expanded2.shape)
#print(tile2.shape)
#print(tile2)

diff = tile2 - tile1
distances = np.linalg.norm(diff, axis=-1)
# print(distances)
print(np.mean(distances))
##################################


# With broadcasting
diff = samples[: ,np.newaxis, :] - samples[np.newaxis, :, :]
distances = np.linalg.norm(diff, axis=-1)
# print(distances)
print(np.mean(distances))


# With scipy
import scipy.spatial
distances = scipy.spatial.distance.cdist(samples, samples)
# print(distances)
# print(len(distances))
print(np.mean(distances))
(15, 5)
[[0.45767142 0.56489308 0.37910783 0.1012638  0.63895657]
 [0.72033823 0.28494664 0.86460006 0.81522924 0.05615894]
 [0.72889278 0.88609119 0.04580975 0.81831563 0.24520082]
 [0.68200685 0.6404537  0.70349505 0.58704715 0.58236006]
 [0.11619128 0.48050658 0.74821419 0.43276056 0.24725844]
 [0.95417451 0.95489342 0.07671449 0.86527711 0.64929007]
 [0.18535464 0.92787863 0.7322276  0.00184351 0.90755884]
 [0.89479318 0.99133381 0.23356447 0.30061149 0.93226858]
 [0.98611507 0.03185917 0.24049277 0.63320623 0.89291318]
 [0.76912372 0.3582217  0.22339368 0.50746419 0.51563737]
 [0.45958078 0.3723447  0.99481086 0.28386613 0.75707502]
 [0.27449411 0.56054339 0.91572132 0.97952258 0.35366246]
 [0.51649077 0.49313818 0.58891696 0.04172703 0.56133593]
 [0.24170496 0.01170604 0.82451557 0.34265237 0.42497829]
 [0.82304983 0.96870729 0.04454417 0.77944192 0.68369793]]
0.8702615298788167
0.8702615298788167
0.8702615298788167

Vectorization

tqdm is a nice package for you to track progress, or just kill time.

In [80]:
import time # time.time() gets wall time, time.clock() gets processor time
from tqdm import tqdm

Dot Product

Numpy is 25 times faster than loops here.

In [81]:
a = np.random.random(500000)
b = np.random.random(500000)

p_tic = time.perf_counter()
tic = time.time()

dot = 0.0;
for i in tqdm(range(len(a))):
    dot += a[i] * b[i]

print(dot)

toc = time.time()
p_toc = time.perf_counter()

print(f'Result: {dot}');
print(f'Compute time (wall): {round(1000 * (toc - tic), 6)}ms')
print(f'Compute time (cpu) : {round(1000 * (p_toc - p_tic), 6)}ms\n')

#####################################################################

p_tic = time.perf_counter()
tic = time.time()

print(np.array(a).dot(np.array(b)))

toc = time.time()
p_toc = time.perf_counter()

print(f'(vectorized) Result: {dot}');
print(f'(vectorized) Compute time: {round(1000 * (toc - tic), 6)}ms')
print(f'(vectorized) Compute time (cpu) : {round(1000 * (p_toc - p_tic), 6)}ms')
100%|██████████| 500000/500000 [00:00<00:00, 513976.60it/s]
125037.9051522837
Result: 125037.9051522837
Compute time (wall): 985.734701ms
Compute time (cpu) : 985.8469ms

125037.90515228486
(vectorized) Result: 125037.9051522837
(vectorized) Compute time: 23.534298ms
(vectorized) Compute time (cpu) : 23.6601ms

Matrix muliplication (2D)

Numpy is more than TWO THOUSAND times faster than loops here.

Matrix multiplication is a O(n^3) complexity operation if implemented naively.

In [82]:
def matrix_mul(X, Y):
    # iterate through rows of X
    for i in range(len(X)):
        # iterate through columns of Y
        for j in range(len(Y[0])):
            # iterate through rows of Y
            for k in range(len(Y)):
                result[i][j] += X[i][k] * Y[k][j]
    return result
In [83]:
X = np.random.random((200, 200))
Y = np.random.random((200, 200))

result = np.zeros((200, 200))

p_tic = time.perf_counter()
tic = time.time()

# iterate through rows of X
for i in tqdm(range(len(X))):
    # iterate through columns of Y
    for j in range(len(Y[0])):
        # iterate through rows of Y
        for k in range(len(Y)):
            result[i][j] += X[i][k] * Y[k][j]

s = np.sum(result)

toc = time.time()
p_toc = time.perf_counter()

print(f'Result: {s}');
print(f'Compute time (wall): {round(1000 * (toc - tic), 6)}ms')
print(f'Compute time (cpu) : {round(1000 * (p_toc - p_tic), 6)}ms\n')

#####################################################################

p_tic = time.perf_counter()
tic = time.time()

result = np.matmul(X, Y)
s = np.sum(result)

toc = time.time()
p_toc = time.perf_counter()

print(f'(vectorized) Result: {s}');
print(f'(vectorized) Compute time: {round(1000 * (toc - tic), 6)}ms')
print(f'(vectorized) Compute time (cpu) : {round(1000 * (p_toc - p_tic), 6)}ms')
100%|██████████| 200/200 [00:14<00:00, 14.05it/s]
Result: 1999185.355486207
Compute time (wall): 14240.134239ms
Compute time (cpu) : 14240.2964ms

(vectorized) Result: 1999185.355486207
(vectorized) Compute time: 5.803108ms
(vectorized) Compute time (cpu) : 5.93ms

Pairwise distance, again

Again, numpy is 30 times faster

In [84]:
samples = np.random.random((100, 5))

p_tic = time.perf_counter()
tic = time.time()

total_dist = []
for s1 in samples:
    for s2 in samples:
        d = np.linalg.norm(s1 - s2)
        total_dist.append(d)
        
avg_dist = np.mean(total_dist)

toc = time.time()
p_toc = time.perf_counter()

print(f'Result: {avg_dist}');
print(f'Compute time (wall): {round(1000 * (toc - tic), 6)}ms')
print(f'Compute time (cpu) : {round(1000 * (p_toc - p_tic), 6)}ms\n')


#####################################################################

p_tic = time.perf_counter()
tic = time.time()

diff = samples[: ,np.newaxis, :] - samples[np.newaxis, :, :]
distances = np.linalg.norm(diff, axis=-1)
avg_dist = np.mean(distances)

toc = time.time()
p_toc = time.perf_counter()

print(f'Result: {avg_dist}');
print(f'Compute time (wall): {round(1000 * (toc - tic), 6)}ms')
print(f'Compute time (cpu) : {round(1000 * (p_toc - p_tic), 6)}ms\n')
Result: 0.8657805911523523
Compute time (wall): 172.189951ms
Compute time (cpu) : 172.335ms

Result: 0.8657805911523523
Compute time (wall): 2.529621ms
Compute time (cpu) : 2.6422ms

You might want to make sure that OpenBLAS is installed. OpenBLAS is a "basic linear algebra subprograms" package that basically, speeds up math for numpy.

In [85]:
np.show_config()
mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/jingboyang/anaconda3/envs/common/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/jingboyang/anaconda3/envs/common/include']
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/jingboyang/anaconda3/envs/common/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/jingboyang/anaconda3/envs/common/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/jingboyang/anaconda3/envs/common/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/jingboyang/anaconda3/envs/common/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/jingboyang/anaconda3/envs/common/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/jingboyang/anaconda3/envs/common/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/jingboyang/anaconda3/envs/common/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/jingboyang/anaconda3/envs/common/include']

Matplotlib

Simple plotting

We want to plot with proper labels, series legend, and even markets.

In [86]:
# If you are using a headless environment. Very important if running on server
# import matplotlib
# matplotlib.use('Agg')

import matplotlib.pyplot as plt
In [87]:
def draw_simple_sin_cos(x_values):
    
    y1_values = np.sin(x_values * np.pi)
    y2_values = np.cos(x_values * np.pi)

    plt.plot(x_values, y1_values, label='Sine')
    plt.plot(x_values, y2_values, label='Cosine')

    plt.legend()
    plt.xlabel('x')
    plt.ylabel('values')
    plt.title('Values for sin and cos, scaled by $\phi_i$')
In [88]:
x_values = np.arange(0, 20, 0.001)

draw_simple_sin_cos(x_values)
plt.show()

You can adjust figure size for aspect ratio then DPI for density of pixels. These combined give you resolution of the image

In [89]:
plt.figure(figsize=(10,3), dpi=100) # 640 x 450

draw_simple_sin_cos(x_values)

plt.savefig('tutorial_sin.jpg')
plt.show()

Subplots in a grid can share axis labels through sharex and sharey.

In [90]:
def draw_subplot_sin_cos(index, x_values, ax):
    
    y1_values = np.sin(x_values * np.pi)
    y2_values = np.cos(x_values * np.pi)

    ax.plot(x_values, y1_values, c='r', label='Sine')
    ax.scatter(x_values, y2_values, s=4, label='Cosine')

    ax.legend()
    ax.set_xlabel('x')
    ax.set_ylabel('values')
    ax.set_title(f'Values for sin and cos (Subplot #{index})')
In [91]:
fig, ax_list = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
#fig, ax_list = plt.subplots(nrows=2, ncols=2,
#                            sharex='col', sharey='row',
#                            figsize=(10, 10))

i = 0
for r, row in enumerate(ax_list):
    for c, ax in enumerate(row):
        x_values = np.arange(i, i + 10, 0.1)
        draw_subplot_sin_cos(i, x_values, ax)
        i += 1

plt.show()

Confusion matrix

Here we show plotting confusion matrix from scratch. For a pre-built one, see implementation by scikit-learn

In [92]:
fig, ax = plt.subplots(figsize=(10,10))

color='YlGn'

labels = ['Python', 'C++', 'Fortran']

cm = np.array([[0.7, 0.3, 0.2], [0.1, 0.5, 0.4], [0.05, 0.1, 0.85]])
heatmap = ax.pcolor(cm, cmap=color)
fig.colorbar(heatmap)
ax.invert_yaxis()
ax.xaxis.tick_top()

ax.set_title('Confusion Matrix')
ax.set_xlabel('Prediction')
ax.set_ylabel('Groud Truth')

ax.set_xticks(np.arange(cm.shape[0]) + 0.5, minor=False)
ax.set_yticks(np.arange(cm.shape[1]) + 0.5, minor=False)
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)

plt.show()

Show image

When showing images, remember to tell numpy the range of pixel values. Typically pixel values are either 0-1 or 0-255.

In [93]:
img_arr = np.random.random((256, 256))# 0 -> 1
print(img_arr.shape)

plt.imshow(img_arr, cmap='gray', vmin=0.2, vmax=0.25)
plt.show()
(256, 256)

By default numpy goes channel first.

In [94]:
img_arr = np.random.random((256, 256, 3))# R, C, (RGB)
print(img_arr.shape)

plt.imshow(img_arr, vmin=0, vmax=1)
plt.show()
(256, 256, 3)

Remember to move axis around if you want to use the default plotting tool.

In [95]:
img_arr = np.random.random((3, 256, 256))# (RGB) R C
print(img_arr.shape)

img_arr = np.moveaxis(img_arr, 0, -1)
print(img_arr.shape)

plt.imshow(img_arr, vmin=0, vmax=1)
plt.show()
(3, 256, 256)
(256, 256, 3)
In [96]:
import imageio

fname = 'sample.jpg'
img = imageio.imread(fname)

pp.pprint(img)
print(img.shape)
Array([[[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       ...,

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]]], dtype=uint8)
(211, 862, 3)
In [97]:
plt.figure(dpi=250)   # dpi=500 -> larger
plt.imshow(img, vmin=0, vmax=10000, interpolation='bilinear')
plt.show()

Pandas

Pandas is great data processing library for table/database-like data. Excellent for things that come in or you wish to output as CSV/Excel. Part of the following content is inspired by a Pandas tutorial online.

File operations

In [98]:
import pandas as pd
In [99]:
data = pd.read_csv('train.csv')

data_short = data[:20]
In [100]:
data_short
Out[100]:
x_1 x_2 x_3 x_4 y
0 1.0 0.0 2.976142 0.651482 10
1 0.0 1.0 1.411390 0.743732 12
2 0.0 1.0 1.039892 1.290588 7
3 1.0 0.0 2.338679 0.973942 15
4 0.0 1.0 2.385257 0.297921 9
5 0.0 1.0 2.912910 0.244489 8
6 1.0 0.0 2.585491 0.133044 9
7 1.0 0.0 2.961107 0.338565 8
8 0.0 1.0 0.161944 0.481609 4
9 0.0 1.0 2.512621 1.118481 16
10 1.0 0.0 2.711287 0.463432 14
11 1.0 0.0 1.479011 0.860247 7
12 1.0 0.0 0.223923 1.030258 8
13 1.0 0.0 2.918245 0.409249 10
14 0.0 1.0 1.447071 0.061543 4
15 1.0 0.0 2.269534 1.754568 19
16 0.0 1.0 2.804809 1.114212 15
17 0.0 1.0 2.539715 1.850662 21
18 0.0 1.0 1.300125 1.178924 5
19 1.0 0.0 1.275172 0.756409 11

You can get basic statiscis with little effort like this

In [101]:
print(data['x_1'].describe())     # For one column

data.describe()                   # For the entire dataframe
count    2500.000000
mean        0.286400
std         0.452169
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: x_1, dtype: float64
Out[101]:
x_1 x_2 x_3 x_4 y
count 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000
mean 0.286400 0.713600 1.506160 0.974763 9.764000
std 0.452169 0.452169 0.862699 0.577296 4.559893
min 0.000000 0.000000 0.004021 0.001589 1.000000
25% 0.000000 0.000000 0.753142 0.481975 6.000000
50% 0.000000 1.000000 1.488827 0.969344 9.000000
75% 1.000000 1.000000 2.262212 1.473104 13.000000
max 1.000000 1.000000 2.998699 1.997793 29.000000

See Pandas documentation for parameters of to_csv function.

In [102]:
data_short.to_csv('data_short.csv', index=False)

Manipulations

Columns can be selected and filtered based on value/name. Be careful with binary operation for filtering due to order of execution. Bitwise operations takes precedence over boolean.

In [103]:
data_short[['x_1', 'y']]
Out[103]:
x_1 y
0 1.0 10
1 0.0 12
2 0.0 7
3 1.0 15
4 0.0 9
5 0.0 8
6 1.0 9
7 1.0 8
8 0.0 4
9 0.0 16
10 1.0 14
11 1.0 7
12 1.0 8
13 1.0 10
14 0.0 4
15 1.0 19
16 0.0 15
17 0.0 21
18 0.0 5
19 1.0 11
In [104]:
data_short[(data_short['y'] > 5) & (data_short['x_3'] < 1.5)]       # Use & | instead of and/or. Put brackets around
Out[104]:
x_1 x_2 x_3 x_4 y
1 0.0 1.0 1.411390 0.743732 12
2 0.0 1.0 1.039892 1.290588 7
11 1.0 0.0 1.479011 0.860247 7
12 1.0 0.0 0.223923 1.030258 8
19 1.0 0.0 1.275172 0.756409 11

A filter function can be applied to generate a new column (you can use this to apply a trained model for prediction result).

We can add column based on filter/conditions.

In [105]:
def filter_func(row):
    
    if row['x_1'] == 1.0 and row['x_2'] == 0.0:
        return row['y'] * 10
    
    return -1

data_short['new_column'] = data_short[['x_1', 'x_2', 'y']].apply(filter_func, axis=1)

data_short
/home/jingboyang/anaconda3/envs/common/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[105]:
x_1 x_2 x_3 x_4 y new_column
0 1.0 0.0 2.976142 0.651482 10 100.0
1 0.0 1.0 1.411390 0.743732 12 -1.0
2 0.0 1.0 1.039892 1.290588 7 -1.0
3 1.0 0.0 2.338679 0.973942 15 150.0
4 0.0 1.0 2.385257 0.297921 9 -1.0
5 0.0 1.0 2.912910 0.244489 8 -1.0
6 1.0 0.0 2.585491 0.133044 9 90.0
7 1.0 0.0 2.961107 0.338565 8 80.0
8 0.0 1.0 0.161944 0.481609 4 -1.0
9 0.0 1.0 2.512621 1.118481 16 -1.0
10 1.0 0.0 2.711287 0.463432 14 140.0
11 1.0 0.0 1.479011 0.860247 7 70.0
12 1.0 0.0 0.223923 1.030258 8 80.0
13 1.0 0.0 2.918245 0.409249 10 100.0
14 0.0 1.0 1.447071 0.061543 4 -1.0
15 1.0 0.0 2.269534 1.754568 19 190.0
16 0.0 1.0 2.804809 1.114212 15 -1.0
17 0.0 1.0 2.539715 1.850662 21 -1.0
18 0.0 1.0 1.300125 1.178924 5 -1.0
19 1.0 0.0 1.275172 0.756409 11 110.0

Iterating through Pandas rows can be done as follows. Each row is a "dictionary". Adding data directly via a list of values is also valid.

In [106]:
col2 = []
for i, row in data_short.iterrows():
    print(f'Row {i}: y-value: {row["y"]}')
    col2.append(row['y'] ** 2)

data_short['col_2'] = col2
data_short
Row 0: y-value: 10.0
Row 1: y-value: 12.0
Row 2: y-value: 7.0
Row 3: y-value: 15.0
Row 4: y-value: 9.0
Row 5: y-value: 8.0
Row 6: y-value: 9.0
Row 7: y-value: 8.0
Row 8: y-value: 4.0
Row 9: y-value: 16.0
Row 10: y-value: 14.0
Row 11: y-value: 7.0
Row 12: y-value: 8.0
Row 13: y-value: 10.0
Row 14: y-value: 4.0
Row 15: y-value: 19.0
Row 16: y-value: 15.0
Row 17: y-value: 21.0
Row 18: y-value: 5.0
Row 19: y-value: 11.0
/home/jingboyang/anaconda3/envs/common/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[106]:
x_1 x_2 x_3 x_4 y new_column col_2
0 1.0 0.0 2.976142 0.651482 10 100.0 100.0
1 0.0 1.0 1.411390 0.743732 12 -1.0 144.0
2 0.0 1.0 1.039892 1.290588 7 -1.0 49.0
3 1.0 0.0 2.338679 0.973942 15 150.0 225.0
4 0.0 1.0 2.385257 0.297921 9 -1.0 81.0
5 0.0 1.0 2.912910 0.244489 8 -1.0 64.0
6 1.0 0.0 2.585491 0.133044 9 90.0 81.0
7 1.0 0.0 2.961107 0.338565 8 80.0 64.0
8 0.0 1.0 0.161944 0.481609 4 -1.0 16.0
9 0.0 1.0 2.512621 1.118481 16 -1.0 256.0
10 1.0 0.0 2.711287 0.463432 14 140.0 196.0
11 1.0 0.0 1.479011 0.860247 7 70.0 49.0
12 1.0 0.0 0.223923 1.030258 8 80.0 64.0
13 1.0 0.0 2.918245 0.409249 10 100.0 100.0
14 0.0 1.0 1.447071 0.061543 4 -1.0 16.0
15 1.0 0.0 2.269534 1.754568 19 190.0 361.0
16 0.0 1.0 2.804809 1.114212 15 -1.0 225.0
17 0.0 1.0 2.539715 1.850662 21 -1.0 441.0
18 0.0 1.0 1.300125 1.178924 5 -1.0 25.0
19 1.0 0.0 1.275172 0.756409 11 110.0 121.0

Not a great example here, but loc means index by value, iloc means index by index. For example, you can do iloc by -1, but NOT loc by -1.

In [107]:
print(data.loc[19])
print(data.iloc[-1])
x_1     1.000000
x_2     0.000000
x_3     1.275172
x_4     0.756409
y      11.000000
Name: 19, dtype: float64
x_1     0.000000
x_2     1.000000
x_3     1.825617
x_4     0.059309
y      11.000000
Name: 2499, dtype: float64

Create dataframe

You can create dataframe from dictionary in row or column major manner. Notice that "extra" things will be filled with Nan.

In [108]:
data_list = [{'a': i, 'b': i + 1} for i in range(15)]
data_list[5] = {'a': 10, 'b': 9, 'c': -1}

df = pd.DataFrame(data_list)
df
Out[108]:
a b c
0 0 1 NaN
1 1 2 NaN
2 2 3 NaN
3 3 4 NaN
4 4 5 NaN
5 10 9 -1.0
6 6 7 NaN
7 7 8 NaN
8 8 9 NaN
9 9 10 NaN
10 10 11 NaN
11 11 12 NaN
12 12 13 NaN
13 13 14 NaN
14 14 15 NaN

Dataframe can also be created from 2D array. Naming the rows and columns is a good practice.

In [109]:
data_2d = np.array([i for i in range(50)]).reshape(5, 10)

df = pd.DataFrame(data_2d, columns=[f'col {i}' for i in range(10)], index=[f'row {i}' for i in range(5)])
df
Out[109]:
col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9
row 0 0 1 2 3 4 5 6 7 8 9
row 1 10 11 12 13 14 15 16 17 18 19
row 2 20 21 22 23 24 25 26 27 28 29
row 3 30 31 32 33 34 35 36 37 38 39
row 4 40 41 42 43 44 45 46 47 48 49

Similarly, you can create dataframe directly from dictionary. It also supports whether the dicionary keys are row/col indices.

In [110]:
data_dict = {'col 1': [3, 2, 1, 0],
        'col 2': ['a', 'b', 'c', 'd']}

df = pd.DataFrame.from_dict(data_dict)
df
Out[110]:
col 1 col 2
0 3 a
1 2 b
2 1 c
3 0 d
In [111]:
df = pd.DataFrame.from_dict(data_dict, orient='index')
df
Out[111]:
0 1 2 3
col 1 3 2 1 0
col 2 a b c d

Simple plotting

Pandas also support plotting. The images it generates are the same style as those in Matplotlib. Pandas plotting provides a quick way to visualize, while you might still need to resort to Matplotlib for more formal plots with higher flexibility.

In [112]:
data.plot(kind='scatter', x='x_3', y='y', title='Plot of Data');
In [113]:
data['y'].plot(kind='hist', title='Y');
In [114]:
data.boxplot(column='x_3', by='y');
In [115]:
data.to_numpy()
Out[115]:
array([[ 1.        ,  0.        ,  2.97614241,  0.65148205, 10.        ],
       [ 0.        ,  1.        ,  1.4113903 ,  0.74373156, 12.        ],
       [ 0.        ,  1.        ,  1.03989184,  1.2905879 ,  7.        ],
       ...,
       [ 0.        ,  1.        ,  1.49124324,  0.84115559,  7.        ],
       [ 0.        ,  1.        ,  2.8631773 ,  1.13793409, 12.        ],
       [ 0.        ,  1.        ,  1.82561719,  0.05930945, 11.        ]])
In [116]:
plt.scatter(data['x_1'], data['y'])
Out[116]:
<matplotlib.collections.PathCollection at 0x7f36ce2ccfd0>