To put things into perspective, we were running an Inception3 architecture with a sample of 18 thousand documents on a 1 * 12GB Tesla K80 GPU. Each epoch took about 30 minutes. With Horovod and an upgraded instance with 4 * 12GB Tesla K80 GPU, reduced each epoch to about 5–6 minutes.
Books TensorFlow for Machine Intelligence (TFFMI)
● Hands-On Machine Learning with Scikit-Learn and TensorFlow. Chapter 9:
Up and running with TensorFlow
● Fundamentals of Deep Learning. Chapter 3: Implementing Neural Networks
in TensorFlow (FODL)
TensorFlow is being constantly updated so books might become
outdated fast
TF Learn simple example import tensofrlow as tf
import sklearn
# Load dataset.
iris = tf.contrib.learn.datasets.load_dataset('iris')
x_train, x_test, y_train, y_test = cross_validation.train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42)
# Build 3 layer DNN with 10, 20, 10 units respectively.
feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(
x_train)
classifier = tf.contrib.learn.DNNClassifier(
feature_columns=feature_columns, hidden_units=[10, 20, 10], n_classes=3)
# Fit and predict.
classifier.fit(x_train, y_train, steps=200)
predictions = list(classifier.predict(x_test, as_iterable=True))
score = metrics.accuracy_score(y_test, predictions)
print('Accuracy: {0:f}'.format(score))
What’s a tensor? An n-dimensional array
0-d tensor: scalar (number) 1-d tensor: vector 2-d tensor: matrix and so on
Tensorboard nodes are operators, variables, constants
edges are actual tensors
Data Flow -> Tensor Flow (I know, mind=blown)
import tensorflow as tf
a = tf.add(3, 5)
print a
>> Tensor("Add:0", shape=(), dtype=int32) # (Not 8)
How to get the value of a? Create a session, assign it to variable sess so we can call it later Within the session, evaluate the graph to fetch the value of a
import tensorflow as tf
a = tf.add(3, 5)
sess = tf.Session()
print sess.run(a)
sess.close()
subgraph x = 2 y = 3 add_op = tf.add(x, y) mul_op = tf.mul(x, y) useless = tf.mul(x, add_op) pow_op = tf.pow(add_op, mul_op) with tf.Session() as sess: z, not_useless = sess.run([op3, useless]) # pass all variables whose values you want to a list in fetches
Run part of a graph on a specific GPU or CPU (for parallel computation)
with tf.device('/gpu:2'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='b') c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) print sess.run(c)
Multiple graphs require multiple sessions, each will try to use all available
resources by default
● Can't pass data between them without passing them through
python/numpy, which doesn't work in distributed
● It’s better to have disconnected subgraphs within one graph
tensorboarda = tf.constant(2) # name='a'
b = tf.constant(3) # name='b'
x = tf.add(a, b) # name='add'
with tf.Session() as sess:
writer = tf.summary.FileWriter('./graphs', sess.graph)
print sess.run(x)
# close the writer when you’re done using it
writer.close()
$ python a.py $ tensorboard --logdir="./graphs"
tf.constant(value, dtype=None, shape=None, name='Const', verify_shape=False) # b = tf.constant([[0, 1], [2, 3]], name="b") tf.zeros([2, 3], tf.int32) # [[0, 0, 0], [0, 0, 0]] tf.zeros_like(input_tensor) # [[0, 0], [0, 0], [0, 0]] tf.ones(shape, dtype=tf.float32, name=None) tf.ones_like(input_tensor) # [[1, 1], [1, 1], [1, 1]] tf.fill(dims, value, name=None) tf.linspace(10.0, 13.0, 4, name="linspace") # [10.0 11.0 12.0 13.0] a sequence of num evenly-spaced values tf.range(start, limit, delta) # 'start' is 3, 'limit' is 1, 'delta' is -0.5 [3, 2.5, 2, 1.5]
unlike NumPy or Python sequences, TensorFlow sequences are not iterable.
for _ in np.linspace(0, 10, 4): # OK for _ in tf.linspace(0, 10, 4): # TypeError("'Tensor' object is not iterable.")
for _ in range(4): # OK for _ in tf.range(4): # TypeError("'Tensor' object is not iterable.")
Generate random constants from certain distributions.
tf.random_normal(shape, mean=0.0, stddev=1.0, dtype=tf.float32, seed=None, name=None)
tf.truncated_normal(shape, mean=0.0, stddev=1.0, dtype=tf.float32, seed=None,
name=None)
tf.random_uniform(shape, minval=0, maxval=None, dtype=tf.float32, seed=None,
name=None)
tf.random_shuffle(value, seed=None, name=None)
tf.random_crop(value, size, seed=None, name=None)
tf.multinomial(logits, num_samples, seed=None, name=None)
tf.random_gamma(shape, alpha, beta=None, dtype=tf.float32, seed=None, name=None)
a = tf.constant([3, 6]) b = tf.constant([2, 2]) tf.add(a, b) # >> [5 8] tf.add_n([a, b, b]) # >> [7 10]. Equivalent to a + b + b tf.mul(a, b) # >> [6 12] because mul is element wise tf.matmul(a, b) # >> ValueError tf.matmul(tf.reshape(a, shape=[1, 2]), tf.reshape(b, shape=[2, 1])) # >> [[18]] tf.div(a, b) # >> [1 3] tf.mod(a, b) # >> [1 0]
Graph's definition is called Protobuff stands for protocol buffer, “Google's language-neutral, platform-neutral, extensible
mechanism for serializing structured data – think XML, but smaller, faster, and simpler.” import tensorflow as tf my_const = tf.constant([1.0, 2.0], name="my_const") print tf.get_default_graph().as_graph_def()
node {
name: "my_const"
op: "Const"
attr {
key: "dtype"
value {
type: DT_FLOAT
}
}
attr {
key: "value"
value {
tensor {
dtype: DT_FLOAT
tensor_shape {
dim {
size: 2
}
}
tensor_content: "\000\000\200?\000\000\000@"
}
}
}
}
versions {
producer: 17
}
Variables b = tf.Variable([2, 3], name="vector") W = tf.Variable(tf.zeros([784,10])) # create variable W as 784 x 10 tensor, filled with zeros
tf.Variable holds several ops:
x.initializer # init x.value() # read op x.assign(...) # write op x.assign_add(...) etc
You have to initialize variables before using them
Initializ all variables at once init = tf.global_variables_initializer()
with tf.Session() as sess:
tf.run(init) # Note that you use tf.run() to run the initializer, not fetching any value.
Initialize only a subset of variables with a list of variables to initialize init_ab = tf.variables_initializer([a, b], name="init_ab")
with tf.Session() as sess:
tf.run(init_ab)
Initialize each variable separately using tf.Variable.initializer
# create variable W as 784 x 10 tensor, filled with zeros
W = tf.Variable(tf.zeros([784,10]))
with tf.Session() as sess:
tf.run(W.initializer)
Print a variable
W = tf.Variable(tf.truncated_normal([700, 10]))
with tf.Session() as sess:
sess.run(W.initializer)
print W # Tensor("Variable/read:0", shape=(700, 10), dtype=float32) print W.eval() # actual variable value
W = tf.Variable(10) W.assign(100) with tf.Session() as sess: sess.run(W.initializer) print W.eval() # 10
Why 10 and not 100? W.assign(100) doesn’t assign the value 100 to W, but instead create an
assign op to do that. For this op to take effect, we have to run this op in session.
W = tf.Variable(10) assign_op = W.assign(100) with tf.Session() as sess: sess.run(assign_op) print W.eval() # 100
Note that we don’t have initialize W in this case, because assign() does it for us. In fact,
initializer op is the assign op that assigns the variable’s initial value to the variable itself.
Interesting example:
a = tf.Variable(2, name="scalar") # create a variable whose original value is 2
a_times_two = a.assign(a * 2) # assign a * 2 to a and call that op a_times_two
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
# have to initialize a, because a_times_two op depends on the value of a
sess.run(a_times_two) # >> 4
sess.run(a_times_two) # >> 8
sess.run(a_times_two) # >> 16
TensorFlow assigns a*2 to a every time a_times_two is fetched. [[At each time variable is calculated from scratch]]
Increment, Decrement ---> tf.Variable.assign_add() and tf.Variable.assign_sub()
Because TensorFlow sessions maintain values separately, each Session can have its own current
value for a variable defined in a graph.
W = tf.Variable(10)
sess1 = tf.Session()
sess2 = tf.Session()
sess1.run(W.initializer)
sess2.run(W.initializer)
print sess1.run(W.assign_add(10)) # >> 20
print sess2.run(W.assign_sub(2)) # >> 8
print sess1.run(W.assign_add(100)) # >> 120
print sess2.run(W.assign_sub(50)) # >> -42
sess1.close()
sess2.close()
declare a variable that depends on other variables
# W is a random 700 x 100 tensor
W = tf.Variable(tf.truncated_normal([700, 10]))
U = tf.Variable(W * 2)
U = tf.Variable(W.intialized_value() * 2) # use initialized_value() to make sure that W is initialized before its value
is used to initialize W.
InteractiveSession makes itself the default session so you can call run() or eval() without
explicitly call the session. This is convenient in interactive shells and IPython notebooks, as it
avoids having to pass an explicit Session object to run ops
sess = tf.InteractiveSession() a = tf.constant(5.0) b = tf.constant(6.0) c = a * b print(c.eval()) # We can just use 'c.eval()' without passing 'sess' sess.close()
# your graph g have 5 ops: a, b, c, d, e
with g.control_dependencies([a, b, c]):
# `d` and `e` will only run after `a`, `b`, and `c` have executed. d = ... e = …
placeholder tf.Variable for trainable variables such as weights (W) and biases (B) for your model. tf.placeholder is used to feed actual training examples. The difference is that with tf.Variable you have to provide an initial value when you declare it. With tf.placeholder you don't have to provide an initial value and you can specify it at run time with the feed_dict argument inside Session.run We can and only need to save or restore the Variables to save or rebuild the graph. Placeholders are mostly holders for the different datasets
# create a placeholder of type float 32-bit, shape is a vector of 3 elements
a = tf.placeholder(tf.float32, shape=[3])
# create a constant of type float 32-bit, shape is a vector of 3 elements
b = tf.constant([5, 5, 5], tf.float32)
# use the placeholder as you would a constant or a variable
c = a + b # Short for tf.add(a, b)
# If we try to fetch c, we will run into error.
with tf.Session() as sess:
print(sess.run(c))
>> NameError
with tf.Session() as sess:
# feed [1, 2, 3] to placeholder a via the dict {a: [1, 2, 3]}
# fetch value of c
print(sess.run(c, {a: [1, 2, 3]}))
with tf.Session() as sess: for a_value in list_of_a_values: print(sess.run(c, {a: a_value}))
# create Operations, Tensors, etc (using the default graph) a = tf.add(2, 5) b = tf.mul(a, 3)
# start up a `Session` using the default graph sess = tf.Session()
# define a dictionary that says to replace the value of `a` with 15
replace_dict = {a: 15}
# Run the session, passing in `replace_dict` as the value to `feed_dict` sess.run(b, feed_dict=replace_dict) # returns 45
feed_dict can be extremely useful to test your model. When you have a large graph and just
want to test out certain parts, you can provide dummy values so TensorFlow won’t waste time
doing unnecessary computations.
The trap of lazy loading One of the most common TensorFlow non-bug bugs I see (and I used to commit) is what my
friend Danijar and I call “lazy loading”. Lazy loading is a term that refers to a programming
pattern when you defer declaring/initializing an object until it is loaded. In the context of
TensorFlow, it means you defer creating an op until you need to compute it. For example, this is
normal loading: you create the op z when you assemble the graph.
x = tf.Variable(10, name='x') y = tf.Variable(20, name='y') z = tf.add(x, y) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for _ in range(10): sess.run(z) writer.close()
This is what happens when someone decides to be clever and use lazy loading to save one line
of code:
x = tf.Variable(10, name='x') y = tf.Variable(20, name='y') with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for _ in range(10): sess.run(tf.add(x, y)) # create the op add only when you need to compute it
writer.close()
Let’s see the graphs for them on TensorBoard.
Normal loading graph looks just like we expected Lazy loading, Well, the node “Add” is missing, which is understandable since we added the note “Add” after
we’ve written the graph to FileWriter. This makes it harder to read the graph but it’s not a bug.
So, what’s the big deal?
Let’s look at the graph definition. Remember that to print out the graph definition, we use: print tf.get_default_graph().as_graph_def() The protobuf for the graph in normal loading has only 1 node “Add” On the other hand, the protobuf for the graph in lazy loading has 10 copies of the node “Add”. It
adds a new node “Add” every time you want to compute z
You probably think: “This is stupid. Why would I want to compute the same value more than
once?” and think that it’s a bug that nobody will ever commit. It happens more often than you
think. For example, you might want to compute the same loss function or make some prediction
after a certain number of training samples. Before you know it, you’ve computed it for thousands
of times, and added thousands of unnecessary nodes to your graph. Your graph definition
becomes bloated, slow to load and expensive to pass around. There are two ways to avoid this bug. First, always separate the definition of ops and their
execution when you can. But when it is not possible because you want to group related ops into
classes, you can use Python property to ensure that your function is only loaded once when it’s
first called. This is not a Python course so I won’t dig into how to do it. But if you want to know,
check out this wonderful blog post by Danijar Hafner.
print sess.graph.as_graph_def() tf.get_default_graph().as_graph_def()
We also add some extra evidence called a bias. Basically, we want to be able
to say that some things are more likely independent of the input. But it's often more helpful to think of softmax the first way:
exponentiating its inputs and then normalizing them. The exponentiation
means that one unit more evidence increases the weight given to any hypothesis
multiplicatively. And conversely, having one less unit of evidence means that a
hypothesis gets a fraction of its earlier weight. Softmax then normalizes these weights, so that they add up
to one, forming a valid probability distribution. x = tf.placeholder("float", [None, 784])
(Here None means that a dimension can be of any length.) placeholder , a value that we'll input
A Variable is a modifiable tensor that lives in TensorFlow's graph of
interacting operations. model parameters be Variable s.
In order to train our model, we need to define what it means for the model to
be good. Well, actually, in machine learning we typically define what it means
for a model to be bad, called the cost or loss, and then try to minimize how bad
it is. But the two are equivalent.
Where y is our predicted probability distribution, and y′ is the true
distribution (the one-hot vector we'll input). In some rough sense, the
cross-entropy is measuring how inefficient our predictions are for describing
the truth. Going into more detail about cross-entropy is beyond the scope of
this tutorial, but it's well worth
understanding. What TensorFlow actually does here, behind the scenes, is it adds new operations
to your graph which
implement backpropagation and gradient descent. Then it gives you back a
single operation which, when run, will do a step of gradient descent training,
slightly tweaking your variables to reduce the cost. |