Fix several typos in docs.

This commit is contained in:
Qian Xiao 2020-07-11 16:23:50 -07:00
parent 84f4c15278
commit 37dc3e08a4
1 changed files with 5 additions and 5 deletions

View File

@ -14,10 +14,10 @@ initially in CPU memory. Then each of the :math:`G` GPUs obtains a mini-batch of
along with the current model parameters. Using this mini-batch each GPU then along with the current model parameters. Using this mini-batch each GPU then
computes the gradients for all model parameters and sends these gradients back computes the gradients for all model parameters and sends these gradients back
to the CPU when the GPU is done with its mini-batch. The CPU then asynchronously to the CPU when the GPU is done with its mini-batch. The CPU then asynchronously
updates the model parameters whenever it recieves a set of gradients from a GPU. updates the model parameters whenever it receives a set of gradients from a GPU.
Asynchronous parallel optimization has several advantages and several Asynchronous parallel optimization has several advantages and several
disadvantages. One large advantage is throughput. No GPU will every be waiting disadvantages. One large advantage is throughput. No GPU will ever be waiting
idle. When a GPU is done processing a mini-batch, it can immediately obtain the idle. When a GPU is done processing a mini-batch, it can immediately obtain the
next mini-batch to process. It never has to wait on other GPUs to finish their next mini-batch to process. It never has to wait on other GPUs to finish their
mini-batch. However, this means that the model updates will also be asynchronous mini-batch. However, this means that the model updates will also be asynchronous
@ -63,7 +63,7 @@ advantages of asynchronous and synchronous optimization.
Hybrid Parallel Optimization Hybrid Parallel Optimization
---------------------------- ----------------------------
Hybrid parallel optimization combines most of the benifits of asynchronous and Hybrid parallel optimization combines most of the benefits of asynchronous and
synchronous optimization. It allows for multiple GPUs to be used, but does not synchronous optimization. It allows for multiple GPUs to be used, but does not
suffer from the incorrect gradient problem exhibited by asynchronous suffer from the incorrect gradient problem exhibited by asynchronous
optimization. optimization.
@ -86,7 +86,7 @@ to use multiple GPUs in parallel. Furthermore, unlike asynchronous parallel
optimization, the incorrect gradient problem is not present here. In fact, optimization, the incorrect gradient problem is not present here. In fact,
hybrid parallel optimization performs as if one is working with a single hybrid parallel optimization performs as if one is working with a single
mini-batch which is :math:`G` times the size of a mini-batch handled by a single GPU. mini-batch which is :math:`G` times the size of a mini-batch handled by a single GPU.
Hoewever, hybrid parallel optimization is not perfect. If one GPU is slower than However, hybrid parallel optimization is not perfect. If one GPU is slower than
all the others in completing its mini-batch, all other GPUs will have to sit all the others in completing its mini-batch, all other GPUs will have to sit
idle until this straggler finishes with its mini-batch. This hurts throughput. idle until this straggler finishes with its mini-batch. This hurts throughput.
But, if all GPUs are of the same make and model, this problem should be But, if all GPUs are of the same make and model, this problem should be
@ -99,7 +99,7 @@ synchronous optimization. So, we will, for our work, use this hybrid model.
Adam Optimization Adam Optimization
----------------- -----------------
In constrast to In contrast to
`Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_, `Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_,
in which `Nesterovs Accelerated Gradient Descent <www.cs.toronto.edu/~fritz/absps/momentum.pdf>`_ was used, we will use the Adam method for optimization `[3] <http://arxiv.org/abs/1412.6980>`_, in which `Nesterovs Accelerated Gradient Descent <www.cs.toronto.edu/~fritz/absps/momentum.pdf>`_ was used, we will use the Adam method for optimization `[3] <http://arxiv.org/abs/1412.6980>`_,
because, generally, it requires less fine-tuning. because, generally, it requires less fine-tuning.