Street by predicting character classifier in left to

Street view number detection is called
natural scene text recognition problem which is quite different from printed
character or handwritten recognition. Research in this field was started in
90’s, but still it is considered as an unsolved issue. As I mentioned earlier
that the difficulties arise due to fonts variation, scales, rotations, low
lights etc.

     In earlier years to deal with natural
scene text identification sequentially, first character classification by
sliding window or connected components mainly used. 4 After that word
prediction can be done by predicting character classifier in left to right
manner. Recently segmentation method guided by supervised classifier use where
words can be recognized through a sequential beam search. 4 But none of this
can help to solve the street view recognition problem.

     In recent works convolutional neural
networks proves its capabilities more accurately to solve object recognition
task. 4 Some research has done with CNN to tackle scene text recognition
tasks. 4 Studies on CNN shows its huge capability to represent all types of
character variation in the natural scene and till now it is holding this high
variability. Analysis with convolutional neural network stars at early 80’s and
it successfully applied for handwritten digit recognition in 90’s. 4 With the
recent development of computer resources, training sets, advance algorithm and
dropout training deep convolutional neural networks become more efficient to recognize
natural scene digit and characters. 3

      Previously CNN used mainly to detecting a
single object from an input image. It was quite difficult to isolate each
character from a single image and identify them. Goodfellow et al., solve this
problem by using deep large CNN directly to model the whole image and with a
simple graphical model as the top inference layer. 4

      The rest of the paper is designed in
section III Convolutional neural network architecture, section IV Experiment,
Result, and Discussion and Future Work and Conclusion in section V.          

Convolutional Neural Networks
(CNN) is a multilayer network to handle complex and high-dimensional data, its
architecture is same as typical neural networks. 8 Each layer contains some
neuron which carries some weight and biases. Each neuron takes images as
inputs, then move onward for implementation and reduce parameter numbers in the
network. 7 The first layer is a convolutional layer. Here input will be
convoluted by a set of filters to extract the feature from the input. The size
of feature maps depends on three parameters: number of filters, stride size,
padding. After each convolutional layer, a non-linear operation, ReLU use. It
converts all negative value to zero. Next is pooling or sub-sampling layer, it
will reduce the size of feature maps. Pooling can be different types: max,
average, sum. But max pooling is generally used. Down-sampling also controls
overfitting. Pooling layer output is using to create feature extractor. Feature
extractor retrieves selective features from the input images. These layers will
have moved to fully connected layers (FCL) and the output layer. In CNN
previous layer output considers as next layer input. For the different type of
problem, CNN is different.


main objective of this project is detecting and identifying house-number signs
from street view images. The dataset I am considering for this project is
street view house numbers dataset taken from 5 has similarities with MNIST
dataset. The SVHN dataset has more than 600,000 labeled.