Street view number detection is callednatural scene text recognition problem which is quite different from printedcharacter or handwritten recognition. Research in this field was started in90’s, but still it is considered as an unsolved issue. As I mentioned earlierthat the difficulties arise due to fonts variation, scales, rotations, lowlights etc. In earlier years to deal with naturalscene text identification sequentially, first character classification bysliding window or connected components mainly used. 4 After that wordprediction can be done by predicting character classifier in left to rightmanner. Recently segmentation method guided by supervised classifier use wherewords can be recognized through a sequential beam search.
4 But none of thiscan help to solve the street view recognition problem. In recent works convolutional neuralnetworks proves its capabilities more accurately to solve object recognitiontask. 4 Some research has done with CNN to tackle scene text recognitiontasks.
4 Studies on CNN shows its huge capability to represent all types ofcharacter variation in the natural scene and till now it is holding this highvariability. Analysis with convolutional neural network stars at early 80’s andit successfully applied for handwritten digit recognition in 90’s. 4 With therecent development of computer resources, training sets, advance algorithm anddropout training deep convolutional neural networks become more efficient to recognizenatural scene digit and characters.
3 Previously CNN used mainly to detecting asingle object from an input image. It was quite difficult to isolate eachcharacter from a single image and identify them. Goodfellow et al., solve thisproblem by using deep large CNN directly to model the whole image and with asimple graphical model as the top inference layer. 4 The rest of the paper is designed insection III Convolutional neural network architecture, section IV Experiment,Result, and Discussion and Future Work and Conclusion in section V. Convolutional Neural Networks(CNN) is a multilayer network to handle complex and high-dimensional data, itsarchitecture is same as typical neural networks. 8 Each layer contains someneuron which carries some weight and biases. Each neuron takes images asinputs, then move onward for implementation and reduce parameter numbers in thenetwork.
7 The first layer is a convolutional layer. Here input will beconvoluted by a set of filters to extract the feature from the input. The sizeof feature maps depends on three parameters: number of filters, stride size,padding. After each convolutional layer, a non-linear operation, ReLU use.
Itconverts all negative value to zero. Next is pooling or sub-sampling layer, itwill reduce the size of feature maps. Pooling can be different types: max,average, sum. But max pooling is generally used.
Down-sampling also controlsoverfitting. Pooling layer output is using to create feature extractor. Featureextractor retrieves selective features from the input images. These layers willhave moved to fully connected layers (FCL) and the output layer. In CNNprevious layer output considers as next layer input. For the different type ofproblem, CNN is different.
Themain objective of this project is detecting and identifying house-number signsfrom street view images. The dataset I am considering for this project isstreet view house numbers dataset taken from 5 has similarities with MNISTdataset. The SVHN dataset has more than 600,000 labeled.