The purpose of the project is to design a device to help blind people to understand
the world they are facing. Based on some rapidly growing technologies such as face recognition technology and OCR technology, a
Raspberry Pi based smart camera “The Eye” is designed and implemented in this project. Users just need to take a photo using the
device, and it will give a general description of the photo, recognize the faces and extract text information in the photo. After that,
all the information will be output in audio form.
Introduction
Considering our target users are blind people, this device is designed as simple as possible and very easy to use. As for the
hardware, there is a piCamera used to take photos. By pressing the buttons on the Raspberry Pi, the users can use all the
functions of the device.
Our smart camera can send the photo to the cloud server and receive the general description of the photo from the server. The face
recognition function is mainly implemented with openCV library. The LBP cascade classifier is used for face detection and all the
three face recognition models in this library are used for better performance of face recognition. As for the text extraction, we
used the library “tesseract-ocr” to transform the photo into text information. Finally, the library “flite” is used to output all
the information in audio form.
Design Plan
Hardware Design
As the following figure shows, besides Raspberry Pi we have three modules in the
whole system. The piCamera is used to acquire the image and what the camera sees will be displayed on the piTFT. Then, the
acquired images will be sent to the Raspberry Pi and it will finish the task of giving general description, face recognition and text
extraction with the help of some cloud servers. After that, the results will be output through a speaker in audio form.
Figure 1. Hardware design of “The Eye” system
Software Design
The following flow chart shows the overall scheme of the software.
It mainly consists of five modules: image acquiring and display, general description, face recognition, text
extraction and audio output. After the code starts running, the piCamera will continuously take photos and
display the latest one on the piTFT. At the same time, the press event of each button will be checked. If none
of the buttons is pressed, the code will continuously check the press events, acquire and display the image. Once
the press of any button is detected, corresponding code will be executed. As shown in the flow chart, once the button
17 is pressed, the device will output a brief description. Button 22 corresponds to face recognition function,
while button 23 corresponds to text extraction.
Figure 2. Software design of "The Eye system"
Image Acquiring & Display
We use piCamera to take photos. What the piCamera sees will be displayed on the piTFT.
For the image acquiring, it can be implemented with the python package “picamera”. What the piCamera will do is
just continuously taking photos and the latest photo will be stored in the variable “image”. And this variable
will continuously being renewed until some button is pressed. Once that happens, the latest image will be taken and
processed by corresponding code. As for the image displaying, it is used for debugging. The python package “pygame” is used to display
images taken by the piCamera on the piTFT. The following figure shows the front view and rear view respectively with
the “image acquiring and display” module turned on.
Figure 3. Front view and rear view of the smart camera
General Description
For general description, we want to give the user some general information and
description of the image taken by the PiCamera. There are two forms of general description, describing the whole view in a
short sentence and giving a set of keywords. For this function, we mainly relied on Microsoft Cognitive Services and
CloudSight.ai, which are both cloud service that takes in images and helps the users to analyze it. Since we are using
cloud service, the key part for general description is to make HTTP request, encode image data and decode JSON response.
First, we applied for Microsoft Cognitive Services and CloudSight.ai, got the keys of our account for making the request. Then we set up
the header, parameters and body of the HTTP POST request as the code below. The image will be encoded into Base64 string and put into the request body.
After that we set up the HTTP POST request, it is sent using Python Request
package. The response will be a JSON string, which can be decoded into a JSON object. A JSON object is actually a dictionary
in Python, we can get the data according to the key value. For general description in a sentence, the key is “captions”,
and for keywords, the key is “tags”. Then we output the sentence and keywords to the speaker module.
Face Detection and Recognition
1. Face Detection
For face detection, we implemented both online and offline detection. When we have
the access to the internet, we can use both online and offline two methods to detect faces in the photo. When we have no access
to the Internet, we can just use the offline method to detect the faces in the photo.
1.1 Online Face Detection
As we mentioned above in general description, the online face detection is also
making a HTTP POST request to the Microsoft cloud server and decoding the JSON response. The response of the face detection
includes a good amount of information. For each face, we can inquire information like age, hair color, smiling or not,
accessories etc. by changing the parameters of the HTTP request. Thus, in the JSON object, there is a list of all the faces
detected. In each faces, we get the age, hair, gender and smiling information and we make several sentences like below.
She is stranger. She is around 22 years old. She is not smiling. She has black hair.
Also, in order to do face recognition, we need small images of faces. To do this, we can
find the location and size of the detected faces and crop the face images using these information.
Since OpenCV images are saved in Numpy array, cropping the face images is actually taking the subarray
of the original image.
1.2 Offline Face Detection
We also tried to implement our own offline face detection function in OpenCV,
in case there is no internet connection. For face detection, OpenCV library has a widely used algorithm implemented,
called cascade classifier. There are mainly two cascade classifiers, LBP classifier and Haar classifier. LBP is a lot
faster but it has higher probability to miss some of the faces, while for Haar it is slower but might give more accurate
result. We tested on both of these algorithms and compared the results. We found that, with the same parameters, Haar can
detect our faces most of the time, while LBP missed our face even though it is faster. Thus, we chose Haar as our detecting
algorithm.
First, the Haar model, which is an XML file, is loaded during initialization. Second, create a detector and set the parameters of this detector,
including "minNeighbors", "scaleFactor", and so on. "scaleFactor" determines the running time of the algorithm. Greater "scaleFactor"
leads to longer running time. "minNeighbors" determines the number of detected circles. If the "minNeighbors" parameter gets too small, more ‘faces’
will be detected, but it is more likely to detect wrong faces, and less likely to miss faces. We finally decided that "minNeighbors = 2", "scaleFactor = 1.1"
and we did not set minimum or maximum area.
2. Face Recognition
After our device detects all the faces in the photo, we also want it
to recognize us among all the faces. To be more specific, our device is expected to classify all the faces into three
categories: Yixuan, Yufu and Strangers.
The first step of implementing this function is to create the training dataset. What we did is just taking photos
for both of us with piCamera and used face detection module to crop the faces in the photo.As for the size of the
dataset, we found that if we use too many photos to train the model, it is more possible for our device to
recognize a stranger as one of us. Thus, we took 20 photos for either of us to train the model. The label of
“Yixuan” is 2 and the label of “Yufu” is 3.
Figure 4. Samples of training dataset
Then, we need to create a face recognition model. There are three
face recognition methods provided in openCV: Eigenfaces, Fisherfaces and Local Binary Patterns Histograms (LBPH).
Each time we input a photo to the model, the model will output a label and confidence. If the confidence (the lower
the better) is higher than the threshold, the photo will be recognized as “Stranger”. Otherwise, the model will make
a decision according to the label. After doing many experiments with these three methods respectively, we found that
these methods perform well in different situations.For example, the Eigenfaces method is sensitive to distance. Therefore,
we decided to create a model using all these three models. Each model will make decision respectively, and we will take
the majority of the decisions from these three models.
Text Extraxtion
1. Online Text Extraction
For online text extraction, we are also using Microsoft Cognitive Services.
The reason we are using online text extraction after we have implemented offline OCR text extraction is that Microsoft
Cognitive Services separate different regions for us, so that it won’t mix all the text together, which makes more sense.
We did that same thing as we mentioned above about HTTP request. After we get the JSON object, since the words are detected
separately. We need to combine these words into a string of text. Since there are some special characters in the string
such as double quote and single quote which will affect the performance of audio output, we need to replace all these
characters with spaces before we send the string to ‘flite’.
2. Offline Text Extraction
Another OCR method that can be run offline is also implemented. The name of the OCR software is
“Tesseract OCR”. First, we installed it on the Raspberry Pi by running the following command line:
“sudo apt-get install tesseract-ocr”. Then, we can just use a command line to extract text information from any
photos we want. Here is a basic format of the command that are used: “tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]”.
Audio Output
Since the target users of our device are blind people, we need to output
the information in audio form. What we use is an embedded library “Flite” which allows synthesis of strings of text and
text file from the command line. After getting the text information from previous modules, we ran the corresponding
command line using “subprocess”. The format of the command line is as following: “flite -t ‘text information’”.
Result
General Description
To test the first funtion, we pressed the button 17. The output is printed in the console window, which is shown
in the following figure. After that, the text information will be sent to "flite" and be transformed into audio form.
As shown in the figure, our device correctly gives a setence of description "A person sitting at a table using a laptop computer".
Besides that, our device also output a list of key words which can describe the photo.
Figure 5. Output of the first function
Face Recognition
After pressing button 22, our device will take a picture and recognize the faces in the photo. In the
following demo, I took a photo of both of us and the device successfully recognize us and give a brief description
of us.
Figure 6. Output of the second function
Text Extraction
As for the third function, we press the button 23 and try to extract the text information from a photo of magazine.
The following figure shows result. And similar to the first function, the output will be first printed in the
console window, and the it will be sent to "flite" and output in audio form.