
Raspberry Pi robotic server and motion, video, sensor controller.

View the Project on GitHub hcl337/ancilla


def: a person whose work provides necessary support to the primary activities of an organization, institution, or industry.

Project Goal

Create a social robot which conforms to very simple human social rules and recognizes social cues such as eye contact, facial expressions, speaking and known objects.

Supported Interactions

The overall goal is to mimic simple human social interaction:

Running the system

  1. Check out code base
  2. Install all dependencies from setup script.
  3. Run python src/ to start the robot
  4. In your browser, connect to the IP address to watch what is happening

Overall System Design Philosophy

Having gone through this before in multiple projects, the goal overall is to create a very easy to support set of hardware and software so there are not hidden elements which are going to break or be forgotten in the future. The decisions below are designed to make it easily supportable, cost effective and understandable for new people.


The system contains full onboard processing so there are no external computers needed. It also has a full web interface allowing for easy understanding of what is going on inside the system by going to the support URL.


One of the raspberry Pi’s is for core processing and the other will be dedicated to environment processing. A 3rd may be necessary if the processing load is too much for controlling the robot and doing vision processing.

Web Interface

One of the biggest challenges in embedded systems is being able to understand and interact with them successfully. Therefore, I am going to expose the key elements in a password protected web interface.

Here is the API documentation.

To change the password for the web server interface, run this below in the /src/webserver directory.

python -c "import hashlib; import getpass; print(hashlib.sha512(getpass.getpass())).hexdigest()" > password.txt

Processing Code Libraries


The system will use two cameras to enable both full environment awareness and targeted vision. The reason is that for environmental awareness, background subtraction is the most important step. Knowing what elements matter and what are just walls. If a camera is moving on servos, it is very difficult to guess which pixels correspond to foreground or background data without 3D pixels (Maybe a future project :-) ). Therefore, by using a wide angle static camera, a standard background removal can be done to remove non-salient objects, color clustering can be done to segment the image into elements and then those can be clustered into people, objects, etc.


Key Tracking Actions

Vision Code Libraries

Raspberry Pi Vision Install

’’’ sudo apt-get install python-opencv libjpeg-dev



Robotic head with 5 DOF raspberry Pi robotic server and motion, video, sensor controller.


Movement Code Libraries

Important Notes on Driving Hobby Servos

PWM does not mean PWM

NOTE: Even though servos have a 0 to 3.3 v control signal where 12-bits is 0 to 4095, for these, that will blow it up. The actual range of the server is 150 to 600 on RobotGeek servos. Therefore we need to map that to our positions correctly.

From here:

.5 ms / 4.8 usec = 104 the number required by our program to position the servo at 0 degreees 1.5 msec / 4.8 usec = 312 the number required by our program to position the servo at 90 degrees 2.5 msec / 4.8 usec = 521 the number required by our program to position the servo at 180 degrees

Smooth control of position at slow speeds is hard

The contorller only updates at 50 hz and it seems that the actual position control of servos is only accurate to about 0.5 degrees which means that the whole thing can jitter a LOT. To account for this, we need to adjust the interpolation algorithms.

A few things I have seen online:


There are multiple


TTS Tutorials and resources

TTS Libraries



There are two ways that speech recognition can be implemented. Either local(Sphinx) or cloud based (Amazon, Google). Cloud-based recognition will always be more accurate however there is a larger delay between speech and recognition. If local recognition is to be used, then a small vocabulary should be specified.

Speech Recognition Libraries