Nimish Sawant Apr 28, 2017 11:52 AM IST
Persistent Systems is a technology services company headquartered in Pune. Every year it hosts a global hackathon called 'Semicolons' — which lasts for 24 hours and comprises self-managed teams who ideate and compete against each other to come up with innovative solutions to everyday problems. This year’s theme was Digital Transformation, which saw participation by 45 teams across 11 global centres in 5 countries involving 600+ Persistent employees.
The winning solution this year was Drishti — an application which uses advances in deep learning, artificial intelligence, image recognition, person identification, speech-to-text and accessibility technologies to bring value to those who are visually impaired.
We spoke to team leader Pandurang Kamat, chief architect — innovation and R&D at Persistent Systems, who also happens to head the Blockchain division at Persistent Systems.
Tech2: On the use-cases and the ideation of the Drishti app
Kamat: This year the team that I was representing decided to focus on digital disruption. We generally look at a combination that is meaningful in an enterprise setting, or in a social or consumer setting and uses cutting edge technology. Last year, we worked on a blockchain system and tried to build a smart contract solution for each citizen contributing towards elevating climate change and how one can gamify the whole product.
This year we realised we wanted to focus on deep learning and machine learning. It is a trending technology. We wanted to use it to do something in the social sector. After discussing with the team, we decided to do something for the visually impaired. One of the major applications of deep learning technology is computer vision. It limits itself not just to what is there in the scene but also classifies the scenario and what is happening in the scene looking at the way things are placed.
This project was inspired by the Microsoft Research video which showed how a visually impaired researcher had used technology to build successful smart glasses for blind people. It is also interpreting a scene, which is very different from just image recognition.
We realised that slowly, but surely, these high end technologies are becoming more commoditised and are no longer just in the realm of rocket scientists. We spent three weeks studying deep learning, learnt how to use it and then we went about building the application, which would be working on how to make the world more accessible to a visually impaired person, by doing certain very specific tasks. The first step was to identify and classify objects and persons and describe the scene with directional guidance.
What components of the smartphone have you used to ensure that a person gets an aural context of the surroundings?
The first thing is the camera of the smartphone. We had plans to use GPS to showcase locations, but that was over and above our brief. We also used a local text-to-speech engine on the phone to give an audio output from your speaker or headphone jack. We have ensured that the processing happens in the cloud so that one is not limited by the handset.
We had built a software that we run and had launched it on Google Compute. We built a progressive architecture. Our aim was to make technology invisible to the user. We want to ensure that by the time the camera recognises something, it should be able to interpret it quickly. If the interpretation happens say after 30 secs, then it defeats the purpose. It’s not natural.
In our app, as soon as the photo is taken, the first level of interpretation is near-instantaneous, in a couple of seconds you get a description of your surroundings. This is not a full blown application, but whatever we could do in 24 hours. So we focussed on one user and connected to his Google+ Photos and Facebook accounts to get some contextual information, and then our machine learning algorithm learnt that data. Post that, when one friend came in front of the visually impaired subject, the app could recognise that person.
Do you require a specific kind of hardware?
No, all the processing and heavy lifting happens in the cloud, so there is no stress on the hardware. Our architecture was progressively pushing things, as the scene unfolded, to the visually impaired person. Of course, deep learning algorithms for the learning stage require massively parallel computing, so you can accelerate them by using specialised hardware such as a higher end GPU which does massive parallel processing. We also divided a screen into different cards so that the visually impaired person is better able to navigate the smartphone. As you touch a particular card, you will get an audio response.
Increasingly, a lot of the smart virtual assistants are doing stuff locally, thanks to the libraries that are being written to use just a limited amount of processing power of the phone and do some of the local tasks. So there is a the global model which does the heavy lifting on the cloud and then there is the lighter version which is present locally on the phone for easier tasks. TensorFlow from Google is something you can run on the phone today.
Marketing the Drishti app
We are still trying to figure out ways to move forward with this. Our intent is not to immediately monetise the product or bring it to market. This app is what we came up with in 24 hours. So what we will do is take the learning from this app and use it in broader applications. We don’t have an operation in the consumer space at the moment. We are not sure if we would like to license this technology to third parties. We will take a call over the next few months as to which of the 45 ideas that were presented at the hackathon have a market viability. We would certainly love to take forth the learnings from this 24-hour app development competition and build up on that.