How does Xerini build software?
The problem: Wide-scale data collection
If you find yourself in need, you rarely need just one thing. You might find your local food bank and maybe even shelter for a night, but finding a clear path to get your life back is very difficult. – Mohsin Ravjani, Co-founder & CEO of Change Ahead
In the UK a staggering amount of 8 million people are one pay-check away from being homeless. And if you find yourself in a situation like that, it becomes increasingly difficult to find all of the support structures that can help you. Change Ahead‘s mission is to remove the gap between people in need and organisations and individuals seeking to provide this support. They are building an app that can display geo-located web resources (links) that connect a person to everything from basic necessities like food and shelter to organisations that give career advice and opportunities. A very noble cause underlined by a complicated real-world data problem.
We at Xerini take this cause to heart and so we looked into how we can use our expertise with data solutions to help solve the problem of acquiring large amounts of useful data quickly. But where do we start with such a daunting task? Where is the best place to get useful data from? For that matter, what does “useful” even mean? It’s easy to get bogged down with the potential avenues this project can take. Complicated tools that mine the web and learn a sentiment system that disambiguates between links that can help and links that cannot sounds amazing. Unfortunately building a system like that takes time and the problem is pressing. We need to hone-in on a solution that brings value to our client now and not in the future and we need to ensure that we can improve it in the future and reach the ideal goals of the complicated autonomous miner. So what are the steps at building such a solution?
Step 1: Isolate the main need - Finding useful web resources
Our client wants to be able to find web resources for a given location in the UK. The “best” place to find resources might not be immediately apparent. A simple example would be the trade-off between mining updates from a local council website as opposed to looking at global results from a search engine. The council website can give us good quality information as they also aggregate useful resources at a council-wide level. An example would be Kent County Council’s social care and health page, which keeps an up-to-date index of both digital and real-world resources for physical and mental health. This is good on Kent but what happens if your council hasn’t kept an index? That doesn’t mean that food banks or mental health centres don’t exist in your area, it just means it’s more difficult to find them.
Search engines don’t rely on someone previously curating the resources, and therefore the data comes with substantial noise. We don’t need links to the local newspaper telling us about the opening of a new food bank in Barnsley, we need the website of the food bank. We don’t need journal papers coming out of Oxford about the state-of-the-art research done for PTSD, we need mental health clinics in Oxford that can help people with this problem.
At a moment like this, we pause for an important discussion with the client, which tends to be very productive in inspiring not only just good transparent communication, but also makes the client reflect and understand their problem better. This discussion told us to not miss any resources and so we based our search of web resources on the Google search API. But, not disregarding the benefit of the other approach, we implemented a generic interface that would allow the search functionality to be a slot-in module that can be easily expanded in the future depending on our client’s interest. For example, we might switch from using Google to Bing as the services tend to be cheaper if cost becomes an issue. Or we could compose the search modules together to enhance the result.
After the first iteration, we have in place the following interface:
We have a working search where the input is a search criterion composed of a UK-based town or city, a category, and subcategory for the type of resource and a free text box for keyword(s) to narrow down the search as the user pleases. At this point, we not only do a demo to our clients but give them the ability to interact with the tool. Xerini is a firm believer that a client can understand their own requirements better if they can have their hands on a tool as soon as possible. If they find issues, even better! We now know what to focus on from day one. A quick deployment on Azure and the tool is in the hands of our clients!
Now that it is possible to search for resources, we come to the integral discussion of the usefulness of a link. We already hinted that periodical and academic articles might not be the thing of interest. We can brainstorm with our client and expand that list of “useless” sources to include unrelated company websites (searching for “food” often just returns restaurant websites) and other such filters that will restrict the results. Then simply add a black-list of common domains that fit those categories and call it job done. But following discussions with our client we realised that this might not be a universal truth. For example, sometimes a company would do a limited-time campaign helping people in need, which should be listed as a useful result. This and other similar factors showed us that basic rules might help, but will inevitably miss important help especially for future cases as this tool is required to keep an up-to-date index of these web resources. Because of that, we opted-in for a man-in-the-middle solution. Our client who has the expertise of understanding what is helpful to people in need is now able to tag links as “accepted” or “rejected” depending on their usefulness. This manual process works for the first iteration of the product and is the stepping stone for a crucial future improvement as discussed in a moment.
We complete “Step 1” by adding export functionality for the filtered data. At this stage, we have an effective tool that essentially replaces Google searches and an Excel spreadsheet. While this is not as a useful as we intend the final product to be, it does add value by streamlining the process making the job of manually collecting resources quicker and less error prone. If this was all we had been able to deliver within the budget then it would still provide value for our client.
Step 2: Understand key features - Augmenting resources
Now we can take a look at all of the key features that would be required for the value of this product to be exponentially increased. Currently, we have raw information from Google along with the label that the users have set and all of that corresponds to a city and category of help that the client has specified. What they also need to know is whether this resource is physical or not and, if it is, we should find the location of the resource. Similarly to the search functionality, we create a location finder module that queries the Google Location API with the name of a resource and searches for its locations and augments the resource. Same as before, because we use this modular structure of the tool, we can easily create a different practical implementation of the location search that leads to a similar result but in a different manner. For example, instead of using Google’s API, we can simply mine the web page itself and search for the address in there. Or even better, we can compose the results from the searches as a fallback mechanism (use the Google location search and if it’s not successful, try to mine the website) or maybe even compare results from different searches to see if we have found the right address. A lot of possibilities open up when developing modular software and our client has the freedom to “choose their own adventure” based on their requirements.
Step 3: Improve the solution by adding modules
At this point, the MVP and the main required functionality is in the hands of our clients. We have gathered some know-how over the past month and have tweaked both functionality and UX given the great constructive feedback we’re receiving. On top of that, for this month we have aggregated a fairly small, but high-quality dataset of accepted and rejected entries associated with our enriched web resources. This is a supervised machine learning problem waiting to be solved! Using the contents of the title and description of the links we can build a sentiment analysis model that we train to disambiguate between Accepted and Rejected results.
Using Tensorflow, we build a small two-layered recurrent neural network and train it on the data we have collected. We use GRUs, which are in the same family as the LSTM layers we have already covered if you’re interested in a more technical read-through of our ML modelling practices. Now we can package our model in a docker container and serve it using the Tensorflow Serving API, which allows us to call our model over REST (or gRPC) and get a percentage prediction of how sure we are that this new link is going to be accepted. Now when a user searches for resources, they can select if they want them to be automatically sorted into auto-accepted or auto-rejected buckets. At this early stage we focus on having more manual control over relying on full automation and that is why we keep the automatically sorted results separate. We also apply very frugal thresholds that ensure we auto accept only links that the model is more than 90% sure are useful and reject ones where the score is less than 10%. Even though this doesn’t do too much to alleviate the manual process of accepting and rejecting, it does speed up the process of sorting by giving another indicator in the form of the percentages.
You can imagine that the first iteration of the model doesn’t have incredibly good predictive capabilities, due to the data being fairly scarce. We have about 300 classified links which is nothing compared to let’s say the 50K IMDb reviews dataset that is often used in sentiment analysis examples. But that is not the crucial part here. The most important thing is that we have built a system that is future proof and changes will be seamless for our client. After one month we can re-train the model on newly aggregated data, tweak it, and silently roll-out v2 of the predictive component and the users will see nothing but improved performance when they continue to use it. We have not achieved the perfect “autonomous miner” yet, but we have clearly laid the road towards it with entirely actionable items until we reach it.
It has been a couple of months of back-and-forward developments and our clients have had the Step One application from as early as the second week. This maximised the efficacy of our communication. They could tell us immediately what they liked, didn’t like and need. It was also great for promoting client reflection. We do not expect our clients to know exactly what they need down to the tiniest bolt from day one. Xerini is here to also provide consultation and help the clients themselves to understand the problem better. We are firm believers that just listening well isn’t enough: you need to understand. And we know software can be a dynamic and cruel beast and sometimes a feature initially wanted has to be modified or even removed. Avoiding over-engineering in favour of modular software helps with that and gives our clients the ability to get the tool that fits their requirements and restrictions.
We at Xerini are incredibly happy to be a part of this very noble cause and we will continue to support it. Make sure to stay tuned and watch how together with Change Ahead we can make people’s lives better, one link at a time!