Building the Discovery Engine

How can we identify prospects for the Neuroscience Institute? This was the question that research analyst Mallory Lass posed to Prospect Development at a staff meeting. Because this was a new institute, we couldn't start with donors, graduates or ticketholders as we might have for an established program. Instead, Lass described her approach as mining the text of old contact reports and fund biographies, researching academic majors and student organizations that had some overlap with brain science, identifying events that had featured some neuroscience-related content and pulling their attendees, and so on. At each step she was building not only a list of prospects, but also a deep understanding of the fundraising landscape for the new institute.

The task involved both traditional prospect research and data analysis, but what struck me during Lass’s presentation was how much additional energy went into overcoming the limitations of tools that did not support the sophisticated work she was doing. She had to thoroughly document each step of her work in order to, among other things, be able to go back and tweak things after having reviewed results. This was accomplished via the notes she kept as she worked. The iterative process meant that she was often working with several spreadsheets simultaneously — these frequently needed to be merged and reconciled, and then fed back into the database to pull more reports.

The Discovery Engine (referred to around the office as the "Disco Engine") is a prospecting tool built in-house by members of UC Berkeley's Prospect Development team using R, a freely available and open-source programming language and environment. The goal of the Disco Engine is to allow users to precisely define constituencies for their fundraising clients. To accomplish this goal, it provides two complementary resources: (1) a vocabulary and grammar for unambiguously describing constituencies of interest (that "unambiguous" bit is important — our constituency definitions are automatically translated into SQL queries and returned as IDs), and (2) hooks of discoverability for analysts to grab onto, enabling them to define the novel constituencies that are expected to fund new cross-disciplinary programs.

Vocabulary and grammar

The basic building block for any Disco Engine constituency definition is the widget — a fill-in-the-blanks predicate. For example, "majored in X" is a widget, where "X" can be filled in with "mathematics" or "biology" or any other major. The important part is that each widget is focused on an isolated characteristic. In the language of the Disco Engine, you specify anyone who majored in mathematics with: majored_in(mathematics). These definitions can then be whisked off to our Data Warehouse and returned as IDs of individuals in our database who fit the definition, by using the display function:

Display math 2.PNG

In addition to displaying the IDs on the screen, display also gives you the option to export the IDs to a savedlist. We already use savedlists regularly in our reporting environment, where a number of different reports take a savedlist and return details about the individuals with those IDs. We're also getting ready to roll out a new feature that allows users to output full reports instead of just savedlists.

Of course, finding math majors was never that difficult to begin with. What makes the Disco Engine valuable is its ability to combine predicates into more complex definitions, and the ability to manage that complexity by assigning names to definitions. Imagine that a client is holding an event near campus for the parents of math majors. To identify potential invitees:

math major 2.PNG

In addition to %and%, individual predicates can also be combined with %or% and %but_not%.


Predicates act as basic building blocks that can be combined with one another into more complex predicates, or definitions. The examples above are easy to understand even if you've never had any training with our tools. But to create these definitions, a user would need to be able to discover the correct widget for their purposes. To that end, users of the Disco Engine can, whenever necessary, bring up a searchable, sortable list of widgets:

job title.png

Searching for widgets related to employment data

Since we all are using RStudio (also free and open-source!), we also benefit from the user-friendly auto-completion, a type of interaction that anyone who has used a search-engine will find familiar:


Auto-complete is a key part of the user-interface

Furthermore, we have a growing suite of additional services that plug in to the Disco ecosystem — we call them "bots" — that make it even easier to start building the perfect definition. Returning to the Neuroscience example, we might try to get started by calling on the brainstorm_bot:

brainstorm bot.PNG

Training and ongoing development

Building the Disco Engine required thoughtful design and some programming skills, but that by itself was not sufficient for a successful rollout. Management agreed to expand the required database access to our team after a successful demonstration of the tool. After that, we held multiple trainings that the entire team attended. A documentation site also helps analysts navigate the tool.

The Disco Engine itself has grown by leaps and bounds since its initial rollout. This growth has been driven entirely by the needs of analysts on the Prospect Development team: In addition to the cost savings, building a tool in-house using open-source technology has meant that enhancing the tool is as easy as dropping by my desk and asking, "Can we make the Disco Engine do this?" We now have widgets that utilize text-mining and other advanced techniques, along with additional features to improve usability; and even more ambitious extensions are on the way. Analytics team members have developed new coding skills, and research analysts have deepened their domain expertise, which benefits their clients.

Putting it in context

The Disco Engine is not the only response to the increase in multi-disciplinary projects we've seen at Berkeley. Prospect Development, behind the efforts of Mallory Lass and Abbey Myszka, has been revamping our interest in coding in order to more precisely identify and track our prospects' philanthropic interests and align them with known fundraising priorities. Our Prospect Analysis team has a monthly process of importing data from various public sources — the data are used not only as wealth indicators, but also as indicators of philanthropic interests (e.g., an alum's contribution to an environmental PAC reveals potential interest in some of the environmental research that happens here). At a campus level, we've implemented Prospect Management policies that encourage collaboration among fundraising units. And it is not only the newer initiatives that benefit from this work. Even degree-granting units with traditional constituencies raise money from outside of their narrowly defined constituencies — the MBA with an interest in social welfare or the engineer with a passion for classics. Our efforts to improve the precision of constituency definitions benefit everyone.

Tarak Shah is the assistant director for prospect analytics in UC Berkeley’s Prospect Development, where he has worked since 2006 building analyses, models, tools and other data-related products. He has previously presented at CARA, CASE and APRA conferences, and participated on the team that built Berkeley’s entity match web service, which received a 2016 CASE Grand Gold Award. He is online at

Apra BoK Data Analytics icon.png

Dig even deeper into the world of data analytics with Apra OverDRIVE, Apra’s Data Analytics Symposium and the webinar “Thinking About Data Modeling? Five Key Questions to Consider Before Starting.” 

Recent Stories
Five Steps to a Successful Campaign

Philanthropic Game Changer: The Rise of Donor-Advised Funds

Remarkably Outstanding Implementation: The Foundation of Positive ROI