Your data model is your product: building citation discovery on Neo4j

When I started Cite Smart AI, the obvious instinct was to reach for a relational database. Papers in one table, citations in a join table, done. I’m glad I didn’t - because the thing that made the product interesting was treating citations as what they actually are: a graph.

The shape of the problem

Academic citations form a network. Paper A cites B and C; B cites D; D cites back into the same cluster A came from. The questions researchers actually ask are traversal questions:

“What does this paper depend on, two hops out?”
“Which papers sit between these two ideas?”
“What’s the tightly-connected cluster around this topic?”

In a relational schema, every one of those is a recursive, multi-join query that gets uglier the deeper you go. In a graph database, they’re the native operation.

The same question, two ways

Here’s “everything within two hops of a paper” in SQL - and it’s already straining:

SELECT DISTINCT p2.*
FROM citations c1
JOIN citations c2 ON c1.cited_id = c2.citing_id
JOIN papers p2 ON p2.id IN (c1.cited_id, c2.cited_id)
WHERE c1.citing_id = $1;

Now the same thing in Cypher, on Neo4j:

MATCH (p:Paper {id: $id})-[:CITES*1..2]->(related:Paper)
RETURN DISTINCT related;

The Cypher version isn’t just shorter - it scales to “1..5 hops” by changing one number, where the SQL grows another join per level. The query reads like the question. That’s the tell that you’ve picked the right model.

Why this mattered for the product

Once citations lived in a graph, a whole category of features stopped being “features” and became one-line queries:

Shortest path between two papers → shortestPath().
Clusters of related work → community detection over the graph.
“Papers that cite both A and B” → a two-pattern match.

I didn’t have to build bespoke machinery for each. The database already understood the relationships, so the work shifted from plumbing to presentation - which is where the value actually was.

The trade-offs (it’s not free)

Graph databases aren’t a default. A few things to weigh:

Your team has to learn Cypher. It’s pleasant, but it’s another language.
Aggregations and tabular reporting are often easier in SQL. If your core access pattern is “rows and sums,” a graph is the wrong tool.
Operational maturity. The Postgres ecosystem is enormous; plan for fewer off-the-shelf answers.

The rule I now use: if your most important questions are about relationships between entities, model the relationships first. If they’re about attributes of entities, stay relational.

Takeaway

The biggest architectural lever in Cite Smart wasn’t the AI layer - it was choosing a data model that matched the shape of the questions. Pick that early and the rest of the system gets simpler. Pick it wrong and you spend the project fighting your own schema.