Select Star from Database
It’s come to my attention that a lot of people have a limited understanding of the variety and diversity of database solutions that exist. Let’s change that now.
It’s important to understand that this isn’t a comparison. These are all powerful tools that solve specific problems. The fact is – you don’t have these problems and you don’t need these tools. But it’s useful to know about them, to know what technical problems they solve, and what they’re really for.
In addition to covering those questions (or at least as part of the way of answering them) I’ll show how queries or code on these applications actually looks, with some example queries. I’m not going to show the whole setup or insertion process, just some sample queries of getting example data from an existing and populated database.
First of all we’re going to start with kind of generic databases, which follow a kind of continuum. That continuum is based on what we store in a database. The answer seems obvious: data. But actually what we store is data and the relationships between data. It’s very important to understand that latter fact.
In-Memory Cache
The classic examples here are Redis and Memcached. I’ll only be talking about Redis here because it seems to have eaten Memcached’s lunch. Redis isn’t something a lot of people think of when the word “database” comes up. Redis isn’t fancy. It isn’t doing a lot. It’s fantastic at what it’s actually doing though.
In the continuum established before, Redis is pure data, and simple at that. It doesn’t know anything about relationships, and solely stores discrete chunks of data.
The goal of Redis is a very specific job – typically it is a cache. You store stuff in Redis that you don’t want to have to keep getting out of another database. Redis doesn’t run on disk, it’s entirely in memory. That gives it sweet FA in latency, and crazy high performance.
Why would you use it?
Caching, primarily. Redis is a great way of putting frequently used and/or infrequently changed data somewhere that is ultra-fast to access and takes the load off your database. In web terms this might mean your current specials, leaderboard, current exchange rates, or your user session data. Really anything you want to take off the load of the primary database.
How does it work?
Internally, everything in Redis is a string, stored as a string, indexed by a string. To be more detailed, Redis is often referred to as a “key-value store”, and that’s basically what it’s for. Have a key, get a value. People will tell you this is false, that you can store other things, but no. Redis just provides some helpful methods to get data, it’s still stored as a string. Something that is critical to understanding why Redis is so fast is that everything is stored in memory, not on disk.
How do you use it?
The Redis client has a few basic methods for things like Get and Delete. The real key is… well.. the key. You don’t want to use a key like “1”. But you may well want to store something like current-rates:eur
. This colon separates the key into a namespace and a detail, and you could also add any other details you like: current-rates:eur:1m-average
.
const euroRate = await redisClient.get("rates:eur")
There’s nothing special about the colon, it’s just a convention, and Redis has good support for dealing with keys via wildcard, such as deleting or selecting. current-rates:*
.
const keys = redisClient.keys(“rates:*”)
const rates = redisClient.mget(keys)
Something important to note is that you don’t just have to store basic strings. It’s common and routine to store entire JSON objects, you just have to serialize them in, and deserialize them out.
Why would you use it?
Redis is extremely fast, making regularly accessed data highly performant, potentially improving responsiveness of an application and reducing load on the primary database.
What are the downsides?
There’s always a cost, and in this case it’s financial. Redis caches can be made as big as you like, but they require a lot of RAM, and that comes at a solid price. This also necessitates another server, more infrastructure, complexities in deployment, etc. There are also some development requirements to factor in. Anywhere that updates the data you need has to now have functions to update Redis, and the lookups for certain data need to change from “query database” to “check Redis, if not in Redis, query database, return value to user, update Redis”.
Summary
Redis is a high performance optimization for applications with heavy database load. Like any optimization it should not be undertaken prematurely. Wait until you have a need. But the vast majority of production applications use Redis or something like it. It’s probably the first optimization you’d reach for when fine-tuning an application.
Document Databases
These are often referred to as NoSQL databases, and MongoDB is probably the best known. I’m not a big fan of calling something by what it isn’t, so I’ll stick with Document Database.
The term is potentially misleading. It’s not for storing documents like Word docs or Excel spreadsheets. It’s treating each record in a collection as a separate document, much like how an HTML page is a document.
Each entity in a collection is very much self-contained. The example I usually use is something like a real estate listing, which is a pretty good candidate. The listing isn’t extensively related to other things. Sure you might have an agent id or phone number or something, but that’s fine.
Document databases fit in the middle of our data vs relationships model. They’re almost entirely based around storing data. They can do some limited cross-collection connections, and they can store a “subdocument”, so if you wanted to have the Posts as the collection and a subdocument of Comments you can totally do that. But beyond that it’s very much intended to store data and then read it back in much the same way.
Why would you use it?
Document databases have another advantage – they’re schemaless. That means they have no specific formal structure. To go back to our real estate example, listings can have wildly different requirements. How much land is this luxury apartment on? How many bedrooms is this shopping center? If you tried to imagine making a form, you’d need a lot of logic to say that you don’t need to enter the council zoning details on a condo rental, or the number of bathrooms in a hospital.
Document databases let you simply store whatever data you want to store, rather than constraining you with a specific, rigid structure. This has benefits for flexible data like above, but also for rapid prototyping, where you might not know what your fields even are.
How does it work?
The MongoDB engine stores a record in a collection, which is the type of record it is. It then stores the object as JSON. (It’s actually BSON, binary JSON, a format with faster reading and retrieval, but from a user’s point of view it’s close enough.)
A user is then able to query the collection however they like and get the data out.
How do you use it?
The real world syntax and approach is one of the biggest benefits MongoDB has over other styles of databases. It’s super easy and intuitive to use and especially to query. This is particularly the case in a JavaScript environment, which is where MongoDB got its biggest push, largely as part of the MEAN stack – MongoDB, Express, Angular and NodeJS.
Mongo is often queried using a tool called Mongoose. Mongoose allows you to define a schema – if you think there’s something funny about taking a schemaless database and then forcing a schema on it you’re not alone – and then query using those objects.
const productsInRange = await Product.find({
price: { $gte: 50, $lte: 100 },
category: "Electronics",
});
This lets you do a find all the products between $50 and $100 in the Electronics category, and the following will find products in two categories with stock levels more than 10.
const audioOrVideo = await Product.find({
category: { $in: ["Audio", "Video"] },
stock: { $gt: 10 },
});
This is a very intuitive system, and it makes searching and querying very easy. Much the same applies to inserting new data, updating, etc.
What are the downsides?
You may well find requirements get more and more complex, and MongoDB or its ilk no longer provides a particularly good match. You’ll encounter the cost after that, as you have to rebuild everything from the ground up to what you arguably should have used in the first place.
Another hidden cost is that of schemaless. If you have a MySQL table that has a middle_name
field, it’s going to potentially come back as an empty string. If you use MongoDB maybe that field doesn’t exist on the object at all. The need to double check for any field that might potentially be missing is a common fact of life in a system written with no schema.
Summary
I’m not going to lie here, I actually don’t like document databases. I’ve never written one that I didn’t have to un-write. In my experience a huge amount of data is actually relational, or as it changes it quickly becomes so, and I find MongoDB to be a fisher-price database for babies. I know a lot of smart people use it, but I’m not one of them.
Relational Databases
As alluded before, a huge amount of data is relational. But what does that mean? We’re heading now to a point where relationships between data is as, or nearly as, important as the data itself.
Relational data is data that has a relationship. If I have an order, that order has products. Product and order have a relationship. Product and category have a relationship. Order and user have a relationship. And so on.
These relationships are defined in the very structure of the database. Unlike the schemaless document databases above, relational databases are very very schemaful. Everything is rigidly defined, ruthlessly established, and strictly enforced.
The best known examples are things like MySQL and PostgreSQL, but there are plenty of others – SQL Server from Microsoft, Oracle, SQLite, and many more.
Relational databases are at the core of basically every single web application ever made ever in the history of the world. You ignore this technology at your peril, it is beyond foundational and well into necessary.
How do they work?
How long have you got? This is a massive topic. Essentially relational databases store everything in tables. Each table has a given structure, a list of columns. Think of each table like an Excel spreadsheet, with named columns, and indeed, many editors make it look quite a lot like that.
The difference comes in relationships. A key part of relational databases is that you don’t store duplicate details. The order table doesn’t store the user. The order table stores the user’s id. Then there is a “foreign key constraint” to say that this user_id field represents an id field on the users table.
These constraints establish the relationships. One thing that’s important to understand is that unlike a document database there is no correct way of looking at the data. Sure, you can get all the products from an order. But it’s just as valid to get all the orders for a user. Or all the products a user has purchased in the last year. Or the total number of categories the user has purchased products from. Or the total amount of money paid by the user. Or the average per order. Or all the products in a category. Or all the subcategories in a category.
Why would you use it?
Why would you not? It’s the most common way to store any form of persistent data. Ultimately the question is to determine if your data is inherently relational. I’ve struggled myself to figure out whether a lot of real-world data really is relational or whether my brain has just been warped to fit into this paradigm. The likely answer is a bit of both. Nevertheless, most cases you could come up with as random examples, such as products or orders or posts end up being relational. Orders have products, products have categories, posts have comments, related to users. If your mental model consists of terms like “belongs to” and “has a” it’s probably relational.
How do you use it?
This is where things start to get harder, and this is where you need to know SQL. There’s a reason document databases call themselves NoSQL, because it’s hard.
It’s incredibly difficult to learn to a high level, but it’s enormously powerful. SQL is used for updating and inserting the data, and for querying, but also for creating and defining the tables themselves.
SELECT
p.*, oi.quantity, o.*, u.*
FROM
orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN users u ON o.user_id = u.user_id
WHERE
o.order_id = 12345;
Things don’t always have to be hard. It’s possible to learn enough SQL to get by with basic usage. Also, there are abstractions over SQL. It’s quite common to use an ORM, where the relationships are established in code as part of models, and then queried accordingly.
This can greatly simplify the code, making the app logic more maintainable and readable.
$order = Order::with(['orderItems.product', 'user'])->find(12345);
With that established, I can’t stress this enough: do not use an ORM because you don’t know SQL. If you do not know SQL you must learn SQL.
What are the downsides?
Databases can be expensive. They take up a lot of space and use a lot of resources. This is really the same regardless of implementation, whether it’s managed and on the cloud or installed on a server.
Databases like Postgres can be time-consuming and expensive to maintain, to optimize, to tune. Queries can get stuck, poorly written, or be slow running. Quality SQL is not at all easy to write, and database design and tuning is an entire field of its own.
Ultimately questions about pros and cons come with an important question — compared to what? There’s a reason relational databases have dominated the industry for decades.
Summary
Just use a relational database. Probably 90%+ applications in the world use something like Postgres or MySQL as their core technology. These tools will be perfect for at least 90% of requirements, and good enough for another 5%.
If you don’t know what you need, it’s almost certainly a relational database. If you’ve got a specific need that wouldn’t work in a relational database, think twice. Maybe three times. I bet it would. If you come from a bootcamp background and have been using MongoDB for all your projects, I strongly recommend giving something like MySQL a go.
Graph Databases
This is further along our spectrum, and now we hit a point where the relationships matter more than the data itself. This might not seem particularly intuitive. Why would the relationship matter more? The key question is probably to think about networks. It doesn’t really matter what the network is, but social networks are quite a common example.
It’s easy to store a user’s details, and even to store a user’s friends in something like MySQL. But how would you query friends of a friend? How would you query how many degrees of separation between any two people (think LinkedIn’s 1st, 2nd, etc)? The fact is it’s extremely difficult if not impossible to build queries that do this sort of thing. Even if you could, they would become exponentially slower as you increased “hops” between the nodes.
This where tools like Neo4j become dominant. They let you query directly on the relationships as well as the data.
Why would you use it?
If your core business is network based. This isn’t as obvious as it might seem. Examples might include social media as shown above, but also potentially computer networking or logistics. We’ll go more into the details of that now.
How does it work?
The name “graph database” doesn’t seem to mean much, not a pie chart in sight. The graph in this case is Graph Theory, a branch of geographical mathematics related to shapes and patterns of interconnected points. In Graph Theory the points are nodes and the connections between them are edges. The whole thing is the graph.
Graph theory is about doing things like finding the fewest edges from one node to another, or how to make a path that touches all nodes without duplicating edge transits (this is the famous Traveling Salesman problem).
Edges can also be weighted, and this is where things become important for logistics. If you think of the US highway system as your new graph, your nodes are towns, and the edges are highways. These highways are not all equal, and so they can be weighted with whatever makes sense for your business. Time, distance, fuel usage. The graph database can use this information to select the most efficient path between any two nodes, or calculate the cost of a specific path.
How do you use it?
Like relational databases, Graph databases use a querying language, typically custom to the vendor. Neo4j uses Cypher, which has an SQL-like syntax, but has additional stuff for navigating relationships.
When I say SQL-like, that’s a little vagues. It’s a bit like, but it’s not exactly based on it. In some ways it’s actually backwards.
MATCH (u:User {id: 123})-[:FRIEND]->(f:User)-[:PURCHASED]->(p:Product)
RETURN p
This query will find the products purchased by the friends of a given user. As you can see the intial MATCH
tells it to find the following, the User
is a node, and these :FRIEND
queries are pre-defined relationships (edges) that you can query by.
What are the downsides?
Graph databases are actually relatively common, and most cloud providers offer some kind of managed solution. This is not to say there can’t be significant costs, but they’ll be in the ballpark of a relational database.
Specialist Databases
Not all databases are general purpose like the above. General purpose databases are excellent for solving general purpose problems, but some domains have specific needs that aren’t covered, or aren’t covered sufficiently.
It’s worth noting that many of these requirement can be covered in a general sense, but are not optimized, or have lower performance ceilings. I labor this point more and harder later, but it’s important to clarify that I’m not advocating you learn or use these tools. I’m mostly talking about them because they’re interesting and cool.
Search Index Database
If you’ve ever wondered how something like Google works, this is how. Many people have used something like Postgres with a FULLTEXT
index, or used a LIKE
query condition. But the requirements of search can be vastly more than that.
If you think about a dating site like Tinder (put scare quotes around “dating” if you like) the process of matching is pretty complex. Many different conditions exist, there are quiz questions, and these are weighted, and the end result gives a match quality rather than just a yes or no.
This is what tools like Elasticsearch or Algolia are used for. I’ll use Elasticsearch as my canonical example here because it’s what I’ve used most.
Why would you use it?
If search or matching is the primary goal of the company or application. Particularly if you need the search or matching to be “non-boolean”, so you don’t just want things to pass or fail, but have better or worse matches, or degrees of success.
How does it work?
Basically Elasticsearch makes an enormous index from your records. Take a dating profile, for example. Essentially the data is pulled apart and indexed with all of its relevant components. For example, you’d need to get out all the user’s profile details, recent logins, answers to quiz questions, etc.
This index is then queried against by a bunch of scoring functions written to check various properties. For example, is the user a smoker? Well that might be a definite no, a definite yes, or maybe no effect. Typically these functions return a multiplier. Returning 1 has no effect, returning 1.1 makes it a 10% stronger match, 0.2 makes it a weak match, etc. Other criteria might be less simple, for example geographical distance can be calculated and add to the weighting. Incompatible sexual orientation or gender returns a 0. Cumulatively these numbers give a match index that shows how good a match is, and orders by that. These functions and how they work and weight are essentially “the algorithm” in a lot of search system.
Most critically it does all of the above in milliseconds. Elasticsearch is blindingly fast at doing complex search queries of its indexed data.
How do you use it?
Like most of these applications the idea is to run a query on a client. Writing the query for something like this is actually pretty complex.
{
query: {
function_score: {
query: {
bool: {
must: [
{ match: { smoker: 'no' } },
{ range: { age: { gte: 25, lte: 35 } } }
]
}
},
functions: [
{
gauss: {
location: {
origin: { lat: 40.7128, lon: -74.0060 },
scale: "50km", // Defines the "drop-off" for score impact
offset: "5km", // No score boost within 5 km
decay: 0.5 // Decay rate, where lower values make proximity matter more
}
}
}
],
score_mode: "multiply", // Combines geo score with other scores multiplicatively
boost_mode: "sum" // Adds the geo score to the overall match score
}
}
}
This object is pretty complex and it’s out of scope to go through it in any real detail, but building up this kind of object is often a pretty complex process of conditional logic and input parsing.
What is the cost?
Hoo boy. Absolutely enormous. Elasticsearch requires massive machines with extreme amounts of RAM to store the indices it creates. It’s the fact that it’s in memory that makes it so extremely performant. This means production machines will need minimums of 64 to up to 256 GB of provisioned RAM. This is even assuming minimal scaling, as Elasticsearch scales horizontally, spreading its index across multiple machines, magnifying costs.
Summary
Elasticsearch is the sort of technology entire businesses are built on. It’s not just “search”. Any sort of matching of users and services, matching of people, or finding appropriate products can be done with these search databases. The “feed” in almost any social media system is probably something like elastic search. Any dating site. Obviously any search engine. Any product recommendation engine. They are incredibly powerful tools, but come at a cost, both in terms of the time it takes to get them working and get them right, and in the cost of… you know… money.
Time Series
In rare cases, the data doesn’t matter quite as much as the time. This doesn’t seem to make much sense, but think of the example of something like sensor output. Maybe the sensor just dumps a single integer of the current humidity. There needs to be an identifier of the sensor but that’s about it.
InfluxDB is a commonly used open source time series database.
Why would you use it?
This section will be longer than most because it’s not as obvious. The short answer is that you have massive amounts of highly time specific but otherwise minimal data.
Examples might be sensor data, or some kind of system logging. Almost invariably this will be data you need to write a billion times and read like… twice. Maybe make a report of, or get the daily average of data collected by the millisecond.
What is less obvious about time series database is why am I using this database instead of Postgres or MySQL. Postgres has timestamps too. I can make the exact same structure in a readings table, so where is the benefit?
There are two key benefits. The first is write speed. Time series DBs are intended to have thousands of writes per second. They store small payloads at ungodly speeds, well beyond what a general purpose database is intended to handle. The other is the focus on time.
We’ll look at more details soon, but in SQL it’s possible to group by a period of time. In a Time Series database this form of grouping is vastly easier, and much faster.
High end enterprise Time Series Databases like kdb+ are cornerstones of high frequency trading, running in memory with extreme performance.
How it works
Time series databases are essentially a timestamp associated with some data. The data itself is almost always minimal - an identifier and a value, or a tiny payload.
How to use it
The query language for most time series dbs is drawn heavily from SQL to make usage more familiar to users. This is then augmented with some specific functions and syntax around the needs of time aggregation.
SELECT MEAN("close")
FROM "stock_prices"
WHERE "ticker" = 'AAPL'
AND time >= '2024-01-01T00:00:00Z'
AND time <= '2024-01-07T23:59:59Z'
GROUP BY time(1h)
As said, this looks a lot like SQL and
Summary
Time series dbs are cool, but aren’t something I’m likely to ever need myself. The time series functionality is easy enough to
Transactions
The term “transaction” is overloaded in database contexts. Relational database has the idea wherein a set of operations can be wrapped in a “transaction”. If all the operations are successful, then the transaction is committed, if there is an error then the transaction is rolled back, and none of the operations are actually executed.
That’s not what we’re talking about here. We’re talking about financial transactions – transferring funds between accounts.
Handling fund transfers is not as simple as it might seem. There are a lot of rules and implications to these transfers. A transfer should only be allowed if there is sufficient balance for the transaction, funds cannot be double-spent, and transfers need to update the balance of a given account. These actions are not just important, they are mission-critical.
TigerBeetle is a database designed solely around financial transactions. It has only really two types of things it can store: accounts, and transfers.
Blockchains
It might not seem obvious, but blockchains are essentially a transaction database. They are decentralized and append-only, but those features are not unique to blockchains – the same terms could be applied to TigerBeetle.
Why would you use it?
When the primary goal of the application is entirely dedicated to financial transfers, such as currency transfers, or crypto-currency. This wouldn’t include a large financial institution like a bank (there are actually specific databases intended for banks) but it might include currency exchanges or investment platforms, which is what I work on.
How does it work
TigerBeetle stores both accounts and transfers in a table with highly specific structures. There are fixed fields for the data, including a few that can be used to store custom data. Transactions (transfers) are relatively simple, but understanding how accounts work within your domain can take a little bit of doing. For example, accounts have a LedgerID
.
Ledgers aren’t any sort of abstraction, they’re just a way to group accounts. For example, ledger ID of 1 might represent US dollar transactions, while 2 represents Euro. You then cannot simply send transfers denominated in USD (ledger 1) to Euro accounts (ledger 2). Additionally there is a set of UserData fields, UserData8
, UserData16
, UserData32
that are those numbers of bytes. These can be used to store data to help you identify them.
In my own implementation, for example, these fields store the transaction type (deposit, withdrawal, investment, interest payment, etc) and the USD value of the cryptocurrency transaction.
How do you use it
Most databases here use a pattern of a client and query language of some kind. Tigerbeetle doesn’t. There’s a client sure, but minimal functions that can be used to run specific implemented functions. For example, CreateTransfer
or LookupAccounts
. The LookupAccounts
function needs to be passed a list of ids, which means you need to query elsewhere. There’s some rudimentary search in the QueryAccounts
function, but it’s so beta it’s not even in the client library I use in Go.
Speaking of which I have used this entirely in Golang, and my code samples will reflect it.
accountSearch := models.Account{
OwnableModel: models.OwnableModel{UserID: userID},
AccountCode: models.AccountCodeWallet
}
accounts := accountService.getAccounts(accountSearch)
accountIds := accountService.ExtractIDs(accounts)
balances := s.tbClient.LookupAccounts(accountIds)
The first few lines get the Postgres model, only the last one is actually connecting to TigerBeetle. Most TigerBeetle functions expect you to know the ID of an account or a transaction in order to get details on it.
What does it cost
Quite a lot. The high availability and distributed nature of TigerBeetle inherently requires multiple instances and that means infrastructure costs. There is benefit in the robustness of transactions and in the management of account balances.
TigerBeetle doesn’t replace existing databases. In fact, it has almost no searching ability so if you are going to want to get account details, like to search for transactions by type or for a given user, you actually need to run something like Postgres in parallel. In fact you’re going to need to have Postgres anyway for general purpose data. This is actually a common theme in these specialized databases. They’re good for storing the specific key data they’re intended for, but they’re rarely suitable for the general data that you need for a full application – user data, logins, settings, etc. We’ll talk about that more later.
Summary
This is one of the few of these specialist database I’ve got personal experience with, and it was actually the catalyst for writing this article. It has proven such an elegant and effective solution for the problem that it got me thinking it’s worth discussing the merits of these focused solutions.
There are noticeable downsides, but in this case they come from being an early adopter, rather than from the niche nature of the solution. In fact, all of these systems are well-proven and battle tested in their own domain.
I’ll go into more detail about TigerBeetle and my own experiences with it in a separate article.
GIS – Geographic Information Systems
GIS is solely devoted to storage of geographical information. This means map co-ordinates, regions, and more. The intent is to be able to store this location and do queries based on geography – querying whether a point is in a region, how close two points are, and more.
There are actually modules that can be added to Postgres to allow it to do this, but the king of the carefully mapped hill is ArcGIS. Arc is an enterprise solution for mapping and geography software.
Why would you use it?
Because your primary business requirements are involved in geological features or mapping and you need accurate and powerful tools to manage these features. Maybe you work in mining, surveying, or some sort of land authority. Urban planning, wildlife management or conservation, real estate, or utility. Knowing exactly were gas lines are or electrical conduits is of obvious importance.
How does it work
This is actually an enormous question. GIS information can be some of the most complex in any domain, because it models details of the real, natural world. It’s not just co-ordinates. It’s shapes, terrain, topology, contours, hydrology, and more stored in complex vector files, not unlike an SVG. It’s also files like LIDAR mappings, 3d models, and satellite imagery.
Focusing on ArcGIS as an industry standard, the model is to store data in “feature classes”. These are broken into several main categories — point, line and polygon. There are also some group ones, like multi-point. These classes are then named, so you might make a Rivers Feature Class that extends the line feature class, and also you define it to have depth, flow rate, maybe fish species, whatever. Rivers is just what we named it, we could have called it “Riparian system” if we wanted to. You might have properties to the sides of the river called “Allotment Feature Class” which extend the polygon feature class, and have zoning details, size, ownership, water allocations, and some basic topology.
How do you use it
Despite the underlying complexity it’s actually not as hard to use as it might seem. Essentially there are two layers to a GIS system. The first is to query for the core data. This is not at all dissimilar to SQL and there’s a good reason for that. Let’s say we have a true crime podcast and we want to get some serial killer victimology data. (I was going to go with an urban planning example, but I thought I’d use something less bleak.)
Let’s say we want to find all of the women the Green River Killer dumped or abducted within 25km of his home. The most idiomatic way is using ArcPy, a Python version of the Arc lookup syntax.
import arcpy
house_point = arcpy.Point(-122.2590, 47.3681)
house_feature = arcpy.PointGeometry(house_point, arcpy.SpatialReference(4326))
# Create a 25 km buffer
buffer_distance = "25 Kilometers"
house_buffer = arcpy.Buffer_analysis(house_feature, "House_Buffer", buffer_distance)
# Attribute query to filter victim locations for Green River Killer case
victim_query = "Case_Type = 'Green River Killer' AND Location_Type IN ('Abduction', 'Discovery')"
arcpy.MakeFeatureLayer_management("Victim_Locations", "Filtered_Victim_Locations", victim_query)
# Spatial selection for locations within the 25 km buffer
arcpy.SelectLayerByLocation_management("Filtered_Victim_Locations", "WITHIN", "House_Buffer")
Where a tool like ArcGIS differs significantly from the other databases we’ve talked about is that Arc has a full interface. Commands like the above can be saved, and then run to get data, sure, but for ad-hoc searches you can simply enter your requirements into a form.
What does it cost
Costs for ArcGIS is a little hard to figure out. There are several products, not least ArcOnline, which is a Software-as-a-service solution, and ArcPro, which is a more standard licenced software install. There are multiple tiers and multiple options, with ArcOnline Creator clearly costing less than Professional Plus. The pricing page literally doesn’t include pricing.
Some ArcPro licence prices I found cost around $3,000 per user per year, which is not insignificant. Others were more like $900. This certainly isn’t an open source solution. Products like Arc are there for professionals doing a professional job. With that said I imagine something like this is just a cost of doing business for a surveyor, urban planning deparment, or the like.
There is also Arc Enterprise, a locally hosted server solution, whose cost I’m not even going to look up. You don’t need to know and neither do I.
Summary
This is such a niche it’s very unlikely anyone reading this – or me – will have the need. And to be honest if I did need GIS functionality I’d be adding it to Postgres anyway. This isn’t at all a dodgy solution. Postgres with PostGIS is considered a pretty effective solution to most common mapping and searching requirements.
Overall Summary
As developers we have a myriad ways to store data. All of them have benefits, costs, upsides and downsides. Some of them have specific development environments they are an ideal fit for, while others are optimized around niche businesses.
There are a few things that we can say in conclusion, though.
First of all – start general. It’s highly unlikely you need a time series database initially. It’s unlikely you’ll need a graph database, or Elasticsearch out of the bottom.
Secondly, this isn’t a zero-sum game. You typically don’t replace one of these solutions with another. You typically add it. If you have a MySQL database you may well add Elasticsearch to it, or add Redis for caching, or implement time series for logging. This makes these solutions cumulative, and so are their costs. If you don’t absolutely need them, don’t use them. And if you definitely do need them, use the simplest version that solves your need.
Finally, and maybe this is just a duplicate of the first one, but don’t prematurely optimize.
It is important to reiterate that these are technical solutions to given problems, and they’re often integrated in a complementary way. Without being absolutist about it, there’s almost nothing you couldn’t do with just a Postgres or MySQL database, and maybe throw some Redis caching on there for an easy performance win.
But it’s useful to know that these other more niche solutions are available. There may be a time where you have to collect IoT data from temperature sensors, or (like me) building the backend for a cryptocurrency investment platform. And so it’s nice to have an awareness of what your options are. It still might be easier to just store all your stuff in MySQL. But maybe not.