Friday, December 8, 2023
Google search engine
HomeUncategorizedCases where full scans are better than indexes

Cases where full scans are better than indexes

Sometimes you have a lot of data, and one approach to support quick
searches is pre-processing it to build an index so a search can
involve only looking at a small fraction of the total data. The
threshold at which it’s worth switching to indexing, though, might be
higher than you’d guess. Here are some cases I’ve worked on where
full scans were better engineering choices:

  • Ten years ago I wrote an interoffice messaging application for
    a small billing service. Messages were stored in MySQL and I was
    going to add indexing if full-text searches got slow or we had load
    issues, but even with ten years worth of messages to search it stayed
    responsive.

  • I recently came across
    someone maintaining a 0.5GB full text index to support searching their
    shell history, 100k commands. I use
    grep on a flat file, and testing now it takes 200ms for a
    query across my 180k history entries.

  • My contra dance search
    tool
    ranks each dance in response to your query, with no
    geospatial indexing, because there are just ~350 dances.

  • The viral counts
    explorer
    I’ve been playing with for work searches the taxonomic
    tree of human viruses in real time, scanning ~15k names with JS’s
    “includes” command about as fast as you can type.

  • When I worked in ads I would often need to debug issues using
    production logs, and would use Dremel (Melnik
    2010
    , Melnik
    2020
    ) to run a distributed scan of very large amounts of data at
    interactive speeds. Because queries were relatively rare, an index
    would have been far more expensive to maintain.

Unless you know from the start that you’ll be searching hundreds of
millions of records, consider starting with simple scans and only add
indexing if you can’t get acceptable performance. And even then, if
queries are rare and highly varied you may still do better to do
the work at query time instead of ingestion time.

Comment via: facebook, lesswrong, mastodon

Recent posts on blogs I like:

Cycling Injuries Analysis in DC

I looked at a few years’ worth of data for DC (where I live) to try to figure out how risky cycling is here.


via Home

May 19, 2023

Who is nature good for?

Not necessarily good for the fish.
The post Who is nature good for? appeared first on Otherwise.


via Otherwise

May 8, 2023

Some mistakes I made as a new manager

the trough of zero dopamine • managing the wrong amount • procrastinating on hard questions • indefinitely deferring maintenance • angsting instead of asking


via benkuhn.net

April 23, 2023

more
   
(via openring)

Read More

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments