Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲More info about synonyms at Google (2010) (mattcutts.com)

113 points by bms2297 2 days ago | 48 comments

bms2297 2 days ago [-]

One of the most important components of Pre-2010 Google's search system was its synonym discovery mechanism. Simply put, queries would be "expanded" with synonyms. Google automatically generated synonym choices that took into account the context of surrounding words, with the understanding that synonyms are highly context dependent. Steven Baker, John Lamping, and a couple of others were key engineers of the system.

Does anyone with a NLP background care to take some guesses on how the synonym extraction methodology worked? My only piece of information is that it likely used the query log itself to do so.

evmar 2 days ago [-]

I was on the team too, with less impact than the names you mentioned. The team filed a number of patents that describe parts of how it worked. You can query a patent search engine with terms like [baker synonyms]. Looking now I think Steve was on most of the patents and you can also gather adjacent coauthor names from there.

[I am not a fan of patents, but to the extent they have any positives they in principle serve to share knowledge about how inventions work. Also I am not a lawyer but I think patents last 20 years from filing date and these were filed ~20 years ago maybe?]

robrenaud 2 days ago [-]

I got a couple patents while at Google. I sent a nice readable 4 page design doc that I wrote to a patent lawyer, and I got back 40 pages of nonsense that I basically didn't understand.

I wish there was some kind of readability requirement for patents , if they are to continue to exist.

bruckie 2 days ago [-]

Ha, I had the exact same experience. The lawyers sent me what they wrote for me to verify that they didn't mess anything up, and it almost seemed like they had based the patent on an entirely different document than the one I wrote up for the invention disclosure—not because they actually messed it up, but because patentese is almost as far from regular English as any foreign language.

neodypsis 2 days ago [-]

Someone should finetune an LLM to create a "patentese" assistant writer.

gregw134 2 days ago [-]

Please no

ajb 2 days ago [-]

The reverse would be good though

neves 1 days ago [-]

It is already one of the main applications of llm systems. Lawyers use a lot

dekhn 1 days ago [-]

Same- I had a very trivial idea and talked to a few lawyers who turned it into something that was far, far more complicated (and clever, in fact I wondered why they didn't just write patents on their own all day long). At the time, each submitted patent gained you $1K and if it was accepted ( I don't recall the actual terms) was $5K. Easy money! Hopefully, my ex-employer won't abuse this patent....

ajb 2 days ago [-]

Yeah, factor of 10 expansion was my experience as well.

bms2297 1 days ago [-]

Very cool that you worked on it! I've found most, I think, of the patents. They are.... as has been hinted at in this thread, very difficult to parse through and (imo) don't actually reveal much, though I may just lack the expertise! That's why I was hoping to get some NLP folks to speculate!

Your point on dates is something I did want to call out - I wouldn't be asking this if it wasn't ancient history. I have no interest in doing anything sinister. Just trying to explore a fun part of Internet history. Any shot I could shoot you an email to chat?

RaoulP 1 days ago [-]

I just realised that this technique is absent from local/desktop search. Meaning that in most systems you’re expected to recall how something was phrased, if you want to have a chance of finding it.

I know “Google Desktop” used to be a product years ago. What’s the state of that space today?

theolivenbaum 18 hours ago [-]

We experimented with it sometime ago for our https://curiosity.ai app, initial training on your data was a bit heavy (at the time, probably fine by today's standards) but nice results if you had enough files. Needs to be done with care as for small datasets as there's not enough info for a model to learn and you end up introducing more noise than anything.

bckr 1 days ago [-]

There are some Spotlight replacements on Mac that I’ve been meaning to try out, but not sure they do this specifically.

- https://www.raycast.com/

- https://www.cerebroapp.com/

- https://news.ycombinator.com/item?id=33816014

user_7832 1 days ago [-]

I personally use Everything by Voidtools.

Remembering the exact name is great, but I find using *.docx combined with sorting by date is often good enough.

IshKebab 1 days ago [-]

> in principle serve to share knowledge about how inventions work

Emphasis on "in principle". Most patents - especially software patents - are completely unintelligible. They also tend to describe the system enough that they can sue people that do the same thing, but no-where near enough that you could actually implement it based on the patent.

jjtheblunt 2 days ago [-]

Around 2001 i was using Wordnet to do the same in Motorola Labs days.

https://en.wikipedia.org/wiki/WordNet

mcculley 2 days ago [-]

Wordnet is insufficient for disambiguation, right? That’s why you need the query log.

jjtheblunt 2 days ago [-]

That would be great.

cdavid 2 days ago [-]

If you have access to the query log, aka "who makes which query in what context"), you can use see which queries are "close" to others in context.

For example, with session, you can detect manual query rewriting, and use this as a signal to see which queries are close to others in the time context. You can do various fancy things from just that.

Nowadays, a simple way to start would be to use SOTA LLMs to generate synonyms offline, and use this for query expansion at query time. At least in a context where queries are small, that should give decent results. This has however diminishing returns because of cost (the more synonyms the more expensive querying the index), and also you lose precision with diminishing returns on increased recall.

Ofc, for complex search like google, I am sure it is much more complicated

bms2297 1 days ago [-]

Re: LLMs, I was trying to better understand how pre-LLM search worked, hence the interest in the topic.

Any chance you have any open source links that discuss how you practically operate a system based on the concept you describe (manual query rewrite w/i a session as your data set)? Perhaps it's obvious to an NLP person how to reduce that "idea" to practice, but it is not to me!

You're definitely right about the idea though - a former Search engineer obliquely mentioned that this sort of session based manual query rewriting was very core to how the synonym system worked.

cdavid 1 days ago [-]

It is hard to find modern references on this. When I led a search group, coming from an ML but non search background, I found the following most useful

1. Query understanding series: https://queryunderstanding.com/query-understanding-8a2b16024... 2. Deep learning for search in manning. It covers some non DL techniques

Then it is mostly papers and talking to people. Berlin buzzword has videos of all talks and is very non academic but technical.

bpiche 1 days ago [-]

Maybe pointwise mutual information (pmi)

bms2297 1 days ago [-]

Say more!? :)

bpiche 1 days ago [-]

Kind of like a distance/similarity metric for words, but not an embedding. Joint probability. I once used it to automatically group words into multiword tokens. The wiki page is informative. But wouldn't know what they were using.

choppaface 2 days ago [-]

Also under-rated feature of 2010-era search was Matt Cutts, author of the article. He was an outlier at Google in that he did real community engagement as well as anti-spam, which is a huge contrast to today’s Google and how the internet has reacted to present-day SEO.

While the Matt Cuts era search tech is interesting, it’s crucial to keep in mind that the dataset was very different then too as a result of Matt Cutts’ own attitude towards spam and SEO.

Back in 2010 LDA was big and Google had used probabilistic networks e.g. Rephil / large noisy-OR networks as models

https://uh.edu/nsm/computer-science/events/seminars/2016/110...

Would the same things work today given how SEO spam and Google ads work? The same models are probably useful but it’s the noise and the long tail of the data that makes the problem hard.

bpiche 1 days ago [-]

I was a fan of LDA but would not agree that it is 'probably useful' today. It's an unsupervised clustering algorithm based on Gibbs sampling. Like k-means, it's gonna return a few buckets that will have to be reviewed by a human for data exploration. In this case instead of neatly labeled buckets, these are unlabeled distributions of distributions (lists of single word tokens). If you do some kind of multiword tokenization preprocessing, it'll return a few lists of words and multiword tokens for each document. How is this useful to an end user? Even internally, they're not useful embeddings/vectorizations. Would love to hear some contrary opinions

choppaface 1 days ago [-]

In many applications like especially Google's display ad targeting market, the "accuracy" of the clusters isn't so import as the lift in key metrics (e.g. click rates or revenue) and the overall efficiency of the method. Indeed the clustering algo might get things "dead wrong" but somehow surface something that causes clicks and revenue to increase. LDA offered much improvement over e.g. TF-IDF models, just as t-SNE improved on LDA, and now LLM embeddings are on average better and potentially cheap to compute.

LDA could be useful if your success metric is perplexity; k-means is useful if vector distance is very meaningful for your problem. Also well-studied algorithms are generally useful for initial studies in a new, unknown dataset. As always with ML, the dataset and setting are just as important as the model and algorithm.

bpiche 1 days ago [-]

Thank you for the well considered response

dkjaudyeqooe 2 days ago [-]

> A lot of people seem to think that Google only does simple-minded matching of the users’ keywords with words that we indexed

Oh what a dream if that were true! Instead, every year, the 'synonyms' get broader and broader. To me it looks like they're using synonyms of synonyms (of synonyms).

One thing is surely true: Google abhors a vacuum, they will show you results, no matter how tenuously connected to your query.

Calavar 2 days ago [-]

> Oh what a dream if that were true! Instead, every year, the 'synonyms' get broader and broader. To me it looks like they're using synonyms of synonyms (of synonyms).

Forget about synonyms of synonyms - I've seen antonyms bolded as matches in google search results. I have to imagine I'm not the only one.

userbinator 2 days ago [-]

I've also seen a few cases where it decided to cross out the word "not", basically inverting all the results.

a_wild_dandan 2 days ago [-]

Google has slashed one term in my two word query. The results of this 50% query reduction, predictably, were complete trash.

marginalia_nu 1 days ago [-]

While I agree that Google's query interpretation as it works right now is annoying and frustrating more than it is helpful, that's largely owing to how inscrutable it is. If you search for cats and get dogs, it's not obvious why that happened and not obvious how to prevent it from happening. That problem is in no way intrinsic to synonym generation, but likely a result of leaning too much into embeddings to do the heavy lifting.

That said, skipping synonyms and other query variant generation is definitely throwing the baby out with the bathwater. When it works well it massively increases the recall of the search engine at very little loss of precision, which is important given the scale and noisiness of web results.

dkjaudyeqooe 19 hours ago [-]

> That said, skipping synonyms and other query variant generation is definitely throwing the baby out with the bathwater.

I disagree, I want control back. They've neutered almost every query narrowing option they used to have. They don't care anymore they just want to give you results.

I'm not saying the, lets call it 'query automation', is always bad, I just want the option to turn it off to some degree. Apparently we can no longer be trusted with that.

marginalia_nu 19 hours ago [-]

There's really no reason why you couldn't still have control with e.g. a verbatim mode or quotes.

This is a Google problem, not a search problem.

asddubs 1 days ago [-]

I don't know why this is downvoted, it's overstating things a little but there's some truth to it for sure. The other day I was googling for the name of the \mid Latex math symbol. It's basically a pipe with some space around it, a separator line useful in definitions.

So I tried to google various variations of "separator line latex math" but it always synonymed them to include "line break latex", which is obviously "\\" and not what I want.

base698 1 days ago [-]

I've found Google totally useless for all things LaTeX and resorted to using LLMs.

dkjaudyeqooe 19 hours ago [-]

Some people can't handle, or don't recognize, hyperbole.

It was written, and was meant to be read in, an exasperated and slightly dramatic tone. I thought that was obvious from the opening sentence, but apparently not.

bombcar 1 days ago [-]

For what it’s worth Kagi gets what you wanted (I think) in the first four results for your first search.

wizzwizz4 19 hours ago [-]

Fwiw, the usual method for finding LaTeX symbols is Detexify: https://detexify.kirelabs.org/classify.html

asddubs 17 hours ago [-]

Oh yeah, thanks for that. I actually knew about this but completely forgot it existed. Getting back into latex after not having used it for a while

mmastrac 2 days ago [-]

Given this was published in 2010, and https://en.wikipedia.org/wiki/Word2vec was published in 2013, perhaps this was an early precursor?

From the article linked from this blog post: "Enabling computers to understand language remains one of the hardest problems in artificial intelligence."

visarga 2 days ago [-]

I worked for this task for a year and it doesn't work very well because in embedding space relatedness, synonymy and antonymy are mixed up and require pairwise thresholding. You can probably get to 90% but not 99% this way. Better use a crossentropy approach.

In modern RAG applications we return top-k results for this reason - it can't simply give the correct snippet in one result, leaving the hard part to the LLM to make sense what is useful and what is not.

fuzzy_biscuit 2 days ago [-]

Oh man, I miss the days when Matt Cutts was the de facto search liaison at Google. When I was doing an agency SEO, I read his posts and followed him with fervor.

1970-01-01 23 hours ago [-]

Not ignoring the deliberate use of verbatim operators (text in quotes), and delivering verbatim results, just as they were doing a decade ago, would be a fantastic improvement. I've found this problem in every search engine, and it's infuriating. Quick example:

    "A+Z" becomes "A-Z":

https://www.google.com/search?q="A%2BZ"

     "(dog)" becomes "dog"

https://search.yahoo.com/search?p="(dog)"

https://www.google.com/search?&q="(dog)"

https://www.bing.com/search?q="(dog)"

https://yandex.com/search/?text="(dog)"

https://kagi.com/search?q=%22%28dog%29%22

https://www.searchenginewatch.com/2011/11/18/google-introduc...

48864w6ui 17 hours ago [-]

Nearly all customer facing computing is now optimized for people who don't know how computers work, not for people who do.

e____g 2 days ago [-]

(2010)

2 days ago [-]

Rendered at 15:41:43 GMT+0000 (Coordinated Universal Time) with Vercel.