Marginalia – A search engine that prioritizes non-commercial content

578 points by herbertl 3 months ago

Levitating 3 months ago

I recently wrote an essay about this search engine, and its ranking algorithms.

Initially Marginalia used an interesting variant of PageRank discussed in the original paper, called Personal Pagerank.[1] Currently pages are ranked with BM25.

I think Personalized PageRank is still used for a new feature of Marginalia which is ranking pages based on similarity. I think this is already integrated into the website but there used to only be this testing page: https://explore2.marginalia.nu/

In any case I have a lot of respect for the creator. Marginalia has seen a lot of growth and it's been interesting reading the blogposts.[2]

[1]: https://www.marginalia.nu/log/26-personalized-pagerank/ [2]: https://www.marginalia.nu/log/

marginalia_nu 3 months ago

I'm using PPR for domain rankings, but it's a very weak factor. It mostly affects the physical ordering on the results in the index, and given that queries have a timeout they'll execute for, it makes it so that higher ranking results are discovered first. Though in general, as I mentioned, this is a weak effect.
Explore2 and the website discovery tools now built into the search engine are using cosine similarity of the incident link vectors. I wrote a blog post about the technique called "Creepy Website Similarity" available :-) https://www.marginalia.nu/log/69-creepy-website-similarity/
- jmholla 3 months ago
  
  Two things I've noted with the redesign (I hope it's ok to share it here):
  * I couldn't find the random button. But, I used the old site to find the correct URL. [0]
  * The random URL on the old site is broken. [1]
  [0]: https://marginalia-search.com/explore/random
  [1]: https://old-search.marginalia.nu/explore/random
_emacsomancer_ 3 months ago

where is the recent essay?
- Levitating 3 months ago
  
  It was for a university assignment, I am not sure if I could or should share it.
  - hooli_gan 3 months ago
    
    If you don't share the task description or other materials provided by your university, I don't see the problem with sharing your essay. (Not a lawyer)
    
    hmlwilliams 3 months ago
    
    It's a sure fire way to get your uni submission incorrectly flagged as plagiarism. How straightforward it is to prove you're the original author varies by institution.
    
    em-bee 3 months ago
    
    once the essay is already graded, it should not matter. and less as time passes.
    on the contrary i thought any essay, paper or report written for uni or school can be considered for publication in some form. not everything is worth publishing of course, but if it is, talk to your supervisors about it.
    http://academia.stackexchange.com/questions/65166/ddg#65208
  - _emacsomancer_ 3 months ago
    
    Ah, that makes sense - didn't really it was a uni thing.

marginalia_nu 3 months ago

(Creator here) I recently moved the website from search.marginalia.nu to marginalia-search.com and gave it a bit of a visual touch up, on the basis that I felt it was working too well to be a weird subdomain outgrowth off my blog.

It's still the same search engine :-)

stackghost 3 months ago

Lately there's been some consternation in the small-web community about the aggressive crawling by OpenAI and friends.
How often does yours (re)crawl its indexed sites?
- marginalia_nu 3 months ago
  
  I do a big recrawl around once every 10 weeks, but I have a RSS crawler that does daily visits and fetches new items. But that should be very low volume traffic.
  - stackghost 3 months ago
    
    Interesting, does your RSS crawler auto-detect feeds? That's a neat solution I'd never considered.
    
    marginalia_nu 3 months ago
    
    The big crawler tries to find feeds, and the RSS crawler acts on what the big one has found.
axiomdata316 3 months ago

It looks like the "random" button on the old version no longer works. Did you remove that feature?
- marginalia_nu 3 months ago
  
  Hmm, I think it's just a broken redirect. It works if you search for browse:random . I'll try to fix it later today.
alberth 3 months ago

Any reason why you don’t use a shorter domain name?
(Amazing what you built, btw)
- marginalia_nu 3 months ago
  
  Hard to find one that makes sense and isn't inscrutable like mxyzp.tlk.
  If I ever do find one that's appropriate I might set up a CNAME record to point to marginalia-search.com
  - alberth 3 months ago
    
    FYI - https://www.dynadot.com/market/search has a nice way you can filter their Aftermarket domains for sales by Character Count, TLD, No IDN/Number/Hyphens.
    I've found over the years plenty of <= 5 character domains that have a meaningful name just from using their search filters.
    (I'm not affiliated in anyway to them)

disqard 3 months ago

If somebody hasn't been following this, it's one person's passion project, and it doesn't aim to compete with Google/Bing/DDG.

It's more of a way to find interesting things on the "small web".

Please keep that in mind when you check this out.

(I'm not the creator)

lolinder 3 months ago

I don't know if something has changed since then, but in 2023 the author did quit their job to start working on the search engine full time [0]. So yes, it's a passion project, but it's also not just a side project.
[0] https://www.marginalia.nu/log/83_full_time/
- barbs 3 months ago
  
  That's really cool, although I'm curious about their plans to monetize the site, if any.
  - ChemTails 3 months ago
    
    According to the creator, $200/month will maintain operations. He doesn’t mention plans to monetize.
    
    marginalia_nu 3 months ago
    
    The plan as much as there is one is to have very little costs, provide outsized value and somehow scrape by on grants, public donations. I'm currently set for about 1-2 more years of full time work (had 6 months' funds + a 1 year grant when I left my job 18 months ago).
    
    mannycalavera42 3 months ago
    
    I simply LOVE and LOVE and LOVE marginalia. Keep rocking mate
eitland 3 months ago

> If somebody hasn't been following this, it's one person's passion project, and it doesn't aim to compete with Google/Bing/DDG.
That is correct.
But I used[1] to find that in certain niches (Linux and open source software, history) I'd often get significantly better results with Marginalia than with Google or DDG.
This seems to be related both to
- the input stupidifier that seems to be in use with mainstream search engines (if I search for anything unusual it assumes I misspelled or mis-remembered and replace my search with something I didn't search for before sending it to the backend)
- and mainstream engines going out of their way to prefer corporate media over original, authorative content
> It's more of a way to find interesting things on the "small web".
Yes, he keeps saying that, but it was[2] still not just interesting but useful for me.
[1][2]: past tense because I switched to Kagi almost 3 years ago and now I don't have to maintain all the hacks (like separate search engines for separate niche topics) that I used to have. Full disclaimer: I know certain other people frequently want to contest this, saying they get great results with Google or bad results with Kagi, to which all I have to say is I have documented history of Google consistently failing even simple queries going more than a decade back and if Google works for you, more power to you, but it hasn't work reliably for me for over a decade and I am fed up.

hatefulmoron 3 months ago

I don't have anything unique to say, but I love when this project comes up on HN. The project exemplifies what a patient and loving hand can do.

nostradumbasp 3 months ago

same. I was thinking about doing something similar at some point but this exceeds my expectations. Seriously a great job. Let us know if theres anyway to contribute.
- marginalia_nu 3 months ago
  
  Well it's open source and contributions are welcome, though it's a fairly sprawling java project that's probably not the easiest to get into still (through no lack of effort making it more accessible): https://github.com/MarginaliaSearch/MarginaliaSearch
  If you have spare dollars but not time, you can also contribute to the war chest: https://about.marginalia-search.com/article/supporting/

dredmorbius 3 months ago

I'm surprised to see that, as of now at least, there's apparently no Marginalia bang search at DDG:

<https://duckduckgo.com/bangs?q=marginalia>

I've submitted it as a suggestion.

rossdavidh 3 months ago

Decided to repeat my last few google searches ("celery", "Wardian case", and "ferry from Italy to Greece", if you're curious), and it came up with reasonable answers as the top one in every case! I will give it a try.

benreesman 3 months ago

I don’t have much to add other than to say that this is the most pleasant and useful and wholesome thing I’ve seen on the web in a long time. The technology looks extremely solid and the experience is flawless.

To the author, thank you so much for this trip to the glory days of the Internet.

miki123211 3 months ago

I tried a few relatively simple terms in the areas I have an interest in, e.g. ("attention is all you need paper", "sip rfc", "klatt speech synthesizer", "crafting interpreters book", "Scott Alexander Substack"), and none of them actually produced the results I expected.

Marginalia seems to work okay-ish if you want to learn about something, but not if you want to find something.

If you want to read what people think about Scott Alexander's Substack, you will get some decent results, but not if you want to find the newsletter itself.

marginalia_nu 3 months ago

Yeah navigational queries are a bit hit and miss, but I feel google does them very well as is so it's not a huge priority area. I've mostly been focusing my efforts on the stuff Google is bad at, finding human content about some topic.

AndrewStephens 3 months ago

This is great - as a site owner I especially appreciated that you publish discovered backlinks. I discovered a few that I didn't know about thanks to your service.

Timwi 3 months ago

I looked at the backlinks for a website of mine and none of them actually linked to my site, so I was very confused...
- marginalia_nu 3 months ago
  
  Hmm, can you share the domain name? Either here or email me? I can look into it. (Though it's most likely suffix-list related, there are some outstanding issues in that regard.)
  - Timwi 3 months ago
    
    I owe you an apology; I clicked on the ↗ icons, not realizing that those take me to the domain’s homepage instead of the specific URL containing the link. I have now checked them all and they do in fact link to the domain. (Which, for the record, is ktane.timwi.de, a repository of documents relating to a video game.)

dang 3 months ago

Much discussed under its previous domain (marginalia.nu):

Phrase matching in Marginalia Search - https://news.ycombinator.com/item?id=41696046 - Sept 2024 (24 comments)

Marginalia: 3 Years - https://news.ycombinator.com/item?id=39501061 - Feb 2024 (44 comments)

Interview with Viktor Lofgren from Marginalia Search - https://news.ycombinator.com/item?id=38470832 - Nov 2023 (21 comments)

Moving Marginalia to a new server - https://news.ycombinator.com/item?id=37800753 - Oct 2023 (39 comments)

Marginalia.nu API - https://news.ycombinator.com/item?id=35871186 - May 2023 (22 comments)

Marginalia: DIY search engine that focuses on non-commercial content - https://news.ycombinator.com/item?id=35611923 - April 2023 (193 comments)

Marginalia Search has received an NLNet grant - https://news.ycombinator.com/item?id=34945541 - Feb 2023 (17 comments)

A Theoretical Justification (2021) - https://news.ycombinator.com/item?id=32586273 - Aug 2022 (22 comments)

The Evolution of Marginalia's Crawling - https://news.ycombinator.com/item?id=32565052 - Aug 2022 (22 comments)

Marginalia Goes Open Source - https://news.ycombinator.com/item?id=31536626 - May 2022 (72 comments)

Uncertain Future for Marginalia Search - https://news.ycombinator.com/item?id=31200319 - April 2022 (37 comments)

Marginalia Search: 1 Year - https://news.ycombinator.com/item?id=30823481 - March 2022 (29 comments)

Show HN: Marginalia – Exploration Mode - https://news.ycombinator.com/item?id=30047455 - Jan 2022 (53 comments)

A search engine that favors text-heavy sites and punishes modern web design - https://news.ycombinator.com/item?id=28550764 - Sept 2021 (717 comments)

https://news.ycombinator.com/user?id=marginalia_nu

Funes- 3 months ago

Hell yeah. The very first one was submitted by yours truly. I love this project and hope to work on similar ones in the near future.

metadat 3 months ago

My personal website has 0-ads or anything malicious, 100% original content, yet it still doesn't show up in the first 5 pages when searching marginalia for `<first-name> <last-name>'. There are some marginally noteworthy folks who share my name, but it's basically some artist nobody cares about, a Floridian pest control expert who I regularly get mis-addressed email replies from, and a low-ranked pro football player with an unexceptional web presence nobody appears to care about. I thought I could at least eek out a out #4 or #3, but no?

I imagined my ad-free, www-free, firstname-lastname domain would be welcome on marginalia, but it seems deranked just like what Google has inexplicably done since I bought the domain. Despite following best practices and webmaster console recommendations. A squatter owned the domain for some time before 2010, but there's no evidence of interesting nefarious activity.

My first name-lastname GitHub account, linked from my site homepage, has a noteworthy number of open-source contributions to all sorts of projects.

P.s. marginalia_nu: Your Java and architectural design is beautiful, you're a true craftsman. <3 Best in show.

marginalia_nu 3 months ago

Could you email me the domain name to contact@marginalia-search.com (or share it here if you don't mind linking it to your account)? I'll have a look and see what's up.
> P.s. marginalia_nu: Your Java and architectural design is beautiful, you're a true craftsman. <3 Best in show
Well like any project this size (especially when it's a one man show) it has its warts and the design isn't entirely uncontroversial[1], but I'm pretty happy with it.
[1] Last time I posted about it someone likened it to https://www.youtube.com/watch?v=y8OnoxKotPQ , not entirely unfairly, though it does look like it does for very solid reasons that go beyond "netflix uses microservices", which is usually where things don't work out so well.
roywiggins 3 months ago

Are you sure you're in the index?
https://github.com/MarginaliaSearch/submit-site-to-marginali...
- metadat 3 months ago
  
  Do all sites manually submit themselves via PR? I'm happy to do it, never encountered this information previously.
  - marginalia_nu 3 months ago
    
    (Creator of the search engine) It's not required, they will be discovered eventually, though it's a way of speeding up the process, as the index is anything but exhaustive still.
    
    nonrandomstring 3 months ago
    
    Submitted manually more than once in the last year and still no listings.
    Completely understand that you'd be overwhelmed by bots, that personally curating sites is beyond human effort, and that you want to be choosy.
    But I've a site that is clearly non-commercial, definitely "marginal" interest, full of completely original human-generated content, yet sadly I've never seen a hit from Marginalia in the logs despite trying to get listed.
    Are you excluding certain content? Political? Anti-Bigtech? Sites "related to hacking"?
    
    marginalia_nu 3 months ago
    
    I'm blacklisting a few websites that are very spammy, like e-pharma spam and the like, and a small number of websites that are in very poor taste, but it should tell you that the domain is blocked if you use the site inspector[1].
    Either way, I'd love to investigate to see what's going on. Mind sharing the domain name? Either here, or if you don't want to doxx yourself, email contact@marginalia-search.com
    [1] https://marginalia-search.com/site
    
    nonrandomstring 3 months ago
    
    Thank you, I will DM you as suggested.
  - stebalien 3 months ago
    
    If and only if your site isn't already referenced by some other site in the index (see option A).
kristopolous 3 months ago

I'm up on the first page, in some thing from 17 years ago.
The internet is kind of like a fire. Most things go away but a few things miraculously survive for fairly unexpected and benign reasons.
- metadat 3 months ago
  
  I suppose the real question is: How heavily does Marginalia consider Google (and other primary indices) ranking? Because Google inexplicably hates my good-actor domain and continues to instead promote the likes of cyberciti[d0t]bjz and nixcrapt[d0t]c0m.
  - marginalia_nu 3 months ago
    
    Not looking at Google at all. All indexing is custom and in-house.
  - hcs 3 months ago
    
    I doubt that it considers Google at all, just submit your site.

black_puppydog 3 months ago

This is for English content only, right? I'm trying with some specifically French queries and it only gives me (relevant, but) English results.

marginalia_nu 3 months ago

Yeah, for now I'm focusing on doing English search well before I tackle additional languages. It's some effort to get it working well for any given language, and I don't think a search engine that doesn't work well is going to make anyone happy.

Llamamoe 3 months ago

I haven't had much luck using it as a search engine, but I ADORE using the "explore random websites" feature[1], and just browsing random personal blogs and sites, or kitsch neocity pages.

All of it feels so human and is a joy to view, in contrast to the raw garbage Google spews at me whenever I bother to use it.

[1] https://search.marginalia.nu/explore/random

rafaelgoncalves 3 months ago

whoa, didn't know this project, looks really incredible, will add to my others search engines and give a try, thanks for the devs/creators and hope for success to you all!

ajdude 3 months ago

I really like the search engine, and I am absolutely adding it to my rotation when searching for interesting things

ycombinete 3 months ago

I use it whenever I'm searching for cooking recipes. It's how I found the Cooking for Engineers forums.
https://www.cookingforengineers.com/forums

fsflover 3 months ago

Looks similar to https://wiby.me.

yurisalazar 3 months ago

Wow, this project has been going on for 3 years, amazing stuff, I love reading devlogs

vekker 3 months ago

Awesome! The name is just a bit funny if your native tongue is Dutch.

jorams 3 months ago

Is it? The word means the same in English as it does in Dutch.

modzu 3 months ago

very cool, but wish i could remember the name

biohcacker84 3 months ago

I've noticed an uptick of new search engines.

Likely because of the enshitification of google.

But what is most on my mind, is how the enshitification of google might have affected the global economy.....

Allow me to try and explain. Market efficiency is deeply tied to market information. And early google was fantastic at providing accurate and valuable information. It's hard to put into words, and I am not aware of any studies, but it felt like it sped up and/or made the whole economy more efficient.

And today it feels like that has been lost. And I wonder if there could ever be formal accounting to detect how much peak google contributed to.... not so much GDP, but how the economy functions. And compare to today.

marginalia_nu 3 months ago

I honestly think this is as much the enshittification of google as it is the enshittification of the web. Early google was great in large part because the early web had a much better signal to noise ratio.
Going on for well over a decade we've been putting far more slop on the web than useful content. It really adds up over time.
I also feel this is largely why the GPTs have been so successful. They're if nothing else a lot less annoying than the web search experience has become.
- plagiarist 3 months ago
  
  I agree, though I think Google and all the adtech companies share a lot of the blame for the structure of the modern web.
  The result is current web search returns static output from smaller models in the form of affiliate marketing listicles. Using a larger LLM directly on your query is simply better. There is a semantic/vector search happening with them such that there might be relevant terms or concepts even in a false answer. That's a step forward towards real information at least. That doesn't happen with the listicles.
- eitland 3 months ago
  
  > I honestly think this is as much the enshittification of google as it is the enshittification of the web. Early google was great in large part because the early web had a much better signal to noise ratio.
  Hi again! Lots of respect for your work.
  This thing I considered for a while but I am now certain that it cannot be the whole explanation.
  Why? Because for me Kagi has consistently outperformed Google to the point where it feels like having Google 2009 back, just with some extra improvements: It finds the things I ask for and prioritise in a good way so I rarely have to look for page 2 or anything.
rapnie 3 months ago

And DDG also enshittifies from its glory days of being an ethical alternative, where their AI prefers to e.g. explain a commercial service in their Assist section at the top, and not the common dictionary term.
- yegg 3 months ago
  
  For clarity, we have not changed anything with regards to our vision and there is currently no commercial intent infused into Assist, which can also be turned off. That said, I'd love to follow up on this example -- which query in particular?

BOOSTERHIDROGEN 3 months ago

I am aware that SearxNG has chosen to remove the engine; hopefully, a mutually beneficial agreement can be reached.

marginalia_nu 3 months ago

SearxNG integration didn't work great, in part because they had some issues where they were hitting the search engine like a shotgun with 10+ identical requests, but also because the API has a public key that's nearly unusable due to bot spam, meaning anyone who wants to route traffic to marginalia search needs to email me and get a key. Not a super convenient system.
I'm not opposed to seeing a return of the integration though it probably could do with some more design work.

webscout 3 months ago

[dead]

wetpaws 3 months ago

[dead]

cratermoon 3 months ago

I'm less than impressed with the quality of results. Sure, the sites linked are generally not commercial, although various stackexchange.com posts show up, as to a lot of wordpress.com and blogspot.com blogs. Some sketchy results seemed highly ranked, too, include conspiracy theorists and fringe ideas.