Levitating 2 days ago

I recently wrote an essay about this search engine, and its ranking algorithms.

Initially Marginalia used an interesting variant of PageRank discussed in the original paper, called Personal Pagerank.[1] Currently pages are ranked with BM25.

I think Personalized PageRank is still used for a new feature of Marginalia which is ranking pages based on similarity. I think this is already integrated into the website but there used to only be this testing page: https://explore2.marginalia.nu/

In any case I have a lot of respect for the creator. Marginalia has seen a lot of growth and it's been interesting reading the blogposts.[2]

[1]: https://www.marginalia.nu/log/26-personalized-pagerank/ [2]: https://www.marginalia.nu/log/

  • marginalia_nu 2 days ago

    I'm using PPR for domain rankings, but it's a very weak factor. It mostly affects the physical ordering on the results in the index, and given that queries have a timeout they'll execute for, it makes it so that higher ranking results are discovered first. Though in general, as I mentioned, this is a weak effect.

    Explore2 and the website discovery tools now built into the search engine are using cosine similarity of the incident link vectors. I wrote a blog post about the technique called "Creepy Website Similarity" available :-) https://www.marginalia.nu/log/69-creepy-website-similarity/

  • _emacsomancer_ 2 days ago

    where is the recent essay?

    • Levitating 2 days ago

      It was for a university assignment, I am not sure if I could or should share it.

      • hooli_gan a day ago

        If you don't share the task description or other materials provided by your university, I don't see the problem with sharing your essay. (Not a lawyer)

        • hmlwilliams a day ago

          It's a sure fire way to get your uni submission incorrectly flagged as plagiarism. How straightforward it is to prove you're the original author varies by institution.

          • em-bee a day ago

            once the essay is already graded, it should not matter. and less as time passes.

            on the contrary i thought any essay, paper or report written for uni or school can be considered for publication in some form. not everything is worth publishing of course, but if it is, talk to your supervisors about it.

            http://academia.stackexchange.com/questions/65166/ddg#65208

      • _emacsomancer_ a day ago

        Ah, that makes sense - didn't really it was a uni thing.

marginalia_nu 2 days ago

(Creator here) I recently moved the website from search.marginalia.nu to marginalia-search.com and gave it a bit of a visual touch up, on the basis that I felt it was working too well to be a weird subdomain outgrowth off my blog.

It's still the same search engine :-)

  • stackghost a day ago

    Lately there's been some consternation in the small-web community about the aggressive crawling by OpenAI and friends.

    How often does yours (re)crawl its indexed sites?

    • marginalia_nu a day ago

      I do a big recrawl around once every 10 weeks, but I have a RSS crawler that does daily visits and fetches new items. But that should be very low volume traffic.

      • stackghost a day ago

        Interesting, does your RSS crawler auto-detect feeds? That's a neat solution I'd never considered.

        • marginalia_nu a day ago

          The big crawler tries to find feeds, and the RSS crawler acts on what the big one has found.

  • axiomdata316 a day ago

    It looks like the "random" button on the old version no longer works. Did you remove that feature?

    • marginalia_nu a day ago

      Hmm, I think it's just a broken redirect. It works if you search for browse:random . I'll try to fix it later today.

  • alberth a day ago

    Any reason why you don’t use a shorter domain name?

    (Amazing what you built, btw)

    • marginalia_nu a day ago

      Hard to find one that makes sense and isn't inscrutable like mxyzp.tlk.

      If I ever do find one that's appropriate I might set up a CNAME record to point to marginalia-search.com

      • alberth a day ago

        FYI - https://www.dynadot.com/market/search has a nice way you can filter their Aftermarket domains for sales by Character Count, TLD, No IDN/Number/Hyphens.

        I've found over the years plenty of <= 5 character domains that have a meaningful name just from using their search filters.

        (I'm not affiliated in anyway to them)

disqard 2 days ago

If somebody hasn't been following this, it's one person's passion project, and it doesn't aim to compete with Google/Bing/DDG.

It's more of a way to find interesting things on the "small web".

Please keep that in mind when you check this out.

(I'm not the creator)

  • lolinder 2 days ago

    I don't know if something has changed since then, but in 2023 the author did quit their job to start working on the search engine full time [0]. So yes, it's a passion project, but it's also not just a side project.

    [0] https://www.marginalia.nu/log/83_full_time/

    • barbs 2 days ago

      That's really cool, although I'm curious about their plans to monetize the site, if any.

      • ChemTails 2 days ago

        According to the creator, $200/month will maintain operations. He doesn’t mention plans to monetize.

        • marginalia_nu 2 days ago

          The plan as much as there is one is to have very little costs, provide outsized value and somehow scrape by on grants, public donations. I'm currently set for about 1-2 more years of full time work (had 6 months' funds + a 1 year grant when I left my job 18 months ago).

          • mannycalavera42 2 days ago

            I simply LOVE and LOVE and LOVE marginalia. Keep rocking mate

  • eitland a day ago

    > If somebody hasn't been following this, it's one person's passion project, and it doesn't aim to compete with Google/Bing/DDG.

    That is correct.

    But I used[1] to find that in certain niches (Linux and open source software, history) I'd often get significantly better results with Marginalia than with Google or DDG.

    This seems to be related both to

    - the input stupidifier that seems to be in use with mainstream search engines (if I search for anything unusual it assumes I misspelled or mis-remembered and replace my search with something I didn't search for before sending it to the backend)

    - and mainstream engines going out of their way to prefer corporate media over original, authorative content

    > It's more of a way to find interesting things on the "small web".

    Yes, he keeps saying that, but it was[2] still not just interesting but useful for me.

    [1][2]: past tense because I switched to Kagi almost 3 years ago and now I don't have to maintain all the hacks (like separate search engines for separate niche topics) that I used to have. Full disclaimer: I know certain other people frequently want to contest this, saying they get great results with Google or bad results with Kagi, to which all I have to say is I have documented history of Google consistently failing even simple queries going more than a decade back and if Google works for you, more power to you, but it hasn't work reliably for me for over a decade and I am fed up.

hatefulmoron 2 days ago

I don't have anything unique to say, but I love when this project comes up on HN. The project exemplifies what a patient and loving hand can do.

rossdavidh 2 days ago

Decided to repeat my last few google searches ("celery", "Wardian case", and "ferry from Italy to Greece", if you're curious), and it came up with reasonable answers as the top one in every case! I will give it a try.

benreesman 2 days ago

I don’t have much to add other than to say that this is the most pleasant and useful and wholesome thing I’ve seen on the web in a long time. The technology looks extremely solid and the experience is flawless.

To the author, thank you so much for this trip to the glory days of the Internet.

AndrewStephens a day ago

This is great - as a site owner I especially appreciated that you publish discovered backlinks. I discovered a few that I didn't know about thanks to your service.

  • Timwi 20 hours ago

    I looked at the backlinks for a website of mine and none of them actually linked to my site, so I was very confused...

    • marginalia_nu 8 hours ago

      Hmm, can you share the domain name? Either here or email me? I can look into it. (Though it's most likely suffix-list related, there are some outstanding issues in that regard.)

miki123211 2 days ago

I tried a few relatively simple terms in the areas I have an interest in, e.g. ("attention is all you need paper", "sip rfc", "klatt speech synthesizer", "crafting interpreters book", "Scott Alexander Substack"), and none of them actually produced the results I expected.

Marginalia seems to work okay-ish if you want to learn about something, but not if you want to find something.

If you want to read what people think about Scott Alexander's Substack, you will get some decent results, but not if you want to find the newsletter itself.

  • marginalia_nu a day ago

    Yeah navigational queries are a bit hit and miss, but I feel google does them very well as is so it's not a huge priority area. I've mostly been focusing my efforts on the stuff Google is bad at, finding human content about some topic.

black_puppydog a day ago

This is for English content only, right? I'm trying with some specifically French queries and it only gives me (relevant, but) English results.

  • marginalia_nu a day ago

    Yeah, for now I'm focusing on doing English search well before I tackle additional languages. It's some effort to get it working well for any given language, and I don't think a search engine that doesn't work well is going to make anyone happy.

metadat 2 days ago

My personal website has 0-ads or anything malicious, 100% original content, yet it still doesn't show up in the first 5 pages when searching marginalia for `<first-name> <last-name>'. There are some marginally noteworthy folks who share my name, but it's basically some artist nobody cares about, a Floridian pest control expert who I regularly get mis-addressed email replies from, and a low-ranked pro football player with an unexceptional web presence nobody appears to care about. I thought I could at least eek out a out #4 or #3, but no?

I imagined my ad-free, www-free, firstname-lastname domain would be welcome on marginalia, but it seems deranked just like what Google has inexplicably done since I bought the domain. Despite following best practices and webmaster console recommendations. A squatter owned the domain for some time before 2010, but there's no evidence of interesting nefarious activity.

My first name-lastname GitHub account, linked from my site homepage, has a noteworthy number of open-source contributions to all sorts of projects.

P.s. marginalia_nu: Your Java and architectural design is beautiful, you're a true craftsman. <3 Best in show.

  • marginalia_nu a day ago

    Could you email me the domain name to contact@marginalia-search.com (or share it here if you don't mind linking it to your account)? I'll have a look and see what's up.

    > P.s. marginalia_nu: Your Java and architectural design is beautiful, you're a true craftsman. <3 Best in show

    Well like any project this size (especially when it's a one man show) it has its warts and the design isn't entirely uncontroversial[1], but I'm pretty happy with it.

    [1] Last time I posted about it someone likened it to https://www.youtube.com/watch?v=y8OnoxKotPQ , not entirely unfairly, though it does look like it does for very solid reasons that go beyond "netflix uses microservices", which is usually where things don't work out so well.

  • roywiggins 2 days ago
    • metadat 2 days ago

      Do all sites manually submit themselves via PR? I'm happy to do it, never encountered this information previously.

      • marginalia_nu 2 days ago

        (Creator of the search engine) It's not required, they will be discovered eventually, though it's a way of speeding up the process, as the index is anything but exhaustive still.

        • nonrandomstring 2 days ago

          Submitted manually more than once in the last year and still no listings.

          Completely understand that you'd be overwhelmed by bots, that personally curating sites is beyond human effort, and that you want to be choosy.

          But I've a site that is clearly non-commercial, definitely "marginal" interest, full of completely original human-generated content, yet sadly I've never seen a hit from Marginalia in the logs despite trying to get listed.

          Are you excluding certain content? Political? Anti-Bigtech? Sites "related to hacking"?

          • marginalia_nu a day ago

            I'm blacklisting a few websites that are very spammy, like e-pharma spam and the like, and a small number of websites that are in very poor taste, but it should tell you that the domain is blocked if you use the site inspector[1].

            Either way, I'd love to investigate to see what's going on. Mind sharing the domain name? Either here, or if you don't want to doxx yourself, email contact@marginalia-search.com

            [1] https://marginalia-search.com/site

      • stebalien 2 days ago

        If and only if your site isn't already referenced by some other site in the index (see option A).

  • kristopolous 2 days ago

    I'm up on the first page, in some thing from 17 years ago.

    The internet is kind of like a fire. Most things go away but a few things miraculously survive for fairly unexpected and benign reasons.

    • metadat 2 days ago

      I suppose the real question is: How heavily does Marginalia consider Google (and other primary indices) ranking? Because Google inexplicably hates my good-actor domain and continues to instead promote the likes of cyberciti[d0t]bjz and nixcrapt[d0t]c0m.

      • marginalia_nu 2 days ago

        Not looking at Google at all. All indexing is custom and in-house.

      • hcs 2 days ago

        I doubt that it considers Google at all, just submit your site.

dang 2 days ago

Much discussed under its previous domain (marginalia.nu):

Phrase matching in Marginalia Search - https://news.ycombinator.com/item?id=41696046 - Sept 2024 (24 comments)

Marginalia: 3 Years - https://news.ycombinator.com/item?id=39501061 - Feb 2024 (44 comments)

Interview with Viktor Lofgren from Marginalia Search - https://news.ycombinator.com/item?id=38470832 - Nov 2023 (21 comments)

Moving Marginalia to a new server - https://news.ycombinator.com/item?id=37800753 - Oct 2023 (39 comments)

Marginalia.nu API - https://news.ycombinator.com/item?id=35871186 - May 2023 (22 comments)

Marginalia: DIY search engine that focuses on non-commercial content - https://news.ycombinator.com/item?id=35611923 - April 2023 (193 comments)

Marginalia Search has received an NLNet grant - https://news.ycombinator.com/item?id=34945541 - Feb 2023 (17 comments)

A Theoretical Justification (2021) - https://news.ycombinator.com/item?id=32586273 - Aug 2022 (22 comments)

The Evolution of Marginalia's Crawling - https://news.ycombinator.com/item?id=32565052 - Aug 2022 (22 comments)

Marginalia Goes Open Source - https://news.ycombinator.com/item?id=31536626 - May 2022 (72 comments)

Uncertain Future for Marginalia Search - https://news.ycombinator.com/item?id=31200319 - April 2022 (37 comments)

Marginalia Search: 1 Year - https://news.ycombinator.com/item?id=30823481 - March 2022 (29 comments)

Show HN: Marginalia – Exploration Mode - https://news.ycombinator.com/item?id=30047455 - Jan 2022 (53 comments)

A search engine that favors text-heavy sites and punishes modern web design - https://news.ycombinator.com/item?id=28550764 - Sept 2021 (717 comments)

https://news.ycombinator.com/user?id=marginalia_nu

  • Funes- 2 days ago

    Hell yeah. The very first one was submitted by yours truly. I love this project and hope to work on similar ones in the near future.

Llamamoe 2 days ago

I haven't had much luck using it as a search engine, but I ADORE using the "explore random websites" feature[1], and just browsing random personal blogs and sites, or kitsch neocity pages.

All of it feels so human and is a joy to view, in contrast to the raw garbage Google spews at me whenever I bother to use it.

[1] https://search.marginalia.nu/explore/random

rafaelgoncalves 2 days ago

whoa, didn't know this project, looks really incredible, will add to my others search engines and give a try, thanks for the devs/creators and hope for success to you all!

ajdude 2 days ago

I really like the search engine, and I am absolutely adding it to my rotation when searching for interesting things

yurisalazar 2 days ago

Wow, this project has been going on for 3 years, amazing stuff, I love reading devlogs

vekker 2 days ago

Awesome! The name is just a bit funny if your native tongue is Dutch.

  • jorams 2 days ago

    Is it? The word means the same in English as it does in Dutch.

biohcacker84 a day ago

I've noticed an uptick of new search engines.

Likely because of the enshitification of google.

But what is most on my mind, is how the enshitification of google might have affected the global economy.....

Allow me to try and explain. Market efficiency is deeply tied to market information. And early google was fantastic at providing accurate and valuable information. It's hard to put into words, and I am not aware of any studies, but it felt like it sped up and/or made the whole economy more efficient.

And today it feels like that has been lost. And I wonder if there could ever be formal accounting to detect how much peak google contributed to.... not so much GDP, but how the economy functions. And compare to today.

  • marginalia_nu a day ago

    I honestly think this is as much the enshittification of google as it is the enshittification of the web. Early google was great in large part because the early web had a much better signal to noise ratio.

    Going on for well over a decade we've been putting far more slop on the web than useful content. It really adds up over time.

    I also feel this is largely why the GPTs have been so successful. They're if nothing else a lot less annoying than the web search experience has become.

    • plagiarist a day ago

      I agree, though I think Google and all the adtech companies share a lot of the blame for the structure of the modern web.

      The result is current web search returns static output from smaller models in the form of affiliate marketing listicles. Using a larger LLM directly on your query is simply better. There is a semantic/vector search happening with them such that there might be relevant terms or concepts even in a false answer. That's a step forward towards real information at least. That doesn't happen with the listicles.

    • eitland a day ago

      > I honestly think this is as much the enshittification of google as it is the enshittification of the web. Early google was great in large part because the early web had a much better signal to noise ratio.

      Hi again! Lots of respect for your work.

      This thing I considered for a while but I am now certain that it cannot be the whole explanation.

      Why? Because for me Kagi has consistently outperformed Google to the point where it feels like having Google 2009 back, just with some extra improvements: It finds the things I ask for and prioritise in a good way so I rarely have to look for page 2 or anything.

  • rapnie a day ago

    And DDG also enshittifies from its glory days of being an ethical alternative, where their AI prefers to e.g. explain a commercial service in their Assist section at the top, and not the common dictionary term.

    • yegg a day ago

      For clarity, we have not changed anything with regards to our vision and there is currently no commercial intent infused into Assist, which can also be turned off. That said, I'd love to follow up on this example -- which query in particular?

BOOSTERHIDROGEN 2 days ago

I am aware that SearxNG has chosen to remove the engine; hopefully, a mutually beneficial agreement can be reached.

  • marginalia_nu 2 days ago

    SearxNG integration didn't work great, in part because they had some issues where they were hitting the search engine like a shotgun with 10+ identical requests, but also because the API has a public key that's nearly unusable due to bot spam, meaning anyone who wants to route traffic to marginalia search needs to email me and get a key. Not a super convenient system.

    I'm not opposed to seeing a return of the integration though it probably could do with some more design work.

modzu 2 days ago

very cool, but wish i could remember the name

cratermoon 2 days ago

I'm less than impressed with the quality of results. Sure, the sites linked are generally not commercial, although various stackexchange.com posts show up, as to a lot of wordpress.com and blogspot.com blogs. Some sketchy results seemed highly ranked, too, include conspiracy theorists and fringe ideas.