How does Lemmy work with search engines?

melonpunk@lemmy.world · 1 year ago

How does Lemmy work with search engines?

wpuckering@lm.williampuckering.com · edit-2 1 year ago

There’s a lot of things that factor into the answer, but I think overall it’s gonna be pretty random. Some instances are on domains without “Lemmy” in the name, some don’t include “Lemmy” in the site name configuration, and in the case of some like my own instance, I set the X-Robots-Tag response header such that search engines that properly honor the header won’t crawl or index content on my instance. I’ve actually taken things a step further with mine and put all public paths except for the API endpoints behind authentication (so that Lemmy clients and federation still work with it), so you can’t browse my instance content without going through a proper client for extra privacy. But that goes off-topic.

Reddit was centralized so could be optimized for SEO. Lemmy instances are individually run with different configuration at the infrastructure level and the application configuration level, which if most people leave things fairly vanilla, should result in pretty good discovery of Lemmy content across most of these kinds of instances, but I would think most people technical enough to host their own instances would have deviated from defaults and (hopefully) implemented some hardening, which would likely mess with SEO.

So yeah, expect it to be pretty random, but not necessarily unworkable.

OrangeSlice@lemmy.ml · 1 year ago

Easily the best answer here, I think the people who think it will work “just like Reddit” are unfamiliar with federation still, and aren’t used to thinking things through in those terms.

Not to mention that Google results in general have been pretty trash for a couple years now. I don’t expect fediverse content to be prominent for some time unless there is a dedicated service that indexes everything.

fizzym4d@lemmy.fmhy.ml · 1 year ago

Your “off-topic” sounded pretty cool to me! I love that that is something anyone can do when hosting a lemmy instance. You get to choose if it’s searchable on the web! Obviously there are search engines which ignore the no scraping/indexing header, but the rest of what you did should counteract that, noice.

wpuckering@lm.williampuckering.com · edit-2 1 year ago

Yeah, if you’re running something yourself, you can do pretty much whatever you want in order to protect it. Especially if it’s behind a reverse proxy. Firewalls are great for protecting ports, but reverse proxies can be their own form of protection, and I don’t think a lot of people associate them with “protection” so much. Why expose paths (unauthenticated) that don’t need to be? For instance, in my case with my Lemmy instance, all any other instance needs is access to the /api path which I leave open. And all the other paths are behind basic authentication which I can access, so I can still use the Lemmy web interface on my own instance if I want to. But if I don’t want others browsing to my instance to see what communities have been added, or I don’t want to give someone an easy glance into what comments or posts my profile has made across all instances (for a little more privacy), then I can simply hide that behind the curtain without losing any functionality.

It’s easy to think of these things when you have relevant experience with web development, debugging web applications, full stack development, and subject matter knowledge in related areas, if you have a tendency to approach things with a security-oriented mindset. I’m not trying to sound arrogant, but honestly my professional experience has a lot to do with how my personal habits have formed around my hobbies. So I have a tendency to take things as far as I can with everything that I know, and stuff like this is the result lol. Might be totally unnecessary without much actual value, but it errs on the side of “a little more secure”, and why not, if it’s fun?

Arinshot@lemmy.world · 1 year ago

I’d be interested in how you did this, this seems like one of the best ways I’ve seen for securing a lemmy instance.

wpuckering@lm.williampuckering.com · 1 year ago

I have a single Nginx container that handles reverse proxying of all my selfhosted services, and I break every service out into its own configuration file, and use include directives to share common configuration across them. For anyone out there with Nginx experience, my Lemmy configuration file should make it fairly clear in terms of how I handle what I described above:

server {
  include ssl_common.conf;
  server_name lm.williampuckering.com;
  set $backend_client lemmy-ui:1234;
  set $backend_server lemmy-server:8536;
  
  location / {
    set $authentication "Authentication Required";
    include /etc/nginx/proxy_nocache_backend.conf;
    
    if ($http_accept = "application/activity+json") {
      set $authentication off;
      set $backend_client $backend_server;
    }
    if ($http_accept = "application/ld+json; profile=\"https://www.w3.org/ns/activitystreams\"") {
      set $authentication off;
      set $backend_client $backend_server;
    }
    if ($request_method = POST) {
      set $authentication off;
      set $backend_client $backend_server;
    }
    
    auth_basic $authentication;
    auth_basic_user_file htpasswd;
    proxy_pass http://$backend_client;
  }
  
  location ~* ^/(api|feeds|nodeinfo|.well-known) {
    include /etc/nginx/proxy_nocache_backend.conf;
    proxy_pass http://$backend_server;
  }
  
  location ~* ^/pictrs {
    proxy_cache lemmy_cache;
    include /etc/nginx/proxy_cache_backend.conf;
    proxy_pass http://$backend_server;
  }
  
  location ~* ^/static {
    proxy_cache lemmy_cache;
    include /etc/nginx/proxy_cache_backend.conf;
    proxy_pass http://$backend_client;
  }
  
  location ~* ^/css {
    proxy_cache lemmy_cache;
    include /etc/nginx/proxy_cache_backend.conf;
    proxy_pass http://$backend_client;
  }
}

It’s definitely in need of some clean-up (for instance, there’s no need for multiple location blocks that do the same thing for caching, a single expression can handle all of the ones with identical configuration to reduce the number of lines required), but I’ve been a bit lazy to clean things up. However it should serve as a good example and communicate the general idea of what I’m doing.

maynarkh@feddit.nl · 1 year ago

One easy way to do that is to set up something like Nginx as a reverse proxy in front and forward /api clean, but forward everything else with basic auth.

The steps broadly would be:

Set up an Nginx instance
Set up a block in Nginx to proxy / to your Lemmy instance
Set up basic auth on that block
Set up a smaller block that will only proxy calls to /api and other endpoints you want public, like previously with /
Make your Lemmy instance unreachable from the broader internet, eg. if you’re on a single server, make it listen on 127.0.0.1 instead of 0.0.0.0, but make sure Nginx can still reach it

And you’re done.

Icarus@lemmy.ml · 1 year ago

I usually don’t see lemmy on search engines, sadly.

Monkey With A Shell@lemmy.socdojo.com · 1 year ago

A lot of search engines rely on backlinks to rank the reliablitly/validity of a site so even if a given instance was picked up to have enough places reference it to be seen as a valid source would ve a pretty heavy lift.

Xylight (Photon dev)@lemmy.xylight.dev · 1 year ago

Unfortunately Lemmy isn’t great for SEO because lemmy-ui heavily relies on JavaScript to render the page, which search bots avoid.

ccx@sopuli.xyz · 1 year ago

One thing I’d love to see and would probably help quite a lot with searchability is to have blog and CMS software, instead of having dedicated comment system, integrate a “discuss on Fediverse” button.

It could bring up possible communities based on blogpost/article tags. And since Lemmy supports pingbacks the system would know about the discussion threads and it could even show few last posts from each.

To me it seems like win/win situation for all parties involved.

Kresten@feddit.dk · 1 year ago

There’s an issue on GitHub about the topic https://github.com/LemmyNet/lemmy-ui/issues/1285

kadu@lemmy.world · 1 year ago

One thing to keep in mind is that Google currently penalizes links that don’t end in the common top domains like “.com”, “.org” and similar. So something like lemmy.world, if indexed, will rank lower than a site ending in .com with the same keyword density.

Briongloid@aussie.zone · 1 year ago

Google went from being the most important website on the internet to being more and more useless, it’s amazing seeing such a massive company go downhill. But they have so much money that they’ll be able to stay big forever from capital alone.

gun@lemmy.ml · 1 year ago

What do you use as a search engine instead of Google? I feel like I’ve tried everything, but always end up back at Google search.

Ministar@lemmy.world · 1 year ago

Been using Ecosia and so far its been very good. I did not have a need to use Google once.

Sploosh the Water@vlemmy.net · 1 year ago

Would be cool to have a browser extension that can return searched terms from fediverse sources that you choose.

Ben@lemmy.ml · 1 year ago

Actually the point is that, if someone searched internet for ‘fediverse sources’ they wouldn’t find a relevant thread on lemmy.world, or lemmy.ml, or whatever.

ccx@sopuli.xyz · 1 year ago

Making it a Searx plugin would probably do better in terms of making it accessible to a lot of people.

I wonder how good are various ActivityPub instances at searching. Having pregenerated fulltext indexes of public content available for download could go a long way to make building search engine easy and fast.

Ben@lemmy.ml · 1 year ago

I actually added a custom search engine to Firefox… so I can search something on Lemmy. I have the keyword ‘LW’ for Lemmy.World search right now (because Lemmy.ml was offline a while).

Basically, do the Lemmy search (search term ssss) then edit/replace ssss > %s and copy the entire link. https://lemmy.world/search/q/%s/type/All/sort/TopAll/listing_type/All/community_id/0/creator_id/0/page/1

Then using ‘add custom search engine’ extension on Firefox, you add it.