Happy Friday (Finally!)

Admiral Patrick@dubvee.org · 2 days ago

Ban ~~USA~~ politics from this sub please

Admiral Patrick@dubvee.org · 4 days ago

1080p buffered generously but it worked :) The sweet spot was having it transcode to 720p (yay hardware acceleration). I wasn’t sharing it with anyone at the time, so it was just me watching at work on one phone while using my second phone at home for internet.

Admiral Patrick@dubvee.org · 4 days ago

Just about anything as long as you don’t need to serve it to hundreds of people simultaneously. Hell, I once hosted Jellyfin over a 3G hotpot and it managed.

Pretty much any web-based app will work fine. Streaming servers (Emby, Plex, Jellyfin, etc) work fine for a few simultaneous people as long as you’re not trying to push 4K or something. 1080p can work fine at 4 Mbps or less (transcoding is your friend here). Chat servers (Matrix, XMPP, etc) are also a good candidate.

I hosted everything I wanted with 30 Mbps upload before I got symmetric fiber.

Admiral Patrick@dubvee.org · 6 days ago

Happy Friday (Finally!)

Admiral Patrick@dubvee.org · edit-2 6 days ago

Maybe I should flesh it out into an actual guide. The Nepenthes docs are “meh” at best and completely gloss over integrating it into your stack.

You’ll also need to give it corpus text to generate slop from. I used transcripts from 4 or 5 weird episodes of Voyager (let’s be honest: shit got weird on Voyager lol), mixed with some Jack Handy quotes and a few transcripts of Married…with Children episodes.

https://content.dubvee.org/ is where that bot traffic lands up if you want to see what I’m feeding them.

Admiral Patrick@dubvee.org · 6 days ago

Thanks!

Mostly there’s three steps involved:

Setup Nepenthes to receive the traffic
Perform bot detection on inbound requests (I use a regex list and one is provided below)
Configure traffic rules in your load balancer / reverse proxy to send the detected bot traffic to Nepenthes instead of the actual backend for the service(s) you run.

Here’s a rough guide I commented a while back: https://dubvee.org/comment/5198738

Here’s the post link at lemmy.world which should have that comment visible: https://lemmy.world/post/40374746

You’ll have to resolve my comment link on your instance since my instance is set to private now, but in case that doesn’t work, here’s the text of it:

So, I set this up recently and agree with all of your points about the actual integration being glossed over.

I already had bot detection setup in my Nginx config, so adding Nepenthes was just changing the behavior of that. Previously, I had just returned either 404 or 444 to those requests but now it redirects them to Nepenthes.

Rather than trying to do rewrites and pretend the Nepenthes content is under my app’s URL namespace, I just do a redirect which the bot crawlers tend to follow just fine.

There’s several parts to this to keep my config sane. Each of those are in include files.

An include file that looks at the user agent, compares it to a list of bot UA regexes, and sets a variable to either 0 or 1. By itself, that include file doesn’t do anything more than set that variable. This allows me to have it as a global config without having it apply to every virtual host.
An include file that performs the action if a variable is set to true. This has to be included in the server portion of each virtual host where I want the bot traffic to go to Nepenthes. If this isn’t included in a virtual host’s server block, then bot traffic is allowed.
A virtual host where the Nepenthes content is presented. I run a subdomain (content.mydomain.xyz). You could also do this as a path off of your protected domain, but this works for me and keeps my already complex config from getting any worse. Plus, it was easier to integrate into my existing bot config. Had I not already had that, I would have run it off of a path (and may go back and do that when I have time to mess with it again).

The map-bot-user-agents.conf is included in the http section of Nginx and applies to all virtual hosts. You can either include this in the main nginx.conf or at the top (above the server section) in your individual virtual host config file(s).

The deny-disallowed.conf is included individually in each virtual hosts’s server section. Even though the bot detection is global, if the virtual host’s server section does not include the action file, then nothing is done.

Files

map-bot-user-agents.conf

Note that I’m treating Google’s crawler the same as an AI bot because…well, it is. They’re abusing their search position by double-dipping on the crawler so you can’t opt out of being crawled for AI training without also preventing it from crawling you for search engine indexing. Depending on your needs, you may need to comment that out. I’ve also commented out the Python requests user agent. And forgive the mess at the bottom of the file. I inherited the seed list of user agents and haven’t cleaned up that massive regex one-liner.

# Map bot user agents
## Sets the $ua_disallowed variable to 0 or 1 depending on the user agent. Non-bot UAs are 0, bots are 1

map $http_user_agent $ua_disallowed {
    default 		0;
    "~PerplexityBot"	1;
    "~PetalBot"		1;
    "~applebot"		1;
    "~compatible; zot"	1;
    "~Meta"		1;
    "~SurdotlyBot"	1;
    "~zgrab"		1;
    "~OAI-SearchBot"	1;
    "~Protopage"	1;
    "~Google-Test"	1;
    "~BacklinksExtendedBot" 1;
    "~microsoft-for-startups" 1;
    "~CCBot"		1;
    "~ClaudeBot"	1;
    "~VelenPublicWebCrawler"	1;
    "~WellKnownBot"	1;
    #"~python-requests"	1;
    "~bitdiscovery"	1;
    "~bingbot"		1;
    "~SemrushBot" 	1;
    "~Bytespider" 	1;
    "~AhrefsBot" 	1;
    "~AwarioBot"	1;
#    "~Poduptime" 	1;
    "~GPTBot" 		1;
    "~DotBot"	 	1;
    "~ImagesiftBot"	1;
    "~Amazonbot"	1;
    "~GuzzleHttp" 	1;
    "~DataForSeoBot" 	1;
    "~StractBot"	1;
    "~Googlebot"	1;
    "~Barkrowler"	1;
    "~SeznamBot"	1;
    "~FriendlyCrawler"	1;
    "~facebookexternalhit" 1;
    "~*(?i)(80legs|360Spider|Aboundex|Abonti|Acunetix|^AIBOT|^Alexibot|Alligator|AllSubmitter|Apexoo|^asterias|^attach|^BackDoorBot|^BackStreet|^BackWeb|Badass|Bandit|Baid|Baiduspider|^BatchFTP|^Bigfoot|^Black.Hole|^BlackWidow|BlackWidow|^BlowFish|Blow|^BotALot|Buddy|^BuiltBotTough|
^Bullseye|^BunnySlippers|BBBike|^Cegbfeieh|^CheeseBot|^CherryPicker|^ChinaClaw|^Cogentbot|CPython|Collector|cognitiveseo|Copier|^CopyRightCheck|^cosmos|^Crescent|CSHttp|^Custo|^Demon|^Devil|^DISCo|^DIIbot|discobot|^DittoSpyder|Download.Demon|Download.Devil|Download.Wonder|^dragonfl
y|^Drip|^eCatch|^EasyDL|^ebingbong|^EirGrabber|^EmailCollector|^EmailSiphon|^EmailWolf|^EroCrawler|^Exabot|^Express|Extractor|^EyeNetIE|FHscan|^FHscan|^flunky|^Foobot|^FrontPage|GalaxyBot|^gotit|Grabber|^GrabNet|^Grafula|^Harvest|^HEADMasterSEO|^hloader|^HMView|^HTTrack|httrack|HTT
rack|htmlparser|^humanlinks|^IlseBot|Image.Stripper|Image.Sucker|imagefetch|^InfoNaviRobot|^InfoTekies|^Intelliseek|^InterGET|^Iria|^Jakarta|^JennyBot|^JetCar|JikeSpider|^JOC|^JustView|^Jyxobot|^Kenjin.Spider|^Keyword.Density|libwww|^larbin|LeechFTP|LeechGet|^LexiBot|^lftp|^libWeb|
^likse|^LinkextractorPro|^LinkScan|^LNSpiderguy|^LinkWalker|msnbot|MSIECrawler|MJ12bot|MegaIndex|^Magnet|^Mag-Net|^MarkWatch|Mass.Downloader|masscan|^Mata.Hari|^Memo|^MIIxpc|^NAMEPROTECT|^Navroad|^NearSite|^NetAnts|^Netcraft|^NetMechanic|^NetSpider|^NetZIP|^NextGenSearchBot|^NICErs
PRO|^niki-bot|^NimbleCrawler|^Nimbostratus-Bot|^Ninja|^Nmap|nmap|^NPbot|Offline.Explorer|Offline.Navigator|OpenLinkProfiler|^Octopus|^Openfind|^OutfoxBot|Pixray|probethenet|proximic|^PageGrabber|^pavuk|^pcBrowser|^Pockey|^ProPowerBot|^ProWebWalker|^psbot|^Pump|python-requests\/|^Qu
eryN.Metasearch|^RealDownload|Reaper|^Reaper|^Ripper|Ripper|Recorder|^ReGet|^RepoMonkey|^RMA|scanbot|SEOkicks-Robot|seoscanners|^Stripper|^Sucker|Siphon|Siteimprove|^SiteSnagger|SiteSucker|^SlySearch|^SmartDownload|^Snake|^Snapbot|^Snoopy|Sosospider|^sogou|spbot|^SpaceBison|^spanne
r|^SpankBot|Spinn4r|^Sqworm|Sqworm|Stripper|Sucker|^SuperBot|SuperHTTP|^SuperHTTP|^Surfbot|^suzuran|^Szukacz|^tAkeOut|^Teleport|^Telesoft|^TurnitinBot|^The.Intraformant|^TheNomad|^TightTwatBot|^Titan|^True_Robot|^turingos|^TurnitinBot|^URLy.Warning|^Vacuum|^VCI|VidibleScraper|^Void
EYE|^WebAuto|^WebBandit|^WebCopier|^WebEnhancer|^WebFetch|^Web.Image.Collector|^WebLeacher|^WebmasterWorldForumBot|WebPix|^WebReaper|^WebSauger|Website.eXtractor|^Webster|WebShag|^WebStripper|WebSucker|^WebWhacker|^WebZIP|Whack|Whacker|^Widow|Widow|WinHTTrack|^WISENutbot|WWWOFFLE|^
WWWOFFLE|^WWW-Collector-E|^Xaldon|^Xenu|^Zade|^Zeus|ZmEu|^Zyborg|SemrushBot|^WebFuck|^MJ12bot|^majestic12|^WallpapersHD)" 1;

}

deny-disallowed.conf

# Deny disallowed user agents
if ($ua_disallowed) { 
    # This redirects them to the Nepenthes domain. So far, pretty much all the bot crawlers have been happy to accept the redirect and crawl the tarpit continuously 
	return 301 https://content.mydomain.xyz/;
}

Admiral Patrick@dubvee.org · 6 days ago

I was blocking them but decided to shunt their traffic to Nepenthes instead. There’s usually 3-4 different bots thrashing around in there at any given time.

If you have the resources, I highly recommend it.

Admiral Patrick@dubvee.org · 6 days ago

Most of the requirements are going to be for the database, and that depends on:

How many active users you expect
How many large rooms you or your users join

I left many of the large Matrix spaces I was in, and mine is now mostly just 1:1 chats or a group chat with a handful of friends. Given that low-usage case, I can run my server on a Pi 3 with 4 GB of RAM quite comfortably. I don’t do that in practice, but I do have that setup as a backup server - it periodically syncs the database from my main server - and works fine. The bottleneck there, really, is the SD card storage since I didn’t want an external SSD hanging off of it.

Even when I was active in several large Matrix spaces/rooms, a USFF Optiplex with a quad core i5, 8 GB of RAM, and a 500GB SSD was more than enough to run it comfortably alongside some other services like LibreTranslate.

Admiral Patrick@dubvee.org · 7 days ago

Other Republican politicians, including former President Donald Trump, also criticized the show as inappropriate.

If only
USA Today needs to either use a better model or just get rid of the AI-generated key point summary.

Admiral Patrick@dubvee.org · 7 days ago

Orphan Black: Live

Admiral Patrick@dubvee.org · edit-2 8 days ago

Somewhere between 35 and 39, but yeah. Not sure how old she was when we got her (fully grown), but I was 5 or 6 then and was 40 when she passed. Have to assume it was just old age Always called her “Horse, of Course” lol

Admiral Patrick@dubvee.org · 8 days ago

Sorry to hear. How old was he? My family had a horse since I was like 5 or 6. She hated being ridden but would follow you around like a dog. She died year-before-last at, I believe, age 39.

Admiral Patrick@dubvee.org · edit-2 8 days ago

I prefer sans-serif fonts visually but prefer serif for readability. So I use Atkinson Hyperlegible which is a mish-mash of both.

And bonus meme:

Admiral Patrick@dubvee.org · 12 days ago

Atkinson Hyperlegible is my new jam. I’m dyslexic and it helps tremendously even though that’s not its primary goal. It also looks a lot better than OpenDyslexic which I used to use.

Loaded “Hyperlegible” onto my Kobo, the reader app on my phone, and set it as the default font on my desktop environment.

Also added it as an option in Tesseract UI (which I swear I’ll be releasing “soon”).

Admiral Patrick@dubvee.org · edit-2 14 days ago

Kinda like in recent Trek series where, when an episode begins with a day in the limelight of a non-main character, it’s basically their eulogy.

DIS: Lt. ~~Miriam~~ Airiam. Stupid autocorrect.
SNW: Ensign Gamble

Admiral Patrick@dubvee.org · 14 days ago

I made my own smart outlets with an ESP-01, dual relay board, and ESPHome. Also made some temp/humidity sensors as well as a 20x4 text display. All powered by a bunch of ESP-01s I bought cheap and in-bulk from Ali and programming using ESPHome which handles most of the work interfacing with the components as well as the HomeAssistant integration.

https://esphome.io/

Admiral Patrick@dubvee.org · edit-2 14 days ago

Basically the only thing you want to present with a challenge is the paths/virtual hosts for the web frontends.

Anything /api/v3/ is client-to-server API (i.e. how your client talk to your instance) and needs to be obstruction-free. Otherwise, clients/apps won’t be able to use the API. Same for /pictrs since that proxies through Lemmy and is a de-facto API endpoint (even though it’s a separate component).

Federation traffic also needs to be exempt, but it’s not based on routes but by the HTTP Accept request header and request method.

Looking at the Nginx proxy config, there’s this mapping which tells Nginx how to route inbound requests:

nginx_internal.conf: https://raw.githubusercontent.com/LemmyNet/lemmy-ansible/main/templates/nginx_internal.conf

    map "$request_method:$http_accept" $proxpass {
        # If no explicit matches exists below, send traffic to lemmy-ui
        default "http://lemmy-ui:1234/";

        # GET/HEAD requests that accepts ActivityPub or Linked Data JSON should go to lemmy.
        #
        # These requests are used by Mastodon and other fediverse instances to look up profile information,
        # discover site information and so on.
        "~^(?:GET|HEAD):.*?application\/(?:activity|ld)\+json" "http://lemmy:8536/";

        # All non-GET/HEAD requests should go to lemmy
        #
        # Rather than calling out POST, PUT, DELETE, PATCH, CONNECT and all the verbs manually
        # we simply negate the GET|HEAD pattern from above and accept all possibly $http_accept values
        "~^(?!(GET|HEAD)).*:" "http://lemmy:8536/";

Admiral Patrick@dubvee.org · 14 days ago

It kinda can but not as easily.

Back when I just downloaded everything under the sun on Napster/Limewire, I’d make highly curated CDs of known-hits as well as ones where I sprinkle in some random songs that were in my downloads that I’d never heard before. Not exactly the same, but I’ve definitely listened to a CD I made and been like “what’s that song?! I love it!”.

Plus, for road trips, everyone would usually burn a CD or two of their own to swap in (a precursor to “pass the aux cord”) so there was some novelty/variety.

Admiral Patrick@dubvee.org · 14 days ago

Nothing hits better on a drive than a good mixed CD. Even making a playlist on your phone, which is basically the same thing, is totally not the same.

Admiral Patrick@dubvee.org · 15 days ago

Yeah, I think it was changed in Win 10 (or maybe 8/8.1?)

Admiral Patrick@dubvee.org · edit-2 15 days ago

Finally, I’m not the only one noticing this.

I’ve long said that the moment “My Computer” changed to “This PC”, it showed how MS really thought of your computer as theirs that they so graciously allow you to use once in a while.