AndresFreundTec

Long time postgres developer, working at Microsoft.

Account about tech, not politics. For the latter look to @AndresFreundPol

Public Key

npub1ly44p7gfxnqm237hpxc8dynusdz4jfvtqrh5nmgrwcrsxkmz5n6q6gks2j

NIP-05 Address

[email protected]

Profile Code

nprofile1qqs0j26slyynfsd4gltsnvrkjf7gx32eyk9spm6fa5phvpcrtd32faqpzemhxue69uhhyetvv9ujuurjd9kkzmpwdejhg2kqtav

Publishing to

relay.primal.net

Author Public Key

npub1ly44p7gfxnqm237hpxc8dynusdz4jfvtqrh5nmgrwcrsxkmz5n6q6gks2j

Show more details

Last Notes

2024-08-04 19:34:34

- reply

@npub1hxa…j94a They either train as pcie 4, don't work at all, or also have AER errors. So I suspect it's a mainboard / firmware issue :/

2024-06-03 19:34:22

- reply

@npub1h59…waea Thanks.

2024-06-03 19:34:06

- reply

@npub1h59…waea Several DEs start themselves via systemd these days, so a graphical terminal will often have the systemd limits applied.

2024-06-03 19:26:11

- reply

@npub1h59…waea (since most things spawn from somewhere within a systemd instance these days, the pam limits are quickly overriden by systemd)

2024-05-30 00:04:56

Here are the slides for a talk I just gave about using perf c2c to find cache line contention in postgres:
https://anarazel.de/talks/2024-05-29-pgconf-dev-c2c/postgres-perf-c2c.pdf

2024-05-27 18:22:06

- reply

@npub1hxa…j94a Yep. This is with turbo verified to be disabled, C-states disabled and frequency monitored...

2024-05-26 20:22:31

- reply

A potentially interesting detail: If I interpret https://community.intel.com/t5/Software-Tuning-Performance/Understanding-PCICFG-space-information/m-p/1138821#M6581 correctly, both my CPUs are 18 core models with some cores disabled.
Both have the same CAPID0 (0x001881fa), CAPID4 (0x24000e80), but different CAPID6: (0x0001b4e3 and 0x0002c6f8).
Afaict that makes them HCC parts with 10 slices enabled.

2024-05-26 18:10:02

- reply

@npub1yc6…lmgn Nope. No difference above noise. That core is the only "really slow" one.

2024-05-26 17:49:08

I don't think that's it - I added userspace vs kernel cycles,instructions:
https://gist.github.com/anarazel/ca7d1db68fb7380d21f6fd819a147df1
There are a few more kernel instructions/cycles in the slow case, but it's just because the slow case takes longer. If I measure for a fixed time, it's about the same.

2024-05-26 12:03:36

- reply

My best theory so far is that somehow on that one core there are more conflicts on L1i entries, leader to a lower hit rate. I haven't figured out what the precise keying scheme for the L1i is (it's 8 way associative).

2024-05-26 11:57:02

- reply

@npub142j…7ugm Yep. There are practically no interrupts, there are no SMIs, no evidence of throttling in performance counters.

2024-05-26 11:41:16

- reply

@npub1e5z…g25u It's possible, but somehow it seems odd to end up with different numbers of L1i misses etc, without causing apparent corruption.

2024-05-26 11:29:24

- reply

@npub1e5z…g25u The others perform like 11, 10 is slow.
Note how in the paste above the instruction numbers are almost identical, but core 10 needed a lot more cycles...

2024-05-26 11:20:08

- reply

@npub1e5z…g25u Core 10 is the the first core of the second socket, 11 the second.
The first socket does not show the same for cores 0,1.
Nor does any other combination I've tried.

2024-05-26 11:07:23

- reply

@npub17lg…9uux Hah. This is weird, but not that kind of weird, I strongly suspect.
Although I wouldn't mind getting that farm.

2024-05-26 11:02:19

- reply

@npub16hg…mtal It's the same numa node...

2024-05-26 11:01:18

- reply

https://gist.github.com/anarazel/ca7d1db68fb7380d21f6fd819a147df1
How can two cores on the same CPU such crazily different icache behaviour? For the same process!

2024-05-26 10:55:32

Well, color me very confused. In a CPU bound workload two cores on the same socket have substantially different performance (32% slowdown). If I just migrate the running process between the cores, performance changes immediately.
This is on 2x Xeon 5215 system.
I checked that it's not thermals, cpu frequency/boost and the system is idle.
Here's the odd part: The biggest difference evident in perf counters is
a 2.5x difference in icache_64b.iftag_stall, with ~same icache_64b.iftag_miss.

2024-05-20 22:53:31

- reply

@npub16ew…us82 I assume this one is also mounted with barrier=0?

2024-05-20 21:37:33

- reply

@npub1hxa…j94a Very interesting. Thanks. A depressingly large performance diff between ext4 and btrfs, even with nocow.
Interesting that dsync wins with ext4 but looses on btrfs.

2024-05-20 21:32:43

- reply

@npub12yj…3l94 Thanks. What filesystem is this?
These don't show the same slowdown we've seen with O_DSYNC/FUA for other Samsung SSDs. I suspect the filesystem doesn't use FUA writes...

2024-05-20 08:44:34

- reply

@npub1jpa…6p97 I'm curious about your results with the SK Hynic, because they're the first non-samsung one where FUA writes are slower. Albeit at a much lower degree.

2024-05-20 08:42:32

- reply

@npub1jpa…6p97 Thanks!
Any chance you could figure out the model number of either, e.g. via smartctl -xa /dev/nvme0n1 or lsblk -o path,model,fstype,size,mountpoints?

2024-05-20 02:56:12

- reply

Collecting the information here:
https://gist.github.com/anarazel/b527e5317bb7d58483a9858f5f2435ca
The background is that I'd like to switch to using O_DSYNC by default for postgres' WAL, but that it appears some drives react unfavorably.

2024-05-20 02:45:58

- reply

@npub1u8f…p9n0 Thanks!

2024-05-20 02:37:48

- reply

@npub1p4g…8pte Thanks!

2024-05-20 02:14:49

Going to collect the information here:
https://gist.github.com/anarazel/b527e5317bb7d58483a9858f5f2435ca

2024-05-20 02:14:35

Any chance a few folks could run the following fio command on various SSDs and tell me the latency, drive model and filesystem?
fio --directory /srv/dev/fio/ --runtime 3 --time_based --output-format json --overwrite 1 --size=8MB --buffered 0 --bs=4096 --rw=write --name write-dsync --wait_for_previous --sync=dsync --name write-fdatasync --wait_for_previous --fdatasync=1 --name write-nondurable --wait_for_previous | jq '.jobs[] | [.jobname, .write.iops]'
--directory needs to be adjusted.

2024-05-01 19:31:39

The "new" linux CVE approach seems ... unhelpful at best. I know from experience that dealing with security reports is a pain and I assume that's way worse for linux than for Postgres. But this just seems to purely be aimed at annoying everyone.

2024-04-26 20:21:38

Congratulations to fellow postgres hackers @npub1zcv…cczq and Richard Guo for becoming committers!
https://postgr.es/m/df222085-2248-4d89-8935-256a9c384878%40postgresql.org

2024-04-08 20:49:16

I did not think that finding a security vulnerability would lead to tabloids digging around my life. Including "analyzing" (aka making things up) the one personal-ish picture I've ever shared on social media. FFS.
Oddly enough, they weren't interested in lots of kinda pretty graphs.

2024-04-03 21:29:09

- reply

@npub17lg…9uux FWIW, I hadn't been on twitter in months, and this made me go back - at least earlier on there was distinct information, particularly around reverse engineering efforts.

2024-04-03 18:41:41

I am a bit concerned by all the focus on small-ish projects with overwhelmed maintainers. There indeed are a lot of problems in that area.
But I am certain that lots of experienced OSS devs can think of a few large and crucial projects where they fairly easily could have hidden something small in a larger change. Without a lot of prior contributions to the project.

2024-03-31 23:47:23

- reply

@npub178r…qekt Even if you don't care about collateral harm, with something like a backdoor in ssh, it just seem too likely you'd otherwise accidentally make yourself vulnerable too, somewhere in your org.

2024-03-30 19:21:44

I wholeheartedly agree with what Russ wrote here:
"Also if there's anything the community can do for Lasse personally, please pass that along."
"Anyone can be the victim of social engineering."
"I suspect many of us here have had nightmares about being in Lasse's
position, and probably will have more of them in the future."
Indeed.
https://www.openwall.com/lists/oss-security/2024/03/30/25

2024-03-29 18:10:34

I accidentally found a security issue while benchmarking postgres changes.
If you run debian testing, unstable or some other more "bleeding edge" distribution, I strongly recommend upgrading ASAP.
https://www.openwall.com/lists/oss-security/2024/03/29/4