Jump to content

Server Admin Log/Archive 5: Difference between revisions

From Wikitech
Content deleted Content added
imported>Brion
No edit summary
Tim (talk | contribs)
Changed names of Seoul machines
Line 5: Line 5:


== September 10 ==
== September 10 ==
* 02:40 Tim: Changed names of Seoul machines
* 02:15 brion: set edit rate limit for new accounts to same as ip rate limit
* 02:15 brion: set edit rate limit for new accounts to same as ip rate limit
* 01:40 brion: installed rsvg (librsvg2) on mediawiki-installation machines, enabled SVG uploads
* 01:40 brion: installed rsvg (librsvg2) on mediawiki-installation machines, enabled SVG uploads

Revision as of 02:44, 10 September 2005

Template:Topnavbar

14 July 12:05 (UTC, purge)

Total bandw. | Squid stats

Ganglia: A|S

September 10

  • 02:40 Tim: Changed names of Seoul machines
  • 02:15 brion: set edit rate limit for new accounts to same as ip rate limit
  • 01:40 brion: installed rsvg (librsvg2) on mediawiki-installation machines, enabled SVG uploads

September 9

  • 06:30 brion: restarted stalled de,en dumps

September 8

  • 19:18 brion: checker daemon running
  • 10:50 brion: setting up vandal checker daemon on larousse
  • 10:42 hashar: enabled subpages for portal (100) and portal discussion (101) on dewiki.
  • 7:45 hashar: added two namespaces for frwiki : 100=>Portail, 101=>Discussion_Portail .

September 7

  • 22:00 jeronim: fixed avar's login problem on servers in the mediawiki-installation group -
    • nscd -i passwd did not work
    • /etc/init.d/nscd restart ; /etc/init.d/sshd restart did solve the problem on each machine except for benet; for benet, problem was finally solved after doing the restarts twice more, then nscd -i passwd, then doing the 2 restarts with a pause in the middle
  • 21:30 jeronim: killed everyone's ssh sessions and sshd on zwinger (sorry)
  • 10:25 midom: After Tim did put live memcached patch, site's sessions were switched from NFS to memc.
  • 06:54 brion: killed stalled backup -- memcached send hang for the last day or so. It's continuing w/ dkwiki; will rerun stalled dewiki and enwiki

September 6

  • 19:55 brion: tgwiktionary to lowercase
  • 05:30 brion: set up experimental upload verification hook

September 5

  • 12:40 brion: set up to shut down search builder daemon every hour (at 47 minutes) to protect aganst memory leaks in builder; search-update-daemon wrapper script set to auto-restart 5 seconds after shutdown/crash of the daemon
  • 09:05 brion: rebuildMessages.php --update on all wikis to add various new messages
  • 06:09 brion: starting mass lucene updates of pages edited in august
  • 05:18 brion: lucene back-deletions done, reoptimizing build index
  • 01:10 brion: search updater up; running queued deletions
  • 00:45 brion: vincent back in active search rotation

September 4

  • 23:55 brion: splitting lucene config to lucene.php. putting coronelli on search, wiht optimized index
  • 19:30 jeronim: created helpdesk-l
  • 17:20 jeronim: fuchsia does not boot on the latest kernel (see below), but it does boot on the 2.6.11-1.33_FC3smp kernel, so switched it to boot that kernel by default
  • 16:27 mark: Because of cascading incidents in knams, we moved all traffic to florida and lopar via DNS.
  • 14:30 jeronim: fuchsia was dead or very close, so power-cycled it using the IPMI. It is broken:
Copyright (c) 1999-2004 LSI Logic Corporation
insmod: error inKernel panic - not syncing: Attempted to kill init!
serting '/lib/mpACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 27 (level, low) -> IRQ 177
tmscsih.ko': -1 ptbase: Initiating ioc0 bringup
niknown symbol ioc0: 53C1030: Capabilities={Initiator}
 
 module
Call Trace:/sbin/udevstart <ffffffff80138164>{panic+196}e xited abnormaly!
Creating roo<ffffffff8034f811>{__down_read+49}t
                                                device
 dev: label /1 n      t found
Mountin<ffffffff80207ef1>{__up_read+33}g root filesyste m
mount: error <ffffffff8013ae53>{do_exit+99}2
                                             mounting ext2
       <mount: error 2ffffffff80207db1>{__up_write+49}mounting none
S witching to new <ffffffff8013ba8f>{do_group_exit+239}r
                                                        oot
 : mount failed:      22
umount /init<ffffffff8010eaa6>{system_call+126}r d/dev failed: 
  • 13:16 Tim: made /home/wikipedia/lib/install.sh ignore x86_64 machines, added a part to clean up rubbish left in /usr/lib, then ran it everywhere with dsh -a -f
  • 04:20 Tim: reinstalling PHP 4.4.0 with exif support. Using php-upgrade-440, which calls the new script /home/wikipedia/lib/install.sh to set up shared libraries in /usr/local/lib.

September 3

  • 18:40 jeronim: removed body of mailman archive messages here and here on yannf's request
  • 06:40 brion: relaunch updated backup script with some of the broken bits fixed.
  • 04:50 Tim: Finished benchmarking PHP 4.4.0, see GCC benchmarking. Now deploying the new binaries, from source tree /home/wikipedia/src/php/php-4.4.0-gcc4
  • sometime brion: added .log to text/plain on benet's lighty

September 2

  • 12:00 brion: ran backup test on aawiki using the new dump splitter and partial new backup script. (script is in ~brion/run-backup.sh if anyone wants to examine it)
  • 07:19 Tim: compiling GCC 4.0.1 on zwinger. It will be installed with a program suffix, so gcc is still the old compiler, and gcc-4.0.1 is the new one. Source directory is /home/wikipedia/src/gcc/gcc-4.0.1, build directory is /home/wikipedia/src/gcc/gcc-4.0.1-build.
  • 06:21 Tim: removing hypatia from perlbal nodelist for an hour or so, for some benchmarking

September 1

  • 07:45 brion: set sitename/meta namespace on mtwiki
  • 07:00 brion: running cleanupTitles.php to rename broken pages. Will be at Special:Prefixindex/Broken/ at each wiki.

August 30

  • 17:30 jeronim: made a robots.txt on larousse (noc/kohl) to disallow some dynamic pages and a few others
  • 16:40 jeronim: created wikimediapl-l

August 29

  • 21:30 brion: blocked wissens-schatz.de for remote loading
  • 17:30 jeluf: anonymized a name in the archive of wikide-l
  • 11:30 brion: running a batch job checking for invalid titles on various wikis (cleanupTitles). shouldn't interfere with anything, making no changes.

August 28

  • 22:15 brion: locking plwiktionary for capitalization change
  • 15:18 hashar: created wikimk-l mailing list.
  • 15:15 mark: Brought mayflower back up. Repaired the filesystems, and rebooted it. It was reporting lines like
Aug 28 04:22:34 mayflower kernel: swap_free: Bad swap file entry 7800007ffffff00f
  • 14:30 mark: Another Kennisnet V-20 went down, this time it was mayflower dieing somewhere this morning. Depooled it... As it's not critical and we still have SP access, I will have a look at it first.

August 27

  • 00:45 brion: turned on wegge's experimental watchlist bot thingy on dawiki

August 26

  • sometime: lots of data imported on wikisources

August 25

  • 16:02 jeronim: added fc-mirror.wikimedia.org DNS entry for fedora mirror
    • fc-mirror 1H IN CNAME albert
  • 15:40 hashar: created wikials-l mailing list. TODO: delete /h/w/htdocs/mail/.index.html.sw(o|p) (swap files by fire).
  • 19:00 mark: PowerDNS on pascal appeared corrupted. Most probably because of an overlapping zones problem in bindbackend (not bindbackend2). I integrated rev.wikimedia.org into the wikimedia.org to evade that.
  • 16:09 hashar: blacklisted www . izynews . com on florida squids (using acl badbadip src 62.75.174.182/32). Need to be done on kennisnet and paris cluster too.
  • 11:00 brion: set up https on kohl. (old ssl key files backed up; wasn't using the established password, nobody knew what it might have been)
  • 07:05 brion: rebuilt interwiki tables; using correct interwikis for the new wikisources.
  • 06:51 brion: added sr.wikisource.org
  • 02:02 hashar: updated in HEAD LanguagePt.php from meta. Watchout when syncronising.

August 24

  • 14:04 hashar: disabled lucene search. Daemon run on maurus but timeout / dont give any output.
  • 04:00 Jamesday: started nice bzip2 for slow query log and first 72 binary logs on adler to free 40GB of disk. Can archive them on another box later.
    • use avicenna for binlog archives -- Tim 05:53, 25 Aug 2005 (UTC)
  • 00:43 brion: trying out an older version of MWDaemon on vincent to see if memory leak is a new code problem

August 23

August 22

  • 22:12 brion: upped max post size to 75mb on squids; were problems posting large videos to commons (or something)
  • 21:50 brion: renamed presswiki to internalwiki

August 21

  • 22:53 brion: bugzilla up; removed ssl-ticket.wikimedia.org from pascal's apache conf.d dir
  • 22:48 brion: bugzilla.wikimedia.org appears to be offline.
  • 13:30 Tim: reduced lucene load on vincent to 1/4, maybe that will stop it from locking up (which it did again)
  • 13:00 Tim: restarted lucene on vincent, it was closing connections as soon as they were established
  • 06:27 brion: otrs now accessible again on https://ticket.wikimedia.org/ ; now with redirect for the index page! For reference: Apache is in /usr/local/otrs
  • 06:00 brion: trying to start otrs on ragweed. apache configuration appears to be borked.

August 20

  • 10:00 jeluf: finished OTRS transition to ragweed. Spamassasin setup finished.
  • 09:53 Tim: Switched site to 1.6alpha
  • 08:16 Tim: Applying schema update for 1.6alpha, basically an ALTER TABLE watchlist
  • 01:00 Tim: ran update-special-pages

August 19

  • 23:30 brion: changed postfix 'myhostname' setting from zwinger.wikimedia.org to mail.wikimedia.org, should prevent the mail loop errors reported sending to the full addr
  • 23:00 brion: ran namespace conflict checks for updates on tawiki and gawiki
  • 21:40 brion: updated rebuildInterwiki

August 18

  • 23:30 jeluf: OTRS status: Installed apache/php/perl/postfix/mysql client on ragweed. Using pascal as DB server. Problems with sessions, sessions seem to be mixed up, sometimes I get logged in as presroi, sometimes as JeLuF :-/ Stopped apache for now. Postfix still accepting new tickets.
  • 22:30 mark: Changed DNS CNAME ticket.wikimedia.org to point to ragweed
  • 22:17 brion: disabled account creation throttle on press wiki; this is closed wiki and all accounts are created by an admin
  • 10:00 midom: suda is back again, with enwiki and commonswiki databases
  • 05:00 jeluf: copied OTRS tables to pascal, copied otrs binaries to pascal, configured pascal to serve https. Can access old tickets again. Currently can't send new tickets to otrs. DNS change needs to be done.
  • 00:55 brion: recreated wikimediasr-l list on zwinger

August 17

  • 19:27 brion: fixed bug in db.php that set all database load factors to NULL

August 16

  • 20:15 jeluf: renamed project namespace on cswikibooks to Wikiknihy.
  • 15:30 midom: resumed idle bacon's mysql replication, we might need to do external store migration soon, and bring back suda with smaller dataset.

August 15

  • 21:46 kate: always_bcc on zwinger was set to "quagga" and its mbox was full, so it generated lots of bounce messages. i removed the setting.
  • 12:30 mark: Mint seems to have at least a bad disk, possibly other problems. Sun will look at it. In the meantime, we can *try* to network boot it and recover data.
  • 10:30 jeronim: had a look at mint via the IPMI - tried to power cycle it but it wouldn't switch off. Mark will tell the kennisnet guys about it. There's a dump of the OTRS DB from before the transfer to mint in albert:/root. If mailman is to be put back to zwinger, chapter-l and the new Serbian list will need to be re-created (and maybe some other lists?).
  • 09:00 mark: Mint apparently is fucked, RAID and SP settings were reverted to factory defaults. Trying to do data recovery now. Possibly a power problem?

August 14

  • 19:51 brion: mail config on zwinger broken or funky or otherwise annoying; just leaving it off for now. moved dns for mail back to mint (which is still dead) sighhhh
  • 19:26 brion: moved mail.wikimedia.org back to zwinger due to extended outage on mint. With our limited support contract on knams we can't afford to have this critical service there.
  • 14:30 midom: srv27,srv26,srv25 joined external storage service, waiting for payload
  • 09:30 brion: mint is offline, no ping
  • 00:20 brion: stopped bacon to run backup dump
  • 01:00 jeluf: enabled spamassassin for OTRS on mint (~otrs/.procmailrc)

August 13

  • sometime kate: moved otrs to mint
  • 23:25 brion: added wikimediasr-l aliases to mailman on mint
  • sometime someone: Apparently mail.wikimedia.org has been moved to mint.
  • 10:42 jeronim: set ticket.wikimedia.org to CNAME mint.knams.wikimedia.org. (move of OTRS to mint is in progress)
  • 00:58 Tim: started update-special-pages
  • 00:19 Tim: it happened again so I disabled otrs's crontab. Original crontab is in /opt/otrs/crontab

August 12

  • 23:18-23:30 Tim: An OTRS process on albert (PostMaster.pl) developed a runaway memory leak, causing heavy swapping. This slowed down albert sufficiently to cause the entire apache cluster to lock up with high load. Killed the process at 23:30 and the site soon returned to normal.
  • 09:30 brion: took srv1 out of 'apaches' node group and shut off apache on it. DON'T RUN APACHE ON SRV1

August 11

  • 21:26 Tim: TICK TICK TICK, that's the sound of 58 servers with their clocks ticking in synchrony, maximum offset 80ms.
  • 20:30 Tim: Added the missing restrict line for 10.0.0.200 to ntp.conf on (almost) all machines
  • 19:30 Tim: Synchronised ntp.conf on hypatia, humboldt, rose, anthony, rabanus, diderot and srv1 with /home/config/others/etc/ntp.conf.vlan2 . This made them remotely queryable, for easier debugging in the future, and also switched their preferred server from zwinger to the cisco (in broadcastclient mode).
  • 18:35 Tim: Fixed tingxi's resolv.conf
  • 17:45 mark: Fixed inconsistent favicons on apaches. Older apaches had symlinks to a common (wikipedia) favicon, which got overwritten with the new wikinews favicon by brion. Removed the symlinks, and put the correct favicons in place.
  • 12:20 brion: set up pl.wikimedia.org and press.wikimedia.org (press is locked, and currently has no user accounts. a sysop/bureaucrat will need to be added for it to be used)
  • 07:28 brion: updated wikinews.org favicon

August 9

  • 23:20 mark: Rerouted Europe back to knams, because all sorts of weird problems were occuring. Fixed a typo (pmpta) in DNS. Some nameservers report TTL 0 for some of our DNS records - need to investigate that.
  • 22:20 mark: Moved Squid service IP 207.142.131.246 from overloaded srv10 to srv5. Cleared the ARP entry on the l3 switch.
  • 22:00 mark: Reroute everything from knams to pmtpa directly, because of routing problems
  • 13:35 mark: changed biruni's hostname from biruni.wikimedia.org to biruni
  • 13:30 mark: added avicenna and biruni to node_groups/apaches
  • 13:00 mark: Restarted apaches on avicenna, alrazi and biruni with -DSLOW, and changed startup scripts
  • 08:52 jeronim: blocked 61.48.105.65 spammer IP from all wikis using block-ip-all - so ipblocklist message will speak of "vandalism" instead of "spam"
  • 08:25 jeronim: created chapter-l for mailman on mint

August 8

  • 09:22 kate: enabled greylisting on mail.wm.org
  • 20:54 hashar: readded srv2 (with ip x.x.0.1 ) to the apache pool
  • 18:25 hashar: avicenna & biruni readded. Monitoring error log, #wikipedia and memory.
  • 17:43 brion: added /mnt/upload mounts on avicenna and biruni
  • 17:32 hashar: forgot sync-common on avicenna and biruni :/ I though scap would do the job ... They both missing the upload directory.
  • 15:45 brion: stopped apache on avicenna and biruni pending more information on reported errors
  • 15:36 hashar: TODO: biruni hostname seems wrong /etc/sysconfig/network list HOSTNAME=biruni.wikimedia.org whereas other servers just get HOSTNAME=zwinger or HOSTNAME=srv30 ...
  • 15:36 hashar: removed srv1 from mediawiki-installation dsh file (as apache is not meant to run on).
  • 15:24 hashar: bringed back biruni in mediawiki-installation pool
  • 15:12 hashar: bringed back avicenna in mediawiki-installation pool
  • 14:30 hashar: started apache on srv11.
  • 06:30 kate: moved mailing lists to mint. let's see if it starts sucking less.

August 7

  • 20:50 brion: postfix hung zombified on zwinger, wouldn't restart automatically. had to remove master.pid and restart.
  • 16:25 brion: installed DynamicPageList on wikiquote per [1]
  • 15:50 brion: locked tlhwiki
  • 07:47 brion: added application/ogg as mime type for ogg files on albert
  • 00:59 brion: set localized logo for ptwiktionary

August 3

  • 14:15 mark: Switched over upload.wikimedia.org to lighttpd instead of apache on albert
  • 12:00 brion: added frankfurt city map to wikimania whitelist. whoops!

August 2

  • 15:45 mark: Bound albert's apache to a single IP, instead of INADDR_ANY
  • 09:40 brion: added wildcard subdomains for wiktionary.com redirection

August 1

  • 22:30 all: samuel's disk filled up. Switched master to adler. Re-syncing samuel from suda.
  • 14:50 mark: Put all kennisnet squids back into DNS, updated DNS on pascal

July 31

  • 11:50 brion: knams squid at 145.97.39.138 is not reachable, but still in dns rotation. THIS IS BAD
  • 01:50 brion: pascal is offline, reason unknown. bugzilla down, no NFS for knams cluster.

July 28

  • 01:06 kate: put a new skin on bugzilla

July 27

  • 18:50 brion: blocked irc4ever.net remote page loaders

July 26

  • 08:08 kate: upgraded mysql on vandale to 5.0.9

July 25

  • 19:05 brion: set $wgMetaNamespace to 'Vikipedi' on trwiki, refreshing links
  • 18:15 mark: Added two missing kennisnet squid IPs to the udpmcast startup script on larousse, and restarted it.
  • 17:29 brion: added wikimania-l mailing list
  • 17:25 mark: Pointed thailand at knams as a test - some people there say it is much faster than pmtpa. Will eventually be replaced by the yahoo cluster anyway...

July 24

  • 16:15 brion: set ndswiktionary to capitallinks off
  • 10:10 brion: updated sudoers file on srv0 so syncs work again

July 22

  • 22:50 brion: restarted search update daemon... still seems to be a memory leak and it hangs when it gets too large
  • 22:31 brion: moved wiki.mediawiki.org to www.mediawiki.org and redirected from mediawiki.org and wiki.mediawiki.org to it
  • 22:07 brion: srv0 clock was about 150 seconds in future. kate did something to fix it. synchronized all apaches from system to hc time to hope reboot works. Fixed one revision reported to be in a weird inversion appearance.
  • 13:50 brion: took avicenna out of search group to do experiments on index

July 21

  • 23:30 Tim: added rollback group
  • 22:00 Tim: moved group settings from CommonSettings.php to InitialiseSettings.php

July 20

  • 23:45 brion: updated clocks on srv1, rabanus, etc all apaches... hopefully
  • 21:40 brion: set wgCapitalLinks off on afwiktionary
  • 19:20 mark: Removed legacy zone gdns.wikimedia.org and corresponding georecord rr.gdns.wikimedia.org from all nameservers. It's not being used anymore, and only confuses people.
  • 19:05 mark: Pointed france and switzerland back at lopar in geodns
  • 14:10 brion: created wikinews-hotline mailing list by request

July 19

  • 23:58 Tim: fixed Special:Uncategorizedcategories, now running updateSpecialPages.php on /h/w/c/smallwikis
  • 15:30 brion: reverting build copy of search index to the previous version to try working around some corruption from daemon crash (?)

July 17

  • 18:27 mark: An empty line in the geomap file caused problems and made the site go down for non EU users. Apparently geobackend currently doesn't handle empty lines in geomap files (a bug which I will fix), so don't use them.
  • 18:18 mark: Pointed all European countries at knams wrt geodns

July 16

  • 17:07 kate: wrote a new statistics system and replaced webalizer with it
  • 07:30 brion: had to restart search daemons again due to breakage. whyyyyyyy they worked before *sob*
  • 00:15 hashar: overloaded suda for almost 5 minutes by running the unbugged updateSpecialPages script . Might be cause of Wantedpages.

July 14

  • 02:50 brion: separated mediawiki-installation and apache node groups. These must not point to the same file.
  • 02:00-3:15 erik: created Japanese Wikinews at http://ja.wikinews.org/

July 13

  • 20:59 brion: had to interrupt bgwiki backup due to memcached hang
  • 06:10 brion: restarted search servers; 'too many open files'
  • 01:30 brion: started backup on benet (slave stopped). updates in #wikimedia.15status

July 12

  • 23:35 brion: commented out lopar from geodns for now (moved them to knams)
  • 23:20 brion: there's intermittent packet loss to lopar...
  • 19:10 mark: Site was down due to crashed perlbal on holbach, restarted it
  • 12:03 kate: put lily back to squid pool
  • 08:10 jeronim: set yum on larousse (FC2) to use fedoralegacy.org
  • 08:00 mark: lily's hardware has been replaced.
  • 07:40 jeronim: set HostnameLookups Off on larousse's apache at hashar's request
  • 07:10 jeronim: added CNAME commons.wikipedia.org -> commons.wikimedia.org
  • 00:40 brion: restarted mysql on james's advice with config change. innodb_lock_monitor fails, however. have innodb_status_file=1 set now. had to do 'slave stop' on samuel, which is master. wtf

July 11

  • 23:40 brion: set innodb_lock_monitor on samuel on jameday's recommendation. will be active when mysqld restarted
  • 23:20 jeluf: restarted ServmonII. Died when it lost its irc connection earlier today.
  • 23:05 brion: removed teh fateful link so editing that page works for now
  • 22:30 brion: disabled deletion of recentchanges records due to slowness there. hacked Title::touchArray to go row by row due to weird hangings trying to edit Template:POTD on enwiki. Not sure what's wrong, it consistently hangs at User:Mulad/portal. What could be locking it?
  • 18:30 brion: biased search load to maurus, as avicenna (with less memory) was being sluggish. added comment to output saying which server was hit
  • 15:10 mark: Removed authoritative zones that were no longer pointing at zwinger from zwinger's Bind configuration (interferes with resolving). Set up AXFR slaving of zones that are supposed to be served by the new PowerDNS servers, but which are still delegated to Zwinger/bomis/fuchsia.
  • 14:50 mark: Fixed reverse DNS for knams

July 10

  • 17:00 brion: shut down slave thread on ariel before it explodes
  • 05:40 hashar: check out our new portal: http://noc.wikimedia.org/
  • 01:07 kate: removed ariel from load balancing because it only has 700MB of disk space left.

July 9

  • 10:30 brion: fixed up steward mode in special:makesysop plugin to provide the full userrights options
  • some time in the morning kate: reverse dns for knams started working, although under *.rev.wikipedia.org.
  • 08:02 brion: reassigned 'developer's on meta to steward group

July 8

  • 5:20 brion: started mass lucene index builds using the updater daemon. once done, will sync current index files out. (progress in #wikimedia.15status)

July 7

  • 13:50 brion: added page update hook for the lucene update daemon, see wikitech-l post
  • 11:38 mark: Installed java (!) on pascal, to allow Kennisnet/ZX to upgrade the SP and BIOS on lily.
  • 11:34 brion: maurus had bogus hostname (maurus.wikimedia.org, doesn't resolve). fixed live and in /etc/sysconfig/network
  • 08:55 brion: upgraded PEAR::XML_RPC to 1.3.2 on mediawiki-installation group. Patching mono on avicenna and maurus for ximian bug 75467
  • 08:30 brion: noticed vincent seems to be hung
  • 07:00 Jamesday changed holbach cache split from 200M/2800M to 200/2500M because of excessive page faulting in vmstat, not yet restarted.

July 6

  • 14:40 Tim: named on albert exit for no apparent reason, causing site-wide slowdown. Logged on via the scs and started it.
  • 07:00 brion: all wikis reading from 1.5 code now. zh-min-nan.wikipedia.org has the UI broken -- code problem selecting wrong UI language [since fixed]
  • 06:30 brion: fixed up broken conversions on sdwiki, rowikibooks, fiu_vrowiki, cowikibooks, aawiki
  • 06:00 brion: upgraded meta to 1.5
  • 04:00 kate: upgraded all knams machines to current kernel to fix bad pmd problem

July 5

  • 10:43 kate: put back mint to squid pool
  • 9:15 mark: Added zh-tw.wikimedia.org CNAME record to the wikipedia.org zonefile, as it was missing (and is not in langlist, for not being a language)
  • 8:40 mark: Added an admin account on lily's SP, and set up temporary port forwarding on pascal to give ZX (sysadmin partner of Kennisnet) access to diagnose lily's hardware problems

July 4

  • Jason/mark: Many Wikimedia project domains have been changed to use the new PowerDNS DNS servers, so if you see any DNS related problems, it might be having to do with that
  • 19:32 kate: set up squid log migration system
  • 08:10 brion: migrated forgotten changes to InitaliseSettings from 1.4 to 1.5 (jbowiki caps, fiu-vro logo, zhwiki externalstorage)
  • 03:08 kate: removed srv1 from apache pool again.

3 July

  • 21:35 jeronim: srv1, srv2 & LDAP alive again after manual reboot by colo staff. not sure if domas actually emailed about scs-ext problem.
  • 20:05 jeronim: and scs-ext.wm.org doesn't work anymore. dammit has emailed colo about this and srv1/srv2 problem
  • 20:00 midom: oopsie, srv1 also didn't come up after reboot, and apparently it was LDAP server... LDAP down.
  • 19:00 midom: resyncing holbach, updated misbehaving apache hosts (srv2,srv3,anthony,rose), srv2 didn't come up after reboot.
  • 06:10 brion: holbach crashed again, mysqld was restarting over and over. killed it for now.
  • 05:05 brion: fixed more wikimania registration files
  • 02:20 brion: fixed missing db config in wikimania attendees list

2 July

  • 21:55 brion: holbach died. restarted zhwiki conversion w/o it.
  • 19:30 brion: started asian large-wiki upgrades: jawiki, zhwiki
  • 16:00 midom: bacon joined perlbal service, restarted perlbal on holbach, site looks happier.
  • 09:00 brion: eswiki upgraded, doing ptwiki now. dammit took ariel out of rotation, ready for reloading
  • 07:40 kate: moved bugzilla to pascal
  • 06:51 brion: fixed db host for wikimania registration
  • 06:45 midom: samuel is our master.
    • mediawiki 1.4, mediawiki 1.5, bugzilla, and otrs should be configured properly for new master. is there anything else? [search server update needs changing anyway, working on this --brion]
  • 04:50 brion: ran refreshLinks on enwikinews
  • 04:30 brion: disabled sorbs checking for now
  • 02:40 Jamesday: changed bacon cache split from 800M/2000M to 200M/2600M, not yet restarted.
  • 02:30 Jameesday: changed holbach cache split from 1000M/2000M to 200M/2800M, not yet restarted.
  • 02:05 brion: running background refreshLinks.php on dewiki

1 July

  • 22:20 Jamesday: changed ariel my.cnf from MyISAM/InnoDB cache split of 1700M/3900M to 300M/5100M assuming minimal MyISAM use now. We've been this high before for InnoDB but there's a small chance that the new kernel on Ariel might not like going abouve 4G on the next restart - reduce it to 3900 if that happens. Not restarting ariel now because one is planned anyway and it's not that urgent - should improve load handling ability though. Decreased binlog_cache_size from 1M to 128k (it's per session and doesn't really need to be 1M).
  • 08:20 brion: changed Revision legacy encoding conversion to use //IGNORE in iconv... this may need tweaking
  • 06:10 brion: dewiki done.
  • 05:56 brion: moved 1.5 skins dir from /w/skins-1.5 to /skins-1.5. Turns out squid configuration does cache-control rewriting on /w which makes them uncacheable. Bad squid!
  • 00:45 brion: switched 1.5 wikis to shared filesystem sessions. A hack in User::matchEditToken fatally broke save attempts by previously-logged-in users because it didn't bother to check that memcached sessions were in use; I've commented it out.
  • 00:30 brion: switched 1.4 wikis to shared filesystem sessions, perhaps this will relieve memcached session problems?

30 June

  • 23:00 brion: installed test fix for firefox intermittent download problem
  • 06:30 brion: set tidy's line wrapping off on 1.4 config as well (already on 1.5)
  • 01:50 brion: finished. running refreshLinks.php on en.wiktionary.org (in background)
  • 01:00 brion: running cleanupCaps.php on en.wiktionary.org to rename all article pages to lowercase

29 June

  • ??:?? brion: somebody moved en.wiktionary.org to wgCapitalLinks off, throwing it into total chaos. thanks!
  • 22:12 brion: removed some unused, added Mac OS X 10.4 to bugzilla operating systems list
  • 18:18 brion: set $wgCapitalLinks off on jbo.wikipedia.org

28 June

  • 22:20 brion: adding image table entries for 'missing' images (probably broken or half-canceled uploads from months back, mostly)
  • 15:50 kate: setup ganglia, ssh, yum on adler and samuel
  • 05:45 kate: set up and documented a better LDAP setup. removed srv1 from apache pool.
  • 01:03 brion: enwiki upgrade broke with its slave reads: page table was incomplete. rebuilding page table from ariel, ETA ~2hrs

27 June

  • 22:35 brion: turned off image metadata loading to speed things up -- will need to do that in a later script run
  • 20:25 brion: dropped & recreated empty links on enwiki to free innodb space (already converted)
  • 19:15 brion: disabled email authentication for now; will do mass checks later
  • 08:40 brion: enwiki upgrade is now pulling revision data from adler, writing to ariel.
  • 07:55 brion: somewhere in the midst of upgrading things. enwiki is going now; upgrade1_5.php is hacked up, please don't run any others until it's restored!
  • 03:35 brion: adler was broken and badly lagging because somebody removed its 'cur2' tables and replication died when we dropped them from the master. fixed, returning...
  • 02:15 brion: commons, wikinews, wikiquote, wiktionaries, and some misc are upgraded. Wikipedias and some others remain... Need to clear disk space on ariel

26 June

  • 04:50 brion: commonswiki being upgraded; ETA in 6-7am range
  • 04:10 brion: upgraded nostalgiawiki as test
  • 02:10 brion: setting things up preparing for 1.5 upgrade

24 June

  • 14:00 brion: adler back in rotation. probably needs reconfiguration for future...
  • 13:30 brion: took adler out of rotation; mysqld crashed OOM and is recovering

23 June

  • 21:45 mark: ...apparently because it was pointing at cache.wikimedia.org., which didn't exist in the new DNS zones... added.
  • 21:45 mark: wiktionary.zone contained an old record fr.wiktionary.org CNAME wikipedia.geo.blitzed.org which for some reason made things break only now. Removed.
  • 05:25 brion: changed sitename, metanamespace on la.wikiquote

22 June

  • 12:00 mark: Changed www-dumps.knams DNS to CNAME dumps in preparation for moving vandale to an internal vlan
  • 05:45 midom: did set global mysql timeout in php to 2 seconds.
  • 05:22 Tim: restored load to samuel, also experimentally changed some other loads
  • 04:28 Tim: realised that the site was down becuase of samuel and changed the load balancing ratios accordingly. At this time samuel is busy doing InnoDB recovery.
  • 04:22 Tim: finished moving binlogs
  • 04:18 samuel's mysqld exited for no apparent reason
  • ~04:10 Tim: started moving binlogs 232-240 to khaldun

21 June

  • 20:48 mark: Network problem has been worked around, switched geodns back.
  • 19:30 mark: Severe network / reachability problems for florida, but knams seems to be able to reach it. Pointed all of geodns at knams and lopar exclusively.
  • 00:01 kate: moved binlogs 228..231 from ariel to khaldun

19 June

  • 20:00 mark: New DNS setup is active, but DNS zone delegations still need to be dealt with. Please note that there are NO wildcards anymore, so you will need to update DNS zonefiles when creating wikis! Also, for the next week or so, update both the old DNS setup, and the new one when changing records.
    Problems will occur, DNS records may be missing, please tell me or update it yourself!
  • 19:00 mark: I broke the site (for the first time, yay! :) because of the mixed old and new DNS setup; the old zonefile was using rr.chtpa. while the new one expected rr.pmtpa.. Oops. Negative cache TTL of 1H means that some users will not be able to access the site for a while.
  • 18:50 mark: Activated new DNS setup on zwinger, which is partly used by the old Bind DNS setup
  • 17:30 mark: Added records ns0/1/2 to wikimedia.org to allow changing NS delegation for the new setup
  • 15:15 mark: Removed zwinger/gdns1 from the list of geodns nameservers in wikimedia.org, on order to build a new setup on zwinger
  • 00:50 brion: upgraded mono on vincent to 1.1.8 (rpm packages), running a mass lucene index update

18 June

  • 12:30 jeronim: added dsh node groups at florida: squids_lopar, squids_knams & squids_global
  • 11:30ish jeronim: added Disallow: /wiki/? to robots.txt because bots were indexing stuff like http://en.wikipedia.org/wiki/?title=Nl%253Aolijfboom&action=edit
  • 06:14-06:37 Tim: Started Folding@Home on mint, ragweed, hawthorn and mayflower
  • 06:14 Tim: Started Folding@Home on iris. I started it on sage and clematis a few days ago without logging it here.

16 June

  • 23:50ish brion: uplink via level3 died for a few minutes, either was fixed or PowerMedium rerouted it and we're back.
  • 22:24 brion: webster is dead: ssh doesn't let in, scs doesn't respond on it. does ping. possible kernel panic.
  • 20:15 jeluf: webster's mysqld got a signal 11, recovered automatically.

15 June

  • 21:10 mark: Set up new requests/s stats at http://noc.wikimedia.org/reqstats/
  • 20:40 mark: Removed lily from the knams squid pool in wikimedia.org DNS, it's broken.
  • 20:30 mark: Added missing peer statements to knams squids

14 June

  • 08:45 jeluf: moved ariel_log_bin.21[345678] to khaldun

13 June

  • 20:55 brion: truncated searchindex table on a bunch of wikis, freed ~5gb of disk space on ariel
  • 20:30 brion: rebuilt search indexes for dawiktionary, svwiktionary due to bad encoding config
  • 16:00 midom: used new apache-restart-all-hard (really hard!) so that slow watchlists (which actually was segfaulting apaches due to bad bytecode in cache on nearly all servers) would become fast ;-) we really need blank page logging somewhere..
  • 09:30 midom: removed icpagents from some of apache hosts, was a major headache recently.. ;-)
  • 09:00 midom: commented out some defunct or non-apache hosts (uh oh, nearly 10 in total) in perlbal's nodelist.

12 June

  • 03:12 brion: started adding name_title unique index on remaining smaller wikis (<10k pages), 30 second wait between each
  • 01:25 brion: unlocked ja.wiktionary on Angela's request

11 June

  • 22:20 jeluf: moved binlogs 204-212 from ariel to khaldun
  • 12:02 brion: stopped index addition for now (left off at bgwiktionary), will run at non-peak hours
  • 11:54 brion: dupe checks done, adding index...
  • 11:29 brion: running cleanupDupes.php on all wikis not already protected with a unique namespace+title index, then adding the index. the largest wikipedias were already protected.

10 June

  • 10:00 brion: ran a salt fixup script to correct entries which had been erroneously re-saved with bad password due to memcached records floating around in the first couple days

9 June

  • 21:45 chaper: inserted the CD that was delivered with the new hosts into srv4. jeluf mounted to /media/cdrom. Apparently containing RAID controller software for many OSes incl linux.
  • 19:55 kate: lily hung and died again during fsck. moved its ip to ragweed and left it off for now.
  • 02:50ish brion: inserted live debugging hack in Article.php for deletion problem on en.wikipedia.org (bug 2195)

8 June

  • 19:25 jeluf: moved binlogs 200-204 from ariel to khaldun. New DB servers have arrived in data center.
  • 13:15 brion: ran namespaceDupe checker on skwiktionary, skwikiquote due to prob w/ namespace changes there
  • 00:35 brion: added wikiskan-l list for scandanavians

7 June

  • 22:07 kate: setup dumps mirror at http://www-dumps.knams.wikimedia.org/
  • 19:00 jeluf: moved binlogs 198 and 199 from ariel to khaldun
  • 18:48 brion: reactivated search
  • 9:00-19:00 all: Moved to new Tampa data center
  • 10:00 brion: replaced lighttpd on fuchsia with apache because the errordocument stopped working for no reason
  • 08:00 or so; brion: added fuchsia to wikimedia.org dns, using an alias from dammit because of crappy verio interface. still not on wikipedia.org because we can't get in to it.
  • 07:00 or somewhat: horrible things begin

6 June

  • 13:40 kate: started copying dumps to vandale
  • 11:30 kate: make a small db change for wikimania registration to implement a change in the form. left a backup of the old one at zwinger:/root/wikimania.prekate.sql
  • 10:05 kate: set up logrotate on knams
  • 01:43 Tim: moved binlogs 194-197

5 June

  • 22:40 kate: reinstalled mint with better partition layout, added it to squid pool
  • 21:00 gwicke: fixed mysql error messages in this wiki after config tweak to index words from 3 chars. You should now be able to search for things like 'DNS'.
  • 14:55 mark: bound bind to 145.97.39.130 (pascal's main ip) only, adapted firewallinit to allow incoming DNS zone transfers
  • 14:19 kate: added lily to squid pool
  • 13:25 mark: Added ip 145.97.39.158 to pascal, adapted /sbin/ifup-local.
  • 10:02 kate: iris -> squid pool
  • 09:03 kate: clematis -> squid pool
  • 08:46 kate: sv,dk,no.wp -> knams
  • 08:09 kate: de.wp -> knams
  • 07:56 kate: put mayflower to knams squid pool. fixed typo in commonsettings breaking squid caching.

4 June

  • 18:27 kate: added hawthorn to squid pool
  • 18:10 kate: created rr.knams pool, put UK, NL, DE and LT on it.
  • 16:28 kate, jer, dammit: started squid on ragweed, put it in lopar pool for now
  • 15:30 jeluf: moved binlogs 190-193 to kkhaldun
  • 13:54 jeronim: built new squid for will as old one had file descriptor limit of 1024 instead of 8192 so it was running out of FDs. In /home/wikipedia/src/squid/squid-2.5.STABLE9.wp20050604.S9plus.no2GB[icpfix,nortt,htcpclr]

3 June

  • 23:30 brion: fixed salting on user_newpassword for accounts not touched since the change.
  • 20:40 mark: Wrote /sbin/ifup-local script on pascal, to handle post-ifup tasks. Currently adds 10.21.0.2/24 IP to eth1 for accessing the LOMs.
  • 20:00 mark: Set up permanent source routing on pascal for Kennisnet out of band access using /etc/sysconfig/network-scripts/route-eth1 and rc.local
  • 19:05 mark: Rebooted csw2-knams with newer crypto image, setup SSH, changed DNS resolver
  • 09:40 kate: created 400GB LV at /sqldata on vandale, ext3. installed mysql. copied ariel's my.cnf over (can someone look at what needs to be changed there?). did not populate any sql data yet.
  • 05:50 kate: REMOVED WILDCARD NS RECORD under *.wikimedia.org. this means you will need to add NS records for new wikis in that domain or they won't work.
  • 05:48 kate: set up recursing NS on pascal and mayflower; tested pdns slave for wikimedia.org on fuchsia, seems to work (but not authorative yet).
  • 00:05 Tim: moving binlogs 186-189 from ariel to khaldun

2 June

  • 06:15 brion: clearing user records from memcached. two instances of can't-log-in reported might have been caused by stale cache records re-saving bogus unsalted passwords, but that's sheer speculation.
  • 06:00 JeLuF: fixed mail on dalembert and goeje to use smtp.pmtpa.wmnet as smarthost
  • 05:45 JeLuF: removed moreri and bart from "apaches" nodegroup

1 June

  • 19:10 JeLuF: moved binlogs 184 and 185 from ariel to khaldun
  • 15:04 Tim: fixed timezone on coronelli
  • 14:35 Tim: had a go at fixing ntpd on various servers. It was not installed on coronelli and not running on srv5, fixed both fairly easily. Synchronised configuration files on srv11-30, they're still reporting "synchronization failed" as ntpd starts up, although I was able to synchronise their clocks manually with ntpdate. "ntpdc -p" seems to indicate that they are working properly.
  • 5:10 jeluf: Added index, set site to read/write
  • 04:10 brion: updated user tables for password hash salting.
  • 3:00 jeluf: set farm to read only

Archives