commit a20106c85295d24487cdf571cde72f7a51df828d Author: Ben Hutchings Date: Thu Aug 2 14:38:04 2012 +0100 Linux 3.2.25 commit 901cf39b0b51c3c31ad0f0e3104742e4d1c831ef Author: Joonsoo Kim Date: Mon Jul 30 14:39:04 2012 -0700 mm: fix wrong argument of migrate_huge_pages() in soft_offline_huge_page() commit dc32f63453f56d07a1073a697dcd843dd3098c09 upstream. Commit a6bc32b89922 ("mm: compaction: introduce sync-light migration for use by compaction") changed the declaration of migrate_pages() and migrate_huge_pages(). But it missed changing the argument of migrate_huge_pages() in soft_offline_huge_page(). In this case, we should call migrate_huge_pages() with MIGRATE_SYNC. Additionally, there is a mismatch between type the of argument and the function declaration for migrate_pages(). Signed-off-by: Joonsoo Kim Cc: Christoph Lameter Cc: Mel Gorman Acked-by: David Rientjes Cc: "Aneesh Kumar K.V" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit 86fccf3e1ef30d1970dbd01ce69208ed1b334a8e Author: Maarten Lankhorst Date: Mon Jun 4 12:00:31 2012 +0200 nouveau: Fix alignment requirements on src and dst addresses commit ce806a30470bcd846d148bf39d46de3ad7748228 upstream. Linear copy works by adding the offset to the buffer address, which may end up not being 16-byte aligned. Some tests I've written for prime_pcopy show that the engine allows this correctly, so the restriction on lowest 4 bits of address can be lifted safely. The comments added were by envyas, I think because I used a newer version. Signed-off-by: Maarten Lankhorst [bwh: Backported to 3.2: no # prefixes in nva3_copy.fuc] Signed-off-by: Ben Hutchings commit 704c3d1f64a8f1200cafbcee732b89fe341fc18f Author: Chris Mason Date: Wed Jul 25 15:57:13 2012 -0400 Btrfs: call the ordered free operation without any locks held commit e9fbcb42201c862fd6ab45c48ead4f47bb2dea9d upstream. Each ordered operation has a free callback, and this was called with the worker spinlock held. Josef made the free callback also call iput, which we can't do with the spinlock. This drops the spinlock for the free operation and grabs it again before moving through the rest of the list. We'll circle back around to this and find a cleaner way that doesn't bounce the lock around so much. Signed-off-by: Chris Mason Signed-off-by: Ben Hutchings commit 86949428b4319a44d0c8ba776974143f938b6758 Author: Jerome Glisse Date: Thu Jul 19 17:25:55 2012 -0400 drm/radeon: on hotplug force link training to happen (v2) commit ca2ccde5e2f24a792caa4cca919fc5c6f65d1887 upstream. To have DP behave like VGA/DVI we need to retrain the link on hotplug. For this to happen we need to force link training to happen by setting connector dpms to off before asking it turning it on again. v2: agd5f - drop the dp_get_link_status() change in atombios_dp.c for now. We still need the dpms OFF change. Signed-off-by: Jerome Glisse Signed-off-by: Alex Deucher Signed-off-by: Dave Airlie Signed-off-by: Ben Hutchings commit 5f2e8ffa509e206725ad62119a5d6ba6294a0023 Author: Jerome Glisse Date: Thu Jul 19 17:15:56 2012 -0400 drm/radeon: fix hotplug of DP to DVI|HDMI passive adapters (v2) commit 266dcba541a1ef7e5d82d9e67c67fde2910636e8 upstream. No need to retrain the link for passive adapters. v2: agd5f - no passive DP to VGA adapters, update comments - assign radeon_connector_atom_dig after we are sure we have a digital connector as analog connectors have different private data. - get new sink type before checking for retrain. No need to check if it's no longer a DP connection. Signed-off-by: Jerome Glisse Signed-off-by: Alex Deucher Signed-off-by: Dave Airlie Signed-off-by: Ben Hutchings commit 7b38f3bc17d4f8b259d41d2a9646a1a3ad772fd5 Author: Jerome Glisse Date: Tue Jul 17 17:17:16 2012 -0400 drm/radeon: fix non revealent error message commit 8d1c702aa0b2c4b22b0742b72a1149d91690674b upstream. We want to print link status query failed only if it's an unexepected fail. If we query to see if we need link training it might be because there is nothing connected and thus link status query have the right to fail in that case. To avoid printing failure when it's expected, move the failure message to proper place. Signed-off-by: Jerome Glisse Signed-off-by: Dave Airlie Signed-off-by: Ben Hutchings commit 12342624d45832c39cd6fc4ad2cc00c17d9d3854 Author: Jerome Glisse Date: Thu Jul 12 18:23:05 2012 -0400 drm/radeon: fix bo creation retry path commit d1c7871ddb1f588b8eb35affd9ee1a3d5e11cd0c upstream. Retry label was at wrong place in function leading to memory leak. Signed-off-by: Jerome Glisse Reviewed-by: Michel Dänzer Reviewed-by: Christian König Signed-off-by: Dave Airlie [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings commit 3a928a5ea792e05f7acc8852babfce2273c3c185 Author: Lan Tianyu Date: Fri Jul 20 13:29:16 2012 +0800 ACPI/AC: prevent OOPS on some boxes due to missing check power_supply_register() return value check commit f197ac13f6eeb351b31250b9ab7d0da17434ea36 upstream. In the ac.c, power_supply_register()'s return value is not checked. As a result, the driver's add() ops may return success even though the device failed to initialize. For example, some BIOS may describe two ACADs in the same DSDT. The second ACAD device will fail to register, but ACPI driver's add() ops returns sucessfully. The ACPI device will receive ACPI notification and cause OOPS. https://bugzilla.redhat.com/show_bug.cgi?id=772730 Signed-off-by: Lan Tianyu Signed-off-by: Len Brown Signed-off-by: Ben Hutchings commit 295cc2b91b078eb04f42db40ee04788c685dd351 Author: J. Bruce Fields Date: Mon Jul 23 15:17:17 2012 -0400 locks: fix checking of fcntl_setlease argument commit 0ec4f431eb56d633da3a55da67d5c4b88886ccc7 upstream. The only checks of the long argument passed to fcntl(fd,F_SETLEASE,.) are done after converting the long to an int. Thus some illegal values may be let through and cause problems in later code. [ They actually *don't* cause problems in mainline, as of Dave Jones's commit 8d657eb3b438 "Remove easily user-triggerable BUG from generic_setlease", but we should fix this anyway. And this patch will be necessary to fix real bugs on earlier kernels. ] Signed-off-by: J. Bruce Fields Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit ad15432068b43135d7af36e555c9d48b0d37c3c3 Author: Mark Brown Date: Fri Jul 20 17:29:34 2012 +0100 ASoC: dapm: Fix _PRE and _POST events for DAPM performance improvements commit 0ff97ebf0804d2e519d578fcb4db03f104d2ca8c upstream. Ever since the DAPM performance improvements we've been marking all widgets as not dirty after each DAPM run. Since _PRE and _POST events aren't part of the DAPM graph this has rendered them non-functional, they will never be marked dirty again and thus will never be run again. Fix this by skipping them when marking widgets as not dirty. Signed-off-by: Mark Brown Acked-by: Liam Girdwood Signed-off-by: Ben Hutchings commit d9af29329519c623a5a9c8b39439ed2886c9f30b Author: Theodore Ts'o Date: Mon Jul 23 00:00:20 2012 -0400 ext4: undo ext4_calc_metadata_amount if we fail to claim space commit 03179fe92318e7934c180d96f12eff2cb36ef7b6 upstream. The function ext4_calc_metadata_amount() has side effects, although it's not obvious from its function name. So if we fail to claim space, regardless of whether we retry to claim the space again, or return an error, we need to undo these side effects. Otherwise we can end up incorrectly calculating the number of metadata blocks needed for the operation, which was responsible for an xfstests failure for test #271 when using an ext2 file system with delalloc enabled. Reported-by: Brian Foster Signed-off-by: "Theodore Ts'o" Signed-off-by: Ben Hutchings commit 50e7ae34b18bb8a4911349184685fc1c4d9c0f50 Author: Brian Foster Date: Sun Jul 22 23:59:40 2012 -0400 ext4: don't let i_reserved_meta_blocks go negative commit 97795d2a5b8d3c8dc4365d4bd3404191840453ba upstream. If we hit a condition where we have allocated metadata blocks that were not appropriately reserved, we risk underflow of ei->i_reserved_meta_blocks. In turn, this can throw sbi->s_dirtyclusters_counter significantly out of whack and undermine the nondelalloc fallback logic in ext4_nonda_switch(). Warn if this occurs and set i_allocated_meta_blocks to avoid this problem. This condition is reproduced by xfstests 270 against ext2 with delalloc enabled: Mar 28 08:58:02 localhost kernel: [ 171.526344] EXT4-fs (loop1): delayed block allocation failed for inode 14 at logical offset 64486 with max blocks 64 with error -28 Mar 28 08:58:02 localhost kernel: [ 171.526346] EXT4-fs (loop1): This should not happen!! Data will be lost 270 ultimately fails with an inconsistent filesystem and requires an fsck to repair. The cause of the error is an underflow in ext4_da_update_reserve_space() due to an unreserved meta block allocation. Signed-off-by: Brian Foster Signed-off-by: "Theodore Ts'o" Signed-off-by: Ben Hutchings commit 99a779227aef0e93a77f7b962b266416f0ead9a6 Author: Daniel Drake Date: Tue Jul 3 23:13:39 2012 +0100 mmc: sdhci-pci: CaFe has broken card detection commit 55fc05b7414274f17795cd0e8a3b1546f3649d5e upstream. At http://dev.laptop.org/ticket/11980 we have determined that the Marvell CaFe SDHCI controller reports bad card presence during resume. It reports that no card is present even when it is. This is a regression -- resume worked back around 2.6.37. Around 400ms after resuming, a "card inserted" interrupt is generated, at which point it starts reporting presence. Work around this hardware oddity by setting the SDHCI_QUIRK_BROKEN_CARD_DETECTION flag. Thanks to Chris Ball for helping with diagnosis. Signed-off-by: Daniel Drake [stable@: please apply to 3.0+] Signed-off-by: Chris Ball Signed-off-by: Ben Hutchings commit e87ebd5357b62b81c70d36c79b1945b45c4c8404 Author: Al Viro Date: Sat Jul 21 08:55:18 2012 +0100 iscsi-target: Drop bogus struct file usage for iSCSI/SCTP commit bf6932f44a7b3fa7e2246a8b18a44670e5eab6c2 upstream. From Al Viro: BTW, speaking of struct file treatment related to sockets - there's this piece of code in iscsi: /* * The SCTP stack needs struct socket->file. */ if ((np->np_network_transport == ISCSI_SCTP_TCP) || (np->np_network_transport == ISCSI_SCTP_UDP)) { if (!new_sock->file) { new_sock->file = kzalloc( sizeof(struct file), GFP_KERNEL); For one thing, as far as I can see it'not true - sctp does *not* depend on socket->file being non-NULL; it does, in one place, check socket->file->f_flags for O_NONBLOCK, but there it treats NULL socket->file as "flag not set". Which is the case here anyway - the fake struct file created in __iscsi_target_login_thread() (and in iscsi_target_setup_login_socket(), with the same excuse) do *not* get that flag set. Moreover, it's a bloody serious violation of a bunch of asserts in VFS; all struct file instances should come from filp_cachep, via get_empty_filp() (or alloc_file(), which is a wrapper for it). FWIW, I'm very tempted to do this and be done with the entire mess: Signed-off-by: Al Viro Cc: Andy Grover Cc: Hannes Reinecke Cc: Christoph Hellwig Signed-off-by: Nicholas Bellinger [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings commit 533c8ed12eea42986cb692c45c4c86512d1ca37d Author: Dan Williams Date: Thu Jun 21 23:36:20 2012 -0700 libsas: fix sas_discover_devices return code handling commit b17caa174a7e1fd2e17b26e210d4ee91c4c28b37 upstream. commit 198439e4 [SCSI] libsas: do not set res = 0 in sas_ex_discover_dev() commit 19252de6 [SCSI] libsas: fix wide port hotplug issues The above commits seem to have confused the return value of sas_ex_discover_dev which is non-zero on failure and sas_ex_join_wide_port which just indicates short circuiting discovery on already established ports. The result is random discovery failures depending on configuration. Calls to sas_ex_join_wide_port are the source of the trouble as its return value is errantly assigned to 'res'. Convert it to bool and stop returning its result up the stack. Tested-by: Dan Melnic Reported-by: Dan Melnic Signed-off-by: Dan Williams Reviewed-by: Jack Wang Signed-off-by: James Bottomley Signed-off-by: Ben Hutchings commit 705c1a5ba71fc1fad9ed0f2fd15e31db2c623dfd Author: Dan Williams Date: Thu Jun 21 23:36:15 2012 -0700 libsas: continue revalidation commit 26f2f199ff150d8876b2641c41e60d1c92d2fb81 upstream. Continue running revalidation until no more broadcast devices are discovered. Fixes cases where re-discovery completes too early in a domain with multiple expanders with pending re-discovery events. Servicing BCNs can get backed up behind error recovery. Signed-off-by: Dan Williams Signed-off-by: James Bottomley Signed-off-by: Ben Hutchings commit c6e92669dd2703bd2243c2b16cd5a5283135151b Author: Dan Williams Date: Thu Jun 21 23:25:32 2012 -0700 fix eh wakeup (scsi_schedule_eh vs scsi_restart_operations) commit 57fc2e335fd3c2f898ee73570dc81426c28dc7b4 upstream. Rapid ata hotplug on a libsas controller results in cases where libsas is waiting indefinitely on eh to perform an ata probe. A race exists between scsi_schedule_eh() and scsi_restart_operations() in the case when scsi_restart_operations() issues i/o to other devices in the sas domain. When this happens the host state transitions from SHOST_RECOVERY (set by scsi_schedule_eh) back to SHOST_RUNNING and ->host_busy is non-zero so we put the eh thread to sleep even though ->host_eh_scheduled is active. Before putting the error handler to sleep we need to check if the host_state needs to return to SHOST_RECOVERY for another trip through eh. Since i/o that is released by scsi_restart_operations has been blocked for at least one eh cycle, this implementation allows those i/o's to run before another eh cycle starts to discourage hung task timeouts. Reported-by: Tom Jackson Tested-by: Tom Jackson Signed-off-by: Dan Williams Signed-off-by: James Bottomley Signed-off-by: Ben Hutchings commit f60c4d00f7e25849c1e743cf66fde70bdac63dae Author: Dan Williams Date: Thu Jun 21 23:47:28 2012 -0700 fix hot unplug vs async scan race commit 3b661a92e869ebe2358de8f4b3230ad84f7fce51 upstream. The following crash results from cases where the end_device has been removed before scsi_sysfs_add_sdev has had a chance to run. BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 IP: [] sysfs_create_dir+0x32/0xb6 ... Call Trace: [] kobject_add_internal+0x120/0x1e3 [] ? trace_hardirqs_on+0xd/0xf [] kobject_add_varg+0x41/0x50 [] kobject_add+0x64/0x66 [] device_add+0x12d/0x63a [] ? _raw_spin_unlock_irqrestore+0x47/0x56 [] ? module_refcount+0x89/0xa0 [] scsi_sysfs_add_sdev+0x4e/0x28a [] do_scan_async+0x9c/0x145 ...teach scsi_sysfs_add_devices() to check for deleted devices() before trying to add them, and teach scsi_remove_target() how to remove targets that have not been added via device_add(). Reported-by: Dariusz Majchrzak Signed-off-by: Dan Williams Signed-off-by: James Bottomley Signed-off-by: Ben Hutchings commit 09411e4280933eb335e1f723dc215a5583c3be8f Author: Bart Van Assche Date: Fri Jun 29 15:34:26 2012 +0000 Avoid dangling pointer in scsi_requeue_command() commit 940f5d47e2f2e1fa00443921a0abf4822335b54d upstream. When we call scsi_unprep_request() the command associated with the request gets destroyed and therefore drops its reference on the device. If this was the only reference, the device may get released and we end up with a NULL pointer deref when we call blk_requeue_request. Reported-by: Mike Christie Signed-off-by: Bart Van Assche Reviewed-by: Mike Christie Reviewed-by: Tejun Heo [jejb: enhance commend and add commit log for stable] Signed-off-by: James Bottomley Signed-off-by: Ben Hutchings commit 9c63d964e3f093b858a6f06875a68bc4206f71ba Author: Bart Van Assche Date: Fri Jun 29 15:33:22 2012 +0000 Fix device removal NULL pointer dereference commit 67bd94130015c507011af37858989b199c52e1de upstream. Use blk_queue_dead() to test whether the queue is dead instead of !sdev. Since scsi_prep_fn() may be invoked concurrently with __scsi_remove_device(), keep the queuedata (sdev) pointer in __scsi_remove_device(). This patch fixes a kernel oops that can be triggered by USB device removal. See also http://www.spinics.net/lists/linux-scsi/msg56254.html. Other changes included in this patch: - Swap the blk_cleanup_queue() and kfree() calls in scsi_host_dev_release() to make that code easier to grasp. - Remove the queue dead check from scsi_run_queue() since the queue state can change anyway at any point in that function where the queue lock is not held. - Remove the queue dead check from the start of scsi_request_fn() since it is redundant with the scsi_device_online() check. Reported-by: Jun'ichi Nomura Signed-off-by: Bart Van Assche Reviewed-by: Mike Christie Reviewed-by: Tejun Heo Signed-off-by: James Bottomley Signed-off-by: Ben Hutchings commit 68e9e9fee2bbec4853a993e98b0df8479292f572 Author: Tejun Heo Date: Wed Dec 14 00:33:37 2011 +0100 block: add blk_queue_dead() commit 34f6055c80285e4efb3f602a9119db75239744dc upstream. There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead() macro and use it. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo Signed-off-by: Jens Axboe Signed-off-by: Ben Hutchings commit a7c5635a91e082f0ac07df5fed58014d70296ef6 Author: Dylan Reid Date: Thu Jul 19 17:52:58 2012 -0700 ALSA: hda - Turn on PIN_OUT from hdmi playback prepare. commit 9e76e6d031482194a5b24d8e9ab88063fbd6b4b5 upstream. Turn on the pin widget's PIN_OUT bit from playback prepare. The pin is enabled in open, but is disabled in hdmi_init_pin which is called during system resume. This causes a system suspend/resume during playback to mute HDMI/DP. Enabling the pin in prepare instead of open allows calling snd_pcm_prepare after a system resume to restore audio. Signed-off-by: Dylan Reid Signed-off-by: Takashi Iwai Signed-off-by: Ben Hutchings commit 0bc80e0aca34d785bf3b4476a11e4aba668a6979 Author: Michel Dänzer Date: Tue Jul 17 19:02:09 2012 +0200 drm/radeon: Try harder to avoid HW cursor ending on a multiple of 128 columns. commit f60ec4c7df043df81e62891ac45383d012afe0da upstream. This could previously fail if either of the enabled displays was using a horizontal resolution that is a multiple of 128, and only the leftmost column of the cursor was (supposed to be) visible at the right edge of that display. The solution is to move the cursor one pixel to the left in that case. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=33183 Signed-off-by: Michel Dänzer Reviewed-by: Alex Deucher Signed-off-by: Dave Airlie Signed-off-by: Ben Hutchings commit a6624e8b990f77bd4ee64f77c2b95c7df005f002 Author: Joerg Roedel Date: Thu Jul 19 13:42:54 2012 +0200 iommu/amd: Fix hotplug with iommu=pt commit 2c9195e990297068d0f1f1bd8e2f1d09538009da upstream. This did not work because devices are not put into the pt_domain. Fix this. Signed-off-by: Joerg Roedel [bwh: Backported to 3.2: do not use iommu_dev_data::passthrough] Signed-off-by: Ben Hutchings commit bac92b49c77bd2cb455770a8f1c567817907428b Author: David Henningsson Date: Wed Jul 18 07:38:46 2012 +0200 ALSA: hda - Add support for Realtek ALC282 commit 4e01ec636e64707d202a1ca21a47bbc6d53085b7 upstream. This codec has a separate dmic path (separate dmic only ADC), and thus it looks mostly like ALC275. BugLink: https://bugs.launchpad.net/bugs/1025377 Tested-by: Ray Chen Signed-off-by: David Henningsson Signed-off-by: Takashi Iwai Signed-off-by: Ben Hutchings commit c4f1f3253303865d0fda62c27057009096eeb6ef Author: Tejun Heo Date: Tue Jul 17 12:39:26 2012 -0700 workqueue: perform cpu down operations from low priority cpu_notifier() commit 6575820221f7a4dd6eadecf7bf83cdd154335eda upstream. Currently, all workqueue cpu hotplug operations run off CPU_PRI_WORKQUEUE which is higher than normal notifiers. This is to ensure that workqueue is up and running while bringing up a CPU before other notifiers try to use workqueue on the CPU. Per-cpu workqueues are supposed to remain working and bound to the CPU for normal CPU_DOWN_PREPARE notifiers. This holds mostly true even with workqueue offlining running with higher priority because workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which runs the per-cpu workqueue without concurrency management without explicitly detaching the existing workers. However, if the trustee needs to create new workers, it creates unbound workers which may wander off to other CPUs while CPU_DOWN_PREPARE notifiers are in progress. Furthermore, if the CPU down is cancelled, the per-CPU workqueue may end up with workers which aren't bound to the CPU. While reliably reproducible with a convoluted artificial test-case involving scheduling and flushing CPU burning work items from CPU down notifiers, this isn't very likely to happen in the wild, and, even when it happens, the effects are likely to be hidden by the following successful CPU down. Fix it by using different priorities for up and down notifiers - high priority for up operations and low priority for down operations. Workqueue cpu hotplug operations will soon go through further cleanup. Signed-off-by: Tejun Heo Acked-by: "Rafael J. Wysocki" Signed-off-by: Ben Hutchings commit f46edd5989ae87bd9eb05041c3d275ccc4abe081 Author: Forest Bond Date: Fri Jul 13 12:26:06 2012 -0400 rtlwifi: rtl8192de: Fix phy-based version calculation commit f1b00f4dab29b57bdf1bc03ef12020b280fd2a72 upstream. Commit d83579e2a50ac68389e6b4c58b845c702cf37516 incorporated some changes from the vendor driver that made it newly important that the calculated hardware version correctly include the CHIP_92D bit, as all of the IS_92D_* macros were changed to depend on it. However, this bit was being unset for dual-mac, dual-phy devices. The vendor driver behavior was modified to not do this, but unfortunately this change was not picked up along with the others. This caused scanning in the 2.4GHz band to be broken, and possibly other bugs as well. This patch brings the version calculation logic in parity with the vendor driver in this regard, and in doing so fixes the regression. However, the version calculation code in general continues to be largely incoherent and messy, and needs to be cleaned up. Signed-off-by: Forest Bond Signed-off-by: Larry Finger Signed-off-by: John W. Linville Signed-off-by: Ben Hutchings commit 88981d86ee7c3d3bcfe75f359363658478667fa5 Author: Heiko Carstens Date: Fri Jul 13 15:45:33 2012 +0200 s390/idle: fix sequence handling vs cpu hotplug commit 0008204ffe85d23382d6fd0f971f3f0fbe70bae2 upstream. The s390 idle accounting code uses a sequence counter which gets used when the per cpu idle statistics get updated and read. One assumption on read access is that only when the sequence counter is even and did not change while reading all values the result is valid. On cpu hotplug however the per cpu data structure gets initialized via a cpu hotplug notifier on CPU_ONLINE. CPU_ONLINE however is too late, since the onlined cpu is already running and might access the per cpu data. Worst case is that the data structure gets initialized while an idle thread is updating its idle statistics. This will result in an uneven sequence counter after an update. As a result user space tools like top, which access /proc/stat in order to get idle stats, will busy loop waiting for the sequence counter to become even again, which will never happen until the queried cpu will update its idle statistics again. And even then the sequence counter will only have an even value for a couple of cpu cycles. Fix this by moving the initialization of the per cpu idle statistics to cpu_init(). I prefer that solution in favor of changing the notifier to CPU_UP_PREPARE, which would be a different solution to the problem. Signed-off-by: Heiko Carstens Signed-off-by: Martin Schwidefsky [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings commit 3e51f8abdd9f7c7a70f1f77295c46a88a46766a6 Author: Roland Dreier Date: Mon Jul 16 15:34:25 2012 -0700 target: Check number of unmap descriptors against our limit commit 7409a6657aebf8be74c21d0eded80709b27275cb upstream. Fail UNMAP commands that have more than our reported limit on unmap descriptors. Signed-off-by: Roland Dreier Signed-off-by: Nicholas Bellinger [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings commit 76191bb22af0d68261d5d9a9b6294d68133813e0 Author: Roland Dreier Date: Mon Jul 16 15:34:24 2012 -0700 target: Fix possible integer underflow in UNMAP emulation commit b7fc7f3777582dea85156a821d78a522a0c083aa upstream. It's possible for an initiator to send us an UNMAP command with a descriptor that is less than 8 bytes; in that case it's really bad for us to set an unsigned int to that value, subtract 8 from it, and then use that as a limit for our loop (since the value will wrap around to a huge positive value). Fix this by making size be signed and only looping if size >= 16 (ie if we have at least a full descriptor available). Also remove offset as an obfuscated name for the constant 8. Signed-off-by: Roland Dreier Signed-off-by: Nicholas Bellinger [bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings commit b67183922769604a26c7a5ed1560fb71fd805635 Author: Roland Dreier Date: Mon Jul 16 15:34:23 2012 -0700 target: Fix reading of data length fields for UNMAP commands commit 1a5fa4576ec8a462313c7516b31d7453481ddbe8 upstream. The UNMAP DATA LENGTH and UNMAP BLOCK DESCRIPTOR DATA LENGTH fields are in the unmap descriptor (the payload transferred to our data out buffer), not in the CDB itself. Read them from the correct place in target_emulated_unmap. Signed-off-by: Roland Dreier Signed-off-by: Nicholas Bellinger [bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings commit 8681d6103dc63879b6b679b2bfa9f59121277b41 Author: Roland Dreier Date: Mon Jul 16 15:34:22 2012 -0700 target: Add range checking to UNMAP emulation commit 2594e29865c291db162313187612cd9f14538f33 upstream. When processing an UNMAP command, we need to make sure that the number of blocks we're asked to UNMAP does not exceed our reported maximum number of blocks per UNMAP, and that the range of blocks we're unmapping doesn't go past the end of the device. Signed-off-by: Roland Dreier Signed-off-by: Nicholas Bellinger [bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings commit 0c2305d15f446ea286545b12587a7c836a1fb08c Author: Roland Dreier Date: Mon Jul 16 15:34:21 2012 -0700 target: Add generation of LOGICAL BLOCK ADDRESS OUT OF RANGE commit e2397c704429025bc6b331a970f699e52f34283e upstream. Many SCSI commands are defined to return a CHECK CONDITION / ILLEGAL REQUEST with ASC set to LOGICAL BLOCK ADDRESS OUT OF RANGE if the initiator sends a command that accesses a too-big LBA. Add an enum value and case entries so that target code can return this status. Signed-off-by: Roland Dreier Signed-off-by: Nicholas Bellinger Signed-off-by: Ben Hutchings commit 6e59fd8e47abf43042a41b904f095c02c47a8468 Author: Bjørn Mork Date: Thu Jul 12 12:37:32 2012 +0200 USB: option: add ZTE MF821D commit 09110529780890804b22e997ae6b4fe3f0b3b158 upstream. Sold by O2 (telefonica germany) under the name "LTE4G" Tested-by: Thomas Schäfer Signed-off-by: Bjørn Mork Signed-off-by: Greg Kroah-Hartman Signed-off-by: Ben Hutchings commit e93afa13c1f85af4eb6c3ceb6d5f4f7ac11b86d2 Author: Andrew Bird (Sphere Systems) Date: Sun Mar 25 00:10:28 2012 +0000 USB: option: Ignore ZTE (Vodafone) K3570/71 net interfaces commit f264ddea0109bf7ce8aab920d64a637e830ace5b upstream. These interfaces need to be handled by QMI/WWAN driver Signed-off-by: Andrew Bird Signed-off-by: David S. Miller Signed-off-by: Ben Hutchings commit 91953dc518f589779c11f3774c6d7ca5b2bb86c6 Author: Amitkumar Karwar Date: Wed Jul 11 18:12:57 2012 -0700 mwifiex: correction in mcs index check commit fe020120cb863ba918c6d603345342a880272c4d upstream. mwifiex driver supports 2x2 chips as well. Hence valid mcs values are 0 to 15. The check for mcs index is corrected in this patch. For example: if 40MHz is enabled and mcs index is 11, "iw link" command would show "tx bitrate: 108.0 MBit/s" without this patch. Now it shows "tx bitrate: 108.0 MBit/s MCS 11 40Mhz" with the patch. Signed-off-by: Amitkumar Karwar Signed-off-by: Bing Zhao Signed-off-by: John W. Linville Signed-off-by: Ben Hutchings commit 216fce53c4a85d012a5dc28eff803b9a28874ba8 Author: Tiejun Chen Date: Wed Jul 11 14:22:46 2012 +1000 powerpc: Add "memory" attribute for mfmsr() commit b416c9a10baae6a177b4f9ee858b8d309542fbef upstream. Add "memory" attribute in inline assembly language as a compiler barrier to make sure 4.6.x GCC don't reorder mfmsr(). Signed-off-by: Tiejun Chen Signed-off-by: Benjamin Herrenschmidt Signed-off-by: Ben Hutchings commit c17fcfb52422d098dde1ccdc4a3d24a7ceeb33e2 Author: Jan Kara Date: Tue Jul 10 17:58:04 2012 +0200 udf: Improve table length check to avoid possible overflow commit 57b9655d01ef057a523e810d29c37ac09b80eead upstream. When a partition table length is corrupted to be close to 1 << 32, the check for its length may overflow on 32-bit systems and we will think the length is valid. Later on the kernel can crash trying to read beyond end of buffer. Fix the check to avoid possible overflow. Reported-by: Ben Hutchings Signed-off-by: Jan Kara Signed-off-by: Ben Hutchings commit e44be732863c136e9560c6bf7bf9b9d709a59fca Author: Theodore Ts'o Date: Mon Jul 9 16:27:05 2012 -0400 ext4: fix overhead calculation used by ext4_statfs() commit 952fc18ef9ec707ebdc16c0786ec360295e5ff15 upstream. Commit f975d6bcc7a introduced bug which caused ext4_statfs() to miscalculate the number of file system overhead blocks. This causes the f_blocks field in the statfs structure to be larger than it should be. This would in turn cause the "df" output to show the number of data blocks in the file system and the number of data blocks used to be larger than they should be. Signed-off-by: "Theodore Ts'o" [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings commit 6d0511498af83b9d09be2f52a69e3f9aca5db81e Author: Hans de Goede Date: Wed Jul 4 09:18:01 2012 +0200 usbdevfs: Correct amount of data copied to user in processcompl_compat commit 2102e06a5f2e414694921f23591f072a5ba7db9f upstream. iso data buffers may have holes in them if some packets were short, so for iso urbs we should always copy the entire buffer, just like the regular processcompl does. Signed-off-by: Hans de Goede Acked-by: Alan Stern Signed-off-by: Greg Kroah-Hartman Signed-off-by: Ben Hutchings commit 87b98a1d1bd63f002763b5cb153928d260850234 Author: Borislav Petkov Date: Thu Jun 21 14:07:16 2012 +0200 x86, microcode: Sanitize per-cpu microcode reloading interface commit c9fc3f778a6a215ace14ee556067c73982b6d40f upstream. Microcode reloading in a per-core manner is a very bad idea for both major x86 vendors. And the thing is, we have such interface with which we can end up with different microcode versions applied on different cores of an otherwise homogeneous wrt (family,model,stepping) system. So turn off the possibility of doing that per core and allow it only system-wide. This is a minimal fix which we'd like to see in stable too thus the more-or-less arbitrary decision to allow system-wide reloading only on the BSP: $ echo 1 > /sys/devices/system/cpu/cpu0/microcode/reload ... and disable the interface on the other cores: $ echo 1 > /sys/devices/system/cpu/cpu23/microcode/reload -bash: echo: write error: Invalid argument Also, allowing the reload only from one CPU (the BSP in that case) doesn't allow the reload procedure to degenerate into an O(n^2) deal when triggering reloads from all /sys/devices/system/cpu/cpuX/microcode/reload sysfs nodes simultaneously. A more generic fix will follow. Cc: Henrique de Moraes Holschuh Cc: Peter Zijlstra Signed-off-by: Borislav Petkov Link: http://lkml.kernel.org/r/1340280437-7718-2-git-send-email-bp@amd64.org Signed-off-by: H. Peter Anvin Signed-off-by: Ben Hutchings commit f7b867b55f47fb4330ea7b8a19dac2ce196258f0 Author: Shuah Khan Date: Sun May 6 11:11:04 2012 -0600 x86, microcode: microcode_core.c simple_strtoul cleanup commit e826abd523913f63eb03b59746ffb16153c53dc4 upstream. Change reload_for_cpu() in kernel/microcode_core.c to call kstrtoul() instead of calling obsoleted simple_strtoul(). Signed-off-by: Shuah Khan Reviewed-by: Borislav Petkov Link: http://lkml.kernel.org/r/1336324264.2897.9.camel@lorien2 Signed-off-by: H. Peter Anvin Signed-off-by: Ben Hutchings commit 701aa1144e7576cc3a6a9858d2f6ce7ba92bdec4 Author: Srivatsa S. Bhat Date: Sat Jun 16 15:30:45 2012 +0200 ftrace: Disable function tracing during suspend/resume and hibernation, again commit 443772d408a25af62498793f6f805ce3c559309a upstream. If function tracing is enabled for some of the low-level suspend/resume functions, it leads to triple fault during resume from suspend, ultimately ending up in a reboot instead of a resume (or a total refusal to come out of suspended state, on some machines). This issue was explained in more detail in commit f42ac38c59e0a03d (ftrace: disable tracing for suspend to ram). However, the changes made by that commit got reverted by commit cbe2f5a6e84eebb (tracing: allow tracing of suspend/resume & hibernation code again). So, unfortunately since things are not yet robust enough to allow tracing of low-level suspend/resume functions, suspend/resume is still broken when ftrace is enabled. So fix this by disabling function tracing during suspend/resume & hibernation. Signed-off-by: Srivatsa S. Bhat Signed-off-by: Rafael J. Wysocki Signed-off-by: Ben Hutchings commit 2a129c733126dba186f325cb29041e49c6c22347 Author: Theodore Ts'o Date: Sat Jun 30 19:14:57 2012 -0400 ext4: pass a char * to ext4_count_free() instead of a buffer_head ptr commit f6fb99cadcd44660c68e13f6eab28333653621e6 upstream. Make it possible for ext4_count_free to operate on buffers and not just data in buffer_heads. Signed-off-by: "Theodore Ts'o" [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings commit 0331c3a64bd80c9fd4cb453a19d0e7c888ca50ce Author: Kevin Cernekee Date: Sun Jun 24 21:11:22 2012 -0700 usb: gadget: Fix g_ether interface link status commit 31bde1ceaa873bcaecd49e829bfabceacc4c512d upstream. A "usb0" interface that has never been connected to a host has an unknown operstate, and therefore the IFF_RUNNING flag is (incorrectly) asserted when queried by ifconfig, ifplugd, etc. This is a result of calling netif_carrier_off() too early in the probe function; it should be called after register_netdev(). Similar problems have been fixed in many other drivers, e.g.: e826eafa6 (bonding: Call netif_carrier_off after register_netdevice) 0d672e9f8 (drivers/net: Call netif_carrier_off at the end of the probe) 6a3c869a6 (cxgb4: fix reported state of interfaces without link) Fix is to move netif_carrier_off() to the end of the function. Signed-off-by: Kevin Cernekee Signed-off-by: Felipe Balbi Signed-off-by: Ben Hutchings commit 8d54ec42181c58503d7432f5bba7e781e20f13d4 Author: Albert Pool Date: Mon May 14 18:08:32 2012 +0200 rt2800usb: 2001:3c17 is an RT3370 device commit 8fd9d059af12786341dec5a688e607bcdb372238 upstream. D-Link DWA-123 rev A1 Signed-off-by: Albert Pool Acked-by: Gertjan van Wingerde Signed-off-by: John W. Linville Signed-off-by: Ben Hutchings commit cd7100a842c28ec46c015b7e3e2c5d360f50d4fa Author: Xose Vazquez Perez Date: Tue Apr 17 16:28:05 2012 +0200 wireless: rt2x00: rt2800usb more devices were identified commit e828b9fb4f6c3513950759d5fb902db5bd054048 upstream. found in 2012_03_22_RT5572_Linux_STA_v2.6.0.0_DPO RT3070: (0x2019,0x5201) Planex Communications, Inc. RT8070 (0x7392,0x4085) 2L Central Europe BV 8070 7392 is Edimax RT35xx: (0x1690,0x0761) Askey was Fujitsu Stylistic 550, but 1690 is Askey Signed-off-by: Xose Vazquez Perez Acked-by: Gertjan van Wingerde Signed-off-by: John W. Linville Signed-off-by: Ben Hutchings commit 0fd47a3fd485742b146e1bec8fbd422b9dd57c8c Author: Xose Vazquez Perez Date: Tue Apr 17 01:50:32 2012 +0200 wireless: rt2x00: rt2800usb add more devices ids commit 63b376411173c343bbcb450f95539da91f079e0c upstream. They were taken from ralink drivers: 2011_0719_RT3070_RT3370_RT5370_RT5372_Linux_STA_V2.5.0.3_DPO 2012_03_22_RT5572_Linux_STA_v2.6.0.0_DPO 0x1eda,0x2210 RT3070 Airties 0x083a,0xb511 RT3370 Panasonic 0x0471,0x20dd RT3370 Philips 0x1690,0x0764 RT35xx Askey 0x0df6,0x0065 RT35xx Sitecom 0x0df6,0x0066 RT35xx Sitecom 0x0df6,0x0068 RT35xx Sitecom 0x2001,0x3c1c RT5370 DLink 0x2001,0x3c1d RT5370 DLink 2001 is D-Link not Alpha Signed-off-by: Xose Vazquez Perez Acked-by: Gertjan van Wingerde Signed-off-by: John W. Linville [bwh: Backported to 3.2: drop the 5372 devices] Signed-off-by: Ben Hutchings commit bb761c983bed94f5bedc6751a51dfab847a6bac3 Author: Jeff Layton Date: Wed Jul 11 09:09:36 2012 -0400 cifs: when CONFIG_HIGHMEM is set, serialize the read/write kmaps commit 3cf003c08be785af4bee9ac05891a15bcbff856a upstream. Jian found that when he ran fsx on a 32 bit arch with a large wsize the process and one of the bdi writeback kthreads would sometimes deadlock with a stack trace like this: crash> bt PID: 2789 TASK: f02edaa0 CPU: 3 COMMAND: "fsx" #0 [eed63cbc] schedule at c083c5b3 #1 [eed63d80] kmap_high at c0500ec8 #2 [eed63db0] cifs_async_writev at f7fabcd7 [cifs] #3 [eed63df0] cifs_writepages at f7fb7f5c [cifs] #4 [eed63e50] do_writepages at c04f3e32 #5 [eed63e54] __filemap_fdatawrite_range at c04e152a #6 [eed63ea4] filemap_fdatawrite at c04e1b3e #7 [eed63eb4] cifs_file_aio_write at f7fa111a [cifs] #8 [eed63ecc] do_sync_write at c052d202 #9 [eed63f74] vfs_write at c052d4ee #10 [eed63f94] sys_write at c052df4c #11 [eed63fb0] ia32_sysenter_target at c0409a98 EAX: 00000004 EBX: 00000003 ECX: abd73b73 EDX: 012a65c6 DS: 007b ESI: 012a65c6 ES: 007b EDI: 00000000 SS: 007b ESP: bf8db178 EBP: bf8db1f8 GS: 0033 CS: 0073 EIP: 40000424 ERR: 00000004 EFLAGS: 00000246 Each task would kmap part of its address array before getting stuck, but not enough to actually issue the write. This patch fixes this by serializing the marshal_iov operations for async reads and writes. The idea here is to ensure that cifs aggressively tries to populate a request before attempting to fulfill another one. As soon as all of the pages are kmapped for a request, then we can unlock and allow another one to proceed. There's no need to do this serialization on non-CONFIG_HIGHMEM arches however, so optimize all of this out when CONFIG_HIGHMEM isn't set. Reported-by: Jian Li Signed-off-by: Jeff Layton Signed-off-by: Steve French [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings commit e67de1a52fb8fdad45a4a99d2bcb0ac41d1a75ab Author: françois romieu Date: Wed Jun 20 12:09:18 2012 +0000 r8169: RxConfig hack for the 8168evl. commit eb2dc35d99028b698cdedba4f5522bc43e576bd2 upstream. The 8168evl (RTL_GIGA_MAC_VER_34) based Gigabyte GA-990FXA motherboards are very prone to NETDEV watchdog problems without this change. See https://bugzilla.kernel.org/show_bug.cgi?id=42899 for instance. I don't know why it *works*. It's depressingly effective though. For the record: - the problem may go along IOMMU (AMD-Vi) errors but it really looks like a red herring. - the patch sets the RX_MULTI_EN bit. If the 8168c doc is any guide, the chipset now fetches several Rx descriptors at a time. - long ago the driver ignored the RX_MULTI_EN bit. e542a2269f232d61270ceddd42b73a4348dee2bb changed the RxConfig settings. Whatever the problem it's now labeled a regression. - Realtek's own driver can identify two different 8168evl devices (CFG_METHOD_16 and CFG_METHOD_17) where the r8169 driver only sees one. It sucks. Signed-off-by: Francois Romieu Signed-off-by: David S. Miller Signed-off-by: Ben Hutchings commit 2c608967d48e65f6d6ef3727a58e4608c1784f22 Author: Alan Cox Date: Tue May 15 18:44:15 2012 +0100 x86: Fix boot on Twinhead H12Y commit 80b3e557371205566a71e569fbfcce5b11f92dbe upstream. Despite lots of investigation into why this is needed we don't know or have an elegant cure. The only answer found on this laptop is to mark a problem region as used so that Linux doesn't put anything there. Currently all the users add reserve= command lines and anyone not knowing this needs to find the magic page that documents it. Automate it instead. Signed-off-by: Alan Cox Tested-and-bugfixed-by: Arne Fitzenreiter Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=10231 Link: http://lkml.kernel.org/r/20120515174347.5109.94551.stgit@bluebook Signed-off-by: Ingo Molnar Signed-off-by: Ben Hutchings commit 1b5b48b8974d2df187f27f99445ba2db44030dd0 Author: Ezequiel Garcia Date: Wed Jul 18 10:05:26 2012 -0300 cx25821: Remove bad strcpy to read-only char* commit 380e99fc44d79bc35af9ff1d3316ef4027ce775e upstream. The strcpy was being used to set the name of the board. Since the destination char* was read-only and the name is set statically at compile time; this was both wrong and redundant. The type of char* is changed to const char* to prevent future errors. Reported-by: Radek Masin Signed-off-by: Ezequiel Garcia [ Taking directly due to vacations - Linus ] Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit 318ee7542a4f4a17a743668ae1d547e3c97a49f3 Author: roger blofeld Date: Thu Jun 21 05:27:14 2012 +0000 powerpc/ftrace: Fix assembly trampoline register usage commit fd5a42980e1cf327b7240adf5e7b51ea41c23437 upstream. Just like the module loader, ftrace needs to be updated to use r12 instead of r11 with newer gcc's. Signed-off-by: Roger Blofeld Signed-off-by: Benjamin Herrenschmidt Signed-off-by: Ben Hutchings commit c812fe636b6e1e3b3b409b316157f56054252e25 Author: Peter Zijlstra Date: Thu May 17 17:15:29 2012 +0200 sched/nohz: Fix rq->cpu_load calculations some more commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream. Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra Cc: pjt@google.com Cc: Venkatesh Pallipadi Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.org Signed-off-by: Ingo Molnar [bwh: Backported to 3.2: adjust filenames and context] Signed-off-by: Ben Hutchings commit 1217fd87cfd53eab752837f19386e0a5594719d9 Author: Peter Zijlstra Date: Fri May 11 17:31:26 2012 +0200 sched/nohz: Fix rq->cpu_load[] calculations commit 556061b00c9f2fd6a5524b6bde823ef12f299ecf upstream. While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Signed-off-by: Peter Zijlstra Cc: pjt@google.com Cc: Venkatesh Pallipadi Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.org Signed-off-by: Ingo Molnar [bwh: Backported to 3.2: adjust filenames and context; keep functions static] Signed-off-by: Ben Hutchings commit 2b2c3e47f0a98261d5d6ab7f61420fb4e479b472 Author: Mark Rustad Date: Thu Jun 21 12:23:42 2012 -0700 Fix NULL dereferences in scsi_cmd_to_driver commit 222a806af830fda34ad1f6bc991cd226916de060 upstream. Avoid crashing if the private_data pointer happens to be NULL. This has been seen sometimes when a host reset happens, notably when there are many LUNs: host3: Assigned Port ID 0c1601 scsi host3: libfc: Host reset succeeded on port (0c1601) BUG: unable to handle kernel NULL pointer dereference at 0000000000000350 IP: [] scsi_send_eh_cmnd+0x58/0x3a0 Process scsi_eh_3 (pid: 4144, threadinfo ffff88030920c000, task ffff880326b160c0) Stack: 000000010372e6ba 0000000000000282 000027100920dca0 ffffffffa0038ee0 0000000000000000 0000000000030003 ffff88030920dc80 ffff88030920dc80 00000002000e0000 0000000a00004000 ffff8803242f7760 ffff88031326ed80 Call Trace: [] ? lock_timer_base+0x70/0x70 [] scsi_eh_tur+0x3e/0xc0 [] scsi_eh_test_devices+0x76/0x170 [] scsi_eh_host_reset+0x85/0x160 [] scsi_eh_ready_devs+0x91/0x110 [] scsi_unjam_host+0xed/0x1f0 [] scsi_error_handler+0x1a8/0x200 [] ? scsi_unjam_host+0x1f0/0x1f0 [] kthread+0x9e/0xb0 [] kernel_thread_helper+0x4/0x10 [] ? kthread_freezable_should_stop+0x70/0x70 [] ? gs_change+0x13/0x13 Code: 25 28 00 00 00 48 89 45 c8 31 c0 48 8b 87 80 00 00 00 48 8d b5 60 ff ff ff 89 d1 48 89 fb 41 89 d6 4c 89 fa 48 8b 80 b8 00 00 00 <48> 8b 80 50 03 00 00 48 8b 00 48 89 85 38 ff ff ff 48 8b 07 4c RIP [] scsi_send_eh_cmnd+0x58/0x3a0 RSP CR2: 0000000000000350 Signed-off-by: Mark Rustad Tested-by: Marcus Dennis Signed-off-by: James Bottomley [bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings commit e07e835a265976a8c13c7990a7e03627040372c2 Author: Konstantin Khlebnikov Date: Wed Apr 25 16:01:46 2012 -0700 mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma commit b1c12cbcd0a02527c180a862e8971e249d3b347d upstream. Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely expensive and severely impacted page allocator performance. This is part of a series of patches that reduce page allocator overhead. Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce large amounts of memory barrier related damage v3") Local variable "page" can be uninitialized if the nodemask from vma policy does not intersects with nodemask from cpuset. Even if it doesn't happens it is better to initialize this variable explicitly than to introduce a kernel oops in a weird corner case. mm/hugetlb.c: In function `alloc_huge_page': mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function Signed-off-by: Konstantin Khlebnikov Acked-by: Mel Gorman Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Ben Hutchings commit e8bf81d11ffae58e78a1d8f58f51accdae627c65 Author: Mel Gorman Date: Wed Mar 21 16:34:11 2012 -0700 cpuset: mm: reduce large amounts of memory barrier related damage v3 commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream. Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely expensive and severely impacted page allocator performance. This is part of a series of patches that reduce page allocator overhead. Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman Cc: Miao Xie Cc: David Rientjes Cc: Peter Zijlstra Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman [bwh: Forward-ported from 3.0 to 3.2: apply the upstream changes to get_any_partial()] Signed-off-by: Ben Hutchings commit 01d90f1200a300dfa8d0fb58d98c18ca51c7363b Author: Johannes Weiner Date: Thu Jan 12 17:18:06 2012 -0800 mm: vmscan: convert global reclaim to per-memcg LRU lists commit b95a2f2d486d0d768a92879c023a03757b9c7e58 upstream - WARNING: this is a substitute patch. Stable note: Not tracked in Bugzilla. This is a partial backport of an upstream commit addressing a completely different issue that accidentally contained an important fix. The workload this patch helps was memcached when IO is started in the background. memcached should stay resident but without this patch it gets swapped. Sometimes this manifests as a drop in throughput but mostly it was observed through /proc/vmstat. Commit [246e87a9: memcg: fix get_scan_count() for small targets] was meant to fix a problem whereby small scan targets on memcg were ignored causing priority to raise too sharply. It forced scanning to take place if the target was small, memcg or kswapd. From the time it was introduced it caused excessive reclaim by kswapd with workloads being pushed to swap that previously would have stayed resident. This was accidentally fixed in commit [b95a2f2d: mm: vmscan: convert global reclaim to per-memcg LRU lists] by making it harder for kswapd to force scan small targets but that patchset is not suitable for backporting. This was later changed again by commit [90126375: mm/vmscan: push lruvec pointer into get_scan_count()] into a format that looks like it would be a straight-forward backport but there is a subtle difference due to the use of lruvecs. The impact of the accidental fix is to make it harder for kswapd to force scan small targets by taking zone->all_unreclaimable into account. This patch is the closest equivalent available based on what is backported. Signed-off-by: Mel Gorman Signed-off-by: Ben Hutchings commit 33e037b9f75061e882e81106be10a2a4de42850d Author: Hugh Dickins Date: Tue Jan 10 15:08:33 2012 -0800 mm: test PageSwapBacked in lumpy reclaim commit 043bcbe5ec51e0478ef2b44acef17193e01d7f70 upstream. Stable note: Not tracked in Bugzilla. There were reports of shared mapped pages being unfairly reclaimed in comparison to older kernels. This is being addressed over time. Even though the subject refers to lumpy reclaim, it impacts compaction as well. Lumpy reclaim does well to stop at a PageAnon when there's no swap, but better is to stop at any PageSwapBacked, which includes shmem/tmpfs too. Signed-off-by: Hugh Dickins Reviewed-by: KOSAKI Motohiro Reviewed-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Ben Hutchings commit cef8678eb9832de6ac3f32810509393e28416d7e Author: Minchan Kim Date: Tue Jan 10 15:08:18 2012 -0800 mm/vmscan.c: consider swap space when deciding whether to continue reclaim commit 86cfd3a45042ab242d47f3935a02811a402beab6 upstream. Stable note: Not tracked in Bugzilla. This patch reduces kswapd CPU usage on swapless systems with high anonymous memory usage. It's pointless to continue reclaiming when we have no swap space and lots of anon pages in the inactive list. Without this patch, it is possible when swap is disabled to continue trying to reclaim when there are only anonymous pages in the system even though that will not make any progress. Signed-off-by: Minchan Kim Cc: KOSAKI Motohiro Acked-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Johannes Weiner Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit 2addaa3c03214d7c5d256dd8a96cf3db83d7f737 Author: Konstantin Khlebnikov Date: Tue Jan 10 15:07:03 2012 -0800 vmscan: activate executable pages after first usage commit c909e99364c8b6ca07864d752950b6b4ecf6bef4 upstream. Stable note: Not tracked in Bugzilla. There were reports of shared mapped pages being unfairly reclaimed in comparison to older kernels. This is being addressed over time. Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages the first class citizen") was noticeably weakened in commit 645747462435d84 ("vmscan: detect mapped file pages used only once"). Currently these pages can become "first class citizens" only after second usage. After this patch page_check_references() will activate they after first usage, and executable code gets yet better chance to stay in memory. Signed-off-by: Konstantin Khlebnikov Cc: Pekka Enberg Cc: Minchan Kim Cc: KAMEZAWA Hiroyuki Cc: Wu Fengguang Cc: Johannes Weiner Cc: Nick Piggin Cc: Mel Gorman Cc: Shaohua Li Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit 1bdeeecd038e66a5722f07ff96d46f0ea0d72084 Author: Konstantin Khlebnikov Date: Tue Jan 10 15:06:59 2012 -0800 vmscan: promote shared file mapped pages commit 34dbc67a644f11ab3475d822d72e25409911e760 upstream. Stable note: Not tracked in Bugzilla. There were reports of shared mapped pages being unfairly reclaimed in comparison to older kernels. This is being addressed over time. The specific workload being addressed here in described in paragraph four and while paragraph five says it did not help performance as such, it made a difference to major page faults. I'm aware of at least one bug for a large vendor that was due to increased major faults. Commit 645747462435 ("vmscan: detect mapped file pages used only once") greatly decreases lifetime of single-used mapped file pages. Unfortunately it also decreases life time of all shared mapped file pages. Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed in fault path") page-fault handler does not mark page active or even referenced. Thus page_check_references() activates file page only if it was used twice while it stays in inactive list, meanwhile it activates anon pages after first access. Inactive list can be small enough, this way reclaimer can accidentally throw away any widely used page if it wasn't used twice in short period. After this patch page_check_references() also activate file mapped page at first inactive list scan if this page is already used multiple times via several ptes. I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5 (~2.6.18). There a complete mess with >100 web/mail/spam/ftp containers, they share all their files but there a lot of anonymous pages: ~500mb shared file mapped memory and 15-20Gb non-shared anonymous memory. In this situation major-pagefaults are very costly, because all containers share the same page. In my load kernel created a disproportionate pressure on the file memory, compared with the anonymous, they equaled only if I raise swappiness up to 150 =) These patches actually wasn't helped a lot in my problem, but I saw noticable (10-20 times) reduce in count and average time of major-pagefault in file-mapped areas. Actually both patches are fixes for commit v2.6.33-5448-g6457474, because it was aimed at one scenario (singly used pages), but it breaks the logic in other scenarios (shared and/or executable pages) Signed-off-by: Konstantin Khlebnikov Acked-by: Pekka Enberg Acked-by: Minchan Kim Reviewed-by: KAMEZAWA Hiroyuki Cc: Wu Fengguang Cc: Johannes Weiner Cc: Nick Piggin Cc: Mel Gorman Cc: Shaohua Li Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit d1c7dc4c0e3468ede76163d283c15c7afdfd1edb Author: Mel Gorman Date: Thu Jan 12 17:19:49 2012 -0800 mm: vmscan: check if reclaim should really abort even if compaction_ready() is true for one zone commit 0cee34fd72c582b4f8ad8ce00645b75fb4168199 upstream. Stable note: Not tracked on Bugzilla. THP and compaction was found to aggressively reclaim pages and stall systems under different situations that was addressed piecemeal over time. If compaction can proceed for a given zone, shrink_zones() does not reclaim any more pages from it. After commit [e0c2327: vmscan: abort reclaim/compaction if compaction can proceed], do_try_to_free_pages() tries to finish as soon as possible once one zone can compact. This was intended to prevent slabs being shrunk unnecessarily but there are side-effects. One is that a small zone that is ready for compaction will abort reclaim even if the chances of successfully allocating a THP from that zone is small. It also means that reclaim can return too early even though sc->nr_to_reclaim pages were not reclaimed. This partially reverts the commit until it is proven that slabs are really being shrunk unnecessarily but preserves the check to return 1 to avoid OOM if reclaim was aborted prematurely. [aarcange@redhat.com: This patch replaces a revert from Andrea] Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit b37d0e5983457f81ee31370edbd5af5a7667312d Author: Mel Gorman Date: Thu Jan 12 17:19:33 2012 -0800 mm: vmscan: do not OOM if aborting reclaim to start compaction commit 7335084d446b83cbcb15da80497d03f0c1dc9e21 upstream. Stable note: Not tracked in Bugzilla. This patch makes later patches easier to apply but otherwise has little to justify it. The problem it fixes was never observed but the source of the theoretical problem did not exist for very long. During direct reclaim it is possible that reclaim will be aborted so that compaction can be attempted to satisfy a high-order allocation. If this decision is made before any pages are reclaimed, it is possible that 0 is returned to the page allocator potentially triggering an OOM. This has not been observed but it is a possibility so this patch addresses it. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit 553ae61ff28b9416c6faa59a02d79c21c3e7e845 Author: Mel Gorman Date: Thu Jan 12 17:19:45 2012 -0800 mm: vmscan: when reclaiming for compaction, ensure there are sufficient free pages available commit fe4b1b244bdb96136855f2c694071cb09d140766 upstream. Stable note: Not tracked on Bugzilla. THP and compaction was found to aggressively reclaim pages and stall systems under different situations that was addressed piecemeal over time. This patch addresses a problem where the fix regressed THP allocation success rates. In commit e0887c19 ("vmscan: limit direct reclaim for higher order allocations"), Rik noted that reclaim was too aggressive when THP was enabled. In his initial patch he used the number of free pages to decide if reclaim should abort for compaction. My feedback was that reclaim and compaction should be using the same logic when deciding if reclaim should be aborted. Unfortunately, this had the effect of reducing THP success rates when the workload included something like streaming reads that continually allocated pages. The window during which compaction could run and return a THP was too small. This patch combines Rik's two patches together. compaction_suitable() is still used to decide if reclaim should be aborted to allow compaction is used. However, it will also ensure that there is a reasonable buffer of free pages available. This improves upon the THP allocation success rates but bounds the number of pages that are freed for compaction. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit 4df9e19392b2aee80bbe12cd1e6f2fcb7fa3ae3c Author: Mel Gorman Date: Thu Jan 12 17:19:43 2012 -0800 mm: compaction: introduce sync-light migration for use by compaction commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 upstream. Stable note: Not tracked in Buzilla. This was part of a series that reduced interactivity stalls experienced when THP was enabled. These stalls were particularly noticable when copying data to a USB stick but the experiences for users varied a lot. This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT mode that avoids writing back pages to backing storage. Async compaction maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is used. This avoids sync compaction stalling for an excessive length of time, particularly when copying files to a USB stick where there might be a large number of dirty pages backed by a filesystem that does not support ->writepages. [aarcange@redhat.com: This patch is heavily based on Andrea's work] [akpm@linux-foundation.org: fix fs/nfs/write.c build] [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build] Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit 34c8ed43743851d7b1ad9e92f5ef7ca0b7ec606f Author: Mel Gorman Date: Thu Jan 12 17:19:38 2012 -0800 mm: compaction: make isolate_lru_page() filter-aware again commit c82449352854ff09e43062246af86bdeb628f0c3 upstream. Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging information by reducing LRU list churning had the side-effect of reducing THP allocation success rates. This was part of a series to restore the success rates while preserving the reclaim fix. Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware") noted that compaction does not migrate dirty or writeback pages and that is was meaningless to pick the page and re-add it to the LRU list. This had to be partially reverted because some dirty pages can be migrated by compaction without blocking. This patch updates "mm: compaction: make isolate_lru_page" by skipping over pages that migration has no possibility of migrating to minimise LRU disruption. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Reviewed-by: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit e9a127ba4b7a07e007af6d4c07006121438affaa Author: Mel Gorman Date: Thu Jan 12 17:19:41 2012 -0800 mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred commit 66199712e9eef5aede09dbcd9dfff87798a66917 upstream. Stable note: Not tracked in Buzilla. This was part of a series that reduced interactivity stalls experienced when THP was enabled. If compaction is deferred, direct reclaim is used to try to free enough pages for the allocation to succeed. For small high-orders, this has a reasonable chance of success. However, if the caller has specified __GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense to fail the allocation rather than stall the caller in direct reclaim. This patch skips direct reclaim if compaction is deferred and the caller specifies __GFP_NO_KSWAPD. Async compaction only considers a subset of pages so it is possible for compaction to be deferred prematurely and not enter direct reclaim even in cases where it should. To compensate for this, this patch also defers compaction only if sync compaction failed. Signed-off-by: Mel Gorman Acked-by: Minchan Kim Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit 419276781056d8942c97b11447b8302d739e8c64 Author: Mel Gorman Date: Thu Jan 12 17:19:34 2012 -0800 mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage commit b969c4ab9f182a6e1b2a0848be349f99714947b0 upstream. Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging information by reducing LRU list churning had the side-effect of reducing THP allocation success rates. This was part of a series to restore the success rates while preserving the reclaim fix. Asynchronous compaction is used when allocating transparent hugepages to avoid blocking for long periods of time. Due to reports of stalling, there was a debate on disabling synchronous compaction but this severely impacted allocation success rates. Part of the reason was that many dirty pages are skipped in asynchronous compaction by the following check; if (PageDirty(page) && !sync && mapping->a_ops->migratepage != migrate_page) rc = -EBUSY; This skips over all mapping aops using buffer_migrate_page() even though it is possible to migrate some of these pages without blocking. This patch updates the ->migratepage callback with a "sync" parameter. It is the responsibility of the callback to fail gracefully if migration would block. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit 46aadfff5c7660f3736c7330952c380d58aacec5 Author: Mel Gorman Date: Thu Jan 12 17:19:22 2012 -0800 mm: compaction: allow compaction to isolate dirty pages commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream. Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging information by reducing LRU list churning had the side-effect of reducing THP allocation success rates. This was part of a series to restore the success rates while preserving the reclaim fix. Short summary: There are severe stalls when a USB stick using VFAT is used with THP enabled that are reduced by this series. If you are experiencing this problem, please test and report back and considering I have seen complaints from openSUSE and Fedora users on this as well as a few private mails, I'm guessing it's a widespread issue. This is a new type of USB-related stall because it is due to synchronous compaction writing where as in the past the big problem was dirty pages reaching the end of the LRU and being written by reclaim. Am cc'ing Andrew this time and this series would replace mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch. I'm also cc'ing Dave Jones as he might have merged that patch to Fedora for wider testing and ideally it would be reverted and replaced by this series. That said, the later patches could really do with some review. If this series is not the answer then a new direction needs to be discussed because as it is, the stalls are unacceptable as the results in this leader show. For testers that try backporting this to 3.1, it won't work because there is a non-obvious dependency on not writing back pages in direct reclaim so you need those patches too. Changelog since V5 o Rebase to 3.2-rc5 o Tidy up the changelogs a bit Changelog since V4 o Added reviewed-bys, credited Andrea properly for sync-light o Allow dirty pages without mappings to be considered for migration o Bound the number of pages freed for compaction o Isolate PageReclaim pages on their own LRU list This is against 3.2-rc5 and follows on from discussions on "mm: Do not stall in synchronous compaction for THP allocations" and "[RFC PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed patch eliminated stalls due to compaction which sometimes resulted in user-visible interactivity problems on browsers by simply never using sync compaction. The downside was that THP success allocation rates were lower because dirty pages were not being migrated as reported by Andrea. His approach at fixing this was nacked on the grounds that it reverted fixes from Rik merged that reduced the amount of pages reclaimed as it severely impacted his workloads performance. This series attempts to reconcile the requirements of maximising THP usage, without stalling in a user-visible fashion due to compaction or cheating by reclaiming an excessive number of pages. Patch 1 partially reverts commit 39deaf85 to allow migration to isolate dirty pages. This is because migration can move some dirty pages without blocking. Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using synchronous compaction when it should be. This is unrelated to the reported stalls but is worth fixing. Patch 3 checks if we isolated a compound page during lumpy scan and account for it properly. For the most part, this affects tracing so it's unrelated to the stalls but worth fixing. Patch 4 notes that it is possible to abort reclaim early for compaction and return 0 to the page allocator potentially entering the "may oom" path. This has not been observed in practice but the rest of the series potentially makes it easier to happen. Patch 5 adds a sync parameter to the migratepage callback and gives the callback responsibility for migrating the page without blocking if sync==false. For example, fallback_migrate_page will not call writepage if sync==false. This increases the number of pages that can be handled by asynchronous compaction thereby reducing stalls. Patch 6 restores filter-awareness to isolate_lru_page for migration. In practice, it means that pages under writeback and pages without a ->migratepage callback will not be isolated for migration. Patch 7 avoids calling direct reclaim if compaction is deferred but makes sure that compaction is only deferred if sync compaction was used. Patch 8 introduces a sync-light migration mechanism that sync compaction uses. The objective is to allow some stalls but to not call ->writepage which can lead to significant user-visible stalls. Patch 9 notes that while we want to abort reclaim ASAP to allow compation to go ahead that we leave a very small window of opportunity for compaction to run. This patch allows more pages to be freed by reclaim but bounds the number to a reasonable level based on the high watermark on each zone. Patch 10 allows slabs to be shrunk even after compaction_ready() is true for one zone. This is to avoid a problem whereby a single small zone can abort reclaim even though no pages have been reclaimed and no suitably large zone is in a usable state. Patch 11 fixes a problem with the rate of page scanning. As reclaim is rarely stalling on pages under writeback it means that scan rates are very high. This is particularly true for direct reclaim which is not calling writepage. The vmstat figures implied that much of this was busy work with PageReclaim pages marked for immediate reclaim. This patch is a prototype that moves these pages to their own LRU list. This has been tested and other than 2 USB keys getting trashed, nothing horrible fell out. That said, I am a bit unhappy with the rescue logic in patch 11 but did not find a better way around it. It does significantly reduce scan rates and System CPU time indicating it is the right direction to take. What is of critical importance is that stalls due to compaction are massively reduced even though sync compaction was still allowed. Testing from people complaining about stalls copying to USBs with THP enabled are particularly welcome. The following tests all involve THP usage and USB keys in some way. Each test follows this type of pattern 1. Read from some fast fast storage, be it raw device or file. Each time the copy finishes, start again until the test ends 2. Write a large file to a filesystem on a USB stick. Each time the copy finishes, start again until the test ends 3. When memory is low, start an alloc process that creates a mapping the size of physical memory to stress THP allocation. This is the "real" part of the test and the part that is meant to trigger stalls when THP is enabled. Copying continues in the background. 4. Record the CPU usage and time to execute of the alloc process 5. Record the number of THP allocs and fallbacks as well as the number of THP pages in use a the end of the test just before alloc exited 6. Run the test 5 times to get an idea of variability 7. Between each run, sync is run and caches dropped and the test waits until nr_dirty is a small number to avoid interference or caching between iterations that would skew the figures. The individual tests were then writebackCPDeviceBasevfat Disable THP, read from a raw device (sda), vfat on USB stick writebackCPDeviceBaseext4 Disable THP, read from a raw device (sda), ext4 on USB stick writebackCPDevicevfat THP enabled, read from a raw device (sda), vfat on USB stick writebackCPDeviceext4 THP enabled, read from a raw device (sda), ext4 on USB stick writebackCPFilevfat THP enabled, read from a file on fast storage and USB, both vfat writebackCPFileext4 THP enabled, read from a file on fast storage and USB, both ext4 The kernels tested were 3.1 3.1 vanilla 3.2-rc5 freemore Patches 1-10 immediate Patches 1-11 andrea The 8 patches Andrea posted as a basis of comparison The results are very long unfortunately. I'll start with the case where we are not using THP at all writebackCPDeviceBasevfat 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.28 ( 0.00%) 54.49 (-4143.46%) 48.63 (-3687.69%) 4.69 ( -265.11%) 51.88 (-3940.81%) +/- 0.06 ( 0.00%) 2.45 (-4305.55%) 4.75 (-8430.57%) 7.46 (-13282.76%) 4.76 (-8440.70%) User Time 0.09 ( 0.00%) 0.05 ( 40.91%) 0.06 ( 29.55%) 0.07 ( 15.91%) 0.06 ( 27.27%) +/- 0.02 ( 0.00%) 0.01 ( 45.39%) 0.02 ( 25.07%) 0.00 ( 77.06%) 0.01 ( 52.24%) Elapsed Time 110.27 ( 0.00%) 56.38 ( 48.87%) 49.95 ( 54.70%) 11.77 ( 89.33%) 53.43 ( 51.54%) +/- 7.33 ( 0.00%) 3.77 ( 48.61%) 4.94 ( 32.63%) 6.71 ( 8.50%) 4.76 ( 35.03%) THP Active 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Fault Alloc 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Fault Fallback 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) The THP figures are obviously all 0 because THP was enabled. The main thing to watch is the elapsed times and how they compare to times when THP is enabled later. It's also important to note that elapsed time is improved by this series as System CPu time is much reduced. writebackCPDevicevfat 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 47.37 (-3789.33%) +/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 6.60 (-10818.53%) User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.04 ( 31.03%) +/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( 25.46%) Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 472.43 ( 95.48%) +/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 59.99 ( 90.68%) THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 62.20 ( 398.72%) +/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 22.02 ( 119.18%) Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 286.60 ( 235.30%) +/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 68.13 ( 92.68%) Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 716.60 ( 18.68%) +/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 67.75 ( 7.84%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 1937.03 Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 2520.28 The first thing to note is the "Elapsed Time" for the vanilla kernels of 2249 seconds versus 56 with THP disabled which might explain the reports of USB stalls with THP enabled. Applying the patches brings performance in line with THP-disabled performance while isolating pages for immediate reclaim from the LRU cuts down System CPU time. The "Fault Alloc" success rate figures are also improved. The vanilla kernel only managed to allocate 76.6 pages on average over the course of 5 iterations where as applying the series allocated 181.20 on average albeit it is well within variance. It's worth noting that applies the series at least descreases the amount of variance which implies an improvement. Andrea's series had a higher success rate for THP allocations but at a severe cost to elapsed time which is still better than vanilla but still much worse than disabling THP altogether. One can bring my series close to Andrea's by removing this check /* * If compaction is deferred for high-order allocations, it is because * sync compaction recently failed. In this is the case and the caller * has requested the system not be heavily disrupted, fail the * allocation now instead of entering direct reclaim */ if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD)) goto nopage; I didn't include a patch that removed the above check because hurting overall performance to improve the THP figure is not what the average user wants. It's something to consider though if someone really wants to maximise THP usage no matter what it does to the workload initially. This is summary of vmstat figures from the same test. 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 Page Ins 3257266139 1111844061 17263623 10901575 161423219 Page Outs 81054922 30364312 3626530 3657687 8753730 Swap Ins 3294 2851 6560 4964 4592 Swap Outs 390073 528094 620197 790912 698285 Direct pages scanned 1077581700 3024951463 1764930052 115140570 5901188831 Kswapd pages scanned 34826043 7112868 2131265 1686942 1893966 Kswapd pages reclaimed 28950067 4911036 1246044 966475 1497726 Direct pages reclaimed 805148398 280167837 3623473 2215044 40809360 Kswapd efficiency 83% 69% 58% 57% 79% Kswapd velocity 664.399 622.521 4253.852 7304.360 751.490 Direct efficiency 74% 9% 0% 1% 0% Direct velocity 20557.737 264745.137 3522673.849 498551.938 2341481.435 Percentage direct scans 96% 99% 99% 98% 99% Page writes by reclaim 722646 529174 620319 791018 699198 Page writes file 332573 1080 122 106 913 Page writes anon 390073 528094 620197 790912 698285 Page reclaim immediate 0 2552514720 1635858848 111281140 5478375032 Page rescued immediate 0 0 0 87848 0 Slabs scanned 23552 23552 9216 8192 9216 Direct inode steals 231 0 0 0 0 Kswapd inode steals 0 0 0 0 0 Kswapd skipped wait 28076 786 0 61 6 THP fault alloc 609 383 753 906 1433 THP collapse alloc 12 6 0 0 6 THP splits 536 211 456 593 1136 THP fault fallback 4406 4633 4263 4110 3583 THP collapse fail 120 127 0 0 4 Compaction stalls 1810 728 623 779 3200 Compaction success 196 53 60 80 123 Compaction failures 1614 675 563 699 3077 Compaction pages moved 193158 53545 243185 333457 226688 Compaction move failure 9952 9396 16424 23676 45070 The main things to look at are 1. Page In/out figures are much reduced by the series. 2. Direct page scanning is incredibly high (264745.137 pages scanned per second on the vanilla kernel) but isolating PageReclaim pages on their own list reduces the number of pages scanned significantly. 3. The fact that "Page rescued immediate" is a positive number implies that we sometimes race removing pages from the LRU_IMMEDIATE list that need to be put back on a normal LRU but it happens only for 0.07% of the pages marked for immediate reclaim. writebackCPDeviceext4 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%) +/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%) User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%) +/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%) Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%) +/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%) THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%) +/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%) Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%) +/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%) Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%) +/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45 Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26 Similar test but the USB stick is using ext4 instead of vfat. As ext4 does not use writepage for migration, the large stalls due to compaction when THP is enabled are not observed. Still, isolating PageReclaim pages on their own list helped completion time largely by reducing the number of pages scanned by direct reclaim although time spend in congestion_wait could also be a factor. Again, Andrea's series had far higher success rates for THP allocation at the cost of elapsed time. I didn't look too closely but a quick look at the vmstat figures tells me kswapd reclaimed 8 times more pages than the patch series and direct reclaim reclaimed roughly three times as many pages. It follows that if memory is aggressively reclaimed, there will be more available for THP. writebackCPFilevfat 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.76 ( 0.00%) 29.10 (-1555.52%) 46.01 (-2517.18%) 4.79 ( -172.35%) 54.89 (-3022.53%) +/- 0.14 ( 0.00%) 25.61 (-18185.17%) 2.15 (-1434.83%) 6.60 (-4610.03%) 9.75 (-6863.76%) User Time 0.05 ( 0.00%) 0.07 ( -45.83%) 0.05 ( -4.17%) 0.06 ( -29.17%) 0.06 ( -16.67%) +/- 0.02 ( 0.00%) 0.02 ( 20.11%) 0.02 ( -3.14%) 0.01 ( 31.58%) 0.01 ( 47.41%) Elapsed Time 22520.79 ( 0.00%) 1082.85 ( 95.19%) 73.30 ( 99.67%) 32.43 ( 99.86%) 291.84 ( 98.70%) +/- 7277.23 ( 0.00%) 706.29 ( 90.29%) 19.05 ( 99.74%) 17.05 ( 99.77%) 125.55 ( 98.27%) THP Active 83.80 ( 0.00%) 12.80 ( 15.27%) 15.60 ( 18.62%) 13.00 ( 15.51%) 0.80 ( 0.95%) +/- 66.81 ( 0.00%) 20.19 ( 30.22%) 5.92 ( 8.86%) 15.06 ( 22.54%) 1.17 ( 1.75%) Fault Alloc 171.00 ( 0.00%) 67.80 ( 39.65%) 97.40 ( 56.96%) 125.60 ( 73.45%) 133.00 ( 77.78%) +/- 82.91 ( 0.00%) 30.69 ( 37.02%) 53.91 ( 65.02%) 55.05 ( 66.40%) 21.19 ( 25.56%) Fault Fallback 832.00 ( 0.00%) 935.20 ( -12.40%) 906.00 ( -8.89%) 877.40 ( -5.46%) 870.20 ( -4.59%) +/- 82.91 ( 0.00%) 30.69 ( 62.98%) 54.01 ( 34.86%) 55.05 ( 33.60%) 20.91 ( 74.78%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 7229.81 928.42 704.52 80.68 1330.76 Total Elapsed Time (seconds) 112849.04 5618.69 571.11 360.54 1664.28 In this case, the test is reading/writing only from filesystems but as it's vfat, it's slow due to calling writepage during compaction. Little to observe really - the time to complete the test goes way down with the series applied and THP allocation success rates go up in comparison to 3.2-rc5. The success rates are lower than 3.1.0 but the elapsed time for that kernel is abysmal so it is not really a sensible comparison. As before, Andrea's series allocates more THPs at the cost of overall performance. writebackCPFileext4 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%) +/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%) User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%) +/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%) Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%) +/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%) THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%) +/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%) Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%) +/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%) Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%) +/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45 Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26 Same type of story - elapsed times go down. In this case, allocation success rates are roughtly the same. As before, Andrea's has higher success rates but takes a lot longer. Overall the series does reduce latencies and while the tests are inherency racy as alloc competes with the cp processes, the variability was included. The THP allocation rates are not as high as they could be but that is because we would have to be more aggressive about reclaim and compaction impacting overall performance. This patch: Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware") noted that compaction does not migrate dirty or writeback pages and that is was meaningless to pick the page and re-add it to the LRU list. What was missed during review is that asynchronous migration moves dirty pages if their ->migratepage callback is migrate_page() because these can be moved without blocking. This potentially impacted hugepage allocation success rates by a factor depending on how many dirty pages are in the system. This patch partially reverts 39deaf85 to allow migration to isolate dirty pages again. This increases how much compaction disrupts the LRU but that is addressed later in the series. Signed-off-by: Mel Gorman Reviewed-by: Andrea Arcangeli Reviewed-by: Rik van Riel Reviewed-by: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings commit c653414d1983c064934ff3aa66bea2d7297f8222 Author: Mel Gorman Date: Tue Jan 10 15:07:14 2012 -0800 mm: reduce the amount of work done when updating min_free_kbytes commit 938929f14cb595f43cd1a4e63e22d36cab1e4a1f upstream. Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 . Large machines with 1TB or more of RAM take a long time to boot without this patch and may spew out soft lockup warnings. When min_free_kbytes is updated, some pageblocks are marked MIGRATE_RESERVE. Ordinarily, this work is unnoticable as it happens early in boot but on large machines with 1TB of memory, this has been reported to delay boot times, probably due to the NUMA distances involved. The bulk of the work is due to calling calling pageblock_is_reserved() an unnecessary amount of times and accessing far more struct page metadata than is necessary. This patch significantly reduces the amount of work done by setup_zone_migrate_reserve() improving boot times on 1TB machines. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings