昨天一台SQL Server 2008R2的数据库在凌晨5点多抛出下面告警信息：
使用Backup创建测试环境之后，发现testdb的Log File过大，达到400GB，由于测试环境实际上不需要这么大的Log Space，占用400GB的Disk Space实在浪费Disk Resource，于是使用DBCC Shrink收缩Log File：
In transactional replication, the transaction log of the database involved in replication is both written to and read from. Without replication, a transaction log is almost always written to, and rarely read from. Because a transaction log is both written to and read from when using transactional replication, this can cause I/O performance issues on databases that experience large numbers of transactions.
dbcc shrinkfile(testdb_log_5,10240,notruncate) dbcc shrinkfile(testdb_log_5,10240,truncateonly)
To help reduce this problem, locate the transaction log of databases involved in transactional replication on its own dedicated RAID 1 or RAID 10 disk array, and connected to its own SCSI or fiber channel. [6.0, 7.0, 2000, 2005] Updated 1-6-2006
The log scan number (620023:3702:1) passed to log scan in database 'xxxx' is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.
If you are using transactional replication, you may want to monitor the latency that it takes the Log Reader to move transactions from a database’s transaction log until it puts it in the distribution database, and to also monitor the latency it takes the Distributor Agent to move transactions from the distribution database to the Subscriber database. The total of these two figures is the amount of time it takes a transaction to get from the publication database to the subscriber database.
由于test db的还原模式是Simple，并且没有active user，最大的可能性是db的Trasaction log被标记为Replication，使用以下函数统计，发现有大量的log未被LogReader读取。
The counters for these two processes are the SQL Server Replication LogReader: Delivery Latency counter and the SQL Server Replication Dist.: Delivery Latency counter.
乍一看，还以为数据库损坏了（data corruption），但是在做完DBCC CHECKDB后，发现其实数据库其实是完好无损的。那么肯定是跟Replication有关。但是在搜索了相关资料，仅仅在The process could not execute ‘sp_repldone/sp_replcounters” 这篇博客中找到了类似错误的资料：
select count(0) from sys.fn_dblog(null,null) f where f.Description ='REPLICATE'
If you see a significant increase in the latency for either of these processes, this should be a signal to you to find out what new or different action has happened to cause the increased latency. [6.5, 7.0, 2000, 2005] Updated 1-6-2006
在Publisher database中，使用 sp_repltrans 查看没有被LogReader标记为Distributed的Transaction。
The last LSN in Transaction Log is less than what the LSN Log Reader is trying to find. An old backup may have been restored on top of Published Database. After the restore, the new Transaction Log doesn't contain the data now distributor & subscriber(s) have.
sp_repltrans returns a result set of all the transactions in the publication database transaction log that are marked for replication but have not been marked as distributed.
Transactional replication offers an option called Immediate-Updating Subscribers. This feature provides for transactional consistency for all of the various Subscribers of a publisher. By making use of the Microsoft Distributed Transaction Coordinator (MSDTC), Immediate-Updating Subscribers allows subscribers to update a copy of the local data, and the Publishers data is also updated, simultaneously. This change is then replicated from the Publisher to the other Subscribers, so they all have consistent data.
How to fix this
Ensure database consistency by running DBCC CHECKDB on the database.
If an old backup was restored on top of published database then use sp_replrestart
If going back to the most recent transaction log backup is not an option then execute sp_replrestart on publisher in published database. This stored procedure is used when the highest log sequence number (LSN) value at the Distributor does match the highest LSN value at the Publisher.
This stored procedure will insert compensating LSNs (No Operation) in the publisher database log file till one the compensating LSN becomes more than the highest distributed LSN in distribution database for this published database. After this it inserts this new high LSN in the msrepl_transactions table in the distribution database and executes sp_repldone on published database to update the internal structures to mark a new starting point for log reader agent.
Ensure that the log reader agent is stopped and there is no incoming transactions on the published database, when this SP is executed.
Since transactions may have been lost, we recommend to reinitialize the subscriber(s) and/or recreate publication/subscription(s). For large databases consider using “Initialize from Backup” as discussed in SQL Book Online.
但是在这个案例当中， 数据库既没有损坏，也没有还原过。 只能是Replication出现了错误，但是在SQL Server的Replication中又没有找到相关错误信息，本身这个是AWS的DMS自动生成的Replication，很多内部信息不太清楚（例如，是否出现异常），官方也没有找到很详细的介绍这个错误的相关资料。在此记录一下。
While the option is effective, it is also a resource hog. Because of this, you should only use Immediate-Updating when it is absolutely necessary. As an alternative to Immediate-Updating, consider only replicating changes from a Subscriber to a Publisher at regular intervals, such as once an hour, or once a day. This will significantly reduce server overhead. [7.0, 2000, 2005] Updated 1-6-2006
Unable to execute procedure. The database is not published. Execute the procedure in a database that is published for replication.