https://blog.csdn.net/Zdelta/article/details/104310984
https://blog.csdn.net/weixin_39401430/article/details/122396832

背景
爬虫项目,由于目标网站验证码频率比较频繁,于是上了IP proxy;

然而

1、c# HttpClient对象有资源释放不及时的问题,会导致系统套接字耗尽;以及内存占用越来越高!

2、如果使用一个全局静态HttpClient对象的话,又会由于初始化时只能有一个固定的代理地址,无法在本地做到动态代理;


//每次请求目标地址,都创建新的对象,在代码层Dispose(),无法释放系统底层socket
HttpClient HttpClient = new HttpClient(new HttpClientHandler
            {
                AutomaticDecompression = DecompressionMethods.GZip,
                UseCookies = true,
                Proxy = new WebProxy(new Uri(GetProxy())) { Credentials = null, UseDefaultCredentials = false },
//实例化后就不可以修改代理地址了
                UseProxy = true,
                AllowAutoRedirect = true,
                ClientCertificateOptions = ClientCertificateOption.Automatic,
                ServerCertificateCustomValidationCallback = (message, cert, chain, error) => true
            });
            HttpClient.Timeout = new TimeSpan(0, 0, 3);
            HttpClient.BaseAddress = uri;
            var result = HttpClient.GetAsync(uri).Result.Content.ReadAsStringAsync().Result;

于是想到调用python request库。。

实现
IronPython可以在vs中直接调用python,但是不支持第三方库,遂选用命令行调用的方式;

1、python request.py

将python请求结果打印在命令行:


import ast
import sys
import time

import requests

FAIL_MESSAGE = "失败的请求"


def send_request(**kwargs):
    url = kwargs.get('url')
    if not url:
        raise Exception("无效的url")
    headers = {
        'Referer': url,
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
    }
    headers.update(kwargs.get('headers', dict()))
    timeout = kwargs.get('timeout', 3)
    proxy_api = kwargs.get('proxy_api', Getproxy())
    verify_text = kwargs.get('verify_text', '验证码')
    proxy_address = requests.get(proxy_api).json()[0]['proxy_address']
    # print(f"proxy_address:{proxy_address}")
    time.sleep(0.1)
    try:
        # print(f"headers:{headers}")
        proxies = {"http": "http://" + proxy_address, "https": "http://" + proxy_address}
        response = requests.get(url, headers=headers, proxies=proxies, timeout=timeout)
        if verify_text in response.text:
            send_request(**kwargs)
        else:
            print(f"{response.text}")
    except Exception as exception:
        print(FAIL_MESSAGE, exception)


if __name__ == '__main__':
    kwargs = ast.literal_eval(sys.argv[1])
    send_request(**kwargs)

2、用c#去读取

private static string RequestByPython(Uri uri)
        {
            var cmdArgs = "{'url':'" + uri + "'}";
            Process process = new Process();
            //py脚本地址
            string path = Directory.GetCurrentDirectory() + PythonRequestFile;
            //本地python安装路径/python.exe
            process.StartInfo.FileName = PythonPath;
            //使用命令行调用py脚本 约定命令格式
            string sArguments = path;
            sArguments += " " + cmdArgs;
            process.StartInfo.Arguments = sArguments;
            process.StartInfo.UseShellExecute = false;
            process.StartInfo.RedirectStandardOutput = true;
            process.StartInfo.RedirectStandardInput = true;
            process.StartInfo.RedirectStandardError = true;
            process.StartInfo.CreateNoWindow = true;
            process.Start();
            StringBuilder stringBuilder = new StringBuilder();
            StreamReader streamReader = process.StandardOutput;
            while (!streamReader.EndOfStream)
            {
                stringBuilder.Append(streamReader.ReadLine());
            }
            process.WaitForExit();
            var result = stringBuilder.ToString();
            return result;
        }

结果
1、内存占用率有升有降,稳定在一个区间;perfect!

2、打印在命令行又读取的网页源码格式有所变化,需要html格式化或者修改正则(如懒汉匹配);

注意:

无论python还是c#使用proxy时都需要忽略https证书错误!

文档更新时间: 2022-12-12 08:02   作者:admin